---
title: "Translating Sumerian Texts"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Translating Sumerian Texts}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  dev = "ragg_png"
)
```

## 1. Introduction

The first vignette ("Getting Started with sumer") introduced the basic functions of the package: sign conversion, dictionary lookup, and translation templates for individual lines. This vignette describes the complete workflow for working with entire texts.

The workflow consists of the following steps:

1. Load a transliterated or cuneiform text
2. Analyze the text using n-gram analysis and grammar probabilities
3. Translate line by line interactively
4. Generate a custom dictionary from the translations

The generated dictionary can be used as an additional source for future translations. This creates a cycle: each new translation improves the dictionary, and the improved dictionary facilitates the next translation.

```{r}
library(sumer)
```


## 2. Loading a Text

The package includes the example text "Enki and the World Order", a Sumerian myth. The text is stored as a text file:

```{r}
path <- system.file("extdata", "enki_and_the_world_order.txt", package = "sumer")
text <- readLines(path, encoding = "UTF-8")
```

The first few lines look like this:

```{r}
cat(text[1:5], sep = "\n")
```

Each line can optionally begin with a line number (e.g., `1)\t...`). Lines starting with `#` are treated as comments and ignored during analysis. The text can be in cuneiform or transliteration -- in the latter case, it can be automatically converted.


## 3. N-gram Analysis

### Finding frequent sign combinations

A good first step when working with a new text is to search for frequently recurring sign combinations (n-grams). Such patterns are valuable clues: if a certain sequence of cuneiform signs appears repeatedly, it is likely a fixed term, a compound word, or an idiomatic expression.

```{r}
freq <- ngram_frequencies(text, min_freq = c(6, 4, 2))
head(freq, 10)
```

The `min_freq` parameter controls the minimum frequency for different n-gram lengths. The default value `c(6, 4, 2)` means: single signs must occur at least 6 times, pairs at least 4 times, and all longer combinations at least 2 times. Depending on the length of the text, these thresholds can be adjusted.

The result is a data frame with three columns: `frequency`, `length` (number of signs), and `combination` (cuneiform characters).

The analysis works from the longest combinations down to the shortest. When a long combination is identified as frequent, its occurrences are masked so that shorter sub-combinations are not falsely counted as frequent just because they are part of the longer combination.

### Marking n-grams in the text

With `mark_ngrams()`, the identified patterns are marked in the text with curly braces:

```{r}
text_marked <- mark_ngrams(text, freq)
cat(text_marked[1:5], sep = "\n")
```

In the output, recurring sign combinations are highlighted with `{...}`. This makes patterns visible that are easily overlooked when reading the raw cuneiform text.

### Searching for patterns in the text

You can also search for a specific pattern in the annotated text. To do this, convert the search term with `mark_ngrams()` into the same format and then search with `grepl()`:

```{r}
term    <- "IGI.DIB.TU"
pattern <- mark_ngrams(term, freq)
pattern
result  <- text_marked[grepl(pattern, text_marked, fixed = TRUE)]
cat(result, sep = "\n")
```

This finds all lines where the pattern IGI.DIB.TU occurs -- including its embedding within larger n-grams.


## 4. Grammar Probabilities

### Grammatical types from the dictionary

To understand the structure of a sentence, it is helpful to know which grammatical role each individual sign is likely to play. The function `sign_grammar()` looks up each sign of a string in the dictionary and counts how often it occurs with each grammatical type:

```{r}
dic <- read_dictionary()
sg  <- sign_grammar("a-ma-ru ba-ur3 ra", dic)
```

The result is a data frame with one row per sign per grammatical type. The `n` column indicates how often this sign is attested with the respective type in the dictionary.

### Bayesian probabilities

The raw frequencies from the dictionary can be refined into probabilities using a Bayesian model. First, compute the prior distribution of types across all signs in the dictionary:

```{r}
prior <- prior_probs(dic, sentence_prob = 0.25)
```

The `sentence_prob` parameter corrects a systematic bias: if a dictionary was primarily built from noun phrases (rather than complete sentences), verbs are underrepresented in it. A value of `sentence_prob = 0.25` means that an estimated 25% of the dictionary entries come from complete sentences. Verb probabilities are then upweighted accordingly.

Next, `grammar_probs()` computes the posterior probabilities for each sign:

```{r}
gp <- grammar_probs(sg, prior, dic)
```

For signs with many dictionary entries, the observed frequencies dominate; for rare signs, the result falls back to the prior distribution.

### Visualization

The function `plot_sign_grammar()` presents the results as a stacked bar chart:

```{r, fig.width = 7, fig.height = 4}
plot_sign_grammar(gp, sign_names = FALSE)
```

Each bar represents a sign position in the sentence. The colours represent grammatical types: green for nouns (S), red shades for verbs (V) and verb operators, blue shades for attribute operators, and orange for other operators. A tall bar in a particular colour indicates that the sign likely has that grammatical function.

The chart can also be saved to a file:

```{r, eval = FALSE}
plot_sign_grammar(gp, output_file = "grammar.png")
```


## 5. Translating Line by Line

### Using the translate gadget with a text

The function `translate()` reaches its full potential when used together with an entire text. Instead of a string, you pass a line number and the text:

```{r, eval = FALSE}
result <- translate(3, text = text)
```

The gadget opens with the third line of the text and has access to the full text. This provides three additional information sources:

- **N-grams**: Frequent sign combinations computed from the entire text that appear in the current line. Additionally, n-grams that appear in both the current line and the neighbouring lines are marked with a checkmark in the Theme column -- these are thematic connections.

- **Context**: The neighbouring lines (up to 2 before and 2 after the current line), with marked n-grams. This shows at a glance which patterns repeat across line boundaries.

- **Grammar**: The bar chart of grammar probabilities for the current line.

### Adjusting the bracket structure

In the input field of the translation section, you can adjust the bracket structure of the sentence. This is particularly important when you have identified fixed terms or compound words:

- Mark a proper name as a fixed term with angle brackets: `<d-en-ki>`
- Group a compound word with round brackets: `(e2-gal)`

After clicking "Update Skeleton", the template is regenerated while preserving all previously entered translations.

### Saving the result

When you click "Done", `translate()` returns a `skeleton` object. This can be saved as a text file:

```{r, eval = FALSE}
result <- translate(3, text = text)
writeLines(result, "line_003.txt")
```

The saved result is a text file in skeleton format (pipe format) that can be used as input for dictionary creation.

### Using multiple dictionaries

If you have already created your own dictionaries (see next chapter), they can be passed as additional sources:

```{r, eval = FALSE}
result <- translate(3, text = text,
              dic = c(system.file("extdata", "sumer-dictionary.txt", package = "sumer"),
                      "my_dictionary.txt"))
```

The dictionaries are searched in the order specified. When automatically pre-filling the translation template, the first dictionary that contains an entry for a given substring wins. In the dictionary panel of the gadget, entries from all dictionaries are displayed side by side.


## 6. Building a Dictionary from Translations

### The annotation format

The skeleton files generated by `translate()` use the pipe format, which serves directly as input for dictionary creation. Every line starting with `|` is used as a dictionary entry:

```
|reading=SIGN_NAME=cuneiform:type:translation
```

A typical file looks like this:

```
an-en-ki-ki-a-ig-e2-kur-ra: SEN: The god Enki transforms the Earth. The one who establishes sustenance of human existence utilizes a supplier of energy from a distant place (the E-Kur temple). 
|an-en-ki=AN.EN.KI=𒀭𒂗𒆠: S: god Enki
|ki-a=KI.A=𒆠𒀀: V: to transform the Earth
|	ki=KI=𒆠: S: Earth
|	a=A=𒀀:S☒->V: to transform S
|ig=IG=𒅅: S:  one who establishes the sustenance of human existence.
|e2-kur-ra=E2.KUR.RA=𒂍𒆳𒊏: V: to utilize a supplier of energy from a distant place
|	e2-kur=E2.KUR=𒂍𒆳: S: supplier of energy from a distant place
|		e2=E2=𒂍: ☒S->S: supplier of energy from S
|		kur=KUR=𒆳: S: distant place
|	ra=RA=𒊏: S☒->V: to utilize S
```

The header line (if it is without `|`) and blank lines are ignored when reading the file. Only lines starting with `|` become dictionary entries.

### Creating the dictionary

Once you have translated several lines of a text and saved them to files, these can be combined into a dictionary with `make_dictionary()`. The function accepts a vector of file paths:

```{r, eval = FALSE}
dictionary <- make_dictionary("line_003.txt")
```

The function reads the translation files, aggregates entries (identical sign-type-translation combinations are counted together), and adds cuneiform characters and readings. The result is a data frame in dictionary format.

Internally, `make_dictionary()` performs two steps that can also be called individually:

```{r, eval = FALSE}
# Step 1: Read translation files
translations <- read_translated_text("line_003.txt")

# Step 2: Convert to dictionary format
dictionary <- convert_to_dictionary(translations)
```

The intermediate step is useful if you want to edit the translations before conversion -- for example, to unify spelling conventions.

### Saving and loading the dictionary

The completed dictionary can be saved with metadata:

```{r, eval = FALSE}
save_dictionary(
  dic     = dictionary,
  file    = "my_dictionary.txt",
  author  = "My Name",
  year    = "2026",
  version = "1.0",
  url     = "https://example.com/dictionary"
)
```

And loaded again later:

```{r, eval = FALSE}
my_dic <- read_dictionary("my_dictionary.txt")
look_up("ki", my_dic)
```

### The cycle

The custom dictionary can now be used in further translation work:

```{r, eval = FALSE}
# For lookup
look_up("lugal", my_dic)

# For interactive translation
result <- translate(4, text = text, dic = "my_dictionary.txt")
```

With each translated line, the dictionary grows. Frequent signs and expressions accumulate higher counts over time, and the automatic pre-filling of translation templates becomes increasingly accurate. In this way, you gradually build a comprehensive dictionary database based on your own texts.