interlineaR: Importing interlinearized corpora and dictionaries

Sylvain Loiseau

2018-05-19

Introduction

This package contains utility functions for importing into R corpus in various formats containing interlinearized corpora or dictionaries produced by descriptive linguistics softwares, such as SIL Toolbox of SIL Fieldworks.

All functions reading interlinearized texts return a list of data frame, where each data frame correspond to an unit (text, sentence, word, morpheme) and each row in the data frame describe an occurrence of the corresponding unit. The set of tables is relational: in each data frame, some columns give IDs pointing to rows in the other data frame: you can join morphemes to the words, sentences or texts they belong to.

This pivot format allows for various reshaping into R (for instance, grouping morphemes by words) as well as conversion between formats.

Reading EMELD XML interlinear corpus

Turning the EMELD XML document into a set of data frames

EMELD is an XML vocabulary introduced in Baden Hughes, Steven Bird and Catherine Bow, Encoding and Presenting Interlinear Text Using XML Technologies, [http://www.aclweb.org/anthology/U03-1008], it is used by SIL Fieldworks as an export format.

corpuspath <- system.file("exampleData", "tuwariInterlinear.xml", package="interlineaR")
corpus <- read.emeld(corpuspath, vernacular.languages="tww")

The returned object is a named list. Each slot of the list contain a data.frame. They are named ‘morphemes’, ‘words’, ‘sentences’, ‘texts’ (unless some of them have been discarted through the function arguments). Each row in the data frame describe an occurrences of a linguistic unit (texts, sentences, words, morphemes.)

Let’s look at the first rows of the “texts” data.frame:

head(corpus$morphemes)
text_id sentence_id word_id morphem_id type txt.tww cf.tww gls.en msa.en hn.en
1 1 1 1 stem a a I pers 2
1 1 2 2 stem otoiso otoiso tomorrow adv NA
1 1 3 3 stem naham naham Naham n_Npr NA
1 1 3 4 suffix -we -we -MASC.SING n_Npr:(NounGenderNumber) NA
1 1 3 5 suffix -lo -lo COMITATIVE n 3
1 1 4 6 stem na na to_find v NA

The first columns contain ‘’ids’’ (referencing to which text, sentence or word each morpheme belongs to). Other columns contains information extracted from the document. The names of the column are made of the field name and the language of the field, separated by a dot. (The parameters of read.emeld allow you to indicate which field, and in wich language(s), you are interested in, for each unit). Each field may be repeated in different languages.

The “words”, “sentences” and “texts” table are made according to the same principles:

head(corpus$words)
text_id sentence_id word_id txt.tww gls.en pos.en
1 1 1 a NA NA
1 1 2 otoiso NA NA
1 1 3 nahamwelo NA NA
1 1 4 na NA NA
1 1 5 balusesapo NA NA
1 1 6 holotuafemamo NA NA
head(corpus$sentences)
text_id sentence_id note.en segnum.en gls.en lit.en
1 1 [mainaimwii] -> / malenaimwii/; gros problème : -mwelo, -welo, -we-lo… ? 1 I, tomorrow, next week, salim i kam, salim tok i go NA
2 2 Cahier : wefemo 1 we, today, young men, men, went to work. The work done, at two o’clock, we go to the garden cleaning ground(?), then the night we come back to the house. NA
3 3 1.1 Listen! NA
3 4 ; 1.2 I went downstream with a dog. NA
3 5 ‎‎upaoma - akiapmin. 1.3 Downstream, on Tepeso, I saw a crocodile, sleeping deep inside the water. NA
3 6 1.4 I shoot the crocodile with a spear on the top of the neck and I get him. NA
head(corpus$texts)
text_id title.en title.abbreviation.en source.en comment.en
1 141104_01_T2 (correction dans 2015.III.S18) 2014T2 NA NA
2 141104_02_T3 A day working on the airstrip (correction dans 2015.III.S18) 2014T3 NA NA
3 141104_03_T4 How Martin Sipamo killed a crocodile (correction dans 2015.III.S18) 2014T4 NA NA
4 141105_01_T5 2014T5 NA NA
5 141105_01_T5com 2014T5com NA NA
6 141105_03_T6 Hurry, the night is coming 2014T6 NA NA

Contructing data set combining information from several data frame

This set of data.frame is similar to the tables of a database: rows from various tables are pointing to each other through ids.

Using these ids, new data.frame aggregating pieces of information coming from several data frame can be built. For instance, a table containing the columns of the morphemes and the words can be built using:

morphemes_words <- merge(corpus$morphemes, corpus$words[,-c(1,2)], by="word_id", suffixes = c(".morpheme",".word"))
head(morphemes_words)
word_id text_id sentence_id morphem_id type txt.tww.morpheme cf.tww gls.en.morpheme msa.en hn.en txt.tww.word gls.en.word pos.en
1 1 1 1 stem a a I pers 2 a NA NA
2 1 1 2 stem otoiso otoiso tomorrow adv NA otoiso NA NA
3 1 1 3 stem naham naham Naham n_Npr NA nahamwelo NA NA
3 1 1 4 suffix -we -we -MASC.SING n_Npr:(NounGenderNumber) NA nahamwelo NA NA
3 1 1 5 suffix -lo -lo COMITATIVE n 3 nahamwelo NA NA
4 1 1 6 stem na na to_find v NA na NA NA

Reading Toolbox interlinear corpus

Toolbox [https://software.sil.org/toolbox/] is widely used for producing interlinearized corpora. It uses a specific text-based format.

corpuspath <- system.file("exampleData", "tuwariToolbox.txt", package="interlineaR")
corpus <- read.toolbox(corpuspath)

Just as read.emeld, the result is a list containing the slots ‘morphemes’, ‘sentences’, ‘words’ and ‘texts’. These slots are data frames, where each line describe an occurrence:

head(corpus$morphemes)
texts_ids sentences_ids triplet_ids morphemes_id mb ge ps
1 1 1 1 ta we Pr
1 1 1 2 samuel Samuel Npr
1 1 1 3 -we -M.S -sfx
1 1 1 4 m- ?- ?-
1 1 1 5 iasa to_help v
1 1 1 6 -ne -Part -mode

As with read.emeld, optional fields can be declared. For instance, the kakabe corpus (by Alexandra Vydrina) also contains morpheme glosses in russian and french

path <- system.file("exampleData", "kakabe.txt", package="interlineaR")
corpus <- read.toolbox(path, morpheme.fields.suppl = c("gr", "gf"))
head(corpus$morphemes)
texts_ids sentences_ids triplet_ids morphemes_id mb ge gr gf ps
1 1 1 1 mùsu woman женщина -ART т femme n
1 1 1 2 -È -AR от б -AR -mr
1 1 1 3 dóo T one ыть ма T un phn
1 1 1 4 bi be ниока -AR être cop
1 1 1 5 bàntará manioc T толочь -ART manioc n
1 1 1 6 -È LOC

The ‘sentences’ data frame contains a numeric id, the reference created for each sentence (“ref” field in toolbox), plus (as with read.emeld) the original text, the free translation as well as the note (“tx”, “nt”, “ft” field in toolbox).

head(corpus$sentences)
texts_ids sentences_ids ft ref tx
1 1 mùséè dóo bi bàntaráà tùgéè là
1 2 Músa kéle-la báti n na-kɔ̀ri
1 3 kín-na-ma t’ a ladíi
1 4 wálè bi dúfen-na sínàn dé
1 5 dende bi faljɛ-la karaɲɛ tɔ
1 6 kɛɛ syɔ́ɲɛ̀ bi fáa-nden

the text data.frame contains a numeric id and the title (toolbox ID) of each text.

Reading LIFT XML dictionary

This XML vocabulary has been introduced by SIL : [http://code.google.com/p/lift-standard] and is used by SIL Fieldworks as an export format.

read.lift() produce a list of three data frame: “entries”, “senses”, “examples” (“relations” should be added). These set of table are linked through IDs, as in a relational database. All the fields of the dictionary, in all languages, can be extracted. THe arguments of read.lift() allow to manually list the fields you are interested in for each data frame; you can also reduce the field (columns) to those that have non-empty values in some columns with simplify=TRUE.

dicpath <- system.file("exampleData", "tuwariDictionary.lift", package="interlineaR")
dictionary <- read.lift(dicpath, vernacular.languages="tww", simplify=TRUE)

table of the lexial units:

head(dictionary$entries)
id_LIFT id lexical-unit.tww morph-type variant.form.tww variant.morph-type
u_002794b9-f063-4c6b-b77d-39b8ecd618d1 1 u stem NA NA
ia_00909ffb-7e90-4b76-b948-cae56094abc9 2 ia stem NA NA
totolo_00b3f2c3-bd07-4282-9287-603f2285c720 3 totolo stem totolu stem
ofa_00dba42e-1a59-40e8-848d-dbe5bc20012e 4 ofa stem NA NA
tia3_013a1198-262e-45c6-9cab-fb6168f4f223 5 tia stem NA NA
waia3_015e37a5-77f2-4c45-9404-e90a36e48014 6 waia stem NA NA
head(dictionary$senses)
id_LIFT id lexem_id grammatical-info.value gloss.en usage-type semantic-domain-ddp4 grammatical-info.traits
582795c9-9350-4e3b-af34-b72e9b5c89aa 1 1 Noun fire Noun-infl-class:tano
5da1286e-07b0-47f2-81ae-5633ca9c875c 2 2 Noun talk 3.5 Communication Noun-infl-class:he
5da1286e-07b0-47f2-81ae-5633ca9c875d 3 2 Noun vernacular_language 3.5 Communication Noun-infl-class:he
5da1286e-07b0-47f2-81ae-5633ca9c875d 4 2 Noun word 3.5 Communication Noun-infl-class:he
ed8e2d65-f3da-4efe-92cc-f381d08f0c08 5 3 Noun island 1.2 World Noun-infl-class:fo
8c5651bd-27b4-4647-b46b-d7b209a2477f 6 4 Adverb now 8.4 Time
head(dictionary$examples)
id lexem_id sense_id example.form.tww
1 2 2 exemple1.1
2 2 2 exemple1.2
3 2 3 exemple2.1
4 2 3 exemple2.2
5 2 4 exemple3.1
6 2 4 exemple3.2

Some fields in LIFT may be repeated. For instance, several “semantic domain” can be expressed in the sense element:

<trait name ="semantic-domain-ddp4" value="1.3 Water"/>
<trait name ="semantic-domain-ddp4" value="6.7 Tool"/>

In that case, the value are concatenated, and the column “semantic-domain-ddp4” contains a value “1.3 Water,6.7 Tool”.

Some other fields are both repeated and appearing as key-value pair, reflecting categories created for a language. In the following chunk, “Noun-infl-class” and “Noun-infl-class2” are two categories created for the nouns of a given language:

<grammatical-info value="Noun">
  <trait name="Noun-infl-class" value="fo"/>
  <trait name="Noun-infl-class2" value="hei"/>
</grammatical-info>

In that case, the column “trait” in the data.frame example will turn out as: “Noun-infl-class:fo,Noun-infl-class2:hei”.