Introduction

This package contains utility functions for importing into R corpus in various formats containing interlinearized corpora or dictionaries produced by descriptive linguistics softwares, such as SIL Toolbox of SIL Fieldworks.

All functions reading interlinearized texts return a list of data frame, where each data frame correspond to an unit (text, sentence, word, morpheme) and each row in the data frame describe an occurrence of the corresponding unit. The set of tables is relational: in each data frame, some columns give IDs pointing to rows in the other data frame: you can join morphemes to the words, sentences or texts they belong to.

This pivot format allows for various reshaping into R (for instance, grouping morphemes by words) as well as conversion between formats.

Reading EMELD XML interlinear corpus

Turning the EMELD XML document into a set of data frames

EMELD is an XML vocabulary introduced in Baden Hughes, Steven Bird and Catherine Bow, Encoding and Presenting Interlinear Text Using XML Technologies, [http://www.aclweb.org/anthology/U03-1008], it is used by SIL Fieldworks as an export format.

corpuspath <- system.file("exampleData", "tuwariInterlinear.xml", package="interlineaR")
corpus <- read.emeld(corpuspath, vernacular.languages="tww")

The returned object is a named list. Each slot of the list contain a data.frame. They are named ‘morphemes’, ‘words’, ‘sentences’, ‘texts’ (unless some of them have been discarted through the function arguments). Each row in the data frame describe an occurrences of a linguistic unit (texts, sentences, words, morphemes.)

Let’s look at the first rows of the “texts” data.frame:

head(corpus$morphemes)

text_id	sentence_id	word_id	morphem_id	type	txt.tww	cf.tww	gls.en	msa.en	hn.en
1	1	1	1	stem	a	a	I	pers	2
1	1	2	2	stem	otoiso	otoiso	tomorrow	adv	NA
1	1	3	3	stem	naham	naham	Naham	n_Npr	NA
1	1	3	4	suffix	-we	-we	-MASC.SING	n_Npr:(NounGenderNumber)	NA
1	1	3	5	suffix	-lo	-lo	COMITATIVE	n	3
1	1	4	6	stem	na	na	to_find	v	NA

The first columns contain ‘’ids’’ (referencing to which text, sentence or word each morpheme belongs to). Other columns contains information extracted from the document. The names of the column are made of the field name and the language of the field, separated by a dot. (The parameters of read.emeld allow you to indicate which field, and in wich language(s), you are interested in, for each unit). Each field may be repeated in different languages.

The “words”, “sentences” and “texts” table are made according to the same principles:

head(corpus$words)

text_id	sentence_id	word_id	txt.tww	gls.en	pos.en
1	1	1	a	NA	NA
1	1	2	otoiso	NA	NA
1	1	3	nahamwelo	NA	NA
1	1	4	na	NA	NA
1	1	5	balusesapo	NA	NA
1	1	6	holotuafemamo	NA	NA

head(corpus$sentences)

text_id	sentence_id	note.en	segnum.en	gls.en	lit.en
1	1	[mainaimwii] -> / malenaimwii/; gros problème : -mwelo, -welo, -we-lo… ?	1	I, tomorrow, next week, salim i kam, salim tok i go	NA
2	2	Cahier : wefemo	1	we, today, young men, men, went to work. The work done, at two o’clock, we go to the garden cleaning ground(?), then the night we come back to the house.	NA
3	3		1.1	Listen!	NA
3	4	;	1.2	I went downstream with a dog.	NA
3	5	‎‎upaoma - akiapmin.	1.3	Downstream, on Tepeso, I saw a crocodile, sleeping deep inside the water.	NA
3	6		1.4	I shoot the crocodile with a spear on the top of the neck and I get him.	NA

head(corpus$texts)

text_id	title.en	title.abbreviation.en	source.en	comment.en
1	141104_01_T2 (correction dans 2015.III.S18)	2014T2	NA	NA
2	141104_02_T3 A day working on the airstrip (correction dans 2015.III.S18)	2014T3	NA	NA
3	141104_03_T4 How Martin Sipamo killed a crocodile (correction dans 2015.III.S18)	2014T4	NA	NA
4	141105_01_T5	2014T5	NA	NA
5	141105_01_T5com	2014T5com	NA	NA
6	141105_03_T6 Hurry, the night is coming	2014T6	NA	NA

Contructing data set combining information from several data frame

This set of data.frame is similar to the tables of a database: rows from various tables are pointing to each other through ids.

Using these ids, new data.frame aggregating pieces of information coming from several data frame can be built. For instance, a table containing the columns of the morphemes and the words can be built using:

morphemes_words <- merge(corpus$morphemes, corpus$words[,-c(1,2)], by="word_id", suffixes = c(".morpheme",".word"))
head(morphemes_words)

word_id	text_id	sentence_id	morphem_id	type	txt.tww.morpheme	cf.tww	gls.en.morpheme	msa.en	hn.en	txt.tww.word	gls.en.word	pos.en
1	1	1	1	stem	a	a	I	pers	2	a	NA	NA
2	1	1	2	stem	otoiso	otoiso	tomorrow	adv	NA	otoiso	NA	NA
3	1	1	3	stem	naham	naham	Naham	n_Npr	NA	nahamwelo	NA	NA
3	1	1	4	suffix	-we	-we	-MASC.SING	n_Npr:(NounGenderNumber)	NA	nahamwelo	NA	NA
3	1	1	5	suffix	-lo	-lo	COMITATIVE	n	3	nahamwelo	NA	NA
4	1	1	6	stem	na	na	to_find	v	NA	na	NA	NA

Reading Toolbox interlinear corpus

Toolbox [https://software.sil.org/toolbox/] is widely used for producing interlinearized corpora. It uses a specific text-based format.

corpuspath <- system.file("exampleData", "tuwariToolbox.txt", package="interlineaR")
corpus <- read.toolbox(corpuspath)

Just as read.emeld, the result is a list containing the slots ‘morphemes’, ‘sentences’, ‘words’ and ‘texts’. These slots are data frames, where each line describe an occurrence:

head(corpus$morphemes)

texts_ids	sentences_ids	triplet_ids	morphemes_id	mb	ge	ps
1	1	1	1	ta	we	Pr
1	1	1	2	samuel	Samuel	Npr
1	1	1	3	-we	-M.S	-sfx
1	1	1	4	m-	?-	?-
1	1	1	5	iasa	to_help	v
1	1	1	6	-ne	-Part	-mode

As with read.emeld, optional fields can be declared. For instance, the kakabe corpus (by Alexandra Vydrina) also contains morpheme glosses in russian and french

path <- system.file("exampleData", "kakabe.txt", package="interlineaR")
corpus <- read.toolbox(path, morpheme.fields.suppl = c("gr", "gf"))
head(corpus$morphemes)

texts_ids	sentences_ids	triplet_ids	morphemes_id	mb	ge	gr	gf	ps
1	1	1	1	mùsu	woman	женщина -ART т	femme	n
1	1	1	2	-È	-AR	от б	-AR	-mr
1	1	1	3	dóo	T one	ыть ма	T un	phn
1	1	1	4	bi	be	ниока -AR	être	cop
1	1	1	5	bàntará	manioc	T толочь -ART	manioc	n
1	1	1	6	-È		LOC

The ‘sentences’ data frame contains a numeric id, the reference created for each sentence (“ref” field in toolbox), plus (as with read.emeld) the original text, the free translation as well as the note (“tx”, “nt”, “ft” field in toolbox).

head(corpus$sentences)

texts_ids	sentences_ids	tx
1	1	mùséè dóo bi bàntaráà tùgéè là
1	2	Músa kéle-la báti n na-kɔ̀ri
1	3	kín-na-ma t’ a ladíi
1	4	wálè bi dúfen-na sínàn dé
1	5	dende bi faljɛ-la karaɲɛ tɔ
1	6	kɛɛ syɔ́ɲɛ̀ bi fáa-nden

the text data.frame contains a numeric id and the title (toolbox ID) of each text.

Reading LIFT XML dictionary

This XML vocabulary has been introduced by SIL : [http://code.google.com/p/lift-standard] and is used by SIL Fieldworks as an export format.

read.lift() produce a list of three data frame: “entries”, “senses”, “examples” (“relations” should be added). These set of table are linked through IDs, as in a relational database. All the fields of the dictionary, in all languages, can be extracted. THe arguments of read.lift() allow to manually list the fields you are interested in for each data frame; you can also reduce the field (columns) to those that have non-empty values in some columns with simplify=TRUE.

dicpath <- system.file("exampleData", "tuwariDictionary.lift", package="interlineaR")
dictionary <- read.lift(dicpath, vernacular.languages="tww", simplify=TRUE)

table of the lexial units:

head(dictionary$entries)

id_LIFT	id	lexical-unit.tww	morph-type	variant.form.tww	variant.morph-type
u_002794b9-f063-4c6b-b77d-39b8ecd618d1	1	u	stem	NA	NA
ia_00909ffb-7e90-4b76-b948-cae56094abc9	2	ia	stem	NA	NA
totolo_00b3f2c3-bd07-4282-9287-603f2285c720	3	totolo	stem	totolu	stem
ofa_00dba42e-1a59-40e8-848d-dbe5bc20012e	4	ofa	stem	NA	NA
tia3_013a1198-262e-45c6-9cab-fb6168f4f223	5	tia	stem	NA	NA
waia3_015e37a5-77f2-4c45-9404-e90a36e48014	6	waia	stem	NA	NA

head(dictionary$senses)

id_LIFT	id	lexem_id	grammatical-info.value	gloss.en	semantic-domain-ddp4	grammatical-info.traits
582795c9-9350-4e3b-af34-b72e9b5c89aa	1	1	Noun	fire		Noun-infl-class:tano
5da1286e-07b0-47f2-81ae-5633ca9c875c	2	2	Noun	talk	3.5 Communication	Noun-infl-class:he
5da1286e-07b0-47f2-81ae-5633ca9c875d	3	2	Noun	vernacular_language	3.5 Communication	Noun-infl-class:he
5da1286e-07b0-47f2-81ae-5633ca9c875d	4	2	Noun	word	3.5 Communication	Noun-infl-class:he
ed8e2d65-f3da-4efe-92cc-f381d08f0c08	5	3	Noun	island	1.2 World	Noun-infl-class:fo
8c5651bd-27b4-4647-b46b-d7b209a2477f	6	4	Adverb	now	8.4 Time

head(dictionary$examples)

id	lexem_id	sense_id	example.form.tww
1	2	2	exemple1.1
2	2	2	exemple1.2
3	2	3	exemple2.1
4	2	3	exemple2.2
5	2	4	exemple3.1
6	2	4	exemple3.2

Some fields in LIFT may be repeated. For instance, several “semantic domain” can be expressed in the sense element:

<trait name ="semantic-domain-ddp4" value="1.3 Water"/>
<trait name ="semantic-domain-ddp4" value="6.7 Tool"/>

In that case, the value are concatenated, and the column “semantic-domain-ddp4” contains a value “1.3 Water,6.7 Tool”.

Some other fields are both repeated and appearing as key-value pair, reflecting categories created for a language. In the following chunk, “Noun-infl-class” and “Noun-infl-class2” are two categories created for the nouns of a given language:

<grammatical-info value="Noun">
  <trait name="Noun-infl-class" value="fo"/>
  <trait name="Noun-infl-class2" value="hei"/>
</grammatical-info>

In that case, the column “trait” in the data.frame example will turn out as: “Noun-infl-class:fo,Noun-infl-class2:hei”.

interlineaR: Importing interlinearized corpora and dictionaries

Sylvain Loiseau

2018-05-19