One of the most interesting things I’ve been doing recently has been working with Nick Thieberger, a linguist at Melbourne University, on “Digital Daisy Bates“, a Digital Humanities project based on the pioneering lexicographical research of Daisy Bates. In the early 20th century, Bates spent many years living in outback Australia, researching the languages and cultures of indigenous Australians, and produced dozens of lexicons of indigenous languages, using a common questionnaire. In our project, we are using modern digital methods to analyze all these lexicons. The first step of the project was to digitize and transcribe the questionnaires, so we can then crunch up the digital data and extract knowledge from it.
I will post more on the project later, but in this post I will restrict myself to describing some of the mechanics of the first stage of that data crunching.
We began by getting the documents transcribed and encoded in XML, using the TEI (Text Encoding for Interchange) flavour of XML. To begin with, the text in those XML files is laid out in a table, but little else is recorded. The meaning of the text — that this is a word in English, and these words are synonyms in some indigenous language — is not explicitly recorded, so it’s rather a superficial encoding of the original text.
Here for instance is a row of data from one of the lexicons:
<row xmlns="http://www.tei-c.org/ns/1.0"> <cell>Snake</cell> <cell>Burling, jundi (carpet), binma, yalun</cell> </row>
The left-hand column of the questionnaire form was pre-printed by Bates; in this row, the printed word was “Snake”. The right hand column was left blank for recording translations in pen. In this case the recorder filled it in with:
Burling, jundi (carpet), binma, yalun
Our aim is to identify which of the words are intended to represent indigenous words, and which (like “carpet”) are actually additional English words which specify a refinement of the original term. In this case, the word “jundi” is specifically for “carpet snake”, whereas the other words may refer to snakes more generically. Of course, the other words might very well also refer to specific kinds of snakes; the English vocabulary that Bates chose is not without its flaws.
The next step is to pass these XML documents through a series of small transformation programs, each of which makes some interpretation of the text and enhances the XML markup in one way or another. The cumulative effect of the transformations is to produce a final output document in which the English and indigenous words are marked up explicitly, and their lexicographical relationships are marked up explicitly using hyperlinks.
For example, a few steps along in the transformation pipeline the same row will have been enhanced with punctuation characters parsed into <pc> markup:
<row xmlns="http://www.tei-c.org/ns/1.0"> <cell>Snake</cell> <cell> Burling <pc>, </pc> jundi <pc>(</pc> carpet <pc>)</pc> <pc>, </pc> binma <pc>, </pc> yalun </cell> </row>
Once the punctuation characters
; are picked out, a subsequent transformation classifies the words into different types, based on the surrounding punctuation. Here, the TEI element
<seg> (segment) is used to assign a type, which is either
<row xmlns="http://www.tei-c.org/ns/1.0"> <cell>Snake</cell> <cell> <seg type="item">Burling</seg> <pc>, </pc> <seg type="item">jundi <pc>(</pc> <seg type="parenthetical">carpet</seg> <pc>)</pc> </seg> <pc>, </pc> <seg type="item">binma</seg> <pc>, </pc> <seg type="item">yalun</seg> </cell> </row>
These “lexical” and “grammatical” transformations set the stage for final transformations to make a guess as to what the text actually means; which of the words are English and which are indigenous, and how they interrelate:
<row xmlns="http://www.tei-c.org/ns/1.0"> <cell> <gloss xml:id="snake" xml:lang="en">Snake</gloss> </cell> <cell> <term ref="#snake">Burling</term>, <term ref="#snake-carpet #snake">jundi </term> (<gloss type="narrow" xml:id="snake-carpet" xml:lang="en">carpet</gloss>) <term ref="#snake">binma</term>, <term ref="#snake">yalun</term> </cell> </row>
Note how the term “jundi” is linked to both the “Snake” and the “(carpet)” glosses, whereas the other terms are linked only to “Snake”. Note also that the words “Snake” and “carpet” are both now explicitly identified as English.
In the end, our document processing pipeline produces a set of documents in which the lexicographical information is encoded explicitly. From these semantically enriched documents, we can extract the lexicographical data directly, populate a database, and produce statistics and visualisations.
It’s worth noting that producing this kind of document-processing pipeline is an iterative process itself: each time you run the pipeline and study the data which it produces, you can go back and make tweaks to the pipeline, correcting errors and handling special cases, to gradually refine the dataset to a high quality.
This process captures almost all cases pretty well, but it has to be admitted that a simple XSLT script can’t be guaranteed to make a correct interpretation. In some cases the people filling in the original forms just haven’t used brackets and equals signs and commas in a consistent way at all. In these cases the 80/20 rule applies, and the solution is for a human to identify those errors, make their own interpretation, and mark up that interpretation manually.