Taking control of an uncontrolled vocabulary

A couple of days ago, Dan McCreary tweeted:

It reminded me of some work I had done a couple of years ago for a project which was at the time based on Linked Data, but which later switched away from that platform, leaving various bits of RDF-based work orphaned.

One particular piece which sprung to mind was a tool for dealing with vocabularies. Whether it’s useful for Dan’s talk I don’t know, but I thought I would dig it out and blog a little about it in case it’s of interest more generally to people working in Linked Open Data in Libraries, Archives and Museums (LODLAM).

I told Dan:

When he sounded interested, I made a promise:

I know I should find a better home for this and the other orphaned LODLAM components, but for now, the original code can be seen here:

https://github.com/Conal-Tuohy/huni/tree/master/data-store/www/skos-tools/vocabulary.xml

I’ll explain briefly how it works, but first, I think it’s necessary to explain the rationale for the vocabulary tool, and for that you need to see how it fits into the LODLAM environment.

At the moment there is a big push in the cultural sector towards moving data from legacy information systems into the “Linked Open Data (LOD) Cloud” – i.e. republishing the existing datasets as web-based sets of inter-linked data. In some cases people are actually migrating from their old infrastructure, but more commonly people are adding LOD capability to existing systems via some kind of API (this is a good approach, to my way of thinking – it reduces the cost and effort involved enormously). Either way, you have to be able to take your existing data and re-express it in terms of Linked Data, and that means facing up to some challenges, one of which is how to manage “vocabularies”.

Vocabularies, controlled and uncontrolled

What are “vocabularies” in this context? A “vocabulary” is a set of descriptive terms which can be applied to a record in a collection management system. For instance, a museum collection management system might have a record for a teacup, and the record could have a number of fields such as “type”, “maker”, “pattern”, “colour”, etc. The value of the “type” field would be “teacup”, for instance, but another piece in the collection might have the value “saucer” or “gravy boat” or what have you. These terms, “teacup”, “plate”, “dinner plate”, “saucer”, “gravy boat” etc, constitute a vocabulary.

In some cases, this set of terms is predefined in a formal list, This is called a “controlled vocabulary”. Usually each term has a description or definition (a “scope note”), and if there are links to other related terms (e.g. “dinner plate” is a “narrower term” of “plate”), as well as synonyms, including in other languages (“taza”, “plato”, etc) then the controlled vocabulary is called a thesaurus. A thesaurus or a controlled vocabulary can be a handy guide to finding things. You can navigate your way around a thesaurus, from one term to another, to find related classes of object which have been described with those terms, or the thesaurus can be used to automatically expand your search queries without you having to do anything; you can search for all items tagged as “plate” and the system will automatically also search for items tagged “dinner plate” or “bread plate”.

In other cases, though, these vocabularies are uncontrolled. They are just tags that people have entered in a database, and they may be consistent or inconsistent, depending on who did the data entry and why. An uncontrolled vocabulary is not so useful. If the vocabulary includes the terms “tea cup”, “teacup”, “Tea Cup”, etc. as distinct terms, then it’s not going to help people to find things because those synonyms aren’t linked together. If it includes terms like “Stirrup Cup” it’s going to be less than perfectly useful because most people don’t know what a Stirrup Cup is (it is a kind of cup).

The vocabulary tool

So one of the challenges in moving to a Linked Data environment is taking the legacy vocabularies which our systems use, and bringing them under control; linking synonyms and related terms together, providing definitions, and so on. This is where my vocabulary tool would come in.

In the Linked Data world, vocabularies are commonly modelled using a system called Simple Knowledge Organization System (SKOS). Using SKOS, every term (a “Concept” in SKOS) is identified by a unique URI, and these URIs are then associated with labels (such as “teacup”), definitions, and with other related Concepts.

The vocabulary tool is built with the assumption that a legacy vocabulary of terms has been migrated to RDF form by converting every one of the terms into a URI, simply by sticking a common prefix on it, and if necessary “munging” the text to replace, or encode spaces or other characters which aren’t allowed in URIs. For example, this might produce a bunch of URIs like this:

  • http://example.com/tableware/teacup
  • http://example.com/tableware/gravy_boat
  • etc.

What the tool then does is it finds all these URIs and gives you a web form which you can fill in to describe them and link them together. To be honest I’m not sure how far I got with this tool, but ultimately the idea would be that you would be able to organise the terms into a hierarchy, link synonyms, standardise inconsistencies by indicating “preferred” and “non-preferred” terms (i.e. you could say that “teacup” is preferred, and that “Tea Cup” is a non-preferred equivalent).

When you start the tool, you have the opportunity to enter a “base URI”, which in this case would be http://example.com/tableware/ – the tool would then find every such URI which was in use, and display them on the form for you to annotate. When you had finished imposing a bit of order on the vocabulary, you would click “Save” and your annotations would be stored in an RDF graph whose name was http://example.com/tableware/. Later, your legacy system might introduce more terms, and your Linked Data store would have some new URIs with that prefix. You would start up the form again, enter the base URI, and load all the URIs again. All your old annotations would also be loaded, and you would see the gaps where there were terms that hadn’t been dealt with; you could go and edit the definitions and click “Save” again.

In short, the idea of the tool was to be able to use, and to continue to use, legacy systems which lack controlled vocabularies, and actually impose control over those vocabularies after converting them to LOD.

How it works

OK here’s the technical bit.

The form is built using XForms technology, and I coded it to use a browser-based (i.e. Javascript) implementation of XForms called XSLTForms.

When the XForm loads, you can enter the common base URI of your vocabulary into a text box labelled “Concept Scheme URI”, and click the “Load” button. When the button is clicked, the vocabulary URI is substituted into a pre-written SPARQL query and sent off to a SPARQL server. This SPARQL query is the tricky part of the whole system really: it finds all the URIs, and it loads any labels which you might have already assigned them, and if any don’t have labels, it generates one by converting the last part of the URI back into plain text.

prefix skos: <http://www.w3.org/2004/02/skos/core#>

construct {
   ?vocabulary a skos:ConceptScheme ;
      skos:prefLabel ?vocabularyLabel.
   ?term a skos:Concept ;
      skos:inScheme ?vocabulary ;
      skos:prefLabel ?prefLabel .
      ?subject ?predicate ?object .
} where {
   bind(&lt;<vocabulary-uri><!--http://corbicula.huni.net.au/data/adb/occupation/--></vocabulary-uri>&gt; as ?vocabulary)
   {
      optional {?vocabulary skos:prefLabel ?existingVocabularyLabel}
      bind("Vocabulary Name" as ?vocabularyLabel)
      filter(!bound(?existingVocabularyLabel))
   } union {
      ?subject ?predicate ?term .
      bind(
         replace(substr(str(?term), strlen(str(?vocabulary)) + 1), "_", " ") as ?prefLabel
      )
      optional {?term skos:prefLabel ?existingPrefLabel}
      filter(!bound(?existingPrefLabel))
      filter(strstarts(str(?term), str(?vocabulary)))
      filter(?term != ?vocabulary)
   } union {
      graph ?vocabulary {
         ?subject ?predicate ?object
      }
   }
}

The resulting list of terms and labels is loaded into the form as a “data instance”, and the form automatically grows to provide data entry fields for all the terms in the instance. When you click the “Save” button, the entire vocabulary, including any labels you’ve entered, is saved back to the server.

One thought on “Taking control of an uncontrolled vocabulary”

Make a comment