SPARQL – Conal Tuohy's blog

Analysis & Policy Online

Conal — Tue, 27 Jun 2017 23:45:27 +0000

Notes for my Open Repositories 2017 conference presentation. I will edit this post later to flesh it out into a proper blog post.
Follow along at: conaltuohy.com/blog/analysis-policy-online/

background

Early discussion with Amanda Lawrence of APO (which at that time stood for “Australian Policy Online”) about text mining, at the 2015 LODLAM Summit in Sydney.
They needed automation to help with the cataloguing work, to improve discovery.
They needed to understand their own corpus better.
I suggested a particular technical approach based on previous work.
In 2016, APO contracted me to advise and help them build a system that would “mine” metadata from their corpus, and use Linked Data to model and explore it.

constraints

Openness
Integrate metadata from multiple text-mining processes, plus manually created metadata
Minimal dependency on their current platform (Drupal 7, now Drupal 8)
Lightweight; easy to make quick changes

technical approach

Use an entirely external metadata store (a SPARQL Graph Store)
Use a pipeline! Extract, Transform, Load
Use standard protocol to extract data (first OAI-PMH, later sitemaps)
In fact, use web services for everything; the pipeline is then just a simple script that passes data between web services
Sure, XSLT and SPARQL Query, but what the hell is XProc?!

progress

Configured Apache Tika as a web service, using Stanford Named Entity Recognition toolkit
Built XProc pipeline to harvest from Drupal’s OAI-PMH module, download digital objects, process them with Stanford NER via Tika, and store the resulting graphs in Fuseki graph store
Harvested, and produced a graph of part of the corpus, but …
Turned out the Drupal OAI-PMH module wa broken! So we used Sitemap instead
“Related” list added to APO dev site (NB I’ve seen this isn’t working in all browsers, and obviously needs more work, perhaps using an iframe is not the best idea. Try Chrome if you don’t see the list of related pages on the right)

next steps

Visualize the graph
Integrate more of the manually created metadata into the RDF graph
Add topic modelling (using MALLET) alongside the NER

Let’s see the code

Questions?

(if there’s any time remaining)

Linked Open Data Visualisation at #GLAMVR16

Conal — Tue, 30 Aug 2016 02:02:10 +0000

On Thursday last week I flew to Perth, in Western Australia, to speak at an event at Curtin University on visualisation of cultural heritage. Erik Champion, Professor of Cultural Visualisation, who organised the event, had asked me to talk about digital heritage collections and Linked Open Data (“LOD”).

The one-day event was entitled “GLAM VR: talks on Digital heritage, scholarly making & experiential media”, and combined presentations and workshops on cultural heritage data (GLAM = Galleries, Libraries, Archives, and Museums) with advanced visualisation technology (VR = Virtual Reality).

The venue was the Curtin HIVE (Hub for Immersive Visualisation and eResearch); a really impressive visualisation facility at Curtin University, with huge screens and panoramic and 3d displays.

There were about 50 people in attendance, and there would have been over a dozen different presenters, covering a lot of different topics, though with common threads linking them together. I really enjoyed the experience, and learned a lot. I won’t go into the detail of the other presentations, here, but quite a few people were live-tweeting, and I’ve collected most of the Twitter stream from the day into a Storify story, which is well worth a read and following up.

My presentation

For my part, I had 40 minutes to cover my topic. I’d been a bit concerned that my talk was more data-focused and contained nothing specifically about VR, but I think on the day the relevance was actually apparent.

The presentation slides are available here as a PDF: Linked Open Data Visualisation

My aims were:

At a tactical level, to explain the basics of Linked Data from a technical point of view (i.e. to answer the question “what is it?”); to show that it’s not as hard as it’s usually made out to be; and to inspire people to get started with generating it, consuming it, and visualising it.
At a strategic level, to make the case for using Linked Data as a basis for visualisation; that the discipline of adopting Linked Data technology is not at all a distraction from visualisation, but rather a powerful generic framework on top of which visualisations of various kinds can be more easily constructed, and given the kind of robustness that real scholarly work deserves.

Linked Data basics

I spent the first part of my talk explaining what Linked Open Data means; starting with “what is a graph?” and introducing RDF triples and Linked Data. Finally I showed a few simple SPARQL queries, without explaining SPARQL in any detail, but just to show the kinds of questions you can ask with a few lines of SPARQL code.

What is an RDF graph?

While I explained about graph data models, I saw attendees nodding, which I took as a sign of understanding and not that they were nodding off to sleep; it was still pretty early in the day for that.

One thing I hoped to get across in this part of the presentation was just that Linked Data is not all that hard to get into. Sure, it’s not a trivial technology, but barriers to entry are not that high; the basics of it are quite basic, so you can make a start and do plenty of useful things without having to know all the advanced stuff. For instance, there are a whole bunch of RDF serializations, but in fact you can get by with knowing only one. There are a zillion different ontologies, but again you only need to know the ontology you want to use, and you can do plenty of things without worrying about a formal ontology at all. I’d make the case for university eResearch agencies, software carpentry, and similar efforts, to be offering classes and basic support in this technology, especially in library and information science, and the humanities generally.

Linked Data as architecture

People often use the analogy of building, when talking about making software. We talk about a “build process”, “platforms”, and “architecture”, and so on. It’s not an exact analogy, but it is useful. Using that analogy, Linked Data provides a foundation that you can build a solid edifice on top of. If you skimp on the foundation, you may get started more quickly, but you will encounter problems later. If your project is small, and if it’s a temporary structure (a shack or bivouac), then architecture is not so important, and you can get away with skimping on foundations (and you probably should!), but the larger the project is (an office building), and the longer you want it to persist (a cathedral), the more valuable a good architecture will be. In the case of digital scholarly works, the common situation in academia is that weakly-architected works are being cranked out and published, but being hard to maintain, they tend to crumble away within a few years.

Crucially, a Linked Data dataset can capture the essence of what needs to be visualised, without being inextricably bound up with any particular genre of visualisation, or any particular visualisation software tool. This relative independence from specific tools is important because a dataset which is tied to a particular software platform needs to rely on the continued existence of that software, and experience shows that individual software packages come and go depressingly quickly. Often only a few years are enough for a software program to be “orphaned”, unavailable, obsolete, incompatible with the current software environment (e.g. requires Windows 95 or IE6), or even, in the case of software available online as a service, for it to completely disappear into thin air, if the service provider goes bust or shuts down the service for reasons of their own. In these cases you can suddenly realise you’ve been building your “scholarly output” on sand.

By contrast, a Linked Data dataset is standardised, and it’s readable with a variety of tools that support that standard. That provides you with a lot of options for how you could go on to visualise the data; that generic foundation gives you the possibility of building (and rebuilding) all kinds of different things on top of it.

Because of its generic nature and its openness to the Web, Linked Data technology has become a broad software ecosystem which already has a lot of people’s data riding on it; that kind of mass investment (a “bandwagon”, if you like) is insurance against it being wiped out by the whims or vicissitudes of individual businesses. That’s the major reason why a Linked Data dataset can be archived and stored long term with confidence.

Linked Open Data is about sharing your data for reuse

Finally, by publishing your dataset as Linked Open Data (independently of any visualisations you may have made of it), you are opening it up to reuse not only by yourself, but by others.

The graph model allows you to describe the meaning of the terms you’ve used (i.e. the analytical categories used in your data can themselves be described and categorised, because everything is a node in a graph). This means that other people can work out what your dataset actually means.

The use of URIs for identifiers means that others can easily cite your work and effectively contribute to your work by creating their own annotations on it. They don’t need to impinge on your work; their annotations can live somewhere else altogether and merely refer to nodes in your graph by those nodes’ identifiers (URIs). They can comment; they can add cross-references; they can assert equivalences to nodes in other graphs, elsewhere. Your scholarly work can break out of its box, to become part of an open web of knowledge that grows and ramifies and enriches us all.

Visualizing Government Archives through Linked Data

Conal — Tue, 05 Apr 2016 13:41:00 +0000

Tonight I’m knocking back a gin and tonic to celebrate finishing a piece of software development for my client the Public Record Office Victoria; the archives of the government of the Australian state of Victoria.

The work, which will go live in a couple of weeks, was an update to a browser-based visualization tool which we first set up last year. In response to user testing, we made some changes to improve the visualization’s usability. It certainly looks a lot clearer than it did, and the addition of some online help makes it a bit more accessible for first-time users.

The visualization now looks like this (here showing the entire dataset, unfiltered, which is not actually that useful, though it is quite pretty):

The bulk of the work, though, was to automate the preparation of data for the visualization.

Up until now, the dataset which you could visualize consisted of a couple of CSV files, manually assembled with considerable care and effort from reports exported from PROV’s repository “Archives One”. In the new system, this manual work will not need to be repeated. Instead, the same dataset will be assembled by an automated metadata-processing pipeline which will keep it continually up to date as government agencies and functions change over time.

It was not as big as job as you might think, since in fact a lot of the work to generate the data had already been done.

PROV’s Interoperable Data service

In 2012, in collaboration with their counterpart agency State Records New South Wales, PROV had set up an Interoperable Data publishing service with funding from the Australian National Data Service. They custom-built some software to export data from Archives One to produce a set of metadata records in RIF-CS format, and they deployed an off-the-shelf software application (an “OAI-PMH Repository”) to disseminate those metadata records over the web.

Originally, the OAI-PMH repository was serving data to the Australian National Data Service, which runs an aggregation service called Research Data Australia, which offers researchers pointers to all manner of scientific, historical and cultural datasets. The PROV metadata, covering the full history of government records in Victoria, is a useful resource for social science researchers, genealogists, historians, and others.

More recently, PROV’s OAI-PMH repository has also been harvested by the National Library of Australia’s Trove service.

Now at last it will be harvested by the Public Record Office itself.

The data pipeline

The software I’ve written consists of a web application which I wrote using a programming language for data pipelines called XProc. The software itself is open source and available on GitHub in a repository with the ludicrously acronymous title PROV-RIF-SPARQL.

This XProc application tediously harvests the metadata records (there are more than 30000 of them) and converts each one from RIF-CS format into RDF/XML format. The RDF/XML data is a reformulation of the RIF-CS in which the hierarchical structures of the RIF-CS are re-expressed as a network of interconnected statements; a kind of web of nodes and links which mathematicians call a “graph”. The statements in these graphs are expressed using the international standard conceptual framework for cultural heritage data; the CIDOC-CRM. My harvester then stores all these RDF/XML documents (or “graphs”) in a SPARQL Graph Store (a kind of hybrid document store and database). The SPARQL Graph Store allows each graph to be addressed individually, but also for the entire dataset to be treated as a single graph, and queried as a whole. Finally, the RDF dataset is queried to produce the two summarised data files which the visualization itself requires; these are simple spreadsheets in CSV (Comma Separated Values) format. One table contains information about each government agency or function, and the other table lists the relationships which have historically existed between those agencies and functions.

The harvester has a basic user interface where you can start a data harvest; a process that takes about half an hour to complete. In this interface you can specify the location of the OAI-PMH server you want to harvest data from, the format of the data you want to harvest, and the location of the SPARQL Graph Store where you want to store the result, amongst other parameters. In practice, this user interface isn’t used by a human (except during testing); another small program running on a regular schedule makes the request.

At this stage of the project, the RDF graph is only used internally to PROV, where it functions purely as an intermediate between the RIF-CS input and the CSV output. The RDF data and the SPARQL database together just provide a convenient way to aggregate a big set of records and query the resulting aggregation. But later I have no doubt that the RDF data will be published directly as Linked Open Data, opening it up, and allowing it to be connected into a world-wide web of data.

Taking control of an uncontrolled vocabulary

Conal — Mon, 16 Nov 2015 13:49:30 +0000

A couple of days ago, Dan McCreary tweeted:

Working on new ideas for NoSQL metadata management for a talk next week. Focus on #NoSQL, Documents, Graphs and #SKOS. Any suggestions?

— Dan McCreary (@dmccreary) November 14, 2015

It reminded me of some work I had done a couple of years ago for a project which was at the time based on Linked Data, but which later switched away from that platform, leaving various bits of RDF-based work orphaned.

One particular piece which sprung to mind was a tool for dealing with vocabularies. Whether it’s useful for Dan’s talk I don’t know, but I thought I would dig it out and blog a little about it in case it’s of interest more generally to people working in Linked Open Data in Libraries, Archives and Museums (LODLAM).

I told Dan:

@dmccreary i did a thing once with an xform using a sparql query to assemble a skos concept scheme, edit it, save in own graph. Of interest?

— Unholy Taco (@conal_tuohy) November 14, 2015

When he sounded interested, I made a promise:

@dmccreary i have the code somewhere. .. will dig it out

— Unholy Taco (@conal_tuohy) November 14, 2015

I know I should find a better home for this and the other orphaned LODLAM components, but for now, the original code can be seen here:

https://github.com/Conal-Tuohy/huni/tree/master/data-store/www/skos-tools/vocabulary.xml

I’ll explain briefly how it works, but first, I think it’s necessary to explain the rationale for the vocabulary tool, and for that you need to see how it fits into the LODLAM environment.

At the moment there is a big push in the cultural sector towards moving data from legacy information systems into the “Linked Open Data (LOD) Cloud” – i.e. republishing the existing datasets as web-based sets of inter-linked data. In some cases people are actually migrating from their old infrastructure, but more commonly people are adding LOD capability to existing systems via some kind of API (this is a good approach, to my way of thinking – it reduces the cost and effort involved enormously). Either way, you have to be able to take your existing data and re-express it in terms of Linked Data, and that means facing up to some challenges, one of which is how to manage “vocabularies”.

Vocabularies, controlled and uncontrolled

What are “vocabularies” in this context? A “vocabulary” is a set of descriptive terms which can be applied to a record in a collection management system. For instance, a museum collection management system might have a record for a teacup, and the record could have a number of fields such as “type”, “maker”, “pattern”, “colour”, etc. The value of the “type” field would be “teacup”, for instance, but another piece in the collection might have the value “saucer” or “gravy boat” or what have you. These terms, “teacup”, “plate”, “dinner plate”, “saucer”, “gravy boat” etc, constitute a vocabulary.

In some cases, this set of terms is predefined in a formal list, This is called a “controlled vocabulary”. Usually each term has a description or definition (a “scope note”), and if there are links to other related terms (e.g. “dinner plate” is a “narrower term” of “plate”), as well as synonyms, including in other languages (“taza”, “plato”, etc) then the controlled vocabulary is called a thesaurus. A thesaurus or a controlled vocabulary can be a handy guide to finding things. You can navigate your way around a thesaurus, from one term to another, to find related classes of object which have been described with those terms, or the thesaurus can be used to automatically expand your search queries without you having to do anything; you can search for all items tagged as “plate” and the system will automatically also search for items tagged “dinner plate” or “bread plate”.

In other cases, though, these vocabularies are uncontrolled. They are just tags that people have entered in a database, and they may be consistent or inconsistent, depending on who did the data entry and why. An uncontrolled vocabulary is not so useful. If the vocabulary includes the terms “tea cup”, “teacup”, “Tea Cup”, etc. as distinct terms, then it’s not going to help people to find things because those synonyms aren’t linked together. If it includes terms like “Stirrup Cup” it’s going to be less than perfectly useful because most people don’t know what a Stirrup Cup is (it is a kind of cup).

The vocabulary tool

So one of the challenges in moving to a Linked Data environment is taking the legacy vocabularies which our systems use, and bringing them under control; linking synonyms and related terms together, providing definitions, and so on. This is where my vocabulary tool would come in.

In the Linked Data world, vocabularies are commonly modelled using a system called Simple Knowledge Organization System (SKOS). Using SKOS, every term (a “Concept” in SKOS) is identified by a unique URI, and these URIs are then associated with labels (such as “teacup”), definitions, and with other related Concepts.

The vocabulary tool is built with the assumption that a legacy vocabulary of terms has been migrated to RDF form by converting every one of the terms into a URI, simply by sticking a common prefix on it, and if necessary “munging” the text to replace, or encode spaces or other characters which aren’t allowed in URIs. For example, this might produce a bunch of URIs like this:

http://example.com/tableware/teacup
http://example.com/tableware/gravy_boat
etc.

What the tool then does is it finds all these URIs and gives you a web form which you can fill in to describe them and link them together. To be honest I’m not sure how far I got with this tool, but ultimately the idea would be that you would be able to organise the terms into a hierarchy, link synonyms, standardise inconsistencies by indicating “preferred” and “non-preferred” terms (i.e. you could say that “teacup” is preferred, and that “Tea Cup” is a non-preferred equivalent).

When you start the tool, you have the opportunity to enter a “base URI”, which in this case would be http://example.com/tableware/ – the tool would then find every such URI which was in use, and display them on the form for you to annotate. When you had finished imposing a bit of order on the vocabulary, you would click “Save” and your annotations would be stored in an RDF graph whose name was http://example.com/tableware/. Later, your legacy system might introduce more terms, and your Linked Data store would have some new URIs with that prefix. You would start up the form again, enter the base URI, and load all the URIs again. All your old annotations would also be loaded, and you would see the gaps where there were terms that hadn’t been dealt with; you could go and edit the definitions and click “Save” again.

In short, the idea of the tool was to be able to use, and to continue to use, legacy systems which lack controlled vocabularies, and actually impose control over those vocabularies after converting them to LOD.

How it works

OK here’s the technical bit.

The form is built using XForms technology, and I coded it to use a browser-based (i.e. Javascript) implementation of XForms called XSLTForms.

When the XForm loads, you can enter the common base URI of your vocabulary into a text box labelled “Concept Scheme URI”, and click the “Load” button. When the button is clicked, the vocabulary URI is substituted into a pre-written SPARQL query and sent off to a SPARQL server. This SPARQL query is the tricky part of the whole system really: it finds all the URIs, and it loads any labels which you might have already assigned them, and if any don’t have labels, it generates one by converting the last part of the URI back into plain text.


prefix skos: 

construct {
   ?vocabulary a skos:ConceptScheme ;
      skos:prefLabel ?vocabularyLabel.
   ?term a skos:Concept ;
      skos:inScheme ?vocabulary ;
      skos:prefLabel ?prefLabel .
      ?subject ?predicate ?object .
} where {
   bind(<> as ?vocabulary)
   {
      optional {?vocabulary skos:prefLabel ?existingVocabularyLabel}
      bind("Vocabulary Name" as ?vocabularyLabel)
      filter(!bound(?existingVocabularyLabel))
   } union {
      ?subject ?predicate ?term .
      bind(
         replace(substr(str(?term), strlen(str(?vocabulary)) + 1), "_", " ") as ?prefLabel
      )
      optional {?term skos:prefLabel ?existingPrefLabel}
      filter(!bound(?existingPrefLabel))
      filter(strstarts(str(?term), str(?vocabulary)))
      filter(?term != ?vocabulary)
   } union {
      graph ?vocabulary {
         ?subject ?predicate ?object
      }
   }
}

The resulting list of terms and labels is loaded into the form as a “data instance”, and the form automatically grows to provide data entry fields for all the terms in the instance. When you click the “Save” button, the entire vocabulary, including any labels you’ve entered, is saved back to the server.