Conal Tuohy's blog

Analysis & Policy Online

Conal — Tue, 27 Jun 2017 23:45:27 +0000

Notes for my Open Repositories 2017 conference presentation. I will edit this post later to flesh it out into a proper blog post.
Follow along at: conaltuohy.com/blog/analysis-policy-online/

background

Early discussion with Amanda Lawrence of APO (which at that time stood for “Australian Policy Online”) about text mining, at the 2015 LODLAM Summit in Sydney.
They needed automation to help with the cataloguing work, to improve discovery.
They needed to understand their own corpus better.
I suggested a particular technical approach based on previous work.
In 2016, APO contracted me to advise and help them build a system that would “mine” metadata from their corpus, and use Linked Data to model and explore it.

constraints

Openness
Integrate metadata from multiple text-mining processes, plus manually created metadata
Minimal dependency on their current platform (Drupal 7, now Drupal 8)
Lightweight; easy to make quick changes

technical approach

Use an entirely external metadata store (a SPARQL Graph Store)
Use a pipeline! Extract, Transform, Load
Use standard protocol to extract data (first OAI-PMH, later sitemaps)
In fact, use web services for everything; the pipeline is then just a simple script that passes data between web services
Sure, XSLT and SPARQL Query, but what the hell is XProc?!

progress

Configured Apache Tika as a web service, using Stanford Named Entity Recognition toolkit
Built XProc pipeline to harvest from Drupal’s OAI-PMH module, download digital objects, process them with Stanford NER via Tika, and store the resulting graphs in Fuseki graph store
Harvested, and produced a graph of part of the corpus, but …
Turned out the Drupal OAI-PMH module wa broken! So we used Sitemap instead
“Related” list added to APO dev site (NB I’ve seen this isn’t working in all browsers, and obviously needs more work, perhaps using an iframe is not the best idea. Try Chrome if you don’t see the list of related pages on the right)

next steps

Visualize the graph
Integrate more of the manually created metadata into the RDF graph
Add topic modelling (using MALLET) alongside the NER

Let’s see the code

Questions?

(if there’s any time remaining)

A tool for Web API harvesting

Conal — Sat, 31 Dec 2016 05:31:05 +0000

A medieval man harvesting metadata from a medieval Web API

As 2016 stumbles to an end, I’ve put in a few days’ work on my new project Oceania, which is to be a Linked Data service for cultural heritage in this part of the world. Part of this project involves harvesting data from cultural institutions which make their collections available via so-called “Web APIs”. There are some very standard ways to publish data, such as OAI-PMH, OpenSearch, SRU, RSS, etc, but many cultural heritage institutions instead offer custom-built APIs that work in their own peculiar way, which means that you need to put in a certain amount of effort in learning each API and dealing with its specific requirements. So I’ve turned to the problem of how to deal with these APIs in the most generic way possible, and written a program that can handle a lot of what is common in most Web APIs, and can be easily configured to understand the specifics of particular APIs.

This program, which I’ve called API Harvester, can be configured by giving it a few simple instructions: where to download the data from, how to split up the downloaded data into individual records, where to save the record files, how to name those files, and where to get the next batch of data from (i.e. how to resume the harvest). The API Harvester does have one hard requirement: it is only able to harvest data in XML format, but most of the APIs I’ve seen offered by cultural heritage institutions do provide XML, so I think it’s not a big limitation.

The API Harvester software is open source, and free to use; I hope that other people find it useful, and I’m happy to accept feedback or improvements, or examples of how to use it with specific APIs. I’ve created a wiki page to record example commands for harvesting from a variety of APIs, including OAI-PMH, the Trove API, and an RSS feed from this blog. This wiki page is currently open for editing, so if you use the API Harvester, I encourage you to record the command you use, so other people can benefit from your work. If you have trouble with it, or need a hand, feel free to raise an issue on the GitHub repository, leave a comment here, or contact me on Twitter.

Finally, a brief word on how to use the software: to tell the harvester how to pull a response apart into individual records, and where to download the next page of records from (and the next, and the next…), you give it instructions in the form of “XPath expressions”. XPath is a micro-language for querying XML documents; it allows you to refer to elements and attributes and pieces of text within an XML document, to perform basic arithmetic and manipulate strings of text. XPath is simple yet enormously powerful; if you are planning on doing anything with XML it’s an essential thing to learn, even if only to a very basic level. I’m not going to give a tutorial on XPath here (there are plenty on the web), but I’ll give an example of querying the Trove API, and briefly explain the XPath expressions used in that examples:

Here’s the command I would use to harvest metadata about maps, relevant to the word “oceania”, from the Trove API, and save the results in a new folder called “oceania-maps” in my Downloads folder:

java -jar apiharvester.jar directory="/home/ctuohy/Downloads/oceania-maps" retries=5 url="http://api.trove.nla.gov.au/result?q=oceania&zone=map&reclevel=full" url-suffix="&key=XXXXXXX" records-xpath="/response/zone/records/*" id-xpath="@url" resumption-xpath="/response/zone/records/@next"

For legibility, I’ve split the command onto multiple lines, but this is a single command and should be entered on a single line.

Going through the parts of the command in order:

The command java launches a Java Virtual Machine to run the harvester application (which is written in the Java language).
The next item, -jar, tells Java to run a program that’s been packaged as a “Java Archive” (jar) file.
The next item, apiharvester.jar, is the harvester program itself, packaged as a jar file.

The remainder of the command consists of parameters that are passed to the API harvester and control its behaviour.

The first parameter, directory="/home/ctuohy/Downloads/oceania-maps", tells the harvester where to save the XML files; it will create this folder if it doesn’t already exist.
With the second parameter, retries=5, I’m telling the harvester to retry a download up to 5 times if it fails; Trove’s server can sometimes be a bit flaky at busy times; retrying a few times can save the day.
The third parameter, url="http://api.trove.nla.gov.au/result?q=oceania&zone=map&reclevel=full", tells the harvester where to download the first batch of data from. To generate a URL like this, I recommend using Tim Sherratt’s excellent online tool, the Trove API Console.
The next parameter url-suffix="&key=XXXXXXX" specifies a suffix that the harvester will append to the end of all the URLs which it requests. Here, I’ve used url-suffix to specify Trove’s “API Key”; a password which each registered Trove API user is given. To get one of these, see the Trove Help Centre. NB XXXXXXX is not my actual API Key.

The remaining parameters are all XPath expressions. To understand them, it will be helpful to look at the XML content which the Trove API returns in response to that query, and which these XPath expressions apply to.

The first XPath parameter, records-xpath="/response/zone/records/*", identifies the elements in the XML which constitute the individual records. The XPath /response/zone/records/* describes a path down the hierarchical structure of the XML: the initial / refers to the start of the document, the response refers to an element with that name at the “root” of the document, then /zone refers to any element called zone within that response element, then /records refers to any records within any of those response elements, and the final /* refers to any elements (with any name) within any of of those response elements. In practice, this XPath expression identifies all the work elements in the API’s response, and means that each of these work elements (and its contents) ends up saved in its own file.
The next parameter, id-xpath="@url" tells the harvester where to find a unique identifier for the record, to generate a unique file name. This XPath is evaluated relative to the elements identified by the records-xpath; i.e. it gets evaluated once for each record, starting from the record’s work element. The expression @url means “the value of the attribute named url”; the result is that the harvested records are saved in files whose names are derived from these URLs. If you look at the XML, you’ll see I could equally have used the expression @id instead of @url.
The final parameter, resumption-xpath="/response/zone/records/@next", tells the harvester where to find a URL (or URLs) from which it can resume the harvest, after saving the records from the first response. You’ll see in the Trove API response that the records element has an attribute called next which contains a URL for this purpose. When the harvester evaluates this XPath expression, it gathers up the next URLs and repeats the whole download process again for each one. Eventually, the API will respond with a records element which doesn’t have a next attribute (meaning that there are no more records). At that point, the XPath expression will evaluate to nothing, and the harvester will run out of URLs to harvest, and grind to a halt.

Happy New Year to all my readers! I hope this tool is of use to some of you, and I wish you a productive year of metadata harvesting in 2017!

Oceania

Conal — Wed, 28 Dec 2016 06:41:58 +0000

I am really excited to have begun my latest project: a Linked Open Data service for online cultural heritage from New Zealand and Australia, and eventually, I hope, from our other neighbours. I have called the service “oceania.digital”

The big idea of oceania.digital is to pull together threads from a number of different “cultural” data sources and weave them together into a single web of data which people can use to tell a huge number of stories.

There are a number of different aspects to the project, and a corresponding number of stages to go through…

I need to gather the data together from a variety of sources. Both Trove and Digital NZ are doing this at the national level; I want to build on both of those data sources, and gradually add more and more.
Having gathered data from my data sources, I need to transform the harvested data into an interoperable form, namely the World Wide Web Consortium’s “Resource Description Framework” (RDF). The metaphor I suggest is that of teasing out threads from the raw data, so that the threads from one dataset can later be interwoven with those from another. This is the vision of the Semantic Web.
Having converted the data to RDF, I need to weave the threads together so that the data harvested from the different sources is explicitly linked to data from other sources. This means identifying where the same things (people, places, etc.) are described in the different data sources, and explicitly equating or merging those things. This is related to what librarians call “Authority Control“.
Finally, having produced a web of interconnected data, I need to make it practically useful to a wide range of people, not just Semantic Web nerds like me. I will need to build, curate, and inspire the development of new tools that will help end-users to tell stories using the RDF dataset. Most people won’t be programming with RDF themselves, and they won’t be excited by JSON-LD or SPARQL; they will need user-friendly software tools that allow them to summon up the data they need, with a minimum of technical geekery, and to use it to produce visualisations, links, images, maps, and timelines, which they can embed on their blogs and websites, in Facebook, Twitter, and other social media.

So far, I have set up a website, with some harvesting software and an RDF data store.

The first dataset I intend to process is “People Australia”; a collection of biographical which is aggregated from a variety of Australian sources and published by the National Library of Australia’s “People Australia”. Hopefully soon after I will be be able to add a related dataset from New Zealand.

Once I have some data available in RDF form, I will add some features to allow the data to be reused on other websites, then I’ll go back and add more datasets from elsewhere, and repeat the process.

If you’d like to keep in touch with the project as it progresses, you can follow the @OceaniaDigital account on Twitter, or follow my blog.

If you think you’d like to contribute to the project in any way, please do get in touch, either via Twitter or email!

Australian Society of Archivists 2016 conference #asalinks

Conal — Tue, 25 Oct 2016 08:15:53 +0000

Last week I participated in the 2016 conference of the Australian Society of Archivists, in Parramatta.

#ASALinks poster

I was very impressed by the programme and the discussion. I thought I’d jot down a few notes here about just a few of the presentations that were most closely related to my own work. The presentations were all recorded, and as the ASA’s YouTube channel is updated with newly edited videos, I’ll be editing this post to include those videos.

It was my first time at an ASA conference; I’d been dragged into it by Asa Letourneau, from the Public Record Office Victoria, with whom over the last year I’d been developing a piece of software called “PROVisualizer”, which appears right below here in the page (hint: click the “full screen” button in its bottom right corner if you want to have a play with it).

health

Asa and I gave a presentation on the PROVisualizer, talking about the history of the project from the early prototypes and models built at PROV, to the series of revisions of the product built in collaboration with me, and including the “Linked Data” infrastructure behind the visualization itself, and its prospects for further development and re-use.

You can access the PROVisualizer presention in PDF.

As always, I enjoyed Tim Sherratt‘s contribution: a keynote on redaction by ASIO (secret police) censors in the National Archives, called Turning the Inside Out.

The black marks are of course inserted by the ASIO censors in order to obscure and hide information, but Tim showed how it’s practicable to deconstruct the redactions’ role in the documents they obscure, and convert these voids, these absences, into positive signs in their own right; and that these signs can be utilized to discover politically sensitive texts, and zoom in precisely on the context that surrounds the censored details in each text. Also the censors made a lot of their redaction marks into cute little pictures of people and sailing ships, which raised a few laughs.

In the morning of the first day of talks, I got a kick out of Chris Hurley’s talk “Access to Archives (& Other Records) in the Digital Age”. His rejection of silos and purely hierarchical data models, and his vision of openness to, and accommodation of, small players in the archives space both really resonated with me, and I was pleased to be able to chat with him over coffee later in the afternoon about the history of this idea and about how things like Linked Data and the ICA’s “Records in Context” data model can help to realize it.

In the afternoon of the first day I was particularly struck by Ross Spencer‘s presentation about extracting metadata from full text resources. He spoke about using automation to identify the names of people, organisations, places, and so on, within the text of documents. For me this was particularly striking because I’d only just begun an almost identical project myself for the Australian Policy Online repository of policy documents. In fact it turned out we were using the same software (Apache Tika and the Stanford Named Entity Recognizer).

On the second day I was particularly struck by a few papers that were very close to my own interests. Nicole Kearney, from Museum Victoria, talked about her work coordinating the Biodiversity Heritage Library Australia.

This presentation was focused on getting value from the documentary heritage of museums; such things as field notes and diaries from scientific expeditions, by using the Atlas of Living Australia’s DigiVol transcription platform to allow volunteers to transcribe the text from digital images, and then publishing the text and images online using the BHL publication platform. In between there was slightly awkward part which involves Nicole converting from the CSV format produced by DigiVol into some more standard format for the BHL. I’ve had an interest in text transcription going back to slightly before the time I joined the New Zealand Electronic Text Centre at Victoria University of Wellington; this would’ve been about 2003, which seems like ancient times now.

After that I saw Val Love and Kirsty Cox talk about their journey in migrating the Alexander Turnbull Library‘s TAPUHI software to KE EMu. Impressive, given the historical complexity of TAPUHI, and the amount of data analysis required to make sense of its unique model, and to translate that into a more standard conceptual model, and to implement that model using EMu. It’s an immense milestone for the Turnbull, and I hope will lead in short order to the opening up of the collection metadata to greater reuse.

Finally I want to mention the talk “Missing Links: museum archives as evidence, context and content” from Mike Jones. This was another talk about breaking down barriers between collection management systems in museums: on the one hand, the museum’s collection of objects, and on the other, the institution’s archives. Of course those archives provide a great deal of context for the collection, but the reality is that the IT infrastructure and social organisation of these two systems is generally very distinct and separate. Mike’s talk was about integrating cultural heritage knowledge from different organisational structures, domains of professional expertise, different data models, and IT systems. I got a shout-out in one slide in the form of a reference to some experimental work I’d done with Museum Victoria’s web API, to convert it into a Linked Data service.

It’s my view that Linked Data technology offers a practical approach to resolving the complex data integration issues in cultural heritage: it is relatively easy to expose legacy systems, whatever they might be, in the form of Linked Data, and having done so, the task of integrating the data so exposed is also rather straight-forward (that’s what Linked Data was invented for, pretty much). To me the problem is how to sell this to an institution, in the sense that you have to offer the institution itself a “win” for undertaking the work. If it’s just that they can award themselves 5 gold stars for public service that’s not a great reason. You need to be able to deliver tangible value to museums themselves. This is where I think there’s a gap; in leveraging Linked Data to enhance exhibitions and also in-house collection management systems. If we can make it so that there’s value to institutions in creating and also consuming Linked Data, then we may be able to establish a virtuous circle to drive uptake of the technology, and see some progress in the integration of knowledge in the sector.

Linked Open Data Visualisation at #GLAMVR16

Conal — Tue, 30 Aug 2016 02:02:10 +0000

On Thursday last week I flew to Perth, in Western Australia, to speak at an event at Curtin University on visualisation of cultural heritage. Erik Champion, Professor of Cultural Visualisation, who organised the event, had asked me to talk about digital heritage collections and Linked Open Data (“LOD”).

The one-day event was entitled “GLAM VR: talks on Digital heritage, scholarly making & experiential media”, and combined presentations and workshops on cultural heritage data (GLAM = Galleries, Libraries, Archives, and Museums) with advanced visualisation technology (VR = Virtual Reality).

The venue was the Curtin HIVE (Hub for Immersive Visualisation and eResearch); a really impressive visualisation facility at Curtin University, with huge screens and panoramic and 3d displays.

There were about 50 people in attendance, and there would have been over a dozen different presenters, covering a lot of different topics, though with common threads linking them together. I really enjoyed the experience, and learned a lot. I won’t go into the detail of the other presentations, here, but quite a few people were live-tweeting, and I’ve collected most of the Twitter stream from the day into a Storify story, which is well worth a read and following up.

My presentation

For my part, I had 40 minutes to cover my topic. I’d been a bit concerned that my talk was more data-focused and contained nothing specifically about VR, but I think on the day the relevance was actually apparent.

The presentation slides are available here as a PDF: Linked Open Data Visualisation

My aims were:

At a tactical level, to explain the basics of Linked Data from a technical point of view (i.e. to answer the question “what is it?”); to show that it’s not as hard as it’s usually made out to be; and to inspire people to get started with generating it, consuming it, and visualising it.
At a strategic level, to make the case for using Linked Data as a basis for visualisation; that the discipline of adopting Linked Data technology is not at all a distraction from visualisation, but rather a powerful generic framework on top of which visualisations of various kinds can be more easily constructed, and given the kind of robustness that real scholarly work deserves.

Linked Data basics

I spent the first part of my talk explaining what Linked Open Data means; starting with “what is a graph?” and introducing RDF triples and Linked Data. Finally I showed a few simple SPARQL queries, without explaining SPARQL in any detail, but just to show the kinds of questions you can ask with a few lines of SPARQL code.

What is an RDF graph?

While I explained about graph data models, I saw attendees nodding, which I took as a sign of understanding and not that they were nodding off to sleep; it was still pretty early in the day for that.

One thing I hoped to get across in this part of the presentation was just that Linked Data is not all that hard to get into. Sure, it’s not a trivial technology, but barriers to entry are not that high; the basics of it are quite basic, so you can make a start and do plenty of useful things without having to know all the advanced stuff. For instance, there are a whole bunch of RDF serializations, but in fact you can get by with knowing only one. There are a zillion different ontologies, but again you only need to know the ontology you want to use, and you can do plenty of things without worrying about a formal ontology at all. I’d make the case for university eResearch agencies, software carpentry, and similar efforts, to be offering classes and basic support in this technology, especially in library and information science, and the humanities generally.

Linked Data as architecture

People often use the analogy of building, when talking about making software. We talk about a “build process”, “platforms”, and “architecture”, and so on. It’s not an exact analogy, but it is useful. Using that analogy, Linked Data provides a foundation that you can build a solid edifice on top of. If you skimp on the foundation, you may get started more quickly, but you will encounter problems later. If your project is small, and if it’s a temporary structure (a shack or bivouac), then architecture is not so important, and you can get away with skimping on foundations (and you probably should!), but the larger the project is (an office building), and the longer you want it to persist (a cathedral), the more valuable a good architecture will be. In the case of digital scholarly works, the common situation in academia is that weakly-architected works are being cranked out and published, but being hard to maintain, they tend to crumble away within a few years.

Crucially, a Linked Data dataset can capture the essence of what needs to be visualised, without being inextricably bound up with any particular genre of visualisation, or any particular visualisation software tool. This relative independence from specific tools is important because a dataset which is tied to a particular software platform needs to rely on the continued existence of that software, and experience shows that individual software packages come and go depressingly quickly. Often only a few years are enough for a software program to be “orphaned”, unavailable, obsolete, incompatible with the current software environment (e.g. requires Windows 95 or IE6), or even, in the case of software available online as a service, for it to completely disappear into thin air, if the service provider goes bust or shuts down the service for reasons of their own. In these cases you can suddenly realise you’ve been building your “scholarly output” on sand.

By contrast, a Linked Data dataset is standardised, and it’s readable with a variety of tools that support that standard. That provides you with a lot of options for how you could go on to visualise the data; that generic foundation gives you the possibility of building (and rebuilding) all kinds of different things on top of it.

Because of its generic nature and its openness to the Web, Linked Data technology has become a broad software ecosystem which already has a lot of people’s data riding on it; that kind of mass investment (a “bandwagon”, if you like) is insurance against it being wiped out by the whims or vicissitudes of individual businesses. That’s the major reason why a Linked Data dataset can be archived and stored long term with confidence.

Linked Open Data is about sharing your data for reuse

Finally, by publishing your dataset as Linked Open Data (independently of any visualisations you may have made of it), you are opening it up to reuse not only by yourself, but by others.

The graph model allows you to describe the meaning of the terms you’ve used (i.e. the analytical categories used in your data can themselves be described and categorised, because everything is a node in a graph). This means that other people can work out what your dataset actually means.

The use of URIs for identifiers means that others can easily cite your work and effectively contribute to your work by creating their own annotations on it. They don’t need to impinge on your work; their annotations can live somewhere else altogether and merely refer to nodes in your graph by those nodes’ identifiers (URIs). They can comment; they can add cross-references; they can assert equivalences to nodes in other graphs, elsewhere. Your scholarly work can break out of its box, to become part of an open web of knowledge that grows and ramifies and enriches us all.

Visualizing Government Archives through Linked Data

Conal — Tue, 05 Apr 2016 13:41:00 +0000

Tonight I’m knocking back a gin and tonic to celebrate finishing a piece of software development for my client the Public Record Office Victoria; the archives of the government of the Australian state of Victoria.

The work, which will go live in a couple of weeks, was an update to a browser-based visualization tool which we first set up last year. In response to user testing, we made some changes to improve the visualization’s usability. It certainly looks a lot clearer than it did, and the addition of some online help makes it a bit more accessible for first-time users.

The visualization now looks like this (here showing the entire dataset, unfiltered, which is not actually that useful, though it is quite pretty):

The bulk of the work, though, was to automate the preparation of data for the visualization.

Up until now, the dataset which you could visualize consisted of a couple of CSV files, manually assembled with considerable care and effort from reports exported from PROV’s repository “Archives One”. In the new system, this manual work will not need to be repeated. Instead, the same dataset will be assembled by an automated metadata-processing pipeline which will keep it continually up to date as government agencies and functions change over time.

It was not as big as job as you might think, since in fact a lot of the work to generate the data had already been done.

PROV’s Interoperable Data service

In 2012, in collaboration with their counterpart agency State Records New South Wales, PROV had set up an Interoperable Data publishing service with funding from the Australian National Data Service. They custom-built some software to export data from Archives One to produce a set of metadata records in RIF-CS format, and they deployed an off-the-shelf software application (an “OAI-PMH Repository”) to disseminate those metadata records over the web.

Originally, the OAI-PMH repository was serving data to the Australian National Data Service, which runs an aggregation service called Research Data Australia, which offers researchers pointers to all manner of scientific, historical and cultural datasets. The PROV metadata, covering the full history of government records in Victoria, is a useful resource for social science researchers, genealogists, historians, and others.

More recently, PROV’s OAI-PMH repository has also been harvested by the National Library of Australia’s Trove service.

Now at last it will be harvested by the Public Record Office itself.

The data pipeline

The software I’ve written consists of a web application which I wrote using a programming language for data pipelines called XProc. The software itself is open source and available on GitHub in a repository with the ludicrously acronymous title PROV-RIF-SPARQL.

This XProc application tediously harvests the metadata records (there are more than 30000 of them) and converts each one from RIF-CS format into RDF/XML format. The RDF/XML data is a reformulation of the RIF-CS in which the hierarchical structures of the RIF-CS are re-expressed as a network of interconnected statements; a kind of web of nodes and links which mathematicians call a “graph”. The statements in these graphs are expressed using the international standard conceptual framework for cultural heritage data; the CIDOC-CRM. My harvester then stores all these RDF/XML documents (or “graphs”) in a SPARQL Graph Store (a kind of hybrid document store and database). The SPARQL Graph Store allows each graph to be addressed individually, but also for the entire dataset to be treated as a single graph, and queried as a whole. Finally, the RDF dataset is queried to produce the two summarised data files which the visualization itself requires; these are simple spreadsheets in CSV (Comma Separated Values) format. One table contains information about each government agency or function, and the other table lists the relationships which have historically existed between those agencies and functions.

The harvester has a basic user interface where you can start a data harvest; a process that takes about half an hour to complete. In this interface you can specify the location of the OAI-PMH server you want to harvest data from, the format of the data you want to harvest, and the location of the SPARQL Graph Store where you want to store the result, amongst other parameters. In practice, this user interface isn’t used by a human (except during testing); another small program running on a regular schedule makes the request.

At this stage of the project, the RDF graph is only used internally to PROV, where it functions purely as an intermediate between the RIF-CS input and the CSV output. The RDF data and the SPARQL database together just provide a convenient way to aggregate a big set of records and query the resulting aggregation. But later I have no doubt that the RDF data will be published directly as Linked Open Data, opening it up, and allowing it to be connected into a world-wide web of data.

Taking control of an uncontrolled vocabulary

Conal — Mon, 16 Nov 2015 13:49:30 +0000

A couple of days ago, Dan McCreary tweeted:

Working on new ideas for NoSQL metadata management for a talk next week. Focus on #NoSQL, Documents, Graphs and #SKOS. Any suggestions?

— Dan McCreary (@dmccreary) November 14, 2015

It reminded me of some work I had done a couple of years ago for a project which was at the time based on Linked Data, but which later switched away from that platform, leaving various bits of RDF-based work orphaned.

One particular piece which sprung to mind was a tool for dealing with vocabularies. Whether it’s useful for Dan’s talk I don’t know, but I thought I would dig it out and blog a little about it in case it’s of interest more generally to people working in Linked Open Data in Libraries, Archives and Museums (LODLAM).

I told Dan:

@dmccreary i did a thing once with an xform using a sparql query to assemble a skos concept scheme, edit it, save in own graph. Of interest?

— Unholy Taco (@conal_tuohy) November 14, 2015

When he sounded interested, I made a promise:

@dmccreary i have the code somewhere. .. will dig it out

— Unholy Taco (@conal_tuohy) November 14, 2015

I know I should find a better home for this and the other orphaned LODLAM components, but for now, the original code can be seen here:

https://github.com/Conal-Tuohy/huni/tree/master/data-store/www/skos-tools/vocabulary.xml

I’ll explain briefly how it works, but first, I think it’s necessary to explain the rationale for the vocabulary tool, and for that you need to see how it fits into the LODLAM environment.

At the moment there is a big push in the cultural sector towards moving data from legacy information systems into the “Linked Open Data (LOD) Cloud” – i.e. republishing the existing datasets as web-based sets of inter-linked data. In some cases people are actually migrating from their old infrastructure, but more commonly people are adding LOD capability to existing systems via some kind of API (this is a good approach, to my way of thinking – it reduces the cost and effort involved enormously). Either way, you have to be able to take your existing data and re-express it in terms of Linked Data, and that means facing up to some challenges, one of which is how to manage “vocabularies”.

Vocabularies, controlled and uncontrolled

What are “vocabularies” in this context? A “vocabulary” is a set of descriptive terms which can be applied to a record in a collection management system. For instance, a museum collection management system might have a record for a teacup, and the record could have a number of fields such as “type”, “maker”, “pattern”, “colour”, etc. The value of the “type” field would be “teacup”, for instance, but another piece in the collection might have the value “saucer” or “gravy boat” or what have you. These terms, “teacup”, “plate”, “dinner plate”, “saucer”, “gravy boat” etc, constitute a vocabulary.

In some cases, this set of terms is predefined in a formal list, This is called a “controlled vocabulary”. Usually each term has a description or definition (a “scope note”), and if there are links to other related terms (e.g. “dinner plate” is a “narrower term” of “plate”), as well as synonyms, including in other languages (“taza”, “plato”, etc) then the controlled vocabulary is called a thesaurus. A thesaurus or a controlled vocabulary can be a handy guide to finding things. You can navigate your way around a thesaurus, from one term to another, to find related classes of object which have been described with those terms, or the thesaurus can be used to automatically expand your search queries without you having to do anything; you can search for all items tagged as “plate” and the system will automatically also search for items tagged “dinner plate” or “bread plate”.

In other cases, though, these vocabularies are uncontrolled. They are just tags that people have entered in a database, and they may be consistent or inconsistent, depending on who did the data entry and why. An uncontrolled vocabulary is not so useful. If the vocabulary includes the terms “tea cup”, “teacup”, “Tea Cup”, etc. as distinct terms, then it’s not going to help people to find things because those synonyms aren’t linked together. If it includes terms like “Stirrup Cup” it’s going to be less than perfectly useful because most people don’t know what a Stirrup Cup is (it is a kind of cup).

The vocabulary tool

So one of the challenges in moving to a Linked Data environment is taking the legacy vocabularies which our systems use, and bringing them under control; linking synonyms and related terms together, providing definitions, and so on. This is where my vocabulary tool would come in.

In the Linked Data world, vocabularies are commonly modelled using a system called Simple Knowledge Organization System (SKOS). Using SKOS, every term (a “Concept” in SKOS) is identified by a unique URI, and these URIs are then associated with labels (such as “teacup”), definitions, and with other related Concepts.

The vocabulary tool is built with the assumption that a legacy vocabulary of terms has been migrated to RDF form by converting every one of the terms into a URI, simply by sticking a common prefix on it, and if necessary “munging” the text to replace, or encode spaces or other characters which aren’t allowed in URIs. For example, this might produce a bunch of URIs like this:

http://example.com/tableware/teacup
http://example.com/tableware/gravy_boat
etc.

What the tool then does is it finds all these URIs and gives you a web form which you can fill in to describe them and link them together. To be honest I’m not sure how far I got with this tool, but ultimately the idea would be that you would be able to organise the terms into a hierarchy, link synonyms, standardise inconsistencies by indicating “preferred” and “non-preferred” terms (i.e. you could say that “teacup” is preferred, and that “Tea Cup” is a non-preferred equivalent).

When you start the tool, you have the opportunity to enter a “base URI”, which in this case would be http://example.com/tableware/ – the tool would then find every such URI which was in use, and display them on the form for you to annotate. When you had finished imposing a bit of order on the vocabulary, you would click “Save” and your annotations would be stored in an RDF graph whose name was http://example.com/tableware/. Later, your legacy system might introduce more terms, and your Linked Data store would have some new URIs with that prefix. You would start up the form again, enter the base URI, and load all the URIs again. All your old annotations would also be loaded, and you would see the gaps where there were terms that hadn’t been dealt with; you could go and edit the definitions and click “Save” again.

In short, the idea of the tool was to be able to use, and to continue to use, legacy systems which lack controlled vocabularies, and actually impose control over those vocabularies after converting them to LOD.

How it works

OK here’s the technical bit.

The form is built using XForms technology, and I coded it to use a browser-based (i.e. Javascript) implementation of XForms called XSLTForms.

When the XForm loads, you can enter the common base URI of your vocabulary into a text box labelled “Concept Scheme URI”, and click the “Load” button. When the button is clicked, the vocabulary URI is substituted into a pre-written SPARQL query and sent off to a SPARQL server. This SPARQL query is the tricky part of the whole system really: it finds all the URIs, and it loads any labels which you might have already assigned them, and if any don’t have labels, it generates one by converting the last part of the URI back into plain text.


prefix skos: 

construct {
   ?vocabulary a skos:ConceptScheme ;
      skos:prefLabel ?vocabularyLabel.
   ?term a skos:Concept ;
      skos:inScheme ?vocabulary ;
      skos:prefLabel ?prefLabel .
      ?subject ?predicate ?object .
} where {
   bind(<> as ?vocabulary)
   {
      optional {?vocabulary skos:prefLabel ?existingVocabularyLabel}
      bind("Vocabulary Name" as ?vocabularyLabel)
      filter(!bound(?existingVocabularyLabel))
   } union {
      ?subject ?predicate ?term .
      bind(
         replace(substr(str(?term), strlen(str(?vocabulary)) + 1), "_", " ") as ?prefLabel
      )
      optional {?term skos:prefLabel ?existingPrefLabel}
      filter(!bound(?existingPrefLabel))
      filter(strstarts(str(?term), str(?vocabulary)))
      filter(?term != ?vocabulary)
   } union {
      graph ?vocabulary {
         ?subject ?predicate ?object
      }
   }
}

The resulting list of terms and labels is loaded into the form as a “data instance”, and the form automatically grows to provide data entry fields for all the terms in the instance. When you click the “Save” button, the entire vocabulary, including any labels you’ve entered, is saved back to the server.

Bridging the conceptual gap: Museum Victoria’s collections API and the CIDOC Conceptual Reference Model

Conal — Wed, 21 Oct 2015 14:44:33 +0000

A Museum Victoria LOD graph about a teacup, shown using the LODLive visualizer.

This is the third in a series of posts about an experimental Linked Open Data (LOD) publication based on the web API of Museum Victoria.

The first post gave an introduction and overview of the architecture of the publication software, and the second dealt quite specifically with how names and identifiers work in the LOD publication software.

In this post I’ll cover how the publication software takes the data published by Museum Victoria’s API and reshapes it to fit a common conceptual model for museum data, the “Conceptual Reference Model” published by the documentation committee of the Internal Council of Museums. I’m not going to exhaustively describe the translation process (you can read the source code if you want the full story), but I’ll include examples to illustrate the typical issues that arise in such a translation.

The CIDOC Conceptual Reference Model

The CIDOC CRM, as it’s usually called, is a system of concepts for analysing and describing the content of museum collections. It is not intended to be a replacement for the Collection Management Systems which museums use to store their data; it is rather intended to function as a kind of lingua franca, through which content from a variety of systems can be expressed in a generally intelligible way.

The Conceptual Reference Model covers a wide range of museological concerns: items can be described in terms of their materials and mode of construction, as well as by who made them, where and when, and for what purpose.

The CRM also provides a framework to describe the events in which objects are broken into pieces, or joined to other objects, damaged or repaired, created or utterly destroyed. Objects can be described in terms of the symbolic and intellectual content which they embody, which are themselves treated as “intellectual objects”. The lineage of intellectual influence can be described, either speculatively, in a high-level way, or by explicitly tracing and documenting the influences that were known have taken place at particular times and locations. The legal history of objects can also be traced through transfer of ownership and custody, commission, sale and purchase, theft and looting, loss and discovery. Where the people involved in these histories are known, they too can be named and described and their histories interwoven with those of other people, objects, and ideas.

Core concepts and additional classification schemes

The CRM framework is quite high level. Only a fairly small number of very general types of thing are defined in the CRM; only concepts general enough to be useful for any kind of museum; whether a museum of computer games or of classical antiquity. Each of these concepts is identified by an alphanumeric code and an English-language name. In addition, the CRM framework allows for arbitrary typologies to be added on, to be used for further classifying pretty much anything. This is to allow all the terms from any classification system used in a museum to be exported directly into a CRM-based dataset, simply by describing each term as an “E55 Type". In short, the CRM consists of a fairly fixed common core, supplemented by a potentially infinite number of custom vocabularies which can be used to make fine distinctions of whatever kind are needed.

Therefore, a dataset based on the CRM will generally be directly comparable with another dataset only in terms of the core CRM-defined entities. The different classification schemes used by different datasets remain “local” vocabularies. To achieve full interoperability between datasets, these distinct typologies would additionally need to be aligned, by defining a “mapping” table which lists the equivalences or inequivalences between the terms in the two vocabularies. For instance, such a table might say that the term “moulded” used in Museum Victoria’s collection is more or less the same classification as “molding (forming)” in the Getty Art and Architecture thesaurus.

Change happens through “events”

To model how things change through time, the CRM uses the notion of an “event”. The production of a physical object, for instance, is modelled as an E12 Production event (NB concepts in the CRM are all identified by an alphanumeric code). This production event is linked to the object which it produced, as well as to the person or persons who played particular creative roles in that event. The event may also have a date and place associated with it, and may be linked to the materials and to the method used in the event.

On a somewhat philosophical note, this focus on discrete events is justified by the fact that not all of history is continuously documented, and we necessarily have a fragmentary knowledge of the history of any given object. Often a museum object will have passed through many hands, or will have been modified many times, and not all of this history is known in any detail. If we know that person A created an object for person B, and that X years later the object turned up in the hands of person C, we can’t assume that the object remained in person B’s hands all those X years. A data model which treated “ownership” as a property of an object would be liable to making such inflated claims to knowledge which is simply not there. Person C may have acquired it at any point during that period, and indeed there may have been many owners in between person B and person C. This is why it makes sense to document an object’s history in terms of the particular events which are known and attested to.

Museum Victoria’s API

How does Museum Victoria’s data fit in terms of the CIDOC model?

In general the model works pretty well for Museum Victoria, though there are also things in MV’s data which are not so easy to express in the CRM.

Items

Museum Victoria describes items as “Things made and used by people”. These correspond exactly to the notion of E22 Man-Made Object in the CIDOC CRM (if you can excuse the sexist language), described as comprising “physical objects purposely created by human activity.”

Every MV item is therefore expressed as an E22 Man-Made Object.

Titles

Museum Victoria’s objects have an objectName property which is a simple piece of text; a name or title. In the CIDOC CRM, the name of an object is something more complex; it’s an entity in its own right, called an E41 Appellation. The reason why a name is treated as more than just a simple property of an object is that in the CRM, it must be possible to treat an object’s name as an historical phenomenon; after all, it will have been applied to an object by a particular person (the person who created the object, perhaps, or an archaeologist who dug it out of the ground, or a curator or historian), at some historical point in time. An object may have a number of different names, each given it by different people, and used by different people at different times.

However, because the Museum Victoria names are simple (a single label) we can ignore most of that complexity. We only need to define an E41 Appellation whose value is the name, and link the E41 Appellation to the E22 Man-Made Object using a P1 is identified by association.

Articles, Items and their relationships

The MV API provides access to a number of “articles” which are documents related to the Museum’s collection. For each article, the API shows a list of the related collection items; and for each item, you can get the corresponding list of related articles. Although the exact nature of the relationship isn’t made explicit, it’s reasonable to assume that an item is in some way documented by the articles that are related to it. In the CIDOC CRM, such an article is considered an E31 Document, and it bears a P70 documents relationship to the item which it’s about.

If the relationship between an item and article is easy to guess, there are a couple of other relationships which are a little less obvious: an article also has a list of related articles, and each item also has a list of related items. What is that nature of those relationships? In what way exactly does article X relate to article Y, or item A to item B? The MV API’s documentation doesn’t say, and it wouldn’t surprise me if the Museum’s collection management system leaves this question up to the curators’ judgement.

A bit of empirical research seemed called for. I checked a few of the related items and the examples I found seemed to fall into two categories:

One item is a photograph depicting another item (the specific relationship here is really “depicts”)
Two items are both photographs of the same subject (the relationship is “has the same subject as”).

Obviously there are two different kinds of relationship here in the Museum’s collection, both of them presented (through the API) in the same way. As a human, I can tell them apart, but my proxy software is not going to be able to. So I need to find a more general relationship which subsumes both the relationships above, and fortunately, the CIDOC CRM includes such a relationship, namely P130 shows features of.

This property generalises the notions of “copy of” and “similar to” into a dynamic, asymmetric relationship, where the domain expresses the derivative, if such a direction can be established.
Otherwise, the relationship is symmetric. It is a shortcut of P15 was influenced by (influenced) in a creation or production, if such a reason for the similarity can be verified. Moreover it expresses similarity in cases that can be stated between two objects only, without historical knowledge about its reasons.

For example, I have a photograph of a piece of computer hardware (which is the relatedItem), and the photo is therefore a kind of derivative of the hardware (though the Museum Victoria API doesn’t tell me which of the objects was the original and which the derivative). In another example I have two photos of the same house; here there’s a similarity which is not due to one of the photos being derived from the other.

Ideally, it would be preferable to be able to represent these kinds of relationships more precisely; for instance, in the case of the two photos of the house, one could generate a resource that denotes the actual physical house itself, and link that to the photographs, but because the underlying data doesn’t include this information in a machine-readable form, the best we can do is to say that the two photos are similar.

Production techniques

Some of the items in the Museum’s collection are recorded as having been produced using a certain “technique”. For instance, archaeological artefacts in the MV collection have a property called archeologyTechnique, which contains the name of a general technique, such as moulded, in the case of certain ceramic items.

This corresponds to the CRM concept P32 used general technique, which is described like so:

This property identifies the technique or method that was employed in an activity.
These techniques should be drawn from an external E55 Type hierarchy of consistent terminology of
general techniques or methods such as embroidery, oil-painting, carbon dating, etc.

Note that in CIDOC this “general technique” used to manufacture an object is not a property of the object iself; it’s a property of the activity which produced the object (i.e. the whole process in which the potter pressed clay into a mould, glazed the cup, and fired it in a kiln).

Note also that, for the CIDOC CRM, the production technique used in making these tea-cups is not the text string “moulded”; it is actually an abstract concept identified by a URI. The string “moulded” is just a human-readable name attached as a property of that concept. That same concept might very well have a number of other names in other languages, or even in English there’s the American variant spelling “molded”, and synonyms such as “cast” that could all be alternative names for the same concept.

Translating a Museum Victoria item with a technique into the CRM therefore involves identifying three entities:

the object itself (an E22 Man-Made Object);
the production of the object (an E12 Production activity);
the technique used in the course of that activity to produce the object (an E55 Type of technique)

These three entities are then linked together:

The production event “P32 used general technique" of the technique; and
The production event [edit: ~~"P94 has created"~~] "P108 has produced" the object itself.

Notes

The items, articles, specimens and species in the Museum’s API are all already first-class objects and can be easily represented as concepts in Linked Data. The archeologyTechnique field also has a fairly restricted range of values, and each of those values (such as “moulded”) can be represented as a Linked Data concept as well. But there are a number of other fields in the Museum’s API which are in the form of relatively long pieces of descriptive text. For example, an object’s objectSummary field contains a long piece of text which describes the object in context. For example, here’s the objectSummary of one our moulded tea cups:

This reconstructed cup was excavated at the Commonwealth Block site between 1988 and 2003. There is a matching saucer that was found with it. The pattern is known as 'Moss Rose' and was made between 1850 and 1851 by Charles Meigh, Son & Pankhurst in Hanley, Staffordshire, England.

Homewares. Numerous crockery pieces were found all over the Little Lon site. Crockery gives us a glimpse of everyday life in Melbourne in the 1880s. In the houses around Little Lon, residents used decorated crockery. Most pieces were cheap earthenware or stoneware, yet provided colour and cheer. Only a few could afford to buy matching sets, and most china was probably acquired second-hand. Some were once expensive pieces. Householders mixed and matched their crockery from the great range of mass-produced designs available. 'Blue and white' and the 'willow' pattern, was the most popular choice and was produced by English potteries from 1790.

It’s not quite as long as an “article” but it’s not far off it. Another textual property is called physicalDescription, and has a narrower focus on the physical nature of the item:

This is a glazed earthenware teacup which has been reconstructed. It is decorated with a blue or black vine and leaf design around outside and inside of the cup which is known as 'Moss Rose' pattern.

The CIDOC CRM does include concepts related to the historical context and the physical nature of items, but it’s not at all easy to extract that detailed information from the descriptive prose of these, and similar fields. Because the information is stored in a long narrative form, it can’t be easily mapped to the denser data structure of a Linked Data graph. The best we can hope to do with these fields is to treat them as notes attached to the item.

The CIDOC CRM includes a concept for attaching a note: P3 has note. But to represent these two different types of note, it’s necessary to extend the CRM by creating two new, specialized versions (“sub-properties”) of the property called P3 has note, which I’ve called P3.1 objectSummary and P3.1 physicalDescription.

Summary

It’s possible to recognise three distinct patterns in mapping an API such as Museum Victoria’s to a Linked Data model like the CIDOC CRM.

Where the API provides access to a set of complex data objects of a particular type, these can
be mapped straight-forwardly to a corresponding class of Linked Data resources (e.g. the items, species, specimens, and articles in MV’s API).
Where the API exposes a simple data property, it can be straightforwardly converted to a Linked Data property (e.g. the two types of notes, in the example above).
Where the API exposes a simple data property whose values come from a fairly limited range (a “vocabulary”), then those individual property values can be assigned identifiers of their own, and effectively promoted from simple data properties to full-blown object properties (e.g. the production techniques in Museum Victoria’s API).

Conclusion

It’s been an interesting experiment, to generate Linked Open Data from an open API using a simple proxy: I think it shows that the technique is a very viable mechanism for institutions to break into the LOD cloud and contribute their collection in a standardised manner, without necessarily having to make any changes to their existing systems or invest in substantial software development work. To my mind, making that first step is a significant barrier that holds institutions and individuals back from realising the potential in their data. Once you have a system for publishing LOD, you are opening up a world of possibilities for external developers, data aggregators, and humanities researchers, and if your data is of interest to those external groups, you have the possibility of generating some significant returns on your investment, and the possibility of “harvesting” some of that work back into your institution’s own web presence in the form of better visualizations, discovery interfaces, and better understanding of your own collection.

Before the end of the year I hope to explore some further possibilities in the area of user interfaces based on Linked Data, to show some of the value that these Linked Data publishing systems can support.

Names in the Museum

Conal — Thu, 01 Oct 2015 04:56:14 +0000

My last blog post described an experimental Linked Open Data service I created, underpinned by Museum Victoria’s collection API. Mainly, I described the LOD service’s general framework, and explained how it worked in terms of data flow.

To recap briefly, the LOD service receives a request from a browser and in turn translates that request into one or more requests to the Museum Victoria API, interprets the result in terms of the CIDOC CRM, and returns the result to the browser. The LOD service does not have any data storage of its own; it’s purely an intermediary or proxy, like one of those real-time interpreters at the United Nations. I call this technique a “Linked Data proxy”.

I have a couple more blog posts to write about the experience. In this post, I’m going to write about how the Linked Data proxy deals with the issue of naming the various things which the Museum’s database contains.

Using Uniform Resource Identifiers (URIs) as names

Names are a central issue in any Linked Data system; anything of interest must be named with an HTTP URI; every piece of information which is recorded about a thing is attached to this name, and crucially, because these names are HTTP URIs, they can (in fact in a Linked Data system, they must) also serve as a means to obtain information about the thing.

In a nutshell there are three main tasks the Linked Data proxy has to be able to perform:

When it receives an HTTP request, it has to recognise the HTTP URI as an identifier that identifies a particular individual belonging to some general type: an artefact; a species; a manufacturing technique; etc.
Having recognised as some sort of name, it has to be able to look up and retrieve information about the particular individual which it identifies.
Having found some information about the named thing, it has to convert that information into RDF (the language of Linked Data), in the process converting any identifiers it has found into the kind of HTTP URIs it can recognise in future. A Linked Open Data client is going to want to use those identifiers to make further requests, so they have to match the kind of identifiers the LOD service can recognise (in step 1 above).

Recognising various HTTP URIs as identifiers for things in Museum Victoria’s collection

Let’s look at the task of recognising URIs as names first.

The Linked Data Proxy distinguishes between URIs that name different types of things by recognising different prefixes in the URIs. For instance, a URI beginning with http://conaltuohy.com/xproc-z/museum-victoria/resource/item/ will identity a particular item in the collection, whereas a URI beginning with http://conaltuohy.com/xproc-z/museum-victoria/resource/technique/ will identify some particular technique used in the manufacture of an item.

The four central entities of Museum Victoria’s API

The Museum Victoria API is organised around four main types of entity:

items
specimens
species
articles

The LOD service handles all four very similarly: since the MV API provides an identifier for every item, specimen, species, or article, the LOD service can generate a linked data identifier for each one just by sticking a prefix on the front. For example, the item which Museum Victoria identifies with the identifier items/1221868 can be identified with the Linked Data identifier http://conaltuohy.com/xproc-z/museum-victoria/resource/items/1221868 just by sticking http://conaltuohy.com/xproc-z/museum-victoria/resource/ in front of it, and a document about that item can be identified by http://conaltuohy.com/xproc-z/museum-victoria/data/items/1221868.

Secondary entities

So far so straightforward, but apart from these four main entity types, there are a number of things of interest which the Museum Victoria API deals with in a secondary way.

For example, the MV database includes information on how many of the artefacts in the collection were manufactured, in a field called technique. For instance, many ceramic items (e.g. teacups) in their collection were created from a mould, and have the value moulded in their technique field. The tricky thing here is that the techniques are not “first-class” entities like items. Instead, a technique is just a textual attribute of an item. This is a common enough situation in legacy data systems: the focus of the system is on what it sees as a “core” entity (a museum item in this case), which have their own identifiers and a bunch of properties hanging off them. Those properties are very much second-class citizens in the data model, and are often just textual labels. A number of items might share a common value for their technique field, but that common value is not stored anywhere except in the technique field of those items; it has no existence independent of those items.

In Linked Data systems, by contrast, such common values should be treated as first-class citizens, with their own identifiers, and with links that connect each technique to the items which were manufactured using that technique.

What is the LOD service to do? When expressing a technique as a URI, it can simply use the technique’s name itself (“moulded”) as part of the identifier, like so:

http://conaltuohy.com/xproc-z/museum-victoria/resource/technique/moulded

Then when the LOD service is responding to a request for a URI like the above, it can pull that prefix off and have the original world “moulded” back.

At this point the LOD service needs to be able to provide some information about the moulded technique. Because the technique is not a first-class object in the underlying collection database, there’s not much that can be said about it, apart from its name, obviously, which is “moulded”. All that the LOD service really knows about a given technique is that a certain set of items were produced using that technique, and it can retrieve that list using the Museum Victoria search API. The search API allows for searching by a number of different fields, including technique, so the Linked Data service can take the last component of the request URI it has received (“moulded”) and pass that to the search API, like so:

http://collections.museumvictoria.com.au/api/search/search?limit=100&technique=moulded

The result of the search is a list of items produced with the given technique, which the LOD service simply reformats into an RDF representation. As part of that conversion, the identifiers of the various moulded items in the results list (e.g. items/1280928) are converted into HTTP URIs simply by sticking the LOD service’s base URI on the front of them, e.g.

http://conaltuohy.com/xproc-z/museum-victoria/resource/items/1280928

External links

Tim Berners-Lee, the inventor of the World Wide Web, in an addendum to his “philosophical” post about Linked Data, suggested a “5-star” rating scheme for Linked Open Data, in which the fifth star requires that a dataset “link … to other people’s data to provide context”. Since the Museum Victoria data doesn’t include external links, it is tricky to earn this final star, but there is a way, based on MV’s use of the standard taxonomic naming system used in biology. Since many of MV’s items are biological specimens, we can use their taxonomic names to establish links to external sources which also use the same taxonomic names. For this example, I chose to link to biological data in Wikipedia, which, unknown to many people, also publishes a large dataset of Linked Open Data derived from the Wikipedia pages, including a lot of biological taxa. To establisha link to DBpedia, the LOD service takes the Museum’s taxonName field and inserts it into a SPARQL query, which it sends to DBpedia, essentially asking “do you have anything on file which has this binomial name?

select distinct ?species where { ?species dbp:binomial "{MV's taxon name goes here}"@en}

The result of the query is either a “no” or it’s a link to the species in Wikipedia’s database, which the LOD service can then republish.

coming up…

My next post in the series will look at some issues of how the Museum’s data relates to the CIDOC CRM model; where it matches neatly, and where it’s either more, or less specific than the CRM.

Linked Open Data built from a custom web API

Conal — Mon, 07 Sep 2015 08:51:37 +0000

I’ve spent a bit of time just recently poking at the new Web API of Museum Victoria Collections, and making a Linked Open Data service based on their API.

I’m writing this up as an example of one way — a relatively easy way — to publish Linked Data off the back of some existing API. I hope that some other libraries, archives, and museums with their own API will adopt this approach and start publishing their data in a standard Linked Data style, so it can be linked up with the wider web of data.

Two ways to skin a cat

There are two basic ways you can take an existing API and turn it into a Linked Data service.

One way is the “metadata aggregator” approach. In this approach, an “aggregator” periodically downloads (or “harvests”) the data from the API, in bulk, converts it all into RDF, and stores it in a triple-store. Then another application — a Linked Data publishing service — can read that aggregated data from the triple store using a SPARQL query and expose it in Linked Data form. The tricky part here is that you have to create and maintain your own copy (a cache) of all the data which the API provides. You run the risk that if the source data changes, then your cache is out of date. You need to schedule regular harvests to be sure that the copy you have is as up to date as you need it to be. You have to hope that the API can tell you which particular records have changed or been deleted, otherwise, you may have to download every piece of data just to be sure.

But this blog post is about another way which is much simpler: the “proxy” approach. A Linked Data proxy is a web application that receives a request for Linked Data, and in order to satisfy that request, makes one or more requests of its own to the API, processing the results it receives, and formatting them as Linked Data, which it then returns. The advantage of this approach is that every response to a request for Linked Data is freshly prepared. There is no need to maintain a cache of the data. There is no need for harvesting or scheduling. It’s simply a translator that sits in front of the API and translates what it says into Linked Data.

This is an approach I’ve been meaning to try out for a fair while, and in fact I gave a very brief presentation on the proxy idea at the recent LODLAM Summit in Sydney. All I needed was a good API to try it out with.

Museum Victoria Collection API

The Museum Victoria Collection API was announced on Twitter by Ely Wallis on August 25th:

The cat's out of the bag – @museumvictoria has a new collections online. We hope you like it! http://t.co/c0oNTWDk5t

— Ely Wallis (@elyw) August 25, 2015

Well as it happened I did like it, so I got in touch. Since it’s so new, the API’s documentation is a bit sparse, but I did get some helpful advice from the author of the API, Museum Victoria’s own Michael Mason, including details of how to perform searches, and useful hints about the data structures which the API provides.

In a nutshell, the Museum Victoria API provides access to data about four different sorts of things:

Items (artefacts in the Museum’s collections),
Specimens (biological specimens),
Species (which the specimens belong to), and
Articles (which document the other things)

There’s also a search API with which you can search within all of those categories.

Armed with this knowledge, I used my trusty XProc-Z proxy software to build a Linked Data proxy to that API.

Linked Data

Linked Data is a technique for publishing information on the web in a common, machine-understandable way.

The central principle of Linked Data is that all items of interest are identified with an HTTP URI (“Uniform Resource Identifier”). And “Resource” here doesn’t just mean web pages or other electronic resources; anything at all can be a “Resource”: people, physical objects, species of animal, days of the week … anything. If you take one of these URIs and put it into your browser, it will deliver you up some information which relates to the thing identified by the URI.

Because of course you can’t download a person or a species of animal, there is a special trick to this: if you send a request for a URI which identifies one of these “non-information resources”, such as a person, the server can’t simply respond by sending you an information resource (after all, you asked for a person, not a document). Instead it responds by saying “see also” and referring you to a different URL. This is basically saying “since you can’t download the resource you asked for (because it’s not an information resource), here is the URL of an information resource which is relevant to your request”. Then when your browser makes a request from that second URL, it receives an information resource. This is why, when browsing Linked Data, you will sometimes see the URI in your browser’s address bar change: first it makes a request for one URI and then is automatically redirected to another.

That information also needs to be encoded as RDF (“Resource Description Framework”). The RDF document you receive from a Linked Data server consists of a set of statements (called “triples”) about various things,including the original “resource” which your original URI identified, but usually also other things as well. Those statements assign various properties to the resources, and also link them to other resources. Since those other resources are also identified by URIs, you can follow those links and retrieve information about those related resources, and resources that are related to them, and so on.

Linked Data URIs as proxies for Museum Victoria identifiers

So one of the main tasks of the Linked Data proxy is to take any identifiers which it retrieves from the Museum Victoria API, and convert them into full HTTP URIs. That’s pretty easy; it’s just a matter of adding a prefix like “http://example/something/something/”. When the proxy receives a request for one of those URIs, it has to be able to turn it back into the form that Museum Victoria’s API uses. That basically involves trimming the prefix back off. Because many of the things identified in the Museum’s API are not information resources (many are physical objects), the proxy makes up two different URIs, one to denote the thing itself, and one to refer to the information about the thing.

The conceptual model (“ontology”)

The job of the proxy is to publish the data in a standard Linked Data vocabulary. There was an obvious choice here; the well-known museum ontology (and ISO standard) with the endearing name “CIDOC-CRM”. This is the Conceptual Reference Model produced by the International Committee for Documentation (CIDOC) of the International Council of Museums. This abstract model is published as an OWL ontology (a form that can be directly used in a Linked Data system) by a joint working group of computer scientists and museologists in Germany.

This Conceptual Reference Model defines terms for things such as physical objects, names, types, documents, and images, and also for relationships such as “being documented in”, or “having a type”, or “being an image of”. The proxy’s job is to translate the terminology used in Museum Victoria’s API into the terms defined in the CIDOC-CRM. Unsurprisingly, much of that translation is pretty easy, because there are long-standing principles in the museum world about how to organise collection information, and both the Museum Victoria API and the CIDOC-CRM are aligned to those principles.

As it happened I already knew the CIDOC-CRM model pretty well, which was one reason why a museum API was an attractive subject for this exercise.

Progress and prospects

At this stage I haven’t yet translated all the information which the Museum’s API provides; most of the details are still simply ignored. But already the translation does include titles and types, as well as descriptions, and many of the important relationships between resources (it wouldn’t be Linked Data without links!). I still want to flesh out the translation some more, to include more of the detailed information which the Museum’s API makes available.

This exercise was a test of my XProc-Z software, and of the general approach of using a proxy to publish Linked Data. Although the result is not yet a complete representation of the Museum’s API, I think it has at least proved the practicality of the approach.

At present my Linked Data service produces RDF in XML format only. There are many other ways that the RDF can be expressed, such as e.g. JSON-LD, and there are even ways to embed the RDF in HTML, which makes it easier for a human to read. But I’ve left that part of the project for now; it’s a very distinct part that will plug in quite easily, and in the meantime there are other pieces of software available that can do that part of the job.

See the demo

The proxy software itself is running here on my website conaltuohy.com, but for ease of viewing it’s more convenient to access it through another proxy which converts the Linked Data into an HTML view.

Here is an HTML view of a Linked Data resource which is a timber cabinet for storing computer software for an ancient computer: http://conaltuohy.com/xproc-z/museum-victoria/resource/items/1411018. Here is the same Linked Data resource description as raw RDF/XML: http://conaltuohy.com/xproc-z/museum-victoria/data/items/1411018 — note how, if you follow this link, the URL in your browser’s address bar changes as the Linked Data server redirects you from the identifier for the cabinet itself, to an identifier for a set of data about the cabinet.

The code

The source code for the proxy is available in the XProc-Z GitHub repository: I’ve packaged the Museum Victoria Linked Data service as one of XProc-Z’s example apps. The code is contained in two files:

museum-victoria.xpl which is a pipeline written in the XProc language, which deals with receiving and sending HTTP messages, and converting JSON into XML, and
museum-victoria-json-to-rdf.xsl, which is a stylesheet written in the XSLT language, which performs the translation between Museum Victoria’s vocabulary and the CIDOC-CRM vocabulary.