Visualizing Government Archives through Linked Data

Tonight I’m knocking back a gin and tonic to celebrate finishing a piece of software development for my client the Public Record Office Victoria; the archives of the government of the Australian state of Victoria.

The work, which will go live in a couple of weeks, was an update to a browser-based visualization tool which we first set up last year. In response to user testing, we made some changes to improve the visualization’s usability. It certainly looks a lot clearer than it did, and the addition of some online help makes it a bit more accessible for first-time users.

The visualization now looks like this (here showing the entire dataset, unfiltered, which is not actually that useful, though it is quite pretty):

provisualizer

The bulk of the work, though, was to automate the preparation of data for the visualization.

Up until now, the dataset which you could visualize consisted of a couple of CSV files, manually assembled with considerable care and effort from reports exported from PROV’s repository “Archives One”. In the new system, this manual work will not need to be repeated. Instead, the same dataset will be assembled by an automated metadata-processing pipeline which will keep it continually up to date as government agencies and functions change over time.

It was not as big as job as you might think, since in fact a lot of the work to generate the data had already been done.

PROV’s Interoperable Data service

In 2012, in collaboration with their counterpart agency State Records New South Wales, PROV had set up an Interoperable Data publishing service with funding from the Australian National Data Service. They custom-built some software to export data from Archives One to produce a set of metadata records in RIF-CS format, and they deployed an off-the-shelf software application (an “OAI-PMH Repository”) to disseminate those metadata records over the web.

Originally, the OAI-PMH repository was serving data to the Australian National Data Service, which runs an aggregation service called Research Data Australia, which offers researchers pointers to all manner of scientific, historical and cultural datasets. The PROV metadata, covering the full history of government records in Victoria, is a useful resource for social science researchers, genealogists, historians, and others.

More recently, PROV’s OAI-PMH repository has also been harvested by the National Library of Australia’s Trove service.

Now at last it will be harvested by the Public Record Office itself.

The data pipeline

The software I’ve written consists of a web application which I wrote using a programming language for data pipelines called XProc. The software itself is open source and available on GitHub in a repository with the ludicrously acronymous title PROV-RIF-SPARQL.

This XProc application tediously harvests the metadata records (there are more than 30000 of them) and converts each one from RIF-CS format into RDF/XML format. The RDF/XML data is a reformulation of the RIF-CS in which the hierarchical structures of the RIF-CS are re-expressed as a network of interconnected statements; a kind of web of nodes and links which mathematicians call a “graph”. The statements in these graphs are expressed using the international standard conceptual framework for cultural heritage data; the CIDOC-CRM. My harvester then stores all these RDF/XML documents (or “graphs”) in a SPARQL Graph Store (a kind of hybrid document store and database). The SPARQL Graph Store allows each graph to be addressed individually, but also for the entire dataset to be treated as a single graph, and queried as a whole. Finally, the RDF dataset is queried to produce the two summarised data files which the visualization itself requires; these are simple spreadsheets in CSV (Comma Separated Values) format. One table contains information about each government agency or function, and the other table lists the relationships which have historically existed between those agencies and functions.

The harvester has a basic user interface where you can start a data harvest; a process that takes about half an hour to complete. In this interface you can specify the location of the OAI-PMH server you want to harvest data from, the format of the data you want to harvest, and the location of the SPARQL Graph Store where you want to store the result, amongst other parameters. In practice, this user interface isn’t used by a human (except during testing); another small program running on a regular schedule makes the request.

harvester

At this stage of the project, the RDF graph is only used internally to PROV, where it functions purely as an intermediate between the RIF-CS input and the CSV output. The RDF data and the SPARQL database together just provide a convenient way to aggregate a big set of records and query the resulting aggregation. But later I have no doubt that the RDF data will be published directly as Linked Open Data, opening it up, and allowing it to be connected into a world-wide web of data.

Taking control of an uncontrolled vocabulary

A couple of days ago, Dan McCreary tweeted:

It reminded me of some work I had done a couple of years ago for a project which was at the time based on Linked Data, but which later switched away from that platform, leaving various bits of RDF-based work orphaned.

One particular piece which sprung to mind was a tool for dealing with vocabularies. Whether it’s useful for Dan’s talk I don’t know, but I thought I would dig it out and blog a little about it in case it’s of interest more generally to people working in Linked Open Data in Libraries, Archives and Museums (LODLAM).
Continue reading Taking control of an uncontrolled vocabulary

Bridging the conceptual gap: Museum Victoria’s collections API and the CIDOC Conceptual Reference Model

A Museum Victoria LOD graph about a teacup, shown using the LODLive visualizer.
A Museum Victoria LOD graph about a teacup, shown using the LODLive visualizer.
This is the third in a series of posts about an experimental Linked Open Data (LOD) publication based on the web API of Museum Victoria.

The first post gave an introduction and overview of the architecture of the publication software, and the second dealt quite specifically with how names and identifiers work in the LOD publication software.

In this post I’ll cover how the publication software takes the data published by Museum Victoria’s API and reshapes it to fit a common conceptual model for museum data, the “Conceptual Reference Model” published by the documentation committee of the Internal Council of Museums. I’m not going to exhaustively describe the translation process (you can read the source code if you want the full story), but I’ll include examples to illustrate the typical issues that arise in such a translation.
Continue reading Bridging the conceptual gap: Museum Victoria’s collections API and the CIDOC Conceptual Reference Model

Names in the Museum

My last blog post described an experimental Linked Open Data service I created, underpinned by Museum Victoria’s collection API. Mainly, I described the LOD service’s general framework, and explained how it worked in terms of data flow.

To recap briefly, the LOD service receives a request from a browser and in turn translates that request into one or more requests to the Museum Victoria API, interprets the result in terms of the CIDOC CRM, and returns the result to the browser. The LOD service does not have any data storage of its own; it’s purely an intermediary or proxy, like one of those real-time interpreters at the United Nations. I call this technique a “Linked Data proxy”.

I have a couple more blog posts to write about the experience. In this post, I’m going to write about how the Linked Data proxy deals with the issue of naming the various things which the Museum’s database contains.

Continue reading Names in the Museum

Linked Open Data built from a custom web API

I’ve spent a bit of time just recently poking at the new Web API of Museum Victoria Collections, and making a Linked Open Data service based on their API.

I’m writing this up as an example of one way — a relatively easy way — to publish Linked Data off the back of some existing API. I hope that some other libraries, archives, and museums with their own API will adopt this approach and start publishing their data in a standard Linked Data style, so it can be linked up with the wider web of data.

Continue reading Linked Open Data built from a custom web API

Zotero, Web APIs, and data formats

I’ve been doing some work recently (for a couple of different clients) with Zotero, the popular reference management software. I’ve always been a big fan of the product. It has a number of great features, including the fact that it integrates with users’ browsers, and can read metadata out of web pages, PDF files, linked data, and a whole bunch of APIs.

zotero

One especially nice feature of Zotero is that you can use it to collaborate with a group of people on a shared library of data which is stored in the cloud and synchronized to the devices of the group members.
Continue reading Zotero, Web APIs, and data formats

Proxying: a trick to easily add features to existing websites and applications

At the start of last month I attended the LODLAM (Linked Open Data in Libraries, Archives and Museums) Summit in Sydney, in the lovely Mitchell Library of the State Library of New South Wales.

The Summit is organised as an “un-conference”. There is no pre-defined agenda; it’s organised by the participants themselves at the start of the day. It makes it a very participatory event; your brain is in top gear the whole time and everything is so interesting you end up feeling a bit stunned at the end of the day.

One of the features of the Summit was a series of very brief talks (“speedos”) on a variety of topics. At the last minute I decided I’d contribute a quick rant on a particular hobby-horse of mine: the value of using proxies to build web applications, Linked Open Data, and so on. Continue reading Proxying: a trick to easily add features to existing websites and applications

Old News for Twitter

Yesterday I finished a little development project to build a TwitterBot for New Zealand’s online newspaper archive Papers Past.

What’s a “TwitterBot”? It’s a software application that autonomously (robotically, hence “-bot”) sends tweets. There are a lot of TwitterBots tweeting about all kinds of things. Tim Sherratt has produced a few, including one called @TroveNewsBot which tweets links to articles from the Australian online newspaper archive of Trove, and this was a direct inspiration for my TwitterBot. Recently Hugh Rundle produced a TwitterBot called Aus GLAM Blog Bot that tweets links to blog posts by people blogging in the Australian GLAM (Galleries, Libraries, Archives and Museums) sector. People like me. I’m looking forward to seeing Hugh’s bot tweeting about my bot. Continue reading Old News for Twitter

Public OAI-PMH repository for Papers Past

I have deployed a publicly available service to provide access in bulk to newspaper articles from Papers Past — the National Library of New Zealand’s online collection of historical newspapers — via the DigitalNZ API.

The service allows access to newspaper articles in bulk (up to a maximum of 5000 articles), using OAI-PMH harvesting software. To gain access to the collection, point your OAI-PMH harvester to the repository with this URI:

https://papers-past-oai-pmh.herokuapp.com/ Continue reading Public OAI-PMH repository for Papers Past

Beta release of XProc-Z web server framework

I have at last released a “final” version of my web server framework, XProc-Z, for testing. The last features I had wanted to include were:

  • The ability for the XProc code in the web server to read information from its environment, so that a generic XProc pipeline can be customized by setting configuration properties.
  • Full support for sending and receiving binary files (i.e. non text files). XProc is really a language for processing XML, but I think it will be handy to be able to deal with binary files as well from time to time.
  • A few sample XProc pipelines, to demonstrate the capability of the platform.

XProc-Z-samples Continue reading Beta release of XProc-Z web server framework