Archives – Conal Tuohy's blog

Australian Society of Archivists 2016 conference #asalinks

Conal — Tue, 25 Oct 2016 08:15:53 +0000

Last week I participated in the 2016 conference of the Australian Society of Archivists, in Parramatta.

I was very impressed by the programme and the discussion. I thought I’d jot down a few notes here about just a few of the presentations that were most closely related to my own work. The presentations were all recorded, and as the ASA’s YouTube channel is updated with newly edited videos, I’ll be editing this post to include those videos.

It was my first time at an ASA conference; I’d been dragged into it by Asa Letourneau, from the Public Record Office Victoria, with whom over the last year I’d been developing a piece of software called “PROVisualizer”, which appears right below here in the page (hint: click the “full screen” button in its bottom right corner if you want to have a play with it).

health

Asa and I gave a presentation on the PROVisualizer, talking about the history of the project from the early prototypes and models built at PROV, to the series of revisions of the product built in collaboration with me, and including the “Linked Data” infrastructure behind the visualization itself, and its prospects for further development and re-use.

You can access the PROVisualizer presention in PDF.

As always, I enjoyed Tim Sherratt‘s contribution: a keynote on redaction by ASIO (secret police) censors in the National Archives, called Turning the Inside Out.

The black marks are of course inserted by the ASIO censors in order to obscure and hide information, but Tim showed how it’s practicable to deconstruct the redactions’ role in the documents they obscure, and convert these voids, these absences, into positive signs in their own right; and that these signs can be utilized to discover politically sensitive texts, and zoom in precisely on the context that surrounds the censored details in each text. Also the censors made a lot of their redaction marks into cute little pictures of people and sailing ships, which raised a few laughs.

In the morning of the first day of talks, I got a kick out of Chris Hurley’s talk “Access to Archives (& Other Records) in the Digital Age”. His rejection of silos and purely hierarchical data models, and his vision of openness to, and accommodation of, small players in the archives space both really resonated with me, and I was pleased to be able to chat with him over coffee later in the afternoon about the history of this idea and about how things like Linked Data and the ICA’s “Records in Context” data model can help to realize it.

In the afternoon of the first day I was particularly struck by Ross Spencer‘s presentation about extracting metadata from full text resources. He spoke about using automation to identify the names of people, organisations, places, and so on, within the text of documents. For me this was particularly striking because I’d only just begun an almost identical project myself for the Australian Policy Online repository of policy documents. In fact it turned out we were using the same software (Apache Tika and the Stanford Named Entity Recognizer).

On the second day I was particularly struck by a few papers that were very close to my own interests. Nicole Kearney, from Museum Victoria, talked about her work coordinating the Biodiversity Heritage Library Australia.

This presentation was focused on getting value from the documentary heritage of museums; such things as field notes and diaries from scientific expeditions, by using the Atlas of Living Australia’s DigiVol transcription platform to allow volunteers to transcribe the text from digital images, and then publishing the text and images online using the BHL publication platform. In between there was slightly awkward part which involves Nicole converting from the CSV format produced by DigiVol into some more standard format for the BHL. I’ve had an interest in text transcription going back to slightly before the time I joined the New Zealand Electronic Text Centre at Victoria University of Wellington; this would’ve been about 2003, which seems like ancient times now.

After that I saw Val Love and Kirsty Cox talk about their journey in migrating the Alexander Turnbull Library‘s TAPUHI software to KE EMu. Impressive, given the historical complexity of TAPUHI, and the amount of data analysis required to make sense of its unique model, and to translate that into a more standard conceptual model, and to implement that model using EMu. It’s an immense milestone for the Turnbull, and I hope will lead in short order to the opening up of the collection metadata to greater reuse.

Finally I want to mention the talk “Missing Links: museum archives as evidence, context and content” from Mike Jones. This was another talk about breaking down barriers between collection management systems in museums: on the one hand, the museum’s collection of objects, and on the other, the institution’s archives. Of course those archives provide a great deal of context for the collection, but the reality is that the IT infrastructure and social organisation of these two systems is generally very distinct and separate. Mike’s talk was about integrating cultural heritage knowledge from different organisational structures, domains of professional expertise, different data models, and IT systems. I got a shout-out in one slide in the form of a reference to some experimental work I’d done with Museum Victoria’s web API, to convert it into a Linked Data service.

It’s my view that Linked Data technology offers a practical approach to resolving the complex data integration issues in cultural heritage: it is relatively easy to expose legacy systems, whatever they might be, in the form of Linked Data, and having done so, the task of integrating the data so exposed is also rather straight-forward (that’s what Linked Data was invented for, pretty much). To me the problem is how to sell this to an institution, in the sense that you have to offer the institution itself a “win” for undertaking the work. If it’s just that they can award themselves 5 gold stars for public service that’s not a great reason. You need to be able to deliver tangible value to museums themselves. This is where I think there’s a gap; in leveraging Linked Data to enhance exhibitions and also in-house collection management systems. If we can make it so that there’s value to institutions in creating and also consuming Linked Data, then we may be able to establish a virtuous circle to drive uptake of the technology, and see some progress in the integration of knowledge in the sector.

Visualizing Government Archives through Linked Data

Conal — Tue, 05 Apr 2016 13:41:00 +0000

Tonight I’m knocking back a gin and tonic to celebrate finishing a piece of software development for my client the Public Record Office Victoria; the archives of the government of the Australian state of Victoria.

The work, which will go live in a couple of weeks, was an update to a browser-based visualization tool which we first set up last year. In response to user testing, we made some changes to improve the visualization’s usability. It certainly looks a lot clearer than it did, and the addition of some online help makes it a bit more accessible for first-time users.

The visualization now looks like this (here showing the entire dataset, unfiltered, which is not actually that useful, though it is quite pretty):

The bulk of the work, though, was to automate the preparation of data for the visualization.

Up until now, the dataset which you could visualize consisted of a couple of CSV files, manually assembled with considerable care and effort from reports exported from PROV’s repository “Archives One”. In the new system, this manual work will not need to be repeated. Instead, the same dataset will be assembled by an automated metadata-processing pipeline which will keep it continually up to date as government agencies and functions change over time.

It was not as big as job as you might think, since in fact a lot of the work to generate the data had already been done.

PROV’s Interoperable Data service

In 2012, in collaboration with their counterpart agency State Records New South Wales, PROV had set up an Interoperable Data publishing service with funding from the Australian National Data Service. They custom-built some software to export data from Archives One to produce a set of metadata records in RIF-CS format, and they deployed an off-the-shelf software application (an “OAI-PMH Repository”) to disseminate those metadata records over the web.

Originally, the OAI-PMH repository was serving data to the Australian National Data Service, which runs an aggregation service called Research Data Australia, which offers researchers pointers to all manner of scientific, historical and cultural datasets. The PROV metadata, covering the full history of government records in Victoria, is a useful resource for social science researchers, genealogists, historians, and others.

More recently, PROV’s OAI-PMH repository has also been harvested by the National Library of Australia’s Trove service.

Now at last it will be harvested by the Public Record Office itself.

The data pipeline

The software I’ve written consists of a web application which I wrote using a programming language for data pipelines called XProc. The software itself is open source and available on GitHub in a repository with the ludicrously acronymous title PROV-RIF-SPARQL.

This XProc application tediously harvests the metadata records (there are more than 30000 of them) and converts each one from RIF-CS format into RDF/XML format. The RDF/XML data is a reformulation of the RIF-CS in which the hierarchical structures of the RIF-CS are re-expressed as a network of interconnected statements; a kind of web of nodes and links which mathematicians call a “graph”. The statements in these graphs are expressed using the international standard conceptual framework for cultural heritage data; the CIDOC-CRM. My harvester then stores all these RDF/XML documents (or “graphs”) in a SPARQL Graph Store (a kind of hybrid document store and database). The SPARQL Graph Store allows each graph to be addressed individually, but also for the entire dataset to be treated as a single graph, and queried as a whole. Finally, the RDF dataset is queried to produce the two summarised data files which the visualization itself requires; these are simple spreadsheets in CSV (Comma Separated Values) format. One table contains information about each government agency or function, and the other table lists the relationships which have historically existed between those agencies and functions.

The harvester has a basic user interface where you can start a data harvest; a process that takes about half an hour to complete. In this interface you can specify the location of the OAI-PMH server you want to harvest data from, the format of the data you want to harvest, and the location of the SPARQL Graph Store where you want to store the result, amongst other parameters. In practice, this user interface isn’t used by a human (except during testing); another small program running on a regular schedule makes the request.

At this stage of the project, the RDF graph is only used internally to PROV, where it functions purely as an intermediate between the RIF-CS input and the CSV output. The RDF data and the SPARQL database together just provide a convenient way to aggregate a big set of records and query the resulting aggregation. But later I have no doubt that the RDF data will be published directly as Linked Open Data, opening it up, and allowing it to be connected into a world-wide web of data.