oai-pmh – Conal Tuohy's blog

Analysis & Policy Online

Conal — Tue, 27 Jun 2017 23:45:27 +0000

Notes for my Open Repositories 2017 conference presentation. I will edit this post later to flesh it out into a proper blog post.
Follow along at: conaltuohy.com/blog/analysis-policy-online/

background

Early discussion with Amanda Lawrence of APO (which at that time stood for “Australian Policy Online”) about text mining, at the 2015 LODLAM Summit in Sydney.
They needed automation to help with the cataloguing work, to improve discovery.
They needed to understand their own corpus better.
I suggested a particular technical approach based on previous work.
In 2016, APO contracted me to advise and help them build a system that would “mine” metadata from their corpus, and use Linked Data to model and explore it.

constraints

Openness
Integrate metadata from multiple text-mining processes, plus manually created metadata
Minimal dependency on their current platform (Drupal 7, now Drupal 8)
Lightweight; easy to make quick changes

technical approach

Use an entirely external metadata store (a SPARQL Graph Store)
Use a pipeline! Extract, Transform, Load
Use standard protocol to extract data (first OAI-PMH, later sitemaps)
In fact, use web services for everything; the pipeline is then just a simple script that passes data between web services
Sure, XSLT and SPARQL Query, but what the hell is XProc?!

progress

Configured Apache Tika as a web service, using Stanford Named Entity Recognition toolkit
Built XProc pipeline to harvest from Drupal’s OAI-PMH module, download digital objects, process them with Stanford NER via Tika, and store the resulting graphs in Fuseki graph store
Harvested, and produced a graph of part of the corpus, but …
Turned out the Drupal OAI-PMH module wa broken! So we used Sitemap instead
“Related” list added to APO dev site (NB I’ve seen this isn’t working in all browsers, and obviously needs more work, perhaps using an iframe is not the best idea. Try Chrome if you don’t see the list of related pages on the right)

next steps

Visualize the graph
Integrate more of the manually created metadata into the RDF graph
Add topic modelling (using MALLET) alongside the NER

Let’s see the code

Questions?

(if there’s any time remaining)

Visualizing Government Archives through Linked Data

Conal — Tue, 05 Apr 2016 13:41:00 +0000

Tonight I’m knocking back a gin and tonic to celebrate finishing a piece of software development for my client the Public Record Office Victoria; the archives of the government of the Australian state of Victoria.

The work, which will go live in a couple of weeks, was an update to a browser-based visualization tool which we first set up last year. In response to user testing, we made some changes to improve the visualization’s usability. It certainly looks a lot clearer than it did, and the addition of some online help makes it a bit more accessible for first-time users.

The visualization now looks like this (here showing the entire dataset, unfiltered, which is not actually that useful, though it is quite pretty):

The bulk of the work, though, was to automate the preparation of data for the visualization.

Up until now, the dataset which you could visualize consisted of a couple of CSV files, manually assembled with considerable care and effort from reports exported from PROV’s repository “Archives One”. In the new system, this manual work will not need to be repeated. Instead, the same dataset will be assembled by an automated metadata-processing pipeline which will keep it continually up to date as government agencies and functions change over time.

It was not as big as job as you might think, since in fact a lot of the work to generate the data had already been done.

PROV’s Interoperable Data service

In 2012, in collaboration with their counterpart agency State Records New South Wales, PROV had set up an Interoperable Data publishing service with funding from the Australian National Data Service. They custom-built some software to export data from Archives One to produce a set of metadata records in RIF-CS format, and they deployed an off-the-shelf software application (an “OAI-PMH Repository”) to disseminate those metadata records over the web.

Originally, the OAI-PMH repository was serving data to the Australian National Data Service, which runs an aggregation service called Research Data Australia, which offers researchers pointers to all manner of scientific, historical and cultural datasets. The PROV metadata, covering the full history of government records in Victoria, is a useful resource for social science researchers, genealogists, historians, and others.

More recently, PROV’s OAI-PMH repository has also been harvested by the National Library of Australia’s Trove service.

Now at last it will be harvested by the Public Record Office itself.

The data pipeline

The software I’ve written consists of a web application which I wrote using a programming language for data pipelines called XProc. The software itself is open source and available on GitHub in a repository with the ludicrously acronymous title PROV-RIF-SPARQL.

This XProc application tediously harvests the metadata records (there are more than 30000 of them) and converts each one from RIF-CS format into RDF/XML format. The RDF/XML data is a reformulation of the RIF-CS in which the hierarchical structures of the RIF-CS are re-expressed as a network of interconnected statements; a kind of web of nodes and links which mathematicians call a “graph”. The statements in these graphs are expressed using the international standard conceptual framework for cultural heritage data; the CIDOC-CRM. My harvester then stores all these RDF/XML documents (or “graphs”) in a SPARQL Graph Store (a kind of hybrid document store and database). The SPARQL Graph Store allows each graph to be addressed individually, but also for the entire dataset to be treated as a single graph, and queried as a whole. Finally, the RDF dataset is queried to produce the two summarised data files which the visualization itself requires; these are simple spreadsheets in CSV (Comma Separated Values) format. One table contains information about each government agency or function, and the other table lists the relationships which have historically existed between those agencies and functions.

The harvester has a basic user interface where you can start a data harvest; a process that takes about half an hour to complete. In this interface you can specify the location of the OAI-PMH server you want to harvest data from, the format of the data you want to harvest, and the location of the SPARQL Graph Store where you want to store the result, amongst other parameters. In practice, this user interface isn’t used by a human (except during testing); another small program running on a regular schedule makes the request.

At this stage of the project, the RDF graph is only used internally to PROV, where it functions purely as an intermediate between the RIF-CS input and the CSV output. The RDF data and the SPARQL database together just provide a convenient way to aggregate a big set of records and query the resulting aggregation. But later I have no doubt that the RDF data will be published directly as Linked Open Data, opening it up, and allowing it to be connected into a world-wide web of data.

Public OAI-PMH repository for Papers Past

Conal — Mon, 25 May 2015 05:44:09 +0000

I have deployed a publicly available service to provide access in bulk to newspaper articles from Papers Past — the National Library of New Zealand’s online collection of historical newspapers — via the DigitalNZ API.

The service allows access to newspaper articles in bulk (up to a maximum of 5000 articles), using OAI-PMH harvesting software. To gain access to the collection, point your OAI-PMH harvester to the repository with this URI:

https://papers-past-oai-pmh.herokuapp.com/

If you’re looking for a good harvester, let me recommend jOAI.

Searching

You can harvest records that match a search. Provide your search query as an OAI-PMH set, for example to search for “titokowaru”, specify search:titokowaru as the value of the OAI-PMH set parameter:

https://papers-past-oai-pmh.herokuapp.com/?verb=ListRecords&metadataPrefix=oai_dc&set=search:titokowaru

Formats available

You can harvest records (i.e. articles) in one of three different formats:

html — this format returns the full text of the articles, and is likely to be the most useful format. Note that the text available through DigitalNZ has had punctuation and capitalization removed.
oai_dc — a simple metadata record.
digitalnz — straightforwardly based on DigitalNZ’s own metadata format.

https://papers-past-oai-pmh.herokuapp.com/?verb=ListMetadataFormats

Happy harvesting!

How to download bulk newspaper articles from Papers Past

Conal — Sun, 14 Sep 2014 10:25:22 +0000

Anyone interested in New Zealand history should already know about the amazing Papers Past website provided by the National Library of New Zealand, where you can read search and browse millions of historical newspaper articles and advertisements from New Zealand.

You may also know about Digital New Zealand, which provides a central point for access to New Zealand’s digital culture.

This post is about using Digital NZ and Papers Past to get access, in bulk, to newspaper articles, in a form which is then open to being “crunched” with sophisticated “data mining” software, able to automatically discover patterns in the articles. In my earlier post How to download bulk newspaper articles from Trove, I wrote:

Some researchers want to go beyond merely using computers to help them find newspaper articles to read (what librarians call “resource discovery”); researchers also want to use computers to help them read those articles.

To use that kind of “data mining” software, you first need to have direct access to your data, and that can mean having to download thousands and thousands of articles. It’s not at all obvious how to do that, but in fact it can be made quite easy, as I will show.

First, though, a brisk stroll down memory lane…

The history of Papers Past

The Papers Past website can trace its origins back to its pre-digital ancestor the National Newspaper Collection, an archive of paper and microfilm. But the website has its own history, with two distinct phases. The original Papers Past was essentially just a collection of giant images of newspaper pages, organized by newspaper title and by date. It was a fabulous resource, but the tricky thing was knowing where to look.

Then in 2007 Papers Past was reincarnated in a much more sophisticated form, featuring full text search for individual articles.

This crucial improvement in usability came about through converting the page images to text. The Library extracted the full text from each newspaper page image using Optical Character Recognition software, which they had primed with lists of NZ words including placenames and Māori familial names, in order to more reliably recognize these words. Finally they had every headline manually checked and edited for accuracy. The current website is built around an index of all that text, linked to page images. By searching that index, you can retrieve a list of articles that might interest you, pick out a selection, and actually read them.

It’s notable that each of these development stages delivered a new way for researchers to work with the archive. The original website expanded the accessibility of the archive by exposing it to the entire internet, and the modern version of the website dramatically improved the discoverability of that information by allowing researchers to search within full text.

Since then there’s been a third change in the digital environment around Papers Past — where it has become linked up to a broader system called Digital New Zealand — and it seems to me that this change opens up yet another new way for researchers to engage with the archive.

Digital New Zealand

Digital New Zealand is an aggregation of information about NZ cultural items; books, movies, posters, art works, newspapers, and more, drawn from the catalogues and databases of galleries, libraries, museums, and other institutions. There amongst the contributors to Digital New Zealand is our friend Papers Past.

Digital New Zealand is more than just a website, though; its core offering is a so-called “Application Programming Interface”, or API, providing third-party programmers (i.e. people like me) with a way to access the data contained in Digital NZ. Using the Digital NZ API we can search, download, and even create new information resources.

The Digital New Zealand API is a custom-built thing, but functionally it’s not too different to many other APIs in use on the internet. In particular, some parts of it are very similar to a standard mechanism called OAI-PMH, which has been used by libraries and archives for over a decade.

The Open Archives Initiatives Protocol for Metadata Harvesting

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol (i.e. a set of conversational conventions, but in this case it’s essentially synonymous with “API”) for machines to send each other large batches of structured information (e.g. library catalogue records) in a variety of formats.

There are two roles defined in the protocol; the “provider” which manages and publishes a database of records, and a “harvester”, which periodically asks the provider for updates. Over the years a large number of software applications have been produced to function either as OAI-PMH providers or harvesters (or both). Could Papers Past be made implement this protocol (as a provider)? If so, then we could access Papers Past using any of the existing harvester applications.

Retailer

I’ve been a fan of OAI-PMH from way back, and I’ve used it many times. It had struck me that if I built a “gateway” which translated services like Digital NZ into OAI-PMH, then historical researchers could use existing OAI-PMH software to download the articles they need. So I sat down and wrote this gateway software, and I gave it the name “Retailer” because it’s a kind of shop-front, providing “retail” access to data provided by a “wholesale” data provider such as Digital NZ.

The way Retailer works is that it is a generic framework for translating between one API and another. It could be used to translate all sorts of protocols. Each translation is effected by writing a specific script that defines the equivalences between a specific pair of APIs. Having written the generic Retailer, I then wrote some translation scripts: first I wrote a script to translate between OAI-PMH and the API of the National Library of Australia’s “Trove” service, and then I wrote a script to translate between OAI-PMH and the API of Digital NZ.

I have bundled the core “Retailer” software with both these scripts, so you can harvest newspaper articles from both the NZ and Australian newspaper archives.

Installation

First, you will need to register with Digital New Zealand to obtain an “API Key”. This is a random password which they assign to you. Whenever you ask a question of the Digital NZ API, you must provide this API Key with your request, so that they know who you are. Since Retailer will be interacting with Digital NZ on your behalf, you will need to let Retailer know what your key is.

Then you need to set up a Java web server, such as Apache Tomcat. This provides the platform on which Retailer runs.

Once you have set up Tomcat, you can install Retailer in it, and configure it to run the Papers Past OAI-PMH provider script papers-past.xsl.

Finally you need to set up an OAI-PMH harvester. I’ve been using jOAI, which I can recommend heartily. It has the advantage that, like Retailer, it is a Java Servlet, so you can install it inside Tomcat, alongside Retailer itself.

Now you are all set to harvest newspaper articles!

Harvesting

To harvest, you should first go to Digital NZ and do some searching until you have a search query that returns you the results you want to harvest. Click through a selection of results to check that they are relevant; if necessary, refine your query to exclude articles you are not interested in.

Once you have decided on a good query, navigate your browser to your jOAI harvester (e.g. to http://localhost:8080/oai/) and add a new harvest. Set the harvest’s setSpec to be search: followed by your query, e.g. search:new zealand wars. Then run the harvest to actually download the search results. For details, see the Retailer documentation.

Why not give it a try? Any problems, leave a comment here or as an issue on the Retailer github site!

How to download bulk newspaper articles from Trove

Conal — Fri, 15 Aug 2014 07:50:54 +0000

Before the internet, before TV, before radio, newspapers ruled. There were literally hundreds of newspapers, published in towns and cities all over Australia, and they carried the daily life of Australians in all its petty detail. For historians, newspapers were a diamond mine; the information content was hugely valuable; the hard part was all the digging you had to do. It used to be that you would have to go to a library where a newspaper collection was held, and search manually through text on paper or microfiche. You had to be prepared to put in a lot of hard slog.

But then everything changed. A humanities researcher once told me that for Australian researchers, the National Library’s of Australia’s “Trove” newspaper archive marked a radical break: “There was a Before Trove, and an After Trove”.

Discovering newspaper articles has never been easier, for both professional and amateur researchers. These days researchers can search through many millions of newspaper articles in a few seconds, from the comfort of their own web browser. Enter your query, retrieve a list of the top 20 hits, click through to read any of them, click for another 20 hits, and so on.

From “resource discovery” to “distant reading”

However, there’s no pleasing some people. Some researchers want to go beyond merely using computers to help them find newspaper articles to read (what librarians call “resource discovery”); researchers also want to use computers to help them read those articles.

If you search and discover an article; a few articles; even a few dozen articles, you can read them yourself and take whatever notes you need. But what if you want to read hundreds, thousands, or hundreds of thousands of articles? What if you wanted to analyze the entire corpus of Trove articles? That’s just not humanly possible. Of course, a computer can’t have quite the same “understanding” of a set of newspaper articles as a human being can, but it has the advantage that it can “read” in minutes or hours, what would take a human years, or centuries, to wade through. There are a number of techniques for bulk machine-reading of text; what Franco Moretti called “distant reading”.

Actually this is a part of the so-called “big data” trend in research generally, in which computers are used to find useful information by crunching up vast amounts of data. In humanities research, “big data” generally (though not always) means large corpora of digitized text.

Harvesting text from Trove

At some stage, I have no doubt that the entire Trove corpus will become available to anyone who wants it, but as of this moment, it’s not entirely straightforward to acquire articles in bulk. The Trove programming interface allows for articles to be downloaded, but it needs a little help, because it was primarily intended as a means to discover individual resources, rather than a bulk data exchange mechanism.

A couple of months ago I was looking at the Trove API; the Application Programming Interface that Trove uses to provide automated access to their data. I was struck at the time by how similar it was to another mechanism which libraries use for bulk data exchange (e.g. of library catalogues), namely the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH).

OAI-PMH is a protocol (i.e. a set of conversational conventions) for machines to send each other large batches of structured information in a variety of formats. The protocol allows a requester (called a “harvester”) to ask a repository (called a “provider”) for data records in a particular format, and updated since a particular date. The provider then responds with a list of records. If there are too many records to send in a single list (more than a few Megabytes of data is typically considered too large), then the provider will return a partial list, and also include a bookmark (called a “resumption token”) to mark how far through the full list it has got. Whenever the harvester receives a resumption token, it knows it can issue another request, handing the resumption token back to the provider, and receive in response another partial list, starting from where the previous list left off, along with yet another resumption token. At any point, the provider can let the harvester know that the list is finished, by returning an empty resumption token. In this way, very large batches of data can be transferred in bite-sized chunks. Typically, libraries harvest data overnight, retrieving just the records which have been created or updated since their harvest the previous night.

On reflection, I realized that if Trove could just support OAI-PMH, then it would be possible to download articles from it in bulk, using any of a number of existing OAI-PMH programs. I wouldn’t need to write software to perform a harvest; I only needed to translate between the OAI-PMH protocol and the Trove API. An easy win!

Retailer

At this point I sat down and knocked out some code, which I called Retailer because it’s a kind of shop-front with Trove as a warehouse behind it. Retailer is able to act as an OAI-PMH provider by translating any OAI-PMH requests it receives into Trove-style requests which it forwards on to the Trove API. Then it takes the response from Trove, formats it as an OAI-PMH response, and returns that.

Because the Trove corpus is so large, it’s not really feasible to harvest the entire thing. So I’ve made it possible to specify a search query, to limit the results to something more manageable. In OAI-PMH, a repository can divide its records up into so-called distinct sets, and with Retailer you can use a Trove query as one of these sets.

How to use it

Let’s go step by step through how to use Retailer to harvest from Trove. For now, I’m going to detail how to do this using Linux. I hope later I will add instructions for Windows and MacOS.

First you will need to register with the National Library and obtain an API key. This is a kind of random looking password which you must provide to the Trove API with every request you make, and hence you have to tell Retailer what it is.

Next you will need to install Retailer. Retailer is a Java Servlet, which means you must first have installed Java and a Java Web Server such as Apache Tomcat. On a Debian-based Linux, this is as simple as:
sudo apt-get install tomcat7

Now you will install Retailer. Download the retailer.war file and save it e.g.
mkdir /var/lib/retailer cp retailer.war /var/lib/retailer/retailer.war

To install it into Tomcat, you will need to create a file called retailer.xml in Tomcat’s configuration folder /var/lib/tomcat7/conf/Catalina/localhost with the following content. Copy and paste this and edit the trove key.

 path="/retailer" 
    docBase="/var/lib/retailer/retailer.war"
    antiResourceLocking="false">
   name="trove-key" value="your-key-here" override="false"/>

This is so that Retailer knows your Trove API key.

Next you will need an OAI-PMH harvester. There are several of these available. I am going to use jOAI, which has the advantage that it’s also a Java Servlet. Download the jOAI zip file, unzip it, find the oai.war file, and copy it into the webapps folder of Tomcat.
unzip joai_v3.1.1.3.zip sudo cp joai_v3.1.1.3/oai.war /var/lib/tomcat7/webapps/

At this point, you should have all you need to start harvesting. Open your browser and navigate to http://localhost:8080/oai/admin/harvester.do – you should be talking to your jOAI harvester. Here’s where you set up your harvester. Click the “Add new harvest” button and fill out the form.

Field	Example value	Notes
Repository Name	`Trove`	Call it what you like, but “Trove” is an obvious name
Repository Base URL	`http://localhost:8080/retailer/`	This is the address of Retailer. Remember the harvester talks to Retailer, and behind the scenes Retailer is talking to Trove
SetSpec	`search:annular solar eclipse`	Enter a search query here. Don’t leave the SetSpec blank or jOAI will attempt to harvest all of Trove, which may take many months.
Metadata format	`html`	This is where you specify what format you want to retrieve data in. The most useful value here is definitely `html` (which retrieves the articles as web pages), but you can also use `oai_dc` (which retrieves only certain metadata, such as the article title and URL, in an XML format), and `trove` (which retrieves the articles in a similar format to that returned by the Trove API, with minimal reformatting).

Save the harvest configuration, and run it by clicking “All” under the heading “Manually Harvest”. After a few minutes, jOAI should have downloaded your web pages and point you at the folder where they can be found.