A tool for Web API harvesting

Conal — Sat, 31 Dec 2016 05:31:05 +0000

A medieval man harvesting metadata from a medieval Web API

As 2016 stumbles to an end, I’ve put in a few days’ work on my new project Oceania, which is to be a Linked Data service for cultural heritage in this part of the world. Part of this project involves harvesting data from cultural institutions which make their collections available via so-called “Web APIs”. There are some very standard ways to publish data, such as OAI-PMH, OpenSearch, SRU, RSS, etc, but many cultural heritage institutions instead offer custom-built APIs that work in their own peculiar way, which means that you need to put in a certain amount of effort in learning each API and dealing with its specific requirements. So I’ve turned to the problem of how to deal with these APIs in the most generic way possible, and written a program that can handle a lot of what is common in most Web APIs, and can be easily configured to understand the specifics of particular APIs.

This program, which I’ve called API Harvester, can be configured by giving it a few simple instructions: where to download the data from, how to split up the downloaded data into individual records, where to save the record files, how to name those files, and where to get the next batch of data from (i.e. how to resume the harvest). The API Harvester does have one hard requirement: it is only able to harvest data in XML format, but most of the APIs I’ve seen offered by cultural heritage institutions do provide XML, so I think it’s not a big limitation.

The API Harvester software is open source, and free to use; I hope that other people find it useful, and I’m happy to accept feedback or improvements, or examples of how to use it with specific APIs. I’ve created a wiki page to record example commands for harvesting from a variety of APIs, including OAI-PMH, the Trove API, and an RSS feed from this blog. This wiki page is currently open for editing, so if you use the API Harvester, I encourage you to record the command you use, so other people can benefit from your work. If you have trouble with it, or need a hand, feel free to raise an issue on the GitHub repository, leave a comment here, or contact me on Twitter.

Finally, a brief word on how to use the software: to tell the harvester how to pull a response apart into individual records, and where to download the next page of records from (and the next, and the next…), you give it instructions in the form of “XPath expressions”. XPath is a micro-language for querying XML documents; it allows you to refer to elements and attributes and pieces of text within an XML document, to perform basic arithmetic and manipulate strings of text. XPath is simple yet enormously powerful; if you are planning on doing anything with XML it’s an essential thing to learn, even if only to a very basic level. I’m not going to give a tutorial on XPath here (there are plenty on the web), but I’ll give an example of querying the Trove API, and briefly explain the XPath expressions used in that examples:

Here’s the command I would use to harvest metadata about maps, relevant to the word “oceania”, from the Trove API, and save the results in a new folder called “oceania-maps” in my Downloads folder:

java -jar apiharvester.jar directory="/home/ctuohy/Downloads/oceania-maps" retries=5 url="http://api.trove.nla.gov.au/result?q=oceania&zone=map&reclevel=full" url-suffix="&key=XXXXXXX" records-xpath="/response/zone/records/*" id-xpath="@url" resumption-xpath="/response/zone/records/@next"

For legibility, I’ve split the command onto multiple lines, but this is a single command and should be entered on a single line.

Going through the parts of the command in order:

The command java launches a Java Virtual Machine to run the harvester application (which is written in the Java language).
The next item, -jar, tells Java to run a program that’s been packaged as a “Java Archive” (jar) file.
The next item, apiharvester.jar, is the harvester program itself, packaged as a jar file.

The remainder of the command consists of parameters that are passed to the API harvester and control its behaviour.

The first parameter, directory="/home/ctuohy/Downloads/oceania-maps", tells the harvester where to save the XML files; it will create this folder if it doesn’t already exist.
With the second parameter, retries=5, I’m telling the harvester to retry a download up to 5 times if it fails; Trove’s server can sometimes be a bit flaky at busy times; retrying a few times can save the day.
The third parameter, url="http://api.trove.nla.gov.au/result?q=oceania&zone=map&reclevel=full", tells the harvester where to download the first batch of data from. To generate a URL like this, I recommend using Tim Sherratt’s excellent online tool, the Trove API Console.
The next parameter url-suffix="&key=XXXXXXX" specifies a suffix that the harvester will append to the end of all the URLs which it requests. Here, I’ve used url-suffix to specify Trove’s “API Key”; a password which each registered Trove API user is given. To get one of these, see the Trove Help Centre. NB XXXXXXX is not my actual API Key.

The remaining parameters are all XPath expressions. To understand them, it will be helpful to look at the XML content which the Trove API returns in response to that query, and which these XPath expressions apply to.

The first XPath parameter, records-xpath="/response/zone/records/*", identifies the elements in the XML which constitute the individual records. The XPath /response/zone/records/* describes a path down the hierarchical structure of the XML: the initial / refers to the start of the document, the response refers to an element with that name at the “root” of the document, then /zone refers to any element called zone within that response element, then /records refers to any records within any of those response elements, and the final /* refers to any elements (with any name) within any of of those response elements. In practice, this XPath expression identifies all the work elements in the API’s response, and means that each of these work elements (and its contents) ends up saved in its own file.
The next parameter, id-xpath="@url" tells the harvester where to find a unique identifier for the record, to generate a unique file name. This XPath is evaluated relative to the elements identified by the records-xpath; i.e. it gets evaluated once for each record, starting from the record’s work element. The expression @url means “the value of the attribute named url”; the result is that the harvested records are saved in files whose names are derived from these URLs. If you look at the XML, you’ll see I could equally have used the expression @id instead of @url.
The final parameter, resumption-xpath="/response/zone/records/@next", tells the harvester where to find a URL (or URLs) from which it can resume the harvest, after saving the records from the first response. You’ll see in the Trove API response that the records element has an attribute called next which contains a URL for this purpose. When the harvester evaluates this XPath expression, it gathers up the next URLs and repeats the whole download process again for each one. Eventually, the API will respond with a records element which doesn’t have a next attribute (meaning that there are no more records). At that point, the XPath expression will evaluate to nothing, and the harvester will run out of URLs to harvest, and grind to a halt.

Happy New Year to all my readers! I hope this tool is of use to some of you, and I wish you a productive year of metadata harvesting in 2017!

Oceania

Conal — Wed, 28 Dec 2016 06:41:58 +0000

I am really excited to have begun my latest project: a Linked Open Data service for online cultural heritage from New Zealand and Australia, and eventually, I hope, from our other neighbours. I have called the service “oceania.digital”

The big idea of oceania.digital is to pull together threads from a number of different “cultural” data sources and weave them together into a single web of data which people can use to tell a huge number of stories.

There are a number of different aspects to the project, and a corresponding number of stages to go through…

I need to gather the data together from a variety of sources. Both Trove and Digital NZ are doing this at the national level; I want to build on both of those data sources, and gradually add more and more.
Having gathered data from my data sources, I need to transform the harvested data into an interoperable form, namely the World Wide Web Consortium’s “Resource Description Framework” (RDF). The metaphor I suggest is that of teasing out threads from the raw data, so that the threads from one dataset can later be interwoven with those from another. This is the vision of the Semantic Web.
Having converted the data to RDF, I need to weave the threads together so that the data harvested from the different sources is explicitly linked to data from other sources. This means identifying where the same things (people, places, etc.) are described in the different data sources, and explicitly equating or merging those things. This is related to what librarians call “Authority Control“.
Finally, having produced a web of interconnected data, I need to make it practically useful to a wide range of people, not just Semantic Web nerds like me. I will need to build, curate, and inspire the development of new tools that will help end-users to tell stories using the RDF dataset. Most people won’t be programming with RDF themselves, and they won’t be excited by JSON-LD or SPARQL; they will need user-friendly software tools that allow them to summon up the data they need, with a minimum of technical geekery, and to use it to produce visualisations, links, images, maps, and timelines, which they can embed on their blogs and websites, in Facebook, Twitter, and other social media.

So far, I have set up a website, with some harvesting software and an RDF data store.

The first dataset I intend to process is “People Australia”; a collection of biographical which is aggregated from a variety of Australian sources and published by the National Library of Australia’s “People Australia”. Hopefully soon after I will be be able to add a related dataset from New Zealand.

Once I have some data available in RDF form, I will add some features to allow the data to be reused on other websites, then I’ll go back and add more datasets from elsewhere, and repeat the process.

If you’d like to keep in touch with the project as it progresses, you can follow the @OceaniaDigital account on Twitter, or follow my blog.

If you think you’d like to contribute to the project in any way, please do get in touch, either via Twitter or email!

oceania.digital – Conal Tuohy's blog

A tool for Web API harvesting

Oceania