A tool for Web API harvesting

A medieval man harvesting metadata from a medieval Web API

As 2016 stumbles to an end, I’ve put in a few days’ work on my new project Oceania, which is to be a Linked Data service for cultural heritage in this part of the world. Part of this project involves harvesting data from cultural institutions which make their collections available via so-called “Web APIs”. There are some very standard ways to publish data, such as OAI-PMH, OpenSearch, SRU, RSS, etc, but many cultural heritage institutions instead offer custom-built APIs that work in their own peculiar way, which means that you need to put in a certain amount of effort in learning each API and dealing with its specific requirements. So I’ve turned to the problem of how to deal with these APIs in the most generic way possible, and written a program that can handle a lot of what is common in most Web APIs, and can be easily configured to understand the specifics of particular APIs.

This program, which I’ve called API Harvester, can be configured by giving it a few simple instructions: where to download the data from, how to split up the downloaded data into individual records, where to save the record files, how to name those files, and where to get the next batch of data from (i.e. how to resume the harvest). The API Harvester does have one hard requirement: it is only able to harvest data in XML format, but most of the APIs I’ve seen offered by cultural heritage institutions do provide XML, so I think it’s not a big limitation.

The API Harvester software is open source, and free to use; I hope that other people find it useful, and I’m happy to accept feedback or improvements, or examples of how to use it with specific APIs. I’ve created a wiki page to record example commands for harvesting from a variety of APIs, including OAI-PMH, the Trove API, and an RSS feed from this blog. This wiki page is currently open for editing, so if you use the API Harvester, I encourage you to record the command you use, so other people can benefit from your work. If you have trouble with it, or need a hand, feel free to raise an issue on the GitHub repository, leave a comment here, or contact me on Twitter.

Finally, a brief word on how to use the software: to tell the harvester how to pull a response apart into individual records, and where to download the next page of records from (and the next, and the next…), you give it instructions in the form of “XPath expressions”. XPath is a micro-language for querying XML documents; it allows you to refer to elements and attributes and pieces of text within an XML document, to perform basic arithmetic and manipulate strings of text. XPath is simple yet enormously powerful; if you are planning on doing anything with XML it’s an essential thing to learn, even if only to a very basic level. I’m not going to give a tutorial on XPath here (there are plenty on the web), but I’ll give an example of querying the Trove API, and briefly explain the XPath expressions used in that examples:

Here’s the command I would use to harvest metadata about maps, relevant to the word “oceania”, from the Trove API, and save the results in a new folder called “oceania-maps” in my Downloads folder:

java -jar apiharvester.jar
directory="/home/ctuohy/Downloads/oceania-maps"
retries=5
url="http://api.trove.nla.gov.au/result?q=oceania&zone=map&reclevel=full"
url-suffix="&key=XXXXXXX"
records-xpath="/response/zone/records/*"
id-xpath="@url"
resumption-xpath="/response/zone/records/@next"

For legibility, I’ve split the command onto multiple lines, but this is a single command and should be entered on a single line.

Going through the parts of the command in order:

  • The command java launches a Java Virtual Machine to run the harvester application (which is written in the Java language).
  • The next item, -jar, tells Java to run a program that’s been packaged as a “Java Archive” (jar) file.
  • The next item, apiharvester.jar, is the harvester program itself, packaged as a jar file.

The remainder of the command consists of parameters that are passed to the API harvester and control its behaviour.

  • The first parameter, directory="/home/ctuohy/Downloads/oceania-maps", tells the harvester where to save the XML files; it will create this folder if it doesn’t already exist.
  • With the second parameter, retries=5, I’m telling the harvester to retry a download up to 5 times if it fails; Trove’s server can sometimes be a bit flaky at busy times; retrying a few times can save the day.
  • The third parameter, url="http://api.trove.nla.gov.au/result?q=oceania&zone=map&reclevel=full", tells the harvester where to download the first batch of data from. To generate a URL like this, I recommend using Tim Sherratt’s excellent online tool, the Trove API Console.
  • The next parameter url-suffix="&key=XXXXXXX" specifies a suffix that the harvester will append to the end of all the URLs which it requests. Here, I’ve used url-suffix to specify Trove’s “API Key”; a password which each registered Trove API user is given. To get one of these, see the Trove Help Centre. NB XXXXXXX is not my actual API Key.

The remaining parameters are all XPath expressions. To understand them, it will be helpful to look at the XML content which the Trove API returns in response to that query, and which these XPath expressions apply to.

  • The first XPath parameter, records-xpath="/response/zone/records/*", identifies the elements in the XML which constitute the individual records. The XPath /response/zone/records/* describes a path down the hierarchical structure of the XML: the initial / refers to the start of the document, the response refers to an element with that name at the “root” of the document, then /zone refers to any element called zone within that response element, then /records refers to any records within any of those response elements, and the final /* refers to any elements (with any name) within any of of those response elements. In practice, this XPath expression identifies all the work elements in the API’s response, and means that each of these work elements (and its contents) ends up saved in its own file.
  • The next parameter, id-xpath="@url" tells the harvester where to find a unique identifier for the record, to generate a unique file name. This XPath is evaluated relative to the elements identified by the records-xpath; i.e. it gets evaluated once for each record, starting from the record’s work element. The expression @url means “the value of the attribute named url”; the result is that the harvested records are saved in files whose names are derived from these URLs. If you look at the XML, you’ll see I could equally have used the expression @id instead of @url.
  • The final parameter, resumption-xpath="/response/zone/records/@next", tells the harvester where to find a URL (or URLs) from which it can resume the harvest, after saving the records from the first response. You’ll see in the Trove API response that the records element has an attribute called next which contains a URL for this purpose. When the harvester evaluates this XPath expression, it gathers up the next URLs and repeats the whole download process again for each one. Eventually, the API will respond with a records element which doesn’t have a next attribute (meaning that there are no more records). At that point, the XPath expression will evaluate to nothing, and the harvester will run out of URLs to harvest, and grind to a halt.

Happy New Year to all my readers! I hope this tool is of use to some of you, and I wish you a productive year of metadata harvesting in 2017!

Make a comment