trove – Conal Tuohy's blog

A tool for Web API harvesting

Conal — Sat, 31 Dec 2016 05:31:05 +0000

A medieval man harvesting metadata from a medieval Web API

As 2016 stumbles to an end, I’ve put in a few days’ work on my new project Oceania, which is to be a Linked Data service for cultural heritage in this part of the world. Part of this project involves harvesting data from cultural institutions which make their collections available via so-called “Web APIs”. There are some very standard ways to publish data, such as OAI-PMH, OpenSearch, SRU, RSS, etc, but many cultural heritage institutions instead offer custom-built APIs that work in their own peculiar way, which means that you need to put in a certain amount of effort in learning each API and dealing with its specific requirements. So I’ve turned to the problem of how to deal with these APIs in the most generic way possible, and written a program that can handle a lot of what is common in most Web APIs, and can be easily configured to understand the specifics of particular APIs.

This program, which I’ve called API Harvester, can be configured by giving it a few simple instructions: where to download the data from, how to split up the downloaded data into individual records, where to save the record files, how to name those files, and where to get the next batch of data from (i.e. how to resume the harvest). The API Harvester does have one hard requirement: it is only able to harvest data in XML format, but most of the APIs I’ve seen offered by cultural heritage institutions do provide XML, so I think it’s not a big limitation.

The API Harvester software is open source, and free to use; I hope that other people find it useful, and I’m happy to accept feedback or improvements, or examples of how to use it with specific APIs. I’ve created a wiki page to record example commands for harvesting from a variety of APIs, including OAI-PMH, the Trove API, and an RSS feed from this blog. This wiki page is currently open for editing, so if you use the API Harvester, I encourage you to record the command you use, so other people can benefit from your work. If you have trouble with it, or need a hand, feel free to raise an issue on the GitHub repository, leave a comment here, or contact me on Twitter.

Finally, a brief word on how to use the software: to tell the harvester how to pull a response apart into individual records, and where to download the next page of records from (and the next, and the next…), you give it instructions in the form of “XPath expressions”. XPath is a micro-language for querying XML documents; it allows you to refer to elements and attributes and pieces of text within an XML document, to perform basic arithmetic and manipulate strings of text. XPath is simple yet enormously powerful; if you are planning on doing anything with XML it’s an essential thing to learn, even if only to a very basic level. I’m not going to give a tutorial on XPath here (there are plenty on the web), but I’ll give an example of querying the Trove API, and briefly explain the XPath expressions used in that examples:

Here’s the command I would use to harvest metadata about maps, relevant to the word “oceania”, from the Trove API, and save the results in a new folder called “oceania-maps” in my Downloads folder:

java -jar apiharvester.jar directory="/home/ctuohy/Downloads/oceania-maps" retries=5 url="http://api.trove.nla.gov.au/result?q=oceania&zone=map&reclevel=full" url-suffix="&key=XXXXXXX" records-xpath="/response/zone/records/*" id-xpath="@url" resumption-xpath="/response/zone/records/@next"

For legibility, I’ve split the command onto multiple lines, but this is a single command and should be entered on a single line.

Going through the parts of the command in order:

The command java launches a Java Virtual Machine to run the harvester application (which is written in the Java language).
The next item, -jar, tells Java to run a program that’s been packaged as a “Java Archive” (jar) file.
The next item, apiharvester.jar, is the harvester program itself, packaged as a jar file.

The remainder of the command consists of parameters that are passed to the API harvester and control its behaviour.

The first parameter, directory="/home/ctuohy/Downloads/oceania-maps", tells the harvester where to save the XML files; it will create this folder if it doesn’t already exist.
With the second parameter, retries=5, I’m telling the harvester to retry a download up to 5 times if it fails; Trove’s server can sometimes be a bit flaky at busy times; retrying a few times can save the day.
The third parameter, url="http://api.trove.nla.gov.au/result?q=oceania&zone=map&reclevel=full", tells the harvester where to download the first batch of data from. To generate a URL like this, I recommend using Tim Sherratt’s excellent online tool, the Trove API Console.
The next parameter url-suffix="&key=XXXXXXX" specifies a suffix that the harvester will append to the end of all the URLs which it requests. Here, I’ve used url-suffix to specify Trove’s “API Key”; a password which each registered Trove API user is given. To get one of these, see the Trove Help Centre. NB XXXXXXX is not my actual API Key.

The remaining parameters are all XPath expressions. To understand them, it will be helpful to look at the XML content which the Trove API returns in response to that query, and which these XPath expressions apply to.

The first XPath parameter, records-xpath="/response/zone/records/*", identifies the elements in the XML which constitute the individual records. The XPath /response/zone/records/* describes a path down the hierarchical structure of the XML: the initial / refers to the start of the document, the response refers to an element with that name at the “root” of the document, then /zone refers to any element called zone within that response element, then /records refers to any records within any of those response elements, and the final /* refers to any elements (with any name) within any of of those response elements. In practice, this XPath expression identifies all the work elements in the API’s response, and means that each of these work elements (and its contents) ends up saved in its own file.
The next parameter, id-xpath="@url" tells the harvester where to find a unique identifier for the record, to generate a unique file name. This XPath is evaluated relative to the elements identified by the records-xpath; i.e. it gets evaluated once for each record, starting from the record’s work element. The expression @url means “the value of the attribute named url”; the result is that the harvested records are saved in files whose names are derived from these URLs. If you look at the XML, you’ll see I could equally have used the expression @id instead of @url.
The final parameter, resumption-xpath="/response/zone/records/@next", tells the harvester where to find a URL (or URLs) from which it can resume the harvest, after saving the records from the first response. You’ll see in the Trove API response that the records element has an attribute called next which contains a URL for this purpose. When the harvester evaluates this XPath expression, it gathers up the next URLs and repeats the whole download process again for each one. Eventually, the API will respond with a records element which doesn’t have a next attribute (meaning that there are no more records). At that point, the XPath expression will evaluate to nothing, and the harvester will run out of URLs to harvest, and grind to a halt.

Happy New Year to all my readers! I hope this tool is of use to some of you, and I wish you a productive year of metadata harvesting in 2017!

How to download bulk newspaper articles from Papers Past

Conal — Sun, 14 Sep 2014 10:25:22 +0000

Anyone interested in New Zealand history should already know about the amazing Papers Past website provided by the National Library of New Zealand, where you can read search and browse millions of historical newspaper articles and advertisements from New Zealand.

You may also know about Digital New Zealand, which provides a central point for access to New Zealand’s digital culture.

This post is about using Digital NZ and Papers Past to get access, in bulk, to newspaper articles, in a form which is then open to being “crunched” with sophisticated “data mining” software, able to automatically discover patterns in the articles. In my earlier post How to download bulk newspaper articles from Trove, I wrote:

Some researchers want to go beyond merely using computers to help them find newspaper articles to read (what librarians call “resource discovery”); researchers also want to use computers to help them read those articles.

To use that kind of “data mining” software, you first need to have direct access to your data, and that can mean having to download thousands and thousands of articles. It’s not at all obvious how to do that, but in fact it can be made quite easy, as I will show.

First, though, a brisk stroll down memory lane…

The history of Papers Past

The Papers Past website can trace its origins back to its pre-digital ancestor the National Newspaper Collection, an archive of paper and microfilm. But the website has its own history, with two distinct phases. The original Papers Past was essentially just a collection of giant images of newspaper pages, organized by newspaper title and by date. It was a fabulous resource, but the tricky thing was knowing where to look.

Then in 2007 Papers Past was reincarnated in a much more sophisticated form, featuring full text search for individual articles.

This crucial improvement in usability came about through converting the page images to text. The Library extracted the full text from each newspaper page image using Optical Character Recognition software, which they had primed with lists of NZ words including placenames and Māori familial names, in order to more reliably recognize these words. Finally they had every headline manually checked and edited for accuracy. The current website is built around an index of all that text, linked to page images. By searching that index, you can retrieve a list of articles that might interest you, pick out a selection, and actually read them.

It’s notable that each of these development stages delivered a new way for researchers to work with the archive. The original website expanded the accessibility of the archive by exposing it to the entire internet, and the modern version of the website dramatically improved the discoverability of that information by allowing researchers to search within full text.

Since then there’s been a third change in the digital environment around Papers Past — where it has become linked up to a broader system called Digital New Zealand — and it seems to me that this change opens up yet another new way for researchers to engage with the archive.

Digital New Zealand

Digital New Zealand is an aggregation of information about NZ cultural items; books, movies, posters, art works, newspapers, and more, drawn from the catalogues and databases of galleries, libraries, museums, and other institutions. There amongst the contributors to Digital New Zealand is our friend Papers Past.

Digital New Zealand is more than just a website, though; its core offering is a so-called “Application Programming Interface”, or API, providing third-party programmers (i.e. people like me) with a way to access the data contained in Digital NZ. Using the Digital NZ API we can search, download, and even create new information resources.

The Digital New Zealand API is a custom-built thing, but functionally it’s not too different to many other APIs in use on the internet. In particular, some parts of it are very similar to a standard mechanism called OAI-PMH, which has been used by libraries and archives for over a decade.

The Open Archives Initiatives Protocol for Metadata Harvesting

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol (i.e. a set of conversational conventions, but in this case it’s essentially synonymous with “API”) for machines to send each other large batches of structured information (e.g. library catalogue records) in a variety of formats.

There are two roles defined in the protocol; the “provider” which manages and publishes a database of records, and a “harvester”, which periodically asks the provider for updates. Over the years a large number of software applications have been produced to function either as OAI-PMH providers or harvesters (or both). Could Papers Past be made implement this protocol (as a provider)? If so, then we could access Papers Past using any of the existing harvester applications.

Retailer

I’ve been a fan of OAI-PMH from way back, and I’ve used it many times. It had struck me that if I built a “gateway” which translated services like Digital NZ into OAI-PMH, then historical researchers could use existing OAI-PMH software to download the articles they need. So I sat down and wrote this gateway software, and I gave it the name “Retailer” because it’s a kind of shop-front, providing “retail” access to data provided by a “wholesale” data provider such as Digital NZ.

The way Retailer works is that it is a generic framework for translating between one API and another. It could be used to translate all sorts of protocols. Each translation is effected by writing a specific script that defines the equivalences between a specific pair of APIs. Having written the generic Retailer, I then wrote some translation scripts: first I wrote a script to translate between OAI-PMH and the API of the National Library of Australia’s “Trove” service, and then I wrote a script to translate between OAI-PMH and the API of Digital NZ.

I have bundled the core “Retailer” software with both these scripts, so you can harvest newspaper articles from both the NZ and Australian newspaper archives.

Installation

First, you will need to register with Digital New Zealand to obtain an “API Key”. This is a random password which they assign to you. Whenever you ask a question of the Digital NZ API, you must provide this API Key with your request, so that they know who you are. Since Retailer will be interacting with Digital NZ on your behalf, you will need to let Retailer know what your key is.

Then you need to set up a Java web server, such as Apache Tomcat. This provides the platform on which Retailer runs.

Once you have set up Tomcat, you can install Retailer in it, and configure it to run the Papers Past OAI-PMH provider script papers-past.xsl.

Finally you need to set up an OAI-PMH harvester. I’ve been using jOAI, which I can recommend heartily. It has the advantage that, like Retailer, it is a Java Servlet, so you can install it inside Tomcat, alongside Retailer itself.

Now you are all set to harvest newspaper articles!

Harvesting

To harvest, you should first go to Digital NZ and do some searching until you have a search query that returns you the results you want to harvest. Click through a selection of results to check that they are relevant; if necessary, refine your query to exclude articles you are not interested in.

Once you have decided on a good query, navigate your browser to your jOAI harvester (e.g. to http://localhost:8080/oai/) and add a new harvest. Set the harvest’s setSpec to be search: followed by your query, e.g. search:new zealand wars. Then run the harvest to actually download the search results. For details, see the Retailer documentation.

Why not give it a try? Any problems, leave a comment here or as an issue on the Retailer github site!

How to download bulk newspaper articles from Trove

Conal — Fri, 15 Aug 2014 07:50:54 +0000

Before the internet, before TV, before radio, newspapers ruled. There were literally hundreds of newspapers, published in towns and cities all over Australia, and they carried the daily life of Australians in all its petty detail. For historians, newspapers were a diamond mine; the information content was hugely valuable; the hard part was all the digging you had to do. It used to be that you would have to go to a library where a newspaper collection was held, and search manually through text on paper or microfiche. You had to be prepared to put in a lot of hard slog.

But then everything changed. A humanities researcher once told me that for Australian researchers, the National Library’s of Australia’s “Trove” newspaper archive marked a radical break: “There was a Before Trove, and an After Trove”.

Discovering newspaper articles has never been easier, for both professional and amateur researchers. These days researchers can search through many millions of newspaper articles in a few seconds, from the comfort of their own web browser. Enter your query, retrieve a list of the top 20 hits, click through to read any of them, click for another 20 hits, and so on.

From “resource discovery” to “distant reading”

However, there’s no pleasing some people. Some researchers want to go beyond merely using computers to help them find newspaper articles to read (what librarians call “resource discovery”); researchers also want to use computers to help them read those articles.

If you search and discover an article; a few articles; even a few dozen articles, you can read them yourself and take whatever notes you need. But what if you want to read hundreds, thousands, or hundreds of thousands of articles? What if you wanted to analyze the entire corpus of Trove articles? That’s just not humanly possible. Of course, a computer can’t have quite the same “understanding” of a set of newspaper articles as a human being can, but it has the advantage that it can “read” in minutes or hours, what would take a human years, or centuries, to wade through. There are a number of techniques for bulk machine-reading of text; what Franco Moretti called “distant reading”.

Actually this is a part of the so-called “big data” trend in research generally, in which computers are used to find useful information by crunching up vast amounts of data. In humanities research, “big data” generally (though not always) means large corpora of digitized text.

Harvesting text from Trove

At some stage, I have no doubt that the entire Trove corpus will become available to anyone who wants it, but as of this moment, it’s not entirely straightforward to acquire articles in bulk. The Trove programming interface allows for articles to be downloaded, but it needs a little help, because it was primarily intended as a means to discover individual resources, rather than a bulk data exchange mechanism.

A couple of months ago I was looking at the Trove API; the Application Programming Interface that Trove uses to provide automated access to their data. I was struck at the time by how similar it was to another mechanism which libraries use for bulk data exchange (e.g. of library catalogues), namely the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH).

OAI-PMH is a protocol (i.e. a set of conversational conventions) for machines to send each other large batches of structured information in a variety of formats. The protocol allows a requester (called a “harvester”) to ask a repository (called a “provider”) for data records in a particular format, and updated since a particular date. The provider then responds with a list of records. If there are too many records to send in a single list (more than a few Megabytes of data is typically considered too large), then the provider will return a partial list, and also include a bookmark (called a “resumption token”) to mark how far through the full list it has got. Whenever the harvester receives a resumption token, it knows it can issue another request, handing the resumption token back to the provider, and receive in response another partial list, starting from where the previous list left off, along with yet another resumption token. At any point, the provider can let the harvester know that the list is finished, by returning an empty resumption token. In this way, very large batches of data can be transferred in bite-sized chunks. Typically, libraries harvest data overnight, retrieving just the records which have been created or updated since their harvest the previous night.

On reflection, I realized that if Trove could just support OAI-PMH, then it would be possible to download articles from it in bulk, using any of a number of existing OAI-PMH programs. I wouldn’t need to write software to perform a harvest; I only needed to translate between the OAI-PMH protocol and the Trove API. An easy win!

Retailer

At this point I sat down and knocked out some code, which I called Retailer because it’s a kind of shop-front with Trove as a warehouse behind it. Retailer is able to act as an OAI-PMH provider by translating any OAI-PMH requests it receives into Trove-style requests which it forwards on to the Trove API. Then it takes the response from Trove, formats it as an OAI-PMH response, and returns that.

Because the Trove corpus is so large, it’s not really feasible to harvest the entire thing. So I’ve made it possible to specify a search query, to limit the results to something more manageable. In OAI-PMH, a repository can divide its records up into so-called distinct sets, and with Retailer you can use a Trove query as one of these sets.

How to use it

Let’s go step by step through how to use Retailer to harvest from Trove. For now, I’m going to detail how to do this using Linux. I hope later I will add instructions for Windows and MacOS.

First you will need to register with the National Library and obtain an API key. This is a kind of random looking password which you must provide to the Trove API with every request you make, and hence you have to tell Retailer what it is.

Next you will need to install Retailer. Retailer is a Java Servlet, which means you must first have installed Java and a Java Web Server such as Apache Tomcat. On a Debian-based Linux, this is as simple as:
sudo apt-get install tomcat7

Now you will install Retailer. Download the retailer.war file and save it e.g.
mkdir /var/lib/retailer cp retailer.war /var/lib/retailer/retailer.war

To install it into Tomcat, you will need to create a file called retailer.xml in Tomcat’s configuration folder /var/lib/tomcat7/conf/Catalina/localhost with the following content. Copy and paste this and edit the trove key.

 path="/retailer" 
    docBase="/var/lib/retailer/retailer.war"
    antiResourceLocking="false">
   name="trove-key" value="your-key-here" override="false"/>

This is so that Retailer knows your Trove API key.

Next you will need an OAI-PMH harvester. There are several of these available. I am going to use jOAI, which has the advantage that it’s also a Java Servlet. Download the jOAI zip file, unzip it, find the oai.war file, and copy it into the webapps folder of Tomcat.
unzip joai_v3.1.1.3.zip sudo cp joai_v3.1.1.3/oai.war /var/lib/tomcat7/webapps/

At this point, you should have all you need to start harvesting. Open your browser and navigate to http://localhost:8080/oai/admin/harvester.do – you should be talking to your jOAI harvester. Here’s where you set up your harvester. Click the “Add new harvest” button and fill out the form.

Field	Example value	Notes
Repository Name	`Trove`	Call it what you like, but “Trove” is an obvious name
Repository Base URL	`http://localhost:8080/retailer/`	This is the address of Retailer. Remember the harvester talks to Retailer, and behind the scenes Retailer is talking to Trove
SetSpec	`search:annular solar eclipse`	Enter a search query here. Don’t leave the SetSpec blank or jOAI will attempt to harvest all of Trove, which may take many months.
Metadata format	`html`	This is where you specify what format you want to retrieve data in. The most useful value here is definitely `html` (which retrieves the articles as web pages), but you can also use `oai_dc` (which retrieves only certain metadata, such as the article title and URL, in an XML format), and `trove` (which retrieves the articles in a similar format to that returned by the Trove API, with minimal reformatting).

Save the harvest configuration, and run it by clicking “All” under the heading “Manually Harvest”. After a few minutes, jOAI should have downloaded your web pages and point you at the folder where they can be found.