Before the internet, before TV, before radio, newspapers ruled. There were literally hundreds of newspapers, published in towns and cities all over Australia, and they carried the daily life of Australians in all its petty detail. For historians, newspapers were a diamond mine; the information content was hugely valuable; the hard part was all the digging you had to do. It used to be that you would have to go to a library where a newspaper collection was held, and search manually through text on paper or microfiche. You had to be prepared to put in a lot of hard slog.
But then everything changed. A humanities researcher once told me that for Australian researchers, the National Library’s of Australia’s “Trove” newspaper archive marked a radical break: “There was a Before Trove, and an After Trove”.
Discovering newspaper articles has never been easier, for both professional and amateur researchers. These days researchers can search through many millions of newspaper articles in a few seconds, from the comfort of their own web browser. Enter your query, retrieve a list of the top 20 hits, click through to read any of them, click for another 20 hits, and so on.
From “resource discovery” to “distant reading”
However, there’s no pleasing some people. Some researchers want to go beyond merely using computers to help them find newspaper articles to read (what librarians call “resource discovery”); researchers also want to use computers to help them read those articles.
If you search and discover an article; a few articles; even a few dozen articles, you can read them yourself and take whatever notes you need. But what if you want to read hundreds, thousands, or hundreds of thousands of articles? What if you wanted to analyze the entire corpus of Trove articles? That’s just not humanly possible. Of course, a computer can’t have quite the same “understanding” of a set of newspaper articles as a human being can, but it has the advantage that it can “read” in minutes or hours, what would take a human years, or centuries, to wade through. There are a number of techniques for bulk machine-reading of text; what Franco Moretti called “distant reading”.
Actually this is a part of the so-called “big data” trend in research generally, in which computers are used to find useful information by crunching up vast amounts of data. In humanities research, “big data” generally (though not always) means large corpora of digitized text.
Harvesting text from Trove
At some stage, I have no doubt that the entire Trove corpus will become available to anyone who wants it, but as of this moment, it’s not entirely straightforward to acquire articles in bulk. The Trove programming interface allows for articles to be downloaded, but it needs a little help, because it was primarily intended as a means to discover individual resources, rather than a bulk data exchange mechanism.
A couple of months ago I was looking at the Trove API; the Application Programming Interface that Trove uses to provide automated access to their data. I was struck at the time by how similar it was to another mechanism which libraries use for bulk data exchange (e.g. of library catalogues), namely the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH).
OAI-PMH is a protocol (i.e. a set of conversational conventions) for machines to send each other large batches of structured information in a variety of formats. The protocol allows a requester (called a “harvester”) to ask a repository (called a “provider”) for data records in a particular format, and updated since a particular date. The provider then responds with a list of records. If there are too many records to send in a single list (more than a few Megabytes of data is typically considered too large), then the provider will return a partial list, and also include a bookmark (called a “resumption token”) to mark how far through the full list it has got. Whenever the harvester receives a resumption token, it knows it can issue another request, handing the resumption token back to the provider, and receive in response another partial list, starting from where the previous list left off, along with yet another resumption token. At any point, the provider can let the harvester know that the list is finished, by returning an empty resumption token. In this way, very large batches of data can be transferred in bite-sized chunks. Typically, libraries harvest data overnight, retrieving just the records which have been created or updated since their harvest the previous night.
On reflection, I realized that if Trove could just support OAI-PMH, then it would be possible to download articles from it in bulk, using any of a number of existing OAI-PMH programs. I wouldn’t need to write software to perform a harvest; I only needed to translate between the OAI-PMH protocol and the Trove API. An easy win!
At this point I sat down and knocked out some code, which I called Retailer because it’s a kind of shop-front with Trove as a warehouse behind it. Retailer is able to act as an OAI-PMH provider by translating any OAI-PMH requests it receives into Trove-style requests which it forwards on to the Trove API. Then it takes the response from Trove, formats it as an OAI-PMH response, and returns that.
Because the Trove corpus is so large, it’s not really feasible to harvest the entire thing. So I’ve made it possible to specify a search query, to limit the results to something more manageable. In OAI-PMH, a repository can divide its records up into so-called distinct sets, and with Retailer you can use a Trove query as one of these sets.
How to use it
Let’s go step by step through how to use Retailer to harvest from Trove. For now, I’m going to detail how to do this using Linux. I hope later I will add instructions for Windows and MacOS.
First you will need to register with the National Library and obtain an API key. This is a kind of random looking password which you must provide to the Trove API with every request you make, and hence you have to tell Retailer what it is.
Next you will need to install Retailer. Retailer is a Java Servlet, which means you must first have installed Java and a Java Web Server such as Apache Tomcat. On a Debian-based Linux, this is as simple as:
sudo apt-get install tomcat7
Now you will install Retailer. Download the retailer.war file and save it e.g.
cp retailer.war /var/lib/retailer/retailer.war
To install it into Tomcat, you will need to create a file called
retailer.xml in Tomcat’s configuration folder
/var/lib/tomcat7/conf/Catalina/localhost with the following content. Copy and paste this and edit the trove key.
<Context path="/retailer" docBase="/var/lib/retailer/retailer.war" antiResourceLocking="false"> <Parameter name="trove-key" value="your-key-here" override="false"/> </Context>
This is so that Retailer knows your Trove API key.
Next you will need an OAI-PMH harvester. There are several of these available. I am going to use jOAI, which has the advantage that it’s also a Java Servlet. Download the jOAI zip file, unzip it, find the oai.war file, and copy it into the webapps folder of Tomcat.
sudo cp joai_v126.96.36.199/oai.war /var/lib/tomcat7/webapps/
At this point, you should have all you need to start harvesting. Open your browser and navigate to http://localhost:8080/oai/admin/harvester.do – you should be talking to your jOAI harvester. Here’s where you set up your harvester. Click the “Add new harvest” button and fill out the form.
||Call it what you like, but “Trove” is an obvious name|
|Repository Base URL||
||This is the address of Retailer. Remember the harvester talks to Retailer, and behind the scenes Retailer is talking to Trove|
||Enter a search query here. Don’t leave the SetSpec blank or jOAI will attempt to harvest all of Trove, which may take many months.|
||This is where you specify what format you want to retrieve data in. The most useful value here is definitely
Save the harvest configuration, and run it by clicking “All” under the heading “Manually Harvest”. After a few minutes, jOAI should have downloaded your web pages and point you at the folder where they can be found.