Anyone interested in New Zealand history should already know about the amazing Papers Past website provided by the National Library of New Zealand, where you can read search and browse millions of historical newspaper articles and advertisements from New Zealand.
You may also know about Digital New Zealand, which provides a central point for access to New Zealand’s digital culture.
This post is about using Digital NZ and Papers Past to get access, in bulk, to newspaper articles, in a form which is then open to being “crunched” with sophisticated “data mining” software, able to automatically discover patterns in the articles. In my earlier post How to download bulk newspaper articles from Trove, I wrote:
Some researchers want to go beyond merely using computers to help them find newspaper articles to read (what librarians call “resource discovery”); researchers also want to use computers to help them read those articles.
To use that kind of “data mining” software, you first need to have direct access to your data, and that can mean having to download thousands and thousands of articles. It’s not at all obvious how to do that, but in fact it can be made quite easy, as I will show.
First, though, a brisk stroll down memory lane…
The history of Papers Past
The Papers Past website can trace its origins back to its pre-digital ancestor the National Newspaper Collection, an archive of paper and microfilm. But the website has its own history, with two distinct phases. The original Papers Past was essentially just a collection of giant images of newspaper pages, organized by newspaper title and by date. It was a fabulous resource, but the tricky thing was knowing where to look.
Then in 2007 Papers Past was reincarnated in a much more sophisticated form, featuring full text search for individual articles.
This crucial improvement in usability came about through converting the page images to text. The Library extracted the full text from each newspaper page image using Optical Character Recognition software, which they had primed with lists of NZ words including placenames and Māori familial names, in order to more reliably recognize these words. Finally they had every headline manually checked and edited for accuracy. The current website is built around an index of all that text, linked to page images. By searching that index, you can retrieve a list of articles that might interest you, pick out a selection, and actually read them.
It’s notable that each of these development stages delivered a new way for researchers to work with the archive. The original website expanded the accessibility of the archive by exposing it to the entire internet, and the modern version of the website dramatically improved the discoverability of that information by allowing researchers to search within full text.
Since then there’s been a third change in the digital environment around Papers Past — where it has become linked up to a broader system called Digital New Zealand — and it seems to me that this change opens up yet another new way for researchers to engage with the archive.
Digital New Zealand
Digital New Zealand is an aggregation of information about NZ cultural items; books, movies, posters, art works, newspapers, and more, drawn from the catalogues and databases of galleries, libraries, museums, and other institutions. There amongst the contributors to Digital New Zealand is our friend Papers Past.
Digital New Zealand is more than just a website, though; its core offering is a so-called “Application Programming Interface”, or API, providing third-party programmers (i.e. people like me) with a way to access the data contained in Digital NZ. Using the Digital NZ API we can search, download, and even create new information resources.
The Digital New Zealand API is a custom-built thing, but functionally it’s not too different to many other APIs in use on the internet. In particular, some parts of it are very similar to a standard mechanism called OAI-PMH, which has been used by libraries and archives for over a decade.
The Open Archives Initiatives Protocol for Metadata Harvesting
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol (i.e. a set of conversational conventions, but in this case it’s essentially synonymous with “API”) for machines to send each other large batches of structured information (e.g. library catalogue records) in a variety of formats.
There are two roles defined in the protocol; the “provider” which manages and publishes a database of records, and a “harvester”, which periodically asks the provider for updates. Over the years a large number of software applications have been produced to function either as OAI-PMH providers or harvesters (or both). Could Papers Past be made implement this protocol (as a provider)? If so, then we could access Papers Past using any of the existing harvester applications.
I’ve been a fan of OAI-PMH from way back, and I’ve used it many times. It had struck me that if I built a “gateway” which translated services like Digital NZ into OAI-PMH, then historical researchers could use existing OAI-PMH software to download the articles they need. So I sat down and wrote this gateway software, and I gave it the name “Retailer” because it’s a kind of shop-front, providing “retail” access to data provided by a “wholesale” data provider such as Digital NZ.
The way Retailer works is that it is a generic framework for translating between one API and another. It could be used to translate all sorts of protocols. Each translation is effected by writing a specific script that defines the equivalences between a specific pair of APIs. Having written the generic Retailer, I then wrote some translation scripts: first I wrote a script to translate between OAI-PMH and the API of the National Library of Australia’s “Trove” service, and then I wrote a script to translate between OAI-PMH and the API of Digital NZ.
I have bundled the core “Retailer” software with both these scripts, so you can harvest newspaper articles from both the NZ and Australian newspaper archives.
First, you will need to register with Digital New Zealand to obtain an “API Key”. This is a random password which they assign to you. Whenever you ask a question of the Digital NZ API, you must provide this API Key with your request, so that they know who you are. Since Retailer will be interacting with Digital NZ on your behalf, you will need to let Retailer know what your key is.
Then you need to set up a Java web server, such as Apache Tomcat. This provides the platform on which Retailer runs.
Once you have set up Tomcat, you can install Retailer in it, and configure it to run the Papers Past OAI-PMH provider script
Finally you need to set up an OAI-PMH harvester. I’ve been using jOAI, which I can recommend heartily. It has the advantage that, like Retailer, it is a Java Servlet, so you can install it inside Tomcat, alongside Retailer itself.
Now you are all set to harvest newspaper articles!
To harvest, you should first go to Digital NZ and do some searching until you have a search query that returns you the results you want to harvest. Click through a selection of results to check that they are relevant; if necessary, refine your query to exclude articles you are not interested in.
Once you have decided on a good query, navigate your browser to your jOAI harvester (e.g. to http://localhost:8080/oai/) and add a new harvest. Set the harvest’s
setSpec to be search: followed by your query, e.g. search:new zealand wars. Then run the harvest to actually download the search results. For details, see the Retailer documentation.
Why not give it a try? Any problems, leave a comment here or as an issue on the Retailer github site!