How to download bulk newspaper articles from Papers Past

Anyone interested in New Zealand history should already know about the amazing Papers Past website provided by the National Library of New Zealand, where you can read search and browse millions of historical newspaper articles and advertisements from New Zealand.

You may also know about Digital New Zealand, which provides a central point for access to New Zealand’s digital culture.

This post is about using Digital NZ and Papers Past to get access, in bulk, to newspaper articles, in a form which is then open to being “crunched” with sophisticated “data mining” software, able to automatically discover patterns in the articles. In my earlier post How to download bulk newspaper articles from Trove, I wrote:

Some researchers want to go beyond merely using computers to help them find newspaper articles to read (what librarians call “resource discovery”); researchers also want to use computers to help them read those articles.

To use that kind of “data mining” software, you first need to have direct access to your data, and that can mean having to download thousands and thousands of articles. It’s not at all obvious how to do that, but in fact it can be made quite easy, as I will show.

First, though, a brisk stroll down memory lane…

The history of Papers Past

The Papers Past website can trace its origins back to its pre-digital ancestor the National Newspaper Collection, an archive of paper and microfilm. But the website has its own history, with two distinct phases. The original Papers Past was essentially just a collection of giant images of newspaper pages, organized by newspaper title and by date. It was a fabulous resource, but the tricky thing was knowing where to look.

Then in 2007 Papers Past was reincarnated in a much more sophisticated form, featuring full text search for individual articles.

This crucial improvement in usability came about through converting the page images to text. The Library extracted the full text from each newspaper page image using Optical Character Recognition software, which they had primed with lists of NZ words including placenames and Māori familial names, in order to more reliably recognize these words. Finally they had every headline manually checked and edited for accuracy. The current website is built around an index of all that text, linked to page images. By searching that index, you can retrieve a list of articles that might interest you, pick out a selection, and actually read them.

It’s notable that each of these development stages delivered a new way for researchers to work with the archive. The original website expanded the accessibility of the archive by exposing it to the entire internet, and the modern version of the website dramatically improved the discoverability of that information by allowing researchers to search within full text.

Since then there’s been a third change in the digital environment around Papers Past — where it has become linked up to a broader system called Digital New Zealand — and it seems to me that this change opens up yet another new way for researchers to engage with the archive.

Digital New Zealand

Digital New Zealand is an aggregation of information about NZ cultural items; books, movies, posters, art works, newspapers, and more, drawn from the catalogues and databases of galleries, libraries, museums, and other institutions. There amongst the contributors to Digital New Zealand is our friend Papers Past.

Digital New Zealand is more than just a website, though; its core offering is a so-called “Application Programming Interface”, or API, providing third-party programmers (i.e. people like me) with a way to access the data contained in Digital NZ. Using the Digital NZ API we can search, download, and even create new information resources.

The Digital New Zealand API is a custom-built thing, but functionally it’s not too different to many other APIs in use on the internet. In particular, some parts of it are very similar to a standard mechanism called OAI-PMH, which has been used by libraries and archives for over a decade.

The Open Archives Initiatives Protocol for Metadata Harvesting

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol (i.e. a set of conversational conventions, but in this case it’s essentially synonymous with “API”) for machines to send each other large batches of structured information (e.g. library catalogue records) in a variety of formats.

There are two roles defined in the protocol; the “provider” which manages and publishes a database of records, and a “harvester”, which periodically asks the provider for updates. Over the years a large number of software applications have been produced to function either as OAI-PMH providers or harvesters (or both). Could Papers Past be made implement this protocol (as a provider)? If so, then we could access Papers Past using any of the existing harvester applications.

Retailer

I’ve been a fan of OAI-PMH from way back, and I’ve used it many times. It had struck me that if I built a “gateway” which translated services like Digital NZ into OAI-PMH, then historical researchers could use existing OAI-PMH software to download the articles they need. So I sat down and wrote this gateway software, and I gave it the name “Retailer” because it’s a kind of shop-front, providing “retail” access to data provided by a “wholesale” data provider such as Digital NZ.

The way Retailer works is that it is a generic framework for translating between one API and another. It could be used to translate all sorts of protocols. Each translation is effected by writing a specific script that defines the equivalences between a specific pair of APIs. Having written the generic Retailer, I then wrote some translation scripts: first I wrote a script to translate between OAI-PMH and the API of the National Library of Australia’s “Trove” service, and then I wrote a script to translate between OAI-PMH and the API of Digital NZ.

I have bundled the core “Retailer” software with both these scripts, so you can harvest newspaper articles from both the NZ and Australian newspaper archives.

Installation

First, you will need to register with Digital New Zealand to obtain an “API Key”. This is a random password which they assign to you. Whenever you ask a question of the Digital NZ API, you must provide this API Key with your request, so that they know who you are. Since Retailer will be interacting with Digital NZ on your behalf, you will need to let Retailer know what your key is.

Then you need to set up a Java web server, such as Apache Tomcat. This provides the platform on which Retailer runs.

Once you have set up Tomcat, you can install Retailer in it, and configure it to run the Papers Past OAI-PMH provider script papers-past.xsl.

Finally you need to set up an OAI-PMH harvester. I’ve been using jOAI, which I can recommend heartily. It has the advantage that, like Retailer, it is a Java Servlet, so you can install it inside Tomcat, alongside Retailer itself.

Now you are all set to harvest newspaper articles!

Harvesting

To harvest, you should first go to Digital NZ and do some searching until you have a search query that returns you the results you want to harvest. Click through a selection of results to check that they are relevant; if necessary, refine your query to exclude articles you are not interested in.

Once you have decided on a good query, navigate your browser to your jOAI harvester (e.g. to http://localhost:8080/oai/) and add a new harvest. Set the harvest’s setSpec to be search: followed by your query, e.g. search:new zealand wars. Then run the harvest to actually download the search results. For details, see the Retailer documentation.

Why not give it a try? Any problems, leave a comment here or as an issue on the Retailer github site!

 

4 thoughts on “How to download bulk newspaper articles from Papers Past”

  1. It’s great to see this being explored! I thought it also important to note that the DigitalNZ API was not designed to be a front-end to an OAI-PMH service, so apologies that we don’t have all the feature to make this work as well as you would have liked. We do hope to provide download dump services in the future. In the meantime it is also important to understand the terms of use for the DigitalNZ API do not allow for users to permanently keep the data. The data must be refreshed every 30 days so as to support our license agreement with partners. Together with the query limit of 10,000 queries a day it means that you will be able to use this to maintain a subset of the data. If you have any questions about the using the DigitalNZ API take a look at http://www.digitalnz.org/developers

  2. Great work – but why does the Papers Past service not have Native OAI-PMH service enabled – I understand it is a native part of the Veridan platform and no doubt it will be a key feature of any new Papers past service – along with text correction. And why is the Papers past data not available for open download insteat of restrictive DNZ rules – its almost all out of copyright so why restrict?

    1. Good questions Paul! It obviously would be great to be able to harvest the native data format (METS ALTO I believe) via an official OAI-PMH service! What we have at the moment is only the Papers Past text after it has been digested by DigitalNZ. The text made available through DigitalNZ is normalised to remove punctuation and capitalisation, etc. because it is intended to serve as the input to a search index. Similarly the terms and conditions are generic DigitalNZ terms and conditions – not specific to Papers Past.

      1. Hi Paul + Conal. We have tested the OAI capability of our instance of Veridian. We’re keen to expose data, but we need to do it well because there’s interplay between a few things across different domains. These range from fundamental partnership restrictions in how we can make some content available, to answering system overhead questions, to getting organisational buy-in to the premise that this is a thing we need to do. Clearly, none of these are insurmountable. However there’s also the fact that we’re a relatively small group and we need to be a bit mindful of how we resource this. For now, we’re happy with having data services delivered at the current level by DigitalNZ, but that’s not to suggest that we don’t want to extend our data services in the future. We think xml-api’s are good things, and I can tell you that we are also investigating options for making datasets available for download.

Make a comment