papers past – Conal Tuohy's blog

Old News for Twitter

Conal — Sat, 25 Jul 2015 09:59:20 +0000

Yesterday I finished a little development project to build a TwitterBot for New Zealand’s online newspaper archive Papers Past.

What’s a “TwitterBot”? It’s a software application that autonomously (robotically, hence “-bot”) sends tweets. There are a lot of TwitterBots tweeting about all kinds of things. Tim Sherratt has produced a few, including one called @TroveNewsBot which tweets links to articles from the Australian online newspaper archive of Trove, and this was a direct inspiration for my TwitterBot. Recently Hugh Rundle produced a TwitterBot called Aus GLAM Blog Bot that tweets links to blog posts by people blogging in the Australian GLAM (Galleries, Libraries, Archives and Museums) sector. People like me. I’m looking forward to seeing Hugh’s bot tweeting about my bot.

One nice thing about making a TwitterBot is the tight constraints you have to work under. That 140-character limit keeps you focused on doing one thing. Another nice thing about them is they are public performers; they get up on stage in front of the world and sing and dance, or they shout weird slogans, or whatever. If they are interesting, people will follow them. The other great thing about them is that they are autonomous; not even their creators know exactly what they will do and say.

Tim’s bots are written in Python, which is his programming language of choice. Hugh chose to write his in Javascript. My bot is written in XProc, which is a programming language designed for processing markup (XML and HTML) and pushing data around on the web. I’ve been using it for a while, and I thought it would be nice to add some tools to my XProc toolbox for dealing with Twitter. XML hackers may like to check out the source code for @NZPaperBot, on GitHub.

So I set my robot the task of tweeting pictures from newspapers that were published exactly 100 years ago, and after a bit of hacking with Papers Past and with Twitter, my bot posted its first tweet yesterday:

PERFECTLY SANITARY. – Star #100years http://t.co/DKI8NeE4LL– pic.twitter.com/cvHUdoyYSK

— NZ Paper Bot (@NZPaperBot) July 24, 2015

I’m looking forward to seeing what else it comes up with, and to expanding its behaviour in future. I’d like to see it responding to other people’s tweets, and to the tweets of other bots!

Public OAI-PMH repository for Papers Past

Conal — Mon, 25 May 2015 05:44:09 +0000

I have deployed a publicly available service to provide access in bulk to newspaper articles from Papers Past — the National Library of New Zealand’s online collection of historical newspapers — via the DigitalNZ API.

The service allows access to newspaper articles in bulk (up to a maximum of 5000 articles), using OAI-PMH harvesting software. To gain access to the collection, point your OAI-PMH harvester to the repository with this URI:

https://papers-past-oai-pmh.herokuapp.com/

If you’re looking for a good harvester, let me recommend jOAI.

Searching

You can harvest records that match a search. Provide your search query as an OAI-PMH set, for example to search for “titokowaru”, specify search:titokowaru as the value of the OAI-PMH set parameter:

https://papers-past-oai-pmh.herokuapp.com/?verb=ListRecords&metadataPrefix=oai_dc&set=search:titokowaru

Formats available

You can harvest records (i.e. articles) in one of three different formats:

html — this format returns the full text of the articles, and is likely to be the most useful format. Note that the text available through DigitalNZ has had punctuation and capitalization removed.
oai_dc — a simple metadata record.
digitalnz — straightforwardly based on DigitalNZ’s own metadata format.

https://papers-past-oai-pmh.herokuapp.com/?verb=ListMetadataFormats

Happy harvesting!

How to download bulk newspaper articles from Papers Past

Conal — Sun, 14 Sep 2014 10:25:22 +0000

Anyone interested in New Zealand history should already know about the amazing Papers Past website provided by the National Library of New Zealand, where you can read search and browse millions of historical newspaper articles and advertisements from New Zealand.

You may also know about Digital New Zealand, which provides a central point for access to New Zealand’s digital culture.

This post is about using Digital NZ and Papers Past to get access, in bulk, to newspaper articles, in a form which is then open to being “crunched” with sophisticated “data mining” software, able to automatically discover patterns in the articles. In my earlier post How to download bulk newspaper articles from Trove, I wrote:

Some researchers want to go beyond merely using computers to help them find newspaper articles to read (what librarians call “resource discovery”); researchers also want to use computers to help them read those articles.

To use that kind of “data mining” software, you first need to have direct access to your data, and that can mean having to download thousands and thousands of articles. It’s not at all obvious how to do that, but in fact it can be made quite easy, as I will show.

First, though, a brisk stroll down memory lane…

The history of Papers Past

The Papers Past website can trace its origins back to its pre-digital ancestor the National Newspaper Collection, an archive of paper and microfilm. But the website has its own history, with two distinct phases. The original Papers Past was essentially just a collection of giant images of newspaper pages, organized by newspaper title and by date. It was a fabulous resource, but the tricky thing was knowing where to look.

Then in 2007 Papers Past was reincarnated in a much more sophisticated form, featuring full text search for individual articles.

This crucial improvement in usability came about through converting the page images to text. The Library extracted the full text from each newspaper page image using Optical Character Recognition software, which they had primed with lists of NZ words including placenames and Māori familial names, in order to more reliably recognize these words. Finally they had every headline manually checked and edited for accuracy. The current website is built around an index of all that text, linked to page images. By searching that index, you can retrieve a list of articles that might interest you, pick out a selection, and actually read them.

It’s notable that each of these development stages delivered a new way for researchers to work with the archive. The original website expanded the accessibility of the archive by exposing it to the entire internet, and the modern version of the website dramatically improved the discoverability of that information by allowing researchers to search within full text.

Since then there’s been a third change in the digital environment around Papers Past — where it has become linked up to a broader system called Digital New Zealand — and it seems to me that this change opens up yet another new way for researchers to engage with the archive.

Digital New Zealand

Digital New Zealand is an aggregation of information about NZ cultural items; books, movies, posters, art works, newspapers, and more, drawn from the catalogues and databases of galleries, libraries, museums, and other institutions. There amongst the contributors to Digital New Zealand is our friend Papers Past.

Digital New Zealand is more than just a website, though; its core offering is a so-called “Application Programming Interface”, or API, providing third-party programmers (i.e. people like me) with a way to access the data contained in Digital NZ. Using the Digital NZ API we can search, download, and even create new information resources.

The Digital New Zealand API is a custom-built thing, but functionally it’s not too different to many other APIs in use on the internet. In particular, some parts of it are very similar to a standard mechanism called OAI-PMH, which has been used by libraries and archives for over a decade.

The Open Archives Initiatives Protocol for Metadata Harvesting

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a protocol (i.e. a set of conversational conventions, but in this case it’s essentially synonymous with “API”) for machines to send each other large batches of structured information (e.g. library catalogue records) in a variety of formats.

There are two roles defined in the protocol; the “provider” which manages and publishes a database of records, and a “harvester”, which periodically asks the provider for updates. Over the years a large number of software applications have been produced to function either as OAI-PMH providers or harvesters (or both). Could Papers Past be made implement this protocol (as a provider)? If so, then we could access Papers Past using any of the existing harvester applications.

Retailer

I’ve been a fan of OAI-PMH from way back, and I’ve used it many times. It had struck me that if I built a “gateway” which translated services like Digital NZ into OAI-PMH, then historical researchers could use existing OAI-PMH software to download the articles they need. So I sat down and wrote this gateway software, and I gave it the name “Retailer” because it’s a kind of shop-front, providing “retail” access to data provided by a “wholesale” data provider such as Digital NZ.

The way Retailer works is that it is a generic framework for translating between one API and another. It could be used to translate all sorts of protocols. Each translation is effected by writing a specific script that defines the equivalences between a specific pair of APIs. Having written the generic Retailer, I then wrote some translation scripts: first I wrote a script to translate between OAI-PMH and the API of the National Library of Australia’s “Trove” service, and then I wrote a script to translate between OAI-PMH and the API of Digital NZ.

I have bundled the core “Retailer” software with both these scripts, so you can harvest newspaper articles from both the NZ and Australian newspaper archives.

Installation

First, you will need to register with Digital New Zealand to obtain an “API Key”. This is a random password which they assign to you. Whenever you ask a question of the Digital NZ API, you must provide this API Key with your request, so that they know who you are. Since Retailer will be interacting with Digital NZ on your behalf, you will need to let Retailer know what your key is.

Then you need to set up a Java web server, such as Apache Tomcat. This provides the platform on which Retailer runs.

Once you have set up Tomcat, you can install Retailer in it, and configure it to run the Papers Past OAI-PMH provider script papers-past.xsl.

Finally you need to set up an OAI-PMH harvester. I’ve been using jOAI, which I can recommend heartily. It has the advantage that, like Retailer, it is a Java Servlet, so you can install it inside Tomcat, alongside Retailer itself.

Now you are all set to harvest newspaper articles!

Harvesting

To harvest, you should first go to Digital NZ and do some searching until you have a search query that returns you the results you want to harvest. Click through a selection of results to check that they are relevant; if necessary, refine your query to exclude articles you are not interested in.

Once you have decided on a good query, navigate your browser to your jOAI harvester (e.g. to http://localhost:8080/oai/) and add a new harvest. Set the harvest’s setSpec to be search: followed by your query, e.g. search:new zealand wars. Then run the harvest to actually download the search results. For details, see the Retailer documentation.

Why not give it a try? Any problems, leave a comment here or as an issue on the Retailer github site!