How to download bulk newspaper articles from Papers Past

Anyone interested in New Zealand history should already know about the amazing Papers Past website provided by the National Library of New Zealand, where you can read search and browse millions of historical newspaper articles and advertisements from New Zealand.

You may also know about Digital New Zealand, which provides a central point for access to New Zealand’s digital culture.

This post is about using Digital NZ and Papers Past to get access, in bulk, to newspaper articles, in a form which is then open to being “crunched” with sophisticated “data mining” software, able to automatically discover patterns in the articles. In my earlier post How to download bulk newspaper articles from Trove, I wrote:

Some researchers want to go beyond merely using computers to help them find newspaper articles to read (what librarians call “resource discovery”); researchers also want to use computers to help them read those articles.

To use that kind of “data mining” software, you first need to have direct access to your data, and that can mean having to download thousands and thousands of articles. It’s not at all obvious how to do that, but in fact it can be made quite easy, as I will show.

First, though, a brisk stroll down memory lane…

How to download bulk newspaper articles from Trove

Before the internet, before TV, before radio, newspapers ruled. There were literally hundreds of newspapers, published in towns and cities all over Australia, and they carried the daily life of Australians in all its petty detail. For historians, newspapers were a diamond mine; the information content was hugely valuable; the hard part was all the digging you had to do. It used to be that you would have to go to a library where a newspaper collection was held, and search manually through text on paper or microfiche. You had to be prepared to put in a lot of hard slog.

But then everything changed. A humanities researcher once told me that for Australian researchers, the National Library’s of Australia’s “Trove” newspaper archive marked a radical break: “There was a Before Trove, and an After Trove”.

Interpreting the Australian indigenous lexicons of Daisy Bates

One of the most interesting things I’ve been doing recently has been working with Nick Thieberger, a linguist at Melbourne University, on “Digital Daisy Bates“, a Digital Humanities project based on the pioneering lexicographical research of Daisy Bates. In the early 20th century, Bates spent many years living in outback Australia, researching the languages and cultures of indigenous Australians, and produced dozens of lexicons of indigenous languages, using a common questionnaire. In our project, we are using modern digital methods to analyze all these lexicons. The first step of the project was to digitize and transcribe the questionnaires, so we can then crunch up the digital data and extract knowledge from it.

I will post more on the project later, but in this post I will restrict myself to describing some of the mechanics of the first stage of that data crunching.

Earlier this year I began to do work on my own account, as a software developer for hire.

I’ve got a few clients (a start!); an Australian Business Number, a business name (“Conal Tuohy” – which is not coincidentally also my own name), and finally the domain name,. So it was about time I launched my new website and blog.

Not much to see here yet, is there? But I’m intending this site to host a blog and to show off samples of my work.

I expect I will write about Digital Humanities and eResearch a fair bit, text encoding, software architecture, and whatever else takes my fancy.