Last week I travelled down from Brisbane to Canberra, to attend the THATCamp Canberra event at the National Library of Australia.
It’s been a very pleasant trip. I’ve been staying with friends I haven’t seen for a while, the spring weather has been lovely, and the environment here really is beautiful, with lots of leafy trees and birds everywhere.
It’s been a busy few days though; today at last I’ve had a chance to relax and reflect on the THATCamp, and I thought I’d jot down a few things.
This particular THATCamp was themed around the National Library’s Trove service. The main organiser of the event was Tim Sherratt, who’s the manager of Trove. Apparently it was Trove’s 5th birthday, and coincidentally it was Tim’s birthday too, on the last day of the event; one of his colleagues had baked him a cake and we all sang him happy birthday during one of the breaks between sessions.
That’s why many of the THATCamp sessions were related to Trove, and I facilitated one such session myself, about the Retailer software I’d developed recently. Retailer is a bit of web middleware for transforming XML web APIs into other web APIs, and in particular, it includes a script to transform the newspaper part of Trove’s API into a standard API called OAI-PMH, suitable for bulk harvesting. By installing a copy of Retailer in front of Trove (i.e. using Trove as a wholesaler), you can use an OAI-PMH harvester application to download a very large number of newspaper articles in HTML format.
I wanted to present the Retailer software, explain how it works, demonstrate it, but also, and especially, to see if I could get a bunch of people to install it, along with an OAI-PMH harvester, and then harvest some full text. I particularly wanted to see what challenges came up in the installation, so that I could revise the documentation. Sure enough, it was not as easy as you might have hoped. I had written and tested Retailer on my own computer systems which all run some kind of Linux, and the instructions I’d written reflected that. I have a mate who had installed it on OS X but he’d had to adapt the instructions and it had taken him some time to get it going, with sporadic and half-arsed advice from me over the internet (mostly in the form of comments made inside Facebook Scrabble). I don’t personally know much about Macs as I’ve never had one myself, so this was the big worry for me on the day.
— Ingrid Mason (@1n9r1d) November 1, 2014
In the Retailer session it turned out that there were a couple of people there running Linux, and they had the software installed and had harvested some articles within a few minutes. I could hardly have done it quicker myself!
There were a few people running Windows, but also several people running MacOS X and this is where we spent most of the time. One of the participants, Steve Leahy, put in a lot of this work and also volunteered to write it up, which will be great – thanks Steve!
So the upshot was:
- I gathered a bit of useful information about installation;
- I quadrupled my installed user base;
- I got to talk to a bunch of people about it; and
- I decided to also install the software myself “in the cloud”, so that people could use it without having to do any installation themselves.
It turned out, too, that Retailer may be of use to the National Library Trove team themselves; as a tool to help them harvest content from “collecting institutions” such as museums and galleries. It seems a lot of these institutions have their own custom web APIs, so the Trove team were quite keen to try out Retailer as a way to connect those APIs up to Trove’s existing OAI-PMH-based tools. I spent yesterday in a workshop at the Library, talking about XML and XSLT, and we spent about the last hour walking through a specific example of how Retailer could be deployed as a bridge, from Museum Victoria’s “Collections Online” API into Trove.
Apart from spruiking Retailer, I had a lot of fun listening and learning in other sessions, and during the al fresco session imposed by the fire alarm.
— Conal Tuohy (@conal_tuohy) November 1, 2014
I particularly enjoyed Glenn Roe‘s talk about “distant reading”, and I took away a bunch of ideas and links to software. I’m hoping to deploy some of those tools to analyse and visualise the newspaper articles I’ve harvested from Trove, and I expect I’ll blog more about that experience in future.
Another interesting discussion was about Linked Open Data; something which is a bigger story that I expect to tell in the near future once I’ve published some LOD software I wrote last year.
At another session about digital curation, Greg Rolan, from Monash University, pointed me at a very interesting set of tools called BitCurator, which I expect will come in handy some time, even if just for tidying up my own files.