Data mining – Conal Tuohy's blog http://conaltuohy.com The blog of a digital humanities software developer Wed, 28 Jun 2017 23:15:33 +0000 en-AU hourly 1 https://wordpress.org/?v=5.1.10 http://conaltuohy.com/blog/wp-content/uploads/2017/01/conal-avatar-with-hat-150x150.jpg Data mining – Conal Tuohy's blog http://conaltuohy.com 32 32 74724268 Analysis & Policy Online http://conaltuohy.com/blog/analysis-policy-online/ http://conaltuohy.com/blog/analysis-policy-online/#respond Tue, 27 Jun 2017 23:45:27 +0000 http://conaltuohy.com/?p=647 Continue reading Analysis & Policy Online]]> Notes for my Open Repositories 2017 conference presentation. I will edit this post later to flesh it out into a proper blog post.
Follow along at: conaltuohy.com/blog/analysis-policy-online/

background

  • Early discussion with Amanda Lawrence of APO (which at that time stood for “Australian Policy Online”) about text mining, at the 2015 LODLAM Summit in Sydney.
  • They needed automation to help with the cataloguing work, to improve discovery.
  • They needed to understand their own corpus better.
  • I suggested a particular technical approach based on previous work.
  • In 2016, APO contracted me to advise and help them build a system that would “mine” metadata from their corpus, and use Linked Data to model and explore it.

constraints

  • Openness
  • Integrate metadata from multiple text-mining processes, plus manually created metadata
  • Minimal dependency on their current platform (Drupal 7, now Drupal 8)
  • Lightweight; easy to make quick changes

technical approach

  • Use an entirely external metadata store (a SPARQL Graph Store)
  • Use a pipeline! Extract, Transform, Load
  • Use standard protocol to extract data (first OAI-PMH, later sitemaps)
  • In fact, use web services for everything; the pipeline is then just a simple script that passes data between web services
  • Sure, XSLT and SPARQL Query, but what the hell is XProc?!

progress

  • Configured Apache Tika as a web service, using Stanford Named Entity Recognition toolkit
  • Built XProc pipeline to harvest from Drupal’s OAI-PMH module, download digital objects, process them with Stanford NER via Tika, and store the resulting graphs in Fuseki graph store
  • Harvested, and produced a graph of part of the corpus, but …
  • Turned out the Drupal OAI-PMH module wa broken! So we used Sitemap instead
  • “Related” list added to APO dev site (NB I’ve seen this isn’t working in all browsers, and obviously needs more work, perhaps using an iframe is not the best idea. Try Chrome if you don’t see the list of related pages on the right)

next steps

  • Visualize the graph
  • Integrate more of the manually created metadata into the RDF graph
  • Add topic modelling (using MALLET) alongside the NER

Let’s see the code

Questions?

(if there’s any time remaining)

]]>
http://conaltuohy.com/blog/analysis-policy-online/feed/ 0 647
Australian Society of Archivists 2016 conference #asalinks http://conaltuohy.com/blog/asa-2016-asalinks/ http://conaltuohy.com/blog/asa-2016-asalinks/#respond Tue, 25 Oct 2016 08:15:53 +0000 http://conaltuohy.com/?p=430 Continue reading Australian Society of Archivists 2016 conference #asalinks]]> Last week I participated in the 2016 conference of the Australian Society of Archivists, in Parramatta.

ASA Links poster
#ASALinks poster

I was very impressed by the programme and the discussion. I thought I’d jot down a few notes here about just a few of the presentations that were most closely related to my own work. The presentations were all recorded, and as the ASA’s YouTube channel is updated with newly edited videos, I’ll be editing this post to include those videos.

It was my first time at an ASA conference; I’d been dragged into it by Asa Letourneau, from the Public Record Office Victoria, with whom over the last year I’d been developing a piece of software called “PROVisualizer”, which appears right below here in the page (hint: click the “full screen” button in its bottom right corner if you want to have a play with it).

Asa and I gave a presentation on the PROVisualizer, talking about the history of the project from the early prototypes and models built at PROV, to the series of revisions of the product built in collaboration with me, and including the “Linked Data” infrastructure behind the visualization itself, and its prospects for further development and re-use.

You can access the PROVisualizer presention in PDF.

As always, I enjoyed Tim Sherratt‘s contribution: a keynote on redaction by ASIO (secret police) censors in the National Archives, called Turning the Inside Out.

The black marks are of course inserted by the ASIO censors in order to obscure and hide information, but Tim showed how it’s practicable to deconstruct the redactions’ role in the documents they obscure, and convert these voids, these absences, into positive signs in their own right; and that these signs can be utilized to discover politically sensitive texts, and zoom in precisely on the context that surrounds the censored details in each text. Also the censors made a lot of their redaction marks into cute little pictures of people and sailing ships, which raised a few laughs.

In the morning of the first day of talks, I got a kick out of Chris Hurley’s talk “Access to Archives (& Other Records) in the Digital Age”. His rejection of silos and purely hierarchical data models, and his vision of openness to, and accommodation of, small players in the archives space both really resonated with me, and I was pleased to be able to chat with him over coffee later in the afternoon about the history of this idea and about how things like Linked Data and the ICA’s “Records in Context” data model can help to realize it.

In the afternoon of the first day I was particularly struck by Ross Spencer‘s presentation about extracting metadata from full text resources. He spoke about using automation to identify the names of people, organisations, places, and so on, within the text of documents. For me this was particularly striking because I’d only just begun an almost identical project myself for the Australian Policy Online repository of policy documents. In fact it turned out we were using the same software (Apache Tika and the Stanford Named Entity Recognizer).

On the second day I was particularly struck by a few papers that were very close to my own interests. Nicole Kearney, from Museum Victoria, talked about her work coordinating the Biodiversity Heritage Library Australia.

This presentation was focused on getting value from the documentary heritage of museums; such things as field notes and diaries from scientific expeditions, by using the Atlas of Living Australia’s DigiVol transcription platform to allow volunteers to transcribe the text from digital images, and then publishing the text and images online using the BHL publication platform. In between there was slightly awkward part which involves Nicole converting from the CSV format produced by DigiVol into some more standard format for the BHL. I’ve had an interest in text transcription going back to slightly before the time I joined the New Zealand Electronic Text Centre at Victoria University of Wellington; this would’ve been about 2003, which seems like ancient times now.

After that I saw Val Love and Kirsty Cox talk about their journey in migrating the Alexander Turnbull Library‘s TAPUHI software to KE EMu. Impressive, given the historical complexity of TAPUHI, and the amount of data analysis required to make sense of its unique model, and to translate that into a more standard conceptual model, and to implement that model using EMu. It’s an immense milestone for the Turnbull, and I hope will lead in short order to the opening up of the collection metadata to greater reuse.

Finally I want to mention the talk “Missing Links: museum archives as evidence, context and content” from Mike Jones. This was another talk about breaking down barriers between collection management systems in museums: on the one hand, the museum’s collection of objects, and on the other, the institution’s archives. Of course those archives provide a great deal of context for the collection, but the reality is that the IT infrastructure and social organisation of these two systems is generally very distinct and separate. Mike’s talk was about integrating cultural heritage knowledge from different organisational structures, domains of professional expertise, different data models, and IT systems. I got a shout-out in one slide in the form of a reference to some experimental work I’d done with Museum Victoria’s web API, to convert it into a Linked Data service.

It’s my view that Linked Data technology offers a practical approach to resolving the complex data integration issues in cultural heritage: it is relatively easy to expose legacy systems, whatever they might be, in the form of Linked Data, and having done so, the task of integrating the data so exposed is also rather straight-forward (that’s what Linked Data was invented for, pretty much). To me the problem is how to sell this to an institution, in the sense that you have to offer the institution itself a “win” for undertaking the work. If it’s just that they can award themselves 5 gold stars for public service that’s not a great reason. You need to be able to deliver tangible value to museums themselves. This is where I think there’s a gap; in leveraging Linked Data to enhance exhibitions and also in-house collection management systems. If we can make it so that there’s value to institutions in creating and also consuming Linked Data, then we may be able to establish a virtuous circle to drive uptake of the technology, and see some progress inĀ  the integration of knowledge in the sector.

 

]]>
http://conaltuohy.com/blog/asa-2016-asalinks/feed/ 0 430