Notes for my Open Repositories 2017 conference presentation. I will edit this post later to flesh it out into a proper blog post.
Follow along at: conaltuohy.com/blog/analysis-policy-online/
- Early discussion with Amanda Lawrence of APO (which at that time stood for “Australian Policy Online”) about text mining, at the 2015 LODLAM Summit in Sydney.
- They needed automation to help with the cataloguing work, to improve discovery.
- They needed to understand their own corpus better.
- I suggested a particular technical approach based on previous work.
- In 2016, APO contracted me to advise and help them build a system that would “mine” metadata from their corpus, and use Linked Data to model and explore it.
- Integrate metadata from multiple text-mining processes, plus manually created metadata
- Minimal dependency on their current platform (Drupal 7, now Drupal 8)
- Lightweight; easy to make quick changes
- Use an entirely external metadata store (a SPARQL Graph Store)
- Use a pipeline! Extract, Transform, Load
- Use standard protocol to extract data (first OAI-PMH, later sitemaps)
- In fact, use web services for everything; the pipeline is then just a simple script that passes data between web services
- Sure, XSLT and SPARQL Query, but what the hell is XProc?!
- Configured Apache Tika as a web service, using Stanford Named Entity Recognition toolkit
- Built XProc pipeline to harvest from Drupal’s OAI-PMH module, download digital objects, process them with Stanford NER via Tika, and store the resulting graphs in Fuseki graph store
- Harvested, and produced a graph of part of the corpus, but …
- Turned out the Drupal OAI-PMH module wa broken! So we used Sitemap instead
- “Related” list added to APO dev site (NB I’ve seen this isn’t working in all browsers, and obviously needs more work, perhaps using an iframe is not the best idea. Try Chrome if you don’t see the list of related pages on the right)
- Visualize the graph
- Integrate more of the manually created metadata into the RDF graph
- Add topic modelling (using MALLET) alongside the NER
Let’s see the code
(if there’s any time remaining)