Visualization – Conal Tuohy's blog http://conaltuohy.com The blog of a digital humanities software developer Wed, 28 Jun 2017 23:15:33 +0000 en-AU hourly 1 https://wordpress.org/?v=5.1.10 http://conaltuohy.com/blog/wp-content/uploads/2017/01/conal-avatar-with-hat-150x150.jpg Visualization – Conal Tuohy's blog http://conaltuohy.com 32 32 74724268 Australian Society of Archivists 2016 conference #asalinks http://conaltuohy.com/blog/asa-2016-asalinks/ http://conaltuohy.com/blog/asa-2016-asalinks/#respond Tue, 25 Oct 2016 08:15:53 +0000 http://conaltuohy.com/?p=430 Continue reading Australian Society of Archivists 2016 conference #asalinks]]> Last week I participated in the 2016 conference of the Australian Society of Archivists, in Parramatta.

ASA Links poster
#ASALinks poster

I was very impressed by the programme and the discussion. I thought I’d jot down a few notes here about just a few of the presentations that were most closely related to my own work. The presentations were all recorded, and as the ASA’s YouTube channel is updated with newly edited videos, I’ll be editing this post to include those videos.

It was my first time at an ASA conference; I’d been dragged into it by Asa Letourneau, from the Public Record Office Victoria, with whom over the last year I’d been developing a piece of software called “PROVisualizer”, which appears right below here in the page (hint: click the “full screen” button in its bottom right corner if you want to have a play with it).

Asa and I gave a presentation on the PROVisualizer, talking about the history of the project from the early prototypes and models built at PROV, to the series of revisions of the product built in collaboration with me, and including the “Linked Data” infrastructure behind the visualization itself, and its prospects for further development and re-use.

You can access the PROVisualizer presention in PDF.

As always, I enjoyed Tim Sherratt‘s contribution: a keynote on redaction by ASIO (secret police) censors in the National Archives, called Turning the Inside Out.

The black marks are of course inserted by the ASIO censors in order to obscure and hide information, but Tim showed how it’s practicable to deconstruct the redactions’ role in the documents they obscure, and convert these voids, these absences, into positive signs in their own right; and that these signs can be utilized to discover politically sensitive texts, and zoom in precisely on the context that surrounds the censored details in each text. Also the censors made a lot of their redaction marks into cute little pictures of people and sailing ships, which raised a few laughs.

In the morning of the first day of talks, I got a kick out of Chris Hurley’s talk “Access to Archives (& Other Records) in the Digital Age”. His rejection of silos and purely hierarchical data models, and his vision of openness to, and accommodation of, small players in the archives space both really resonated with me, and I was pleased to be able to chat with him over coffee later in the afternoon about the history of this idea and about how things like Linked Data and the ICA’s “Records in Context” data model can help to realize it.

In the afternoon of the first day I was particularly struck by Ross Spencer‘s presentation about extracting metadata from full text resources. He spoke about using automation to identify the names of people, organisations, places, and so on, within the text of documents. For me this was particularly striking because I’d only just begun an almost identical project myself for the Australian Policy Online repository of policy documents. In fact it turned out we were using the same software (Apache Tika and the Stanford Named Entity Recognizer).

On the second day I was particularly struck by a few papers that were very close to my own interests. Nicole Kearney, from Museum Victoria, talked about her work coordinating the Biodiversity Heritage Library Australia.

This presentation was focused on getting value from the documentary heritage of museums; such things as field notes and diaries from scientific expeditions, by using the Atlas of Living Australia’s DigiVol transcription platform to allow volunteers to transcribe the text from digital images, and then publishing the text and images online using the BHL publication platform. In between there was slightly awkward part which involves Nicole converting from the CSV format produced by DigiVol into some more standard format for the BHL. I’ve had an interest in text transcription going back to slightly before the time I joined the New Zealand Electronic Text Centre at Victoria University of Wellington; this would’ve been about 2003, which seems like ancient times now.

After that I saw Val Love and Kirsty Cox talk about their journey in migrating the Alexander Turnbull Library‘s TAPUHI software to KE EMu. Impressive, given the historical complexity of TAPUHI, and the amount of data analysis required to make sense of its unique model, and to translate that into a more standard conceptual model, and to implement that model using EMu. It’s an immense milestone for the Turnbull, and I hope will lead in short order to the opening up of the collection metadata to greater reuse.

Finally I want to mention the talk “Missing Links: museum archives as evidence, context and content” from Mike Jones. This was another talk about breaking down barriers between collection management systems in museums: on the one hand, the museum’s collection of objects, and on the other, the institution’s archives. Of course those archives provide a great deal of context for the collection, but the reality is that the IT infrastructure and social organisation of these two systems is generally very distinct and separate. Mike’s talk was about integrating cultural heritage knowledge from different organisational structures, domains of professional expertise, different data models, and IT systems. I got a shout-out in one slide in the form of a reference to some experimental work I’d done with Museum Victoria’s web API, to convert it into a Linked Data service.

It’s my view that Linked Data technology offers a practical approach to resolving the complex data integration issues in cultural heritage: it is relatively easy to expose legacy systems, whatever they might be, in the form of Linked Data, and having done so, the task of integrating the data so exposed is also rather straight-forward (that’s what Linked Data was invented for, pretty much). To me the problem is how to sell this to an institution, in the sense that you have to offer the institution itself a “win” for undertaking the work. If it’s just that they can award themselves 5 gold stars for public service that’s not a great reason. You need to be able to deliver tangible value to museums themselves. This is where I think there’s a gap; in leveraging Linked Data to enhance exhibitions and also in-house collection management systems. If we can make it so that there’s value to institutions in creating and also consuming Linked Data, then we may be able to establish a virtuous circle to drive uptake of the technology, and see some progress inĀ  the integration of knowledge in the sector.

 

]]>
http://conaltuohy.com/blog/asa-2016-asalinks/feed/ 0 430
Linked Open Data Visualisation at #GLAMVR16 http://conaltuohy.com/blog/linked-open-data-visualisation/ http://conaltuohy.com/blog/linked-open-data-visualisation/#comments Tue, 30 Aug 2016 02:02:10 +0000 http://conaltuohy.com/?p=404 Continue reading Linked Open Data Visualisation at #GLAMVR16]]> On Thursday last week I flew to Perth, in Western Australia, to speak at an event at Curtin University on visualisation of cultural heritage. Erik Champion, Professor of Cultural Visualisation, who organised the event, had asked me to talk about digital heritage collections and Linked Open Data (“LOD”).

The one-day event was entitled “GLAM VR: talks on Digital heritage, scholarly making & experiential media”, and combined presentations and workshops on cultural heritage data (GLAM = Galleries, Libraries, Archives, and Museums) with advanced visualisation technology (VR = Virtual Reality).

The venue was the Curtin HIVE (Hub for Immersive Visualisation and eResearch); a really impressive visualisation facility at Curtin University, with huge screens and panoramic and 3d displays.

There were about 50 people in attendance, and there would have been over a dozen different presenters, covering a lot of different topics, though with common threads linking them together. I really enjoyed the experience, and learned a lot. I won’t go into the detail of the other presentations, here, but quite a few people were live-tweeting, and I’ve collected most of the Twitter stream from the day into a Storify story, which is well worth a read and following up.

My presentation

For my part, I had 40 minutes to cover my topic. I’d been a bit concerned that my talk was more data-focused and contained nothing specifically about VR, but I think on the day the relevance was actually apparent.

The presentation slides are available here as a PDF: Linked Open Data Visualisation

My aims were:

  • At a tactical level, to explain the basics of Linked Data from a technical point of view (i.e. to answer the question “what is it?”); to show that it’s not as hard as it’s usually made out to be; and to inspire people to get started with generating it, consuming it, and visualising it.
  • At a strategic level, to make the case for using Linked Data as a basis for visualisation; that the discipline of adopting Linked Data technology is not at all a distraction from visualisation, but rather a powerful generic framework on top of which visualisations of various kinds can be more easily constructed, and given the kind of robustness that real scholarly work deserves.

Linked Data basics

I spent the first part of my talk explaining what Linked Open Data means; starting with “what is a graph?” and introducing RDF triples and Linked Data. Finally I showed a few simple SPARQL queries, without explaining SPARQL in any detail, but just to show the kinds of questions you can ask with a few lines of SPARQL code.

What is an RDF graph?
What is an RDF graph?

While I explained about graph data models, I saw attendees nodding, which I took as a sign of understanding and not that they were nodding off to sleep; it was still pretty early in the day for that.

One thing I hoped to get across in this part of the presentation was just that Linked Data is not all that hard to get into. Sure, it’s not a trivial technology, but barriers to entry are not that high; the basics of it are quite basic, so you can make a start and do plenty of useful things without having to know all the advanced stuff. For instance, there are a whole bunch of RDF serializations, but in fact you can get by with knowing only one. There are a zillion different ontologies, but again you only need to know the ontology you want to use, and you can do plenty of things without worrying about a formal ontology at all. I’d make the case for university eResearch agencies, software carpentry, and similar efforts, to be offering classes and basic support in this technology, especially in library and information science, and the humanities generally.

Linked Data as architecture

People often use the analogy of building, when talking about making software. We talk about a “build process”, “platforms”, and “architecture”, and so on. It’s not an exact analogy, but it is useful. Using that analogy, Linked Data provides a foundation that you can build a solid edifice on top of. If you skimp on the foundation, you may get started more quickly, but you will encounter problems later. If your project is small, and if it’s a temporary structure (a shack or bivouac), then architecture is not so important, and you can get away with skimping on foundations (and you probably should!), but the larger the project is (an office building), and the longer you want it to persist (a cathedral), the more valuable a good architecture will be. In the case of digital scholarly works, the common situation in academia is that weakly-architected works are being cranked out and published, but being hard to maintain, they tend to crumble away within a few years.

Crucially, a Linked Data dataset can capture the essence of what needs to be visualised, without being inextricably bound up with any particular genre of visualisation, or any particular visualisation software tool. This relative independence from specific tools is important because a dataset which is tied to a particular software platform needs to rely on the continued existence of that software, and experience shows that individual software packages come and go depressingly quickly. Often only a few years are enough for a software program to be “orphaned”, unavailable, obsolete, incompatible with the current software environment (e.g. requires Windows 95 or IE6), or even, in the case of software available online as a service, for it to completely disappear into thin air, if the service provider goes bust or shuts down the service for reasons of their own. In these cases you can suddenly realise you’ve been building your “scholarly output” on sand.

By contrast, a Linked Data dataset is standardised, and it’s readable with a variety of tools that support that standard. That provides you with a lot of options for how you could go on to visualise the data; that generic foundation gives you the possibility of building (and rebuilding) all kinds of different things on top of it.

Because of its generic nature and its openness to the Web, Linked Data technology has become a broad software ecosystem which already has a lot of people’s data riding on it; that kind of mass investment (a “bandwagon”, if you like) is insurance against it being wiped out by the whims or vicissitudes of individual businesses. That’s the major reason why a Linked Data dataset can be archived and stored long term with confidence.

Linked Open Data is about sharing your data for reuse

Finally, by publishing your dataset as Linked Open Data (independently of any visualisations you may have made of it), you are opening it up to reuse not only by yourself, but by others.

The graph model allows you to describe the meaning of the terms you’ve used (i.e. the analytical categories used in your data can themselves be described and categorised, because everything is a node in a graph). This means that other people can work out what your dataset actually means.

The use of URIs for identifiers means that others can easily cite your work and effectively contribute to your work by creating their own annotations on it. They don’t need to impinge on your work; their annotations can live somewhere else altogether and merely refer to nodes in your graph by those nodes’ identifiers (URIs). They can comment; they can add cross-references; they can assert equivalences to nodes in other graphs, elsewhere. Your scholarly work can break out of its box, to become part of an open web of knowledge that grows and ramifies and enriches us all.

]]>
http://conaltuohy.com/blog/linked-open-data-visualisation/feed/ 1 404
Visualizing Government Archives through Linked Data http://conaltuohy.com/blog/visualizing-government-archives-through-linked-data/ http://conaltuohy.com/blog/visualizing-government-archives-through-linked-data/#comments Tue, 05 Apr 2016 13:41:00 +0000 http://conaltuohy.com/?p=383 Continue reading Visualizing Government Archives through Linked Data]]> Tonight I’m knocking back a gin and tonic to celebrate finishing a piece of software development for my client the Public Record Office Victoria; the archives of the government of the Australian state of Victoria.

The work, which will go live in a couple of weeks, was an update to a browser-based visualization tool which we first set up last year. In response to user testing, we made some changes to improve the visualization’s usability. It certainly looks a lot clearer than it did, and the addition of some online help makes it a bit more accessible for first-time users.

The visualization now looks like this (here showing the entire dataset, unfiltered, which is not actually that useful, though it is quite pretty):

provisualizer

The bulk of the work, though, was to automate the preparation of data for the visualization.

Up until now, the dataset which you could visualize consisted of a couple of CSV files, manually assembled with considerable care and effort from reports exported from PROV’s repository “Archives One”. In the new system, this manual work will not need to be repeated. Instead, the same dataset will be assembled by an automated metadata-processing pipeline which will keep it continually up to date as government agencies and functions change over time.

It was not as big as job as you might think, since in fact a lot of the work to generate the data had already been done.

PROV’s Interoperable Data service

In 2012, in collaboration with their counterpart agency State Records New South Wales, PROV had set up an Interoperable Data publishing service with funding from the Australian National Data Service. They custom-built some software to export data from Archives One to produce a set of metadata records in RIF-CS format, and they deployed an off-the-shelf software application (an “OAI-PMH Repository”) to disseminate those metadata records over the web.

Originally, the OAI-PMH repository was serving data to the Australian National Data Service, which runs an aggregation service called Research Data Australia, which offers researchers pointers to all manner of scientific, historical and cultural datasets. The PROV metadata, covering the full history of government records in Victoria, is a useful resource for social science researchers, genealogists, historians, and others.

More recently, PROV’s OAI-PMH repository has also been harvested by the National Library of Australia’s Trove service.

Now at last it will be harvested by the Public Record Office itself.

The data pipeline

The software I’ve written consists of a web application which I wrote using a programming language for data pipelines called XProc. The software itself is open source and available on GitHub in a repository with the ludicrously acronymous title PROV-RIF-SPARQL.

This XProc application tediously harvests the metadata records (there are more than 30000 of them) and converts each one from RIF-CS format into RDF/XML format. The RDF/XML data is a reformulation of the RIF-CS in which the hierarchical structures of the RIF-CS are re-expressed as a network of interconnected statements; a kind of web of nodes and links which mathematicians call a “graph”. The statements in these graphs are expressed using the international standard conceptual framework for cultural heritage data; the CIDOC-CRM. My harvester then stores all these RDF/XML documents (or “graphs”) in a SPARQL Graph Store (a kind of hybrid document store and database). The SPARQL Graph Store allows each graph to be addressed individually, but also for the entire dataset to be treated as a single graph, and queried as a whole. Finally, the RDF dataset is queried to produce the two summarised data files which the visualization itself requires; these are simple spreadsheets in CSV (Comma Separated Values) format. One table contains information about each government agency or function, and the other table lists the relationships which have historically existed between those agencies and functions.

The harvester has a basic user interface where you can start a data harvest; a process that takes about half an hour to complete. In this interface you can specify the location of the OAI-PMH server you want to harvest data from, the format of the data you want to harvest, and the location of the SPARQL Graph Store where you want to store the result, amongst other parameters. In practice, this user interface isn’t used by a human (except during testing); another small program running on a regular schedule makes the request.

harvester

At this stage of the project, the RDF graph is only used internally to PROV, where it functions purely as an intermediate between the RIF-CS input and the CSV output. The RDF data and the SPARQL database together just provide a convenient way to aggregate a big set of records and query the resulting aggregation. But later I have no doubt that the RDF data will be published directly as Linked Open Data, opening it up, and allowing it to be connected into a world-wide web of data.

]]>
http://conaltuohy.com/blog/visualizing-government-archives-through-linked-data/feed/ 3 383