LODLive – Conal Tuohy's blog

Linked Open Data Visualisation at #GLAMVR16

Conal — Tue, 30 Aug 2016 02:02:10 +0000

On Thursday last week I flew to Perth, in Western Australia, to speak at an event at Curtin University on visualisation of cultural heritage. Erik Champion, Professor of Cultural Visualisation, who organised the event, had asked me to talk about digital heritage collections and Linked Open Data (“LOD”).

The one-day event was entitled “GLAM VR: talks on Digital heritage, scholarly making & experiential media”, and combined presentations and workshops on cultural heritage data (GLAM = Galleries, Libraries, Archives, and Museums) with advanced visualisation technology (VR = Virtual Reality).

The venue was the Curtin HIVE (Hub for Immersive Visualisation and eResearch); a really impressive visualisation facility at Curtin University, with huge screens and panoramic and 3d displays.

There were about 50 people in attendance, and there would have been over a dozen different presenters, covering a lot of different topics, though with common threads linking them together. I really enjoyed the experience, and learned a lot. I won’t go into the detail of the other presentations, here, but quite a few people were live-tweeting, and I’ve collected most of the Twitter stream from the day into a Storify story, which is well worth a read and following up.

My presentation

For my part, I had 40 minutes to cover my topic. I’d been a bit concerned that my talk was more data-focused and contained nothing specifically about VR, but I think on the day the relevance was actually apparent.

The presentation slides are available here as a PDF: Linked Open Data Visualisation

My aims were:

At a tactical level, to explain the basics of Linked Data from a technical point of view (i.e. to answer the question “what is it?”); to show that it’s not as hard as it’s usually made out to be; and to inspire people to get started with generating it, consuming it, and visualising it.
At a strategic level, to make the case for using Linked Data as a basis for visualisation; that the discipline of adopting Linked Data technology is not at all a distraction from visualisation, but rather a powerful generic framework on top of which visualisations of various kinds can be more easily constructed, and given the kind of robustness that real scholarly work deserves.

Linked Data basics

I spent the first part of my talk explaining what Linked Open Data means; starting with “what is a graph?” and introducing RDF triples and Linked Data. Finally I showed a few simple SPARQL queries, without explaining SPARQL in any detail, but just to show the kinds of questions you can ask with a few lines of SPARQL code.

What is an RDF graph?

While I explained about graph data models, I saw attendees nodding, which I took as a sign of understanding and not that they were nodding off to sleep; it was still pretty early in the day for that.

One thing I hoped to get across in this part of the presentation was just that Linked Data is not all that hard to get into. Sure, it’s not a trivial technology, but barriers to entry are not that high; the basics of it are quite basic, so you can make a start and do plenty of useful things without having to know all the advanced stuff. For instance, there are a whole bunch of RDF serializations, but in fact you can get by with knowing only one. There are a zillion different ontologies, but again you only need to know the ontology you want to use, and you can do plenty of things without worrying about a formal ontology at all. I’d make the case for university eResearch agencies, software carpentry, and similar efforts, to be offering classes and basic support in this technology, especially in library and information science, and the humanities generally.

Linked Data as architecture

People often use the analogy of building, when talking about making software. We talk about a “build process”, “platforms”, and “architecture”, and so on. It’s not an exact analogy, but it is useful. Using that analogy, Linked Data provides a foundation that you can build a solid edifice on top of. If you skimp on the foundation, you may get started more quickly, but you will encounter problems later. If your project is small, and if it’s a temporary structure (a shack or bivouac), then architecture is not so important, and you can get away with skimping on foundations (and you probably should!), but the larger the project is (an office building), and the longer you want it to persist (a cathedral), the more valuable a good architecture will be. In the case of digital scholarly works, the common situation in academia is that weakly-architected works are being cranked out and published, but being hard to maintain, they tend to crumble away within a few years.

Crucially, a Linked Data dataset can capture the essence of what needs to be visualised, without being inextricably bound up with any particular genre of visualisation, or any particular visualisation software tool. This relative independence from specific tools is important because a dataset which is tied to a particular software platform needs to rely on the continued existence of that software, and experience shows that individual software packages come and go depressingly quickly. Often only a few years are enough for a software program to be “orphaned”, unavailable, obsolete, incompatible with the current software environment (e.g. requires Windows 95 or IE6), or even, in the case of software available online as a service, for it to completely disappear into thin air, if the service provider goes bust or shuts down the service for reasons of their own. In these cases you can suddenly realise you’ve been building your “scholarly output” on sand.

By contrast, a Linked Data dataset is standardised, and it’s readable with a variety of tools that support that standard. That provides you with a lot of options for how you could go on to visualise the data; that generic foundation gives you the possibility of building (and rebuilding) all kinds of different things on top of it.

Because of its generic nature and its openness to the Web, Linked Data technology has become a broad software ecosystem which already has a lot of people’s data riding on it; that kind of mass investment (a “bandwagon”, if you like) is insurance against it being wiped out by the whims or vicissitudes of individual businesses. That’s the major reason why a Linked Data dataset can be archived and stored long term with confidence.

Linked Open Data is about sharing your data for reuse

Finally, by publishing your dataset as Linked Open Data (independently of any visualisations you may have made of it), you are opening it up to reuse not only by yourself, but by others.

The graph model allows you to describe the meaning of the terms you’ve used (i.e. the analytical categories used in your data can themselves be described and categorised, because everything is a node in a graph). This means that other people can work out what your dataset actually means.

The use of URIs for identifiers means that others can easily cite your work and effectively contribute to your work by creating their own annotations on it. They don’t need to impinge on your work; their annotations can live somewhere else altogether and merely refer to nodes in your graph by those nodes’ identifiers (URIs). They can comment; they can add cross-references; they can assert equivalences to nodes in other graphs, elsewhere. Your scholarly work can break out of its box, to become part of an open web of knowledge that grows and ramifies and enriches us all.

Bridging the conceptual gap: Museum Victoria’s collections API and the CIDOC Conceptual Reference Model

Conal — Wed, 21 Oct 2015 14:44:33 +0000

A Museum Victoria LOD graph about a teacup, shown using the LODLive visualizer.

This is the third in a series of posts about an experimental Linked Open Data (LOD) publication based on the web API of Museum Victoria.

The first post gave an introduction and overview of the architecture of the publication software, and the second dealt quite specifically with how names and identifiers work in the LOD publication software.

In this post I’ll cover how the publication software takes the data published by Museum Victoria’s API and reshapes it to fit a common conceptual model for museum data, the “Conceptual Reference Model” published by the documentation committee of the Internal Council of Museums. I’m not going to exhaustively describe the translation process (you can read the source code if you want the full story), but I’ll include examples to illustrate the typical issues that arise in such a translation.

The CIDOC Conceptual Reference Model

The CIDOC CRM, as it’s usually called, is a system of concepts for analysing and describing the content of museum collections. It is not intended to be a replacement for the Collection Management Systems which museums use to store their data; it is rather intended to function as a kind of lingua franca, through which content from a variety of systems can be expressed in a generally intelligible way.

The Conceptual Reference Model covers a wide range of museological concerns: items can be described in terms of their materials and mode of construction, as well as by who made them, where and when, and for what purpose.

The CRM also provides a framework to describe the events in which objects are broken into pieces, or joined to other objects, damaged or repaired, created or utterly destroyed. Objects can be described in terms of the symbolic and intellectual content which they embody, which are themselves treated as “intellectual objects”. The lineage of intellectual influence can be described, either speculatively, in a high-level way, or by explicitly tracing and documenting the influences that were known have taken place at particular times and locations. The legal history of objects can also be traced through transfer of ownership and custody, commission, sale and purchase, theft and looting, loss and discovery. Where the people involved in these histories are known, they too can be named and described and their histories interwoven with those of other people, objects, and ideas.

Core concepts and additional classification schemes

The CRM framework is quite high level. Only a fairly small number of very general types of thing are defined in the CRM; only concepts general enough to be useful for any kind of museum; whether a museum of computer games or of classical antiquity. Each of these concepts is identified by an alphanumeric code and an English-language name. In addition, the CRM framework allows for arbitrary typologies to be added on, to be used for further classifying pretty much anything. This is to allow all the terms from any classification system used in a museum to be exported directly into a CRM-based dataset, simply by describing each term as an “E55 Type". In short, the CRM consists of a fairly fixed common core, supplemented by a potentially infinite number of custom vocabularies which can be used to make fine distinctions of whatever kind are needed.

Therefore, a dataset based on the CRM will generally be directly comparable with another dataset only in terms of the core CRM-defined entities. The different classification schemes used by different datasets remain “local” vocabularies. To achieve full interoperability between datasets, these distinct typologies would additionally need to be aligned, by defining a “mapping” table which lists the equivalences or inequivalences between the terms in the two vocabularies. For instance, such a table might say that the term “moulded” used in Museum Victoria’s collection is more or less the same classification as “molding (forming)” in the Getty Art and Architecture thesaurus.

Change happens through “events”

To model how things change through time, the CRM uses the notion of an “event”. The production of a physical object, for instance, is modelled as an E12 Production event (NB concepts in the CRM are all identified by an alphanumeric code). This production event is linked to the object which it produced, as well as to the person or persons who played particular creative roles in that event. The event may also have a date and place associated with it, and may be linked to the materials and to the method used in the event.

On a somewhat philosophical note, this focus on discrete events is justified by the fact that not all of history is continuously documented, and we necessarily have a fragmentary knowledge of the history of any given object. Often a museum object will have passed through many hands, or will have been modified many times, and not all of this history is known in any detail. If we know that person A created an object for person B, and that X years later the object turned up in the hands of person C, we can’t assume that the object remained in person B’s hands all those X years. A data model which treated “ownership” as a property of an object would be liable to making such inflated claims to knowledge which is simply not there. Person C may have acquired it at any point during that period, and indeed there may have been many owners in between person B and person C. This is why it makes sense to document an object’s history in terms of the particular events which are known and attested to.

Museum Victoria’s API

How does Museum Victoria’s data fit in terms of the CIDOC model?

In general the model works pretty well for Museum Victoria, though there are also things in MV’s data which are not so easy to express in the CRM.

Items

Museum Victoria describes items as “Things made and used by people”. These correspond exactly to the notion of E22 Man-Made Object in the CIDOC CRM (if you can excuse the sexist language), described as comprising “physical objects purposely created by human activity.”

Every MV item is therefore expressed as an E22 Man-Made Object.

Titles

Museum Victoria’s objects have an objectName property which is a simple piece of text; a name or title. In the CIDOC CRM, the name of an object is something more complex; it’s an entity in its own right, called an E41 Appellation. The reason why a name is treated as more than just a simple property of an object is that in the CRM, it must be possible to treat an object’s name as an historical phenomenon; after all, it will have been applied to an object by a particular person (the person who created the object, perhaps, or an archaeologist who dug it out of the ground, or a curator or historian), at some historical point in time. An object may have a number of different names, each given it by different people, and used by different people at different times.

However, because the Museum Victoria names are simple (a single label) we can ignore most of that complexity. We only need to define an E41 Appellation whose value is the name, and link the E41 Appellation to the E22 Man-Made Object using a P1 is identified by association.

Articles, Items and their relationships

The MV API provides access to a number of “articles” which are documents related to the Museum’s collection. For each article, the API shows a list of the related collection items; and for each item, you can get the corresponding list of related articles. Although the exact nature of the relationship isn’t made explicit, it’s reasonable to assume that an item is in some way documented by the articles that are related to it. In the CIDOC CRM, such an article is considered an E31 Document, and it bears a P70 documents relationship to the item which it’s about.

If the relationship between an item and article is easy to guess, there are a couple of other relationships which are a little less obvious: an article also has a list of related articles, and each item also has a list of related items. What is that nature of those relationships? In what way exactly does article X relate to article Y, or item A to item B? The MV API’s documentation doesn’t say, and it wouldn’t surprise me if the Museum’s collection management system leaves this question up to the curators’ judgement.

A bit of empirical research seemed called for. I checked a few of the related items and the examples I found seemed to fall into two categories:

One item is a photograph depicting another item (the specific relationship here is really “depicts”)
Two items are both photographs of the same subject (the relationship is “has the same subject as”).

Obviously there are two different kinds of relationship here in the Museum’s collection, both of them presented (through the API) in the same way. As a human, I can tell them apart, but my proxy software is not going to be able to. So I need to find a more general relationship which subsumes both the relationships above, and fortunately, the CIDOC CRM includes such a relationship, namely P130 shows features of.

This property generalises the notions of “copy of” and “similar to” into a dynamic, asymmetric relationship, where the domain expresses the derivative, if such a direction can be established.
Otherwise, the relationship is symmetric. It is a shortcut of P15 was influenced by (influenced) in a creation or production, if such a reason for the similarity can be verified. Moreover it expresses similarity in cases that can be stated between two objects only, without historical knowledge about its reasons.

For example, I have a photograph of a piece of computer hardware (which is the relatedItem), and the photo is therefore a kind of derivative of the hardware (though the Museum Victoria API doesn’t tell me which of the objects was the original and which the derivative). In another example I have two photos of the same house; here there’s a similarity which is not due to one of the photos being derived from the other.

Ideally, it would be preferable to be able to represent these kinds of relationships more precisely; for instance, in the case of the two photos of the house, one could generate a resource that denotes the actual physical house itself, and link that to the photographs, but because the underlying data doesn’t include this information in a machine-readable form, the best we can do is to say that the two photos are similar.

Production techniques

Some of the items in the Museum’s collection are recorded as having been produced using a certain “technique”. For instance, archaeological artefacts in the MV collection have a property called archeologyTechnique, which contains the name of a general technique, such as moulded, in the case of certain ceramic items.

This corresponds to the CRM concept P32 used general technique, which is described like so:

This property identifies the technique or method that was employed in an activity.
These techniques should be drawn from an external E55 Type hierarchy of consistent terminology of
general techniques or methods such as embroidery, oil-painting, carbon dating, etc.

Note that in CIDOC this “general technique” used to manufacture an object is not a property of the object iself; it’s a property of the activity which produced the object (i.e. the whole process in which the potter pressed clay into a mould, glazed the cup, and fired it in a kiln).

Note also that, for the CIDOC CRM, the production technique used in making these tea-cups is not the text string “moulded”; it is actually an abstract concept identified by a URI. The string “moulded” is just a human-readable name attached as a property of that concept. That same concept might very well have a number of other names in other languages, or even in English there’s the American variant spelling “molded”, and synonyms such as “cast” that could all be alternative names for the same concept.

Translating a Museum Victoria item with a technique into the CRM therefore involves identifying three entities:

the object itself (an E22 Man-Made Object);
the production of the object (an E12 Production activity);
the technique used in the course of that activity to produce the object (an E55 Type of technique)

These three entities are then linked together:

The production event “P32 used general technique" of the technique; and
The production event [edit: ~~"P94 has created"~~] "P108 has produced" the object itself.

Notes

The items, articles, specimens and species in the Museum’s API are all already first-class objects and can be easily represented as concepts in Linked Data. The archeologyTechnique field also has a fairly restricted range of values, and each of those values (such as “moulded”) can be represented as a Linked Data concept as well. But there are a number of other fields in the Museum’s API which are in the form of relatively long pieces of descriptive text. For example, an object’s objectSummary field contains a long piece of text which describes the object in context. For example, here’s the objectSummary of one our moulded tea cups:

This reconstructed cup was excavated at the Commonwealth Block site between 1988 and 2003. There is a matching saucer that was found with it. The pattern is known as 'Moss Rose' and was made between 1850 and 1851 by Charles Meigh, Son & Pankhurst in Hanley, Staffordshire, England.

Homewares. Numerous crockery pieces were found all over the Little Lon site. Crockery gives us a glimpse of everyday life in Melbourne in the 1880s. In the houses around Little Lon, residents used decorated crockery. Most pieces were cheap earthenware or stoneware, yet provided colour and cheer. Only a few could afford to buy matching sets, and most china was probably acquired second-hand. Some were once expensive pieces. Householders mixed and matched their crockery from the great range of mass-produced designs available. 'Blue and white' and the 'willow' pattern, was the most popular choice and was produced by English potteries from 1790.

It’s not quite as long as an “article” but it’s not far off it. Another textual property is called physicalDescription, and has a narrower focus on the physical nature of the item:

This is a glazed earthenware teacup which has been reconstructed. It is decorated with a blue or black vine and leaf design around outside and inside of the cup which is known as 'Moss Rose' pattern.

The CIDOC CRM does include concepts related to the historical context and the physical nature of items, but it’s not at all easy to extract that detailed information from the descriptive prose of these, and similar fields. Because the information is stored in a long narrative form, it can’t be easily mapped to the denser data structure of a Linked Data graph. The best we can hope to do with these fields is to treat them as notes attached to the item.

The CIDOC CRM includes a concept for attaching a note: P3 has note. But to represent these two different types of note, it’s necessary to extend the CRM by creating two new, specialized versions (“sub-properties”) of the property called P3 has note, which I’ve called P3.1 objectSummary and P3.1 physicalDescription.

Summary

It’s possible to recognise three distinct patterns in mapping an API such as Museum Victoria’s to a Linked Data model like the CIDOC CRM.

Where the API provides access to a set of complex data objects of a particular type, these can
be mapped straight-forwardly to a corresponding class of Linked Data resources (e.g. the items, species, specimens, and articles in MV’s API).
Where the API exposes a simple data property, it can be straightforwardly converted to a Linked Data property (e.g. the two types of notes, in the example above).
Where the API exposes a simple data property whose values come from a fairly limited range (a “vocabulary”), then those individual property values can be assigned identifiers of their own, and effectively promoted from simple data properties to full-blown object properties (e.g. the production techniques in Museum Victoria’s API).

Conclusion

It’s been an interesting experiment, to generate Linked Open Data from an open API using a simple proxy: I think it shows that the technique is a very viable mechanism for institutions to break into the LOD cloud and contribute their collection in a standardised manner, without necessarily having to make any changes to their existing systems or invest in substantial software development work. To my mind, making that first step is a significant barrier that holds institutions and individuals back from realising the potential in their data. Once you have a system for publishing LOD, you are opening up a world of possibilities for external developers, data aggregators, and humanities researchers, and if your data is of interest to those external groups, you have the possibility of generating some significant returns on your investment, and the possibility of “harvesting” some of that work back into your institution’s own web presence in the form of better visualizations, discovery interfaces, and better understanding of your own collection.

Before the end of the year I hope to explore some further possibilities in the area of user interfaces based on Linked Data, to show some of the value that these Linked Data publishing systems can support.