CIDOC-CRM – Conal Tuohy's blog http://conaltuohy.com The blog of a digital humanities software developer Wed, 28 Jun 2017 23:15:33 +0000 en-AU hourly 1 https://wordpress.org/?v=5.1.10 http://conaltuohy.com/blog/wp-content/uploads/2017/01/conal-avatar-with-hat-150x150.jpg CIDOC-CRM – Conal Tuohy's blog http://conaltuohy.com 32 32 74724268 Visualizing Government Archives through Linked Data http://conaltuohy.com/blog/visualizing-government-archives-through-linked-data/ http://conaltuohy.com/blog/visualizing-government-archives-through-linked-data/#comments Tue, 05 Apr 2016 13:41:00 +0000 http://conaltuohy.com/?p=383 Continue reading Visualizing Government Archives through Linked Data]]> Tonight I’m knocking back a gin and tonic to celebrate finishing a piece of software development for my client the Public Record Office Victoria; the archives of the government of the Australian state of Victoria.

The work, which will go live in a couple of weeks, was an update to a browser-based visualization tool which we first set up last year. In response to user testing, we made some changes to improve the visualization’s usability. It certainly looks a lot clearer than it did, and the addition of some online help makes it a bit more accessible for first-time users.

The visualization now looks like this (here showing the entire dataset, unfiltered, which is not actually that useful, though it is quite pretty):

provisualizer

The bulk of the work, though, was to automate the preparation of data for the visualization.

Up until now, the dataset which you could visualize consisted of a couple of CSV files, manually assembled with considerable care and effort from reports exported from PROV’s repository “Archives One”. In the new system, this manual work will not need to be repeated. Instead, the same dataset will be assembled by an automated metadata-processing pipeline which will keep it continually up to date as government agencies and functions change over time.

It was not as big as job as you might think, since in fact a lot of the work to generate the data had already been done.

PROV’s Interoperable Data service

In 2012, in collaboration with their counterpart agency State Records New South Wales, PROV had set up an Interoperable Data publishing service with funding from the Australian National Data Service. They custom-built some software to export data from Archives One to produce a set of metadata records in RIF-CS format, and they deployed an off-the-shelf software application (an “OAI-PMH Repository”) to disseminate those metadata records over the web.

Originally, the OAI-PMH repository was serving data to the Australian National Data Service, which runs an aggregation service called Research Data Australia, which offers researchers pointers to all manner of scientific, historical and cultural datasets. The PROV metadata, covering the full history of government records in Victoria, is a useful resource for social science researchers, genealogists, historians, and others.

More recently, PROV’s OAI-PMH repository has also been harvested by the National Library of Australia’s Trove service.

Now at last it will be harvested by the Public Record Office itself.

The data pipeline

The software I’ve written consists of a web application which I wrote using a programming language for data pipelines called XProc. The software itself is open source and available on GitHub in a repository with the ludicrously acronymous title PROV-RIF-SPARQL.

This XProc application tediously harvests the metadata records (there are more than 30000 of them) and converts each one from RIF-CS format into RDF/XML format. The RDF/XML data is a reformulation of the RIF-CS in which the hierarchical structures of the RIF-CS are re-expressed as a network of interconnected statements; a kind of web of nodes and links which mathematicians call a “graph”. The statements in these graphs are expressed using the international standard conceptual framework for cultural heritage data; the CIDOC-CRM. My harvester then stores all these RDF/XML documents (or “graphs”) in a SPARQL Graph Store (a kind of hybrid document store and database). The SPARQL Graph Store allows each graph to be addressed individually, but also for the entire dataset to be treated as a single graph, and queried as a whole. Finally, the RDF dataset is queried to produce the two summarised data files which the visualization itself requires; these are simple spreadsheets in CSV (Comma Separated Values) format. One table contains information about each government agency or function, and the other table lists the relationships which have historically existed between those agencies and functions.

The harvester has a basic user interface where you can start a data harvest; a process that takes about half an hour to complete. In this interface you can specify the location of the OAI-PMH server you want to harvest data from, the format of the data you want to harvest, and the location of the SPARQL Graph Store where you want to store the result, amongst other parameters. In practice, this user interface isn’t used by a human (except during testing); another small program running on a regular schedule makes the request.

harvester

At this stage of the project, the RDF graph is only used internally to PROV, where it functions purely as an intermediate between the RIF-CS input and the CSV output. The RDF data and the SPARQL database together just provide a convenient way to aggregate a big set of records and query the resulting aggregation. But later I have no doubt that the RDF data will be published directly as Linked Open Data, opening it up, and allowing it to be connected into a world-wide web of data.

]]>
http://conaltuohy.com/blog/visualizing-government-archives-through-linked-data/feed/ 3 383
Bridging the conceptual gap: Museum Victoria’s collections API and the CIDOC Conceptual Reference Model http://conaltuohy.com/blog/bridging-conceptual-gap-api-cidoc-crm/ http://conaltuohy.com/blog/bridging-conceptual-gap-api-cidoc-crm/#comments Wed, 21 Oct 2015 14:44:33 +0000 http://conaltuohy.com/?p=301 Continue reading Bridging the conceptual gap: Museum Victoria’s collections API and the CIDOC Conceptual Reference Model]]>
A Museum Victoria LOD graph about a teacup, shown using the LODLive visualizer.
A Museum Victoria LOD graph about a teacup, shown using the LODLive visualizer.
This is the third in a series of posts about an experimental Linked Open Data (LOD) publication based on the web API of Museum Victoria.

The first post gave an introduction and overview of the architecture of the publication software, and the second dealt quite specifically with how names and identifiers work in the LOD publication software.

In this post I’ll cover how the publication software takes the data published by Museum Victoria’s API and reshapes it to fit a common conceptual model for museum data, the “Conceptual Reference Model” published by the documentation committee of the Internal Council of Museums. I’m not going to exhaustively describe the translation process (you can read the source code if you want the full story), but I’ll include examples to illustrate the typical issues that arise in such a translation.

The CIDOC Conceptual Reference Model

The CIDOC CRM, as it’s usually called, is a system of concepts for analysing and describing the content of museum collections. It is not intended to be a replacement for the Collection Management Systems which museums use to store their data; it is rather intended to function as a kind of lingua franca, through which content from a variety of systems can be expressed in a generally intelligible way.

The Conceptual Reference Model covers a wide range of museological concerns: items can be described in terms of their materials and mode of construction, as well as by who made them, where and when, and for what purpose.

The CRM also provides a framework to describe the events in which objects are broken into pieces, or joined to other objects, damaged or repaired, created or utterly destroyed. Objects can be described in terms of the symbolic and intellectual content which they embody, which are themselves treated as “intellectual objects”. The lineage of intellectual influence can be described, either speculatively, in a high-level way, or by explicitly tracing and documenting the influences that were known have taken place at particular times and locations. The legal history of objects can also be traced through transfer of ownership and custody, commission, sale and purchase, theft and looting, loss and discovery. Where the people involved in these histories are known, they too can be named and described and their histories interwoven with those of other people, objects, and ideas.

Core concepts and additional classification schemes

The CRM framework is quite high level. Only a fairly small number of very general types of thing are defined in the CRM; only concepts general enough to be useful for any kind of museum; whether a museum of computer games or of classical antiquity. Each of these concepts is identified by an alphanumeric code and an English-language name. In addition, the CRM framework allows for arbitrary typologies to be added on, to be used for further classifying pretty much anything. This is to allow all the terms from any classification system used in a museum to be exported directly into a CRM-based dataset, simply by describing each term as an “E55 Type". In short, the CRM consists of a fairly fixed common core, supplemented by a potentially infinite number of custom vocabularies which can be used to make fine distinctions of whatever kind are needed.

Therefore, a dataset based on the CRM will generally be directly comparable with another dataset only in terms of the core CRM-defined entities. The different classification schemes used by different datasets remain “local” vocabularies. To achieve full interoperability between datasets, these distinct typologies would additionally need to be aligned, by defining a “mapping” table which lists the equivalences or inequivalences between the terms in the two vocabularies. For instance, such a table might say that the term “moulded” used in Museum Victoria’s collection is more or less the same classification as “molding (forming)” in the Getty Art and Architecture thesaurus.

Change happens through “events”

To model how things change through time, the CRM uses the notion of an “event”. The production of a physical object, for instance, is modelled as an E12 Production event (NB concepts in the CRM are all identified by an alphanumeric code). This production event is linked to the object which it produced, as well as to the person or persons who played particular creative roles in that event. The event may also have a date and place associated with it, and may be linked to the materials and to the method used in the event.

On a somewhat philosophical note, this focus on discrete events is justified by the fact that not all of history is continuously documented, and we necessarily have a fragmentary knowledge of the history of any given object. Often a museum object will have passed through many hands, or will have been modified many times, and not all of this history is known in any detail. If we know that person A created an object for person B, and that X years later the object turned up in the hands of person C, we can’t assume that the object remained in person B’s hands all those X years. A data model which treated “ownership” as a property of an object would be liable to making such inflated claims to knowledge which is simply not there. Person C may have acquired it at any point during that period, and indeed there may have been many owners in between person B and person C. This is why it makes sense to document an object’s history in terms of the particular events which are known and attested to.

Museum Victoria’s API

How does Museum Victoria’s data fit in terms of the CIDOC model?

In general the model works pretty well for Museum Victoria, though there are also things in MV’s data which are not so easy to express in the CRM.

Items

Museum Victoria describes items as “Things made and used by people”. These correspond exactly to the notion of E22 Man-Made Object in the CIDOC CRM (if you can excuse the sexist language), described as comprising “physical objects purposely created by human activity.”

Every MV item is therefore expressed as an E22 Man-Made Object.

Titles

Museum Victoria’s objects have an objectName property which is a simple piece of text; a name or title. In the CIDOC CRM, the name of an object is something more complex; it’s an entity in its own right, called an E41 Appellation. The reason why a name is treated as more than just a simple property of an object is that in the CRM, it must be possible to treat an object’s name as an historical phenomenon; after all, it will have been applied to an object by a particular person (the person who created the object, perhaps, or an archaeologist who dug it out of the ground, or a curator or historian), at some historical point in time. An object may have a number of different names, each given it by different people, and used by different people at different times.

However, because the Museum Victoria names are simple (a single label) we can ignore most of that complexity. We only need to define an E41 Appellation whose value is the name, and link the E41 Appellation to the E22 Man-Made Object using a P1 is identified by association.

Articles, Items and their relationships

The MV API provides access to a number of “articles” which are documents related to the Museum’s collection. For each article, the API shows a list of the related collection items; and for each item, you can get the corresponding list of related articles. Although the exact nature of the relationship isn’t made explicit, it’s reasonable to assume that an item is in some way documented by the articles that are related to it. In the CIDOC CRM, such an article is considered an E31 Document, and it bears a P70 documents relationship to the item which it’s about.

If the relationship between an item and article is easy to guess, there are a couple of other relationships which are a little less obvious: an article also has a list of related articles, and each item also has a list of related items. What is that nature of those relationships? In what way exactly does article X relate to article Y, or item A to item B? The MV API’s documentation doesn’t say, and it wouldn’t surprise me if the Museum’s collection management system leaves this question up to the curators’ judgement.

A bit of empirical research seemed called for. I checked a few of the related items and the examples I found seemed to fall into two categories:

  • One item is a photograph depicting another item (the specific relationship here is really “depicts”)
  • Two items are both photographs of the same subject (the relationship is “has the same subject as”).

Obviously there are two different kinds of relationship here in the Museum’s collection, both of them presented (through the API) in the same way. As a human, I can tell them apart, but my proxy software is not going to be able to. So I need to find a more general relationship which subsumes both the relationships above, and fortunately, the CIDOC CRM includes such a relationship, namely P130 shows features of.

This property generalises the notions of “copy of” and “similar to” into a dynamic, asymmetric relationship, where the domain expresses the derivative, if such a direction can be established.
Otherwise, the relationship is symmetric. It is a shortcut of P15 was influenced by (influenced) in a creation or production, if such a reason for the similarity can be verified. Moreover it expresses similarity in cases that can be stated between two objects only, without historical knowledge about its reasons.

For example, I have a photograph of a piece of computer hardware (which is the relatedItem), and the photo is therefore a kind of derivative of the hardware (though the Museum Victoria API doesn’t tell me which of the objects was the original and which the derivative). In another example I have two photos of the same house; here there’s a similarity which is not due to one of the photos being derived from the other.

Ideally, it would be preferable to be able to represent these kinds of relationships more precisely; for instance, in the case of the two photos of the house, one could generate a resource that denotes the actual physical house itself, and link that to the photographs, but because the underlying data doesn’t include this information in a machine-readable form, the best we can do is to say that the two photos are similar.

Production techniques

Some of the items in the Museum’s collection are recorded as having been produced using a certain “technique”. For instance, archaeological artefacts in the MV collection have a property called archeologyTechnique, which contains the name of a general technique, such as moulded, in the case of certain ceramic items.

This corresponds to the CRM concept P32 used general technique, which is described like so:

This property identifies the technique or method that was employed in an activity.
These techniques should be drawn from an external E55 Type hierarchy of consistent terminology of
general techniques or methods such as embroidery, oil-painting, carbon dating, etc.

Note that in CIDOC this “general technique” used to manufacture an object is not a property of the object iself; it’s a property of the activity which produced the object (i.e. the whole process in which the potter pressed clay into a mould, glazed the cup, and fired it in a kiln).

Note also that, for the CIDOC CRM, the production technique used in making these tea-cups is not the text string “moulded”; it is actually an abstract concept identified by a URI. The string “moulded” is just a human-readable name attached as a property of that concept. That same concept might very well have a number of other names in other languages, or even in English there’s the American variant spelling “molded”, and synonyms such as “cast” that could all be alternative names for the same concept.

Translating a Museum Victoria item with a technique into the CRM therefore involves identifying three entities:

  • the object itself (an E22 Man-Made Object);
  • the production of the object (an E12 Production activity);
  • the technique used in the course of that activity to produce the object (an E55 Type of technique)

These three entities are then linked together:

  • The production event “P32 used general technique" of the technique; and
  • The production event [edit: "P94 has created"] "P108 has produced" the object itself.

Notes

The items, articles, specimens and species in the Museum’s API are all already first-class objects and can be easily represented as concepts in Linked Data. The archeologyTechnique field also has a fairly restricted range of values, and each of those values (such as “moulded”) can be represented as a Linked Data concept as well. But there are a number of other fields in the Museum’s API which are in the form of relatively long pieces of descriptive text. For example, an object’s objectSummary field contains a long piece of text which describes the object in context. For example, here’s the objectSummary of one our moulded tea cups:

This reconstructed cup was excavated at the Commonwealth Block site between 1988 and 2003. There is a matching saucer that was found with it. The pattern is known as 'Moss Rose' and was made between 1850 and 1851 by Charles Meigh, Son & Pankhurst in Hanley, Staffordshire, England.


Homewares.
Numerous crockery pieces were found all over the Little Lon site. Crockery gives us a glimpse of everyday life in Melbourne in the 1880s. In the houses around Little Lon, residents used decorated crockery. Most pieces were cheap earthenware or stoneware, yet provided colour and cheer. Only a few could afford to buy matching sets, and most china was probably acquired second-hand. Some were once expensive pieces. Householders mixed and matched their crockery from the great range of mass-produced designs available. 'Blue and white' and the 'willow' pattern, was the most popular choice and was produced by English potteries from 1790.

It’s not quite as long as an “article” but it’s not far off it. Another textual property is called physicalDescription, and has a narrower focus on the physical nature of the item:

This is a glazed earthenware teacup which has been reconstructed. It is decorated with a blue or black vine and leaf design around outside and inside of the cup which is known as 'Moss Rose' pattern.

The CIDOC CRM does include concepts related to the historical context and the physical nature of items, but it’s not at all easy to extract that detailed information from the descriptive prose of these, and similar fields. Because the information is stored in a long narrative form, it can’t be easily mapped to the denser data structure of a Linked Data graph. The best we can hope to do with these fields is to treat them as notes attached to the item.

The CIDOC CRM includes a concept for attaching a note: P3 has note. But to represent these two different types of note, it’s necessary to extend the CRM by creating two new, specialized versions (“sub-properties”) of the property called P3 has note, which I’ve called P3.1 objectSummary and P3.1 physicalDescription.

Summary

It’s possible to recognise three distinct patterns in mapping an API such as Museum Victoria’s to a Linked Data model like the CIDOC CRM.

  1. Where the API provides access to a set of complex data objects of a particular type, these can
    be mapped straight-forwardly to a corresponding class of Linked Data resources (e.g. the items, species, specimens, and articles in MV’s API).
  2. Where the API exposes a simple data property, it can be straightforwardly converted to a Linked Data property (e.g. the two types of notes, in the example above).
  3. Where the API exposes a simple data property whose values come from a fairly limited range (a “vocabulary”), then those individual property values can be assigned identifiers of their own, and effectively promoted from simple data properties to full-blown object properties (e.g. the production techniques in Museum Victoria’s API).

Conclusion

It’s been an interesting experiment, to generate Linked Open Data from an open API using a simple proxy: I think it shows that the technique is a very viable mechanism for institutions to break into the LOD cloud and contribute their collection in a standardised manner, without necessarily having to make any changes to their existing systems or invest in substantial software development work. To my mind, making that first step is a significant barrier that holds institutions and individuals back from realising the potential in their data. Once you have a system for publishing LOD, you are opening up a world of possibilities for external developers, data aggregators, and humanities researchers, and if your data is of interest to those external groups, you have the possibility of generating some significant returns on your investment, and the possibility of “harvesting” some of that work back into your institution’s own web presence in the form of better visualizations, discovery interfaces, and better understanding of your own collection.

Before the end of the year I hope to explore some further possibilities in the area of user interfaces based on Linked Data, to show some of the value that these Linked Data publishing systems can support.

]]>
http://conaltuohy.com/blog/bridging-conceptual-gap-api-cidoc-crm/feed/ 6 301
Names in the Museum http://conaltuohy.com/blog/museum-names/ http://conaltuohy.com/blog/museum-names/#comments Thu, 01 Oct 2015 04:56:14 +0000 http://conaltuohy.com/?p=279 Continue reading Names in the Museum]]> My last blog post described an experimental Linked Open Data service I created, underpinned by Museum Victoria’s collection API. Mainly, I described the LOD service’s general framework, and explained how it worked in terms of data flow.

To recap briefly, the LOD service receives a request from a browser and in turn translates that request into one or more requests to the Museum Victoria API, interprets the result in terms of the CIDOC CRM, and returns the result to the browser. The LOD service does not have any data storage of its own; it’s purely an intermediary or proxy, like one of those real-time interpreters at the United Nations. I call this technique a “Linked Data proxy”.

I have a couple more blog posts to write about the experience. In this post, I’m going to write about how the Linked Data proxy deals with the issue of naming the various things which the Museum’s database contains.

Using Uniform Resource Identifiers (URIs) as names

Names are a central issue in any Linked Data system; anything of interest must be named with an HTTP URI; every piece of information which is recorded about a thing is attached to this name, and crucially, because these names are HTTP URIs, they can (in fact in a Linked Data system, they must) also serve as a means to obtain information about the thing.

In a nutshell there are three main tasks the Linked Data proxy has to be able to perform:

  1. When it receives an HTTP request, it has to recognise the HTTP URI as an identifier that identifies a particular individual belonging to some general type: an artefact; a species; a manufacturing technique; etc.
  2. Having recognised as some sort of name, it has to be able to look up and retrieve information about the particular individual which it identifies.
  3. Having found some information about the named thing, it has to convert that information into RDF (the language of Linked Data), in the process converting any identifiers it has found into the kind of HTTP URIs it can recognise in future. A Linked Open Data client is going to want to use those identifiers to make further requests, so they have to match the kind of identifiers the LOD service can recognise (in step 1 above).

Recognising various HTTP URIs as identifiers for things in Museum Victoria’s collection

Let’s look at the task of recognising URIs as names first.

The Linked Data Proxy distinguishes between URIs that name different types of things by recognising different prefixes in the URIs. For instance, a URI beginning with http://conaltuohy.com/xproc-z/museum-victoria/resource/item/ will identity a particular item in the collection, whereas a URI beginning with http://conaltuohy.com/xproc-z/museum-victoria/resource/technique/ will identify some particular technique used in the manufacture of an item.

The four central entities of Museum Victoria’s API

The Museum Victoria API is organised around four main types of entity:

  • items
  • specimens
  • species
  • articles

The LOD service handles all four very similarly: since the MV API provides an identifier for every item, specimen, species, or article, the LOD service can generate a linked data identifier for each one just by sticking a prefix on the front. For example, the item which Museum Victoria identifies with the identifier items/1221868 can be identified with the Linked Data identifier http://conaltuohy.com/xproc-z/museum-victoria/resource/items/1221868 just by sticking http://conaltuohy.com/xproc-z/museum-victoria/resource/ in front of it, and a document about that item can be identified by http://conaltuohy.com/xproc-z/museum-victoria/data/items/1221868.

Secondary entities

So far so straightforward, but apart from these four main entity types, there are a number of things of interest which the Museum Victoria API deals with in a secondary way.

For example, the MV database includes information on how many of the artefacts in the collection were manufactured, in a field called technique. For instance, many ceramic items (e.g. teacups) in their collection were created from a mould, and have the value moulded in their technique field. The tricky thing here is that the techniques are not “first-class” entities like items. Instead, a technique is just a textual attribute of an item. This is a common enough situation in legacy data systems: the focus of the system is on what it sees as a “core” entity (a museum item in this case), which have their own identifiers and a bunch of properties hanging off them. Those properties are very much second-class citizens in the data model, and are often just textual labels. A number of items might share a common value for their technique field, but that common value is not stored anywhere except in the technique field of those items; it has no existence independent of those items.

In Linked Data systems, by contrast, such common values should be treated as first-class citizens, with their own identifiers, and with links that connect each technique to the items which were manufactured using that technique.

What is the LOD service to do? When expressing a technique as a URI, it can simply use the technique’s name itself (“moulded”) as part of the identifier, like so:

http://conaltuohy.com/xproc-z/museum-victoria/resource/technique/moulded

Then when the LOD service is responding to a request for a URI like the above, it can pull that prefix off and have the original world “moulded” back.

At this point the LOD service needs to be able to provide some information about the moulded technique. Because the technique is not a first-class object in the underlying collection database, there’s not much that can be said about it, apart from its name, obviously, which is “moulded”. All that the LOD service really knows about a given technique is that a certain set of items were produced using that technique, and it can retrieve that list using the Museum Victoria search API. The search API allows for searching by a number of different fields, including technique, so the Linked Data service can take the last component of the request URI it has received (“moulded”) and pass that to the search API, like so:

http://collections.museumvictoria.com.au/api/search/search?limit=100&technique=moulded

The result of the search is a list of items produced with the given technique, which the LOD service simply reformats into an RDF representation. As part of that conversion, the identifiers of the various moulded items in the results list (e.g. items/1280928) are converted into HTTP URIs simply by sticking the LOD service’s base URI on the front of them, e.g.

http://conaltuohy.com/xproc-z/museum-victoria/resource/items/1280928

External links

Tim Berners-Lee, the inventor of the World Wide Web, in an addendum to his “philosophical” post about Linked Data, suggested a “5-star” rating scheme for Linked Open Data, in which the fifth star requires that a dataset “link … to other people’s data to provide context”. Since the Museum Victoria data doesn’t include external links, it is tricky to earn this final star, but there is a way, based on MV’s use of the standard taxonomic naming system used in biology. Since many of MV’s items are biological specimens, we can use their taxonomic names to establish links to external sources which also use the same taxonomic names. For this example, I chose to link to biological data in Wikipedia, which, unknown to many people, also publishes a large dataset of Linked Open Data derived from the Wikipedia pages, including a lot of biological taxa. To establisha link to DBpedia, the LOD service takes the Museum’s taxonName field and inserts it into a SPARQL query, which it sends to DBpedia, essentially asking “do you have anything on file which has this binomial name?

select distinct ?species where {
?species dbp:binomial "{MV's taxon name goes here}"@en}

The result of the query is either a “no” or it’s a link to the species in Wikipedia’s database, which the LOD service can then republish.

coming up…

My next post in the series will look at some issues of how the Museum’s data relates to the CIDOC CRM model; where it matches neatly, and where it’s either more, or less specific than the CRM.

]]>
http://conaltuohy.com/blog/museum-names/feed/ 1 279
Linked Open Data built from a custom web API http://conaltuohy.com/blog/lod-from-custom-web-api/ http://conaltuohy.com/blog/lod-from-custom-web-api/#comments Mon, 07 Sep 2015 08:51:37 +0000 http://conaltuohy.com/?p=268 Continue reading Linked Open Data built from a custom web API]]> I’ve spent a bit of time just recently poking at the new Web API of Museum Victoria Collections, and making a Linked Open Data service based on their API.

I’m writing this up as an example of one way — a relatively easy way — to publish Linked Data off the back of some existing API. I hope that some other libraries, archives, and museums with their own API will adopt this approach and start publishing their data in a standard Linked Data style, so it can be linked up with the wider web of data.

Two ways to skin a cat

There are two basic ways you can take an existing API and turn it into a Linked Data service.

One way is the “metadata aggregator” approach. In this approach, an “aggregator” periodically downloads (or “harvests”) the data from the API, in bulk, converts it all into RDF, and stores it in a triple-store. Then another application — a Linked Data publishing service — can read that aggregated data from the triple store using a SPARQL query and expose it in Linked Data form. The tricky part here is that you have to create and maintain your own copy (a cache) of all the data which the API provides. You run the risk that if the source data changes, then your cache is out of date. You need to schedule regular harvests to be sure that the copy you have is as up to date as you need it to be. You have to hope that the API can tell you which particular records have changed or been deleted, otherwise, you may have to download every piece of data just to be sure.

But this blog post is about another way which is much simpler: the “proxy” approach. A Linked Data proxy is a web application that receives a request for Linked Data, and in order to satisfy that request, makes one or more requests of its own to the API, processing the results it receives, and formatting them as Linked Data, which it then returns. The advantage of this approach is that every response to a request for Linked Data is freshly prepared. There is no need to maintain a cache of the data. There is no need for harvesting or scheduling. It’s simply a translator that sits in front of the API and translates what it says into Linked Data.

This is an approach I’ve been meaning to try out for a fair while, and in fact I gave a very brief presentation on the proxy idea at the recent LODLAM Summit in Sydney. All I needed was a good API to try it out with.

Museum Victoria Collection API

The Museum Victoria Collection API was announced on Twitter by Ely Wallis on August 25th:

Well as it happened I did like it, so I got in touch. Since it’s so new, the API’s documentation is a bit sparse, but I did get some helpful advice from the author of the API, Museum Victoria’s own Michael Mason, including details of how to perform searches, and useful hints about the data structures which the API provides.

In a nutshell, the Museum Victoria API provides access to data about four different sorts of things:

  • Items (artefacts in the Museum’s collections),
  • Specimens (biological specimens),
  • Species (which the specimens belong to), and
  • Articles (which document the other things)

There’s also a search API with which you can search within all of those categories.

Armed with this knowledge, I used my trusty XProc-Z proxy software to build a Linked Data proxy to that API.

Linked Data

Linked Data is a technique for publishing information on the web in a common, machine-understandable way.

The central principle of Linked Data is that all items of interest are identified with an HTTP URI (“Uniform Resource Identifier”). And “Resource” here doesn’t just mean web pages or other electronic resources; anything at all can be a “Resource”: people, physical objects, species of animal, days of the week … anything. If you take one of these URIs and put it into your browser, it will deliver you up some information which relates to the thing identified by the URI.

Because of course you can’t download a person or a species of animal, there is a special trick to this: if you send a request for a URI which identifies one of these “non-information resources”, such as a person, the server can’t simply respond by sending you an information resource (after all, you asked for a person, not a document). Instead it responds by saying “see also” and referring you to a different URL. This is basically saying “since you can’t download the resource you asked for (because it’s not an information resource), here is the URL of an information resource which is relevant to your request”. Then when your browser makes a request from that second URL, it receives an information resource. This is why, when browsing Linked Data, you will sometimes see the URI in your browser’s address bar change: first it makes a request for one URI and then is automatically redirected to another.

That information also needs to be encoded as RDF (“Resource Description Framework”). The RDF document you receive from a Linked Data server consists of a set of statements (called “triples”) about various things,including the original “resource” which your original URI identified, but usually also other things as well. Those statements assign various properties to the resources, and also link them to other resources. Since those other resources are also identified by URIs, you can follow those links and retrieve information about those related resources, and resources that are related to them, and so on.

Linked Data URIs as proxies for Museum Victoria identifiers

So one of the main tasks of the Linked Data proxy is to take any identifiers which it retrieves from the Museum Victoria API, and convert them into full HTTP URIs. That’s pretty easy; it’s just a matter of adding a prefix like “http://example/something/something/”. When the proxy receives a request for one of those URIs, it has to be able to turn it back into the form that Museum Victoria’s API uses. That basically involves trimming the prefix back off. Because many of the things identified in the Museum’s API are not information resources (many are physical objects), the proxy makes up two different URIs, one to denote the thing itself, and one to refer to the information about the thing.

The conceptual model (“ontology”)

The job of the proxy is to publish the data in a standard Linked Data vocabulary. There was an obvious choice here; the well-known museum ontology (and ISO standard) with the endearing name “CIDOC-CRM”. This is the Conceptual Reference Model produced by the International Committee for Documentation (CIDOC) of the International Council of Museums. This abstract model is published as an OWL ontology (a form that can be directly used in a Linked Data system) by a joint working group of computer scientists and museologists in Germany.

This Conceptual Reference Model defines terms for things such as physical objects, names, types, documents, and images, and also for relationships such as “being documented in”, or “having a type”, or “being an image of”. The proxy’s job is to translate the terminology used in Museum Victoria’s API into the terms defined in the CIDOC-CRM. Unsurprisingly, much of that translation is pretty easy, because there are long-standing principles in the museum world about how to organise collection information, and both the Museum Victoria API and the CIDOC-CRM are aligned to those principles.

As it happened I already knew the CIDOC-CRM model pretty well, which was one reason why a museum API was an attractive subject for this exercise.

Progress and prospects

At this stage I haven’t yet translated all the information which the Museum’s API provides; most of the details are still simply ignored. But already the translation does include titles and types, as well as descriptions, and many of the important relationships between resources (it wouldn’t be Linked Data without links!). I still want to flesh out the translation some more, to include more of the detailed information which the Museum’s API makes available.

This exercise was a test of my XProc-Z software, and of the general approach of using a proxy to publish Linked Data. Although the result is not yet a complete representation of the Museum’s API, I think it has at least proved the practicality of the approach.

At present my Linked Data service produces RDF in XML format only. There are many other ways that the RDF can be expressed, such as e.g. JSON-LD, and there are even ways to embed the RDF in HTML, which makes it easier for a human to read. But I’ve left that part of the project for now; it’s a very distinct part that will plug in quite easily, and in the meantime there are other pieces of software available that can do that part of the job.

See the demo

The proxy software itself is running here on my website conaltuohy.com, but for ease of viewing it’s more convenient to access it through another proxy which converts the Linked Data into an HTML view.

Here is an HTML view of a Linked Data resource which is a timber cabinet for storing computer software for an ancient computer: http://conaltuohy.com/xproc-z/museum-victoria/resource/items/1411018. Here is the same Linked Data resource description as raw RDF/XML: http://conaltuohy.com/xproc-z/museum-victoria/data/items/1411018 — note how, if you follow this link, the URL in your browser’s address bar changes as the Linked Data server redirects you from the identifier for the cabinet itself, to an identifier for a set of data about the cabinet.

The code

The source code for the proxy is available in the XProc-Z GitHub repository: I’ve packaged the Museum Victoria Linked Data service as one of XProc-Z’s example apps. The code is contained in two files:

  • museum-victoria.xpl which is a pipeline written in the XProc language, which deals with receiving and sending HTTP messages, and converting JSON into XML, and
  • museum-victoria-json-to-rdf.xsl, which is a stylesheet written in the XSLT language, which performs the translation between Museum Victoria’s vocabulary and the CIDOC-CRM vocabulary.
]]>
http://conaltuohy.com/blog/lod-from-custom-web-api/feed/ 5 268