XProc-Z – Conal Tuohy's blog

Analysis & Policy Online

Conal — Tue, 27 Jun 2017 23:45:27 +0000

Notes for my Open Repositories 2017 conference presentation. I will edit this post later to flesh it out into a proper blog post.
Follow along at: conaltuohy.com/blog/analysis-policy-online/

background

Early discussion with Amanda Lawrence of APO (which at that time stood for “Australian Policy Online”) about text mining, at the 2015 LODLAM Summit in Sydney.
They needed automation to help with the cataloguing work, to improve discovery.
They needed to understand their own corpus better.
I suggested a particular technical approach based on previous work.
In 2016, APO contracted me to advise and help them build a system that would “mine” metadata from their corpus, and use Linked Data to model and explore it.

constraints

Openness
Integrate metadata from multiple text-mining processes, plus manually created metadata
Minimal dependency on their current platform (Drupal 7, now Drupal 8)
Lightweight; easy to make quick changes

technical approach

Use an entirely external metadata store (a SPARQL Graph Store)
Use a pipeline! Extract, Transform, Load
Use standard protocol to extract data (first OAI-PMH, later sitemaps)
In fact, use web services for everything; the pipeline is then just a simple script that passes data between web services
Sure, XSLT and SPARQL Query, but what the hell is XProc?!

progress

Configured Apache Tika as a web service, using Stanford Named Entity Recognition toolkit
Built XProc pipeline to harvest from Drupal’s OAI-PMH module, download digital objects, process them with Stanford NER via Tika, and store the resulting graphs in Fuseki graph store
Harvested, and produced a graph of part of the corpus, but …
Turned out the Drupal OAI-PMH module wa broken! So we used Sitemap instead
“Related” list added to APO dev site (NB I’ve seen this isn’t working in all browsers, and obviously needs more work, perhaps using an iframe is not the best idea. Try Chrome if you don’t see the list of related pages on the right)

next steps

Visualize the graph
Integrate more of the manually created metadata into the RDF graph
Add topic modelling (using MALLET) alongside the NER

Let’s see the code

Questions?

(if there’s any time remaining)

Linked Open Data built from a custom web API

Conal — Mon, 07 Sep 2015 08:51:37 +0000

I’ve spent a bit of time just recently poking at the new Web API of Museum Victoria Collections, and making a Linked Open Data service based on their API.

I’m writing this up as an example of one way — a relatively easy way — to publish Linked Data off the back of some existing API. I hope that some other libraries, archives, and museums with their own API will adopt this approach and start publishing their data in a standard Linked Data style, so it can be linked up with the wider web of data.

Two ways to skin a cat

There are two basic ways you can take an existing API and turn it into a Linked Data service.

One way is the “metadata aggregator” approach. In this approach, an “aggregator” periodically downloads (or “harvests”) the data from the API, in bulk, converts it all into RDF, and stores it in a triple-store. Then another application — a Linked Data publishing service — can read that aggregated data from the triple store using a SPARQL query and expose it in Linked Data form. The tricky part here is that you have to create and maintain your own copy (a cache) of all the data which the API provides. You run the risk that if the source data changes, then your cache is out of date. You need to schedule regular harvests to be sure that the copy you have is as up to date as you need it to be. You have to hope that the API can tell you which particular records have changed or been deleted, otherwise, you may have to download every piece of data just to be sure.

But this blog post is about another way which is much simpler: the “proxy” approach. A Linked Data proxy is a web application that receives a request for Linked Data, and in order to satisfy that request, makes one or more requests of its own to the API, processing the results it receives, and formatting them as Linked Data, which it then returns. The advantage of this approach is that every response to a request for Linked Data is freshly prepared. There is no need to maintain a cache of the data. There is no need for harvesting or scheduling. It’s simply a translator that sits in front of the API and translates what it says into Linked Data.

This is an approach I’ve been meaning to try out for a fair while, and in fact I gave a very brief presentation on the proxy idea at the recent LODLAM Summit in Sydney. All I needed was a good API to try it out with.

Museum Victoria Collection API

The Museum Victoria Collection API was announced on Twitter by Ely Wallis on August 25th:

The cat's out of the bag – @museumvictoria has a new collections online. We hope you like it! http://t.co/c0oNTWDk5t

— Ely Wallis (@elyw) August 25, 2015

Well as it happened I did like it, so I got in touch. Since it’s so new, the API’s documentation is a bit sparse, but I did get some helpful advice from the author of the API, Museum Victoria’s own Michael Mason, including details of how to perform searches, and useful hints about the data structures which the API provides.

In a nutshell, the Museum Victoria API provides access to data about four different sorts of things:

Items (artefacts in the Museum’s collections),
Specimens (biological specimens),
Species (which the specimens belong to), and
Articles (which document the other things)

There’s also a search API with which you can search within all of those categories.

Armed with this knowledge, I used my trusty XProc-Z proxy software to build a Linked Data proxy to that API.

Linked Data

Linked Data is a technique for publishing information on the web in a common, machine-understandable way.

The central principle of Linked Data is that all items of interest are identified with an HTTP URI (“Uniform Resource Identifier”). And “Resource” here doesn’t just mean web pages or other electronic resources; anything at all can be a “Resource”: people, physical objects, species of animal, days of the week … anything. If you take one of these URIs and put it into your browser, it will deliver you up some information which relates to the thing identified by the URI.

Because of course you can’t download a person or a species of animal, there is a special trick to this: if you send a request for a URI which identifies one of these “non-information resources”, such as a person, the server can’t simply respond by sending you an information resource (after all, you asked for a person, not a document). Instead it responds by saying “see also” and referring you to a different URL. This is basically saying “since you can’t download the resource you asked for (because it’s not an information resource), here is the URL of an information resource which is relevant to your request”. Then when your browser makes a request from that second URL, it receives an information resource. This is why, when browsing Linked Data, you will sometimes see the URI in your browser’s address bar change: first it makes a request for one URI and then is automatically redirected to another.

That information also needs to be encoded as RDF (“Resource Description Framework”). The RDF document you receive from a Linked Data server consists of a set of statements (called “triples”) about various things,including the original “resource” which your original URI identified, but usually also other things as well. Those statements assign various properties to the resources, and also link them to other resources. Since those other resources are also identified by URIs, you can follow those links and retrieve information about those related resources, and resources that are related to them, and so on.

Linked Data URIs as proxies for Museum Victoria identifiers

So one of the main tasks of the Linked Data proxy is to take any identifiers which it retrieves from the Museum Victoria API, and convert them into full HTTP URIs. That’s pretty easy; it’s just a matter of adding a prefix like “http://example/something/something/”. When the proxy receives a request for one of those URIs, it has to be able to turn it back into the form that Museum Victoria’s API uses. That basically involves trimming the prefix back off. Because many of the things identified in the Museum’s API are not information resources (many are physical objects), the proxy makes up two different URIs, one to denote the thing itself, and one to refer to the information about the thing.

The conceptual model (“ontology”)

The job of the proxy is to publish the data in a standard Linked Data vocabulary. There was an obvious choice here; the well-known museum ontology (and ISO standard) with the endearing name “CIDOC-CRM”. This is the Conceptual Reference Model produced by the International Committee for Documentation (CIDOC) of the International Council of Museums. This abstract model is published as an OWL ontology (a form that can be directly used in a Linked Data system) by a joint working group of computer scientists and museologists in Germany.

This Conceptual Reference Model defines terms for things such as physical objects, names, types, documents, and images, and also for relationships such as “being documented in”, or “having a type”, or “being an image of”. The proxy’s job is to translate the terminology used in Museum Victoria’s API into the terms defined in the CIDOC-CRM. Unsurprisingly, much of that translation is pretty easy, because there are long-standing principles in the museum world about how to organise collection information, and both the Museum Victoria API and the CIDOC-CRM are aligned to those principles.

As it happened I already knew the CIDOC-CRM model pretty well, which was one reason why a museum API was an attractive subject for this exercise.

Progress and prospects

At this stage I haven’t yet translated all the information which the Museum’s API provides; most of the details are still simply ignored. But already the translation does include titles and types, as well as descriptions, and many of the important relationships between resources (it wouldn’t be Linked Data without links!). I still want to flesh out the translation some more, to include more of the detailed information which the Museum’s API makes available.

This exercise was a test of my XProc-Z software, and of the general approach of using a proxy to publish Linked Data. Although the result is not yet a complete representation of the Museum’s API, I think it has at least proved the practicality of the approach.

At present my Linked Data service produces RDF in XML format only. There are many other ways that the RDF can be expressed, such as e.g. JSON-LD, and there are even ways to embed the RDF in HTML, which makes it easier for a human to read. But I’ve left that part of the project for now; it’s a very distinct part that will plug in quite easily, and in the meantime there are other pieces of software available that can do that part of the job.

See the demo

The proxy software itself is running here on my website conaltuohy.com, but for ease of viewing it’s more convenient to access it through another proxy which converts the Linked Data into an HTML view.

Here is an HTML view of a Linked Data resource which is a timber cabinet for storing computer software for an ancient computer: http://conaltuohy.com/xproc-z/museum-victoria/resource/items/1411018. Here is the same Linked Data resource description as raw RDF/XML: http://conaltuohy.com/xproc-z/museum-victoria/data/items/1411018 — note how, if you follow this link, the URL in your browser’s address bar changes as the Linked Data server redirects you from the identifier for the cabinet itself, to an identifier for a set of data about the cabinet.

The code

The source code for the proxy is available in the XProc-Z GitHub repository: I’ve packaged the Museum Victoria Linked Data service as one of XProc-Z’s example apps. The code is contained in two files:

museum-victoria.xpl which is a pipeline written in the XProc language, which deals with receiving and sending HTTP messages, and converting JSON into XML, and
museum-victoria-json-to-rdf.xsl, which is a stylesheet written in the XSLT language, which performs the translation between Museum Victoria’s vocabulary and the CIDOC-CRM vocabulary.

Beta release of XProc-Z web server framework

Conal — Thu, 14 May 2015 04:12:04 +0000

I have at last released a “final” version of my web server framework, XProc-Z, for testing. The last features I had wanted to include were:

The ability for the XProc code in the web server to read information from its environment, so that a generic XProc pipeline can be customized by setting configuration properties.
Full support for sending and receiving binary files (i.e. non text files). XProc is really a language for processing XML, but I think it will be handy to be able to deal with binary files as well from time to time.
A few sample XProc pipelines, to demonstrate the capability of the platform.

Now the software is out there for people to try, and already I have a friend — a medievalist — who has installed it and started to use it to develop a web application. It’s exciting to have an “installed base” (one person, but it’s a start!) for the software which previously I was the only one to use.

Also, now that the XProc-Z platform is more or less complete, I will be using it myself to build an application for Library and Archives people to convert their collection metadata into Linked Open Data form.

I hope that the platform will turn out to be useful generally in the Digital Humanities and Library fields; there’s a lot of processing of XML going on, and XProc is an ideal programming language for that. Since it’s designed to run XProc pipelines, on the web, with minimal extras, XProc-Z is also highly appropriate for web-based XML processing applications, making it one of the most concise and simple ways to write applications of that nature. If you are a DH developer and you already know XSLT, XQuery, or XPath, you will find XProc a pretty amenable language – I totally recommend it!

If you’re interested to see it in action, you can view it on this server, running the sample pipelines which I’ve included. You can also view the Java source code or the sample XProc pipelines on the github site. The “main” pipeline is xproc-z.xpl.

If you’re interested to give it a try, and you know — or don’t mind learning — a bit of XProc, feel free to download the software and fire it up on a machine of your own. I am happy to answer questions about it and generally help to get people going. You can comment here on the blog, email me, or post an “issue” on the github site.

Long running and asynchronous processes in XProc-Z

Conal — Fri, 17 Apr 2015 05:15:19 +0000

I added a useful feature to my web server framework XProc-Z, to allow it to run processes that take a long time – longer than the lifetime of an HTTP request.

Specifically, I had in mind running OAI-PMH harvests, in which a harvester may need to run for hours and make hundreds or thousands of HTTP requests of its own, in order to download a large set of metadata records. But any long running process faces the same issue: if a person makes a request to initiate a long running process, using a web browser, the server can’t afford to wait until the process is complete before it responds, because the user’s HTTP request will time out. Instead, the server needs to return a response to say “process has started…” and then to continue the processing in the background.

Another issue with long running processes in XProc is that they often require recursion, and can consume infeasibly large amounts of memory. OAI-PMH is one task which requires recursion. To perform a harvest from an OAI-PMH repository, an OAI-PMH harvester makes a request for a batch of records, and receives along with that batch a so-called “resumption token” which is a reference to the next batch of records. When it’s finished processing the first batch, the harvester makes another request to the repository, passing back the resumption token it received earlier, and receives in response another batch of records, with another resumption token. After a number of requests like this, the final batch will include an empty resumption token, indicating that there are no records left to harvest. In XProc, because of the nature of the language itself, such a process is necessarily a recursive one; to complete the harvest of the first batch one has to finish harvesting the second, and to complete the second, one has to have completed the third, and so on. This tends to produce issues for the XProc interpreter: stack overflows and other memory errors. The canonical solution to such problems is to “flatten” the recursion by converting it into iteration, but since the XProc language doesn’t support this style of programming, it needs to be provided by the application framework instead.

Accordingly, I have modified XProc-Z to include whats known as a “trampoline”. When XProc-Z executes a pipeline, it will now expect an HTTP response (which it sends off to the waiting web browser), but also an optional sequence of requests, which it then executes in their own threads, asynchronously. These requests are effectively made by a pipeline to itself, via the XProc-Z framework. XProc-Z receives the request from the pipeline and bounces it straight back; hence the term “trampoline”. When a pipeline returns, it releases the memory it has been using, and the next invocation of the pipeline starts with a clean slate. That means the long term memory consumption of the pipeline is limited to what it needs during a single bounce. By dividing its work into small enough batches, a pipeline can keep its memory consumption arbitrarily low.

XProc-Z

Conal — Tue, 09 Dec 2014 09:38:36 +0000

Last weekend I finally released my latest work of art; a software application called XProc-Z. It’s a fairly small thing, but it’s the result of a lot of thought, and I’m very pleased with it. I hope to make a lot of use of it myself, and I hope I can interest other people in using it too.

A lot of the work I do involves crunching up metadata records, XML-encoded text, web pages, and the like. In the old days I used to use Apache Cocoon for this kind of work, but in recent years the development community has moved Cocoon (especially since version 3) in a different direction. Now it’s more of a Java web server framework with many XML-related features. To actually build an application with Cocoon now, you have to put your Java hat on and write some Java and compile it, with Maven and all that Java stuff. That’s all very well, if you like that sort of thing, but it is not very lightweight. I would prefer to be able to just write a script, and not have to write and compile Java. And there are better languages around for that purpose. In particular, there’s a relatively new language for scripting XML processing, called XProc.

XProc; a language for data plumbing

XProc is a language for writing XML pipelines; it uses the idea of data “flowing” from various sources, step by step through a network of pipes and filters, to reach its destination. It’s a language for data plumbing.

So for the last few years I have tended to use the XProc programming language for XML processing tasks. For tasks such as these XProc is an ideal language because its features are designed for precisely these purposes. For instance it takes only a dozen or so lines of code to read a bunch of XML files from a website, transform them with an XSLT, validate them with a schema, and finally save the valid files in one folder and the invalid files in another.

Running XProc programs on the Web

Unlike Cocoon, XProc is not intended primarily for writing web servers, and for a while at least, there was no convenient way to run XProc pipelines as a web server at all. I’ve tended to run my XProc pipelines from the command line (using the XProc interpreter Calabash, by Norm Walsh) where they can read from the Web, and write to the Web, but they aren’t themselves actually part of the Web. It’s always struck me, though, that it would be a great language for writing web applications, and so I did some research to try to find a good way to run my XProc code on the Web.

I had a look at about 5 different ways, but none of them offered quite what I wanted. The problem lay in the details of the mechanisms by which an HTTP request is passed to your pipeline, and in which your pipeline outputs its response. For instance, if a browser makes an HTTP “POST” request for a resource with a particular URI, and passes it a bunch of parameters encoded in the “application/x-www-form-urlencoded” format, somehow that request has to invoke a particular pipeline, and pass those parameters to it. As far as I could tell, each of the different frameworks had their own custom mechanism for this. Some of them were quite restrictive; a URI directly identified a pipeline, and any URI parameters were passed to the pipeline as pipeline parameters. Others were more flexible; you could tweak it so that various properties of a request, taken together, identified which pipeline to run; you could pass not just form parameters, but other things, such as HTTP request headers, to the pipeline, and so on. Generally, to customize the way the HTTP request was handled you had to write some Java code, or write some custom XML configuration file. That was all a bit discouraging to me, because it didn’t fit with two requirements that I had in mind:

Firstly, I wanted to be able to write an application entirely in the XProc language, without any Java coding or compilation. This is to keep it simple. I myself am a fluent Java programmer, but there are a lot of potential XProc programmers who don’t know Java, and who would find it a barrier to have to set up a Java development environment. Why should they have to? I know that in the library world, and in the Digital Humanities community, there are a lot of people who know XML, and know XSLT, and for whom XProc could be a really easy next step, but having to learn Java (even at a basic level) would be an effective barrier.
Secondly, and this is more of an issue for me personally; I want to be able to write XProc applications that handle any kind of HTTP request. I don’t just want to be able to do GET and POST, but also PUT, HEAD, and so on. I want my applications to have access not only to URI parameters, but also to cookies, HTTP request headers, multipart content uploads – everything.

Proxies

It might seem odd to want to be able to handle any arbitrary HTTP request; surely if I’m writing a web application I can ensure my front end makes only the sort of requests that my back end can handle? That’s true — if you’re writing a specific application, or applications of a specific kind. But I want to be able to also write really generic applications; namely proxies.

A web proxy is both a client and a server. It’s an intermediary which sits in between a client and one or more servers. It receives a request from a client, which it passes on to a server (perhaps modifying the request first), and then retrieves the response from the server, which it returns to the client (perhaps modifying the response first).

This allows a proxy to transform an existing web application into something different. For instance a proxy could make a blog look like a Linked Data store, or make a repository of TEI files look like a website, or a map, or an RSS feed. A proxy can turn a website into a web API, or vice versa. It can mash up two or more web APIs and make them look like another web API.

As well as transforming one kind of web server into something quite different, it can also just add some extra feature to an existing web server. For instance, it can enhance the web pages provided by one server by adding links or related information retrieved from another server.

In general I think that the proxy design pattern is seriously under-valued and under-used. It’s a powerful technique for assembling large systems out of smaller parts. The World Wide Web has been designed specifically to facilitate the use of proxies, and yet many web developers are not really even aware of the technique. Part of my goal with XProc-Z is to facilitate and encourage the use of this pattern.

XProc-Z

I figured that to make a really proxy-friendly XProc server, I would have to construct it myself. Earlier this year I had writtten a program called Retailer which is a platform for hosting web apps written in the XSLT language, so I started with that and replaced the XSLT bits with XProc bits, using the Calabash XProc interpreter, and I was done.

Reusing XProc’s `request` and `response` documents

For maximum flexibility, I decided to pass the entire HTTP request to a single XProc pipeline, and leave it up to the pipeline itself to decide how to handle any headers, parameters, and so on. An XProc pipeline that didn’t need to know about the HTTP Accept header, for instance, could just ignore that header, but the header would always be passed to it anyway, just in case.

To pass the request to the pipeline, and to retrieve the response, I re-used a mechanism already present in the XProc language, which just had to be turned inside out. XProc has a step called http-request, and associated request and response XML document types. In XProc, an HTTP request is made by creating a request document containing the details of the request, and piping the document into an http-request step, which actually makes the HTTP request, and in turn outputs a response document. By following this pattern, I could make use of the existing definitions of request and response, and not have to add any extraneous or “foreign” mechanisms. In XProc-Z’s binding mechanism, an HTTP request received from a web user agent is converted into a request object and passed into the XProc-Z pipeline. The output of the pipeline is expected to be a response document, which XProc-Z converts into an actual HTTP response to the web user agent. In other words, an XProc-Z pipeline has the same signature as the standard XProc http-request step, which makes a lot of sense if you think about it.

This means that an XProc-Z server can make do with a single pipeline which handles any request. The pipeline can parse the request in an arbitrary way; using cookies, parsing URI parameters and HTTP headers, accepting PUT and POST requests, and returning arbitrary HTTP response codes and headers. The business of “routing” request URIs and parsing parameters is all left up to the XProc pipeline itself. So the binding mechanism is very simple and leaves maximum flexibility to the pipeline, which can then be used to implement any kind of HTTP based protocol.

My first XProc-Z app

Finally, as a demo, I wrote a small pipeline for my friend Dot. The pipeline shows you a list of XML-encoded manuscripts, and lets you pick a selection.

Making a selection

Then, when you have made a selection, and clicked the button, the pipeline is invoked again. It runs a stylesheet (which Dot had already written) over each of the selected XML files, aggregates the results into a single web page, and returns the page to your browser.

Visualize the placement of illustrations in the manuscripts you selected

Next steps?

I am planning to use XProc-Z as a framework for building some Linked Open Data software, for publishing Linked Data from various legacy systems. I have some code lying around which mostly just needs some repackaging to turn it into a form that will run in XProc-Z.

I’m open to suggestions, though, and I’d be delighted to see other people using it. Let me know in the comments below if you have any bright ideas, or if you’d like to use it and need a hand getting started.

XProc-Z – Conal Tuohy's blog

Analysis & Policy Online

background

constraints

technical approach

progress

next steps

Let’s see the code

Questions?

Linked Open Data built from a custom web API

Two ways to skin a cat

Museum Victoria Collection API

Linked Data

Linked Data URIs as proxies for Museum Victoria identifiers

The conceptual model (“ontology”)

Progress and prospects

See the demo

The code

Beta release of XProc-Z web server framework

Long running and asynchronous processes in XProc-Z

XProc-Z

XProc; a language for data plumbing

Running XProc programs on the Web

Proxies

XProc-Z

Reusing XProc’s request and response documents

My first XProc-Z app

Next steps?

Reusing XProc’s `request` and `response` documents