XProc – Conal Tuohy's blog

Analysis & Policy Online

Conal — Tue, 27 Jun 2017 23:45:27 +0000

Notes for my Open Repositories 2017 conference presentation. I will edit this post later to flesh it out into a proper blog post.
Follow along at: conaltuohy.com/blog/analysis-policy-online/

background

Early discussion with Amanda Lawrence of APO (which at that time stood for “Australian Policy Online”) about text mining, at the 2015 LODLAM Summit in Sydney.
They needed automation to help with the cataloguing work, to improve discovery.
They needed to understand their own corpus better.
I suggested a particular technical approach based on previous work.
In 2016, APO contracted me to advise and help them build a system that would “mine” metadata from their corpus, and use Linked Data to model and explore it.

constraints

Openness
Integrate metadata from multiple text-mining processes, plus manually created metadata
Minimal dependency on their current platform (Drupal 7, now Drupal 8)
Lightweight; easy to make quick changes

technical approach

Use an entirely external metadata store (a SPARQL Graph Store)
Use a pipeline! Extract, Transform, Load
Use standard protocol to extract data (first OAI-PMH, later sitemaps)
In fact, use web services for everything; the pipeline is then just a simple script that passes data between web services
Sure, XSLT and SPARQL Query, but what the hell is XProc?!

progress

Configured Apache Tika as a web service, using Stanford Named Entity Recognition toolkit
Built XProc pipeline to harvest from Drupal’s OAI-PMH module, download digital objects, process them with Stanford NER via Tika, and store the resulting graphs in Fuseki graph store
Harvested, and produced a graph of part of the corpus, but …
Turned out the Drupal OAI-PMH module wa broken! So we used Sitemap instead
“Related” list added to APO dev site (NB I’ve seen this isn’t working in all browsers, and obviously needs more work, perhaps using an iframe is not the best idea. Try Chrome if you don’t see the list of related pages on the right)

next steps

Visualize the graph
Integrate more of the manually created metadata into the RDF graph
Add topic modelling (using MALLET) alongside the NER

Let’s see the code

Questions?

(if there’s any time remaining)

Zotero, Web APIs, and data formats

Conal — Sun, 30 Aug 2015 05:49:44 +0000

I’ve been doing some work recently (for a couple of different clients) with Zotero, the popular reference management software. I’ve always been a big fan of the product. It has a number of great features, including the fact that it integrates with users’ browsers, and can read metadata out of web pages, PDF files, linked data, and a whole bunch of APIs.

One especially nice feature of Zotero is that you can use it to collaborate with a group of people on a shared library of data which is stored in the cloud and synchronized to the devices of the group members.

Getting data out of Zotero’s web API

If you then want to get the data out of Zotero to do other things with it, you have a number of options. Zotero supports many standard export formats, but the problem I found was that none of those export formats exposed the full richness of your data. Some formats don’t include the “Tags” that you can apply to items in your library; some don’t reflect the hierarchical structure of ‘Collections’ in your library; and so on. It seems the only way to get the full story is to use Zotero’s web API.

Like any web API, this API is a great thing; it makes it possible to use Zotero as a platform for building all kinds of other web applications and systems. The nice thing about a web API is that it’s open to being accessed by any other kind of software. You don’t need to write your software in Javascript or in PHP (the Zotero data server is written in PHP). To access a web API you only need to be able to make HTTP requests, so you’re not tied to any particular platform.

Zotero’s web API is pretty good as web APIs go, though it does have a weakness which is common to many “web APIs”. The weakness is that it’s not obvious how to interpret the data which Zotero provides, and this is a practical barrier to the use of the API. It certainly was for me.

REST

Zotero’s API documentation makes mention of the buzzword “REST”, which is an acronym for “Representational State Transfer”. REST is the name for a style of network communications, defined by a set of design principles or guidelines. A network protocol or web API that conforms to those guidelines is said to be “RESTful”. However, in practice a great many “RESTful” web APIs fail to conform to one or more of the principles, commonly the principle of the “Uniform Interface”, one corollary of which is that the packets of information sent back and forth must be “self-descriptive”.

Self-descriptive messages

To get it right, a RESTful web API needs to provide self-descriptive information; the information it sends you must describe itself sufficiently that you could work out what to do with it. Often the publishers of APIs rely on providing documentation of the different data formats their API provides, and they expect you to have found and read that documentation before you use their API, and to already know what kind of response you will get from the various different parts of their API. But if an API relies on you already knowing what kind of information it provides, then it’s not RESTful. This unfortunately is the case with Zotero.

So how should an API publish “self-descriptive” data?

The HTTP `Content-Type` header

The main mechanism a web server uses to publish self-descriptive data is to include along with the data a Content-Type header which explicitly declares the format of that information using a code called an “Internet Media Type”. There are a zillion of these Internet Media Types, including image/jpeg and image/png for images, text/html for web pages, application/xml for generic XML documents, or application/json for generic JSON data objects. Of these examples, the last two stand out as different because they are not very specific. What does an XML document mean? What does a JSON object mean? They could mean anything at all, because XML and JSON are generic data formats which can be used to transmit all different kinds of information. It’s possible to be more specific about what kind of XML or JSON you are producing, by saying for instance application/rdf+xml (RDF data encoded in XML) or application/ld+json (Linked Data encoded in JSON). But if you only give a more generic Content-Type, then a client will need to look inside the data package itself to determine what it means.

If Zotero were to publish its data as application/zotero+json, that would be an improvement. It would mean that Zotero data in this format could be exchanged around in other systems, and still be understandable. As it stands, Zotero’s application/json data can only reliably be understood if you have just read it from Zotero.

Here’s an example of the JSON data you can read from Zotero’s API: https://api.zotero.org/groups/300568/items?v=3&format=json

Namespaces

One of the nice features of XML is the concept of “namespaces”. These are distinct vocabularies with globally unique names, which allow you to unambiguously identify what kind of XML data you are looking at. If a piece of software can recognise the namespace or namespaces that a document uses, then it’s in a position to understand what it means, and to process it usefully. Otherwise a human is going to have to look at the XML and try to make some sense of it. JSON doesn’t have an equivalent to XML Namespaces (although JSON-LD does), so that means that information served up as application/json can’t be considered very self-descriptive.

Another interesting point about XML Namespaces is that each of these vocabularies is uniquely identified by a URI; that is, the URI is the name of the vocabulary. This has the nice feature that you can open an XML file, find the namespace URI, plug that namespace URI into your browser, and magically be presented with some useful information about that vocabulary. In other words, any data in this format will always contain a hyperlink to its own documentation (called a “Namespace Document“).

If Zotero were to publish its data in XML, and use a “Zotero” namespace to label all the terms in its vocabulary, then that would be another improvement. Any XML documents of that type could be downloaded from Zotero and stored in any other kind of system, and because they would contain that identifier, they would still be identifiably Zotero-flavoured, even after they had long been detached from Zotero itself.

Formalised data formats

Although it is a problem that Zotero’s JSON data format doesn’t have its own formal name by which it can identify itself, the more critical issue for me in attempting to understand the Zotero data was that the data format exposed by the API is barely documented at all.

If you read the JSON data, you will see names such as publicationTitle, itemType, dateAdded, etc, and you can have a guess at what they mean, but it shouldn’t be necessary to guess what they mean, or to understand the relationships between them. I had to spend hours analysing the dataset I had extracted from the web API, before I could seriously attempt to convert it to some other form. There is some documentation scattered about here and there, but no authoritative description of the data format. Compare this to the situation with the more formalised formats which Zotero can export: TEI, RIS, MODS, etc, which have formal specifications defining all the terms in their vocabulary.

Is this something that Zotero could do? It’s hard to say; it would require some technical changes to the Zotero data server code, but probably more signficantly it would involve a change in collective mindset by the developers involved; to see Zotero’s data model as an abstraction independent of Zotero’s data server application; a genuine public language for communicating between arbitrary bibliographic systems, not merely a kind of window into the internal workings of a particular software system.

This is a common situation in web applications which offer an API: the application developers are focused intellectually on the application itself; its own internal workings and the functionality it provides, and they naturally tend to see the API as merely an aspect of that system. The idea that the data format of the API might have a life independent of their software, or that it might even outlive their software altogether, is a stretch. But if the data which their system works with is important, then it is surely important enough to accord some formal status: to give it a name; an Internet Media Type, even a namespace URI, to constrain it with a schema, and to explain it with formal documentation.

Next steps

As usual, the code I’m writing is published on GitHub; in the first instance this XProc pipeline to convert a Zotero library to EAD. But this was really a first stab at the problem; where I’d like to go is to try to formalise and specify the Zotero data format itself; to give it an XML encoding with a formal definition, and then to build other systems, such as Linked Data systems, on top of that formalised format.

Old News for Twitter

Conal — Sat, 25 Jul 2015 09:59:20 +0000

Yesterday I finished a little development project to build a TwitterBot for New Zealand’s online newspaper archive Papers Past.

What’s a “TwitterBot”? It’s a software application that autonomously (robotically, hence “-bot”) sends tweets. There are a lot of TwitterBots tweeting about all kinds of things. Tim Sherratt has produced a few, including one called @TroveNewsBot which tweets links to articles from the Australian online newspaper archive of Trove, and this was a direct inspiration for my TwitterBot. Recently Hugh Rundle produced a TwitterBot called Aus GLAM Blog Bot that tweets links to blog posts by people blogging in the Australian GLAM (Galleries, Libraries, Archives and Museums) sector. People like me. I’m looking forward to seeing Hugh’s bot tweeting about my bot.

One nice thing about making a TwitterBot is the tight constraints you have to work under. That 140-character limit keeps you focused on doing one thing. Another nice thing about them is they are public performers; they get up on stage in front of the world and sing and dance, or they shout weird slogans, or whatever. If they are interesting, people will follow them. The other great thing about them is that they are autonomous; not even their creators know exactly what they will do and say.

Tim’s bots are written in Python, which is his programming language of choice. Hugh chose to write his in Javascript. My bot is written in XProc, which is a programming language designed for processing markup (XML and HTML) and pushing data around on the web. I’ve been using it for a while, and I thought it would be nice to add some tools to my XProc toolbox for dealing with Twitter. XML hackers may like to check out the source code for @NZPaperBot, on GitHub.

So I set my robot the task of tweeting pictures from newspapers that were published exactly 100 years ago, and after a bit of hacking with Papers Past and with Twitter, my bot posted its first tweet yesterday:

PERFECTLY SANITARY. – Star #100years http://t.co/DKI8NeE4LL– pic.twitter.com/cvHUdoyYSK

— NZ Paper Bot (@NZPaperBot) July 24, 2015

I’m looking forward to seeing what else it comes up with, and to expanding its behaviour in future. I’d like to see it responding to other people’s tweets, and to the tweets of other bots!

Beta release of XProc-Z web server framework

Conal — Thu, 14 May 2015 04:12:04 +0000

I have at last released a “final” version of my web server framework, XProc-Z, for testing. The last features I had wanted to include were:

The ability for the XProc code in the web server to read information from its environment, so that a generic XProc pipeline can be customized by setting configuration properties.
Full support for sending and receiving binary files (i.e. non text files). XProc is really a language for processing XML, but I think it will be handy to be able to deal with binary files as well from time to time.
A few sample XProc pipelines, to demonstrate the capability of the platform.

Now the software is out there for people to try, and already I have a friend — a medievalist — who has installed it and started to use it to develop a web application. It’s exciting to have an “installed base” (one person, but it’s a start!) for the software which previously I was the only one to use.

Also, now that the XProc-Z platform is more or less complete, I will be using it myself to build an application for Library and Archives people to convert their collection metadata into Linked Open Data form.

I hope that the platform will turn out to be useful generally in the Digital Humanities and Library fields; there’s a lot of processing of XML going on, and XProc is an ideal programming language for that. Since it’s designed to run XProc pipelines, on the web, with minimal extras, XProc-Z is also highly appropriate for web-based XML processing applications, making it one of the most concise and simple ways to write applications of that nature. If you are a DH developer and you already know XSLT, XQuery, or XPath, you will find XProc a pretty amenable language – I totally recommend it!

If you’re interested to see it in action, you can view it on this server, running the sample pipelines which I’ve included. You can also view the Java source code or the sample XProc pipelines on the github site. The “main” pipeline is xproc-z.xpl.

If you’re interested to give it a try, and you know — or don’t mind learning — a bit of XProc, feel free to download the software and fire it up on a machine of your own. I am happy to answer questions about it and generally help to get people going. You can comment here on the blog, email me, or post an “issue” on the github site.

XProc-Z

Conal — Tue, 09 Dec 2014 09:38:36 +0000

Last weekend I finally released my latest work of art; a software application called XProc-Z. It’s a fairly small thing, but it’s the result of a lot of thought, and I’m very pleased with it. I hope to make a lot of use of it myself, and I hope I can interest other people in using it too.

A lot of the work I do involves crunching up metadata records, XML-encoded text, web pages, and the like. In the old days I used to use Apache Cocoon for this kind of work, but in recent years the development community has moved Cocoon (especially since version 3) in a different direction. Now it’s more of a Java web server framework with many XML-related features. To actually build an application with Cocoon now, you have to put your Java hat on and write some Java and compile it, with Maven and all that Java stuff. That’s all very well, if you like that sort of thing, but it is not very lightweight. I would prefer to be able to just write a script, and not have to write and compile Java. And there are better languages around for that purpose. In particular, there’s a relatively new language for scripting XML processing, called XProc.

XProc; a language for data plumbing

XProc is a language for writing XML pipelines; it uses the idea of data “flowing” from various sources, step by step through a network of pipes and filters, to reach its destination. It’s a language for data plumbing.

So for the last few years I have tended to use the XProc programming language for XML processing tasks. For tasks such as these XProc is an ideal language because its features are designed for precisely these purposes. For instance it takes only a dozen or so lines of code to read a bunch of XML files from a website, transform them with an XSLT, validate them with a schema, and finally save the valid files in one folder and the invalid files in another.

Running XProc programs on the Web

Unlike Cocoon, XProc is not intended primarily for writing web servers, and for a while at least, there was no convenient way to run XProc pipelines as a web server at all. I’ve tended to run my XProc pipelines from the command line (using the XProc interpreter Calabash, by Norm Walsh) where they can read from the Web, and write to the Web, but they aren’t themselves actually part of the Web. It’s always struck me, though, that it would be a great language for writing web applications, and so I did some research to try to find a good way to run my XProc code on the Web.

I had a look at about 5 different ways, but none of them offered quite what I wanted. The problem lay in the details of the mechanisms by which an HTTP request is passed to your pipeline, and in which your pipeline outputs its response. For instance, if a browser makes an HTTP “POST” request for a resource with a particular URI, and passes it a bunch of parameters encoded in the “application/x-www-form-urlencoded” format, somehow that request has to invoke a particular pipeline, and pass those parameters to it. As far as I could tell, each of the different frameworks had their own custom mechanism for this. Some of them were quite restrictive; a URI directly identified a pipeline, and any URI parameters were passed to the pipeline as pipeline parameters. Others were more flexible; you could tweak it so that various properties of a request, taken together, identified which pipeline to run; you could pass not just form parameters, but other things, such as HTTP request headers, to the pipeline, and so on. Generally, to customize the way the HTTP request was handled you had to write some Java code, or write some custom XML configuration file. That was all a bit discouraging to me, because it didn’t fit with two requirements that I had in mind:

Firstly, I wanted to be able to write an application entirely in the XProc language, without any Java coding or compilation. This is to keep it simple. I myself am a fluent Java programmer, but there are a lot of potential XProc programmers who don’t know Java, and who would find it a barrier to have to set up a Java development environment. Why should they have to? I know that in the library world, and in the Digital Humanities community, there are a lot of people who know XML, and know XSLT, and for whom XProc could be a really easy next step, but having to learn Java (even at a basic level) would be an effective barrier.
Secondly, and this is more of an issue for me personally; I want to be able to write XProc applications that handle any kind of HTTP request. I don’t just want to be able to do GET and POST, but also PUT, HEAD, and so on. I want my applications to have access not only to URI parameters, but also to cookies, HTTP request headers, multipart content uploads – everything.

Proxies

It might seem odd to want to be able to handle any arbitrary HTTP request; surely if I’m writing a web application I can ensure my front end makes only the sort of requests that my back end can handle? That’s true — if you’re writing a specific application, or applications of a specific kind. But I want to be able to also write really generic applications; namely proxies.

A web proxy is both a client and a server. It’s an intermediary which sits in between a client and one or more servers. It receives a request from a client, which it passes on to a server (perhaps modifying the request first), and then retrieves the response from the server, which it returns to the client (perhaps modifying the response first).

This allows a proxy to transform an existing web application into something different. For instance a proxy could make a blog look like a Linked Data store, or make a repository of TEI files look like a website, or a map, or an RSS feed. A proxy can turn a website into a web API, or vice versa. It can mash up two or more web APIs and make them look like another web API.

As well as transforming one kind of web server into something quite different, it can also just add some extra feature to an existing web server. For instance, it can enhance the web pages provided by one server by adding links or related information retrieved from another server.

In general I think that the proxy design pattern is seriously under-valued and under-used. It’s a powerful technique for assembling large systems out of smaller parts. The World Wide Web has been designed specifically to facilitate the use of proxies, and yet many web developers are not really even aware of the technique. Part of my goal with XProc-Z is to facilitate and encourage the use of this pattern.

XProc-Z

I figured that to make a really proxy-friendly XProc server, I would have to construct it myself. Earlier this year I had writtten a program called Retailer which is a platform for hosting web apps written in the XSLT language, so I started with that and replaced the XSLT bits with XProc bits, using the Calabash XProc interpreter, and I was done.

Reusing XProc’s `request` and `response` documents

For maximum flexibility, I decided to pass the entire HTTP request to a single XProc pipeline, and leave it up to the pipeline itself to decide how to handle any headers, parameters, and so on. An XProc pipeline that didn’t need to know about the HTTP Accept header, for instance, could just ignore that header, but the header would always be passed to it anyway, just in case.

To pass the request to the pipeline, and to retrieve the response, I re-used a mechanism already present in the XProc language, which just had to be turned inside out. XProc has a step called http-request, and associated request and response XML document types. In XProc, an HTTP request is made by creating a request document containing the details of the request, and piping the document into an http-request step, which actually makes the HTTP request, and in turn outputs a response document. By following this pattern, I could make use of the existing definitions of request and response, and not have to add any extraneous or “foreign” mechanisms. In XProc-Z’s binding mechanism, an HTTP request received from a web user agent is converted into a request object and passed into the XProc-Z pipeline. The output of the pipeline is expected to be a response document, which XProc-Z converts into an actual HTTP response to the web user agent. In other words, an XProc-Z pipeline has the same signature as the standard XProc http-request step, which makes a lot of sense if you think about it.

This means that an XProc-Z server can make do with a single pipeline which handles any request. The pipeline can parse the request in an arbitrary way; using cookies, parsing URI parameters and HTTP headers, accepting PUT and POST requests, and returning arbitrary HTTP response codes and headers. The business of “routing” request URIs and parsing parameters is all left up to the XProc pipeline itself. So the binding mechanism is very simple and leaves maximum flexibility to the pipeline, which can then be used to implement any kind of HTTP based protocol.

My first XProc-Z app

Finally, as a demo, I wrote a small pipeline for my friend Dot. The pipeline shows you a list of XML-encoded manuscripts, and lets you pick a selection.

Making a selection

Then, when you have made a selection, and clicked the button, the pipeline is invoked again. It runs a stylesheet (which Dot had already written) over each of the selected XML files, aggregates the results into a single web page, and returns the page to your browser.

Visualize the placement of illustrations in the manuscripts you selected

Next steps?

I am planning to use XProc-Z as a framework for building some Linked Open Data software, for publishing Linked Data from various legacy systems. I have some code lying around which mostly just needs some repackaging to turn it into a form that will run in XProc-Z.

I’m open to suggestions, though, and I’d be delighted to see other people using it. Let me know in the comments below if you have any bright ideas, or if you’d like to use it and need a hand getting started.

XProc – Conal Tuohy's blog

Analysis & Policy Online

background

constraints

technical approach

progress

next steps

Let’s see the code

Questions?

Zotero, Web APIs, and data formats

Getting data out of Zotero’s web API

REST

Self-descriptive messages

The HTTP Content-Type header

Namespaces

Formalised data formats

Next steps

Old News for Twitter

Beta release of XProc-Z web server framework

XProc-Z

XProc; a language for data plumbing

Running XProc programs on the Web

Proxies

XProc-Z

Reusing XProc’s request and response documents

My first XProc-Z app

Next steps?

The HTTP `Content-Type` header

Reusing XProc’s `request` and `response` documents