Zotero, Web APIs, and data formats

Conal — Sun, 30 Aug 2015 05:49:44 +0000

I’ve been doing some work recently (for a couple of different clients) with Zotero, the popular reference management software. I’ve always been a big fan of the product. It has a number of great features, including the fact that it integrates with users’ browsers, and can read metadata out of web pages, PDF files, linked data, and a whole bunch of APIs.

One especially nice feature of Zotero is that you can use it to collaborate with a group of people on a shared library of data which is stored in the cloud and synchronized to the devices of the group members.

Getting data out of Zotero’s web API

If you then want to get the data out of Zotero to do other things with it, you have a number of options. Zotero supports many standard export formats, but the problem I found was that none of those export formats exposed the full richness of your data. Some formats don’t include the “Tags” that you can apply to items in your library; some don’t reflect the hierarchical structure of ‘Collections’ in your library; and so on. It seems the only way to get the full story is to use Zotero’s web API.

Like any web API, this API is a great thing; it makes it possible to use Zotero as a platform for building all kinds of other web applications and systems. The nice thing about a web API is that it’s open to being accessed by any other kind of software. You don’t need to write your software in Javascript or in PHP (the Zotero data server is written in PHP). To access a web API you only need to be able to make HTTP requests, so you’re not tied to any particular platform.

Zotero’s web API is pretty good as web APIs go, though it does have a weakness which is common to many “web APIs”. The weakness is that it’s not obvious how to interpret the data which Zotero provides, and this is a practical barrier to the use of the API. It certainly was for me.

REST

Zotero’s API documentation makes mention of the buzzword “REST”, which is an acronym for “Representational State Transfer”. REST is the name for a style of network communications, defined by a set of design principles or guidelines. A network protocol or web API that conforms to those guidelines is said to be “RESTful”. However, in practice a great many “RESTful” web APIs fail to conform to one or more of the principles, commonly the principle of the “Uniform Interface”, one corollary of which is that the packets of information sent back and forth must be “self-descriptive”.

Self-descriptive messages

To get it right, a RESTful web API needs to provide self-descriptive information; the information it sends you must describe itself sufficiently that you could work out what to do with it. Often the publishers of APIs rely on providing documentation of the different data formats their API provides, and they expect you to have found and read that documentation before you use their API, and to already know what kind of response you will get from the various different parts of their API. But if an API relies on you already knowing what kind of information it provides, then it’s not RESTful. This unfortunately is the case with Zotero.

So how should an API publish “self-descriptive” data?

The HTTP `Content-Type` header

The main mechanism a web server uses to publish self-descriptive data is to include along with the data a Content-Type header which explicitly declares the format of that information using a code called an “Internet Media Type”. There are a zillion of these Internet Media Types, including image/jpeg and image/png for images, text/html for web pages, application/xml for generic XML documents, or application/json for generic JSON data objects. Of these examples, the last two stand out as different because they are not very specific. What does an XML document mean? What does a JSON object mean? They could mean anything at all, because XML and JSON are generic data formats which can be used to transmit all different kinds of information. It’s possible to be more specific about what kind of XML or JSON you are producing, by saying for instance application/rdf+xml (RDF data encoded in XML) or application/ld+json (Linked Data encoded in JSON). But if you only give a more generic Content-Type, then a client will need to look inside the data package itself to determine what it means.

If Zotero were to publish its data as application/zotero+json, that would be an improvement. It would mean that Zotero data in this format could be exchanged around in other systems, and still be understandable. As it stands, Zotero’s application/json data can only reliably be understood if you have just read it from Zotero.

Here’s an example of the JSON data you can read from Zotero’s API: https://api.zotero.org/groups/300568/items?v=3&format=json

Namespaces

One of the nice features of XML is the concept of “namespaces”. These are distinct vocabularies with globally unique names, which allow you to unambiguously identify what kind of XML data you are looking at. If a piece of software can recognise the namespace or namespaces that a document uses, then it’s in a position to understand what it means, and to process it usefully. Otherwise a human is going to have to look at the XML and try to make some sense of it. JSON doesn’t have an equivalent to XML Namespaces (although JSON-LD does), so that means that information served up as application/json can’t be considered very self-descriptive.

Another interesting point about XML Namespaces is that each of these vocabularies is uniquely identified by a URI; that is, the URI is the name of the vocabulary. This has the nice feature that you can open an XML file, find the namespace URI, plug that namespace URI into your browser, and magically be presented with some useful information about that vocabulary. In other words, any data in this format will always contain a hyperlink to its own documentation (called a “Namespace Document“).

If Zotero were to publish its data in XML, and use a “Zotero” namespace to label all the terms in its vocabulary, then that would be another improvement. Any XML documents of that type could be downloaded from Zotero and stored in any other kind of system, and because they would contain that identifier, they would still be identifiably Zotero-flavoured, even after they had long been detached from Zotero itself.

Formalised data formats

Although it is a problem that Zotero’s JSON data format doesn’t have its own formal name by which it can identify itself, the more critical issue for me in attempting to understand the Zotero data was that the data format exposed by the API is barely documented at all.

If you read the JSON data, you will see names such as publicationTitle, itemType, dateAdded, etc, and you can have a guess at what they mean, but it shouldn’t be necessary to guess what they mean, or to understand the relationships between them. I had to spend hours analysing the dataset I had extracted from the web API, before I could seriously attempt to convert it to some other form. There is some documentation scattered about here and there, but no authoritative description of the data format. Compare this to the situation with the more formalised formats which Zotero can export: TEI, RIS, MODS, etc, which have formal specifications defining all the terms in their vocabulary.

Is this something that Zotero could do? It’s hard to say; it would require some technical changes to the Zotero data server code, but probably more signficantly it would involve a change in collective mindset by the developers involved; to see Zotero’s data model as an abstraction independent of Zotero’s data server application; a genuine public language for communicating between arbitrary bibliographic systems, not merely a kind of window into the internal workings of a particular software system.

This is a common situation in web applications which offer an API: the application developers are focused intellectually on the application itself; its own internal workings and the functionality it provides, and they naturally tend to see the API as merely an aspect of that system. The idea that the data format of the API might have a life independent of their software, or that it might even outlive their software altogether, is a stretch. But if the data which their system works with is important, then it is surely important enough to accord some formal status: to give it a name; an Internet Media Type, even a namespace URI, to constrain it with a schema, and to explain it with formal documentation.

Next steps

As usual, the code I’m writing is published on GitHub; in the first instance this XProc pipeline to convert a Zotero library to EAD. But this was really a first stab at the problem; where I’d like to go is to try to formalise and specify the Zotero data format itself; to give it an XML encoding with a formal definition, and then to build other systems, such as Linked Data systems, on top of that formalised format.

Zotero – Conal Tuohy's blog