Long running and asynchronous processes in XProc-Z

I added a useful feature to my web server framework XProc-Z, to allow it to run processes that take a long time – longer than the lifetime of an HTTP request.

Specifically, I had in mind running OAI-PMH harvests, in which a harvester may need to run for hours and make hundreds or thousands of HTTP requests of its own, in order to download a large set of metadata records. But any long running process faces the same issue: if a person makes a request to initiate a long running process, using a web browser, the server can’t afford to wait until the process is complete before it responds, because the user’s HTTP request will time out. Instead, the server needs to return a response to say “process has started…” and then to continue the processing in the background.

Another issue with long running processes in XProc is that they often require recursion, and can consume infeasibly large amounts of memory. OAI-PMH is one task which requires recursion. To perform a harvest from an OAI-PMH repository, an OAI-PMH harvester makes a request for a batch of records, and receives along with that batch a so-called “resumption token” which is a reference to the next batch of records. When it’s finished processing the first batch,  the harvester makes another request to the repository, passing back the resumption token it received earlier, and receives in response another batch of records, with another resumption token. After a number of requests like this, the final batch will include an empty resumption token,  indicating that there are no records left to harvest. In XProc, because of the nature of the language itself, such a process is necessarily a recursive one; to complete the harvest of the first batch one has to finish harvesting the second, and to complete the second,  one has to have completed the third, and so on. This tends to produce issues for the XProc interpreter: stack overflows and other memory errors. The canonical solution to such problems is to “flatten” the recursion by converting it into iteration, but since the XProc language doesn’t support this style of programming, it needs to be provided by the application framework instead.

UML Sequence Diagram of XProc-Z Trampoline
Accordingly, I have modified XProc-Z to include whats known as a “trampoline”. When XProc-Z executes a pipeline, it will now expect an HTTP response (which it sends off to the waiting web browser), but also an optional sequence of requests, which it then executes in their own threads, asynchronously.  These requests are effectively made by a pipeline to itself, via the XProc-Z framework. XProc-Z receives the request from the pipeline and bounces it straight back; hence the term “trampoline”. When a pipeline returns, it releases the memory it has been using, and the next invocation of the pipeline starts with a clean slate.  That means the long term memory consumption of the pipeline is limited to what it needs during a single bounce. By dividing its work into small enough batches,  a pipeline can keep its memory consumption arbitrarily low.

Make a comment