International Symposium on Processing XML Efficiently

[10 August 2009, delayed by network problems …]

The International Symposium on Processing XML Efficiently, chaired by Michael Kay, has now reached its midpoint, at lunch time.

The morning began with a clear and useful report by Michael Leventhal and Eric Lemoine at LSI, talking about six years of experience building XML chips. Tarari, which was spun off by Intel in 2002, started with a tokenizer on a chip, based on a field-programmable gate array (FPGA) and has gradually developed more capabilities, including parse-time evaluation of XPath queries, later on full XSLT, and even parse-time XSD schema validation. The goal is not to perform full XML processing on the chip, but to perform tasks which software higher up in the stack will find useful.

One property of the chip I found interesting was that it attempts to treat parsing as a stateless activity, which aids security and allows several XML documents to be processed at once. Of course, parsing is not by nature stateless, but the various specialized processes implemented on the chip produce relevant state information as part of their output, and that state information is cycled around to be fed into the input side of the chip together with the next bits of the document in question. It reminds me of the way Web (and CICS) applications make themselves stateless by externalizing the state of a long conversational interaction by sending it to the client as hidden data.

David Maze of IBM then spoke about the Data Power XML appliance; I had heard people talk about it before, but had never gotten a very clear idea of what the box actually does. This talk dispelled a lot of uncertainty. In contrast to the LSI chip, the Data Power appliance is designed to perform full XML processing, and with throughput rather than reduced latency as the design goal. But the list of services offered is still rather similar: low-level parsing, XPath evaluation, XSLT processing, schema validation. Some are done during parsing, and some by means of a set of specialized post-processing primitives.

Rob Cameron and some of his students at Simon Fraser University came next. They have been exploring ways to exploit the single-instruction/multiple-data (SIMD) instructions which have been appearing in the instruction sets of recent chips. They turn a stream of octets into eight streams of bits, and can thus process eight times as much of the document in a single operation as they would otherwise be able to. The part that blew my mind was the explanation of parsing by means of bitstream addition. He used decimal numeric character references to illustrate. I can’t explain in detail here, but the basic idea is: you make a bit stream for the ‘&’ character (where a 1 in position n means that the nth character in the input stream is a ‘&’. Make a similar bit stream for the ‘#’. And them together; you have the occurrences of the ‘&#’ delimiter in the document. Make a similar bit stream for decimal digits; you may frequently have multiple decimal digits in a row. Now comes an extremely expected trick. Reverse the bit array, so it rights right to left. Now shift the ‘&#’ delimiter bit stream by one position, and ADD it to the decimal-digit bit stream. If the delimiter was followed by a decimal digit (as in a decimal character reference it must be), there will be two ones in the same column. They will sum to ’10’, and the one will carry. If the following character is also a decimal digit, it will sum, with the carry, to ’10’. And so on, until finally you reach the end of the sequence of adjacent decimal digits, and are left with a 1 in the result bitstream. AND that together with the bit stream for the ‘;’ character, and wherever there is a 1 in the result you have now diagnosed a well-formedness error in the input.

Upshot: A single machine operation has scanned from the beginning to the end of (multiple instances of) a variable-length construct and identified the end-point of (each instance of) the construct. With a little cleverness, this can be applied to pretty much all the regular-language phenomena in XML parsing. The speedups they report as a result are substantial.

The use of bitstream operations provides a lot of parallelism; the one place the Parabix system has to drop into a loop seems to be for the recognition of attribute-value pairs. I keep wondering whether optimistic parallelism might not allow that problem to be solved, too.

Mohamed Zergaoui gave a characteristically useful discussion of streamability and its conceptual complications, and David Lee and Norm Walsh presented some work on timing various different ways of writing and running pipelines of XML processes. When running components written in Java, from scripting languages like bash, the time required for the XML processing (in the real-world application they used for testing) was dwarfed by the cost of repeatedly launching the Java Virtual Machine. Shifting to a system like xmlsh or calabash, which launches the JVM once not repeatedly, gained fifty- and hundred-fold speedups.

Efficient processing of XML

[9 June 2009]

The organizers of Balisage (among them me) have announced that the program is now available for the International Symposium on Processing XML Efficiently, chaired by Michael Kay of Saxonica, which will take place the day before the Balisage conference starts, in the same venue. I reproduce the announcement below.

PROGRAM NOW AVAILABLE

International Symposium on Processing XML Efficiently:
Overcoming Limits on Space, Time, or Bandwidth

Monday August 10, 2009
Hotel Europa, Montréal, Canada

Chair: Michael Kay, Saxonica
Symposium description: http://www.balisage.net/Processing/
Detailed Program: http://www.balisage.net/Processing/Program.html
Registration: http://www.balisage.net/registration.html

Developers have said it: “XML is too slow!”, where “slow” can mean many things including elapsed time, throughput, latency, memory use, and bandwidth consumption.

The aim of this one-day symposium is to understand these problems better and to explore and share approaches to solving them. We’ll hear about attempts to tackle the problem at many levels of the processing stack. Some developers are addressing the XML parsing bottleneck at the hardware level with custom chips or with hardware-assisted techniques. Some researchers are looking for ways to compress XML efficiently without sacrificing the ability to perform queries, while others are focusing on the ability to perform queries and transformations in streaming mode. We’ll hear from a group who believe the problem (and its solution) lies not with the individual component technologies that make up an application, but with the integration technology that binds the components together.

We’ll also hear from someone who has solved the problems in real life, demonstrating that it is possible to build XML-based applications handling very large numbers of documents and a high throughput of queries while offering good response time to users. And that with today’s technologies, not tomorrow’s.

If you are interested in this symposium we invite you to read about
“Balisage: The Markup Conference”, which follows it in the same location:
http://www.balisage.net/

Questions: info@balisage.net

XML in the browser at Balisage

[6 June 2009]

It’s been some time since XML was first specified, with the hope that once browsers supported XML, it would become easy to deliver content on the Web in XML.

A lot of people spent a lot of time, in those early years, warning that you couldn’t really deliver XML on the Web in practice, because too many people were still using non-XML-capable (or non-XSLT-enabled) browsers. (At least, it seemed that way to me.) I got used to the idea that you couldn’t really do it. So it was a bit of a surprise to me, a few years ago, to discover that I could, after all. There are some dark corners in some browsers’ implementations of XSLT (no information about unparsed entities in Opera, no namespace nodes in Firefox — though that last one is being fixed even as I write, which is good news) but there are workarounds; the situation is probably at least as good with respect to XSLT as it is with respect to Javascript and CSS. I have not had to draft an important paper in HTML, or worry about translating it into HTML in order to deliver it on the Web, in years.

Why is this fact not more widely known and exploited by users of XML?

It will surprise no one, after what I just said, that one of the papers I’m looking forward to hearing at Balisage 2009 (coming up very soon — the deadline for late-breaking news submissions is 19 June, and the conference itself is the week of 10 August) is a talk by Alex Milowski under the title “XML in the browser: the next decade”. Alex Milowski is one of the smartest and most thoughtful technologists I know, and I look forward to hearing what he has to say.

He’ll talk (as I know from having read the paper proposal) about what some people hoped for, from XML in the browser, ten years ago, and about what has happened instead. He’ll talk about what can be done with XML in the browser today, and what most needs fixing in the current situation. There will probably be demos. And, most interesting, I expect he will provide some thoughts on the long-term direction we should be trying to take, in order to make the Web and XML get along better together.

If you think XML can help make the Web and the world a better place, you might want to come to Montréal this August to hang around with Alex, and with me, and with others of your kind. It’s always a rewarding experience.