[10 August 2009, delayed by network problems …]
The International Symposium on Processing XML Efficiently, chaired by Michael Kay, has now reached its midpoint, at lunch time.
The morning began with a clear and useful report by Michael Leventhal and Eric Lemoine at LSI, talking about six years of experience building XML chips. Tarari, which was spun off by Intel in 2002, started with a tokenizer on a chip, based on a field-programmable gate array (FPGA) and has gradually developed more capabilities, including parse-time evaluation of XPath queries, later on full XSLT, and even parse-time XSD schema validation. The goal is not to perform full XML processing on the chip, but to perform tasks which software higher up in the stack will find useful.
One property of the chip I found interesting was that it attempts to treat parsing as a stateless activity, which aids security and allows several XML documents to be processed at once. Of course, parsing is not by nature stateless, but the various specialized processes implemented on the chip produce relevant state information as part of their output, and that state information is cycled around to be fed into the input side of the chip together with the next bits of the document in question. It reminds me of the way Web (and CICS) applications make themselves stateless by externalizing the state of a long conversational interaction by sending it to the client as hidden data.
David Maze of IBM then spoke about the Data Power XML appliance; I had heard people talk about it before, but had never gotten a very clear idea of what the box actually does. This talk dispelled a lot of uncertainty. In contrast to the LSI chip, the Data Power appliance is designed to perform full XML processing, and with throughput rather than reduced latency as the design goal. But the list of services offered is still rather similar: low-level parsing, XPath evaluation, XSLT processing, schema validation. Some are done during parsing, and some by means of a set of specialized post-processing primitives.
Rob Cameron and some of his students at Simon Fraser University came next. They have been exploring ways to exploit the single-instruction/multiple-data (SIMD) instructions which have been appearing in the instruction sets of recent chips. They turn a stream of octets into eight streams of bits, and can thus process eight times as much of the document in a single operation as they would otherwise be able to. The part that blew my mind was the explanation of parsing by means of bitstream addition. He used decimal numeric character references to illustrate. I can’t explain in detail here, but the basic idea is: you make a bit stream for the ‘&’ character (where a 1 in position n means that the nth character in the input stream is a ‘&’. Make a similar bit stream for the ‘#’. And them together; you have the occurrences of the ‘&#’ delimiter in the document. Make a similar bit stream for decimal digits; you may frequently have multiple decimal digits in a row. Now comes an extremely expected trick. Reverse the bit array, so it rights right to left. Now shift the ‘&#’ delimiter bit stream by one position, and ADD it to the decimal-digit bit stream. If the delimiter was followed by a decimal digit (as in a decimal character reference it must be), there will be two ones in the same column. They will sum to ’10’, and the one will carry. If the following character is also a decimal digit, it will sum, with the carry, to ’10’. And so on, until finally you reach the end of the sequence of adjacent decimal digits, and are left with a 1 in the result bitstream. AND that together with the bit stream for the ‘;’ character, and wherever there is a 1 in the result you have now diagnosed a well-formedness error in the input.
Upshot: A single machine operation has scanned from the beginning to the end of (multiple instances of) a variable-length construct and identified the end-point of (each instance of) the construct. With a little cleverness, this can be applied to pretty much all the regular-language phenomena in XML parsing. The speedups they report as a result are substantial.
The use of bitstream operations provides a lot of parallelism; the one place the Parabix system has to drop into a loop seems to be for the recognition of attribute-value pairs. I keep wondering whether optimistic parallelism might not allow that problem to be solved, too.
Mohamed Zergaoui gave a characteristically useful discussion of streamability and its conceptual complications, and David Lee and Norm Walsh presented some work on timing various different ways of writing and running pipelines of XML processes. When running components written in Java, from scripting languages like bash, the time required for the XML processing (in the real-world application they used for testing) was dwarfed by the cost of repeatedly launching the Java Virtual Machine. Shifting to a system like xmlsh or calabash, which launches the JVM once not repeatedly, gained fifty- and hundred-fold speedups.