Metadata and search – a concrete example

[18 August 2009]

Here’s a concrete example of the difference between the metadata-aware search we would like to have, and the metadata-oblivious full-text search we mostly have today, encountered the other day at the Balisage 2009 conference in Montréal.

Try to find a video of the song “I don’t want to go to Toronto”, by a group called Radio Free Vestibule.

When I search video.google.com for “I don’t want to go to Toronto”, I get, in first place, a song called “I don’t want to go”, performed live in Toronto. When I put quotation marks around the title, it tells me nothing matches and shows me a video of Elvis Costello singing “I don’t want to go to Chelsea”.

It’s always good to have concrete examples, and I always like real ones better than made-up examples. (Real examples do often have a disconcerting habit of bringing in one complication after another and involving more than one problem, which is why good ones are so hard to find. But I don’t see many extraneous complications in this one.)

International Symposium on Processing XML Efficiently

[10 August 2009, delayed by network problems …]

The International Symposium on Processing XML Efficiently, chaired by Michael Kay, has now reached its midpoint, at lunch time.

The morning began with a clear and useful report by Michael Leventhal and Eric Lemoine at LSI, talking about six years of experience building XML chips. Tarari, which was spun off by Intel in 2002, started with a tokenizer on a chip, based on a field-programmable gate array (FPGA) and has gradually developed more capabilities, including parse-time evaluation of XPath queries, later on full XSLT, and even parse-time XSD schema validation. The goal is not to perform full XML processing on the chip, but to perform tasks which software higher up in the stack will find useful.

One property of the chip I found interesting was that it attempts to treat parsing as a stateless activity, which aids security and allows several XML documents to be processed at once. Of course, parsing is not by nature stateless, but the various specialized processes implemented on the chip produce relevant state information as part of their output, and that state information is cycled around to be fed into the input side of the chip together with the next bits of the document in question. It reminds me of the way Web (and CICS) applications make themselves stateless by externalizing the state of a long conversational interaction by sending it to the client as hidden data.

David Maze of IBM then spoke about the Data Power XML appliance; I had heard people talk about it before, but had never gotten a very clear idea of what the box actually does. This talk dispelled a lot of uncertainty. In contrast to the LSI chip, the Data Power appliance is designed to perform full XML processing, and with throughput rather than reduced latency as the design goal. But the list of services offered is still rather similar: low-level parsing, XPath evaluation, XSLT processing, schema validation. Some are done during parsing, and some by means of a set of specialized post-processing primitives.

Rob Cameron and some of his students at Simon Fraser University came next. They have been exploring ways to exploit the single-instruction/multiple-data (SIMD) instructions which have been appearing in the instruction sets of recent chips. They turn a stream of octets into eight streams of bits, and can thus process eight times as much of the document in a single operation as they would otherwise be able to. The part that blew my mind was the explanation of parsing by means of bitstream addition. He used decimal numeric character references to illustrate. I can’t explain in detail here, but the basic idea is: you make a bit stream for the ‘&’ character (where a 1 in position n means that the nth character in the input stream is a ‘&’. Make a similar bit stream for the ‘#’. And them together; you have the occurrences of the ‘&#’ delimiter in the document. Make a similar bit stream for decimal digits; you may frequently have multiple decimal digits in a row. Now comes an extremely expected trick. Reverse the bit array, so it rights right to left. Now shift the ‘&#’ delimiter bit stream by one position, and ADD it to the decimal-digit bit stream. If the delimiter was followed by a decimal digit (as in a decimal character reference it must be), there will be two ones in the same column. They will sum to ’10’, and the one will carry. If the following character is also a decimal digit, it will sum, with the carry, to ’10’. And so on, until finally you reach the end of the sequence of adjacent decimal digits, and are left with a 1 in the result bitstream. AND that together with the bit stream for the ‘;’ character, and wherever there is a 1 in the result you have now diagnosed a well-formedness error in the input.

Upshot: A single machine operation has scanned from the beginning to the end of (multiple instances of) a variable-length construct and identified the end-point of (each instance of) the construct. With a little cleverness, this can be applied to pretty much all the regular-language phenomena in XML parsing. The speedups they report as a result are substantial.

The use of bitstream operations provides a lot of parallelism; the one place the Parabix system has to drop into a loop seems to be for the recognition of attribute-value pairs. I keep wondering whether optimistic parallelism might not allow that problem to be solved, too.

Mohamed Zergaoui gave a characteristically useful discussion of streamability and its conceptual complications, and David Lee and Norm Walsh presented some work on timing various different ways of writing and running pipelines of XML processes. When running components written in Java, from scripting languages like bash, the time required for the XML processing (in the real-world application they used for testing) was dwarfed by the cost of repeatedly launching the Java Virtual Machine. Shifting to a system like xmlsh or calabash, which launches the JVM once not repeatedly, gained fifty- and hundred-fold speedups.

NACS and W3C

[8 August 2009]

Just read a long interviw between Ian Jacobs of W3C and David Ezell, the chair of the W3C XML Schema working group and the representative of the National Association of Convenience and Petroleum Retailers (NACS) on the W3C’s Advisory Committee.

I may be biased, since I’ve worked closely with David for several years, but what he says in the interview seems to illustrate well the advantages for user organizations to be involved in standards development, instead of just leaving the standards work to their vendors. User organizations are woefully underrepresented on pretty much every standards group I know about; I wish more organizations took the enlightened approach NACS has taken in participating in W3C work and in supporting David’s work as chair of the XML Schema working group. Boeing deserves kudos, too, for their participation in XML Schema work. But if we had had even more users, and a less pronounced dominance of the group by vendors, I think the spec would have been better for it.

If your organization can join W3C, or other standards bodies relevant to your work, think about doing so.

Even without being a W3C member, of course, any member of the public is invited to comment on published drafts of specs, and the comments are typically taken very seriously. If you can’t afford the commitment of membership and working group participation, commenting on drafts is a good way to influence the specs. But if you can join, I encourage you to do so!

Balisage is calling …

[5 August 2009]

This week I’m busy trying to wrap things up before heading to Montréal next week for Balisage. Songs from South Pacific keep running through my head, starting of course with “Bali Ha’i” (to which Enrique is working on a contrafacture).

I had meant to post periodically over the summer about papers I’m particularly looking forward to hearing, in the interest of reminding people about the conference and trying to encourage attendance. I only managed one or two, but it seems I needn’t feel guilty, after all. The conference chair, Tommie Usdin of Mulberry Technologies, tells me that we have now pre-registered more people for Balisage 2009 than we have had at any previous Balisage.

So even without my reminding people about what is on the program, people are coming to the conference anyway. Good! But I can’t resist mentioning here: Fabio Vitali and his colleagues have a really super idea for encoding overlapping structures by using RDF (which automatically means that we can try using SPARQL to query such documents). The continuing work on XML representations of overlap in Bielefeld and Lyon continues to bear fruit: Maik Stührenberg and Daniel Jettka of Bielefeld are talking about XStandoff, the successor to the Sekimo General Format (SGF) developed earlier in Bielefeld, while Pierre Edouard Portier and Sylvie Calabretto of Lyon are talking about the problem of constructing documents using formats like Lyon’s MultiX. And Desmond Schmidt of Queensland University of Technology is coming, to talk about his work on overlapping structures in multi-versioned documents.

Norm Walsh and Michael Kay are both talking about pipelines in XML processing. Michael is also chairing a full-day symposium on Monday about efficient processing of XML. (Why did no one from the EXI effort offer a paper?!) Kurt Cagle is talking about XML and linked data (that would be the rebranding of the Semantic Web).

And there’s a lot more. See the program for details.

So this year Balisage will be bigger than ever before.

I hope to see you in Montréal next week!