The joy of testing

November 5th, 2009

[5 November 2009, some additions 6 November 2009]

I’m using Jeni Tennison’s xspec to develop tests for a simple stylesheet I’m writing. An xspec test takes the form of a scenario along the lines of

  • When you match a foo element, do this.
  • When you call function bar with these arguments, expect this result.
  • When you call this named template, expect this result.

It’s a relatively young project, and the documentation is perhaps best described as nascent. Working from the documentation (it does exist, which makes for a nice change from some things I work with), I first wrote nine or ten tests to describe the behavior of an existing stylesheet; when I ran the tests against that stylesheet, all of them reported failures, because my formulation of the expected results violated various silent assumptions of the xspec code. That might indicate opportunities for making the xspec documentation more informative. I’ve spent an enjoyable hour or two this evening, however, looking at the xspec code and figuring out how my test cases are confusing it, reformulating them, and watching the bars of red in the test report change, one by one, to green. It’s nice to have a visible sign of forward progress.

There are other XSLT test frameworks I haven’t tried, and I can’t compare xspec to any of them. But I can say this: if you are developing XSLT stylesheets and aren’t using any of the available test frameworks, you really ought to look into xspec.

A helpful page about XSLT testing is maintained by Tony Graham of Menteith Consulting. If xspec doesn’t work out for you, check out the other frameworks he lists there.

Day of the dead

November 2nd, 2009

[2 November 2009]

Today is the Feast of All Souls, better known where I come from as the Day of the Dead. It’s a useful day to remember the dead.

Today, I am thinking particularly of Donald Walker, Antonio Zampolli, Yuri Rubinsky, each important in different ways to me. Life remains (as I expected when they died) a little harder without them around.

It’s also a good day to think about the death that will come for each of us before long.

Deyr fé,    deyia frœndr,
deyr siálfr it sama;
enn orðztírr    deyr aldregi,
hveim er sér góðan getr.
Deyr fé,    deyia frœndr,
deyr siálfr it sama;
ec veit einn,    at aldri deyr:
dómr um dauðan hvern.

What will we leave to those who stay here after us? What would we like to be remembered by?

Firefox and namespace nodes: an open plea

October 21st, 2009

[21 October 2009]

One of the long-standing gaps in Mozilla’s support for XSLT 1.0 is its failure to support the XPath namespace axis; for the many stylesheets that don’t use that axis, the gap is not a problem. But access to the namespace axis is essential for many stylesheets that work upon XSLT stylesheets, or XSD schema documents, or any other documents which may have namespace-qualified data in attribute values and element content; Firefox’s failure to support it means that browser-based tools for those vocabularies must often carry warnings like “Works with everything except Firefox.” What a drag.

So it was encouraging, early this year, when a team of students at Simon Fraser University provided a fix for the bug. (Thank you, SFU! Way to go!) What I don’t understand is: given that there is a fix, given that it passes all the tests, given that this fix removes one of the major blots on Firefox’s XSLT conformance, why isn’t it in the product yet?

I wonder if it’s because the responsible parties don’t perceive the bug or the fix as important; that would be understandable, since with 17 votes in favor of fixing the issue, this bug is way down among the weeds. If that’s the reason, then perhaps it would help if those who do feel the bug is important were to raise the vote total of the relevant bug a bit.

So if you care about XSLT support and have a login ID on bugzilla.mozilla.org, please navigate to bug 94270 and vote for the bug. (Click the ‘vote’ link next to the display of the Importance field.) If you care about XSLT support in Firefox and don’t have a login ID on bugzilla.mozilla.org, I urge you to consider getting a login ID, so you can vote for this bug.

If anyone reading this has insight into the dynamics of getting a fix that’s ready and tested into the next release, I for one would be interested to learn more.

[My evil twin Enrique has produced a poetic version of this plea, addressed to the committers of Firefox:

Without a warning, you broke my heart.
I've got this QName to take apart,
But the namespace axis returned no nodes.
Can't find no binding to guide my code.

So come on, Firefox, Firefox please!
an' I'm beggin' you, Firefox, and I'm on my knees.
Please fix this bug,
Accept the patch.
The namespace axis —
Let it match, let it match, let it match.

I'd refuse to include it, as being a sacrilege against the memory of Ron McKernan, but he'd just hack the web site and add it anyway. And if he did, he'd make it look as if I had included it myself, against my better judgement. He's really past all controlling lately.]

Changing stylesheets in midstream

October 19th, 2009

[19 October 2009]

My evil twin Enrique came by the other day in a great state of excitement. There’s been a bit of a kerfuffle in some W3C working groups lately, he told me. As some readers will know, the W3C recently unveiled a new design for their web site. (Many people seem to want to call this a site redesign, but as far as I know most of the site was originally developed by individuals and working groups working autonomously, and outside the front page, the Tech Reports page, and the other pages maintained by the Communications Team, the site never had a consistent design to begin with. Surely it’s only a redesign if there was a design there in the first place?)

Almost all the comments on the new design appear to be positive — at least, they were until some spec editors and working group chairs noticed that the site redesign had included reformatted versions of their working groups’ current Recommendations, which the working groups had not looked at before and which proved, when examined, to be sub-optimal in some ways.

“Sub-optimal is putting it mildly,” laughed Enrique. “Some of the specs looked like night soil on toast. And some of the editors were fit to be tied.” Enough pain was expressed over the new look of the old specs, apparently, that after a couple of days the standard URLs for existing Recommendations were all reset, and no longer point to the reformatted versions. (The reformatted versions are still around — no one at W3C ever deletes anything, it’s a point of some pride — though you have to know what URIs to point to.)

One of the most visible problems is that in some specs, extra space was appearing before and after large numbers of hyperlinked special terms. “You know what it was?” chortled Enrique. “Some bright young thing at some bright young design agency seems to have thought a 20px padding would be a good idea for the CODE element. Do these people not know any HTML? Here, look at the stylesheet!” He pulled out a hand-held and showed me a rule from one of the new stylesheets (reformatted here for legibility):

h1, h2, h3, h4, h5, h6, ul, ol, dl, p, pre, blockquote, code {
  padding:20px 20px 0 20px;
}

He was cackling with malice now. “The stylesheet author seems to have thought that code was not for inline material but for indented blocks. Where do they get these people? And giving measurements in pixels is so dead-tree-oriented!”

“Now, now,” I said. “I’m sure you were a bright young thing once yourself.”

“Not me,” he returned brightly. “I was fifty-two the day I was born, and I’ve always been dumb as a post.”

“Two, actually. Odd, though,” I said. “When I retrieve the reformatted versions of the XML and XSLT 2.0 specs, I don’t see extra white space around code elements.” I retrieved the stylesheet with the bogus padding values for code; the rule now read

h1, h2, h3, h4, h5, h6, ul, ol, dl, p, pre, blockquote {
    padding:20px 20px 0 20px;
}

“Those bastards!” Enrique cried. “You mean they’ve fixed it? I was going to charge them big bucks to tell them what was wrong!” And he stomped off again in spluttering disappointment. I haven’t seen him since, but I’m not worried; he’ll get over it.

[The new W3C site is the result of a long design history, and really does appear to be an improvement, for the most part. It makes it much easier than the old site to find your way around (or so I believe — I knew the old site structure well enough that the new one just confuses me; I assume that will pass). The new look intended for W3C technical reports (i.e. Recommendations, Notes, Working Drafts, etc.) can be inspected on the beta site's Tech Reports page, or the beta site's version of the new Standards page. I haven't yet decided whether I think the new tech report styling is an improvement or not, and if it is, whether it's enough of an improvement to justify the disruption of restyling the entire body of existing Recommendations. I'll be interested in readers' reactions.

One thing is unsurprising: if you launch a new stylesheet on technical material whose authors and editors pride themselves on precision, you would do well not to make it public until they have confirmed that it is OK. And it would be smart, before you let them see it at all, and certainly before you make it public, to make sure the new stylesheet doesn't introduce highly visible problems like 20 pixels of extra white space around every code element.

Live and learn.]

Looking for open source XML software?

September 30th, 2009

[30 September 2009]

Last week I participated in the XML Summer School organized by Eleven Informatics at St. Edmund Hall in Oxford. I hope the participants enjoyed it as much as the speakers did. The weather certainly cooperated, although it felt more autumnal than summery by the end of the week.

One of my responsibilities during the week was to give a survey of open-source software for XML applications; this turns out to be harder than it might look because there are so many, with such varying degrees of polish, reliability, and completeness. There are several lists of XML software, and open-source software, and open-source XML software (general, or in some specific categories) on the Web, but many of them appear to not to have been maintained or updated in several years. (Honorable exceptions include the lists maintained by Ron Bourret on databases and XML, Lars Marius Garshol on XML tools and Topic-Map tools, and Tony Graham on XSLT testing tools.) So the lists I made, arbitrary and capricious though some aspects of them are, may be helpful.

Eventually I plan to turn the information gathered into a more convenient form, and set up some infrastructure to make it easier to maintain, but in the meantime the slides I prepared for the session may be helpful; they provide a coarsely categorized and tersely annotated list of some open-source XML software that readers of this klog may find interesting.

Quote of the day

August 25th, 2009

[25 August 2009]

Data persistence is a crapshoot. Load the dice.

-Dorothea Salo, Equipment and data curation, 7 August 2009 (on preferring widely supported open formats to niche formats and closed formats).

Metadata and search - a concrete example

August 18th, 2009

[18 August 2009]

Here’s a concrete example of the difference between the metadata-aware search we would like to have, and the metadata-oblivious full-text search we mostly have today, encountered the other day at the Balisage 2009 conference in Montréal.

Try to find a video of the song “I don’t want to go to Toronto”, by a group called Radio Free Vestibule.

When I search video.google.com for “I don’t want to go to Toronto”, I get, in first place, a song called “I don’t want to go”, performed live in Toronto. When I put quotation marks around the title, it tells me nothing matches and shows me a video of Elvis Costello singing “I don’t want to go to Chelsea”.

It’s always good to have concrete examples, and I always like real ones better than made-up examples. (Real examples do often have a disconcerting habit of bringing in one complication after another and involving more than one problem, which is why good ones are so hard to find. But I don’t see many extraneous complications in this one.)

International Symposium on Processing XML Efficiently

August 12th, 2009

[10 August 2009, delayed by network problems ...]

The International Symposium on Processing XML Efficiently, chaired by Michael Kay, has now reached its midpoint, at lunch time.

The morning began with a clear and useful report by Michael Leventhal and Eric Lemoine at LSI, talking about six years of experience building XML chips. Tarari, which was spun off by Intel in 2002, started with a tokenizer on a chip, based on a field-programmable gate array (FPGA) and has gradually developed more capabilities, including parse-time evaluation of XPath queries, later on full XSLT, and even parse-time XSD schema validation. The goal is not to perform full XML processing on the chip, but to perform tasks which software higher up in the stack will find useful.

One property of the chip I found interesting was that it attempts to treat parsing as a stateless activity, which aids security and allows several XML documents to be processed at once. Of course, parsing is not by nature stateless, but the various specialized processes implemented on the chip produce relevant state information as part of their output, and that state information is cycled around to be fed into the input side of the chip together with the next bits of the document in question. It reminds me of the way Web (and CICS) applications make themselves stateless by externalizing the state of a long conversational interaction by sending it to the client as hidden data.

David Maze of IBM then spoke about the Data Power XML appliance; I had heard people talk about it before, but had never gotten a very clear idea of what the box actually does. This talk dispelled a lot of uncertainty. In contrast to the LSI chip, the Data Power appliance is designed to perform full XML processing, and with throughput rather than reduced latency as the design goal. But the list of services offered is still rather similar: low-level parsing, XPath evaluation, XSLT processing, schema validation. Some are done during parsing, and some by means of a set of specialized post-processing primitives.

Rob Cameron and some of his students at Simon Fraser University came next. They have been exploring ways to exploit the single-instruction/multiple-data (SIMD) instructions which have been appearing in the instruction sets of recent chips. They turn a stream of octets into eight streams of bits, and can thus process eight times as much of the document in a single operation as they would otherwise be able to. The part that blew my mind was the explanation of parsing by means of bitstream addition. He used decimal numeric character references to illustrate. I can’t explain in detail here, but the basic idea is: you make a bit stream for the ‘&’ character (where a 1 in position n means that the nth character in the input stream is a ‘&’. Make a similar bit stream for the ‘#’. And them together; you have the occurrences of the ‘&#’ delimiter in the document. Make a similar bit stream for decimal digits; you may frequently have multiple decimal digits in a row. Now comes an extremely expected trick. Reverse the bit array, so it rights right to left. Now shift the ‘&#’ delimiter bit stream by one position, and ADD it to the decimal-digit bit stream. If the delimiter was followed by a decimal digit (as in a decimal character reference it must be), there will be two ones in the same column. They will sum to ‘10′, and the one will carry. If the following character is also a decimal digit, it will sum, with the carry, to ‘10′. And so on, until finally you reach the end of the sequence of adjacent decimal digits, and are left with a 1 in the result bitstream. AND that together with the bit stream for the ‘;’ character, and wherever there is a 1 in the result you have now diagnosed a well-formedness error in the input.

Upshot: A single machine operation has scanned from the beginning to the end of (multiple instances of) a variable-length construct and identified the end-point of (each instance of) the construct. With a little cleverness, this can be applied to pretty much all the regular-language phenomena in XML parsing. The speedups they report as a result are substantial.

The use of bitstream operations provides a lot of parallelism; the one place the Parabix system has to drop into a loop seems to be for the recognition of attribute-value pairs. I keep wondering whether optimistic parallelism might not allow that problem to be solved, too.

Mohamed Zergaoui gave a characteristically useful discussion of streamability and its conceptual complications, and David Lee and Norm Walsh presented some work on timing various different ways of writing and running pipelines of XML processes. When running components written in Java, from scripting languages like bash, the time required for the XML processing (in the real-world application they used for testing) was dwarfed by the cost of repeatedly launching the Java Virtual Machine. Shifting to a system like xmlsh or calabash, which launches the JVM once not repeatedly, gained fifty- and hundred-fold speedups.

NACS and W3C

August 7th, 2009

[8 August 2009]

Just read a long interviw between Ian Jacobs of W3C and David Ezell, the chair of the W3C XML Schema working group and the representative of the National Association of Convenience and Petroleum Retailers (NACS) on the W3C’s Advisory Committee.

I may be biased, since I’ve worked closely with David for several years, but what he says in the interview seems to illustrate well the advantages for user organizations to be involved in standards development, instead of just leaving the standards work to their vendors. User organizations are woefully underrepresented on pretty much every standards group I know about; I wish more organizations took the enlightened approach NACS has taken in participating in W3C work and in supporting David’s work as chair of the XML Schema working group. Boeing deserves kudos, too, for their participation in XML Schema work. But if we had had even more users, and a less pronounced dominance of the group by vendors, I think the spec would have been better for it.

If your organization can join W3C, or other standards bodies relevant to your work, think about doing so.

Even without being a W3C member, of course, any member of the public is invited to comment on published drafts of specs, and the comments are typically taken very seriously. If you can’t afford the commitment of membership and working group participation, commenting on drafts is a good way to influence the specs. But if you can join, I encourage you to do so!

Balisage is calling …

August 5th, 2009

[5 August 2009]

This week I’m busy trying to wrap things up before heading to Montréal next week for Balisage. Songs from South Pacific keep running through my head, starting of course with “Bali Ha’i” (to which Enrique is working on a contrafacture).

I had meant to post periodically over the summer about papers I’m particularly looking forward to hearing, in the interest of reminding people about the conference and trying to encourage attendance. I only managed one or two, but it seems I needn’t feel guilty, after all. The conference chair, Tommie Usdin of Mulberry Technologies, tells me that we have now pre-registered more people for Balisage 2009 than we have had at any previous Balisage.

So even without my reminding people about what is on the program, people are coming to the conference anyway. Good! But I can’t resist mentioning here: Fabio Vitali and his colleagues have a really super idea for encoding overlapping structures by using RDF (which automatically means that we can try using SPARQL to query such documents). The continuing work on XML representations of overlap in Bielefeld and Lyon continues to bear fruit: Maik Stührenberg and Daniel Jettka of Bielefeld are talking about XStandoff, the successor to the Sekimo General Format (SGF) developed earlier in Bielefeld, while Pierre Edouard Portier and Sylvie Calabretto of Lyon are talking about the problem of constructing documents using formats like Lyon’s MultiX. And Desmond Schmidt of Queensland University of Technology is coming, to talk about his work on overlapping structures in multi-versioned documents.

Norm Walsh and Michael Kay are both talking about pipelines in XML processing. Michael is also chairing a full-day symposium on Monday about efficient processing of XML. (Why did no one from the EXI effort offer a paper?!) Kurt Cagle is talking about XML and linked data (that would be the rebranding of the Semantic Web).

And there’s a lot more. See the program for details.

So this year Balisage will be bigger than ever before.

I hope to see you in Montréal next week!