Fear, uncertainty, and XML 1.0 Fifth Edition

[11 June 2009]

From time to time people tell me that the transition from XML 1.0 Fourth Edition to XML 1.0 Fifth Edition is hard. Just as from time to time people have said the transition from XML 1.0 to XML 1.1 would be hard, and might break systems that consume XML data. I just spoke with a friend who told me their company was having internal discussions about what to do about XML 1.0 Fifth Edition, because some of their customers had expressed “concern” (possibly “deep concern”).

I’ve never understood what is hard about either transition; perhaps if I ask here someone can explain it to me.

There are two classes of software to consider: (a) software which checks that a string is a legal XML name, and (b) software which just consumes valid or well-formed XML, without doing its own checking.

Software that actively checks XML names

Obviously, if you are going to upgrade your XML processors from 1.0 Fourth Edition to 1.0 Fifth Edition (or to XML 1.1), you are going to need to change them. No one has ever argued seriously that that’s hard (not even Noah Mendelsohn). Anyone who has written a parser for names can tell you that the new definition of Name is simpler; the only serious likelihood is that a programmer comparing the complexity of the old definition with the relative simplicity of the new may be mildly depressed that the complexity was ever needed (long story, let’s not go there), and will lose a moment or two sighing deeply.

Software that isn’t an XML parser but which has decided for reasons of its own to use XML’s definition of Name may or may not also need to change. Since it’s not an XML parser, it has no obligation to follow the XML spec in the first place. But if you want to change it to keep it in synch with XML, the change is simple, just as it is for an XML parser.

Software that doesn’t check XML names but assumes they are OK

Noah Mendelsohn (my esteemed colleague in the W3C XML Schema working group) was eloquent, in presentations I heard him give, about the danger that an XML 1.1 processor would let data through that an XML 1.0 processor would not have let through, and that that new data might break other software which had been relying on the XML processor upstream for sanity-checking its input data. Such reliance is not at all a bad thing; one point of using XML is precisely that valid or well-formed XML is much more predictable than arbitrary octet sequences.

Software of this kind, which doesn’t itself check its input data, can in principle break when presented with data it’s not prepared for. So the prospect that XML 1.1 (or XML 1.0 Fifth Edition) might break such software naturally scares people a lot. Noah (and possibly others) successfully scared enough people that many people shied away from XML 1.1. Now purveyors of fear, uncertainty, and doubt are trying to scare people away from XML 1.0 Fifth Edition.

But what they are spreading was FUD when they were talking about XML 1.1, and it’s FUD now. It’s not logically impossible for software to exist which works fine when presented with XML 1.0 Fourth Edition input, and which will break when presented with Fifth Edition input. But such software would be unusual in the extreme, even eccentric. No one has ever actually identified such software to me. I’ve been asking, every time this comes up, for the last five or six years.

It’s not at all clear that such software could be constructed by any programmer of ordinary competence. To try to prevent the use of minority scripts in XML names for the sake of avoiding the hypothetical risk of breaking hypothetical software which (if it existed) would be a textbook case of poor design and poor implementation, is just insane.

Let us imagine the existence of such a piece of software; let’s call this software N.

We know very little about N, only that N has no problem with any XML 1.0 name, but will break when confronted with at least some XML 1.1 names that are not also XML 1.0 names. So, let’s see: that means that N is perfectly happy to consume a name containing Tibetan characters, but N might break in an ugly way when confronted with Hittite. Or perhaps N is perfectly happy with the Tibetan characters U+0F47 and U+0F49, which are legal in XML 1.0 4E names, but N will break if confronted with the character U+0F48, which lies between them.

How can this be? By hypothesis, N is not running its own name checker that implements the 1.0 4E rules (if it is, then N belongs in class (a) above, and when confronted with U+0F48 N does not break but issues an error message). What can it possibly be doing with data that comes in marked as a Name, that causes it to handle U+0F47 and U+0F49, but not U+0F48?

As far as I can tell, by far the most common thing to do when ingesting something marked as an XML name is to copy it into a variable typed to accept Unicode strings. Use of this variable may well exploit the fact that it won’t contain blanks. But I haven’t seen much code that is written to exploit the fact that a Unicode string does or does not contain any occurrences of U+0F48, or of characters in various minority writing systems. Maybe I’m just young and ignorant; it’s only thirty years since I started programming, and I’ve mostly worked in fairly restricted areas (text processing, markup, character set problems, that kind of thing), so there’s a lot I don’t know.

So, please, if anyone can enlighten me, please do. What rational programmer of even modest competence — or for that matter, what programmer completely lacking in competence, will write code that (a) is not an implementation of the name rules of XML 1.0 4E, that (b) accepts all names defined according to the rules of XML 1.0 4E, and that (c) will die when confronted with some name which is legal by the rules of 1.0 5E?

In earlier discussions, Michael Kay tried to suggest why a program might fail on 1.0 5E names, but all the plausible examples of such a program involve the program assuming that the characters are all ASCII, or all ISO 8859-1, or all in some other historical character set. Such programs will certainly fail when confronted with 1.0 5E names. But they will also fail when confronted with XML 1.0 4E names, so they don’t satisfy condition (b) in the list.

In order to have properties (a), (b), and (c), software would have to be seriously pathological in design and coding. And I don’t mean that in a good sense.

I conclude: insofar as the resistance to XML 1.1, and to XML 1.0 Fifth Edition is based on fear that the shift will break deployed software, it’s irrational and based on a complete misunderstanding of the detailed technical issues involved. Those who are spreading this FUD are doing neither themselves, nor their companies, nor the community, a service.

Efficient processing of XML

[9 June 2009]

The organizers of Balisage (among them me) have announced that the program is now available for the International Symposium on Processing XML Efficiently, chaired by Michael Kay of Saxonica, which will take place the day before the Balisage conference starts, in the same venue. I reproduce the announcement below.

PROGRAM NOW AVAILABLE

International Symposium on Processing XML Efficiently:
Overcoming Limits on Space, Time, or Bandwidth

Monday August 10, 2009
Hotel Europa, Montréal, Canada

Chair: Michael Kay, Saxonica
Symposium description: http://www.balisage.net/Processing/
Detailed Program: http://www.balisage.net/Processing/Program.html
Registration: http://www.balisage.net/registration.html

Developers have said it: “XML is too slow!”, where “slow” can mean many things including elapsed time, throughput, latency, memory use, and bandwidth consumption.

The aim of this one-day symposium is to understand these problems better and to explore and share approaches to solving them. We’ll hear about attempts to tackle the problem at many levels of the processing stack. Some developers are addressing the XML parsing bottleneck at the hardware level with custom chips or with hardware-assisted techniques. Some researchers are looking for ways to compress XML efficiently without sacrificing the ability to perform queries, while others are focusing on the ability to perform queries and transformations in streaming mode. We’ll hear from a group who believe the problem (and its solution) lies not with the individual component technologies that make up an application, but with the integration technology that binds the components together.

We’ll also hear from someone who has solved the problems in real life, demonstrating that it is possible to build XML-based applications handling very large numbers of documents and a high throughput of queries while offering good response time to users. And that with today’s technologies, not tomorrow’s.

If you are interested in this symposium we invite you to read about
“Balisage: The Markup Conference”, which follows it in the same location:
http://www.balisage.net/

Questions: info@balisage.net

XML in the browser at Balisage

[6 June 2009]

It’s been some time since XML was first specified, with the hope that once browsers supported XML, it would become easy to deliver content on the Web in XML.

A lot of people spent a lot of time, in those early years, warning that you couldn’t really deliver XML on the Web in practice, because too many people were still using non-XML-capable (or non-XSLT-enabled) browsers. (At least, it seemed that way to me.) I got used to the idea that you couldn’t really do it. So it was a bit of a surprise to me, a few years ago, to discover that I could, after all. There are some dark corners in some browsers’ implementations of XSLT (no information about unparsed entities in Opera, no namespace nodes in Firefox — though that last one is being fixed even as I write, which is good news) but there are workarounds; the situation is probably at least as good with respect to XSLT as it is with respect to Javascript and CSS. I have not had to draft an important paper in HTML, or worry about translating it into HTML in order to deliver it on the Web, in years.

Why is this fact not more widely known and exploited by users of XML?

It will surprise no one, after what I just said, that one of the papers I’m looking forward to hearing at Balisage 2009 (coming up very soon — the deadline for late-breaking news submissions is 19 June, and the conference itself is the week of 10 August) is a talk by Alex Milowski under the title “XML in the browser: the next decade”. Alex Milowski is one of the smartest and most thoughtful technologists I know, and I look forward to hearing what he has to say.

He’ll talk (as I know from having read the paper proposal) about what some people hoped for, from XML in the browser, ten years ago, and about what has happened instead. He’ll talk about what can be done with XML in the browser today, and what most needs fixing in the current situation. There will probably be demos. And, most interesting, I expect he will provide some thoughts on the long-term direction we should be trying to take, in order to make the Web and XML get along better together.

If you think XML can help make the Web and the world a better place, you might want to come to Montréal this August to hang around with Alex, and with me, and with others of your kind. It’s always a rewarding experience.