[11 June 2009]
From time to time people tell me that the transition from XML 1.0 Fourth Edition to XML 1.0 Fifth Edition is hard. Just as from time to time people have said the transition from XML 1.0 to XML 1.1 would be hard, and might break systems that consume XML data. I just spoke with a friend who told me their company was having internal discussions about what to do about XML 1.0 Fifth Edition, because some of their customers had expressed “concern” (possibly “deep concern”).
I’ve never understood what is hard about either transition; perhaps if I ask here someone can explain it to me.
There are two classes of software to consider: (a) software which checks that a string is a legal XML name, and (b) software which just consumes valid or well-formed XML, without doing its own checking.
Software that actively checks XML names
Obviously, if you are going to upgrade your XML processors from 1.0 Fourth Edition to 1.0 Fifth Edition (or to XML 1.1), you are going to need to change them. No one has ever argued seriously that that’s hard (not even Noah Mendelsohn). Anyone who has written a parser for names can tell you that the new definition of Name is simpler; the only serious likelihood is that a programmer comparing the complexity of the old definition with the relative simplicity of the new may be mildly depressed that the complexity was ever needed (long story, let’s not go there), and will lose a moment or two sighing deeply.
Software that isn’t an XML parser but which has decided for reasons of its own to use XML’s definition of Name may or may not also need to change. Since it’s not an XML parser, it has no obligation to follow the XML spec in the first place. But if you want to change it to keep it in synch with XML, the change is simple, just as it is for an XML parser.
Software that doesn’t check XML names but assumes they are OK
Noah Mendelsohn (my esteemed colleague in the W3C XML Schema working group) was eloquent, in presentations I heard him give, about the danger that an XML 1.1 processor would let data through that an XML 1.0 processor would not have let through, and that that new data might break other software which had been relying on the XML processor upstream for sanity-checking its input data. Such reliance is not at all a bad thing; one point of using XML is precisely that valid or well-formed XML is much more predictable than arbitrary octet sequences.
Software of this kind, which doesn’t itself check its input data, can in principle break when presented with data it’s not prepared for. So the prospect that XML 1.1 (or XML 1.0 Fifth Edition) might break such software naturally scares people a lot. Noah (and possibly others) successfully scared enough people that many people shied away from XML 1.1. Now purveyors of fear, uncertainty, and doubt are trying to scare people away from XML 1.0 Fifth Edition.
But what they are spreading was FUD when they were talking about XML 1.1, and it’s FUD now. It’s not logically impossible for software to exist which works fine when presented with XML 1.0 Fourth Edition input, and which will break when presented with Fifth Edition input. But such software would be unusual in the extreme, even eccentric. No one has ever actually identified such software to me. I’ve been asking, every time this comes up, for the last five or six years.
It’s not at all clear that such software could be constructed by any programmer of ordinary competence. To try to prevent the use of minority scripts in XML names for the sake of avoiding the hypothetical risk of breaking hypothetical software which (if it existed) would be a textbook case of poor design and poor implementation, is just insane.
Let us imagine the existence of such a piece of software; let’s call this software N.
We know very little about N, only that N has no problem with any XML 1.0 name, but will break when confronted with at least some XML 1.1 names that are not also XML 1.0 names. So, let’s see: that means that N is perfectly happy to consume a name containing Tibetan characters, but N might break in an ugly way when confronted with Hittite. Or perhaps N is perfectly happy with the Tibetan characters U+0F47 and U+0F49, which are legal in XML 1.0 4E names, but N will break if confronted with the character U+0F48, which lies between them.
How can this be? By hypothesis, N is not running its own name checker that implements the 1.0 4E rules (if it is, then N belongs in class (a) above, and when confronted with U+0F48 N does not break but issues an error message). What can it possibly be doing with data that comes in marked as a Name, that causes it to handle U+0F47 and U+0F49, but not U+0F48?
As far as I can tell, by far the most common thing to do when ingesting something marked as an XML name is to copy it into a variable typed to accept Unicode strings. Use of this variable may well exploit the fact that it won’t contain blanks. But I haven’t seen much code that is written to exploit the fact that a Unicode string does or does not contain any occurrences of U+0F48, or of characters in various minority writing systems. Maybe I’m just young and ignorant; it’s only thirty years since I started programming, and I’ve mostly worked in fairly restricted areas (text processing, markup, character set problems, that kind of thing), so there’s a lot I don’t know.
So, please, if anyone can enlighten me, please do. What rational programmer of even modest competence — or for that matter, what programmer completely lacking in competence, will write code that (a) is not an implementation of the name rules of XML 1.0 4E, that (b) accepts all names defined according to the rules of XML 1.0 4E, and that (c) will die when confronted with some name which is legal by the rules of 1.0 5E?
In earlier discussions, Michael Kay tried to suggest why a program might fail on 1.0 5E names, but all the plausible examples of such a program involve the program assuming that the characters are all ASCII, or all ISO 8859-1, or all in some other historical character set. Such programs will certainly fail when confronted with 1.0 5E names. But they will also fail when confronted with XML 1.0 4E names, so they don’t satisfy condition (b) in the list.
In order to have properties (a), (b), and (c), software would have to be seriously pathological in design and coding. And I don’t mean that in a good sense.
I conclude: insofar as the resistance to XML 1.1, and to XML 1.0 Fifth Edition is based on fear that the shift will break deployed software, it’s irrational and based on a complete misunderstanding of the detailed technical issues involved. Those who are spreading this FUD are doing neither themselves, nor their companies, nor the community, a service.