Fear, uncertainty, and XML 1.0 Fifth Edition

[11 June 2009]

From time to time people tell me that the transition from XML 1.0 Fourth Edition to XML 1.0 Fifth Edition is hard. Just as from time to time people have said the transition from XML 1.0 to XML 1.1 would be hard, and might break systems that consume XML data. I just spoke with a friend who told me their company was having internal discussions about what to do about XML 1.0 Fifth Edition, because some of their customers had expressed “concern” (possibly “deep concern”).

I’ve never understood what is hard about either transition; perhaps if I ask here someone can explain it to me.

There are two classes of software to consider: (a) software which checks that a string is a legal XML name, and (b) software which just consumes valid or well-formed XML, without doing its own checking.

Software that actively checks XML names

Obviously, if you are going to upgrade your XML processors from 1.0 Fourth Edition to 1.0 Fifth Edition (or to XML 1.1), you are going to need to change them. No one has ever argued seriously that that’s hard (not even Noah Mendelsohn). Anyone who has written a parser for names can tell you that the new definition of Name is simpler; the only serious likelihood is that a programmer comparing the complexity of the old definition with the relative simplicity of the new may be mildly depressed that the complexity was ever needed (long story, let’s not go there), and will lose a moment or two sighing deeply.

Software that isn’t an XML parser but which has decided for reasons of its own to use XML’s definition of Name may or may not also need to change. Since it’s not an XML parser, it has no obligation to follow the XML spec in the first place. But if you want to change it to keep it in synch with XML, the change is simple, just as it is for an XML parser.

Software that doesn’t check XML names but assumes they are OK

Noah Mendelsohn (my esteemed colleague in the W3C XML Schema working group) was eloquent, in presentations I heard him give, about the danger that an XML 1.1 processor would let data through that an XML 1.0 processor would not have let through, and that that new data might break other software which had been relying on the XML processor upstream for sanity-checking its input data. Such reliance is not at all a bad thing; one point of using XML is precisely that valid or well-formed XML is much more predictable than arbitrary octet sequences.

Software of this kind, which doesn’t itself check its input data, can in principle break when presented with data it’s not prepared for. So the prospect that XML 1.1 (or XML 1.0 Fifth Edition) might break such software naturally scares people a lot. Noah (and possibly others) successfully scared enough people that many people shied away from XML 1.1. Now purveyors of fear, uncertainty, and doubt are trying to scare people away from XML 1.0 Fifth Edition.

But what they are spreading was FUD when they were talking about XML 1.1, and it’s FUD now. It’s not logically impossible for software to exist which works fine when presented with XML 1.0 Fourth Edition input, and which will break when presented with Fifth Edition input. But such software would be unusual in the extreme, even eccentric. No one has ever actually identified such software to me. I’ve been asking, every time this comes up, for the last five or six years.

It’s not at all clear that such software could be constructed by any programmer of ordinary competence. To try to prevent the use of minority scripts in XML names for the sake of avoiding the hypothetical risk of breaking hypothetical software which (if it existed) would be a textbook case of poor design and poor implementation, is just insane.

Let us imagine the existence of such a piece of software; let’s call this software N.

We know very little about N, only that N has no problem with any XML 1.0 name, but will break when confronted with at least some XML 1.1 names that are not also XML 1.0 names. So, let’s see: that means that N is perfectly happy to consume a name containing Tibetan characters, but N might break in an ugly way when confronted with Hittite. Or perhaps N is perfectly happy with the Tibetan characters U+0F47 and U+0F49, which are legal in XML 1.0 4E names, but N will break if confronted with the character U+0F48, which lies between them.

How can this be? By hypothesis, N is not running its own name checker that implements the 1.0 4E rules (if it is, then N belongs in class (a) above, and when confronted with U+0F48 N does not break but issues an error message). What can it possibly be doing with data that comes in marked as a Name, that causes it to handle U+0F47 and U+0F49, but not U+0F48?

As far as I can tell, by far the most common thing to do when ingesting something marked as an XML name is to copy it into a variable typed to accept Unicode strings. Use of this variable may well exploit the fact that it won’t contain blanks. But I haven’t seen much code that is written to exploit the fact that a Unicode string does or does not contain any occurrences of U+0F48, or of characters in various minority writing systems. Maybe I’m just young and ignorant; it’s only thirty years since I started programming, and I’ve mostly worked in fairly restricted areas (text processing, markup, character set problems, that kind of thing), so there’s a lot I don’t know.

So, please, if anyone can enlighten me, please do. What rational programmer of even modest competence — or for that matter, what programmer completely lacking in competence, will write code that (a) is not an implementation of the name rules of XML 1.0 4E, that (b) accepts all names defined according to the rules of XML 1.0 4E, and that (c) will die when confronted with some name which is legal by the rules of 1.0 5E?

In earlier discussions, Michael Kay tried to suggest why a program might fail on 1.0 5E names, but all the plausible examples of such a program involve the program assuming that the characters are all ASCII, or all ISO 8859-1, or all in some other historical character set. Such programs will certainly fail when confronted with 1.0 5E names. But they will also fail when confronted with XML 1.0 4E names, so they don’t satisfy condition (b) in the list.

In order to have properties (a), (b), and (c), software would have to be seriously pathological in design and coding. And I don’t mean that in a good sense.

I conclude: insofar as the resistance to XML 1.1, and to XML 1.0 Fifth Edition is based on fear that the shift will break deployed software, it’s irrational and based on a complete misunderstanding of the detailed technical issues involved. Those who are spreading this FUD are doing neither themselves, nor their companies, nor the community, a service.

6 thoughts on “Fear, uncertainty, and XML 1.0 Fifth Edition

  1. Enrique whispered into my ear:

    “Dude. You so do not understand FUD. Anyone might be a terrorist. So carry your machine gun at all times. And don’t hesitate to use it. He who shoots last, loses.”

  2. That’s the best piece of argumentation I’ve read trying to dispel the FUD about these name rule changes. Thank you!

  3. I think arguments based on the probability of problems arising as a result of normal sensible software development slightly miss the point. There are two things that concern me more:

    (a) there’s a whole mass of interconnected componentry in the field of XML software itself, and this has to satisfy a lot of test cases for boundary conditions. When the components implement different versions of fundamental data types, test cases for edge cases will fail, and the net result is a lot of expense for software developers. You can’t persuade your QA people that it doesn’t matter if they can break the software, it will never happen in the wild. So you end up adding switches to your product to allow the user to configure which version of the spec it should implement, and so on: a general increase in complexity and cost for the whole system, in return for which very few users see a benefit.

    (b) there’s scope for malicious people to exploit the differences in the rules being enforced by different pieces of software, making breakage of systems much more likely than would happen by chance. For example, you can try sending a document containing a 5e name used as an ID, hoping that the data validation at the receiving end will accept it, but that somewhere in the system behind that there will be a 4e parser that throws it out; so the gateway validation has failed to protect the receiving system in the way it was designed to. (I’ll try this with my next tax return…)

  4. Michael, you make good arguments for the proposition that a complex of interconnected XML software should probably be upgraded all at once rather than piecemeal, and that adapting XML systems to follow a change the XML spec may require revision of one’s test suite. That can, of course, be tricky.

    But I think that’s just an instantiation of the general rule that when you have complex systems in which components make a lot of assumptions (especially undocumented assumptions) about what other components are doing, changing them can be tricky. Nothing about it seems special to XML or to 1.0 5e.

    Your points seem to support the proposition that we should be patient if vendors say the shift to support 5e won’t be instantaneous or even quick. But they do not seem to me to support the claim I have heard made, and which was influential in deterring uptake of XML 1.1, that the change to 5e is likely to damage or cause new failures in innocent bystanders (i.e. downstream consumers of XML data). I continue to believe, based on the evidence available, that that claim is pure FUD.

  5. Alhough I do want to allow name characters (e.g., KATAKANA MIDDLE
    DOT) of the 5th edition, I am still concerned. I am not concerned about
    XML parsers or application programs. I am concerned about W3C specifications that use name characters of XML 1.0. Do all of them
    allow name characters of the 5th edition? AFAIK, XSD (2nd edition)
    does not. I heard that the 3rd edition will, but it is not available yet.
    A nice overview of the current status of other specifications will help
    a lot.

    The transition is hard because most users have no interest in
    the new name characters. Please don’t take me wrong. I
    love them. I need them. But even Japanese users are
    not strongly interested. Being a good standard does not
    guarantee a success.

Comments are closed.