Fear, uncertainty, and XML 1.0 Fifth Edition

[11 June 2009]

From time to time people tell me that the transition from XML 1.0 Fourth Edition to XML 1.0 Fifth Edition is hard. Just as from time to time people have said the transition from XML 1.0 to XML 1.1 would be hard, and might break systems that consume XML data. I just spoke with a friend who told me their company was having internal discussions about what to do about XML 1.0 Fifth Edition, because some of their customers had expressed “concern” (possibly “deep concern”).

I’ve never understood what is hard about either transition; perhaps if I ask here someone can explain it to me.

There are two classes of software to consider: (a) software which checks that a string is a legal XML name, and (b) software which just consumes valid or well-formed XML, without doing its own checking.

Software that actively checks XML names

Obviously, if you are going to upgrade your XML processors from 1.0 Fourth Edition to 1.0 Fifth Edition (or to XML 1.1), you are going to need to change them. No one has ever argued seriously that that’s hard (not even Noah Mendelsohn). Anyone who has written a parser for names can tell you that the new definition of Name is simpler; the only serious likelihood is that a programmer comparing the complexity of the old definition with the relative simplicity of the new may be mildly depressed that the complexity was ever needed (long story, let’s not go there), and will lose a moment or two sighing deeply.

Software that isn’t an XML parser but which has decided for reasons of its own to use XML’s definition of Name may or may not also need to change. Since it’s not an XML parser, it has no obligation to follow the XML spec in the first place. But if you want to change it to keep it in synch with XML, the change is simple, just as it is for an XML parser.

Software that doesn’t check XML names but assumes they are OK

Noah Mendelsohn (my esteemed colleague in the W3C XML Schema working group) was eloquent, in presentations I heard him give, about the danger that an XML 1.1 processor would let data through that an XML 1.0 processor would not have let through, and that that new data might break other software which had been relying on the XML processor upstream for sanity-checking its input data. Such reliance is not at all a bad thing; one point of using XML is precisely that valid or well-formed XML is much more predictable than arbitrary octet sequences.

Software of this kind, which doesn’t itself check its input data, can in principle break when presented with data it’s not prepared for. So the prospect that XML 1.1 (or XML 1.0 Fifth Edition) might break such software naturally scares people a lot. Noah (and possibly others) successfully scared enough people that many people shied away from XML 1.1. Now purveyors of fear, uncertainty, and doubt are trying to scare people away from XML 1.0 Fifth Edition.

But what they are spreading was FUD when they were talking about XML 1.1, and it’s FUD now. It’s not logically impossible for software to exist which works fine when presented with XML 1.0 Fourth Edition input, and which will break when presented with Fifth Edition input. But such software would be unusual in the extreme, even eccentric. No one has ever actually identified such software to me. I’ve been asking, every time this comes up, for the last five or six years.

It’s not at all clear that such software could be constructed by any programmer of ordinary competence. To try to prevent the use of minority scripts in XML names for the sake of avoiding the hypothetical risk of breaking hypothetical software which (if it existed) would be a textbook case of poor design and poor implementation, is just insane.

Let us imagine the existence of such a piece of software; let’s call this software N.

We know very little about N, only that N has no problem with any XML 1.0 name, but will break when confronted with at least some XML 1.1 names that are not also XML 1.0 names. So, let’s see: that means that N is perfectly happy to consume a name containing Tibetan characters, but N might break in an ugly way when confronted with Hittite. Or perhaps N is perfectly happy with the Tibetan characters U+0F47 and U+0F49, which are legal in XML 1.0 4E names, but N will break if confronted with the character U+0F48, which lies between them.

How can this be? By hypothesis, N is not running its own name checker that implements the 1.0 4E rules (if it is, then N belongs in class (a) above, and when confronted with U+0F48 N does not break but issues an error message). What can it possibly be doing with data that comes in marked as a Name, that causes it to handle U+0F47 and U+0F49, but not U+0F48?

As far as I can tell, by far the most common thing to do when ingesting something marked as an XML name is to copy it into a variable typed to accept Unicode strings. Use of this variable may well exploit the fact that it won’t contain blanks. But I haven’t seen much code that is written to exploit the fact that a Unicode string does or does not contain any occurrences of U+0F48, or of characters in various minority writing systems. Maybe I’m just young and ignorant; it’s only thirty years since I started programming, and I’ve mostly worked in fairly restricted areas (text processing, markup, character set problems, that kind of thing), so there’s a lot I don’t know.

So, please, if anyone can enlighten me, please do. What rational programmer of even modest competence — or for that matter, what programmer completely lacking in competence, will write code that (a) is not an implementation of the name rules of XML 1.0 4E, that (b) accepts all names defined according to the rules of XML 1.0 4E, and that (c) will die when confronted with some name which is legal by the rules of 1.0 5E?

In earlier discussions, Michael Kay tried to suggest why a program might fail on 1.0 5E names, but all the plausible examples of such a program involve the program assuming that the characters are all ASCII, or all ISO 8859-1, or all in some other historical character set. Such programs will certainly fail when confronted with 1.0 5E names. But they will also fail when confronted with XML 1.0 4E names, so they don’t satisfy condition (b) in the list.

In order to have properties (a), (b), and (c), software would have to be seriously pathological in design and coding. And I don’t mean that in a good sense.

I conclude: insofar as the resistance to XML 1.1, and to XML 1.0 Fifth Edition is based on fear that the shift will break deployed software, it’s irrational and based on a complete misunderstanding of the detailed technical issues involved. Those who are spreading this FUD are doing neither themselves, nor their companies, nor the community, a service.

Such a little thing, to provoke so many thoughts

[3 February 2009]

I saw an interesting bit of stray XML this morning, which raises a number of questions worth mulling over.

The New York Times sends me email every morning with headlines from the day’s papers; this is one of the perqs, I think, of being a subscriber. Given the choice, I asked for the ASCII-only version (like many people I know, I don’t much like HTML email; I’m not sure this is a well-founded prejudice, but there it is).

In today’s headlines, I find the following fragment:

- QUOTATION OF THE DAY -

"Oh, you're one of <em style="i">them."
- IRIS CHAU, recounting an acquaintance's reaction when she said she
worked at a banking company.

My eye was caught (perhaps this is just a deformation professionelle) by the unmatched start-tag for an ’em’ element, appearing in what was intended to be a markup-free context. This seems to tell us several things about the internal system used by the Times, and to invite several questions:

  • They seem to be using markup (perhaps HTML, perhaps some other XML vocabulary), not just for delivery of the headlines but in the system for preparing features like the quotation of the day.
  • Their XML vocabulary seems to require an inline specification of rendering information. This seems odd; wouldn’t it be natural for a stylesheet to say that ’em’ elements should be italic? How many other renderings of emphasis are there likely to be in the main story block, in a broadsheet? (Oh, well; not my design, and I don’t know what constraints the vocabulary designer was working under.)
  • Assuming that this quotation was cut and pasted out of the story in which it appears (which does not appear to be ill-formed), could we infer that the journalist who did the cut and paste could was using an editing tool that had no markup awareness? Would a tool with better awareness of markup have helped here? (E.g. by picking up the end-tag of the ’em’ element, as soon as it picks up the start-tag, as SoftQuad’s Author/Editor used to do [and presumably still does in its current incarnation], or by dropping the start-tag.)
  • Could a better validator somewhere in the system that generates the ASCII headlines have caught this error? Could it have been fixed automatically, or with minimal-cost human intervention?
  • What properties would a validator for that process need to have, to induce people to view it as an asset instead of a nuisance?
  • What properties would a schema language need to have, in order to make it easier to write a validator with those properties? And how might the schema language go about encouraging implementors to write such validators?

Hmmmmm.

XPath 1, Enrique 0

[27 January 2009]

My evil twin Enrique came by the other evening, excited and full of himself. “I’ve just found a bug in Saxon!” he announced.

“Really?” I said. “That would be entertaining. Michael Kay keeps finding new bugs in the XSD 1.1 spec; it would be nice to pay him back. Still, I doubt very seriously that you’ve found a bug. What’s the story?”

“I’m working on a stylesheet to generate an SVG image showing a particular hierarchy of objects. And at one point, I have to know how many elements named object there are, descended from the object with name="X", including X itself, if X is a preceding sibling of the current element or one of its ancestors. At another point, I need the same number, but excluding X itself. So I wrote two XPath expressions, like this:

count(preceding::object[@name="X"]//object)
count(preceding::object[@name="X"]/descendant::object)

“I expected both to evaluate to 0, if X is not somewhere on our left, and to some pair of numbers n and n – 1, if it is.”

“I think I see what you’re trying to do,” I said. (And I did; I did something very similar myself, not long ago, working on a new type hierarchy diagram for XSD 1.1 Datatypes.) “And what did you get?”

He pulled out his laptop and ran a stylesheet, from which messages reported that the two expressions evaluated sometimes to 0 and 0, and sometimes to 11 and 11, respectively, but never to 12 and 11, which is what, by inspection, we established was what Enrique wanted.

At this point, dear reader, you may already know what Enrique’s mistake was. If so, I congratulate your perspicuity. I confess that I did not. If you share my uncertainty, it might be rewarding to pause now, before reading on, to figure out for yourself why Enrique was wrong.

“So now do you believe I’ve found a bug in Saxon?” asked Enrique.

“Well, no,” I said.

“What do you mean, no?!” he protested. “Object X has eleven descendants of type object. Right?”

“Right.”

“Plus one for self, so the count of objects descended from X by zero or more steps (i.e. including X itself) is twelve, right?”

“Right.”

“So “preceding::object [@name = "X"] / descendant-or-self :: object’ should be returning 12, not 11! Right?”

“Right.”

“You know that ‘//’ is short for descendant-or-self, right?”

“Right.”

[Well, wrong, actually. See below.]

“So it’s a bug! Why won’t you admit that I’ve found a bug in Saxon?”

“Well, let’s put it this way. I know Michael Kay. And I know you. And if your expectations disagree with his code — even if my expectations disagree with his code — well, I know where I’m putting my money. What does xsltproc say?”

Xsltproc also gave the same value to both expressions.

And both Saxon and xsltproc gave the expected answer of 12 for the expression

count(preceding::object[@name="X"]/descendant-or-self::object)

“A bug in both Saxon and xsltproc?” marveled Enrique. “That’s amazing, I must be brilliant!”

“A bug in both Saxon and xsltproc? That is incredible,” I corrected him. “As in: not believable. Michael can be wrong; that’s possible. Daniel Veillard can be wrong; that’s possible. Michael and DV both wrong, possible, but extremely unlikely. Michael and DV wrong, and you right? Slightly less likely than the spontaneous formation, by a set of atoms, of a working Infinite Improbability Drive.”

[Enrique doesn’t need to be told this, but some readers may need to be reminded that xsltproc is the command-line interface to libxslt, which is written by my friend and former W3C colleague Daniel Veillard, best known to friends as DV. And Michael Kay, of course, is the author of Saxon. If you don’t know what Saxon and libxslt are, dear reader, why on earth are you reading this posting?]

But what was the story?

Eventually, a certain amount of groveling through the prose of section 2.5 of the XPath 1.0 spec showed where Enrique and I had gone wrong. “//” is short for “descendant-or-self” only in a rough and ready way. In particular, “$X//object” is not equivalent to “$X/descendant-or-self::object”, which is clearly what Enrique (and I) had reckoned. Strictly speaking, what XPath says is:

// is short for /descendant-or-self::node()/.

So “$X // object” is equivalent not to “$X/ descendant-or-self :: object”, but to “$X/ descendant-or-self :: node()/ child::object” — or (confusingly, to me, and despite the note in the XPath spec which appears to say differently) to “$X / descendant :: object”. (The note is making a point about how predicates are evaluated, which doesn’t apply in Enrique’s case.)

Enrique was crestfallen; he had been sure that his technical credibility would rise sharply if he had found a bug in Saxon and libxslt. I, on the other hand, was relieved; I now knew how to fix the bug in the stylesheet that generates the SVG image of the XSD 1.1 type hierarchy.

Me, I’m going back to my old habit of just ignoring the abbreviated syntax and using the full syntax all the time: it’s less error prone, because it’s more explicit.

Writing tight

[13 November 2008]

Dimitre Novatchev has called attention to a recent question on the
stackoverflow programming Q and A web site:

I have an XPath expression which provides me a sequence of values like the one below:

1 2 2 3 4 5 5 6 7

It is easy to convert this to a set of unique values “1 2 3 4 5 6 7” using the distinct-values function. However, what I want to extract is the list of duplicate values = “2 5”. I can’t think of an easy way to do this. Can anyone help?

Dimitre’s solution is beautiful and concise: 21 characters long (longer if you use variables with names longer than singler characters), and well worth the five or ten minutes it took me to work out why it works. I won’t explain it to you, dear reader; you deserve the brief puzzlement and Aha-moment of understanding it.

Despite being terse, it’s not the kind of thing you’d enter in an Obfuscated XPath contest, it just uses an idiom I haven’t seen before. I’ll see it again, though, because I’ll use it myself; as I say, it’s beautiful. (I do confess to a certain curiosity about how he would modify it if, as he says, efficiency needed to be addressed.

Dmitre gets my vote for this month’s best programming-language application of Strunk and White’s rule: “Omit needless words!”

I’ve been wandering early …

[21 October 2008]

I’ve been traveling a good deal lately.

This is the first in series of posts recording some of my impressions.

In late September the XSL Working Group held a one-week meeting in Prague to work on revisions of XSLT to make it easier to support streaming transformations in XSLT. By streaming, the WG means:

  • The transformation can run in memory independent of document size (sometimes constant memory, sometimes memory proportional to document depth, sometimes memory proportional to the size of discrete windows of data in the document).
  • The transformation can begin delivering results before all of the input is available (e.g. can work on so-called ‘infinite’ XML documents like streams of stock quotations).
  • The transformation can be preformed in a single pass over the document.

It turns out that for different use cases it can be necessary or useful to:

  • declare a particular input document as streamable / to be streamed if possible
  • declare a particular set of templates as streamable
  • declare that particular parts of the document need to be available in full (buffered for random access) for part of the transform, but can then be discarded (e.g. for windowing use cases)

Some members of the WG may have been apprehensive at the thought of five straight days of WG discussions. Would we have enough to do, or would we run out of ideas on Tuesday and spend Wednesday through Friday staring at the floor in embarrassment while the chair urged us to get some work done? (If no one else harbored these fears, I certainly did.) But in fact we had a lively discussion straight through to the end of the week, and made what I think was good progress toward concrete proposals for the spec.

Among the technical ideas with most legs is (I think) the idea that sometimes what you want to do with a particular node in the input tree can actually be done with a partial copy of the input tree, and that different kinds of partial copy may be appropriate in different situations.

If you perform a deep copy of an element node in an XDM data model instance, for example, you have access to the entire subtree rooted at that node, but not to any of its siblings or ancestors, nor to anything else in the tree from which it came. For cases where you wish to (or must) allow access to the subtree rooted in a node, but to nothing else, such a deep copy is ideal: it preserves the information you want to have available, and it makes all other information inaccessible. (This is essentially the way that XSD 1.1 restricts assertions to the subtree rooted in a given node: logically speaking the assertions are evaluated against a copy of the node, not against the node itself.)

Several kinds of copy can be distinguished. In the terminology of the XSL Working Group (using terms introduced mostly by Michael Kay and Mohamed Zergaoui):

  • Y-copy: contains the subtree rooted in the node being copied, and also all of its ancestor nodes and their attributes, but none of their siblings. It is thus shaped like an upside down uppercase Y.
  • Nabla-copy: contains just the subtree rooted in the node being copied. It is thus shaped like an upside-down nabla. (Yes, also like a right-side-up delta, but since the Y-copy requires the Y to be inverted, we say nabla copy not delta copy. Besides, a delta copy would sound more like something used in change management.)
  • Dot-copy: contains just the node being copied, itself, and its attributes if any.
  • Yen-copy: like a Y-copy, but includes the siblings of each ancestor together with their attributres (although not their children).
  • Spanish exclamation-point copy: contains just the node being copied, and its ancestors, together with their attributes. Shaped like an exclamation point (dot, with something above it), or like an upside-down Spanish exclamation point.

I’ve been quite taken recently by one possible application of these ideas outside of the streaming XSLT work. In the current draft, assertions in XSD 1.1 are restricted to / are evaluated against a nabla-copy of the element or attribute being validated, and the conditions used for conditional type assignment are evaluated against a dot copy of the element. These restrictions are painful, especially the latter since it makes it impossible to select the type of an element depending on its xml:lang value (which is inherited from an ancestor if not specified locally). But XSD 1.1 could readily support conditions on the nearest value of xml:lang if conditions were evaluated on a Spanish-exclamation-point copy, instead of on a dot copy, of the element in question. I don’t know whether the XML Schema WG will buy this approach, but the possibility does seem to suggest that there is value in the idea of thinking about things in terms of invariants preserved by different kinds of node copying operations.