Archive for the ‘XML’ Category

Aquamacs, XEmacs, and psgml

Monday, April 26th, 2010

[26 April 2010]

The other day I thought perhaps it was time to try Aquamacs again, a nice, actively maintained port of FSF Emacs to Mac OS X. I’ve been using a copy of XEmacs I compiled myself years ago, with Andrew Choi’s Carbon XEmacs patches, but recently it has accumulated some problems I don’t have the patience to diagnose.

Got Lennart Staflin’s psgml package (a major mode for SGML and XML documents) installed and compiled; this is a pre-requisite for any Emacs being habitable for me. (Why is package management such a dirty word in FSF Emacs, by the way?) And discovered that on one of my larger documents, psgml takes 50% longer in some tests (9 seconds vs 14 seconds) to parse a large document in Aquamacs than in XEmacs. In other tests, it was 9 seconds vs. 90 seconds (or so — I kept getting bored and losing count between sixty and eighty seconds).

I may not be leaving XEmacs after all (undiagnosed problems or no undiagnosed problems).

One way to define the XPath data model

Tuesday, April 6th, 2010

[6 April 2010; addenda and copy editing 7-8 April 2010]

After discovering earlier this year that the definition of the XPath 1.0 data model falls short of the goal of guaranteeing the desired properties to all instances of the data model, I’ve been spending some time experimenting with alternative definitions, trying to see what must be specified a priori and what properties can be left to follow from others.

It’s no particular surprise that the data model can be defined in a variety of different ways. I’ve worked out three with a certain degree of precision. Here is one, which is not the usual way of defining things. For simplicity, it ignores attributes and namespace nodes; it’s easy enough to add them in once the foundations are a bit firmer.

Assume a non-empty finite set S and two binary relations R and Q on S, with the following properties [Some constraints are shown here as deleted: they were included in the first version of this list but later proved to be redundant; see below] :

  1. R is functional, acyclic, and injective (i.e. for any x and y, R(x) = R(y) implies x = y).
  2. There is exactly one member of S which is not in the domain of R (i.e. R(e) has no value), and exactly one which is not in the range of R (i.e. there is one element e such that for no element f do we have e = R(f)).
  3. Q is transitive and acyclic.
  4. The transitive reduction of Q is functional and injective.
  5. It will be observed that R essentially places the elements of S in a sequence without duplicates. For all elements e, f, g, h of S, if Q includes the pairs (e, f) and (g, h) and if g falls between e and f in the sequence defined by R (or, more formally, if the transitive closure of R contains the pairs (e, f), (e, g), and (g, f)), then h also falls between e and f in that sequence.
  6. The transitive closure of the inverse of R (i.e. R-1*) contains Q as a subset.
  7. The single element of S which is not in the domain of R is also neither in the domain nor the range of Q.

It turns out that if we have any relations R and Q defined on some set S, then we have an instance of the XPath 1.0 data model. The nodes in the model instance, the axes defining their interrelations, and so on can all be defined in terms of S, R, and Q.

For the moment, I’ll leave the details as an exercise for the reader. (I also realize, as I’m about to click “Publish”, that I have not actually checked to see whether the set of constraints given above is minimal. I started with a short list and added constraints until S, R, and Q sufficed to determine a unique data model instance, but I have not checked to see whether any of the later additions rendered any of the earlier additions unnecessary. So points for any reader who identifies redundant constraints in the list given above.)

[When I did check for minimality, it turned out that several of the constraints included in the list above are redundant. The fact that relation R is functional and injective, for example, follows from the others shown. Actually it follows from a subset of them. The deletions above show one way of reducing the number of a priori constraints: they all follow from the others and can be dropped. None of the remaining items follows from the others; if any of them are deleted, the constraints no longer suffice to ensure the properties required by XPath.]

How formal can you get?

Tuesday, January 26th, 2010

[26 January 2010; updated a URI 28 Jan 2010]

In a recent comment on an earlier post here, David Carlisle wrote of proofs concerning properties of the XPath 1.0 data model

It’s not clear how formal any such proof could be, given that the XPath model and even more so, XML itself are defined so informally.

It’s true that the XPath spec defines the data model in prose and not in formulas. But on the whole it’s relatively formal and explicit prose, and I would expect it to be relatively easy to translate the prose into a formal notation.

As a test of that proposition, I recently improved a shining hour or four by translating section 5 of the XPath 1.0 spec into Alloy, the modeling tool developed by Daniel Jackson and others in the MIT Software Design Group. (I would describe Alloy briefly here, but I’ve written about Alloy often enough in this blog that I won’t describe it again now; regular readers of this blog will already know about it, and those who don’t can search the Web for information, or search this blog for what I’ve said about it before.)

The full Alloy model of XPath 1.0 is available from the Black Mesa Technologies web site. It will be of most interest, of course, for those familiar with Alloy, or wanting to learn more about it. But Alloy’s formalism is simple enough that anyone with a taste for formal methods will find it easy going, and the document paraphrases all the important things in English prose.

The net result, I think, is that there are (unsurprisingly) some places where the definition of the XPath 1.0 data model could or should be more explicit, and a few places where it seems a rule or two given explicitly is strictly speaking redundant, but by and large the definition is pretty clean and seems to mean what its creators meant it to mean. By and large, the definition of the data model is formulated without reliance on the XML spec (there are a few places where this separation could be cleaner), so that the informality of the XML spec (or what I prefer to think of as its programmatic promiscuity regarding models) does not in fact make it hard to formalize the XPath 1.0 data model.

In the current version, there are a couple of rough bits that I hope to sand down before I use this model in any other contexts. The ones I’m currently aware of are these:

  • The definition of the parent relation does not require that whenever parent(c,p) is true, either child(p,c) or attribute(p,c) is true. The XPath 1.0 spec doesn’t say this explicitly, but I think its authors probably felt that it would be pedantic to say something like that for a relation with a name like parent.
  • Similarly, the relations parent and ch are not guaranteed acyclic, though I think the original spec assumes that the names parent and child make clear that the relations should be acyclic. (But see the song “I’m my own Grandpa” and the Alloy models illustrating it, developed in Daniel Jackson’s Software Abstractions and shipped with Alloy in the samples directory.)
  • The predicate precedes intended to model document order is defined recursively; this is not legal in Alloy. So a conventional relation on nodes will need to be defined instead, with appropriate constraints. It is still not clear whether the rules given in the spec suffice to make the order total (at least in the absence of multiply occurring children); if they don’t, I’d like to propose an additional predicate which has that effect.

Two, four, three, who’s counting?

Monday, January 25th, 2010

[25 January 2010; correct botched formulation, 26 January]

The XPath 1.0 puzzle introduced and discussed in earlier posts continues to occupy some of my thoughts. Consider the document which can be written

  <a><b/><b/><b/></a>

The central question is this: given an instance of the XPath 1.0 data model corresponding to this document (or to any equivalent document), how many element nodes are present in the instance?

It turns out that not only are the answers four and two both consistent (as far as I can tell) with the definition of the XPath 1.0 data model, but that other answers are possible as well. Well, another answer.

Consider the document

<!DOCTYPE a [
<!ELEMENT a ANY>
<!ELEMENT b ANY>
<!ENTITY b '<b/>' >
]>
<a>&b;&b;<b/></a>

This is not the same serial-form XML document as the one at the top of the page, and the data-model equivalent is not necessarily the same, either, I think. But they both have a document element of type ’a’, whose ordered list of children has length three, with each child being of type ‘b’.

If we take token as denoting some physical object, marks on paper or some other way of writing a character type (here, pixels on screens, magnetic fields on suitable media, or optical effects on other media), which I am told is its proper meaning as it was introduced by Peirce, then here there appear to me to be three string tokens matching the element production (one ‘a’ and two ‘b’s), not four and not two, and thus three element nodes in the data model, not four and not two.

I’m not really at all sure that the right way to define XML documents (and by extension XML elements) is as strings. But if we do wish to say that an element is a string (of some kind), there seem to be at least three different approaches we can take:

  1. a sequence of character types (i.e. a string-type)
  2. a sequence of character tokens (i.e. a string-token)
  3. an occurrence, within some context, of a sequence of character types

Some of the issues relating to these concepts are very well laid out in the article “Types and Tokens” written by Linda Wetzel for the Stanford Encyclopedia of Philosophy. It is interesting to know that SGML and XML accomplish simply, through entity references, what some philosophers have regarded as impossible: we have more occurrences of a thing than we have tokens for it. In the second document given above, it seems clear (at least to me, but I am no philosopher) that there are two string-tokens of type “<b/>” — one in the entity declaration and one in the document content — and three occurrences of that string-type in the document body. When entity expansion is performed by a machine, we are quite likely to be able to identify different tokens with the different occurrences of the string-type (as long as we are willing to entertain time-slices as part of our definition of string-token), but when I just look at the document and count the occurrences of the string-type, I don’t create new tokens in the physical world (unless you want to include mental acts as tokens, but I am pretty sure that that would be a bad idea).

The good news is that there is plenty of philosophical light to throw on these issues, and the issues are well understood by philosophers. Also good news is that the XPath 1.0 data model appears to be consistent with all three possible views.

The bad news, I think, is that while the philosophers understand the issues quite well, they don’t agree on what to do about them. Also bad news is that the XPath 1.0 spec is consistent with all three views, even though it clearly wants to have a single answer to questions like “How many elements are in this document?”. Also, it seems clear that the XPath 1.0 spec really wants to view elements as occurrences (or, at least, the occurrence story produces the results the XPath 1.0 spec relies on its data model producing), but the notion of occurrence appears to be regarded by philosophers as easily the most problematic of the three concepts competing for our patronage.

Fortunately, to make the XPath 1.0 data model do what its creators wanted it to do, all that is necessary is to say explicitly that no node occurs more than once (there’s that word again!) in its parent’s ordered list of children. Or, equivalently, that the cardinality of the set of child nodes is the same as the length of the ordered list of children.

[In full XPath terms, we may also think of it as forbidding any node to appear among the result of evaluating the expression “preceding-sibling::* | following-sibling::*” with that node as the context item. But the data model does not provide a primitive definition of sibling-hood, so it can't conveniently be formulated that way in the data model.]

Tell me, Captain Yossarian, how many elements do you see?

Friday, January 22nd, 2010

[22 January 2010]

In an earlier post, I asked how many element nodes are present in the following XML document’s representation in the XPath 1.0 data model.

  <a><b/><b/><b/></a>

I think the spec clearly expects the answer “four” (a parent and three children). More than that, I think the spec reflects the belief of its authors and editors that that answer follows necessarily from the properties of the data model as defined in section 5 of the spec.

But I don’t think “four” is the only answer consistent with the data model as defined.

In particular, consider the answer “two”: one ‘a’ element node, and one ‘b’ element node (which for brevity I’ll just call A and B from here on out; note that A and B are element nodes, whose generic identifiers happen to be ‘a’ and ‘b’.). As far as I can tell, the abstract object just defined obeys all the rules which define the XPath 1.0 model. The rules which seem to apply to this document are these:

  1. “The tree contains nodes.”

    Check.

  2. “Some types of nodes … have an expanded-name …”

    Check: here the names are “a” and “b”.

  3. “There is an ordering, document order, defined on all the nodes in the document …”

    Check: in the model instance I have in mind the nodes are ordered. (In fact they have a total order, which is more than the spec explicitly requires here.) The root node is first, A second, and B third.

  4. “… corresponding to the order in which the first character of the XML representation of each node occurs in the XML representation of the document after expansion of general entities.”

    Check.

  5. “Thus, the root node will be the first node.”

    Check.

  6. “Element nodes occur before their children.”

    Check: The element node A occurs before its child B in the ordering.

  7. “Thus, document order orders element nodes in order of the occurrence of their start-tag in the XML (after expansion of entities).”

    Check: the start-tags for B begin at positions 4, 8, and 12 of the document’s serial form (counting from 1), and the start-tag for A begins at position 1. So the order of the start-tags in the XML matches the order of the nodes in the model.

    If we had several elements with multiple occurrences and thus multiple start-tags, and the positions of the start-tags were intermingled (as in <a> <b/> <c/> <b/> <c/> <b/> </a>), then it would appear that we had only a partial order on them. if the spec specified that document order was a total ordering over all nodes, we might have a problem. But it doesn’t actually say that; it just speaks of an “ordering”; it would seem strange to argue that a partial ordering is not an “ordering”.

  8. “Root nodes and element nodes have an ordered list of child nodes.”

    Check: the root node’s list of children is {1 → A}, A has the list {1 → B, 2 → B, 3 → B}, and B has the empty list {}.

  9. “Nodes never share children: if one node is not the same node as another node, then none of the children of the one node will be the same node as any of the children of another node.”

    Check: the sets {A}, {B}, and {} (representing the children, respectively, of the root node, A, and B) are pairwise disjoint.

  10. “Every node other than the root node has exactly one parent, which is either an element node or the root node.”

    Check. A has the root node as its parent, B has A as its parent.

  11. “A root node or an element node is the parent of each of its child nodes.”

    Check: the root’s only child is A, and A’s parent is the root. A’s only child is B, and B’s parent is A.

  12. “The descendants of a node are the children of the node and the descendants of the children of the node..”

    Check. The descendants of the root node are {A, B}, those of A are {B}, those of B are {}.

That’s it for the general rules; I think it’s clear that the construction we are describing satisfies them. The subsections of section 5 have some more specific rules, including one that is relevant here.

  1. “There is an element node for every element in the document.” (Sec. 5.2.)

    This rule was cited by John Cowan in his answer to the earlier post; it seems to me it can be taken in either of two ways.

    First, we can take it (as John did, and as I did the first time through this analysis) as saying that for each element node in an instance of the data model, there is an element in the corresponding serial-form XML document, and conversely (I read it as claiming a one to one correspondence) for every element in the serial-form document, there is an element node in the data model instance.

    In this case, the rule seems to me to have two problems. The first problem is that the rule assumes a mapping between XML serial-form documents and data model instances and further assumes (if we take the word “the” and the use of the singular seriously — we are, after all, dealing with a formal specification written by a group of gifted spec-writers, and edited by two of the best in the business) that the mapping from data model instance to serial-form document is a function. But how can it be a function, given that the data model does not model insignficant white space? There are infinitely many serial-form XML documents equivalent to any given data model instance. Serialization will not be a function unless additional rules are specified. And in any case, when we set out to define a formal data model as is done in the XPath 1.0 spec, I think the usual idea is that we should define the data model in such a way as to make it possible to prove that every data-model instance corresponds to a class of XML documents treated as equivalent for XPath purposes, and that every XML document corresponds to a data model instance. If the rule really does appeal to the number of elements in the serial-form XML document, then it’s assuming, rather than establishing, the correspondence. It’s hard to believe that either Steve DeRose or James Clark could make that mistake.

    The second problem, on this reading of the rule, is that it’s hard to say whether a given data model instance obeys the rule, because it’s not clear that XML gives a determinate answer for the question.

    Some argue that XML documents are, by definition, strings that match the relevant production of the XML spec (on this see my post of 5 March 2008); by the same logic we can infer that an element is a string matching the element production.

    [Note: For what it's worth I don't think the XML spec explicitly identifies either documents or elements with strings; the argument that XML documents and elements are strings rests on the claim that they can't be anything else, or that making them anything else would make the spec inconsistent. As I noted in my blog post of 2008, there is at least one passage which seems to assume that documents are strings (it speaks of a document matching the document production), but I believe that passage is just a case of bad drafting.]

    If for discussion’s sake we accept this argument, then it seems we must ask ourselves: is the string consisting of the four characters U+003C, U+0062, U+002F, U+003E, in order, one string or three strings?

    The answer, as students of philosophy will have been shouting out at home for some moments now, is “yes”. If by character you mean ‘character type’, then one string (or string type). If on the other hand you mean ‘character token’, then for the document shown above, I think pretty clearly three strings (string tokens).

    So, on this first reading of the rule, check. Two distinct elements in the XML (counting string types), two in the data model instance. (To show that this rule excludes the model instance we’re discussing, it would be necessary to show that the serialized XML document has four elements, and that counting only two elements is inconsistent with the XML spec. Given how coy the XML spec is on the nature of XML documents, I don’t believe such a showing possible.)

    The second reading of the rule is that “document” does not mean, in this sentence, something different from the data model instance, but is just a way of referring to the entirety of the data model instance itself. A quick glance at the usage of the word “document” in the XPath 1.0 spec suggests that that is in fact its most common usage. In recent years, influenced perhaps by the work on the XPath 2.0 data model, with formalists of the caliber of Mary Fernández and Phil Wadler, many people have begun to think it natural to define an abstract model independently of the XML spec, and then (as I suggested above) establish in separate steps that there is a correspondence between the set of all XML documents viewed as character sequences and the set of all instances of the data model.

    The XPath 1.0 spec seems to take a slightly different tack, rhetorically. The definition of the data model begins

    XPath operates on an XML document as a tree. This section describes how XPath models an XML document as a tree.

    I take this as a suggestion that the data model instance operated on by XPath 1.0 can be thought of not as a thing separate from the XML document (whatever that is) but as a particular way of looking at and thinking about the XML document. I think it’s true that there was (historically speaking) no consensus among the XML community at that time that the term XML document referred to a string, as opposed to a tree. I think the idea would have met fierce resistance.

    On this reading, the rule quoted above is either a vacuous statement, or a statement about usage, establishing the correspondence (or equivalence) between the terms element and element node.

    So, on this second reading, check. Two elements, two element nodes. Same thing, really, element node and element.

As I say, I think it’s quite clear which answer the XPath 1.0 spec intends the question to have: plenty of places in the spec clearly rely on element nodes never having themselves as siblings, just as plenty of places rely on element nodes never having more than one parent. Both properties are a common-sensical interpretation of the element structure of XML. I believe the point of defining the data model explicitly is to eliminate, as far as possible, the need to appeal to common sense and “what everyone knows”, to get the required postulates down on paper so that any implementation which respects those postulates and obeys the constraints will conform and inter-operate. For the parent relation, the definition of the model makes the common-sense interpretation of XML explicit. But not (as far as I can see) for the sibling relation.

Perhaps the creators of the XPath 1.0 spec felt that no explicit statement about no elements being their own siblings was necessary, because it followed from other properties of the model as specified. If so, I think either I must have missed something, or (less likely, but still possible) they did. If the property is to hold for all instances of the model, and if it does not follow from the current definition of the model, then perhaps it needs to be stated explicitly as part of the definition of the model.

[When he reached the end of this post, my evil twin Enrique turned to me and asked “Who's Yossarian? Was he a member of the XSL Working Group?” “No, he was a character in Joseph Heller's novel Catch 22. The title of the post is a reference to an elaborate bit in chapter 18 of the novel.” “And by ‘elaborate,’” mused Enrique, “you mean —” “Exactly: that it's too long to quote here and still claim fair use. Besides, this isn't a commentary on Catch 22. Just search the Web (or the book) for the phrase ‘I see everything twice.’”]

An XPath 1.0 puzzle

Wednesday, January 20th, 2010

[20 January 2010]

Consider the XML document shown below, and in particular consider its representation in the XPath 1.0 data model.

  <a><b/><b/><b/></a>

How many element nodes are there in this document, regarded as an instance of the XPath 1.0 data model? I think it’s clear that, for purposes of XPath 1.0, the expected answer is four: one of type ‘a’ and three of type ‘b’, all children of the ‘a’ element.

I am finding it unexpectedly difficult to prove that conclusion formally on the basis of the definition of the data model given in the spec. I wonder if anyone else will have better luck.

XML as a sort-of open format

Thursday, December 3rd, 2009

[3 December 2009]

I just encountered the following statements in technical documentation for a family of products which I’ll leave nameless.

This document does not describe the complete XML schema for either [Application 1] or [Application 2]. The complete XML schema for both applications is not available and will not be made public.

Perhaps there can be good reasons for such a situation. Perhaps the developers really don’t know how to use any existing schema language to describe the set of documents they actually accept; perhaps only a Turing machine can actually identify set of documents accepted, and the developers were unwilling to work with a simpler set whose membership could be more cheaply decided. (Well, wait, those may be reasons, but they don’t actually qualify as “good”.)

I wonder whether this is an insidious attempt to look like the products have an open format (See? it’s XML! How much more open can you get?) while ensuring that the commercial products in question remain the only arbiters of acceptable documents? Or whether the programmers in question were just too lazy to specify a clean vocabulary and ensure that their software handles all documents which meet some standard of validity that does not require Turing completeness?

Having a partially defined XML format is, at least for me, still a great deal more convenient than having the format be binary and completely undocumented. But it certainly seems to fall a long distance short of what XML could make possible.

Changing stylesheets in midstream

Monday, October 19th, 2009

[19 October 2009]

My evil twin Enrique came by the other day in a great state of excitement. There’s been a bit of a kerfuffle in some W3C working groups lately, he told me. As some readers will know, the W3C recently unveiled a new design for their web site. (Many people seem to want to call this a site redesign, but as far as I know most of the site was originally developed by individuals and working groups working autonomously, and outside the front page, the Tech Reports page, and the other pages maintained by the Communications Team, the site never had a consistent design to begin with. Surely it’s only a redesign if there was a design there in the first place?)

Almost all the comments on the new design appear to be positive — at least, they were until some spec editors and working group chairs noticed that the site redesign had included reformatted versions of their working groups’ current Recommendations, which the working groups had not looked at before and which proved, when examined, to be sub-optimal in some ways.

“Sub-optimal is putting it mildly,” laughed Enrique. “Some of the specs looked like night soil on toast. And some of the editors were fit to be tied.” Enough pain was expressed over the new look of the old specs, apparently, that after a couple of days the standard URLs for existing Recommendations were all reset, and no longer point to the reformatted versions. (The reformatted versions are still around — no one at W3C ever deletes anything, it’s a point of some pride — though you have to know what URIs to point to.)

One of the most visible problems is that in some specs, extra space was appearing before and after large numbers of hyperlinked special terms. “You know what it was?” chortled Enrique. “Some bright young thing at some bright young design agency seems to have thought a 20px padding would be a good idea for the CODE element. Do these people not know any HTML? Here, look at the stylesheet!” He pulled out a hand-held and showed me a rule from one of the new stylesheets (reformatted here for legibility):

h1, h2, h3, h4, h5, h6, ul, ol, dl, p, pre, blockquote, code {
  padding:20px 20px 0 20px;
}

He was cackling with malice now. “The stylesheet author seems to have thought that code was not for inline material but for indented blocks. Where do they get these people? And giving measurements in pixels is so dead-tree-oriented!”

“Now, now,” I said. “I’m sure you were a bright young thing once yourself.”

“Not me,” he returned brightly. “I was fifty-two the day I was born, and I’ve always been dumb as a post.”

“Two, actually. Odd, though,” I said. “When I retrieve the reformatted versions of the XML and XSLT 2.0 specs, I don’t see extra white space around code elements.” I retrieved the stylesheet with the bogus padding values for code; the rule now read

h1, h2, h3, h4, h5, h6, ul, ol, dl, p, pre, blockquote {
    padding:20px 20px 0 20px;
}

“Those bastards!” Enrique cried. “You mean they’ve fixed it? I was going to charge them big bucks to tell them what was wrong!” And he stomped off again in spluttering disappointment. I haven’t seen him since, but I’m not worried; he’ll get over it.

[The new W3C site is the result of a long design history, and really does appear to be an improvement, for the most part. It makes it much easier than the old site to find your way around (or so I believe — I knew the old site structure well enough that the new one just confuses me; I assume that will pass). The new look intended for W3C technical reports (i.e. Recommendations, Notes, Working Drafts, etc.) can be inspected on the beta site's Tech Reports page, or the beta site's version of the new Standards page. I haven't yet decided whether I think the new tech report styling is an improvement or not, and if it is, whether it's enough of an improvement to justify the disruption of restyling the entire body of existing Recommendations. I'll be interested in readers' reactions.

One thing is unsurprising: if you launch a new stylesheet on technical material whose authors and editors pride themselves on precision, you would do well not to make it public until they have confirmed that it is OK. And it would be smart, before you let them see it at all, and certainly before you make it public, to make sure the new stylesheet doesn't introduce highly visible problems like 20 pixels of extra white space around every code element.

Live and learn.]

Looking for open source XML software?

Wednesday, September 30th, 2009

[30 September 2009]

Last week I participated in the XML Summer School organized by Eleven Informatics at St. Edmund Hall in Oxford. I hope the participants enjoyed it as much as the speakers did. The weather certainly cooperated, although it felt more autumnal than summery by the end of the week.

One of my responsibilities during the week was to give a survey of open-source software for XML applications; this turns out to be harder than it might look because there are so many, with such varying degrees of polish, reliability, and completeness. There are several lists of XML software, and open-source software, and open-source XML software (general, or in some specific categories) on the Web, but many of them appear to not to have been maintained or updated in several years. (Honorable exceptions include the lists maintained by Ron Bourret on databases and XML, Lars Marius Garshol on XML tools and Topic-Map tools, and Tony Graham on XSLT testing tools.) So the lists I made, arbitrary and capricious though some aspects of them are, may be helpful.

Eventually I plan to turn the information gathered into a more convenient form, and set up some infrastructure to make it easier to maintain, but in the meantime the slides I prepared for the session may be helpful; they provide a coarsely categorized and tersely annotated list of some open-source XML software that readers of this klog may find interesting.

Fear, uncertainty, and XML 1.0 Fifth Edition

Thursday, June 11th, 2009

[11 June 2009]

From time to time people tell me that the transition from XML 1.0 Fourth Edition to XML 1.0 Fifth Edition is hard. Just as from time to time people have said the transition from XML 1.0 to XML 1.1 would be hard, and might break systems that consume XML data. I just spoke with a friend who told me their company was having internal discussions about what to do about XML 1.0 Fifth Edition, because some of their customers had expressed “concern” (possibly “deep concern”).

I’ve never understood what is hard about either transition; perhaps if I ask here someone can explain it to me.

There are two classes of software to consider: (a) software which checks that a string is a legal XML name, and (b) software which just consumes valid or well-formed XML, without doing its own checking.

Software that actively checks XML names

Obviously, if you are going to upgrade your XML processors from 1.0 Fourth Edition to 1.0 Fifth Edition (or to XML 1.1), you are going to need to change them. No one has ever argued seriously that that’s hard (not even Noah Mendelsohn). Anyone who has written a parser for names can tell you that the new definition of Name is simpler; the only serious likelihood is that a programmer comparing the complexity of the old definition with the relative simplicity of the new may be mildly depressed that the complexity was ever needed (long story, let’s not go there), and will lose a moment or two sighing deeply.

Software that isn’t an XML parser but which has decided for reasons of its own to use XML’s definition of Name may or may not also need to change. Since it’s not an XML parser, it has no obligation to follow the XML spec in the first place. But if you want to change it to keep it in synch with XML, the change is simple, just as it is for an XML parser.

Software that doesn’t check XML names but assumes they are OK

Noah Mendelsohn (my esteemed colleague in the W3C XML Schema working group) was eloquent, in presentations I heard him give, about the danger that an XML 1.1 processor would let data through that an XML 1.0 processor would not have let through, and that that new data might break other software which had been relying on the XML processor upstream for sanity-checking its input data. Such reliance is not at all a bad thing; one point of using XML is precisely that valid or well-formed XML is much more predictable than arbitrary octet sequences.

Software of this kind, which doesn’t itself check its input data, can in principle break when presented with data it’s not prepared for. So the prospect that XML 1.1 (or XML 1.0 Fifth Edition) might break such software naturally scares people a lot. Noah (and possibly others) successfully scared enough people that many people shied away from XML 1.1. Now purveyors of fear, uncertainty, and doubt are trying to scare people away from XML 1.0 Fifth Edition.

But what they are spreading was FUD when they were talking about XML 1.1, and it’s FUD now. It’s not logically impossible for software to exist which works fine when presented with XML 1.0 Fourth Edition input, and which will break when presented with Fifth Edition input. But such software would be unusual in the extreme, even eccentric. No one has ever actually identified such software to me. I’ve been asking, every time this comes up, for the last five or six years.

It’s not at all clear that such software could be constructed by any programmer of ordinary competence. To try to prevent the use of minority scripts in XML names for the sake of avoiding the hypothetical risk of breaking hypothetical software which (if it existed) would be a textbook case of poor design and poor implementation, is just insane.

Let us imagine the existence of such a piece of software; let’s call this software N.

We know very little about N, only that N has no problem with any XML 1.0 name, but will break when confronted with at least some XML 1.1 names that are not also XML 1.0 names. So, let’s see: that means that N is perfectly happy to consume a name containing Tibetan characters, but N might break in an ugly way when confronted with Hittite. Or perhaps N is perfectly happy with the Tibetan characters U+0F47 and U+0F49, which are legal in XML 1.0 4E names, but N will break if confronted with the character U+0F48, which lies between them.

How can this be? By hypothesis, N is not running its own name checker that implements the 1.0 4E rules (if it is, then N belongs in class (a) above, and when confronted with U+0F48 N does not break but issues an error message). What can it possibly be doing with data that comes in marked as a Name, that causes it to handle U+0F47 and U+0F49, but not U+0F48?

As far as I can tell, by far the most common thing to do when ingesting something marked as an XML name is to copy it into a variable typed to accept Unicode strings. Use of this variable may well exploit the fact that it won’t contain blanks. But I haven’t seen much code that is written to exploit the fact that a Unicode string does or does not contain any occurrences of U+0F48, or of characters in various minority writing systems. Maybe I’m just young and ignorant; it’s only thirty years since I started programming, and I’ve mostly worked in fairly restricted areas (text processing, markup, character set problems, that kind of thing), so there’s a lot I don’t know.

So, please, if anyone can enlighten me, please do. What rational programmer of even modest competence — or for that matter, what programmer completely lacking in competence, will write code that (a) is not an implementation of the name rules of XML 1.0 4E, that (b) accepts all names defined according to the rules of XML 1.0 4E, and that (c) will die when confronted with some name which is legal by the rules of 1.0 5E?

In earlier discussions, Michael Kay tried to suggest why a program might fail on 1.0 5E names, but all the plausible examples of such a program involve the program assuming that the characters are all ASCII, or all ISO 8859-1, or all in some other historical character set. Such programs will certainly fail when confronted with 1.0 5E names. But they will also fail when confronted with XML 1.0 4E names, so they don’t satisfy condition (b) in the list.

In order to have properties (a), (b), and (c), software would have to be seriously pathological in design and coding. And I don’t mean that in a good sense.

I conclude: insofar as the resistance to XML 1.1, and to XML 1.0 Fifth Edition is based on fear that the shift will break deployed software, it’s irrational and based on a complete misunderstanding of the detailed technical issues involved. Those who are spreading this FUD are doing neither themselves, nor their companies, nor the community, a service.