Are XML documents character sequences?

[5 March 2008]

Henry Thompson has raised an issue against the SML spec, asking why SML defines the term document as

A well-formed XML document, as defined in [XML].

Henry suggests:

‘document’ as defined in the XML spec. is a character sequence — this seems unnecessarily restrictive — surely a reference to the Infoset spec. would be more appropriate here?

At this point, a voice which I identify as that of my evil twin Skippy sounds in my head, to wonder whether it really makes a difference or not.

I disagree with the premise that the XML spec defines “an XML document” solely as a character sequence. What I remember from the XML spec about the fundamental nature of XML documents is mostly energetic hand-waving.

Hmm. What is the truth of the matter?

I think Skippy has a point here, at least about the hand-waving. Both the first edition of the XML spec, with which I am probably most familiar, and the current (fourth) edition, say “A textual object is a well-formed XML document if” certain conditions are met. Surely the phrase “textual object” begs the question.

Both versions of the spec also hyperlink the words “XML document” to what seems to be a definition of sorts:

A data object is an XML document if it is well-formed, as defined in this specification. A well-formed XML document may in addition be valid if it meets certain further constraints.

The phrase “data object” may be ever so slightly less suspect than “textual object”, but I’m feeling quite a breeze here.

Certainly this is consistent with my recollection that I actively resisted Dan Connolly’s repeated request to nail down, once and for all, what counted as an “XML document”. Most SGML users, I think, were inclined to regard the terms document and SGML document as denoting not just sequences of characters, but also any data structures with recognizable elements, attributes, parents, siblings, and children, and so forth that one might choose to represent the same information in. And I certainly planned to do the same for the term XML document. If anyone had told me that a DOM or SAX interface does not expose an XML document but something which must be religiously distinguished from XML documents, I would have said “Bullshit. The document production defines the serialization of a document, not the document.”

Some people might say, given the level of abstraction implicit in the XML WG’s usage of the term document (or at least in mine), that we regarded the term document as denoting any instance of the data model. (Of course, others will immediately chime in to say that SGML and XML don’t have data models. But as I’ve said elsewhere, at this point I in the discussion I don’t have any idea what the second group means by the words “have a data model”, so I can’t say one way or another.) Speaking for myself, I think I would have been horrified by the suggestion that once one had read a document into memory and begun working with it, one was no longer working with an XML document.

But nevertheless, the XML spec may have nailed things down in just the way I thought I had successfully resisted doing. The conditions imposed on a data object, in order for it to be an XML document, include that it be well-formed, which in turn entails that

Taken as a whole, it matches the production labeled document.

It’s embarrassing for those who like to think of XML documents in a slightly more abstract way, but it’s not clear to me how anything but a character sequence can match the ‘document’ production.

Sorry, Skippy, but while I think Henry’s note over-simplified things, he is probably more nearly right than you.

it probably won’t change by one whit the way I use the term XML document. But it’s nice to learn some new things now and then, even if what you learn is that you failed at something you attempted, ten years ago, and that as a consequence, you’ve been wrong about a related question ever since.

2 thoughts on “Are XML documents character sequences?

  1. I had a little dispute with Dan on a closely related point, IIRC. He was claiming that XML documents are like integers: all of them exist. I held on the contrary that the number of documents was inherently finite, and that a document exists only if someone has assembled the particular character sequence in some fixed medium: paper, disk, flash memory, whatever; and that the document ceases to exist when the sequence of characters is erased or dismembered. All this by analogy to physical documents.

  2. Pingback: Messages in a Bottle

Comments are closed.