Choosing schema-validation roots

[18 February 2008, Prague]

A colleague asks:

naive XML schema question — How does a validating parser know which xs:element is supposed to be the root/document element? I don’t see anything in the schema that tells it.

I’m not getting any love from google or the schema Recs. (I’ve looked at every use of the word “root” in the Recs, with no clues.)

I hate it when smart people who are willing to put in some work to understand things can’t find the answer to their questions in the schema spec. So first of all, I’m sorry. I apologize on behalf of the spec on which I’ve now spent a large proportion of my working life. (I wish I thought I could do something about it, but the XML Schema WG has been appallingly reluctant to fixing the incomprehensibility problems of the spec. I think the 1.1 spec is marginally better than 1.0 in some ways, but only marginally and only in some ways. If you hated the 1.0 spec, you may find you hate 1.1 ever so slightly less, but it’s unlikely to charm you into liking it.)

But this question does come up a lot. And if the WG won’t explain it clearly in the spec, then at least I can try to explain it clearly here.

The choice of validation root is not specified by XSD. Formally it’s regarded as out of scope; in practice, the expectation is that processors will either provide a useful method of choosing where to start validation and users will specify the validation root at invocation time, or that processors will provide a useful default choice (e.g. the document root), or that in some cases processors will provide a fixed choice (e.g. the document root). In the latter case the user can be said to have chosen to start validation at that fixed point by choosing to use that particular validator. That may sound Orwellian, but in principle, at least, the rule is that if you don’t like the level of control given you by a given tool, then why are you using that tool? File a bug report, or an enhancement request, or get another tool. Or both.

The closest the XSD spec comes to talking about this is in section 5.2 (“Assessing Schema-validity”). Personally, I find the discussion in XSD 1.1 marginally clearer than the discussion in 1.0, but I may be exhibiting my bias in that.

My colleague continues:

Preliminary experiments suggest that at least in a normal schema, you can, in fact, just give a fragment of a document and have the document be considered schema valid. So “<br/>” is a schema-valid HTML document? Very odd.

Well, no and yes. “<br/>” is schema-valid against the HTML schema, if schema-validity assessment starts with that element and any of (a) the corresponding element declaration, (b) the relevant type definition, or (c) the instruction to start in lax or strict wildcard mode and look for an applicable definition. And if that element happens to be the document root, then yes, it’s a document valid against the XHTML schema.

Since the default setting for many XSD validators is to start at the document root in lax-wildcard mode, they accept your sample document as valid.

An analogous result could be achieved using a DTD, by writing

<!DOCTYPE br PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<br/>

I think that those who run an XML validator over that document will find that it is valid against the DTD.

The document type definition at http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd has no formal specification that any particular element must be the root element; the constraint on the generic identifier of the root element is specified as part of the document type declaration, the “<!DOCTYPE” part. Analogously, the XSD schema for XHTML doesn’t have any formal specification of any required root element, or required starting declaration; both get specified at validation time. Both when using DTD s and when using XSD, this allows you to validate one part of a document at a time. If you’re editing a large document and are storing different parts of it in different files, it’s convenient to be able to validate each part independently.

Another analogy is with the formal definition of a grammar: the set of productions that most of us think of a grammar does not specify the start symbol. The start symbol is specified in a different part of the tuple that is, for formal-language purposes, the grammar. To describe schemas in these terms: the schema, or the collection of element and other declarations in a DTD file, does not define a full document grammar, but a set of productions for a document grammar. The start symbol is specified separately, in a doctype declaration for DTDs, and at validator invocation time for XSD schemas.

The rules for the HTML vocabulary specify that a conforming HTML document should start with an ‘html’ element, so if you want to check conformance to the HTML spec (as opposed to schema-validity against the XHTML schema, which is not quite the same thing) you don’t get so much choice of how to invoke the validator: you should start with the declaration for the ‘html’ element and with the document’s root element.

If the validator you’re using doesn’t allow you to specify (a) where to start, and (b) what to start with, then you really should file a bug report or a request for enhancement. And whether you do that or not, you really should understand that some of the consequences of the implementation’s default choices are properties of how you are performing validity assessment, not properties of XSD validation in itself.

Some people dislike having to say explicitly that use of a particular vocabulary must start with a particular element, so they take pains to make only that one element top-level; all other elements are defined locally to complex types. This is an effective way of preventing abuse, but it also pretty effectively prevents re-use, and it makes the schema harder to maintain, work with, or reason about. I can’t see such a schema without thinking someone has just cut off their nose in order to spite their face.

Another plug for XML Catalogs (and caching)

The W3C systems group posted a blog entry the other day about the caching of DTDs and schemas. The failure of some XML software to use caches wisely is causing unbelievable amounts of traffic on the W3C site: in some cases, the same IP address is requesting the same DTD file hundreds and thousands of times in the space of a few hours.

The blog has good pointers to resources about using HTTP caching well, and about XML Catalogs.

I’ve said it before, and I’ll say it again: every piece of software that works with XML ought to use XML Catalogs. By all means allow the user to turn it off, but support it, and turn it on by default. The main reason is: it makes the life of your users easier. And the kind of problem discussed by the systeam blog post is one more reason.

W3C working group meetings / Preston’s Maxim

[25 January 2008]

I’m finishing a week of working group meetings in Florida, in the usual state of fatigue.

The SML (Service Modeling Language) and XML Query working groups met Monday-Wednesday. SML followed its usual practice of driving its agenda from Bugzilla: review the issues for which the editors have produced proposals, and act on the proposals, then review the ones that we have discussed before but not gotten consensus suitable for sending them to the editors, then others. The working group spent a fair bit of time discussing issues I had raised or was being recalcitrant on, only to end up resolving them without making the suggested change. I felt a little guilty at taking so much of the group’s time, but no one exhibited any sign of resentment. In one or two cases I was in the end persuaded to change my position; in others it simply became clear that I wasn’t going to manage to persuade the group to do as I suggested. I have always found it easier to accept with (what I hope is) good grace a decision going against me, if I can feel that I’ve been given a fair chance to speak my piece and persuade the others in the group. The chairs of SML are good at ensuring that everyone gets a chance to make their case, but also adept (among their other virtues) at moving the group along at a pretty good clip.

(In some working groups I’ve been in, by contrast, some participants made it a habit not to argue the merits of the issue but instead to spend the the time available arguing over whether the group should be taking any time on the issue at all. This tactic reduces the time available for discussion, impairs the quality of the group’s deliberation, and thus reduces the chance that the group will reach consensus; it’s thus extremely useful for those who wish to defend the status quo but do not have, or are not willing to expose, technical arguments for their position. The fact that this practice reduces me to a state of incoherent rage is merely a side benefit.)

“Service Modeling Language” is an unfortunate name, I think: apart from the fact that the phrase doesn’t suggest any very clear or specific meaning to anyone hearing it for the first time, the meanings it does suggest have pretty much nothing whatever to do with the language. SML defines a set of facilities for cross-document validation, in particular validation of, and by means of, inter-document references. Referential integrity can be checked using XSD (aka XML Schema), but only within the confines of a single document; SML makes it possible to perform referential integrity checking over a collection of documents, with cross-document analogues of XSD’s key, keyref, and unique constraints and with further abilities, in particular being able to specify simply that inter-document references of a given kind must point to elements with a particular expanded name, or elements of with a given governing type definition, or that chains of references of a particular kind must be acyclic. In addition, the SML Interchange Format (SML-IF) specifies rules that make it easier to specify exactly what schema is to be use for validating a document using XSD and thus to get consistent validation results.

The XML Schema working group met Wednesday through Friday. Wednesday morning went to a joint session with the SML and XML Query working groups: Kumar Pandit gave a high-level overview of SML and there was discussion. Then in a joint session with XML Query, we discussed the issue of implementation-defined primitive types.

The rest of the time, the Schema WG worked on last-call issues against XML Schema. Since we had a rough priority sort of the issues, we were able just to sort the issues list and open them one after the other and ask “OK, what do we do about this one?”

Among the highlights visible from Bugzilla:

  • Assertions will be allowed on simple types, not just on complex types.
  • For negative wildcards, the keyword ##definedSibling will be available, so that schema authors can conveniently say “Allow anything except elements already included in this content model”; this is in addition to the already-present ##defined (“Allow anything except elements defined in this schema”). The Working Group was unable to achieve consensus on deep-sixing the latter; it has really surprising effects when new declarations are included in a schema and seems likely to produce mystifying problems in most usage scenarios, but some Working Group members are convinced it’s exactly what they or their users want.
  • The Working Group declined a proposal that some thought would have made it easier to support XHTML Modularization (in particular, the constraints on xhtml:form and xhtml:input); it would have made it possible for the validity of an element against a type to depend, in some cases, on where the element appears. Since some systems (e.g. XPath 2.0, XQuery 1.0, and XSLT 2.0) assume that type-validity is independent of context, the change would have had a high cost.
  • The sections of the Structures spec which contain validation rules and constraints on components (and the like) will be broken up into smaller chunks to try to make them easier to navigate.
  • The group hearkened to the words of Norm Walsh on the name of the spec (roughly paraphrasable as “XSDL? Not WXS? Not XSD? XSDL? What are you smoking?”); the name of the language will be XSD 1.1, not XSDL 1.1.

We made it through the entire stack of issues in the two and a half days; as Michael J. Preston (a prolific creator of machine-generated concordances known to a select few as “the wild man of Boulder”) once told me: it’s amazing how much you can get done if you just put your ass in a chair and do it.

Primitives and non-primitives in XSDL

John Cowan asks, in a comment on another post here, what possible rationale could have governed the decisions in XSDL 1.0 about which types to define as primitives and which to derive from other types.

I started to reply in a follow-up comment, but my reply grew too long for that genre, so I’m promoting it to a separate post.

The questions John asks are good ones. Unfortunately, I don’t have good answers. In all the puzzling cases he notes, my explanation of why XSDL is as it is begins with the words “for historical reasons …”.

Continue reading

Allowing ‘extension primitives’ in XML Schema?

In issue 3251 against XSDL 1.1 (aka ‘XML Schema 1.1’ to those who haven’t internalized the new acronym), Michael Kay suggests that XSDL, like other languages, allow implementations to define primitive types additional to those described in the XSDL spec.

I’ve been thinking about this a fair bit recently.

The result is too much information to put comfortably into a single post, so I’ll limit this post to a discussion of the rationale for the idea.

‘User-defined’ primitives?

Michael is not the first to suggest allowing the set of primitives to be extended without revving the XSDL spec. I’ve heard others, too, express a desire for this or for something similar (see below). In one memorable exchange at Extreme Markup Languages a couple years ago, Ann Wrightson noted that in some projects she has worked on, the need for additional primitives is keenly felt. In the heat of the moment, she spoke feelingly of the “arrogance” of languages that don’t allow users to define their own primitives. I remember it partly because that word stung; I doubt that my reply was as calmly reasoned and equable as I would have liked.

Strictly speaking, of course, the notion of ‘user-defined primitives’ is a contradiction in terms. If a user can define something meaningfully (as opposed to just declaring a sort of black box), then it seems inevitable that that definition will appeal to some other concepts, in terms of which the new thing is to be understood and processed. Those concepts, in turn, either have formal definitions in terms of yet other concepts, or they have no formal definition but must just be known to the processor or understood by human readers by other means. The term primitive denotes this last class of things, that aren’t defined within the system. Whatever a user can define, in the nature of things it can’t very well be a primitive in this sense of the word.

Defining types without lying to the processor

But if one can keep one’s pedantry under control long enough, it’s worth while trying to understand what is desired, before dismissing the expression of the desire as self-contradictory. It’s suggestive that some people point to DTLL (the Datatype Library Language being designed by Jeni Tennison) as an instance of the kind of thing needed. I have seen descriptions of DTLL that claimed that it had no primitives, or that it allowed user-defined primitives (thus falling into the logical trap just mentioned), but I believe that in public discussions Jeni has been more careful.

In practice, DTLL does have primitives, in the sense of basic concepts not defined in terms of other concepts still more basic. In the version Jeni presented at Extreme Markup Languages 2006, the primitive notions in terms of which all types are defined are (a) the idea of tuples with named parts and (b) the four datatypes of XPath 1.0. (Note: I specify the version not because I know that DTLL has since changed, but only because I don’t know that it has not changed; DTLL, alas, is still on the ‘urgent / important / read this soon’ pile, which means it’s continually being shoved aside by things labeled ‘super-urgent / life-threatening / read this yesterday’. It also suffers from my expectation that I’ll have to concentrate and think hard; surfing the Web and reading people’s blogs seems less strenuous.)

But DTLL does not have primitives, in the sense of a set of semantically described types from which all other types are to be constructed. All types (if I remember correctly) are in the same pool, and there is no formal distinction between primitive types and derived types.

Of course, the distinction between primitives and derived types has no particular formal significance in XSDL, either. What you can do with one, you can do with the other. The special types, on the other hand, are, well, special: you can derive new types from the primitives, but you can’t derive new types from the specials, like xs:anySimpleType or (in XSDL 1.1) xs:anyAtomicType. Such derivations require magic (which is one way of saying there isn’t any formal method of defining the nature of a type whose base type is xs:anyAtomicType: it requires [normative] prose).

But XSDL 1.0 and the current draft of XSDL 1.1 do require that all user-defined types be defined in terms of other known types. And this has a couple of irritating corollaries.

If you want to define an XSDL datatype for dates in the form defined by RFC 2822 (e.g. “18 Jan 2008”), or for lengths in any of the forms defined by (X)HTML or CSS (which accepts “50%”, and “50” [pixels], and “50*”, and “50em”, and so on), you can do so if you have enough patience to work out the required regular expressions. But you can’t derive the new rfc2822:date type from xs:date (as you might wish to do, to signal that they share the same value space). You must lie to the processor, and say that really, the set of RFC 2822 dates is a subtype of xs:string.

Exercise for those gifted at casuistry: write a short essay explaining that what is really being defined here is the set of RFC 2822 date expressions, which really are strings, so this is actually all just fine and nothing to complain about.

Exercise for those who care about having notations humans can actually use: read the essay produced by your casuist colleague and refrain from punching the author in the eye.

Lying to the processor is always dangerous, and usually a bad idea; designing a system that requires lying to the processor in order to do something useful is similarly a bad idea (and can lead to heated remarks about the arrogance of the system). When the schema author is forced to pretend that the value space of email dates is the value space of strings (i.e. sequences of UCS characters), it not only does violence to the soul of the schema author and makes the processor miss the opportunity to use a specially optimized storage format for the values, but it also makes it impossible to derive further types by imposing lower and upper bounds on the value space (e.g. a type for plausible email dates: email dated before 1950? probably not). And you can’t expect the XSDL validator to understand the relation among pixels and points and picas and ems and inches, so just forget about the idea of restricting the length type with an upper bound of “2 pc” (two picas) and having the software know that “30pt” (thirty points) exceeds that upper bound.

What about NOTATION?

As suggested in the examples just above, there are a lot of specialized notations that could usefully be defined as extension primitives in an XSDL context. Different date formats are a rich vein in themselves, but any specialized form of expression used to capture specialized information concisely is a candidate. Lengths, in a document formatting context. Rational numbers. Colors (again in a display context). Read the sections on data types in the HTML and CSS specs. Read the section on abstract data types in the programming textbook of your choice. Lots of possibilities.

One might argue that the correct way to handle these is to declare them for what they are: specialized notations, which may be processed and validated by a specialized processor (called, perhaps, as a plugin by the validator) but which a general-purpose markup validator needn’t be expected to know about.

In principle, this could work, I guess. And it may be (related to) what the designers of ISO 8879 (the SGML spec) had in mind when they defined NOTATION. But I don’t think NOTATION will fly as a solution for this kind of problem today, for several reasons:

  • There is no history or established practice of SGML or XML validators calling subvalidators to validate the data declared as being in a particular notation. So the ostensible reason for using NOTATION (“It’s already there! you don’t need to do anything!”) doesn’t really hold up: using declared notations to guide validation would be plowing new ground.
  • Almost no one ever really wants to use notations. In part this is because few people ever feel they really understand what notations are good for, and usually not for long, and they don’t always agree. So software developers never really know what to do with them, and end up doing nothing much with them.
  • If RFC 2822 dates are a ‘notation’ rather than a ‘datatype’, then surely the same applies to ISO 8601 dates. Why treat some date expressions as lexical representations of dates, and others as magic cookies for a black-box process? If notations were intended to keep software that works with SGML and XML markup from having to understand things like integers and dates, then the final returns are now in, and we can certify that that attempt didn’t succeed. (Some document-oriented friends of mine like to tell me that all this datatype stuff was foisted on XSDL by data-heads and does nothing at all for documents. I keep having to remind them that they spent pretty much the entire period from 1986 to 1998 warning their students that neither content models nor attribute declarations allowed you to specify, for example, that the content of a quantity element, or the value of a quantity attribute, ought to be [for example] a quantity — i.e. an integer, preferably restricted to a plausible range. XSDL may have given you things you never asked for, and it may give you things you asked for in a form you didn’t expect and don’t much like. But don’t claim you never asked for datatypes. I was there, and while I don’t have videotapes, I do remember.)

Who defines the new primitives?

Some people are nervous at the idea of trying to allow users to define new kinds of dates, or new kinds of values, in part because the attempt to allow the definition of arbitrary value spaces, in a form that can actually be used to check the validity of data, seems certain to end up by putting a Turing complete language into the spec, or by pointing to some existing programming language and requiring that people use that language to define their new types (and requiring all schema processors to include an interpreter for that language). And the spec ends up defining not just a schema language, but a set of APIs.

However you cut it, it seems a quick route to a language war; include me out.

Michael Kay has pointed out that there is a much simpler way. Don’t try to provide for user-defined types derived by restriction from xs:anyAtomicType in some interoperable way. That would require a lot of machinery, and would be difficult to reach consensus on.

Michael proposes: just specify that implementations may provide additional implementation-defined primitive types. In the nature of things, an implementation can do this however it wants. Some implementors will code up email dates and CSS lengths the same way they code the other primitives. Fine. Some implementors will expose the API that their existing primitive types use, so they choose, at the appropriate moment, to link in a set of extension types, or not. Some will allow users to provide implementations of extension types, using that API, and link them at run time. Some may provide extension syntax to allow users to describe new types in some usable way (DTLL, anyone?) without having to write code in Java or C or [name of language here].

That way, all the burden of designing the way to allow user-specified types to interface with the rest of the implementation falls on the implementors, if they wish to carry it, and not on the XSDL spec. (Hurrah, cries the editor.)

If enough implementations do that, and enough experience is gained, the relevant community might eventually come to some consensus on a good way to do it itneroperably. And at that point, it could go into the spec for some schema language.

This has worked tolerably well for the analogous situation with implementation-defined / user-defined XPath functions in XSLT 1.0. XSLT 1.0 doesn’t provide syntax for declaring user-defined functions; it merely allows implementations to suport additional functions in XPath expressions. Some implementations did so, either their own functions, or functions defined by projects like EXSLT. And some implementations did in fact provide syntax for allowing users to define functions without writing Java or C code and running the linker. (And lo! the experience thus gained has made it possible to get consensus in XSLT 2.0 for standardized syntax for user-defined functions.)

But with that thought, I have reached a topic perhaps better separated out into another post. Other languages, specs, and technologies have faced similar decisions (should we allow implementations to add new blorts to our language, or should we restrict them to using the standard blorts?); what has the experience been like? What can we learn from the experience?