[18 February 2008, Prague]
A colleague asks:
naive XML schema question — How does a validating parser know which xs:element is supposed to be the root/document element? I don’t see anything in the schema that tells it.
I’m not getting any love from google or the schema Recs. (I’ve looked at every use of the word “root” in the Recs, with no clues.)
I hate it when smart people who are willing to put in some work to understand things can’t find the answer to their questions in the schema spec. So first of all, I’m sorry. I apologize on behalf of the spec on which I’ve now spent a large proportion of my working life. (I wish I thought I could do something about it, but the XML Schema WG has been appallingly reluctant to fixing the incomprehensibility problems of the spec. I think the 1.1 spec is marginally better than 1.0 in some ways, but only marginally and only in some ways. If you hated the 1.0 spec, you may find you hate 1.1 ever so slightly less, but it’s unlikely to charm you into liking it.)
But this question does come up a lot. And if the WG won’t explain it clearly in the spec, then at least I can try to explain it clearly here.
The choice of validation root is not specified by XSD. Formally it’s regarded as out of scope; in practice, the expectation is that processors will either provide a useful method of choosing where to start validation and users will specify the validation root at invocation time, or that processors will provide a useful default choice (e.g. the document root), or that in some cases processors will provide a fixed choice (e.g. the document root). In the latter case the user can be said to have chosen to start validation at that fixed point by choosing to use that particular validator. That may sound Orwellian, but in principle, at least, the rule is that if you don’t like the level of control given you by a given tool, then why are you using that tool? File a bug report, or an enhancement request, or get another tool. Or both.
The closest the XSD spec comes to talking about this is in section 5.2 (“Assessing Schema-validity”). Personally, I find the discussion in XSD 1.1 marginally clearer than the discussion in 1.0, but I may be exhibiting my bias in that.
My colleague continues:
Preliminary experiments suggest that at least in a normal schema, you can, in fact, just give a fragment of a document and have the document be considered schema valid. So “
<br/>” is a schema-valid HTML document? Very odd.
Well, no and yes. “
<br/>” is schema-valid against the HTML schema, if schema-validity assessment starts with that element and any of (a) the corresponding element declaration, (b) the relevant type definition, or (c) the instruction to start in lax or strict wildcard mode and look for an applicable definition. And if that element happens to be the document root, then yes, it’s a document valid against the XHTML schema.
Since the default setting for many XSD validators is to start at the document root in lax-wildcard mode, they accept your sample document as valid.
An analogous result could be achieved using a DTD, by writing
<!DOCTYPE br PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <br/>
I think that those who run an XML validator over that document will find that it is valid against the DTD.
The document type definition at
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd has no formal specification that any particular element must be the root element; the constraint on the generic identifier of the root element is specified as part of the document type declaration, the “
<!DOCTYPE” part. Analogously, the XSD schema for XHTML doesn’t have any formal specification of any required root element, or required starting declaration; both get specified at validation time. Both when using DTD s and when using XSD, this allows you to validate one part of a document at a time. If you’re editing a large document and are storing different parts of it in different files, it’s convenient to be able to validate each part independently.
Another analogy is with the formal definition of a grammar: the set of productions that most of us think of a grammar does not specify the start symbol. The start symbol is specified in a different part of the tuple that is, for formal-language purposes, the grammar. To describe schemas in these terms: the schema, or the collection of element and other declarations in a DTD file, does not define a full document grammar, but a set of productions for a document grammar. The start symbol is specified separately, in a doctype declaration for DTDs, and at validator invocation time for XSD schemas.
The rules for the HTML vocabulary specify that a conforming HTML document should start with an ‘html’ element, so if you want to check conformance to the HTML spec (as opposed to schema-validity against the XHTML schema, which is not quite the same thing) you don’t get so much choice of how to invoke the validator: you should start with the declaration for the ‘html’ element and with the document’s root element.
If the validator you’re using doesn’t allow you to specify (a) where to start, and (b) what to start with, then you really should file a bug report or a request for enhancement. And whether you do that or not, you really should understand that some of the consequences of the implementation’s default choices are properties of how you are performing validity assessment, not properties of XSD validation in itself.
Some people dislike having to say explicitly that use of a particular vocabulary must start with a particular element, so they take pains to make only that one element top-level; all other elements are defined locally to complex types. This is an effective way of preventing abuse, but it also pretty effectively prevents re-use, and it makes the schema harder to maintain, work with, or reason about. I can’t see such a schema without thinking someone has just cut off their nose in order to spite their face.
In SGML, DTDs were mandatory and start-tag omission meant that even if the first tag was , the first element could be DOC, so the root-element in the DOCTYPE declaration was absolutely necessary. XML tossed out start-tag omission but preserved this root element for backward compatibility with SGML, and XSD (sorry, haven’t memorized the acronym du jour yet) preserved rootlessness for backward compatibility with DTDs.
RELAX NG and its ancestors TREX and RELAX, though, are rooted schema languages: a Russian-doll schema is implicitly rooted, and a grammar-rule schema is rooted using the reserved name ‘start’. (It’s reserved only as a rule name, not as an element name.) This does not impede reuse, because when you include a grammar inside another grammar you normally override the definition of ‘start’.
I always liked the idea of partial document validation. It is however flawed because of the id/idref mechanism.
My most recent example was with EAD (Archival Description DTD). XML instances may be huge (some Mb, which is huge if you think EAD is an editorial DTD), so beaking it down is an excellent idea (at the component level).
However, I still had to change the DTD for editing, as idrefs attributes would be inconsistent if they point outside the component I am editing.
The DTD/XSD I create are now id/idrefs free, and the control are usually made by schematron, which allow for a finer control level. My other rule is : avoid entities.