Notes on schema-validation results

C. M. Sperberg-McQueen

7 December 2001, rev. 27 March 2003



This document describes the results of schema validation as described in the W3C Recommendation XML Schema 1.0, in particular the ways those results differ from the results of (DTD-based) validation as described in ISO 8879 and the XML 1.0 specification.
It is based on email sent by the author to the XML Query Working Group in June 2001.

1. Introduction

At 2001-05-02 10:37, C. M. Sperberg-McQueen wrote: (full text at http://lists.w3.org/Archives/Member/w3c-xml-query-wg/2001May/0028.html)
Section 3.3 [of the 27 April data model] has a paragraph which reads
A "schema-invalid document" is an XML document that has a corresponding schema but whose schema-validity assessment has resulted in one or more element/attribute information items being assigned values other than 'valid' for the [validity] property in the PSVI.
I think the concept outlined here is going to be important, but I am uncomfortable with the term used for it (without being able to propose a better one at the moment). ...
XML Schema distinguishes valid nodes (on which strict assessment of validity was attempted, and found the node valid), invalid nodes (on which strict assessment was attempted, and found an error), and nodes for which the validity status is 'notKnown'. Our definition covers documents within which any element or attribute is 'invalid' or 'notKnown', which means it also covers documents within which any node was processed as a 'black box' (i.e. skipped during validation), in addition to documents within which there is some detected error. The term 'schema-invalid' will tend to suggest to the non-paranoid reader a meaning narrower than that given by the definition.
We discussed this at the XML Query face to face in May and Mary suggested substituting the term "incompletely validated". Although I had suggested this term myself in my note of 2 May, I find that I am as troubled by it as by the term "invalid".
Upon consideration, I think I now know why. The fact of the matter is that XML Schema (a) provides more information about schema-validity than a single bit (valid/invalid), and (b) provides schema-validity information not just about the document as a whole but about each element and attribute. If we want the data model to cover more than only the set of schema-valid documents, I think we need (a) to make more than a binary distinction ourselves, and (b) to consider validity as a property of (and validation an operation on) elements / subtrees, not solely documents.
At the very least, if we want to go beyond fully-validated schema-valid documents, we need to come to grips with which set of documents, other than the schema-valid documents, we actually want to cover, and how.
The purpose of the rest of this note is to provide a list of the cases I think can usefully be distinguished, and note where we have decisions to make. If people agree that we need something more than a binary switch, I will be willing to attempt formulating specific language for the document-model document.
A conforming XML Schema processor provides information on (inter alia)
  • the ancestor element at which schema-validation ('assessment') started
  • whether this particular element and its descendants were schema-validated or not
  • the result of the assessment
  • the type associated with the element
The various combinations of values of the [validation attempted], [validity], and [type definition] properties can usefully distinguish several cases: eight by my count. A diagram showing the various combinations of [validation attempted] and [validity] is at http://www.w3.org/XML/2001/06/validity-outcomes if that helps. [Note 2003-03-26: this table is now reproduced below, since I was tired of not finding it here. -CMSMcQ]
(Pedantic note: conforming XML Schema processors are allowed not to provide the [type definition] property, if instead they provide a bundle of properties including the [type definition name], [type definition namespace], and [type definition anonymous] properties. I ignore such light-weight processors here because I assume that XML Query will require access to the type definition components themselves. Anyone not sharing that assumption may implicitly insert the phrase "(or [type definition name] and related properties)" wherever I mention the [type definition] property, as long as you adjust quantifiers and negation properly.)
The outcome of XML-Schema-conformant validity assessment is conveyed by two attributes, each of which has three possible values. In the resulting three-by-three matrix, not all combinations are possible; the others distinguish several different states of affairs.
In the following table, the different cases outlined below are labeled by number. Shorthand codes are used to indicate whether the node, its children, and its descendants were assessed, whether they are valid, and whether a [type definition] and/or [schema error code] properties are present or not.
Validity
Validation attempted valid notKnown invalid
full
(this node and all descendants were fully assessed)

OK. Entire subtree from here down has been strictly assessed and is valid.

Case 1
Assess: +N +C +D
Valid: +N +C +D
Props: +td -sec

Not possible: validation-attempted="full" implies strict assessment, and strict assessment implies validity is either "valid" or "invalid"

OK. Entire subtree from here down has been strictly assessed and has an error, either here or at some descendant.

Case 2 There is a problem with the current element: it's not locally valid.
Assess: +N +C +D
Valid: +N +C +D
Props: -td +sec

Case 3 The current element is locally valid, but a child is invalid or missing a required declaration.
Assess: +N +C +D
Valid: +N -C
Props: +td -sec

partial
(either this node was fully assessed but some descendant was not, or vice versa)

OK. This node was strictly assessed and was valid; none of its attributes or children was invalid, and none was missing any required declaration.

Case 4
Assess: +N
Valid: +N
Valid-or-notKnown: +C
Required-decls-found: +C
Props: +td -sec

OK. This node was not strictly assessed (but one of its descendants was.)

Case 7
Assess: -N +D
Props: -td -sec

OK. This node was strictly assessed and was invalid, or else at least one of its direct dependents was invalid. Also, some descendant was not strictly assessed.

Case 5 This node is locally invalid.
Assess: +N -D
Valid: +N
Props: -td +sec

Case 6 A descendant is invalid.
Assess: +N
Valid: +N -D
Props: +td -sec

none
(neither this node nor any descendant was strictly assessed)

Not possible: validation-attempted="none" implies no strict assessment, and validity "valid" implies strict assessment.

OK. This subtree was skipped: no strict assessment, no validity information.

Case 8 A skipped subtree.
Assess: -N -C -D
Props: -td -sec

Not possible: validation-attempted="none" implies no strict assessment, and validity="invalid" implies strict assessment.

2. When the entire subtree has been schema-validated

2.1. Full validation, valid

First, we can distinguish three cases in which the entire subtree has been schema-validated:
1 This element, and all of its descendants, have been checked and are schema-valid. This is the rough equivalent of DTD-based validation: everything has a declaration, and everything conforms to the declaration.
  [validation attempted] = "full"
  [validity] = "valid"
  [type definition] property is present
The Query/XPath data model has to cover these elements.

2.2. Full validation, invalid

2 This element, and all of its descendants, have been checked and there is a problem right here at this element (and possibly also with some descendant).
  [validation attempted] = "full"
  [validity] = "invalid"
  [type definition] property is not present
We need to decide whether the Query/XPath data model should cover these elements and/or their descendants. It seems plausible to want to cover at least all fully-assessed schema-valid descendants (i.e. descendants in class 1). We can also cover the element with the problem by treating it as if it had the urType.

2.3. Full validation, locally valid

3 This element, and all of its descendants, have been checked and while this element is 'locally valid', some descendant is invalid.
  [validation attempted] = "full"
  [validity] = "invalid"
  [type definition] property is present
This will be the description of the top-level element in a database, if one attribute in one record is out of bounds. It seems plausible to want to cover at least these elements, and probably at least some of their descendants (i.e. at least those descendants which are also in this class).

3. Partial schema-validation

Second, there are four cases in which part of the subtree has been schema-validated and part not.
Schema-validity will not be assessed on elements or attributes if they or some ancestor matches an ANY wildcard which prescribes "skip" processing. Skip processing forbids schema-validity assessment and creates a 'black-box' location in a document in which any well-formed XML is legal. Schema-validity will also not be assessed for elements and attributes if (a) they or some ancestor matches an a wildcard which prescribes "lax" processing and (b) no declaration was available for some descendant. Lax processing calls for schema-validity to be assessed for elements and attributes if matching declarations are available, and skipped if declarations are not available; it creates a 'white box' in which undeclared elements and attributes are allowed, but in which all elements and attributes are schema-validated if declarations are available for them.

3.1. Partial validation, valid item

4 This element has been schema-validated, and is schema-valid (which means also that none of its attributes or children is invalid or missing a required declaration), but some descendant is not marked "valid".
  [validation attempted] = "partial"
  [validity] = "valid"
  [type definition] property is present
I believe we want to cover these elements in our data model.

3.2. Partial validation, (locally) invalid

5 This element has been schema-validated, and is invalid because there is a problem right here at this element.
  [validation attempted] = "partial"  
  [validity] = "invalid"
  [type definition] property is not present
I believe we need to decide whether we want to cover these elements in our data model. If we do wish to cover them, we can do so (I think) by assigning them the urType.

3.3. Partial validation, locally valid

6 This element has been schema-validated, and is invalid because although it's OK 'locally', it has some invalid descendant.
  [validation attempted] = "partial"
  [validity] = "invalid"
  [type definition] property is present
I believe we do want to cover these elements in our data model.

3.4. Partial validation, locally unvalidated

7 This element has not been schema-validated, but at least one of its descendants has been.
  [validation attempted] = "partial"
  [validity] = "notKnown"
  [type definition] property not present
I believe we need to decide whether we want to cover these elements in our data model. I believe we do, and that we can do so by assigning them the urType.

4. When the subtree was skipped

Finally, there is one case in which no part of the subtree has been schema-validated.

4.1. Unvalidated

8 Neither this element nor any of its descendants has been schema-validated.
  [validation attempted] = "none"
  [validity] = "notKnown"
  [type definition] and [type definition name] property not present
The Query/XPath data model can easily cover these elements by assigning the urSimpleType to all attributes and the urType to all elements.
I think we can cover all these cases, if we simply assign the urType and urSimple type to items which have no [type definition] property. The question is, so we wish to do so? (If we do, we need to be careful to distinguish elements associated with the urType by the schema validator and those for which the association with the urType came from the query system, not the schema validator.)