XML Schema 1.0

A language for document grammars

ACH/ALLC 2003 / Web X: a decade of the World Wide Web

1 June 2003

C. M. Sperberg-McQueen
Architecture Domain Lead, World Wide Web Consortium



1. Overview

2. The Iron Law

Garbage* in, garbage out.

3. Grammars

4. A rewrite system

(Click in the rectangle to start this animation.)
Figure 1: Writing and rewriting and rewriting ...
N.B. what we are producing are strings, not structures. No memory.

5. A rewrite system (2)

(Click in the rectangle to start this animation.)
Figure 2: A derivation using a rewrite system.
Now we have some memory, but it's not very tractable.

6. A rewrite system (3)

Figure 3: For context-free grammars, the derivation can be interpreted as a tree.

7. Document grammars

Origin: pragmatic, not theoretical; partial post hoc alignment with formal language theory.
Formal specification of validity conditions → automated validation.
Er, ah, partial formal specification of validity conditions → partial automated validation.
Distinction between
  • document type definition (DTD) and
  • “set of effective formal declarations”
→ division of labor.

8. Conceptual layers

Three layers of rules governing data.

9. Conceptual layers (2)

Distinguish logical and physical structure.

10. Uses of document grammars

Document grammars may have several uses:
  • in the struggle against dirty data
  • as documentation of a contract between data provider and data consumer
  • as documentation of the content of data flows
  • as specification of client/server protocols

11. A DTD

Limericks and canzone:
<!ELEMENT poem (haiku | limerick | canzone) >

<!ELEMENT haiku    (l, l, l) >
<!ELEMENT limerick (trimeter, trimeter, 
                    dimeter, dimeter, 
                    trimeter)>
<!ELEMENT trimeter (#PCDATA)>
<!ELEMENT dimeter  (#PCDATA)>

<!ELEMENT canzone   (stanza+) >
<!ELEMENT stanza    (aufgesang, abgesang) >
<!ELEMENT aufgesang (stollen, stollen) >
<!ELEMENT stollen   (l+) >
<!ELEMENT abgesang  (l+) >
<!ELEMENT l         (#PCDATA) >

12. A haiku

<haiku>
 <l>Summer grasses &mdash;</l>
 <l>all that's left</l>
 <l>of warriors' dreams.</l>
</haiku>

13. A limerick (Edward Lear)

<limerick>
  <trimeter>
    There was an Old Man of Peru,
  </trimeter>
  <trimeter>
    who watched his wife making a stew;
  </trimeter>
  <dimeter>But once by mistake,</dimeter>
  <dimeter>In a stove she did bake,</dimeter>
  <trimeter>
    That unfortunate Man of Peru.
  </trimeter>
</limerick>

14. A canzone

<canzone>
  <stanza>
    <aufgesang>
      <stollen>
        <l>unter den linden an der heide</l>
        <l>da unser zweier bette was</l>
      </stollen>
      <stollen>
        <l>da mugt ir vinden schone beide</l>
        <l>gebrochen bluomen unde gras</l>
      </stollen>
    </aufgesang>
    <abgesang>
      <l>kuste er mich? wol tusentstunt</l>
      <l>tandaradei</l>
      <l>seht wie rot mir ist der munt</l>
    </abgesang>
  </stanza>
</canzone>

15. Note on the canzone DTD

16. XML Schema

A W3C* Recommendation*.
Design space:

17. The canzone schema v.1

In version 1 of this schema, we imitate the DTD slavishly.
At the outer level is a schema element:
<xsd:schema 
 xmlns:xsd="http://www.w3.org/2001/XMLSchema">
 <!--* element declarations go here *-->
</xsd:schema>
N.B. the schema does not identify a document-root element / start symbol.

18. Declaring elements

 <xsd:element name="canzone">
  <xsd:complexType>
   <xsd:sequence maxOccurs="unbounded">
    <xsd:element ref="stanza"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="stanza">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="aufgesang"/>
    <xsd:element ref="abgesang"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="aufgesang">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="stollen"/>
    <xsd:element ref="stollen"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

19. Declaring elements

Note difference between element declaration (outer) and element reference (inner).
Implicit occurrence information: min = max = 1.

20. Positive closure

 <xsd:element name="abgesang">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="stollen">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

21. Character data

 <xsd:element name="l">
  <xsd:complexType mixed="true">
   <xsd:sequence>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>
or
 <xsd:element name="l" type="xsd:string"/>

22. Supporting programming-language and dbms paradigms

23. The tag/type distinction

Let's generalize from
 <xsd:element name="l" type="xsd:string"/>
Can we do that for every element type?
N.B. four usages for the term type
  • element type (vs. element, element instance)
  • data type
    • simple type (lexical form has no markup)
    • complex type (can have element children)
In the example, l may be thought of as an accessor.

24. Top-level named complex types

Named types can be used to capture commonalities:
 <xsd:complexType name="lines">
  <xsd:sequence minOccurs="1" 
                maxOccurs="unbounded">
   <xsd:element ref="l"/>
  </xsd:sequence>
 </xsd:complexType>

 <xsd:element name="abgesang" type="lines">
 <xsd:element name="stollen" type="lines">

25. Top-level complex types

... or just to provide a name for a type:
 <xsd:complexType name="canzoneform">
  <xsd:sequence>
   <xsd:element ref="aufgesang"/>
   <xsd:element ref="abgesang"/>
  </xsd:sequence>
 </xsd:complexType>
 <xsd:complexType name="aufgesang">
  <xsd:sequence>
   <xsd:element ref="stollen"/>
   <xsd:element ref="stollen"/>
  </xsd:sequence>
 </xsd:complexType>

 <xsd:element name="stanza" type="canzoneform"/>
 <xsd:element name="aufgesang" type="aufgesang"/>

26. Anonymous types

We can hide things using anonymous local types:
 <xsd:element name="canzone">
  <xsd:complexType>
   <xsd:sequence maxOccurs="unbounded">
    <xsd:element name="stanza">
     <xsd:complexType>
      <xsd:sequence>
       <xsd:element name="aufgesang">
        <xsd:complexType>
         <xsd:sequence>
          <xsd:element name="stollen" type="lines"/>
          <xsd:element name="stollen" type="lines"/>
         </xsd:sequence>
        </xsd:complexType>
       </xsd:element>
       <xsd:element name="abgesang" type="lines"/>
      </xsd:sequence>
     </xsd:complexType>
    </xsd:element>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>
Note nested declarations and definitions.

27. Simple datatypes

28. Built-in primitive datatypes

29. Built-in derived datatypes

30. What is an atomic?

Extensional:
  • a set of values V
  • a set of lexical forms L
  • a mapping from L to V
Intensional:
  • a base mapping L → V
  • a set of fundamental facets:
    • equality (identity)
    • order (partial, total, none)
    • boundedness
    • cardinality
    • numeric
  • a set of constraining facets:
    • length, minLength, maxLength
    • pattern (constrains lexical space)
    • enumeration
    • whiteSpace
    • maxInclusive, maxExclusive, minInclusive, minExclusive
    • totalDigits, fractionDigits

31. Non-atomic simple types

32. Inheritance Type derivation

It turns out to be hard to model stepwise refinement of types:
  • restriction (preserves subset semantics)
  • extension (preserves prefix semantics)

33. Inheritance in document systems

Existing document systems turn out to have a very different model of class systems and inheritance.
  • inheritance of attributes
  • inheritance of locations
XML Schema models these with
  • inheritance (by extension or restriction)
  • substitution groups

34. Schemas and namespaces

Some (unpleasant) facts of life:
  • Namespaces are not incompatible with document grammars
  • — but they don't play well with DTDs.
  • Namespaces allow us to distinguish mine from not-mine.
  • Namespaces do not provide universal names.
  • The namespace : language relation is 1:n.
  • The language : grammar relation is 1:n.
  • Therefore, the namespace : schema relation is 1:n.
Live with it.

35. Schema layers

We distinguish:
  • schema documents (with single target namespace)
  • schemas (sets of abstract components)
Schema composition operations:
  • import
  • include
  • include with override / redefine

36. Modularization

XML Schema makes it possible to write modular document type definitions:
  • late collection of schema components
  • namespace-aware name matching, validation
  • white-box wildcards (lax / opportunistic)
  • black-box wildcards (skip)

37. The tag/type distinction and non-local effects

Consider the HTML input element:
  • legal only in p and similar elements
  • legal only within form elements
SGML DTDs have partial solutions:
  • inclusion exceptions
  • content models

38. Non-local effects in XML Schema

Fundamentally, we trade verbosity for context-sensitivity:
 <xsd:element name="div" type="div-type"/>
 <xsd:element name="div" type="div-in-form-type"/>

 <xsd:element name="p" type="p-type"/>
 <xsd:element name="p" type="p-in-form-type"/>

 <xsd:element name="ul" type="ul-type"/>
 <xsd:element name="ul" type="ul-in-form-type"/>

 <xsd:element name="li" type="li-type"/>
 <xsd:element name="li" type="li-in-form-type"/>
One bit of context information = double the size of grammar.
Cf. van Wijngaarden grammars (infinite size, unlimited of context sensitivity).

39. Determinism

The determinism rule remains controversial:
  • LL(1) guarantees may help implementors
  • All regular languages have a deterministic FSA;
  • ... but not necessarily a deterministic regular expression!
  • Implications for closure under union, intersection.
  • Implications for subsumption tests.
  • Implications for interoperability, single-pass processing.

40. Practical issues

  • XML notation*
  • Linking document and schema
    • namespace name
    • schemaLocation hint
  • Hooks for schema annotation: the annotation element

41. Post-schema-validation infoset (PSVI)

XML-Schema validation: infoset → infoset.
  • additions, no changes
  • type assignment information
  • validation-attempted information (strict, lax, skip)
  • validation-outcome information

42. Thank you

C. M. Sperberg-McQueen
Architecture Domain Leader
World Wide Web Consortium
cmsmcq@w3.org
For more information: http://www.w3.org/XML