XML Schema 1.0

A language for document grammars

ACH/ALLC 2003 / Web X: a decade of the World Wide Web

1 June 2003

C. M. Sperberg-McQueen
Architecture Domain Lead, World Wide Web Consortium

1. Overview
2. The Iron Law
3. Grammars
4. A rewrite system
5. A rewrite system (2)
6. A rewrite system (3)
7. Document grammars
8. Conceptual layers
9. Conceptual layers (2)
10. Uses of document grammars
11. A DTD
12. A haiku
13. A limerick (Edward Lear)
14. A canzone
15. Note on the canzone DTD
16. XML Schema
17. The canzone schema v.1
18. Declaring elements
19. Declaring elements
20. Positive closure
21. Character data
22. Supporting programming-language and dbms paradigms
23. The tag/type distinction
24. Top-level named complex types
25. Top-level complex types
26. Anonymous types
27. Simple datatypes
28. Built-in primitive datatypes
29. Built-in derived datatypes
30. What is an atomic?
31. Non-atomic simple types
32. ~~Inheritance~~ Type derivation
33. Inheritance in document systems
34. Schemas and namespaces
35. Schema layers
36. Modularization
37. The tag/type distinction and non-local effects
38. Non-local effects in XML Schema
39. Determinism
40. Practical issues
41. Post-schema-validation infoset (PSVI)
42. Thank you

1. Overview

grammars?
document grammars?
an example
XML Schema 1.0
outlook

2. The Iron Law

Garbage* in, garbage out.

3. Grammars

informally: the rules governing how things work
formally: Chomsky's explication
- rewriting systems; a grammar as
  Γ = (V_N, V_T, P, S)
- hence set recognizers
- the Chomsky hierarchy
  - unrestricted phrase-structure grammars
  - monotonic grammars
  - context-free grammars
  - regular grammars
- describable and indescribable sets

4. A rewrite system

(Click in the rectangle to start this animation.)

Figure 1: Writing and rewriting and rewriting ...

N.B. what we are producing are strings, not structures. No memory.

5. A rewrite system (2)

(Click in the rectangle to start this animation.)

Figure 2: A derivation using a rewrite system.

Now we have some memory, but it's not very tractable.

6. A rewrite system (3)

Figure 3: For context-free grammars, the derivation can be interpreted as a tree.

7. Document grammars

Origin: pragmatic, not theoretical; partial post hoc alignment with formal language theory.

Formal specification of validity conditions → automated validation.

Er, ah, partial formal specification of validity conditions → partial automated validation.

Distinction between

document type definition (DTD) and
“set of effective formal declarations”

→ division of labor.

8. Conceptual layers

Three layers of rules governing data.

9. Conceptual layers (2)

Distinguish logical and physical structure.

10. Uses of document grammars

Document grammars may have several uses:

in the struggle against dirty data
as documentation of a contract between data provider and data consumer
as documentation of the content of data flows
as specification of client/server protocols

11. A DTD

Limericks and canzone:

<!ELEMENT poem (haiku | limerick | canzone) >

<!ELEMENT haiku    (l, l, l) >
<!ELEMENT limerick (trimeter, trimeter, 
                    dimeter, dimeter, 
                    trimeter)>
<!ELEMENT trimeter (#PCDATA)>
<!ELEMENT dimeter  (#PCDATA)>

<!ELEMENT canzone   (stanza+) >
<!ELEMENT stanza    (aufgesang, abgesang) >
<!ELEMENT aufgesang (stollen, stollen) >
<!ELEMENT stollen   (l+) >
<!ELEMENT abgesang  (l+) >
<!ELEMENT l         (#PCDATA) >

12. A haiku

<haiku>
 <l>Summer grasses &mdash;</l>
 <l>all that's left</l>
 <l>of warriors' dreams.</l>
</haiku>

13. A limerick (Edward Lear)

<limerick>
  <trimeter>
    There was an Old Man of Peru,
  </trimeter>
  <trimeter>
    who watched his wife making a stew;
  </trimeter>
  <dimeter>But once by mistake,</dimeter>
  <dimeter>In a stove she did bake,</dimeter>
  <trimeter>
    That unfortunate Man of Peru.
  </trimeter>
</limerick>

14. A canzone

<canzone>
  <stanza>
    <aufgesang>
      <stollen>
        <l>unter den linden an der heide</l>
        <l>da unser zweier bette was</l>
      </stollen>
      <stollen>
        <l>da mugt ir vinden schone beide</l>
        <l>gebrochen bluomen unde gras</l>
      </stollen>
    </aufgesang>
    <abgesang>
      <l>kuste er mich? wol tusentstunt</l>
      <l>tandaradei</l>
      <l>seht wie rot mir ist der munt</l>
    </abgesang>
  </stanza>
</canzone>

15. Note on the canzone DTD

All the non-terminals in the grammar show up as tags. (Changeable within limits.)
The two Stollen must have same number of lines; this rule is not expressed.
The Abgesang must have more lines than a Stollen, fewer than Aufgesang; this rule is not expressed.
Within a song, each stanza is metrically identical; this rule is not expressed.

16. XML Schema

A W3C* Recommendation*.

Design space:

DTD++, DTD--
instance syntax
supporting programming-language and database-oriented types
design problems

17. The canzone schema v.1

In version 1 of this schema, we imitate the DTD slavishly.

At the outer level is a schema element:

<xsd:schema 
 xmlns:xsd="http://www.w3.org/2001/XMLSchema">
 <!--* element declarations go here *-->
</xsd:schema>

N.B. the schema does not identify a document-root element / start symbol.

18. Declaring elements

 <xsd:element name="canzone">
  <xsd:complexType>
   <xsd:sequence maxOccurs="unbounded">
    <xsd:element ref="stanza"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="stanza">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="aufgesang"/>
    <xsd:element ref="abgesang"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="aufgesang">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="stollen"/>
    <xsd:element ref="stollen"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

19. Declaring elements

Note difference between element declaration (outer) and element reference (inner).

Implicit occurrence information: min = max = 1.

20. Positive closure

 <xsd:element name="abgesang">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="stollen">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

21. Character data

 <xsd:element name="l">
  <xsd:complexType mixed="true">
   <xsd:sequence>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="l" type="xsd:string"/>

22. Supporting programming-language and dbms paradigms

tag/type distinction
named and anonymous datatypes
simple datatypes

23. The tag/type distinction

Let's generalize from

 <xsd:element name="l" type="xsd:string"/>

Can we do that for every element type?

N.B. four usages for the term type

element type (vs. element, element instance)
data type
- simple type (lexical form has no markup)
- complex type (can have element children)

In the example, l may be thought of as an accessor.

24. Top-level named complex types

Named types can be used to capture commonalities:

 <xsd:complexType name="lines">
  <xsd:sequence minOccurs="1" 
                maxOccurs="unbounded">
   <xsd:element ref="l"/>
  </xsd:sequence>
 </xsd:complexType>

 <xsd:element name="abgesang" type="lines">
 <xsd:element name="stollen" type="lines">

25. Top-level complex types

... or just to provide a name for a type:

 <xsd:complexType name="canzoneform">
  <xsd:sequence>
   <xsd:element ref="aufgesang"/>
   <xsd:element ref="abgesang"/>
  </xsd:sequence>
 </xsd:complexType>
 <xsd:complexType name="aufgesang">
  <xsd:sequence>
   <xsd:element ref="stollen"/>
   <xsd:element ref="stollen"/>
  </xsd:sequence>
 </xsd:complexType>

 <xsd:element name="stanza" type="canzoneform"/>
 <xsd:element name="aufgesang" type="aufgesang"/>

26. Anonymous types

We can hide things using anonymous local types:

 <xsd:element name="canzone">
  <xsd:complexType>
   <xsd:sequence maxOccurs="unbounded">
    <xsd:element name="stanza">
     <xsd:complexType>
      <xsd:sequence>
       <xsd:element name="aufgesang">
        <xsd:complexType>
         <xsd:sequence>
          <xsd:element name="stollen" type="lines"/>
          <xsd:element name="stollen" type="lines"/>
         </xsd:sequence>
        </xsd:complexType>
       </xsd:element>
       <xsd:element name="abgesang" type="lines"/>
      </xsd:sequence>
     </xsd:complexType>
    </xsd:element>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

Note nested declarations and definitions.

27. Simple datatypes

built-in
- primitive
- derived
user-defined

28. Built-in primitive datatypes

string
boolean
number, float, double
duration, dateTime, time, date, gYearMonth, gYear, gMonthDay, gDay, gMonth
hexBinary, base64Binary
anyURI
QName
NOTATION

29. Built-in derived datatypes

normalizedString, token, language
IDREFS, ENTITIES, NMTOKEN, NMTOKENS, Name, NCName, ID, IDREF, ENTITY
integer, nonPositiveInteger, negativeInteger, long, int, short, byte, nonNegativeInteger, unsignedLong, unsignedInt, unsignedShort, unsignedByte, positiveInteger

30. What is an atomic?

Extensional:

a set of values V
a set of lexical forms L
a mapping from L to V

Intensional:

a base mapping L → V
a set of fundamental facets:
- equality (identity)
- order (partial, total, none)
- boundedness
- cardinality
- numeric
a set of constraining facets:
- length, minLength, maxLength
- pattern (constrains lexical space)
- enumeration
- whiteSpace
- maxInclusive, maxExclusive, minInclusive, minExclusive
- totalDigits, fractionDigits

31. Non-atomic simple types

list (white-space delimited)
unions (ordered)

32. Inheritance Type derivation

It turns out to be hard to model stepwise refinement of types:

restriction (preserves subset semantics)
extension (preserves prefix semantics)

33. Inheritance in document systems

Existing document systems turn out to have a very different model of class systems and inheritance.

inheritance of attributes
inheritance of locations

XML Schema models these with

inheritance (by extension or restriction)
substitution groups

34. Schemas and namespaces

Some (unpleasant) facts of life:

Namespaces are not incompatible with document grammars
— but they don't play well with DTDs.
Namespaces allow us to distinguish mine from not-mine.
Namespaces do not provide universal names.
The namespace : language relation is 1:n.
The language : grammar relation is 1:n.
Therefore, the namespace : schema relation is 1:n.

Live with it.

35. Schema layers

We distinguish:

schema documents (with single target namespace)
schemas (sets of abstract components)

Schema composition operations:

import
include
include with override / redefine

36. Modularization

XML Schema makes it possible to write modular document type definitions:

late collection of schema components
namespace-aware name matching, validation
white-box wildcards (lax / opportunistic)
black-box wildcards (skip)

37. The tag/type distinction and non-local effects

Consider the HTML input element:

legal only in p and similar elements
legal only within form elements

SGML DTDs have partial solutions:

inclusion exceptions
content models

38. Non-local effects in XML Schema

Fundamentally, we trade verbosity for context-sensitivity:

 <xsd:element name="div" type="div-type"/>
 <xsd:element name="div" type="div-in-form-type"/>

 <xsd:element name="p" type="p-type"/>
 <xsd:element name="p" type="p-in-form-type"/>

 <xsd:element name="ul" type="ul-type"/>
 <xsd:element name="ul" type="ul-in-form-type"/>

 <xsd:element name="li" type="li-type"/>
 <xsd:element name="li" type="li-in-form-type"/>

One bit of context information = double the size of grammar.

Cf. van Wijngaarden grammars (infinite size, unlimited of context sensitivity).

39. Determinism

The determinism rule remains controversial:

LL(1) guarantees may help implementors
All regular languages have a deterministic FSA;
... but not necessarily a deterministic regular expression!
Implications for closure under union, intersection.
Implications for subsumption tests.
Implications for interoperability, single-pass processing.

40. Practical issues

XML notation*
Linking document and schema
- namespace name
- schemaLocation hint
Hooks for schema annotation: the annotation element

41. Post-schema-validation infoset (PSVI)

XML-Schema validation: infoset → infoset.

additions, no changes
type assignment information
validation-attempted information (strict, lax, skip)
validation-outcome information

42. Thank you

C. M. Sperberg-McQueen
Architecture Domain Leader
World Wide Web Consortium
cmsmcq@w3.org

For more information: http://www.w3.org/XML