XML Schema 1.0
A language for document grammars
ACH/ALLC 2003 / Web X: a decade of the World Wide Web
1 June 2003
C. M. Sperberg-McQueen
Architecture Domain Lead, World Wide Web Consortium
1. Overview
- grammars?
- document grammars?
- an example
- XML Schema 1.0
- outlook
2. The Iron Law
Garbage* in, garbage out.
3. Grammars
- informally: the rules governing how things work
- formally: Chomsky's explication
- rewriting systems; a grammar as
Γ = (VN,
VT,
P,
S)
- hence set recognizers
- the Chomsky hierarchy
- unrestricted phrase-structure grammars
- monotonic grammars
- context-free grammars
- regular grammars
- describable and indescribable sets
4. A rewrite system
(Click in the rectangle to start this animation.)
N.B. what we are producing are strings, not structures. No memory.
5. A rewrite system (2)
(Click in the rectangle to start this animation.)
Now we have some memory, but it's not very tractable.
6. A rewrite system (3)
7. Document grammars
Origin: pragmatic, not theoretical; partial post hoc
alignment with formal language theory.
Formal specification of validity conditions → automated validation.
Er, ah, partial formal specification of validity conditions → partial automated validation.
Distinction between
- document type definition (DTD) and
- “set of effective formal declarations”
→ division of labor.
10. Uses of document grammars
Document grammars may have several uses:
- in the struggle against dirty data
- as documentation of a contract between data provider and data consumer
- as documentation of the content of data flows
- as specification of client/server protocols
Limericks and canzone:
<!ELEMENT poem (haiku | limerick | canzone) >
<!ELEMENT haiku (l, l, l) >
<!ELEMENT limerick (trimeter, trimeter,
dimeter, dimeter,
trimeter)>
<!ELEMENT trimeter (#PCDATA)>
<!ELEMENT dimeter (#PCDATA)>
<!ELEMENT canzone (stanza+) >
<!ELEMENT stanza (aufgesang, abgesang) >
<!ELEMENT aufgesang (stollen, stollen) >
<!ELEMENT stollen (l+) >
<!ELEMENT abgesang (l+) >
<!ELEMENT l (#PCDATA) >
12. A haiku
<haiku>
<l>Summer grasses —</l>
<l>all that's left</l>
<l>of warriors' dreams.</l>
</haiku>
13. A limerick (Edward Lear)
<limerick>
<trimeter>
There was an Old Man of Peru,
</trimeter>
<trimeter>
who watched his wife making a stew;
</trimeter>
<dimeter>But once by mistake,</dimeter>
<dimeter>In a stove she did bake,</dimeter>
<trimeter>
That unfortunate Man of Peru.
</trimeter>
</limerick>
14. A canzone
<canzone>
<stanza>
<aufgesang>
<stollen>
<l>unter den linden an der heide</l>
<l>da unser zweier bette was</l>
</stollen>
<stollen>
<l>da mugt ir vinden schone beide</l>
<l>gebrochen bluomen unde gras</l>
</stollen>
</aufgesang>
<abgesang>
<l>kuste er mich? wol tusentstunt</l>
<l>tandaradei</l>
<l>seht wie rot mir ist der munt</l>
</abgesang>
</stanza>
</canzone>
15. Note on the canzone DTD
- All the non-terminals in the grammar show up as tags.
(Changeable within limits.)
- The two Stollen must have same number of
lines; this rule is not expressed.
- The Abgesang must have more lines than a
Stollen, fewer than Aufgesang; this rule is not expressed.
- Within a song, each stanza is metrically identical;
this rule is not expressed.
16. XML Schema
A W3C* Recommendation*.
Design space:
- DTD++, DTD--
- instance syntax
- supporting programming-language and database-oriented
types
- design problems
17. The canzone schema v.1
In version 1 of this schema, we imitate the DTD slavishly.
At the outer level is a
schema element:
<xsd:schema
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<!--* element declarations go here *-->
</xsd:schema>
N.B. the schema does not identify
a document-root element / start symbol.
18. Declaring elements
<xsd:element name="canzone">
<xsd:complexType>
<xsd:sequence maxOccurs="unbounded">
<xsd:element ref="stanza"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="stanza">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="aufgesang"/>
<xsd:element ref="abgesang"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="aufgesang">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="stollen"/>
<xsd:element ref="stollen"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
19. Declaring elements
Note difference between element declaration (outer)
and element reference (inner).
Implicit occurrence information: min = max = 1.
20. Positive closure
<xsd:element name="abgesang">
<xsd:complexType>
<xsd:sequence minOccurs="1"
maxOccurs="unbounded">
<xsd:element ref="l"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="stollen">
<xsd:complexType>
<xsd:sequence minOccurs="1"
maxOccurs="unbounded">
<xsd:element ref="l"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
21. Character data
<xsd:element name="l">
<xsd:complexType mixed="true">
<xsd:sequence>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
or
<xsd:element name="l" type="xsd:string"/>
22. Supporting programming-language and dbms paradigms
- tag/type distinction
- named and anonymous datatypes
- simple datatypes
23. The tag/type distinction
Let's generalize from
<xsd:element name="l" type="xsd:string"/>
Can we do that for every element type?
N.B. four usages for the term
type- element type (vs. element, element instance)
- data type
- simple type (lexical form has no markup)
- complex type (can have element children)
In the example,
l may be thought of as an
accessor.
24. Top-level named complex types
Named types can be used to capture commonalities:
<xsd:complexType name="lines">
<xsd:sequence minOccurs="1"
maxOccurs="unbounded">
<xsd:element ref="l"/>
</xsd:sequence>
</xsd:complexType>
<xsd:element name="abgesang" type="lines">
<xsd:element name="stollen" type="lines">
25. Top-level complex types
... or just to provide a name for a type:
<xsd:complexType name="canzoneform">
<xsd:sequence>
<xsd:element ref="aufgesang"/>
<xsd:element ref="abgesang"/>
</xsd:sequence>
</xsd:complexType>
<xsd:complexType name="aufgesang">
<xsd:sequence>
<xsd:element ref="stollen"/>
<xsd:element ref="stollen"/>
</xsd:sequence>
</xsd:complexType>
<xsd:element name="stanza" type="canzoneform"/>
<xsd:element name="aufgesang" type="aufgesang"/>
26. Anonymous types
We can hide things using
anonymous local types:
<xsd:element name="canzone">
<xsd:complexType>
<xsd:sequence maxOccurs="unbounded">
<xsd:element name="stanza">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="aufgesang">
<xsd:complexType>
<xsd:sequence>
<xsd:element name="stollen" type="lines"/>
<xsd:element name="stollen" type="lines"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="abgesang" type="lines"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
Note nested declarations and definitions.
28. Built-in primitive datatypes
- string
- boolean
- number, float, double
- duration, dateTime, time, date, gYearMonth, gYear, gMonthDay, gDay,
gMonth
- hexBinary, base64Binary
- anyURI
- QName
- NOTATION
29. Built-in derived datatypes
- normalizedString, token, language
-
IDREFS,
ENTITIES,
NMTOKEN,
NMTOKENS,
Name,
NCName,
ID,
IDREF,
ENTITY
-
integer,
nonPositiveInteger,
negativeInteger,
long,
int,
short,
byte,
nonNegativeInteger,
unsignedLong,
unsignedInt,
unsignedShort,
unsignedByte,
positiveInteger
30. What is an atomic?
Extensional:
- a set of values V
- a set of lexical forms L
- a mapping from L to V
Intensional:
- a base mapping L → V
- a set of fundamental facets:
- equality (identity)
- order (partial, total, none)
- boundedness
- cardinality
- numeric
- a set of constraining facets:
- length, minLength, maxLength
- pattern (constrains lexical space)
- enumeration
- whiteSpace
- maxInclusive, maxExclusive, minInclusive, minExclusive
- totalDigits, fractionDigits
31. Non-atomic simple types
- list (white-space delimited)
- unions (ordered)
32. Inheritance Type derivation
It turns out to be hard to model stepwise refinement
of types:
- restriction (preserves subset semantics)
- extension (preserves prefix semantics)
33. Inheritance in document systems
Existing document systems turn out to have a very different model
of class systems and inheritance.
- inheritance of attributes
- inheritance of locations
XML Schema models these with
- inheritance (by extension or restriction)
- substitution groups
34. Schemas and namespaces
Some (unpleasant) facts of life:
- Namespaces are not incompatible with document grammars
- — but they don't play well with DTDs.
- Namespaces allow us to distinguish mine
from not-mine.
- Namespaces do not provide universal names.
- The namespace : language
relation is 1:n.
- The language : grammar
relation is 1:n.
- Therefore, the namespace : schema
relation is 1:n.
Live with it.
35. Schema layers
We distinguish:
- schema documents (with single target namespace)
- schemas (sets of abstract components)
Schema composition operations:
- import
- include
- include with override / redefine
36. Modularization
XML Schema makes it possible to write modular document
type definitions:
- late collection of schema components
- namespace-aware name matching, validation
- white-box wildcards (lax / opportunistic)
- black-box wildcards (skip)
37. The tag/type distinction and non-local effects
Consider the HTML
input element:
- legal only in p and similar elements
- legal only within form elements
SGML DTDs have partial solutions:
- inclusion exceptions
- content models
38. Non-local effects in XML Schema
Fundamentally, we trade verbosity for context-sensitivity:
<xsd:element name="div" type="div-type"/>
<xsd:element name="div" type="div-in-form-type"/>
<xsd:element name="p" type="p-type"/>
<xsd:element name="p" type="p-in-form-type"/>
<xsd:element name="ul" type="ul-type"/>
<xsd:element name="ul" type="ul-in-form-type"/>
<xsd:element name="li" type="li-type"/>
<xsd:element name="li" type="li-in-form-type"/>
One bit of context information = double the size of grammar.
Cf. van Wijngaarden grammars (infinite size, unlimited of
context sensitivity).
39. Determinism
The determinism rule remains controversial:
- LL(1) guarantees may help implementors
- All regular languages have a deterministic FSA;
- ... but not necessarily a deterministic
regular expression!
- Implications for closure under union, intersection.
- Implications for subsumption tests.
- Implications for interoperability, single-pass processing.
40. Practical issues
- XML notation*
- Linking document and schema
- namespace name
- schemaLocation hint
- Hooks for schema annotation: the annotation
element
41. Post-schema-validation infoset (PSVI)
XML-Schema validation: infoset → infoset.
- additions, no changes
- type assignment information
- validation-attempted information (strict, lax, skip)
- validation-outcome information
42. Thank you
C. M. Sperberg-McQueen
Architecture Domain Leader
World Wide Web Consortium
cmsmcq@w3.org