Constrain early and often:

XML Schema and the definition of document grammars for XML vocabularies

Fraunhofer

Institut für integrierte

Publikations- und Informationssysteme

C. M. Sperberg-McQueen

11 October 2001

Abstract

In the historical development of markup languages, few innovations have been more important than the introduction of the notion of document grammars for constraining documents and defining document types. Both SGML (ISO 8870) and XML 1.0 define a specialized notation (the DTD) for defining document grammars; more recently a number of alternative languages have been proposed. The W3C XML Schema language replicates the essential functionality of DTDs, and adds a number of features: the use of XML instance syntax rather than an ad hoc notation, clear relationships between schemas and namespaces, a systematic distinction between element types and data types, and a single-inheritance form of type derivation. In this talk, Michael Sperberg-McQueen, who serves as co-chair of the W3C XML Schema Working Group, will discuss the fundamental design issues of schema languages, and the choices made by the W3C group in defining XML Schema.



I. Overview

I.1. Overview

  • overview
  • the idea of document grammars
  • DTDs as document grammars
  • XML Schema
  • design issues and research questions

II. The idea of document grammars

II.1. Some models of text

  • Rectangular:
    text   ::= record
    record ::= CHAR*    // or CHAR{80} 
    
  • Linear
    text ::= CHAR*

II.2. Tree-shaped models

  • Unlabeled tree:
    text   ::= node*
    node   ::= node* | CHAR*
    
  • Fixed-depth, fixed-label tree:
    text   ::= page*
    page   ::= line* 
    line   ::= CHAR*
    

II.3. Models of marked-up text

  • ODTAO:
    text ::= (CHAR | COMMAND)*
    
    This only looks regular.
  • Pseudo-grammars:
    command ::= '\begin{' NAME '}'
              | '\end{' NAME '}'
    
    command ::= ':' NAME '.'?
              | ':e'' NAME '.'?
    
    command ::= '<' NAME attributes '>'
              | '</'' NAME '>'
    

II.4. COCOA tags

  • Virtual variables:
    command ::= '<' CHAR value '>'
    

II.5. Document grammars (DGs)

Origin pragmatic, not theoretical.
Later aligned (partially) with language theory.
Formal specification of validity rules → automated validation. (Cf. Algol vs. Fortran.)
Distinction between document type definition (DTD) and “the set of effective formal declarations”.

II.6. The uses of DGs

Document grammars may have several uses:
  • in the struggle against dirty data
  • as documentation of a contract between data provider and data consumer
  • as documentation of the content of data flows
  • as specification of client/server protocols

III. DTDs as document grammars

III.1. DTDs as DGs

SGML/XML DTDs resemble Backus-Naur Form grammars, but:
  • They describe bracketed languages* ...
  • ... so ‘non-terminals’ are visible*.
  • SGML allows inclusion and exclusion exceptions (Rizzi: NP-complete parsing problem for non-bracketed L).
  • They are not purely grammatical (notations, entities).
  • Determinism rule.

III.2. A document grammar

Limericks and canzone:
poem     ::= limerick | canzone

limerick ::= trimeter trimeter dimeter 
             dimeter trimeter
trimeter ::= CHAR+
dimeter  ::= CHAR+

canzone   ::= aufgesang abgesang
aufgesang ::= stollen stollen
stollen   ::= line+
abgesang  ::= line+

III.3. A DTD

Limericks and canzone:
<!ELEMENT poem (limerick | canzone) >

<!ELEMENT limerick (trimeter, trimeter, 
                    dimeter, dimeter, 
                    trimeter)>
<!ELEMENT trimeter (#PCDATA)>
<!ELEMENT dimeter  (#PCDATA)>

<!ELEMENT canzone   (aufgesang, abgesang) >
<!ELEMENT aufgesang (stollen, stollen) >
<!ELEMENT stollen   (l+) >
<!ELEMENT abgesang  (l+) >
<!ELEMENT l         (#PCDATA) >

III.4. A limerick

<limerick>
  <trimeter>
    There was a young lady named Bright
  </trimeter>
  <trimeter>
    whose speed was much faster than light.
  </trimeter>
  <dimeter>She set out one day,</dimeter>
  <dimeter>in a relative way,</dimeter>
  <trimeter>
    and returned on the previous night.
  </trimeter>
</limerick>

III.5. A canzone

<canzone>
  <aufgesang>
    <stollen>
      <l>unter den linden an der heide</l>
      <l>da unser zweier bette was</l>
    </stollen>
    <stollen>
      <l>da mugt ir vinden schone beide</l>
      <l>gebrochen bluomen unde gras</l>
    </stollen>
  </aufgesang>
  <abgesang>
    <l>kuste er mich? wol tusentstunt</l>
    <l>tandaradei</l>
    <l>seht wie rot mir ist der munt</l>
  </abgesang>
</canzone>

III.6. Note on the canzone DTD

  • All the non-terminals show up as tags.
  • The two Stollen must have same number of lines; this rule is not expressed.
  • The Abgesang must have more lines than a Stollen, fewer than Aufgesang; this rule is not expressed.

III.7. Removing non-terminals

<!ENTITY % aufgesang "stollen, stollen" >
<!ENTITY % lines     "l+" >
<!ELEMENT canzone   (%aufgesang;, abgesang) >
<!ELEMENT stollen   (%lines;) >
<!ELEMENT abgesang  (%lines;) >
<!ELEMENT l         (#PCDATA) >

III.8. The canzone minus explicit Aufgesang

<canzone>
  <stollen>
    <l>unter den linden an der heide</l>
    <l>da unser zweier bette was</l>
  </stollen>
  <stollen>
    <l>da mugt ir vinden schone beide</l>
    <l>gebrochen bluomen unde gras</l>
  </stollen>
  <abgesang>
    <l>kuste er mich? wol tusentstunt</l>
    <l>tandaradei</l>
    <l>seht wie rot mir ist der munt</l>
  </abgesang>
</canzone>

III.9. The canzone minus NTs

<canzone>
  <l>unter den linden an der heide</l>
  <l>da unser zweier bette was</l>
  <l>da mugt ir vinden schone beide</l>
  <l>gebrochen bluomen unde gras</l>
  <l>kuste er mich? wol tusentstunt</l>
  <l>tandaradei</l>
  <l>seht wie rot mir ist der munt</l>
</canzone>

III.10. Removing all non-terminals

<!ENTITY % stollen   "l+" >
<!ENTITY % aufgesang "%stollen;, %stollen;" >
<!ENTITY % abgesang  "l+" >
<!ELEMENT canzone   (%aufgesang;, %abgesang;) >
<!ELEMENT l         (#PCDATA) >
ERROR: this DTD is illegal; why?

IV. XML Schema and DTD functionality

IV.1. XML Schema

  • DTD++, DTD--
  • instance syntax
  • supporting programming-language and database-oriented types
  • design problems

IV.2. The canzone schema v.1

In version 1 of this schema, we imitate the DTD slavishly.
At the outer level is a schema element:
<xsd:schema>
 <!--* element declarations go here *-->
</xsd:schema>
N.B. the schema does not identify a document-root element / start symbol.

IV.3. Declaring elements

 <xsd:element name="canzone">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="aufgesang"/>
    <xsd:element ref="abgesang"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="aufgesang">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="stollen"/>
    <xsd:element ref="stollen"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

IV.4. Declaring elements

  • Note difference between element declaration (outer) and element reference (inner).
  • Implicit occurrence information: min = max = 1.

IV.5. Positive closure

 <xsd:element name="abgesang">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="stollen">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

IV.6. Character data

 <xsd:element name="l">
  <xsd:complexType mixed="true">
   <xsd:sequence>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>
or
 <xsd:element name="l" type="xsd:string"/>

V. Supporting PL/DBMS type notions

V.1. Supporting programming-language and dbms paradigms

  • tag/type distinction
  • named and anonymous datatypes
  • simple datatypes

V.2. The tag/type distinction

Let's generalize from
 <xsd:element name="l" type="xsd:string"/>
Can we do that for every element type?
N.B. four kinds of type
  • element type (vs. element, element instance)
  • data type
    • simple type (lexical form has no markup)
    • complex type (has element children)
In the example, l may be thought of as an accessor.

V.3. Top-level named complex types

Named types can be used to capture commonalities:
 <xsd:complexType name="lines">
  <xsd:sequence minOccurs="1" 
                maxOccurs="unbounded">
   <xsd:element ref="l"/>
  </xsd:sequence>
 </xsd:complexType>

 <xsd:element name="abgesang" type="lines">
 <xsd:element name="stollen" type="lines">

V.4. Top-level complex types

... or just to provide a name for a type:
 <xsd:complexType name="canzoneform">
  <xsd:sequence>
   <xsd:element ref="aufgesang"/>
   <xsd:element ref="abgesang"/>
  </xsd:sequence>
 </xsd:complexType>
 <xsd:complexType name="aufgesang">
  <xsd:sequence>
   <xsd:element ref="stollen"/>
   <xsd:element ref="stollen"/>
  </xsd:sequence>
 </xsd:complexType>

 <xsd:element name="canzone" type="canzoneform"/>
 <xsd:element name="aufgesang" type="aufgesang">

V.5. Anonymous types

We can hide things using anonymous local types:
 <xsd:element name="canzone">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element name="aufgesang">
     <xsd:complexType>
      <xsd:sequence>
       <xsd:element name="stollen" type="lines"/>
       <xsd:element name="stollen" type="lines"/>
      </xsd:sequence>
     </xsd:complexType>
    </xsd:element>
    <xsd:element name="abgesang" type="lines"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>
Note nested declarations and definitions.

V.6. Simple datatypes

  • built-in
    • primitive
    • derived
  • user-defined

V.7. Built-in primitive datatypes

  • string
  • boolean
  • number, float, double
  • duration, dateTime, time, date, gYearMonth, gYear, gMonthDay, gDay, gMonth
  • hexBinary, base64Binary
  • anyURI
  • QName
  • NOTATION

V.8. Built-in derived datatypes

  • normalizedString, token, language
  • IDREFS, ENTITIES, NMTOKEN, NMTOKENS, Name, NCName, ID, IDREF, ENTITY
  • integer, nonPositiveInteger, negativeInteger, long, int, short, byte, nonNegativeInteger, unsignedLong, unsignedInt, unsignedShort, unsignedByte, positiveInteger

V.9. What is an atomic?

Extensional:
  • a set of values V
  • a set of lexical forms L
  • a mapping from L to V
Intensional:
  • a base mapping L → V
  • a set of fundamental facets:
    • equality (identity)
    • order (partial, total, none)
    • boundedness
    • cardinality
    • numeric
  • a set of constraining facets:
    • length, minLength, maxLength
    • pattern (constrains lexical space)
    • enumeration
    • whiteSpace
    • maxInclusive, maxExclusive, minInclusive, minExclusive
    • totalDigits, fractionDigits

V.10. Non-atomic simple types

  • list (white-space delimited)
  • unions (ordered)

V.11. Inheritance Type derivation

It turns out to be hard to model stepwise refinement of types:
  • restriction (preserves subset semantics)
  • extension (preserves prefix semantics)

V.12. Inheritance in document systems

Document systems turn out to have a very different model of class systems and inheritance.
  • inheritance of attributes
  • inheritance of locations

VI. Design problems and research questions

VI.1. Schemas and namespaces

Some (unpleasant) facts of life:
  • Namespaces allow us to distinguish mine from not-mine.
  • Namespaces do not provide universal names.
  • The namespace : language relation is 1:n.
  • The language : grammar relation is 1:n.
  • Therefore, the namespace : schema relation is 1:n.
Live with it.

VI.2. Schema layers

We distinguish:
  • schema documents (with single target namespace)
  • schemas (sets of abstract components)
Schema composition operations:
  • import
  • include
  • include with override / redefine

VI.3. Modularization

XML Schema makes it possible to write modular document type definitions:
  • late collection of schema components
  • namespace-aware name matching, validation
  • white-box wildcards (lax / opportunistic)
  • black-box wildcards (skip)

VI.4. Linking document and schema

  • namespace name
  • schemaLocation hint

VI.5. Post-schema-validation infoset (PSVI)

XML-Schema validation: infoset → infoset.
  • additions, no changes
  • type assignment information
  • validation-attempted information (strict, lax, skip)
  • validation-outcome information

VI.6. Non-local effects

Consider the HTML input element:
  • legal only in p and similar elements
  • legal only within form elements
SGML DTDs have partial solutions:
  • inclusion exceptions
  • content models

VI.7. Non-local effects in XML Schema

Fundamentally, we trade verbosity for context-sensitivity:
 <xsd:element name="div" type="div-type"/>
 <xsd:element name="div" type="div-in-form-type"/>

 <xsd:element name="p" type="p-type"/>
 <xsd:element name="p" type="p-in-form-type"/>

 <xsd:element name="ul" type="ul-type"/>
 <xsd:element name="ul" type="ul-in-form-type"/>

 <xsd:element name="li" type="li-type"/>
 <xsd:element name="li" type="li-in-form-type"/>
One bit of context information = double the size of grammar.
Cf. van Wijngaarden grammars (infinite size, arbitrary amounts of context sensitivity).

VI.8. Determinism

The determinism rule remains controversial:
  • LL(1) guarantees may help implementors
  • All regular languages have a deterministic FSA;
  • ... but not necessarily a deterministic regular expression!
  • Implications for closure under union, intersection.
  • Implications for subsumption tests.