Abstract
In the historical development of markup languages, few innovations
have been more important than the introduction of the notion of
document grammars for constraining documents and defining document
types. Document grammars provide a simple, easily understood method
of specifying rules for the validity of XML documents. By helping
keep data clean, they make it easier to write simpler, more reliable
software.
Both SGML (ISO 8870) and XML 1.0 define a specialized notation (the
DTD) for defining document grammars; more recently a number of
alternative languages have been proposed. The W3C XML Schema language
replicates the essential functionality of DTDs, and adds a number of
features: the use of XML instance syntax rather than an ad hoc
notation, clear relationships between schemas and namespaces, a
systematic distinction between element types and data types, and a
single-inheritance form of type derivation.
This presentation will outline some of the fundamental features of
document grammars for XML, and at the same time introduce the basics
of the W3C XML Schema 1.0 language. Some fundamental design issues of
schema languages will be discussed, together with the choices made by
the W3C group in defining XML Schema.
III. An example: DTDs as document grammars
DTDs resemble Backus-Naur Form grammars, but:
- They describe ‘bracketed’ languages* ...
- ... so ‘non-terminals’ are visible*.
- SGML allows inclusion and exclusion exceptions (Rizzi: NP-complete
parsing problem for non-bracketed L).
- They are not purely grammatical (notations, entities).
- Determinism rule (LL(1) requirement).
III.2. Example: limericks
Consider two kinds of poem. The limerick:
There was a young lady named Bright
whose speed was much faster than light.
She set out one day,
in a relative way,
and returned on the previous night.
III.3. ... and canzone
Under der linden an der heide,
dâ unser zweier bette was,
dâ muget ir vinden schône beide
gebrochen bluomen unde gras.
vor dem walde in einem tal,
tandaradei,
schône sanc diu nahtegal.
III.4. A document grammar
Limericks and canzone:
poem ::= limerick | canzone
limerick ::= trimeter trimeter dimeter
dimeter trimeter
trimeter ::= CHAR+
dimeter ::= CHAR+
canzone ::= aufgesang abgesang
aufgesang ::= stollen stollen
stollen ::= line+
abgesang ::= line+
Limericks and canzone:
<!ELEMENT poem (limerick | canzone) >
<!ELEMENT limerick (trimeter, trimeter,
dimeter, dimeter,
trimeter)>
<!ELEMENT trimeter (#PCDATA)>
<!ELEMENT dimeter (#PCDATA)>
<!ELEMENT canzone (aufgesang, abgesang) >
<!ELEMENT aufgesang (stollen, stollen) >
<!ELEMENT stollen (l+) >
<!ELEMENT abgesang (l+) >
<!ELEMENT l (#PCDATA) >
<poem>
<limerick>
<trimeter>
There was a young lady named Bright
</trimeter>
<trimeter>
whose speed was much faster than light.
</trimeter>
<dimeter>She set out one day,</dimeter>
<dimeter>in a relative way,</dimeter>
<trimeter>
and returned on the previous night.
</trimeter>
</limerick>
</poem>
<poem>
<canzone>
<aufgesang>
<stollen>
<l>unter den linden an der heide</l>
<l>da unser zweier bette was</l>
</stollen>
<stollen>
<l>da mugt ir vinden schone beide</l>
<l>gebrochen bluomen unde gras</l>
</stollen>
</aufgesang>
<abgesang>
<l>kuste er mich? wol tusentstunt</l>
<l>tandaradei</l>
<l>seht wie rot mir ist der munt</l>
</abgesang>
</canzone>
</poem>
III.8. Note on the poem DTD
- All the non-terminals show up as tags.
- The trimeter and dimeter lines should scan with 2 and 3 dactyls;
this rule is not expressed.
- The two Stollen must have same number of
lines; this rule is not expressed.
- The Abgesang must have more lines than a
Stollen, fewer than Aufgesang; this rule is not expressed.
- No grammar detects the errors in the
previous example.
III.9. Removing non-terminals
<!ENTITY % aufgesang "stollen, stollen" >
<!ENTITY % lines "l+" >
<!ELEMENT canzone (%aufgesang;, abgesang) >
<!ELEMENT stollen (%lines;) >
<!ELEMENT abgesang (%lines;) >
<!ELEMENT l (#PCDATA) >
This allows the DTD to record our understanding.
But can anyone use that understanding?
III.10. The canzone minus explicit Aufgesang
<canzone>
<stollen>
<l>unter den linden an der heide</l>
<l>da unser zweier bette was</l>
</stollen>
<stollen>
<l>da mugt ir vinden schone beide</l>
<l>gebrochen bluomen unde gras</l>
</stollen>
<abgesang>
<l>kuste er mich? wol tusentstunt</l>
<l>tandaradei</l>
<l>seht wie rot mir ist der munt</l>
</abgesang>
</canzone>
IV. XML Schema and DTD functionality
- the idea of document grammars
- DTDs as document grammars (example)
- XML Schema: replicating DTDs
- XML Schema: types
- another example
- design issues and research questions
IV.2. XML Schema
- DTD++ (inheritance, real data types)
- DTD-- (no entities)
- instance syntax
- supporting programming-language and database-oriented
types
- design problems
IV.3. The canzone schema v.1
In version 1 of this schema, we imitate the DTD slavishly.
At the outer level is a
schema element:
<xsd:schema xmlns:xsd =
"http://www.w3.org/2001/XMLSchema"
>
<!--* element and type declarations
* go here ... *-->
</xsd:schema>
N.B. the schema does not identify
a document-root element / start symbol.
IV.4. Declaring elements
<xsd:element name="canzone">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="aufgesang"/>
<xsd:element ref="abgesang"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="aufgesang">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="stollen"/>
<xsd:element ref="stollen"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
IV.5. Declaring elements
- Note difference between element declaration (outer)
and element reference (inner).
- Implicit occurrence information: min = max = 1.
IV.6. Repeated elements
<xsd:element name="abgesang">
<xsd:complexType>
<xsd:sequence minOccurs="1"
maxOccurs="unbounded">
<xsd:element ref="l"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="stollen">
<xsd:complexType>
<xsd:sequence minOccurs="1"
maxOccurs="unbounded">
<xsd:element ref="l"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
IV.7. Character data
<xsd:element name="l">
<xsd:complexType mixed="true">
<xsd:sequence>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
or
<xsd:element name="l" type="xsd:string"/>