In the historical development of markup languages, few innovations
have been more important than the introduction of the notion of
document grammars for constraining documents and defining document
types. Both SGML (ISO 8870) and XML 1.0 define a specialized notation
(the DTD) for defining document grammars; more recently a number of
alternative languages have been proposed. The W3C XML Schema language
replicates the essential functionality of DTDs, and adds a number of
features: the use of XML instance syntax rather than an ad hoc
notation, clear relationships between schemas and namespaces,
a systematic distinction between element types and data types,
and a single-inheritance form of type derivation. In this talk,
Michael Sperberg-McQueen, who serves as co-chair of the W3C XML Schema
Working Group, will discuss the fundamental design issues of
schema languages, and the choices made by the W3C group in defining
XML Schema.
The idea of document grammars
Rectangular:
text ::= record
record ::= CHAR* // or CHAR{80}
Unlabeled tree:
text ::= node*
node ::= node* | CHAR*
Fixed-depth, fixed-label tree:
text ::= page*
page ::= line*
line ::= CHAR*
Models of marked-up text | ← ↑ → |
ODTAO:
text ::= (CHAR | COMMAND)*
This only
looks regular.
Pseudo-grammars:
command ::= '\begin{' NAME '}'
| '\end{' NAME '}'
command ::= ':' NAME '.'?
| ':e'' NAME '.'?
command ::= '<' NAME attributes '>'
| '</'' NAME '>'
Virtual variables:
command ::= '<' CHAR value '>'
Document grammars (DGs) | ← ↑ → |
Origin pragmatic, not theoretical.
Later aligned (partially) with language theory.
Formal specification of validity rules → automated validation.
(Cf. Algol vs. Fortran.)
Distinction between document type definition (DTD)
and “the set of effective formal declarations”.
Document grammars may have several uses:
- in the struggle against dirty data
- as documentation of a contract between data provider
and data consumer
- as documentation of the content of data flows
- as specification of client/server protocols
DTDs as document grammars
SGML/XML DTDs resemble Backus-Naur Form grammars, but:
- They describe bracketed languages* ...
- ... so ‘non-terminals’ are visible*.
- SGML allows inclusion and exclusion exceptions (Rizzi: NP-complete
parsing problem for non-bracketed L).
- They are not purely grammatical (notations, entities).
- Determinism rule.
Limericks and canzone:
poem ::= limerick | canzone
limerick ::= trimeter trimeter dimeter
dimeter trimeter
trimeter ::= CHAR+
dimeter ::= CHAR+
canzone ::= aufgesang abgesang
aufgesang ::= stollen stollen
stollen ::= line+
abgesang ::= line+
Limericks and canzone:
<!ELEMENT poem (limerick | canzone) >
<!ELEMENT limerick (trimeter, trimeter,
dimeter, dimeter,
trimeter)>
<!ELEMENT trimeter (#PCDATA)>
<!ELEMENT dimeter (#PCDATA)>
<!ELEMENT canzone (aufgesang, abgesang) >
<!ELEMENT aufgesang (stollen, stollen) >
<!ELEMENT stollen (l+) >
<!ELEMENT abgesang (l+) >
<!ELEMENT l (#PCDATA) >
<limerick>
<trimeter>
There was a young lady named Bright
</trimeter>
<trimeter>
whose speed was much faster than light.
</trimeter>
<dimeter>She set out one day,</dimeter>
<dimeter>in a relative way,</dimeter>
<trimeter>
and returned on the previous night.
</trimeter>
</limerick>
<canzone>
<aufgesang>
<stollen>
<l>unter den linden an der heide</l>
<l>da unser zweier bette was</l>
</stollen>
<stollen>
<l>da mugt ir vinden schone beide</l>
<l>gebrochen bluomen unde gras</l>
</stollen>
</aufgesang>
<abgesang>
<l>kuste er mich? wol tusentstunt</l>
<l>tandaradei</l>
<l>seht wie rot mir ist der munt</l>
</abgesang>
</canzone>
Note on the canzone DTD | ← ↑ → |
- All the non-terminals show up as tags.
- The two Stollen must have same number of
lines; this rule is not expressed.
- The Abgesang must have more lines than a
Stollen, fewer than Aufgesang; this rule is not expressed.
Removing non-terminals | ← ↑ → |
<!ENTITY % aufgesang "stollen, stollen" >
<!ENTITY % lines "l+" >
<!ELEMENT canzone (%aufgesang;, abgesang) >
<!ELEMENT stollen (%lines;) >
<!ELEMENT abgesang (%lines;) >
<!ELEMENT l (#PCDATA) >
The canzone minus explicit Aufgesang | ← ↑ → |
<canzone>
<stollen>
<l>unter den linden an der heide</l>
<l>da unser zweier bette was</l>
</stollen>
<stollen>
<l>da mugt ir vinden schone beide</l>
<l>gebrochen bluomen unde gras</l>
</stollen>
<abgesang>
<l>kuste er mich? wol tusentstunt</l>
<l>tandaradei</l>
<l>seht wie rot mir ist der munt</l>
</abgesang>
</canzone>
The canzone minus NTs | ← ↑ → |
<canzone>
<l>unter den linden an der heide</l>
<l>da unser zweier bette was</l>
<l>da mugt ir vinden schone beide</l>
<l>gebrochen bluomen unde gras</l>
<l>kuste er mich? wol tusentstunt</l>
<l>tandaradei</l>
<l>seht wie rot mir ist der munt</l>
</canzone>
Removing all non-terminals | ← ↑ → |
<!ENTITY % stollen "l+" >
<!ENTITY % aufgesang "%stollen;, %stollen;" >
<!ENTITY % abgesang "l+" >
<!ELEMENT canzone (%aufgesang;, %abgesang;) >
<!ELEMENT l (#PCDATA) >
ERROR: this DTD is illegal; why?
XML Schema and DTD functionality
- DTD++, DTD--
- instance syntax
- supporting programming-language and database-oriented
types
- design problems
The canzone schema v.1 | ← ↑ → |
In version 1 of this schema, we imitate the DTD slavishly.
At the outer level is a
schema element:
<xsd:schema>
<!--* element declarations go here *-->
</xsd:schema>
N.B. the schema does not identify
a document-root element / start symbol.
<xsd:element name="canzone">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="aufgesang"/>
<xsd:element ref="abgesang"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="aufgesang">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="stollen"/>
<xsd:element ref="stollen"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
- Note difference between element declaration (outer)
and element reference (inner).
- Implicit occurrence information: min = max = 1.
<xsd:element name="abgesang">
<xsd:complexType>
<xsd:sequence minOccurs="1"
maxOccurs="unbounded">
<xsd:element ref="l"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="stollen">
<xsd:complexType>
<xsd:sequence minOccurs="1"
maxOccurs="unbounded">
<xsd:element ref="l"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="l">
<xsd:complexType mixed="true">
<xsd:sequence>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
or
<xsd:element name="l" type="xsd:string"/>