1. Background
1.1. Some relevant projects
Some corpus linguists will already be aware of SGML and XML,
because they have heard of any of a number of projects relevant
for the application of SGML and XML to corpus-linguistic work.
The TEI Guidelines are a set of SGML document-type definitions
prepared over several years in the late 1980s and early 1990s by the
Text Encoding Initiative (TEI), an international project sponsored by
the Association for Computers and the Humanities (ACH), the
Association for Literary and Linguistic Computing (ALLC), and the
Association for Computational Linguistics (ACL). They are
maintained and further developed now by the TEI Consortium.
The markup vocabulary specified by the TEI Guidelines is intended to
allow for the encoding of virtually any textual material that might be
of interest for linguistic or textual study of any kind — which
means any textual material at all, of any period, language, or genre.
Corpus linguists and corpus linguistics were among the fields best
represented in the working groups of the TEI; Stig Johannson, for
example, who was instrumental in completing the Lancaster-Oslo/Bergen
corpus of English, served as chair of a Working Committee, and Antonio
Zampolli of Pisa served on the TEI Steering Committee. As a result,
the TEI vocabulary contains (along with much else) much that is useful
in corpus work.
The British National Corpus (BNC) began work before the TEI Guidelines
were complete, but is encoded using a variant of the TEI encoding
scheme (Lou Burnard, one of the co-editors of the TEI Guidelines, led
the computing work on the BNC). The BNC consists of 100 million words
of modern British English: 90 million words of written material and 10
million words of spoken material, with careful attention to questions
of balance and elaborate attempts at characterization of each sample
so that subcorpora can be selected to suit special interests.
The Corpus Encoding Standard (CES) is a specialization or profile
of the TEI vocabulary, in which many of the options left to local
preference in the TEI scheme are tightened up in the interests of
greater uniformity of practice in the collection of corpora and
similar language resources.
Another application of SGML (and the TEI) to corpus building is
in the Edinburgh and Chiba map task corpora prepared (respectively) by
Henry Thompson (Edinburgh) and Syun Tutiya (Chiba). Here, SGML
is used to encode spoken language of people engaged in a particular
task; in the Chiba materials, the transcriptions are synchronized
in some detail with videotape recordings of the conversations, to
allow study of gesture, eye contact, and so on.
1.2. SGML and XML
But what is this SGML which these projects apply to corpus
linguistics? And how does it relate to the XML mentioned in my
title?
SGML, or the Standard Generalized Markup Language, is an
international standard defined by ISO 8879:1986. It offers a
non-proprietary means of providing explicit descriptive markup of any
collection of textual features which may be of interest. It does not,
as one might at first expect, achieve this by defining a vocabulary or
set of tags for marking up textual features; it defines nothing of the
sort. Instead, it defines a meta-language by means of which the user
of SGML may define an arbitrary set of tags (or, more precisely, of
element types) for marking up documents of a given sort.
The user decides what element types to define, and the meaning
ascribed to them (what textual features they should describe); the
selection of textual features to be marked up is limited not by the
capabilities of a particular piece of software (as in word processors
or formatting languages), nor by the decisions made in a committee
defining a markup language (as in HTML and most other specific
applications of SGML), but only by the user's interests and the user's
ability to mark up the features of interest (both economic and
intellectual limitations play a role).
The meta-language provided by SGML takes the form of a
document type definition, which contains what might be
called a ‘document grammar’, roughly in the style
of a BNF (Backus/Naur Form) grammar.[
2] SGML also
allows the user to declare the coded character set they are using;
this allows SGML to be used on virtually any kind of computer
system.
XML (the Extensible Markup Language) is by contrast not an
International Standard in the narrow sense, although it is fast
becoming a de facto standard. It is defined by a Recommendation
issued by the World Wide Web Consortium (W3C) first issued in 1998 and
re-issued with corrections from time to time since then. XML is a
subset of SGML, both in the sense that the rules of XML are a subset
of the rules of SGML and in the sense that every well-formed XML
document is a legal SGML document. The subset was designed to be
easier to handle on the World Wide Web (the responsible Working Group
was originally called the “SGML on the Web” Working Group), and
in particular to be easier to parse than SGML. There are fewer
optional features (the most prominent exception being that processing
of the document grammar or DTD is not required of conforming XML
processors), and the character set is constrained to be Unicode
(although non-Unicode character encodings may be used, the repertoire
of characters is defined exclusively with reference to Unicode / ISO
10646). By making XML easier to parse than SGML, the designers hoped
to make it easier to write software for XML than for SGML and thus to
encourage software development. Since 1998, this hope has been
amply fulfilled.
I should pause for a moment to describe the World Wide Web
Consortium, as the body responsible for the XML specification. W3C is
a member-supported organization which creates Web standards (in the
form of Recommendations, which describe recommended practice). Its
mission is “to lead the Web to its full potential”, and W3C is
accordingly engaged in a wide variety of endeavors to ensure that the
World Wide Web is accessible to all, regardless of language,
geography, script, visual impairment, or other disability; to improve
the utility of the Web as a medium not just for human to human but
also for program to program communication; to address social concerns
confronting the Web, in particular the development of a Web of Trust
which makes the Web a more successful collaborative environment; to
improve the interoperability of Web-based software and the
evolvability of the technical design of the Web; to encourage
decentralization in the Web; and to provide better standards-based
multimedia formats.
2. Markup and markup languages
I have talked about some corpus-related projects using SGML and
XML, and I have described what SGML and XML are. But what is the
markup to which their names allude?
Markup (the term comes from the traditional printer's term for the
markings in a manuscript which tell the typesetter how to style the
document) is a system of marks (tags) appearing in the
electronic form of a document, which serves to provide a more explicit
representation of textual phenomena not adequately captured by a
simple sequence of characters. Informally, users of markup often
speak as if the purpose of markup is to add information to a
transcription of a text; this is a convenient manner of speaking, but
an oversimplification: quite frequently the information made explicit
by the markup is already notionally part of the text, and is not so
much added to the text by the markup as
exhibited by it.
The nature and utility of markup may be exhibited best by
some simple examples.
Consider first a very simple transcription which uses no explicit
markup. This is the beginning of a transcription of an English
translation of Ibsen's
Peer Gynt, as distributed by the
Oxford Text Archive.[
3]
1875
PEER GYNT
by Henrik Ibsen
THE CHARACTERS
ASE, a peasant's widow.
PEER GYNT, her son.
TWO OLD WOMEN with corn-sacks. ASLAK, a smith. WEDDING-GUESTS. A
MASTER-COOK, A FIDDLER, etc.
A MAN AND WIFE, newcomers to the district.
SOLVEIG and LITTLE HELGA, their daughters.
THE FARMER AT HEGSTAD.
INGRID, his daughter.
THE BRIDEGROOM and His PARENTS.
THREE SAETER-GIRLS. A GREEN-CLAD WOMAN.
THE OLD MAN OF THE DOVRE.
A TROLL-COURTIER. SEVERAL OTHERS. TROLL-MAIDENS and TROLL-URCHINS. A
COUPLE OF WITCHES. BROWNIES, NIXIES, GNOMES, etc.
AN UGLY BRAT. A VOICE IN THE DARKNESS. BIRD-CRIES.
KARI, a cottar's wife.
Master COTTON, Monsieur BALLON, Herren VON EBERKOPF and
TRUMPETERSTRALE, gentlemen on their travels. A THIEF and A RECEIVER.
ANITRA, daughter of a Bedouin chief.
ARABS, FEMALE SLAVES, DANCING-GIRLS, etc.
THE MEMNON-STATUE (singing). THE SPHINX AT GIZEH (muta persona).
PROFESSOR BEGRIFFENFELDT, Dr. Phil., director of the madhouse at
Cairo.
HUHU, a language-reformer from the coast of Malabar. HUSSEIN, an
eastern Minister. A FELLAH, with a royal mummy.
SEVERAL MADMEN, with their KEEPERS.
A NORWEGIAN SKIPPER and HIS CREW. A STRANGE PASSENGER.
A PASTOR. A FUNERAL-PARTY. A PARISH-OFFICER. A BUTTON-MOULDER. A
LEAN PERSON.
The action, which opens in the beginning of the nineteenth
century, and ends around the 1860's, takes place partly in
Gudbrandsdalen, and on the mountains around it, partly on the coast
of Morocco, in the desert of Sahara, in a madhouse at Cairo, at sea,
etc.
ACT FIRST
SCENE FIRST
[A wooded hillside near ASE's farm. A river rushes down the slope.
On the further side of it an old mill shed. It is a hot day in
summer.]
[PEER GYNT, a strongly-built youth of twenty, comes down the
pathway. His mother, ASE, a small, slightly built woman, follows
him, scolding angrily.]
ASE
Peer, you're lying!
PEER [without stopping].
No, I am not!
ASE
Well then, swear that it is true!
PEER
Swear? Why should I?
ASE
See, you dare not!
It's a lie from first to last.
The basic approach has been to render the text in ASCII characters
approximately as it might be rendered by someone using a typewriter.
The words of the text are mostly there, although no distinction is made
between the words of Ibsen's text and other material (e.g. the date,
title, and authorship attribution) transcribed from the exemplar.
Also, it is clear that not everything in the exemplar has been
transcribed. The date, title, and author are given, but they are not
distinguished from the rest of the transcription in such a way as to
make it easy for an automatic process (as opposed to a human) to
identify them. The date given (1875) is cryptic: it is not the date
of original composition (the poem was written and first published
in 1867), nor that of first performance as a play (1876), nor that
of the original publication of this edition (apparently 1892).
1875 may perhaps be intended as giving the date at which the adaptation to
the stage was performed.
The restriction to ASCII characters means that the name of Peer's
mother Åse is mistranscribed as “Ase”; italics in the
original are not signaled, and footnotes have been lost. Speeches
are visually set off from each other, but the user of the transcription
will need to write a special-purpose parser in order to be able to
tell, when processing the transcription, who speaks which words of
dialog, and to distinguish dialog from stage directions and
speech attributions.
In sum, this first example captures the lexical material of the
source text, but does not capture the textual structure or relevant
meta-textual information in any form suitable for machine processing.
Some transcriptions have more information. Older electronic
material, in particular, is apt to be presented in forms like
this fragment of Walther von der Vogelweide[
4]
|s001
|l001 ich sâz ûf eime steine
|l002 und dâhte bein mit beine
|l003 dar ûf sazt ich den ellenbogen
or this representation of the opening lines of
Beowulf:
|b001
|l001a Hw*a/et, we GAR-DENA
|l001b in geardagum t*rym gefrunnon
|l002a Hu t*a *a/ed*elingas
|l002b ellen fremedon
Stanzas (|s),
books (|b), and
lines of verse (|l) are
explicitly marked, and numbers are given as
identifiers. In the Beowulf fragment,
the presence of asterisks and slashes marks
the spelling of characters not present in the
character set of a card-punch: wynn (w*),
aesc (a/e),
thorn (t*),
edh (d*).
By means of these simple conventions, it becomes possible
for concordances to provide line numbers in the lists
of word occurrences, and for a typesetter or suitable
computer printer to provide an appropriate representation
of the special characters.
Sometimes, markup is used for linguistic annotation.
In cases like this, the frequently heard characterization of
markup as information
added to a text becomes
more and more problematic. Is linguistic annotation
to be described as the addition of extra-textual information?
Or only as an explicit representation of information
intrinsic to the text quâ text and left
implicit by conventional orthographic transcription?
Consider this fragment of a sample from the BNC, with simple
phrase-structure tagging.
S0CF6003 v
[S [N TROUBLED_JJ [ morning_NNT1 television_NN1 ]
station_NN1 GMTV_NP1 N] finally_RR [V had_VHD
[N something_PN1 [Ti to_TO smile_VVI [P about_II
P]Ti]N][Nr last_MD night_NNT1 [Fr when_RRQ [N
it_PPH1 N][V was_VBDZ revealed_VVN [Fn[N it_PPH1
N][V gained_VVD [N an_AT1 extra_JJ million_NNO
viewers_NN2 N][P over_II [N the_AT last_MD two_MC
weeks_NNT2 N]P]V]Fn]V]Fr]Nr]V] ._YSTP S]
Fortunately, it is possible to build working systems
of markup while remaining agnostic as to whether
linguistic information of this kind, or any information
of any kind, is part of the text or not.
While the distinction between textual and extra-textual
features has great potential interest for all of us who
care about texts and language, it has no formal
importance. Formally, the crucial opposition is that
between content and
markup.
A
markup language does four things:
- defines a vocabulary to use in markup.
- specifies how markup constructs can occur
(containment, sequence, ...), providing a contract
between data sources and data sinks.
- tells how to distinguish markup from content.
- indicates what the constructs of its vocabulary mean
and how they are to be used.
A simple example may make this more concrete. Following
the tradition of many computer languages, we use a trivial
‘hello, world’ example which illustrates
some of the most important features of XML:
<!DOCTYPE greetings [
<!ELEMENT greetings (hello+) >
<!ELEMENT hello (#PCDATA) >
<!ATTLIST hello
lang CDATA #IMPLIED >
<!ENTITY szlig "ß" >
<!ENTITY uuml "ü" >
]>
<greetings>
<hello lang="en">Hello, world!</hello>
<hello lang="fr">Bon jour, tout le monde!</hello>
<hello lang="no">Goddag!</hello>
<hello lang="de">Guten Tag!</hello>
<hello lang="de-franken">Grüß Gott!</hello>
</greetings>
The document is divided into two parts: first the
document type declaration, which runs from the
first line (“<!DOCTYPE greetings [”)
through the eighth (“]>”), and then the
document instance, which runs from the ninth
line (“<greetings>”)
through the fifteenth (“<greetings>”).
Let us consider the document instance first. It is encoded
using a fairly simple form of labeled bracketing. The document
is divided into structural units known as elements;
each element is delimited by a start-tag at its
beginning and an end-tag at the ending, each
giving the element type of the element. Here,
the top-level element (there is always exactly one outermost
element in any well-formed XML document) is of type
greetings, and it has five child elements,
each of type hello, each containing a
greeting. The final hello element uses two
entity references (the strings “ü”
and “ß”) to represent characters
not conveniently accessible from my US-oriented keyboard.
The start-tag can also contain a list of attribute-value
pairs showing attributes of the element type; in this
example, each hello element is labeled
with a code for its language.
The document type declaration contains declarations for
the two element types; the element declarations may be
thought of as productions in a regular right-part grammar
defining the set of valid documents. The declaration
<!ELEMENT greetings (hello+) >
indicates that valid elements of type
greetings
contain one or more elements of type
hello.
The declaration
<!ELEMENT hello (#PCDATA) >
indicates that
hello elements contain
character data.
Attributes may also be declared: the declaration
<!ATTLIST hello
lang CDATA #IMPLIED >
indicates that elements of type
hello may but need
not carry a
lang attribute, which in turn contains
a string of character data.[
5]
The entities referred to in the document are also declared.
Since entities are simply named strings of characters, the
declaration for an entity consists, in the simple case, just
of the entity name and its replacement value:
<!ENTITY szlig "ß" >
<!ENTITY uuml "ü" >
Here, the name
szlig is defined for the string
consisting of the single character whose decimal number in
Unicode is 223 (U+00DF, in the usual Unicode notation), and
the name
uuml is assigned a replacement string
consisting of the character U+00FC.[
6]
A markup language, I said earlier, does four things.
First, it defines a vocabulary for use in markup. Here,
the vocabulary consists of the element types greetings
and hello, and the attribute lang on
the hello element. Second, it specifies how
markup can occur; here, the rule is that hello elements
occur only within a greetings element, and that
any number of them may occur in the same greetings
element. Third, it tells us how to distinguish markup from
content. Here, the rules for distinguishing markup and content
are given by XML: markup consists of start- and end-tags
(delimited by angle brackets), entity references (delimited
by ampersand and semicolon), and some other constructs not
shown here.
Fourth, a markup language specifies a meaning for the constructs it
defines. Notice that no formal declarations occur in our example to
identify the meaning or expected usage of the elements or attributes.
XML does not provide any particular set of semantic primitives in
terms of which such semantic or pragmatic information could be given.
This is frequently regarded as a weakness of XML, but in fact it is
XML's semantic agnosticism which ensures that XML can be used for so
many different types of information. The meanings to be conveyed by
an XML vocabulary are limited only by the ingenuity of the vocabulary
designer; their translation into operational semantics is limited only
by the wit of those who write programs to process the vocabulary.
Vocabularies intended for serious work should be, and typically are,
documented by conventional prose documentation. The TEI
Guidelines, for example, provide hundreds of pages of
prose documentation for the TEI vocabulary; documentation similar in
essence, albeit frequently less extensive, is provided for other
widely used vocabularies like DocBook and HTML. For throw-away
vocabularies and toy illustrations like the one above, on the other
hand, the documentation of the vocabulary's meaning may be limited to
the use of suggestive natural-language names for elements and
attributes, and the provision of some examples.
3. The XML Landscape
Since SGML and XML define neither a specific set of element
types and attributes nor a specific set of primitive semantic
notions, they may best be regarded (despite their names) not as
markup languages but as meta-languages for the
definition of markup languages.
A particular markup language defined using XML is conventionally
referred to as an application of XML. It is through
particular applications that most users will encounter XML. XHTML,
the XML-based formulation of HTML, may be the best known application
of XML. Corpus linguists may be familiar with, or at least have heard
of, the Text Encoding Initiative Guidelines and the
Corpus Encoding Standard which applies them to corpus-building. Those
interested in graphics will know the W3C's Scalable Vector Graphics
language SVG, which provides compact scalable graphics. Another type
of application is represented by XSLT (Extensible Stylesheet Language:
Transformations), a W3C Recommendation defining an XML-based
functional programming language for transformations of XML data into
XML and other forms; it is very widely used for all kinds of data
processing, including the translation of data from other XML formats
into (X)HTML.
When people speak of the increasing complexity of using XML, they
sometimes have in mind the proliferation of such specialized
applications of XML. There have long since been too many to keep
track of, and more are developed all the time. But since one of the
main goals of XML was precisely to make it easier for users and
communities of users to define the markup languages they need, there
is no point in bemoaning the fact that many users and communities have
seized the opportunity to do just that.[
7] The use of XML to define
specialized vocabularies was part of the plan from the beginning.
In other ways, however, the XML landscape is much more complex
today than was foreseen when work on XML started in 1996. The
original plan of the W3C's project for “SGML on the Web”
was to provide Web-friendly simplifications of the three major
specifications being developed by the ISO / IEC working group
responsible for document processing standards:[
8]
- XML would be a simplification or subset of SGML.
- XLink would be a subset of HyTime, the Hypermedia / Time-based
Structuring Language for hypermedia architectures, defined by ISO 10744
- XSL (Extensible Stylesheet Language) would be a subset or
simplification of DSSSL, the Document Style Semantics and Specification
Language for defining stylesheets to guide the formatting
and presentation of marked-up documents, defined by
ISO 10179.
The developers of XML wanted something like SGML on the Web, because
we were accustomed to the power of defining our own markup languages,
suited fairly precisely to the material we were encoding and the kinds
of processing we wished to perform; the limitation of the World Wide
Web to the single markup language HTML was felt to be unsustainably
constraining. Many people familiar with HyTime and other hypertext
systems felt similarly constrained by the very limited hypertext
mechanisms built into HTML; it was thought highly desirable to provide
more powerful mechanisms, of the kind most other hypertext systems
offered. As the sometime chair of the XLink Working Group (Bill Smith,
of Sun Microsystems) used to say, the goal of XLink was to bring
hypertext on the Web forward, into the 1960s. And, of course, in
order to process and display documents using generic markup, it is
essential to have some kind of stylesheet language.
In the event, all three of these goals were met, although
XLink and XSL were eventually split into several specifications.
But the most striking change in the plan was the sheer number of
other specifications which have been developed which were not
part of the original conception:
- XML
- Namespaces in XML
- XML Information Set
- XLink
- XPointer
- XSLT (XSL Transformations)
- XSL Formatting Objects
- XPath
- XML Schema
- XML Query
- Document Object Model
- SOAP (Simple Object Access Protocol)
- WSDL (Web Services Description Lanuage)
This is by no means an exhaustive list even of W3C specifications
relevant to the XML infrastructure, not to speak of work done
outside the W3C framework.
It would exceed the patience even of the most well behaved reader
to attempt to give a clear idea of the technical content of all of
these specifications. Instead, in the remainder of this paper I will
limit myself to a brief, schematic introduction of two of the
technologies I believe are most salient to the work of corpus
linguists, first XML Schema and second the complex of specifications
surrounding XPath.[
9]
4. Document grammar
One of the key innovations of SGML, which sets it apart from most
other families of markup languages, is the notion that markup can be
validated against a formal definition of its structure.
As illustrated above, an SGML or XML document can be accompanied by a
document type declaration, in which elements and
attributes may be declared, and which provides a partial formal
expression of the
document type definition, which is
defined as the rules governing the use of the vocabulary to documents
of a particular type. Only part of the document type definition is
captured formally, for the simple reason that in the current state of
knowledge there are no tools for the formal expression of arbitrary
semantics which are as convenient and as precise as the well known
tools for the formal expression of syntax.[
10]
The element declarations in a document type declaration provide
what we may call a document grammar, which generates
a set of documents in the same way that a grammar for a natural
language generates a set of sentences or discourses. The process of
checking a document instance against the declarations is called
validation.
Although its connections to automata theory give it a rather
academic air, validation was introduced in SGML not for theoretical
reasons, but for purely pragmatic reasons: errors in input data can
cause costly delays in document processing and typesetting, and
validation was introduced as a method of reducing such errors by
detecting (some of) them mechanically. It was only after the fact
that the validation rules of SGML were partially aligned with
principles of conventional automata theory.[
11]
The formal specification of document validity by means of a DTD
means that whole classes of encoding errors can be detected
by automated processes, without requiring human proofreading; it
is thus a step forward in markup languages roughly analogous
to the introduction of Backus/Naur Form grammars in the specification
of Algol 60, and for much the same reasons.
Document grammars, in whatever notation they are expressed,
may have several uses:
- They assist in the perpetual struggle against dirty data.
- They can serve as documentation of a contract between data providers
and data consumers.
- They can specify the content of particular of data flows
within a complex system.
- They can also be used to specify client/server protocols.
The notation document grammars provided by SGML and XML
is conventionally referred to as
DTD notation;
in that notation, element declarations resemble individual
productions in a Backus/Naur Form grammar, but there are a
number of differences which should be kept in mind.
First of all, the start-tag of the element makes explicit
which production rule to use; certain kinds of ambiguity and
parsing difficulty thus melt away. As a parsing theorist would
say, the language being described is a bracketed language
(as defined by [
Ginsburg/Harrison 1967]),
because each non-terminal has a distinctive start and end
string. Second, the right-hand side of each production consists of
a regular expression, so that technically speaking we have
a kind of regular-right-part grammar, not a BNF grammar.
Third, DTDs are not purely grammatical; as the example
shown above illustrates, they also include declarations for
entities, which have no grammatical function. And finally,
the content model of an element declaration is restricted, both
in SGML and in XML DTDs, to deterministic expressions.
XML Schema is an XML-based language for writing document grammars;
it is intended to make the use of DTDs for grammatical purposes
unnecessary. (For a variety of reasons, XML Schema does not replace
DTDs entirely: XML Schema has no provision for entity declarations,
so DTDs will continue to be used for that purpose.)
Let us consider a simple example of a document grammar, which
will allow us to illustrate some of the capabilities and limitations
both of DTDs and of XML Schema. The grammar describes two kinds
of poem: limericks, and canzone. A poem is one or the
other of these.
poem ::= limerick | canzone
A limerick consists of two lines of trimeter, two of dimeter,
and a final trimeter. Each of these is just (for the purposes
of this grammar) a sequence of characters.
limerick ::= trimeter trimeter dimeter
dimeter trimeter
trimeter ::= CHAR+
dimeter ::= CHAR+
For the internal structure of the canzone, we will use the
terminology developed by the Meistersänger (since that
is how I learned it): a canzone consists of two parts, an
Aufgesang and an
Abgesang; the
Aufgesang consists of two
Stollen.
The
Stollen and the
Abgesang in
turn consist of lines.[
12]
canzone ::= aufgesang abgesang
aufgesang ::= stollen stollen
stollen ::= line+
abgesang ::= line+
A translation of the grammar into DTD form makes the
formal similarity clear:
<!ELEMENT poem (limerick | canzone) >
<!ELEMENT limerick (trimeter, trimeter,
dimeter, dimeter,
trimeter)>
<!ELEMENT trimeter (#PCDATA)>
<!ELEMENT dimeter (#PCDATA)>
<!ELEMENT canzone (aufgesang, abgesang) >
<!ELEMENT aufgesang (stollen, stollen) >
<!ELEMENT stollen (l+) >
<!ELEMENT abgesang (l+) >
<!ELEMENT l (#PCDATA) >
Conforming to the grammar, we can encode poems, both limericks:
<limerick>
<trimeter>
There was a young lady named Bright
</trimeter>
<trimeter>
whose speed was much faster than light.
</trimeter>
<dimeter>She set out one day,</dimeter>
<dimeter>in a relative way,</dimeter>
<trimeter>
and returned on the previous night.
</trimeter>
</limerick>
and canzone:
<canzone>
<aufgesang>
<stollen>
<l>unter den linden an der heide</l>
<l>da unser zweier bette was</l>
</stollen>
<stollen>
<l>da mugt ir vinden schone beide</l>
<l>gebrochen bluomen unde gras</l>
</stollen>
</aufgesang>
<abgesang>
<l>kuste er mich? wol tusentstunt</l>
<l>tandaradei</l>
<l>seht wie rot mir ist der munt</l>
</abgesang>
</canzone>
Note that in the DTD translation of the grammar, each non-terminal
symbol appears as an element type; the result is that instead of
tagging each line of verse the same way, we must use three different
element types, depending on context. Note also that it is a rule of
canzone form (at least, as practiced in Minnesang) that the two
Stollen must have the same number of lines, and that the
Abgesang must have more lines than a
Stollen, but fewer than the Aufgesang.
This rule is not expressed in this grammar.
The mechanical identification of grammatical non-terminals with
element types has given us a rather ponderous style of markup;
some users of XML would prefer not to tag the
Aufgesang
explicitly, thus:
<canzone>
<stollen>
<l>unter den linden an der heide</l>
<l>da unser zweier bette was</l>
</stollen>
<stollen>
<l>da mugt ir vinden schone beide</l>
<l>gebrochen bluomen unde gras</l>
</stollen>
<abgesang>
<l>kuste er mich? wol tusentstunt</l>
<l>tandaradei</l>
<l>seht wie rot mir ist der munt</l>
</abgesang>
</canzone>
We can modify the DTD to accomplish this by translating
the non-terminal
aufgesang not into an element
type but into a parameter entity.[
13]
<!ENTITY % aufgesang "stollen, stollen" >
<!ENTITY % lines "l+" >
<!ELEMENT canzone (%aufgesang;, abgesang) >
<!ELEMENT stollen (%lines;) >
<!ELEMENT abgesang (%lines;) >
<!ELEMENT l (#PCDATA) >
Indeed, we could go further and remove almost all the
non-terminals, leaving only the elements for the poem
as a whole and for the individual verse lines, so that
the poem might be encoded thus:
<canzone>
<l>unter den linden an der heide</l>
<l>da unser zweier bette was</l>
<l>da mugt ir vinden schone beide</l>
<l>gebrochen bluomen unde gras</l>
<l>kuste er mich? wol tusentstunt</l>
<l>tandaradei</l>
<l>seht wie rot mir ist der munt</l>
</canzone>
As a first approximation of this change, we might write
the document grammar thus:
<!ENTITY % stollen "l+" >
<!ENTITY % aufgesang "%stollen;, %stollen;" >
<!ENTITY % abgesang "l+" >
<!ELEMENT canzone (%aufgesang;, %abgesang;) >
<!ELEMENT l (#PCDATA) >
This DTD, however, is illegal, because it is ambiguous
and hence non-deterministic: after expansion of the
parameter-entity references, it amounts to this:
<!ELEMENT canzone (l+, l+, l+) >
<!ELEMENT l (#PCDATA) >
It is not clear which occurrence of “
l+”
any given line of the poem should be assigned to, except
that the first line must obviously be attributed to the
first “
l+”, and the last line of the poem
to the third. The obvious revision of the definition of
canzone alleviates this problem and makes
a legal DTD:
<!ELEMENT canzone (l+) >
<!ELEMENT l (#PCDATA) >
This grammar has lost all of the internal structure of
the poem; whether that is an advantage or disadvantage
will vary with one's purpose in creating the encoding.
5. XML Schema
One of the principal disadvantages of DTD notation is that
while a DTD obviously contains structured information,
that structure is not exposed in the way recommended by
XML (that is, using markup), but is instead captured only
in an ad hoc syntax. This makes reuse of the information
harder, since any use of the information in a DTD requires
the creation of a DTD parser. In addition, the
purely grammatical orientation of DTDs fails to capture
a number of concepts important for modeling information in
the general case. Support for datatypes of the kind
conventional in programming languages and database management
systems is missing, as is type inheritance.
XML Schema was developed with the goal of addressing these
perceived problems. It uses XML document-instance syntax. It offers
much the same basic functionality as DTDs: both are notations for
document grammars. But XML Schema has more than DTDs, in some ways
(datatypes, type inheritance, etc.), and less in other ways (no entity
declarations).
There is not space here for a serious introduction to XML Schema;
we will have to content ourselves with a slavish imitation of the
first version of the DTD.
At the outer level is a
schema element in the
XML Schema namespace:
<xsd:schema
xmlns:xsd ="http://www.w3.org/2001/XMLSchema" >
<!--* element declarations go here *-->
</xsd:schema>
N.B. the schema does not identify
a document-root element / start symbol. In this,
XML Schema appears to differ from the DTDs of XML and
SGML, where the
<!DOCTYPE ... >
declaration is used to give the type (generic identifier) of the
document's root element. But in practice, the
<!DOCTYPE ... > declaration occurs
in the document instance; in markup vocabularies intended
for serious use, the free-standing files containing the
declarations of elements and attributes do
not
indicate which element type is required to be the
root element.
Following the simple DTD, we can declare the
elements
canzone and
aufgesang
as containing sequences of specified children:
<xsd:element name="canzone">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="aufgesang"/>
<xsd:element ref="abgesang"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="aufgesang">
<xsd:complexType>
<xsd:sequence>
<xsd:element ref="stollen"/>
<xsd:element ref="stollen"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
Note that there are two distinct uses of the xsd:element
element. Some are element declarations (the outer ones),
while others are element references (the inner ones).
Implicitly, each element reference matches a single occurrence of
the element: the minimum and maximum number of occurrences is one.
The
abgesang and
stollen elements
require that we write a content model which matches one or more
l elements; this can be done by specifying explicit
values for the attributes
minOccurs and
maxOccurs:
<xsd:element name="abgesang">
<xsd:complexType>
<xsd:sequence minOccurs="1"
maxOccurs="unbounded">
<xsd:element ref="l"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
<xsd:element name="stollen">
<xsd:complexType>
<xsd:sequence minOccurs="1"
maxOccurs="unbounded">
<xsd:element ref="l"/>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
The
l element itself needs to allow for mixed
content; the usual idiom is this:
<xsd:element name="l">
<xsd:complexType mixed="true">
<xsd:sequence>
</xsd:sequence>
</xsd:complexType>
</xsd:element>
or this:
<xsd:element name="l" type="xsd:string"/>
Of these, the first is preferable for natural-language material,
as it is more easily adjusted
when (not if) it
becomes necessary to allow children elements within a line of
verse (e.g. for quotations, or for emphasized words).
The foregoing provides just a glimpse of XML Schema, but
I hope it suffices to make clear the utility of XML Schema
for laying out, concisely and formally, the structure of
a markup vocabulary, and for expressing some of the more
obvious kinds of structural integrity constraints we
may wish to guarantee to users of our data collections.
6. XML Query, XSLT, and XPath
In the long run, however, purely declarative expressions
of structure provide only so much satisfaction. One of the
reasons we create large collections of language data is so
that we can use it to study the language. That means that
we need to be able to find, in the mass of material
collected, examples of linguistic phenomena relevant to
a particular question, and then to manipulate those
examples conveniently.
For purposes of such search, retrieval, and manipulation,
the specifications of XPath (1.0 and 2.0), XSLT (1.0 and 2.0)
and XQuery (1.0) are all of particular interest. In theory,
at least, the core functions of these specifications are
quite distinct. XPath provides a language for identifying (or,
as some say, addressing) particular elements and
attributes in an XML document, while XSLT is designed for use
in document formatting or rendering systems, and XQuery
is designed to provide, for XML-encoded data, data manipulation
functionality analogous to that provided for relational
data by SQL, the Structured Query Language. In practice,
however, the three specifications are intimately related:
XPath 2.0 is used as an expression language by both XQuery 1.0 and
XSLT 2.0, and all three specifications use a common
data model and are based on the same formal semantics.
All three languages can be used for querying XML data
(although XSLT and XQuery both allow for manipulation and
elaboration of the data, whereas systems which use XPath as
their query language generally confine themselves to
presenting the results, without modification), and both
XSLT and XQuery can be used for manipulation of XML tree
structures.
The reader may be surprised that a language designed for document
formatting should have powerful declarative mechanisms for
manipulation of tree structures; it may be worthwhile to digress for a
moment to explain how this comes to be so. As mentioned earlier, XSL
(the Extensible Stylesheet Language) is designed as a Web-oriented
analog to the international standard DSSSL. Both DSSSL and XSL are
intended to support routine tasks in document formatting. They thus
must provide facilities for styling blocks of text, setting it in a
particular font family (e.g. Times Roman, or Helvetica, or Lucida),
with a particular font treatment (italic, bold, demi-bold, etc.), on a
particular measure, in a particular color, and so on. In an SGML or
XML context, this frequently takes the form of associating a
particular set of formatting properties with elements of the document;
a number of stylesheet languages work this way, from W3C's Cascading
Stylesheets (CSS), which was originally designed specifically for HTML
but was early on extended to work with arbitrary markup vocabularies,
to a number of proprietary stylesheet languages used by SGML editing
and formatting systems.
The satisfactory layout of all but the simplest documents, however,
requires more than the association of formatting properties with
elements in the input document. Some information in the document must
be duplicated to appear twice, e.g. the titles of sections, which must
appear both at the head of the section and in the table of contents.
Some must appear an unknown number of times in the output, e.g. the
left and right running heads of chapters, which must be replicated to
appear once on each page opening in the output. Some must be moved
from one location to another (e.g. notes, which are conventionally
stored in the source document at the logical point of attachment to
the main text, but which must be moved either to the bottom of the
page, or to the end of the chapter, or to the end of the book, for
conventional print publication). Some information must be added at
formatting time: page numbers, and quite often headings like
“Chapter VII”. Not infrequently, some information in the input
document must be suppressed; metadata about the revision history
of the electronic document, or authorial notes to the copy editor,
may appear in working drafts but not in final copy. And so on.
All of these tasks require that the tree of structural units to which
formatting properties are to be attached be different from the
tree structure of the input document. In some cases, the difference
in tree structure is minor; in other cases, it is profound.
The consequence of these requirements is that both DSSSL and XSL
specify both a set of formatting objects with rendering properties
and a notation for tree transformations of arbitrary
complexity. In both cases, but more especially in the case of XSL,
the tree transformation part of the system has been adopted by many
users not only for use in formatting systems but as a general purpose
tool for processing XML-encoded data. After all, if the general
pattern of data processing is to accept data as input, perform various
automatic transformations upon it, and write the results out as
output, then whenever one's data is in XML it is convenient to express
the required manipulation in a language which understands XML
structures natively.[
14]
Viewed as a generic tree-transformation tool, XSLT has a number of
salient properties. First and foremost, XSLT stylesheets are
themselves written in XML. This proves a stumbling block for some
users, but is regarded by many, including me, as a key advantage of
XSLT over many conventional programming languages. Because XSLT is
written in XML, XSLT transformations can be used to process XSLT
stylesheets as input, or to produce them as output. Relatively few
conventional programming languages make such second-level processing
convenient.
The logic of an XSLT stylesheet can be driven either by the
structure of the output, or by the structure of the input, or by
a mixture of the two. The input-driven style is particularly
important for those working with natural-language material and
human-readable documents, since their structure is typically much
more elaborate, and much less regular, than that of data
conventionally stored in databases.
Like XML itself, XSLT is designed to have a declarative
semantics, and it falls squarely within the family of functional
programming languages.[
15] The
ability to call named templates with parameters makes
XSLT Turing-complete, so that in theory its expressive power
is the same as that of any other programming language.
XSLT 1.0 is effectively an untyped language (it does have a simple
type system, but only four types), but XSLT 2.0 adopts
the basic types defined by XML Schema 1.0.
XQuery has been developed as an industrial-strength query language
for XML data, with much of the work being carried by major vendors of
SQL database systems. Decades of experience on problems of indexing,
query rewriting and optimization, and schema-validation for relational
are being applied systematically to XML data, with results that
promise to be dramatic for all those with large volumes of XML data to
store, search, and manipulate. The collaboration between the W3C XML
Query and XSL Working Groups has produced an explicit data model, a
formal semantics suitable for research on query optimization, and
a well developed system of static typing, to allow creators of
queries to know in advance that their queries are type-safe.
XQuery 1.0 differs from XSLT 2.0 most visibly in have a
keyword-based syntax, not an XML-based syntax. There are also
a number of more subtle differences, but the common functionality
shared by the two languages is larger than the areas of
functionality specific to either of them.
That common functionality constitutes the XML Path language,
XPath.
XPath 1.0 originated as a language which represented the
intersection between the match expressions of XSLT 1.0 and XPointer
1.0, both of them then Working Drafts. Since their match expressions
had very similar functionality, it was thought desirable to provide a
single expression of that functionality, rather than two incompatible
expresssions. Similarly, XPath 2.0 captures the functionality common
to XSLT 2.0 and XQuery 1.0.
For some purposes, the heart of a query language is not the
manipulations it can perform upon data, but its ability to find the
data of interest. Gaps in a query system's ability to manipulate data
can frequently be made good outside the query system: further
manipulations can always be performed in an arbitrary programming
language, once the relevant data have been found.
It is perhaps for this reason that XPath 1.0 and 2.0 are already,
in themselves, frequently used as query languages for XML. In the
remainder of this paper I'd like to describe XPath briefly, and make
the case that it ought to be of interest to corpus linguists, as to
any other potential users of complex data.
At heart, XPath is an addressing language.
Many applications need to ‘address’ parts of
XML documents, in order to format them (as in XSLT), or to
find the target end of a hyperlink (as in XPointer), or to
extract information in the process of constructing documents
from existing fragments, or for query and retrieval, or to
express and check constraints on data validity, or for any
number of other purposes. XPath captures the functionality
common to such needs.
For XPath purposes, a document is an ordered tree with
- a root node
- element nodes
- text nodes
- attribute nodes
- namespace nodes
- processing instructions
- comment nodes
In XPath 2.0 (but not 1.0), elements and attributes can have type annotations.
There is no structure sharing: distinct XML elements are distinct
for purposes of XPath. The boundaries of entities are not represented in
the data model. Namespace prefixes are resolved.
An
XPath expression consists of a sequence of steps,
each identifying some set of elements or attributes in the document:
Each step starts from one or more nodes in the document structure
and moves to some other set of nodes; the expression as a whole
assumes a
context node which provides the input for
the first step. The result set of the final step is the result
node set for the expression as a whole.[
16]
Abstractly, a step consists of an
axis identifier
(which indicates a direction of movement through the document
tree), a
node test which allows certain nodes along
that axis to be selected and others ignored, and a sequence of
Boolean tests, or predicates, which further constrain the result:
axis::node test
[predicate]
[predicate] ...
For example, the XPath expression
“
descendant::figure[@rend="svg"]” denotes the set of
figure elements which are (a) descendants of the
current element, and (b) have a
rend attribute with a
value of “svg”.
A good idea of the power of XPath can be gained simply by
listing the difference axes along which we can search for
elements and attributes:
- child (selects elements, text nodes,
comments, and processing instructions directly contained by the current
node)
- parent (selects only elements; nothing
else can be a parent)
- attribute (selects attributes of the current
element)
- following, following-sibling (select elements, text nodes, comments, or processing instructions which
follow the current node in document order; in the case of
following-sibling, they must also be children of the same
parent)
- preceding, preceding-sibling (like following and following-sibling, but
moving backwards in document order)
- self
- namespace (selects only namespace declarations in
scope for the current element)
- ancestor, ancestor-or-self (select elements which enclose the current node)
- descendant, descendant-or-self (select elements, text nodes, comments, or processing instructions which
are either children, or children of children, etc., of the current node)
Some simple examples may show how XPath expressions can be used to
find examples of interest in an XML-encoded corpus in which
XML elements are used to represent parse trees in a simple and
straightforward way.[
17]
- child::Fa
(selects all adverbial-clause children)
- child::*
(selects all element children)
- child::text()
(selects all text node children)
- child::node()
(selects all children)
- attribute::del
(selects all attributes of the current element named del)
- attribute::*
(selects all attributes of the current element)
- descendant::N
(selects all noun-phrase descendants)
- ancestor::S
(selects all sentence [clause] ancestors)
- ancestor-or-self::S
(selects all sentence [clause] context nodes or ancestors)
- descendant-or-self::Nr
(selects all context nodes or ancestors which are temporal adverbial
noun phrases)
- /descendant-or-self::S[not(./descendant::N)] selects
all sentences (S elements) which contain no noun
phrases (N elements)
- /descendant-or-self::N[./child::w[@t='AT1']
and ./child::w[@t='JJ'] and ./child::w[@t='NN1']]
selects all noun phrases (N elements)
which directly contain at least one determiner (a
w element marked with t='AT1'),
one adjective (JJ), and one singular noun (NN1)
- child::w[@t='AT1' and following-sibling::w[1][@t='JJ']
and following-sibling::w[2][@t='NN1']]
selects definite articles (AT1) followed immediately by
adjectives and then by singulare nouns
- /descendant-or-self::N[./child::w[@t='AT1'
and following-sibling::w[1][@t='JJ']
and following-sibling::w[2][@t='NN1']]]
selects noun phrases containing the sequence AT1,
JJ, NN1.
- /descendant-or-self::w[@t='NP1'
and @f=translate(@f, 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')]
selects all words whose f (form) attribute is
in upper case (is identical to the result of
uppercasing it)
A short syntax is also available, which is more compact, but
which would require more explanation than seems appropriate here.
XPath is, as noted, already widely used for queries.
It be used to select elements and attributes, or to
return strings and numbers resulting from relatively
simple calculations on matched nodes. Nodes in the XML
tree can be selected by position, by type, by name, by
value, or by combinations of these. By means of
predicates, co-occurrence constraints can be expressed.
The greatest weaknesses of XPath as a free-standing
query language are that XPath 1.0 has no data types,
and there is very little type checking. This minimizes
the number of type errors raised by an XPath processor,
which makes using XPath 1.0 a rather more cheerful
experience than using some other languages. But it also
means that in case of an error in an XPath expression,
the system is liable to produce incorrect answers owing to
an incorrect understanding of what was desired. (If
the built-in type coercions do the right thing, of course,
the answers will not be incorrect.)
More elaborate queries and processing are possible
using XSLT or XQuery: creation of new
elements and attributes, restructuring of the
input, calculation of totals and subtotals,
all of the conveniences familiar to users of
database management systems.
A. References
Brüggemann-Klein, Anne.
1993.
Formal models in document processing.
Habilitationsschrift, Freiburg i.Br., 1993. 110 pp.
Available at ftp://ftp.informatik.uni-freiburg.de/documents/papers/brueggem/habil.ps
(Cover pages archival copy also at
http://www.oasis-open.org/cover/bruggDissert-ps.gz).
[Brüggemann-Klein provides a formal definition of
1-unambiguity, which corresponds to the notion of
unambiguity in ISO 8879 and determinism
in XML 1.0. Her definition of 1-unambiguity can be used to
check XML Schema's Unique Particle Attribution constraint
by changing every minOccurs and maxOccurs value greater than 1 to 1,
if the two are equal, and otherwise changing minOccurs to 1
maxOccurs greater than 1 to unbounded.]
Brüggemann-Klein, Anne.
1993.
“Regular expressions into finite automata.”
Theoretical Computer Science
120.2 (1993): 197-213.
[Ginsburg/Harrison 1967]
Ginsburg, S., and
M. M. Harrison.
“Bracketed context-free languages”.
Journal of computer and system sciences
1.1 (1967): 1-23.
[ISO 1986]
International Organization for Standardization (ISO).
1986.
ISO 8879-1986
(E). Information processing — Text and Office Systems —
Standard Generalized Markup Language (SGML). International
Organization for Standardization, Geneva, 1986.
[ISO/IEC 1992]
International Organization for Standardization (ISO);
International Electrotechnical Commission (IEC).
1992.
ISO/IEC 10744:1992 (E).
Information technology — Hypermedia / Time-based Structuring
Language (HyTime). International
Organization for Standardization, Geneva, 1992.
[ISO/IEC 1996]
International Organization for Standardization (ISO);
International Electrotechnical Commission (IEC).
1996.
[Draft] Corrected HyTime Standard
ISO/IEC 10744:1992 (E)..
[n.p.]: Prepared by W. Eliot Kimber for Charles F. Goldfarb, Editor,
13 November 1996.
[W3C 1999]
World Wide Web Consortium (W3C).
XML Path Language (XPath)
Version 1.0, ed. James Clark and Steve DeRose.
W3C Recommendation 16 November 1999
Published by the
World Wide Web Consortium at
http://www.w3.org/TR/xpath, November
1999.
[W3C 2000]
W3C. Document Object Model (DOM) level 1 specification. Published
by the World Wide Web Consortium at
http://www.w3.org/TR/REC-DOM-Level-1, September 2000. W3C
Recommendation.
[W3C 2001a]
“XML Schema Part 0: Primer”,
ed. David Fallside.
W3C Recommendation, 2 May 2001.
[Cambridge, Sophia-Antipolis, Tokyo: W3C]
http://www.w3.org/TR/xmlschema-0/.
[W3C 2001b]
2001.
XML Schema Part 1: Structures, ed.
Henry S. Thompson,
David Beech,
Murray Maloney, and
Noah Mendelsohn.
W3C Recommendation 2 May 2001.
[Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium.
http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/
[W3C 2001c]
W3C.
2001.
XML
Schema Part 2: Datatypes, ed.
Biron, Paul V. and
Ashok Malhotra.
W3C Recommendation 2 May 2001.
[Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium.
http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/
[W3C 2004a]
World Wide Web Consortium (W3C).
Extensible Markup Language (XML) 1.0 (Third Edition),
ed.
Tim Bray,
Jean Paoli,
C. M. Sperberg-McQueen,
Eve Maler (Second Edition),
François Yergeau (Third Edition).
W3C Recommendation 4 February 2004.
Published by the World Wide Web Consortium at
http://www.w3.org/TR/REC-xml/.
[W3C 2004b]
World Wide Web Consortium (W3C).
XML Information Set (Second Edition),
ed. John Cowan
and Richard Tobin.
W3C Recommendation 4 February 2004.
Published by the World Wide Web Consortium at
http://www.w3.org/TR/xml-infoset.
[W3C 2004c]
World Wide Web Consortium (W3C).
XQuery 1.0 and XPath 2.0 Data Model,
ed. Mary Fernández et al.
W3C Working Draft 23 July 2004.
Published by the
World Wide Web Consortium at
http://www.w3.org/TR/xpath-datamodel/,
2004.