Perspectives on XML

and related standards

Korpuslinguistik Deutsch

Synchron, diachron, konstrastiv

Universität Würzburg

C. M. Sperberg-McQueen

22 February 2003

1. Background
- 1.1. Some relevant projects
- 1.2. SGML and XML
2. Markup and markup languages
3. The XML Landscape
4. Document grammar
5. XML Schema
6. XML Query, XSLT, and XPath
- 6.1. XSLT
- 6.2. XQuery
- 6.3. XPath
7. Conclusion

A. References

In this paper I will outline briefly some aspects of XML and related standards which seem likely to be of particular relevance for corpus linguists and their work. I will begin by providing some background information on markup languages, particularly SGML (the Standard Generalized Markup Language) and XML (the Extensible Markup Language); then I offer a short overview of several XML-related specifications: first XML Schema (used for defining document grammars) and then the inter-related specifications XQuery 1.0, XSLT 1.0 and 2.0, and XPath 1.0 and 2.0.[1]

1. Background

1.1. Some relevant projects

Some corpus linguists will already be aware of SGML and XML, because they have heard of any of a number of projects relevant for the application of SGML and XML to corpus-linguistic work.

The TEI Guidelines are a set of SGML document-type definitions prepared over several years in the late 1980s and early 1990s by the Text Encoding Initiative (TEI), an international project sponsored by the Association for Computers and the Humanities (ACH), the Association for Literary and Linguistic Computing (ALLC), and the Association for Computational Linguistics (ACL). They are maintained and further developed now by the TEI Consortium.

The markup vocabulary specified by the TEI Guidelines is intended to allow for the encoding of virtually any textual material that might be of interest for linguistic or textual study of any kind — which means any textual material at all, of any period, language, or genre. Corpus linguists and corpus linguistics were among the fields best represented in the working groups of the TEI; Stig Johannson, for example, who was instrumental in completing the Lancaster-Oslo/Bergen corpus of English, served as chair of a Working Committee, and Antonio Zampolli of Pisa served on the TEI Steering Committee. As a result, the TEI vocabulary contains (along with much else) much that is useful in corpus work.

The British National Corpus (BNC) began work before the TEI Guidelines were complete, but is encoded using a variant of the TEI encoding scheme (Lou Burnard, one of the co-editors of the TEI Guidelines, led the computing work on the BNC). The BNC consists of 100 million words of modern British English: 90 million words of written material and 10 million words of spoken material, with careful attention to questions of balance and elaborate attempts at characterization of each sample so that subcorpora can be selected to suit special interests.

The Corpus Encoding Standard (CES) is a specialization or profile of the TEI vocabulary, in which many of the options left to local preference in the TEI scheme are tightened up in the interests of greater uniformity of practice in the collection of corpora and similar language resources.

Another application of SGML (and the TEI) to corpus building is in the Edinburgh and Chiba map task corpora prepared (respectively) by Henry Thompson (Edinburgh) and Syun Tutiya (Chiba). Here, SGML is used to encode spoken language of people engaged in a particular task; in the Chiba materials, the transcriptions are synchronized in some detail with videotape recordings of the conversations, to allow study of gesture, eye contact, and so on.

1.2. SGML and XML

But what is this SGML which these projects apply to corpus linguistics? And how does it relate to the XML mentioned in my title?

SGML, or the Standard Generalized Markup Language, is an international standard defined by ISO 8879:1986. It offers a non-proprietary means of providing explicit descriptive markup of any collection of textual features which may be of interest. It does not, as one might at first expect, achieve this by defining a vocabulary or set of tags for marking up textual features; it defines nothing of the sort. Instead, it defines a meta-language by means of which the user of SGML may define an arbitrary set of tags (or, more precisely, of element types) for marking up documents of a given sort. The user decides what element types to define, and the meaning ascribed to them (what textual features they should describe); the selection of textual features to be marked up is limited not by the capabilities of a particular piece of software (as in word processors or formatting languages), nor by the decisions made in a committee defining a markup language (as in HTML and most other specific applications of SGML), but only by the user's interests and the user's ability to mark up the features of interest (both economic and intellectual limitations play a role).

The meta-language provided by SGML takes the form of a document type definition, which contains what might be called a ‘document grammar’, roughly in the style of a BNF (Backus/Naur Form) grammar.[2] SGML also allows the user to declare the coded character set they are using; this allows SGML to be used on virtually any kind of computer system.

XML (the Extensible Markup Language) is by contrast not an International Standard in the narrow sense, although it is fast becoming a de facto standard. It is defined by a Recommendation issued by the World Wide Web Consortium (W3C) first issued in 1998 and re-issued with corrections from time to time since then. XML is a subset of SGML, both in the sense that the rules of XML are a subset of the rules of SGML and in the sense that every well-formed XML document is a legal SGML document. The subset was designed to be easier to handle on the World Wide Web (the responsible Working Group was originally called the “SGML on the Web” Working Group), and in particular to be easier to parse than SGML. There are fewer optional features (the most prominent exception being that processing of the document grammar or DTD is not required of conforming XML processors), and the character set is constrained to be Unicode (although non-Unicode character encodings may be used, the repertoire of characters is defined exclusively with reference to Unicode / ISO 10646). By making XML easier to parse than SGML, the designers hoped to make it easier to write software for XML than for SGML and thus to encourage software development. Since 1998, this hope has been amply fulfilled.

I should pause for a moment to describe the World Wide Web Consortium, as the body responsible for the XML specification. W3C is a member-supported organization which creates Web standards (in the form of Recommendations, which describe recommended practice). Its mission is “to lead the Web to its full potential”, and W3C is accordingly engaged in a wide variety of endeavors to ensure that the World Wide Web is accessible to all, regardless of language, geography, script, visual impairment, or other disability; to improve the utility of the Web as a medium not just for human to human but also for program to program communication; to address social concerns confronting the Web, in particular the development of a Web of Trust which makes the Web a more successful collaborative environment; to improve the interoperability of Web-based software and the evolvability of the technical design of the Web; to encourage decentralization in the Web; and to provide better standards-based multimedia formats.

2. Markup and markup languages

I have talked about some corpus-related projects using SGML and XML, and I have described what SGML and XML are. But what is the markup to which their names allude?

Markup (the term comes from the traditional printer's term for the markings in a manuscript which tell the typesetter how to style the document) is a system of marks (tags) appearing in the electronic form of a document, which serves to provide a more explicit representation of textual phenomena not adequately captured by a simple sequence of characters. Informally, users of markup often speak as if the purpose of markup is to add information to a transcription of a text; this is a convenient manner of speaking, but an oversimplification: quite frequently the information made explicit by the markup is already notionally part of the text, and is not so much added to the text by the markup as exhibited by it.

The nature and utility of markup may be exhibited best by some simple examples.

Consider first a very simple transcription which uses no explicit markup. This is the beginning of a transcription of an English translation of Ibsen's Peer Gynt, as distributed by the Oxford Text Archive.[3]

                                      1875
                                   PEER GYNT
                                by Henrik Ibsen
  THE CHARACTERS
  ASE, a peasant's widow.
  PEER GYNT, her son.
  TWO OLD WOMEN with corn-sacks. ASLAK, a smith. WEDDING-GUESTS. A
    MASTER-COOK, A FIDDLER, etc.
  A MAN AND WIFE, newcomers to the district.
  SOLVEIG and LITTLE HELGA, their daughters.
  THE FARMER AT HEGSTAD.
  INGRID, his daughter.
  THE BRIDEGROOM and His PARENTS.
  THREE SAETER-GIRLS. A GREEN-CLAD WOMAN.
  THE OLD MAN OF THE DOVRE.
  A TROLL-COURTIER. SEVERAL OTHERS. TROLL-MAIDENS and TROLL-URCHINS. A
    COUPLE OF WITCHES. BROWNIES, NIXIES, GNOMES, etc.
  AN UGLY BRAT. A VOICE IN THE DARKNESS. BIRD-CRIES.
  KARI, a cottar's wife.
  Master COTTON, Monsieur BALLON, Herren VON EBERKOPF and
    TRUMPETERSTRALE, gentlemen on their travels. A THIEF and A RECEIVER.
  ANITRA, daughter of a Bedouin chief.
  ARABS, FEMALE SLAVES, DANCING-GIRLS, etc.
  THE MEMNON-STATUE (singing). THE SPHINX AT GIZEH (muta persona).
  PROFESSOR BEGRIFFENFELDT, Dr. Phil., director of the madhouse at
    Cairo.
  HUHU, a language-reformer from the coast of Malabar. HUSSEIN, an
    eastern Minister. A FELLAH, with a royal mummy.
  SEVERAL MADMEN, with their KEEPERS.
  A NORWEGIAN SKIPPER and HIS CREW. A STRANGE PASSENGER.
  A PASTOR. A FUNERAL-PARTY. A PARISH-OFFICER. A BUTTON-MOULDER. A
    LEAN PERSON.
    The action, which opens in the beginning of the nineteenth
  century, and ends around the 1860's, takes place partly in
  Gudbrandsdalen, and on the mountains around it, partly on the coast
  of Morocco, in the desert of Sahara, in a madhouse at Cairo, at sea,
  etc.
                                   ACT FIRST
  SCENE FIRST
  [A wooded hillside near ASE's farm. A river rushes down the slope.
  On the further side of it an old mill shed. It is a hot day in
  summer.]
  [PEER GYNT, a strongly-built youth of twenty, comes down the
  pathway. His mother, ASE, a small, slightly built woman, follows
  him, scolding angrily.]
  ASE
       Peer, you're lying!
  PEER [without stopping].
       No, I am not!
  ASE
       Well then, swear that it is true!
  PEER
       Swear? Why should I?
  ASE
       See, you dare not!
       It's a lie from first to last.

The basic approach has been to render the text in ASCII characters approximately as it might be rendered by someone using a typewriter. The words of the text are mostly there, although no distinction is made between the words of Ibsen's text and other material (e.g. the date, title, and authorship attribution) transcribed from the exemplar. Also, it is clear that not everything in the exemplar has been transcribed. The date, title, and author are given, but they are not distinguished from the rest of the transcription in such a way as to make it easy for an automatic process (as opposed to a human) to identify them. The date given (1875) is cryptic: it is not the date of original composition (the poem was written and first published in 1867), nor that of first performance as a play (1876), nor that of the original publication of this edition (apparently 1892). 1875 may perhaps be intended as giving the date at which the adaptation to the stage was performed. The restriction to ASCII characters means that the name of Peer's mother Åse is mistranscribed as “Ase”; italics in the original are not signaled, and footnotes have been lost. Speeches are visually set off from each other, but the user of the transcription will need to write a special-purpose parser in order to be able to tell, when processing the transcription, who speaks which words of dialog, and to distinguish dialog from stage directions and speech attributions.

In sum, this first example captures the lexical material of the source text, but does not capture the textual structure or relevant meta-textual information in any form suitable for machine processing.

Some transcriptions have more information. Older electronic material, in particular, is apt to be presented in forms like this fragment of Walther von der Vogelweide[4]

|s001
|l001 ich sâz ûf eime steine
|l002 und dâhte bein mit beine
|l003 dar ûf sazt ich den ellenbogen

or this representation of the opening lines of Beowulf:

|b001
|l001a Hw*a/et, we GAR-DENA
|l001b in geardagum t*rym gefrunnon
|l002a Hu t*a *a/ed*elingas 
|l002b ellen fremedon

Stanzas (|s), books (|b), and lines of verse (|l) are explicitly marked, and numbers are given as identifiers. In the Beowulf fragment, the presence of asterisks and slashes marks the spelling of characters not present in the character set of a card-punch: wynn (w*), aesc (a/e), thorn (t*), edh (d*). By means of these simple conventions, it becomes possible for concordances to provide line numbers in the lists of word occurrences, and for a typesetter or suitable computer printer to provide an appropriate representation of the special characters.

Sometimes, markup is used for linguistic annotation. In cases like this, the frequently heard characterization of markup as information added to a text becomes more and more problematic. Is linguistic annotation to be described as the addition of extra-textual information? Or only as an explicit representation of information intrinsic to the text quâ text and left implicit by conventional orthographic transcription? Consider this fragment of a sample from the BNC, with simple phrase-structure tagging.

S0CF6003 v
[S [N TROUBLED_JJ [ morning_NNT1 television_NN1 ] 
station_NN1 GMTV_NP1 N] finally_RR [V had_VHD 
[N something_PN1 [Ti to_TO smile_VVI [P about_II
P]Ti]N][Nr last_MD night_NNT1 [Fr when_RRQ [N 
it_PPH1 N][V was_VBDZ revealed_VVN [Fn[N it_PPH1 
N][V gained_VVD [N an_AT1 extra_JJ million_NNO 
viewers_NN2 N][P over_II [N the_AT last_MD two_MC
weeks_NNT2 N]P]V]Fn]V]Fr]Nr]V] ._YSTP S]

Fortunately, it is possible to build working systems of markup while remaining agnostic as to whether linguistic information of this kind, or any information of any kind, is part of the text or not. While the distinction between textual and extra-textual features has great potential interest for all of us who care about texts and language, it has no formal importance. Formally, the crucial opposition is that between content and markup.

A markup language does four things:

defines a vocabulary to use in markup.
specifies how markup constructs can occur (containment, sequence, ...), providing a contract between data sources and data sinks.
tells how to distinguish markup from content.
indicates what the constructs of its vocabulary mean and how they are to be used.

A simple example may make this more concrete. Following the tradition of many computer languages, we use a trivial ‘hello, world’ example which illustrates some of the most important features of XML:

<!DOCTYPE greetings [
<!ELEMENT greetings (hello+) >
<!ELEMENT hello (#PCDATA) >
<!ATTLIST hello
          lang CDATA #IMPLIED >
<!ENTITY szlig   "&#223;" >
<!ENTITY uuml    "&#252;" >
]>
<greetings>
<hello lang="en">Hello, world!</hello>
<hello lang="fr">Bon jour, tout le monde!</hello>
<hello lang="no">Goddag!</hello>
<hello lang="de">Guten Tag!</hello>
<hello lang="de-franken">Gr&uuml;&szlig; Gott!</hello>
</greetings>

The document is divided into two parts: first the document type declaration, which runs from the first line (“<!DOCTYPE greetings [”) through the eighth (“]>”), and then the document instance, which runs from the ninth line (“<greetings>”) through the fifteenth (“<greetings>”).

Let us consider the document instance first. It is encoded using a fairly simple form of labeled bracketing. The document is divided into structural units known as elements; each element is delimited by a start-tag at its beginning and an end-tag at the ending, each giving the element type of the element. Here, the top-level element (there is always exactly one outermost element in any well-formed XML document) is of type greetings, and it has five child elements, each of type hello, each containing a greeting. The final hello element uses two entity references (the strings “ü” and “ß”) to represent characters not conveniently accessible from my US-oriented keyboard.

The start-tag can also contain a list of attribute-value pairs showing attributes of the element type; in this example, each hello element is labeled with a code for its language.

The document type declaration contains declarations for the two element types; the element declarations may be thought of as productions in a regular right-part grammar defining the set of valid documents. The declaration

<!ELEMENT greetings (hello+) >

indicates that valid elements of type greetings contain one or more elements of type hello. The declaration

<!ELEMENT hello (#PCDATA) >

indicates that hello elements contain character data.

Attributes may also be declared: the declaration

<!ATTLIST hello
          lang CDATA #IMPLIED >

indicates that elements of type hello may but need not carry a lang attribute, which in turn contains a string of character data.[5]

The entities referred to in the document are also declared. Since entities are simply named strings of characters, the declaration for an entity consists, in the simple case, just of the entity name and its replacement value:

<!ENTITY szlig   "&#223;" >
<!ENTITY uuml    "&#252;" >

Here, the name szlig is defined for the string consisting of the single character whose decimal number in Unicode is 223 (U+00DF, in the usual Unicode notation), and the name uuml is assigned a replacement string consisting of the character U+00FC.[6]

A markup language, I said earlier, does four things. First, it defines a vocabulary for use in markup. Here, the vocabulary consists of the element types greetings and hello, and the attribute lang on the hello element. Second, it specifies how markup can occur; here, the rule is that hello elements occur only within a greetings element, and that any number of them may occur in the same greetings element. Third, it tells us how to distinguish markup from content. Here, the rules for distinguishing markup and content are given by XML: markup consists of start- and end-tags (delimited by angle brackets), entity references (delimited by ampersand and semicolon), and some other constructs not shown here.

Fourth, a markup language specifies a meaning for the constructs it defines. Notice that no formal declarations occur in our example to identify the meaning or expected usage of the elements or attributes. XML does not provide any particular set of semantic primitives in terms of which such semantic or pragmatic information could be given. This is frequently regarded as a weakness of XML, but in fact it is XML's semantic agnosticism which ensures that XML can be used for so many different types of information. The meanings to be conveyed by an XML vocabulary are limited only by the ingenuity of the vocabulary designer; their translation into operational semantics is limited only by the wit of those who write programs to process the vocabulary. Vocabularies intended for serious work should be, and typically are, documented by conventional prose documentation. The TEI Guidelines, for example, provide hundreds of pages of prose documentation for the TEI vocabulary; documentation similar in essence, albeit frequently less extensive, is provided for other widely used vocabularies like DocBook and HTML. For throw-away vocabularies and toy illustrations like the one above, on the other hand, the documentation of the vocabulary's meaning may be limited to the use of suggestive natural-language names for elements and attributes, and the provision of some examples.

3. The XML Landscape

Since SGML and XML define neither a specific set of element types and attributes nor a specific set of primitive semantic notions, they may best be regarded (despite their names) not as markup languages but as meta-languages for the definition of markup languages.

A particular markup language defined using XML is conventionally referred to as an application of XML. It is through particular applications that most users will encounter XML. XHTML, the XML-based formulation of HTML, may be the best known application of XML. Corpus linguists may be familiar with, or at least have heard of, the Text Encoding Initiative Guidelines and the Corpus Encoding Standard which applies them to corpus-building. Those interested in graphics will know the W3C's Scalable Vector Graphics language SVG, which provides compact scalable graphics. Another type of application is represented by XSLT (Extensible Stylesheet Language: Transformations), a W3C Recommendation defining an XML-based functional programming language for transformations of XML data into XML and other forms; it is very widely used for all kinds of data processing, including the translation of data from other XML formats into (X)HTML.

When people speak of the increasing complexity of using XML, they sometimes have in mind the proliferation of such specialized applications of XML. There have long since been too many to keep track of, and more are developed all the time. But since one of the main goals of XML was precisely to make it easier for users and communities of users to define the markup languages they need, there is no point in bemoaning the fact that many users and communities have seized the opportunity to do just that.[7] The use of XML to define specialized vocabularies was part of the plan from the beginning.

In other ways, however, the XML landscape is much more complex today than was foreseen when work on XML started in 1996. The original plan of the W3C's project for “SGML on the Web” was to provide Web-friendly simplifications of the three major specifications being developed by the ISO / IEC working group responsible for document processing standards:[8]

XML would be a simplification or subset of SGML.
XLink would be a subset of HyTime, the Hypermedia / Time-based Structuring Language for hypermedia architectures, defined by ISO 10744
XSL (Extensible Stylesheet Language) would be a subset or simplification of DSSSL, the Document Style Semantics and Specification Language for defining stylesheets to guide the formatting and presentation of marked-up documents, defined by ISO 10179.

The developers of XML wanted something like SGML on the Web, because we were accustomed to the power of defining our own markup languages, suited fairly precisely to the material we were encoding and the kinds of processing we wished to perform; the limitation of the World Wide Web to the single markup language HTML was felt to be unsustainably constraining. Many people familiar with HyTime and other hypertext systems felt similarly constrained by the very limited hypertext mechanisms built into HTML; it was thought highly desirable to provide more powerful mechanisms, of the kind most other hypertext systems offered. As the sometime chair of the XLink Working Group (Bill Smith, of Sun Microsystems) used to say, the goal of XLink was to bring hypertext on the Web forward, into the 1960s. And, of course, in order to process and display documents using generic markup, it is essential to have some kind of stylesheet language.

In the event, all three of these goals were met, although XLink and XSL were eventually split into several specifications. But the most striking change in the plan was the sheer number of other specifications which have been developed which were not part of the original conception:

XML
Namespaces in XML
XML Information Set
XLink
XPointer
XSLT (XSL Transformations)
XSL Formatting Objects
XPath
XML Schema
XML Query
Document Object Model
SOAP (Simple Object Access Protocol)
WSDL (Web Services Description Lanuage)

This is by no means an exhaustive list even of W3C specifications relevant to the XML infrastructure, not to speak of work done outside the W3C framework.

It would exceed the patience even of the most well behaved reader to attempt to give a clear idea of the technical content of all of these specifications. Instead, in the remainder of this paper I will limit myself to a brief, schematic introduction of two of the technologies I believe are most salient to the work of corpus linguists, first XML Schema and second the complex of specifications surrounding XPath.[9]

4. Document grammar

One of the key innovations of SGML, which sets it apart from most other families of markup languages, is the notion that markup can be validated against a formal definition of its structure. As illustrated above, an SGML or XML document can be accompanied by a document type declaration, in which elements and attributes may be declared, and which provides a partial formal expression of the document type definition, which is defined as the rules governing the use of the vocabulary to documents of a particular type. Only part of the document type definition is captured formally, for the simple reason that in the current state of knowledge there are no tools for the formal expression of arbitrary semantics which are as convenient and as precise as the well known tools for the formal expression of syntax.[10]

The element declarations in a document type declaration provide what we may call a document grammar, which generates a set of documents in the same way that a grammar for a natural language generates a set of sentences or discourses. The process of checking a document instance against the declarations is called validation.

Although its connections to automata theory give it a rather academic air, validation was introduced in SGML not for theoretical reasons, but for purely pragmatic reasons: errors in input data can cause costly delays in document processing and typesetting, and validation was introduced as a method of reducing such errors by detecting (some of) them mechanically. It was only after the fact that the validation rules of SGML were partially aligned with principles of conventional automata theory.[11]

The formal specification of document validity by means of a DTD means that whole classes of encoding errors can be detected by automated processes, without requiring human proofreading; it is thus a step forward in markup languages roughly analogous to the introduction of Backus/Naur Form grammars in the specification of Algol 60, and for much the same reasons.

Document grammars, in whatever notation they are expressed, may have several uses:

They assist in the perpetual struggle against dirty data.
They can serve as documentation of a contract between data providers and data consumers.
They can specify the content of particular of data flows within a complex system.
They can also be used to specify client/server protocols.

The notation document grammars provided by SGML and XML is conventionally referred to as DTD notation; in that notation, element declarations resemble individual productions in a Backus/Naur Form grammar, but there are a number of differences which should be kept in mind. First of all, the start-tag of the element makes explicit which production rule to use; certain kinds of ambiguity and parsing difficulty thus melt away. As a parsing theorist would say, the language being described is a bracketed language (as defined by [Ginsburg/Harrison 1967]), because each non-terminal has a distinctive start and end string. Second, the right-hand side of each production consists of a regular expression, so that technically speaking we have a kind of regular-right-part grammar, not a BNF grammar. Third, DTDs are not purely grammatical; as the example shown above illustrates, they also include declarations for entities, which have no grammatical function. And finally, the content model of an element declaration is restricted, both in SGML and in XML DTDs, to deterministic expressions.

XML Schema is an XML-based language for writing document grammars; it is intended to make the use of DTDs for grammatical purposes unnecessary. (For a variety of reasons, XML Schema does not replace DTDs entirely: XML Schema has no provision for entity declarations, so DTDs will continue to be used for that purpose.)

Let us consider a simple example of a document grammar, which will allow us to illustrate some of the capabilities and limitations both of DTDs and of XML Schema. The grammar describes two kinds of poem: limericks, and canzone. A poem is one or the other of these.

poem     ::= limerick | canzone

A limerick consists of two lines of trimeter, two of dimeter, and a final trimeter. Each of these is just (for the purposes of this grammar) a sequence of characters.

limerick ::= trimeter trimeter dimeter 
             dimeter trimeter
trimeter ::= CHAR+
dimeter  ::= CHAR+

For the internal structure of the canzone, we will use the terminology developed by the Meistersänger (since that is how I learned it): a canzone consists of two parts, an Aufgesang and an Abgesang; the Aufgesang consists of two Stollen. The Stollen and the Abgesang in turn consist of lines.[12]

canzone   ::= aufgesang abgesang
aufgesang ::= stollen stollen
stollen   ::= line+
abgesang  ::= line+

A translation of the grammar into DTD form makes the formal similarity clear:

<!ELEMENT poem (limerick | canzone) >

<!ELEMENT limerick (trimeter, trimeter, 
                    dimeter, dimeter, 
                    trimeter)>
<!ELEMENT trimeter (#PCDATA)>
<!ELEMENT dimeter  (#PCDATA)>

<!ELEMENT canzone   (aufgesang, abgesang) >
<!ELEMENT aufgesang (stollen, stollen) >
<!ELEMENT stollen   (l+) >
<!ELEMENT abgesang  (l+) >
<!ELEMENT l         (#PCDATA) >

Conforming to the grammar, we can encode poems, both limericks:

<limerick>
  <trimeter>
    There was a young lady named Bright
  </trimeter>
  <trimeter>
    whose speed was much faster than light.
  </trimeter>
  <dimeter>She set out one day,</dimeter>
  <dimeter>in a relative way,</dimeter>
  <trimeter>
    and returned on the previous night.
  </trimeter>
</limerick>

and canzone:

<canzone>
  <aufgesang>
    <stollen>
      <l>unter den linden an der heide</l>
      <l>da unser zweier bette was</l>
    </stollen>
    <stollen>
      <l>da mugt ir vinden schone beide</l>
      <l>gebrochen bluomen unde gras</l>
    </stollen>
  </aufgesang>
  <abgesang>
    <l>kuste er mich? wol tusentstunt</l>
    <l>tandaradei</l>
    <l>seht wie rot mir ist der munt</l>
  </abgesang>
</canzone>

Note that in the DTD translation of the grammar, each non-terminal symbol appears as an element type; the result is that instead of tagging each line of verse the same way, we must use three different element types, depending on context. Note also that it is a rule of canzone form (at least, as practiced in Minnesang) that the two Stollen must have the same number of lines, and that the Abgesang must have more lines than a Stollen, but fewer than the Aufgesang. This rule is not expressed in this grammar.

The mechanical identification of grammatical non-terminals with element types has given us a rather ponderous style of markup; some users of XML would prefer not to tag the Aufgesang explicitly, thus:

<canzone>
  <stollen>
    <l>unter den linden an der heide</l>
    <l>da unser zweier bette was</l>
  </stollen>
  <stollen>
    <l>da mugt ir vinden schone beide</l>
    <l>gebrochen bluomen unde gras</l>
  </stollen>
  <abgesang>
    <l>kuste er mich? wol tusentstunt</l>
    <l>tandaradei</l>
    <l>seht wie rot mir ist der munt</l>
  </abgesang>
</canzone>

We can modify the DTD to accomplish this by translating the non-terminal aufgesang not into an element type but into a parameter entity.[13]

<!ENTITY % aufgesang "stollen, stollen" >
<!ENTITY % lines     "l+" >
<!ELEMENT canzone   (%aufgesang;, abgesang) >
<!ELEMENT stollen   (%lines;) >
<!ELEMENT abgesang  (%lines;) >
<!ELEMENT l         (#PCDATA) >

Indeed, we could go further and remove almost all the non-terminals, leaving only the elements for the poem as a whole and for the individual verse lines, so that the poem might be encoded thus:

<canzone>
  <l>unter den linden an der heide</l>
  <l>da unser zweier bette was</l>
  <l>da mugt ir vinden schone beide</l>
  <l>gebrochen bluomen unde gras</l>
  <l>kuste er mich? wol tusentstunt</l>
  <l>tandaradei</l>
  <l>seht wie rot mir ist der munt</l>
</canzone>

As a first approximation of this change, we might write the document grammar thus:

<!ENTITY % stollen   "l+" >
<!ENTITY % aufgesang "%stollen;, %stollen;" >
<!ENTITY % abgesang  "l+" >
<!ELEMENT canzone   (%aufgesang;, %abgesang;) >
<!ELEMENT l         (#PCDATA) >

This DTD, however, is illegal, because it is ambiguous and hence non-deterministic: after expansion of the parameter-entity references, it amounts to this:

<!ELEMENT canzone   (l+, l+, l+) >
<!ELEMENT l         (#PCDATA) >

It is not clear which occurrence of “l+” any given line of the poem should be assigned to, except that the first line must obviously be attributed to the first “l+”, and the last line of the poem to the third. The obvious revision of the definition of canzone alleviates this problem and makes a legal DTD:

<!ELEMENT canzone   (l+) >
<!ELEMENT l         (#PCDATA) >

This grammar has lost all of the internal structure of the poem; whether that is an advantage or disadvantage will vary with one's purpose in creating the encoding.

5. XML Schema

One of the principal disadvantages of DTD notation is that while a DTD obviously contains structured information, that structure is not exposed in the way recommended by XML (that is, using markup), but is instead captured only in an ad hoc syntax. This makes reuse of the information harder, since any use of the information in a DTD requires the creation of a DTD parser. In addition, the purely grammatical orientation of DTDs fails to capture a number of concepts important for modeling information in the general case. Support for datatypes of the kind conventional in programming languages and database management systems is missing, as is type inheritance.

XML Schema was developed with the goal of addressing these perceived problems. It uses XML document-instance syntax. It offers much the same basic functionality as DTDs: both are notations for document grammars. But XML Schema has more than DTDs, in some ways (datatypes, type inheritance, etc.), and less in other ways (no entity declarations).

There is not space here for a serious introduction to XML Schema; we will have to content ourselves with a slavish imitation of the first version of the DTD.

At the outer level is a schema element in the XML Schema namespace:

<xsd:schema
  xmlns:xsd ="http://www.w3.org/2001/XMLSchema" >
 <!--* element declarations go here *-->
</xsd:schema>

N.B. the schema does not identify a document-root element / start symbol. In this, XML Schema appears to differ from the DTDs of XML and SGML, where the <!DOCTYPE ... > declaration is used to give the type (generic identifier) of the document's root element. But in practice, the <!DOCTYPE ... > declaration occurs in the document instance; in markup vocabularies intended for serious use, the free-standing files containing the declarations of elements and attributes do not indicate which element type is required to be the root element.

Following the simple DTD, we can declare the elements canzone and aufgesang as containing sequences of specified children:

 <xsd:element name="canzone">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="aufgesang"/>
    <xsd:element ref="abgesang"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="aufgesang">
  <xsd:complexType>
   <xsd:sequence>
    <xsd:element ref="stollen"/>
    <xsd:element ref="stollen"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

Note that there are two distinct uses of the xsd:element element. Some are element declarations (the outer ones), while others are element references (the inner ones). Implicitly, each element reference matches a single occurrence of the element: the minimum and maximum number of occurrences is one.

The abgesang and stollen elements require that we write a content model which matches one or more l elements; this can be done by specifying explicit values for the attributes minOccurs and maxOccurs:

 <xsd:element name="abgesang">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

 <xsd:element name="stollen">
  <xsd:complexType>
   <xsd:sequence minOccurs="1" 
                 maxOccurs="unbounded">
    <xsd:element ref="l"/>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

The l element itself needs to allow for mixed content; the usual idiom is this:

 <xsd:element name="l">
  <xsd:complexType mixed="true">
   <xsd:sequence>
   </xsd:sequence>
  </xsd:complexType>
 </xsd:element>

or this:

 <xsd:element name="l" type="xsd:string"/>

Of these, the first is preferable for natural-language material, as it is more easily adjusted when (not if) it becomes necessary to allow children elements within a line of verse (e.g. for quotations, or for emphasized words).

The foregoing provides just a glimpse of XML Schema, but I hope it suffices to make clear the utility of XML Schema for laying out, concisely and formally, the structure of a markup vocabulary, and for expressing some of the more obvious kinds of structural integrity constraints we may wish to guarantee to users of our data collections.

6. XML Query, XSLT, and XPath

In the long run, however, purely declarative expressions of structure provide only so much satisfaction. One of the reasons we create large collections of language data is so that we can use it to study the language. That means that we need to be able to find, in the mass of material collected, examples of linguistic phenomena relevant to a particular question, and then to manipulate those examples conveniently.

For purposes of such search, retrieval, and manipulation, the specifications of XPath (1.0 and 2.0), XSLT (1.0 and 2.0) and XQuery (1.0) are all of particular interest. In theory, at least, the core functions of these specifications are quite distinct. XPath provides a language for identifying (or, as some say, addressing) particular elements and attributes in an XML document, while XSLT is designed for use in document formatting or rendering systems, and XQuery is designed to provide, for XML-encoded data, data manipulation functionality analogous to that provided for relational data by SQL, the Structured Query Language. In practice, however, the three specifications are intimately related: XPath 2.0 is used as an expression language by both XQuery 1.0 and XSLT 2.0, and all three specifications use a common data model and are based on the same formal semantics. All three languages can be used for querying XML data (although XSLT and XQuery both allow for manipulation and elaboration of the data, whereas systems which use XPath as their query language generally confine themselves to presenting the results, without modification), and both XSLT and XQuery can be used for manipulation of XML tree structures.

6.1. XSLT

The reader may be surprised that a language designed for document formatting should have powerful declarative mechanisms for manipulation of tree structures; it may be worthwhile to digress for a moment to explain how this comes to be so. As mentioned earlier, XSL (the Extensible Stylesheet Language) is designed as a Web-oriented analog to the international standard DSSSL. Both DSSSL and XSL are intended to support routine tasks in document formatting. They thus must provide facilities for styling blocks of text, setting it in a particular font family (e.g. Times Roman, or Helvetica, or Lucida), with a particular font treatment (italic, bold, demi-bold, etc.), on a particular measure, in a particular color, and so on. In an SGML or XML context, this frequently takes the form of associating a particular set of formatting properties with elements of the document; a number of stylesheet languages work this way, from W3C's Cascading Stylesheets (CSS), which was originally designed specifically for HTML but was early on extended to work with arbitrary markup vocabularies, to a number of proprietary stylesheet languages used by SGML editing and formatting systems.

The satisfactory layout of all but the simplest documents, however, requires more than the association of formatting properties with elements in the input document. Some information in the document must be duplicated to appear twice, e.g. the titles of sections, which must appear both at the head of the section and in the table of contents. Some must appear an unknown number of times in the output, e.g. the left and right running heads of chapters, which must be replicated to appear once on each page opening in the output. Some must be moved from one location to another (e.g. notes, which are conventionally stored in the source document at the logical point of attachment to the main text, but which must be moved either to the bottom of the page, or to the end of the chapter, or to the end of the book, for conventional print publication). Some information must be added at formatting time: page numbers, and quite often headings like “Chapter VII”. Not infrequently, some information in the input document must be suppressed; metadata about the revision history of the electronic document, or authorial notes to the copy editor, may appear in working drafts but not in final copy. And so on. All of these tasks require that the tree of structural units to which formatting properties are to be attached be different from the tree structure of the input document. In some cases, the difference in tree structure is minor; in other cases, it is profound.

The consequence of these requirements is that both DSSSL and XSL specify both a set of formatting objects with rendering properties and a notation for tree transformations of arbitrary complexity. In both cases, but more especially in the case of XSL, the tree transformation part of the system has been adopted by many users not only for use in formatting systems but as a general purpose tool for processing XML-encoded data. After all, if the general pattern of data processing is to accept data as input, perform various automatic transformations upon it, and write the results out as output, then whenever one's data is in XML it is convenient to express the required manipulation in a language which understands XML structures natively.[14]

Viewed as a generic tree-transformation tool, XSLT has a number of salient properties. First and foremost, XSLT stylesheets are themselves written in XML. This proves a stumbling block for some users, but is regarded by many, including me, as a key advantage of XSLT over many conventional programming languages. Because XSLT is written in XML, XSLT transformations can be used to process XSLT stylesheets as input, or to produce them as output. Relatively few conventional programming languages make such second-level processing convenient.

The logic of an XSLT stylesheet can be driven either by the structure of the output, or by the structure of the input, or by a mixture of the two. The input-driven style is particularly important for those working with natural-language material and human-readable documents, since their structure is typically much more elaborate, and much less regular, than that of data conventionally stored in databases.

Like XML itself, XSLT is designed to have a declarative semantics, and it falls squarely within the family of functional programming languages.[15] The ability to call named templates with parameters makes XSLT Turing-complete, so that in theory its expressive power is the same as that of any other programming language. XSLT 1.0 is effectively an untyped language (it does have a simple type system, but only four types), but XSLT 2.0 adopts the basic types defined by XML Schema 1.0.

6.2. XQuery

XQuery has been developed as an industrial-strength query language for XML data, with much of the work being carried by major vendors of SQL database systems. Decades of experience on problems of indexing, query rewriting and optimization, and schema-validation for relational are being applied systematically to XML data, with results that promise to be dramatic for all those with large volumes of XML data to store, search, and manipulate. The collaboration between the W3C XML Query and XSL Working Groups has produced an explicit data model, a formal semantics suitable for research on query optimization, and a well developed system of static typing, to allow creators of queries to know in advance that their queries are type-safe.

XQuery 1.0 differs from XSLT 2.0 most visibly in have a keyword-based syntax, not an XML-based syntax. There are also a number of more subtle differences, but the common functionality shared by the two languages is larger than the areas of functionality specific to either of them.

6.3. XPath

That common functionality constitutes the XML Path language, XPath.

XPath 1.0 originated as a language which represented the intersection between the match expressions of XSLT 1.0 and XPointer 1.0, both of them then Working Drafts. Since their match expressions had very similar functionality, it was thought desirable to provide a single expression of that functionality, rather than two incompatible expresssions. Similarly, XPath 2.0 captures the functionality common to XSLT 2.0 and XQuery 1.0.

For some purposes, the heart of a query language is not the manipulations it can perform upon data, but its ability to find the data of interest. Gaps in a query system's ability to manipulate data can frequently be made good outside the query system: further manipulations can always be performed in an arbitrary programming language, once the relevant data have been found.

It is perhaps for this reason that XPath 1.0 and 2.0 are already, in themselves, frequently used as query languages for XML. In the remainder of this paper I'd like to describe XPath briefly, and make the case that it ought to be of interest to corpus linguists, as to any other potential users of complex data.

At heart, XPath is an addressing language. Many applications need to ‘address’ parts of XML documents, in order to format them (as in XSLT), or to find the target end of a hyperlink (as in XPointer), or to extract information in the process of constructing documents from existing fragments, or for query and retrieval, or to express and check constraints on data validity, or for any number of other purposes. XPath captures the functionality common to such needs.

For XPath purposes, a document is an ordered tree with

a root node
element nodes
text nodes
attribute nodes
namespace nodes
processing instructions
comment nodes

In XPath 2.0 (but not 1.0), elements and attributes can have type annotations. There is no structure sharing: distinct XML elements are distinct for purposes of XPath. The boundaries of entities are not represented in the data model. Namespace prefixes are resolved.

An XPath expression consists of a sequence of steps, each identifying some set of elements or attributes in the document:

/step/step/step/step ...

Each step starts from one or more nodes in the document structure and moves to some other set of nodes; the expression as a whole assumes a context node which provides the input for the first step. The result set of the final step is the result node set for the expression as a whole.[16]

Abstractly, a step consists of an axis identifier (which indicates a direction of movement through the document tree), a node test which allows certain nodes along that axis to be selected and others ignored, and a sequence of Boolean tests, or predicates, which further constrain the result:

axis::node test [predicate] [predicate] ...

For example, the XPath expression “descendant::figure[@rend="svg"]” denotes the set of figure elements which are (a) descendants of the current element, and (b) have a rend attribute with a value of “svg”.

A good idea of the power of XPath can be gained simply by listing the difference axes along which we can search for elements and attributes:

child (selects elements, text nodes, comments, and processing instructions directly contained by the current node)
parent (selects only elements; nothing else can be a parent)
attribute (selects attributes of the current element)
following, following-sibling (select elements, text nodes, comments, or processing instructions which follow the current node in document order; in the case of following-sibling, they must also be children of the same parent)
preceding, preceding-sibling (like following and following-sibling, but moving backwards in document order)
self
namespace (selects only namespace declarations in scope for the current element)
ancestor, ancestor-or-self (select elements which enclose the current node)
descendant, descendant-or-self (select elements, text nodes, comments, or processing instructions which are either children, or children of children, etc., of the current node)

Some simple examples may show how XPath expressions can be used to find examples of interest in an XML-encoded corpus in which XML elements are used to represent parse trees in a simple and straightforward way.[17]

child::Fa (selects all adverbial-clause children)
child::* (selects all element children)
child::text() (selects all text node children)
child::node() (selects all children)
attribute::del (selects all attributes of the current element named del)
attribute::* (selects all attributes of the current element)
descendant::N (selects all noun-phrase descendants)
ancestor::S (selects all sentence [clause] ancestors)
ancestor-or-self::S (selects all sentence [clause] context nodes or ancestors)
descendant-or-self::Nr (selects all context nodes or ancestors which are temporal adverbial noun phrases)
/descendant-or-self::S[not(./descendant::N)] selects all sentences (S elements) which contain no noun phrases (N elements)
/descendant-or-self::N[./child::w[@t='AT1'] and ./child::w[@t='JJ'] and ./child::w[@t='NN1']] selects all noun phrases (N elements) which directly contain at least one determiner (a w element marked with t='AT1'), one adjective (JJ), and one singular noun (NN1)
child::w[@t='AT1' and following-sibling::w[1][@t='JJ'] and following-sibling::w[2][@t='NN1']] selects definite articles (AT1) followed immediately by adjectives and then by singulare nouns
/descendant-or-self::N[./child::w[@t='AT1' and following-sibling::w[1][@t='JJ'] and following-sibling::w[2][@t='NN1']]] selects noun phrases containing the sequence AT1, JJ, NN1.
/descendant-or-self::w[@t='NP1' and @f=translate(@f, 'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ')] selects all words whose f (form) attribute is in upper case (is identical to the result of uppercasing it)

A short syntax is also available, which is more compact, but which would require more explanation than seems appropriate here.

XPath is, as noted, already widely used for queries. It be used to select elements and attributes, or to return strings and numbers resulting from relatively simple calculations on matched nodes. Nodes in the XML tree can be selected by position, by type, by name, by value, or by combinations of these. By means of predicates, co-occurrence constraints can be expressed. The greatest weaknesses of XPath as a free-standing query language are that XPath 1.0 has no data types, and there is very little type checking. This minimizes the number of type errors raised by an XPath processor, which makes using XPath 1.0 a rather more cheerful experience than using some other languages. But it also means that in case of an error in an XPath expression, the system is liable to produce incorrect answers owing to an incorrect understanding of what was desired. (If the built-in type coercions do the right thing, of course, the answers will not be incorrect.)

More elaborate queries and processing are possible using XSLT or XQuery: creation of new elements and attributes, restructuring of the input, calculation of totals and subtotals, all of the conveniences familiar to users of database management systems.

7. Conclusion

XML and SGML are by no means unknown to community of corpus linguists. But corpus linguists are still far from exploiting them to the full. If corpora are encoded in XML, creators and users of corpora gain access to a broad variety of tools written for the larger XML community. XSLT and XQuery can be used for XML-to-XML transformations, which means in turn that software development can be easier than with lower-level tools. The broad interest within the XML user community in document processing and in database management systems means that many problems analogous to those arising in corpus work are being addressed by other XML users and industrial vendors. The resulting tools may not always be perfect for corpus work. They may run too slowly to allow them to be applied to the large bodies of material characteristic of modern corpora. They may not be directly applicable. But even slow tools can be used to build prototypes quickly, or to work with smaller extracts from larger corpora. And tools which cannot be applied directly can often be adapted.

As XML tools become more and more widely available, XML will be used for more and more applications; this in turn will lead to the development of yet more tools. There may be some intrinsic limit on the degree to which XML will become ubiquitous, but it seems that if such a limit exists, it has not yet been reached. As more and more tools are capable of accepting XML input and producing XML output, it becomes more and more feasible to build prototypes, or production data processing systems, by pipelining XML processes together in the manner of Lego structures. It is unlikely that generic tools will fully replace the many excellent specialized tools being built, or already built, for corpus work. But to the extent that generic tools can be used for corpus work, resources can be freed for more specialized tasks. The potential of generic XML tools for doing corpus-related tasks has thus far hardly been exploited; there is a great opportunity here.

A. References

Brüggemann-Klein, Anne. 1993. Formal models in document processing. Habilitationsschrift, Freiburg i.Br., 1993. 110 pp. Available at ftp://ftp.informatik.uni-freiburg.de/documents/papers/brueggem/habil.ps (Cover pages archival copy also at http://www.oasis-open.org/cover/bruggDissert-ps.gz).

[Brüggemann-Klein provides a formal definition of 1-unambiguity, which corresponds to the notion of unambiguity in ISO 8879 and determinism in XML 1.0. Her definition of 1-unambiguity can be used to check XML Schema's Unique Particle Attribution constraint by changing every minOccurs and maxOccurs value greater than 1 to 1, if the two are equal, and otherwise changing minOccurs to 1 maxOccurs greater than 1 to unbounded.]

Brüggemann-Klein, Anne. 1993. “Regular expressions into finite automata.” Theoretical Computer Science 120.2 (1993): 197-213.

[Ginsburg/Harrison 1967] Ginsburg, S., and M. M. Harrison. “Bracketed context-free languages”. Journal of computer and system sciences 1.1 (1967): 1-23.

[ISO 1986] International Organization for Standardization (ISO). 1986. ISO 8879-1986 (E). Information processing — Text and Office Systems — Standard Generalized Markup Language (SGML). International Organization for Standardization, Geneva, 1986.

[ISO/IEC 1992] International Organization for Standardization (ISO); International Electrotechnical Commission (IEC). 1992. ISO/IEC 10744:1992 (E). Information technology — Hypermedia / Time-based Structuring Language (HyTime). International Organization for Standardization, Geneva, 1992.

[ISO/IEC 1996] International Organization for Standardization (ISO); International Electrotechnical Commission (IEC). 1996. [Draft] Corrected HyTime Standard ISO/IEC 10744:1992 (E).. [n.p.]: Prepared by W. Eliot Kimber for Charles F. Goldfarb, Editor, 13 November 1996.

[W3C 1999] World Wide Web Consortium (W3C). XML Path Language (XPath) Version 1.0, ed. James Clark and Steve DeRose. W3C Recommendation 16 November 1999 Published by the World Wide Web Consortium at http://www.w3.org/TR/xpath, November 1999.

[W3C 2000] W3C. Document Object Model (DOM) level 1 specification. Published by the World Wide Web Consortium at http://www.w3.org/TR/REC-DOM-Level-1, September 2000. W3C Recommendation.

[W3C 2001a] “XML Schema Part 0: Primer”, ed. David Fallside. W3C Recommendation, 2 May 2001. [Cambridge, Sophia-Antipolis, Tokyo: W3C] http://www.w3.org/TR/xmlschema-0/.

[W3C 2001b] 2001. XML Schema Part 1: Structures, ed. Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn. W3C Recommendation 2 May 2001. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/

[W3C 2001c] W3C. 2001. XML Schema Part 2: Datatypes, ed. Biron, Paul V. and Ashok Malhotra. W3C Recommendation 2 May 2001. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/

[W3C 2004a] World Wide Web Consortium (W3C). Extensible Markup Language (XML) 1.0 (Third Edition), ed. Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler (Second Edition), François Yergeau (Third Edition). W3C Recommendation 4 February 2004. Published by the World Wide Web Consortium at http://www.w3.org/TR/REC-xml/.

[W3C 2004b] World Wide Web Consortium (W3C). XML Information Set (Second Edition), ed. John Cowan and Richard Tobin. W3C Recommendation 4 February 2004. Published by the World Wide Web Consortium at http://www.w3.org/TR/xml-infoset.

[W3C 2004c] World Wide Web Consortium (W3C). XQuery 1.0 and XPath 2.0 Data Model, ed. Mary Fernández et al. W3C Working Draft 23 July 2004. Published by the World Wide Web Consortium at http://www.w3.org/TR/xpath-datamodel/, 2004.