What is XML and Why Should Humanists Care?

C. M. Sperberg-McQueen (University of Illinois at Chicago)

[Abstract for a talk at] DRH '97, Oxford, September 1997

The Extensible Markup Language (XML) is a subset of the Standard Generalized Markup Language (SGML) intended to make it more usable for distributing materials on the World Wide Web. XML differs from SGML primarily in simplifying the sometimes intimidating formalisms of SGML in order to ensure that an XML parser is simple enough to embed in even lightweight software, including Web browsers. It differs from HTML primarily in allowing the user to specify new tags, marking types of elements not foreseen in the HTML specification, and making it possible for common off-the-shelf browsers and other software to handle such user-defined element types usefully.

In XML, for example, the publisher of a scholarly edition can define element types like <del> or <add> for marking manuscript cancellations or interlineations. Under control of the style sheet, these could be marked with brackets and arrows, displayed in special colors, or otherwise made visible to the reader; an alternate style sheet might omit the cancelled material entirely and include the inserted material silently, thus creating a sort of `fair copy' of the document. In HTML, there are no element types for registering cancellations or interlineations, and it is unlikely that there ever will be. If a Web publisher wishes to allow the user two or more alternative views of the same document, then two distinct HTML forms of the same material must be available. These may be prepared in advance or generated on the fly; either method makes online publication of complex material more complicated and places a heavier burden on the resource provider, compared with the SGML/XML model.

The paper will outline briefly the structure and current status of the XML specifications, before illustrating XML usage with a series of concrete examples showing various aspects of XML markup and how it will work in practice. The examples should make clear what XML can do that existing HTML and SGML systems cannot do.

Consider a text in modern English, presented approximately the way that an ancient manuscript or inscription might have rendered it:


Using the markup techniques of white space, punctuation, and case distinctions which have become standard in European languages, the text is significantly more readable:

Four score and seven years ago,
our forefathers brought forth
on this continent a new nation,
conceived in liberty and
dedicated to the proposition
that all men are created equal.
Now we are engaged in a great civil war,
testing whether that nation
or any nation so conceived
and so dedicated
can long endure.

Conventions like punctuation and standard layout serve to make the structure of the text more explicit, thus making reading easier. Like most formal systems intended for human readers, punctuation and the division of text into words can allow themselves occasional ambiguities and uncertainties without becoming pointless. Text software is much worse than humans at handling ambiguity and uncertainty, however, and experience has shown that software can do a much better job of processing texts if explicit markup schemes are used to reduce the ambiguity and uncertainty of textual structure. Virtually all text-processing software (concordance packages, word processors, Web browsers, indexers, information retrieval systems, ...) use some sort of markup, some of it in the form of undocumented binary codes embedded in a file and some in the form of human-readable codes in publicly documented forms. XML is concerned with the latter form.

Using explicit markup to indicate the beginnings and endings of paragraphs and sentences, our example might look something like this:

<s>Four score and seven years ago,
our forefathers brought forth
on this continent a new nation,
conceived in liberty and
dedicated to the proposition
that all men are created equal.</s>
<s>Now we are engaged in a great civil war,
testing whether that nation
or any nation so conceived
and so dedicated
can long endure.</s>

The markup in this example takes the form of start-tags (<p>, <s>) marking the beginnings of structures, and end-tags (</p>, </s>) marking their ends. In XML, as in SGML and HTML, all such structures must nest properly within containing structures. (Many HTML processors do not enforce this rule, though the HTML specification is quite clear on the topic.)

Even such a simple example as the one just given illustrates the main shortcoming of HTML for Web publication of scholarly resources. HTML defines a single finite vocabulary of textual structures, which (in its current state) has no way of expressing the claim "This is a sentence." Since HTML is intended primarily for driving browsers, and since conventional typography marks sentence boundaries only with punctuation and not with layout, HTML has no pressing need for sentence-boundary tags. For many scholarly purposes, however, including linguistic analysis and the presentation of material in interactive concordance systems, sentence boundaries are extremely useful and important. It would be convenient to be able to mark them explicitly, so that software could make use of them where appropriate.

It is not feasible, however, even if it were thought wise, to extend HTML to include every textual feature of possible interest to scholarship. It is politically infeasible because most HTML browser users and makers have no serious interest in textual scholarship. It is intellectually infeasible because no finite vocabulary of textual features can ever completely register everything of possible interest to scholarship. (The Text Encoding Initiative has made an effort to register the features most widely agreed to be of interest, but even it is inherently incomplete, and acknowledges that fact by building into itself an elaborate mechanism to allow users to define their own new element types.)

Web browsers are currently built to handle the fixed vocabulary of HTML, perhaps with a few proprietary extensions. Because the vocabulary is fixed in advance, the browser developer can hard-code the special handling of each element type and build it into the browser at compile time. Abandoning the notion of a finite vocabulary would require a more complex browser that can load such information at run time, in the form of a style sheet. As it happens, of course, browser developers are currently working to provide precisely this capability -- extending it to support XML is a relatively simple step for any browser with full style-sheet support.

The implications of extensible markup vocabularies, and thus of XML, for humanists and resource providers are enormous. Text-bases prepared in SGML (e.g. using the TEI encoding scheme) need no longer be translated down into HTML for distribution, which means both that publication is simpler and that information need no longer be lost in the process of publication. Special resources (e.g. resource catalogs) can be distributed directly in XML, allowing client-side software (e.g. Java applets) to search them and display them appropriately on the user's machine; it is no longer necessary to do all the searching and manipulation of databases and similar resources on the server. Instead, the client can perform useful work with the data, since the data can be sent to the client without stripping out most of the useful structure. XML, some experts predict, "will give Java something to do."

It will also give the providers of digital resources for the humanities a great deal to do, and tools to do it with.