graphic with four colored squares
Cover page image (keys)

Ten years of XML, twenty years of SGML

Report from an unexpected success

C. M. Sperberg-McQueen

Budapest, 22 February 2008

Overview

Ancient history: SGML

  • Developed in the publishing industry
  • 1967 Bill Tunnicliffe, Graphic Communications Association
    GenCode (generic coding)
  • 1960s/'70s Charles Goldfarb et al.
    GML Generalized Markup Language
    (aka Golfarb, Mosher, Lorie)
  • 1970s/80s ANSI, ISO, ...
    SGML Standard Generalized Markup Language
    (aka Swede, Goldfarb, Mosher, Lorie)

SGML: a niche technology

  • Very general
  • Very abstract
    • Elements, attributes, tags
    • <title> ... </title>
    • but also {title} ... {.title}
    • or .title ... .etitle
    • or ...
  • Very complex or complicated
    • Minimization techniques (tag omission, tag abbreviation, tag using arbitrary patterns)
    • Unconvential grammar, non-standard description of parser
    • Ambiguous grammar (hundreds of reduce/reduce conflicts)

Why SGML never failed

  • Very general
  • All power to the information creator / owner!
  • No SGML users ever went back

What did we want?

  1. Essential virtues of SGML
    • Not a single tag set / language, but a metalanguage
    • Declarative, not imperative
    • Descriptive / logical, not format-oriented
    • Reflects reality
      or at least our understanding of reality
    • Adamic naming
    • The three-legged stool
      • Simple serial data format
      • Obvious data structure (tree)
      • Validation*

Design goals

  1. XML shall be straightforwardly usable over the Internet.
  2. XML shall support a wide variety of applications.
  3. XML shall be compatible with SGML.
  4. It shall be easy to write programs which process XML documents.
  5. The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
  6. XML documents should be human-legible and reasonably clear.
  7. The XML design should be prepared quickly.
  8. The design of XML shall be formal and concise.
  9. XML documents shall be easy to create.
  10. Terseness in XML markup is of minimal importance.

Predecessors / prior art

  • Basic SGML (ISO 8879)
  • Minimal SGML (ISO 8879)
  • Lexical Analyzer for HTML and Basic SGML (Dan Connolly)
  • Minimal Generalized Markup Language (Tim Bray)
  • Normalised SGML (Henry Thompson et al.)
  • Poor Folks' SGML
  • SGML Lite (Bert Bos)
  • SGML Online (Eliot Kimber)
  • TEI Interchange Format (Sperberg-McQueen, Burnard et al.)

XML

  • Document structure same as SGML
  • Everything is delimited
    • Documents
    • Elements
    • Attribute values
    • Tags
    • Entity references
  • Every XML document is at the same time an SGML document.

The plan

  • SGML → XML
  • DSSSL → XSL (FO), XSLT
  • HyTime → XLink

What actually happened

  • XML
  • XSLT, XSL FO
  • XLink, XPointer
  • XPath
  • XQuery
  • XML Schema, Update, Full Text
  • ¿Efficient XML for Interchange?
  • RDF/XML
  • XHTML
  • MathML
  • ...

Things I've felt twinges over

  • no CAPACITIES,
    so no minimal requirements for software
  • no IGNORE keyword for marked sections
    so harder to work with conditional text

Twinges (2)

  • Difference between WF parsers and full (validating) parsers grew too wide
  • standalone declaration not used
  • non-standard characters (Unicode Private Use Area) ill supported
  • PUBLIC identifiers not well supported
  • Missing entity declaration is WF error (implications for XSD)

All's well that ends well

  • XML's. ‘draconian’ error handling
  • Quotation marks for attribute values
  • No SHORTREF, DATAREF, CONCUR, etc.
  • Informal goals:
    • Maximum 20 pages of normative text.
    • A decent programmer should be able to write a parser in a week.
  • Unicode+
  • Parsing without a DTD
  • Well-formedness / validity
  • URIs as system identifiers

Should people develop their own XML vocabularies?

Tim Bray argues not:
  • Neither easy nor fun
  • Pass/Fail ratio (success is doubtful)
  • Software pain
  • Restating Metcalfe (network effects come if we all use a few languages)
  • Opportunity cost
  • The Big Five do what you need: XHTML + Microformats, DocBook, ODF, UBL, and Atom.

A vocabulary of one's own? (2)

And yet:
  • It is fun.
  • Competitive advantage
  • Network effects (XML)
  • Opportunity cost comes in many forms.
  • Microformats are also vocabularies
    (and interact with the host language in complicated ways)
  • The essential is often the unique and cannot be handled adequately by cookie-cutter analysis.

When you develop vocabularies

  • Technical knowledge is good.
    Fidelity to reality (as you understand it) is indispensible.

What can we learn?

  • Backwards compatibility, forwards compatibility are important
    but not the only thing
  • Never be afraid to think it through again, from first principles!
  • Know your data!
  • It is often possible to find an equilibrium between generality and specificity.