Ten years of XML, twenty years of SGML
Report from an unexpected success
C. M. Sperberg-McQueen
Budapest, 22 February 2008
Overview
Prehistory and beginnings of SGML and XML
What we planned
What actually happened
Retrospective wisdom: decisions that look right
Things that didn't work out
What can we learn?
Ancient history: SGML
Developed in the publishing industry
1967 Bill Tunnicliffe, Graphic Communications Association
GenCode (generic coding)
1960s/'70s Charles Goldfarb et al.
GML Generalized Markup Language
(aka Golfarb, Mosher, Lorie)
1970s/80s ANSI, ISO, ...
SGML Standard Generalized Markup Language
(aka Swede, Goldfarb, Mosher, Lorie)
SGML: a niche technology
Very general
Very abstract
Elements, attributes, tags
<title> ... </title>
but also
{title} ... {.title}
or
.title ... .etitle
or ...
Very complex
or complicated
Minimization techniques (tag omission, tag abbreviation, tag using arbitrary patterns)
Unconvential grammar, non-standard description of parser
Ambiguous grammar (hundreds of reduce/reduce conflicts)
Why SGML never failed
Very general
All power to the information creator / owner!
No SGML users ever went back
What did we want?
Essential virtues of SGML
Not a single tag set / language, but a metalanguage
Declarative, not imperative
Descriptive / logical, not format-oriented
Reflects reality
or at least our understanding of reality
Adamic naming
The three-legged stool
Simple serial data format
Obvious data structure (tree)
Validation*
Design goals
XML shall be straightforwardly usable over the Internet.
XML shall support a wide variety of applications.
XML shall be compatible with SGML.
It shall be easy to write programs which process XML documents.
The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
XML documents should be human-legible and reasonably clear.
The XML design should be prepared quickly.
The design of XML shall be formal and concise.
XML documents shall be easy to create.
Terseness in XML markup is of minimal importance.
Predecessors / prior art
Basic SGML (ISO 8879)
Minimal SGML (ISO 8879)
Lexical Analyzer for HTML and Basic SGML (Dan Connolly)
Minimal Generalized Markup Language (Tim Bray)
Normalised SGML (Henry Thompson et al.)
Poor Folks' SGML
SGML Lite (Bert Bos)
SGML Online (Eliot Kimber)
TEI Interchange Format (Sperberg-McQueen, Burnard et al.)
XML
Document structure same as SGML
Everything is delimited
Documents
Elements
Attribute values
Tags
Entity references
Every XML document is at the same time an SGML document.
The plan
SGML
→ XML
DSSSL
→ XSL (FO), XSLT
HyTime
→ XLink
What actually happened
XML
XSLT, XSL FO
XLink, XPointer
XPath
XQuery
XML Schema, Update, Full Text
¿Efficient XML for Interchange?
RDF/XML
XHTML
MathML
...
Things I've felt twinges over
no
CAPACITIES
,
so no minimal requirements for software
no
IGNORE
keyword for marked sections
so harder to work with conditional text
Twinges (2)
Difference between WF parsers and full (validating) parsers grew too wide
standalone
declaration not used
non-standard characters (Unicode Private Use Area) ill supported
PUBLIC
identifiers not well supported
Missing entity declaration is WF error (implications for XSD)
All's well that ends well
XML's. ‘draconian’ error handling
Quotation marks for attribute values
No
SHORTREF
,
DATAREF
,
CONCUR
, etc.
Informal goals:
Maximum 20 pages of normative text.
A decent programmer should be able to write a parser in a week.
Unicode+
Parsing without a DTD
Well-formedness / validity
URIs as system identifiers
Should people develop their own XML vocabularies?
Tim Bray
argues not:
Neither easy nor fun
Pass/Fail ratio (success is doubtful)
Software pain
Restating Metcalfe (network effects come if we all use
a few
languages)
Opportunity cost
The Big Five do what you need: XHTML + Microformats, DocBook, ODF, UBL, and Atom.
A vocabulary of one's own? (2)
And yet:
It
is
fun.
Competitive advantage
Network effects (XML)
Opportunity cost comes in many forms.
Microformats are also vocabularies
(and interact with the host language in complicated ways)
The essential is often the unique and
cannot
be handled adequately by cookie-cutter analysis.
When you develop vocabularies
Technical knowledge is good.
Fidelity to reality (as you understand it) is indispensible.
What can we learn?
Backwards compatibility, forwards compatibility are important
but not the only thing
Never be afraid to think it through again, from first principles!
Know your data!
It is often possible to find an equilibrium between generality and specificity.