Michael Sperberg-McQueen his document list

C. M. Sperberg-McQueen

This document list points to some of the documents I make available in this Web space.

Documents
XSLT stylesheets (for slides, for displaying a PSVI)

Documents

A small test suite for (one version of) the purchase-order schema in the XML Schema 1.0 Primer; prepared in connection with the work on schemas and definite clause translation grammars listed below.
Notes on finite state automata with counters, a paper (still unpolished and in some respects unfinished) trying to work out what finite-state automata would be like if they were augmented with counters, in ways analogous to the extension of regular expressions to allow them to use integer exponents instead of just the Kleene star. Someone has to have done this before me, but I haven't found it. Not touched since May 2004; if I'm not going to finish and polish it anytime soon, it may as well be visible in its current state.
A brief introduction to definite clause grammars and definite clause translation grammars, A working paper prepared for the W3C XML Schema Working Group, January 2004.
A definite-clause grammar representation of an XSD schema, A working paper prepared for the W3C XML Schema Working Group, January 2004.
Trip report: XML 2003, Philadelphia, December 2003.
Two posters from XML 2003, December 2003: “Why would you want to perform the world's most expensive identity transform?” and “How being schema-valid is different from being pregnant”.
Logic grammars and XML Schema. Paper for Extreme Markup Languages 2003, Montréal. An abbreviated form of the paper Notes on logic grammars and XML Schema, which I never quite completed in its original form and which has now been split into several papers).
XML Schema 1.0: A language for document grammars, slides for a talk I gave at ACH/ALLC 2003 (see the trip report listed below).
Trip report: ACH/ALLC 2003 / Web X: a decade of the World Wide Web, Annual joint conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing, Athens, Georgia, 29 May - 2 June 2003.

XML and related topics

A dialog on surrogate characters in XML, 21 March 2007. An answer to an inquiry; of interest for character set geeks, XML casuists, and possibly others.
Are C1 characters legal in XHTML 1.0?, 23 March 2007. I seem to be stuck on character-set problems this week.

XML Schema-related topics

I include a couple papers here for which I am not, technically, the author, but in which I have a certain paternal interest.

Sevastopol: An XSD schema represented as a definite-clause translation grammar, A working paper prepared for the W3C XML Schema Working Group, January 2004 - October 2005.
How schema-validity is different from being married, a paper submitted to XML 2005 in Atlanta, November 2005.
Applications of Brzozowski derivatives to XML Schema processing, a paper given at Extreme Markup Languages 2005 in August. A brief summary of what Brzozowski derivatives are, followed by discussions of how to use them in validation (they allow validation of content models which cannot be translated into finite state automata without prohibitive cost), checking content model determinism (they allow a precise definition of the determinism rule in terms of content models, without any detour through finite state automata), checking subsumption relations between content models, and other applications. In general, Brzozowski derivatives are a beautiful and elegant tool that deserve to be better known. (If special characters like ε and ∅ do not display usefully [i.e. as a Greek epsilon and the empty-set symbol] in your browser, you'll want the asciified version of the paper.) The text is available (styled differently) from the Extreme Markup Languages proceedings online in XML, HTML, and PDF. The slides are also available.
XML Schema validation outcomes, a tabular display created together with Henry Thompson and Richard Tobin.
Notes on schema resolution and Notes on schema-validation results, both of which are draft notes on topics relating to XML Schema, expanded from some old postings on email lists.
An unfinished paper on S-expressions for XML documents and for XML Schema components (rev. August 2001), which attempts to rewrite the rules for XML Schema component structure and document validation in Lisp, using S-expressions to represent the components and defining functions to capture the meaning of key terms in XML Schema. (Currently accessible only to W3C members, sorry. I'll make it public when it's further along. [If ever. It doesn't look as if I'm ever going to get back to this.])
Context-sensitive rules in XML Schema (February 2000, rev. April 2000). (XML version) The title is a slight misnomer: what is offered is an example of limited non-local effects, not really arbitrary context-sensitivity. There is a clear tradeoff between the limited awareness of context and the size of the grammar (approximately: one bit of context-sensitivity requires doubling the size of the affected portion of the grammar). I hope to clean this up and publish it sometime.
Replicating DTD Functionality Using XML Schema (April 2000).

Markup theory

See also the talk given at ACH/ALLC 2001.

Notes on schema annotation (February and April 2002). Some general thoughts on what the actual problem is and what a solution might look like, in an attempt to define terms clearly enough to allow useful discussion about what should be done.
Meaning and Interpretation of Markup, by C. M. Sperberg-McQueen, Claus Huitfeldt, and Allen Renear. Published in Markup Languages: Theory & Practice 2.3 (2000): 215-234.
GODDAG: A Data Structure for Overlapping Hierarchies, by C. M. Sperberg-McQueen and Claus Huitfeldt. Paper given at Principles of Digital Document Processing 2000, Munich, September 2000. Published in DDEP-PODDP 2000, ed. P. King and E.V. Munson, Lecture Notes in Computer Science 2023 (Berlin: Springer, 2004), pp. 139-160.

Talks

Mostly slides from talks, in a few cases, post-hoc expansions or transcriptions from tape. In some cases, you can get a reasonable idea of what I said from my slides, and in other cases, not.

XML vocabulary design and specification Using W3C XML Schema 1.0, slides from talk during the 'training track' at XML 2007, sponsored by IDEAlliance, Boston, 5 December 2007.
Meaning and interpretation of markup: a report on the Bechamel Project, slides from talk sponsored by the W3C German/Austrian Office at the Fraunhofer Gesellschaft Institut Medienkommunikation in Sankt Augustin, Germany, 1 October 2004.
What does XML have to do with Immanuel Kant?, slides from talk at Net.Object Days 2004, Erfurt 29 September 2004.
Semantic interpretation of XML documents, slides from talk at Modelling linguistic information resources / Modellierung sprachlicher Informationsressourcen, workshop organized by the Zentrum für interdisziplinäre Forschung (ZiF) of the Universität Bielefeld, 12-14 January 2004.
Perspectives on XML and related standards, slides from talk at Korpuslinguistik deutsch: synchron, diachron, konstrastiv, University of Würzburg, 22 February 2003.
What matters? (August 2002). Closing remarks at the conference Extreme Markup Languages 2002.
Slides from The TEI is dead; Long live the TEI (November 2001), the opening keynote at the first members meeting of the TEI Consortium, held in Pisa 16-17 November 2001 (printer-friendlier version).
Slides from A gentle introduction to XML Schema and XML document grammars (October 2001), a talk I gave as part of a series of seminars and talks organized by the HIT Centre (Humanities Information Technology Research Programme) at the University of Bergen, where I am a guest researcher. (XML version, printer version). Mostly this is a very slightly revised version of the slides from Darmstadt earlier this month, but I've added another extended example based on some software in Bergen.
Slides from Constrain early and often: XML Schema and the definition of document grammars for XML vocabularies (October 2001), a talk I gave at the Fraunhofer Institut für integrierte Publikations- und Informationssysteme in Darmstadt. Thanks to my host, Peter Fankhauser, who argues that it would be really cool to have a well-defined method for providing type labeling on an XML instance independent of any schema or schema validation. (printer version)
Slides from The World Wide Web Consortium and Standards (June 2001), a talk I gave on W3C work relevant to computing in the humanities, in a session at ACH/ALLC 2001, the annual conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing.
Slides from a talk on Practical Extraction of Meaning from Markup I gave at ACH/ALLC 2001 (June 2001), reporting on work being done together with Claus Huitfeldt (University of Bergen) and Allen Renear (University of Illinois at Urbana/Champaign).
What is XML and Why Should Humanists Care?, an abstract for a talk delivered at the Digital Resources for the Humanities DRH '97 conference in Oxford, September 1997.
Back to the Frontiers and Edges, Closing Remarks at SGML '92: the quiet revolution, sponsored by the Graphic Communications Association. Danvers, Massachusetts, 29 October 1992. This got the TEI document number ED W31. See also trip report, above.

Trip reports

Trip report: Korpuslinguistik deutsch: synchron, diachron, konstrastiv, University of Würzburg, 20-23 February 2003.
ED R2: Trip report from SGML '92. See also my talk, below.
ED R3: Trip report from ASIS '93.
ED R4: Trip report from Tokyo and Boston, December 1993.
ED R5: Trip report, Lisbon workshop July 1994.

These documents were done as part of my work on the Text Encoding Initiative. (A lot of others will eventually appear here, when I get around to converting them into XML and making HTML versions to put here.)

ED P1: Design Principles for Text Encoding Guidelines (1988, rev. 1990). An early attempt to enunciate the ground rules for the TEI.
ED P3: Theoretical Stance and Resolution of Theory Conflict (1989). A statement of the problem of arriving at consensus in fields where different people take very different theoretical stances. It had no visible effect on most of the TEI working groups, but at least it was useful for me.
ED W03: SGML Problems for Research (1988). A transcription of my notes for a talk at SGML '88. Not complete: the notes may cover a quarter of what I said, but not much more. (In particular, the example from the Glossa ordinaria is not here.)
ED W05: Notes on Features and Tags (1989). Lou Burnard and I wrote the first draft of this over a quiet weekend in Luxembourg; this paper introduced the notion of the `Waterloo DTD', which (as I later discovered) infuriated some of our colleagues in Waterloo, who were unhappy at the notion that they used any kind of DTD at all. (It was intended as a compliment.)
ED W12: Tagging parts of speech (1990). A study of various ways of marking parts of speech in SGML, based on a careful study of the part-of-speech tagging in the Lancaster/Oslo/Bergen (LOB) corpus.
AI1 W02: List of common morphological features for inclusion in TEI Starter Set of grammatical-annotation tags (1991). A set of lists of word classes and features which some Western European languages express morphologically, and which may therefore be useful in linguist annotation of western European languages. I helped draft this, but the intellectual responsibility lies with the linguistic-annotation working group (AI 1) of the TEI (members included Terry Langendoen, Stephen Anderson, Geoffrey Sampson, Nicoletta Calzolari, and Gary Simons). As a result, this is linguistically much more sophisticated than ED W12.

Other

A URI recognizer: a transcription into definite-clause grammar of the ABNF productions for URIs and URI references in RFC 3986. (ABNF is defined in RFC 2234.)
An unfinished paper on Namespaces and RDDL (September 2001).
SWeb: an SGML Tag Set for Literate Programming (1993, rev. 1994, 1995, 1996). A set of tags for literate programming, designed to be used as a DTD module, in combination with some tag set for document structure, etc., such as TEI Lite or Docbook or HTML. An XML version of the combined Sweb + TEI Lite DTD (swebxml.dtd) is available, as is a display stylesheet (swebtohtml.xsl, which relies on tltohtml.xsl being in the same directory). A simple tangle processor in XSLT has replaced swebyacc, my earlier yacc+lex based program. It took a couple of hours to implement, as compared to several weeks' intermittent work for swebyacc.
Working documentation for the Web ORB and Oasis (WOO) system (1998-1999), the last production database I worked on before leaving the University of Illinois at Chicago. Not likely to be of general interest, though I'm still proud of parts of it.
AIK L2: a letter to the staff of Amalgamated Interkludge (AIK) from You Know Who, dated 8 November 1993. Amalgamated Interkludge was (is?) a fictional organization whose products, policies and marketing slogans ("AIK: we have hammers of many sizes") were a topic of humorous discussion among members of the technical staff in the organization where I worked at the time.
This piece was a way of letting off steam about issues of quality control (in particular, the apparent determination of management to prevent programmers from exerting any); the names used are either those of well-known computer scientists or of colleagues (now: former colleagues). It was distributed anonymously, but often quoted. My colleague John Andrews did actually use the Andrews test, though he used it as a milestone, rather than a completion test, and when I wrote this Fred Damen was well on his way to becoming a figure of legend, though there is no record that he ever used the Damen test.
A directed-graph data structure for text manipulation, abstract for a talk given at the 1989 ICCH/ALLC conference The Dynamic Text at the University of Toronto. Describes the ‘Rhine Delta’ data structure for the representation of textual variation. I can't believe I was the first to invent this, but I have not found clear antecedents.

The formal paper I intended to write on this work was never finished, partly because other work was more pressing and partly because I never quite figured out how to reconcile the Rhine Delta with the non-linear representation of texts as trees or graphs which is implicit in SGML and XML. (There is a reconciliation of sorts in the TEI markup for textual variants, which can readily be used to generate Rhine Delta structures within single elements like paragraphs or lines of verse, but variations which span structural boundaries still feel like an unsolved problem.)

XSLT stylesheets

Slides in TEI Lite

There have been some requests for the XSL stylesheet I use for my slides. I write my slides (like almost everything else I write) in TEI Lite, and after years of using an aging copy of SoftQuad Panorama Pro to project them, have changed to using an XSLT stylesheet in an XSL-capable Web browser to do so. If you want to look at what I've done and how it works, here are some pointers:

TEI Lite DTD, hacked for XML (this is my copy; there is a more official copy at the TEI Consortium web site, which may vary in minor ways)
TEI Lite documentation
tltohtml.xsl, my base XSLT style sheet for translating TEI Lite into HTML
tlslides.xsl, the stylesheet which turns TEI div1 or top-level div elements into slides and adds the hyperlinks; for the rest, it imports the base tltohtml stylesheet. Set the browser locally to use a large font; otherwise the slide titles are smaller than the slide text (I haven't quite figured out CSS font-size properties, I guess).
sample slides, with HTML version produced by tlslides.xsl and printer-friendlier version produced by tltohtml.xsl.
A DTD for XSLT stylesheets; this works for stylesheets that use xsl:element instead of having literal result elements, and I use it to get better editing of stylesheets in emacs using PSGML.

When giving talks, I currently use Galeon to display my slides; the one drawback is that in order to persuade Galeon to display any graphics I have to do a batch transformation to HTML, rather than displaying directly from the XML. It's a pain, but a manageable pain. Before moving to Linux, I used Microsoft's Internet Explorer for displaying my slides, because of its built-in XSLT support. These stylesheets worked with version 6.0; I haven't tried them with earlier versions. (Before using IE for displaying slides, I used SoftQuad Panorama [pause for teary-eyed recollections], but those stylesheets are different.)

Display of the post-schema-validation infoset

As a training and debugging aid, I frequently use the options on the xsv and Xerces-J schema validators which cause them to write out an XML representation of the post-schema-validation info set (PSVI), and then run that through XSLT to display the document in a Web browser: in the output, the text color reflects the value of the [validity] property on each element or attribute green text means it's valid, red means invalid, amber means unknown. The background reflects the [validation attempted] property: white background means fully validated, light gray partially validated, dark gray unvalidated.

For example, to use both XSV and XercesJ to validate the XML file theduck.xml against the schema pointed to by its xsi:schemaLocation attribute, I say simply

psvixsv theduck.xml
psvixercesj theduck.xml

The scripts cause new tabs to open in my Web browser window with the two PSVI dumps in XML (for exploration) and the HTML versions of the PSVI (for visual examination).

The things I use for this are:

psvihtml.xsl: my XSLT stylesheet for transforming PSVI dumps into HTML. This is the only thing you really need; the other things listed here are merely for convenience in using this XSLT file.
shell scripts for invoking the processors with the right options to elicit a PSVI dump in XML form:
- psvixsv: for XSV
- psvixercesj: for Xerces J (this script also uses sed to insert namespace declarations into the Xerces PSVI dump)
- xsv.xsl: an XSLT stylesheet I use to make XSV output display with slightly more detail (not essential to the display of the PSVI, but included because my script mentions it)
sample input:
- theduck.xml: an XML document
- tds.xsd: a sample schema, which makes parts of theduck.xml valid, some parts invalid, and some parts unknown.
sample output:
- theduck.xml.psvi.out.xerces.xml: sample output from Xerces (XML)
- theduck.xml.psvi.out.xerces.html: sample output from Xerces (HTML)
- theduck.xml.xsvout.xml: XSV diagnostic output for the sample input
- theduck.xml.psvi.out.xsv.xml: sample output from XSV (XML)
- theduck.xml.psvi.out.xsv.html: sample output from XSV (HTML).
Note that XSV and Xerces disagree on whether the author element is valid or invalid and on whether its child elements are notKnown or invalid; open the two output files side by side and the differences in color should make clear where the two processors disagree.

Since I wrote these only for myself, I have not done anything to the shell scripts to make them system independent; you will need to adapt them to make sure the paths are right for your system. If you have trouble, please send me email and I'll try to help, within the limits of the time I have available and my deplorably short attention span.

$Id: doclist.html,v 1.57 2007/03/24 02:05:04 cmsmcq Exp $