XML and the struggle to keep documentation current and in synch with practice

[9 December 2017]

One of the nice things about having data in a reusable form like SGML or XML that is not application-specific is that it makes it easier to keep documentation in synch with practices and/or software. (Relational databases have some of the same advantages, but I don’t find them handy for texts, and annotating specific data values can require arbitrarily complex technical prose.)

An example I am disproportionately pleased with has recently come about.

The project Annotated Turki Manuscripts from the Jarring Collection Online is transcribing some Central Asian manuscripts collected in and near Kashgar in the first half of the twentieth century. The manuscripts we are working with are written in Perso-Arabic script, and in order to make them better accessible to interested readers more comfortable with Latin script than with Perso-Arabic we provide transcriptions in Latin transliteration as well as in the original script. The domain specialists in the project have spent a lot of time working on the transliteration scheme, trying to make it easily readable while still retaining a 1:1 relation with the original so that no information is lost in transliteration.

Because the transliteration scheme itself is a significant work product, we want to document it. Because it needs to be applied to every new transcription, it also needs to be realized in executable code. And, as one might expect, the scheme has changed slightly as we have gained experience with the manuscripts and with it.

Our representation of the transliteration scheme has taken a variety of forms over the last couple of years: extensive notes on a whiteboard, images of that whiteboard, entries in a table in the project wiki, hard-coded tables of character mappings in an XSLT stylesheet written by a student and other stylesheets derived from it, a spreadsheet, and recently also an XML document, which is both on the Web in XML form with a stylesheet to render it more or less legible to humans (transliteration tables are intrinsically kind of dense) and used by the latest incarnation of the student’s stylesheet (itself on the Web), replacing the hard-coded representation used in earlier versions.

The XML representation has the disadvantage that it’s not as easy to sort it in many different ways as it is to sort the spreadsheet; it has the advantage over the spreadsheet of significantly better data normalization and less redundancy. Compared to the tables used in earlier versions of the XSLT stylesheet, the XML document is significantly better documented and thus easier to debug. (The redundant presentation of strings as displayed characters and as sequences of Unicode code points is important in a way that will surprise no one who has struggled with special character handling issues before.) The mixture of prose and tabular data in the XML, and the more systematic distinction between information about a particular Perso-Arabic string and information about a particular phonetic realization of that string and the Latin-script regularization of that pronunciation), are things that XML does really easily, and other data formats don’t do nearly as easily.

Using XSLT stylesheets to make XML representations of information (here the script-to-script mapping rules) more easily human readable seems similar in spirit to literate programming as developed and practiced by Donald Knuth, although different in details.