XML and the struggle to keep documentation current and in synch with practice

[9 December 2017]

One of the nice things about having data in a reusable form like SGML or XML that is not application-specific is that it makes it easier to keep documentation in synch with practices and/or software. (Relational databases have some of the same advantages, but I don’t find them handy for texts, and annotating specific data values can require arbitrarily complex technical prose.)

An example I am disproportionately pleased with has recently come about.

The project Annotated Turki Manuscripts from the Jarring Collection Online is transcribing some Central Asian manuscripts collected in and near Kashgar in the first half of the twentieth century. The manuscripts we are working with are written in Perso-Arabic script, and in order to make them better accessible to interested readers more comfortable with Latin script than with Perso-Arabic we provide transcriptions in Latin transliteration as well as in the original script. The domain specialists in the project have spent a lot of time working on the transliteration scheme, trying to make it easily readable while still retaining a 1:1 relation with the original so that no information is lost in transliteration.

Because the transliteration scheme itself is a significant work product, we want to document it. Because it needs to be applied to every new transcription, it also needs to be realized in executable code. And, as one might expect, the scheme has changed slightly as we have gained experience with the manuscripts and with it.

Our representation of the transliteration scheme has taken a variety of forms over the last couple of years: extensive notes on a whiteboard, images of that whiteboard, entries in a table in the project wiki, hard-coded tables of character mappings in an XSLT stylesheet written by a student and other stylesheets derived from it, a spreadsheet, and recently also an XML document, which is both on the Web in XML form with a stylesheet to render it more or less legible to humans (transliteration tables are intrinsically kind of dense) and used by the latest incarnation of the student’s stylesheet (itself on the Web), replacing the hard-coded representation used in earlier versions.

The XML representation has the disadvantage that it’s not as easy to sort it in many different ways as it is to sort the spreadsheet; it has the advantage over the spreadsheet of significantly better data normalization and less redundancy. Compared to the tables used in earlier versions of the XSLT stylesheet, the XML document is significantly better documented and thus easier to debug. (The redundant presentation of strings as displayed characters and as sequences of Unicode code points is important in a way that will surprise no one who has struggled with special character handling issues before.) The mixture of prose and tabular data in the XML, and the more systematic distinction between information about a particular Perso-Arabic string and information about a particular phonetic realization of that string and the Latin-script regularization of that pronunciation), are things that XML does really easily, and other data formats don’t do nearly as easily.

Using XSLT stylesheets to make XML representations of information (here the script-to-script mapping rules) more easily human readable seems similar in spirit to literate programming as developed and practiced by Donald Knuth, although different in details.

Fundamental primitives of XSLT programming


A friend planning an introductory course on programming for linguists recently asked me what I thought such linguist-programmers absolutely needed to learn in their first semester of programming. I thought back to a really helpful “Introduction to Programming” course taught in the 1980s at the Princeton University Computer Center by Howard Strauss, then the head of User Services. As I remember it, it consisted essentially of the introduction of three flow-chart patterns (for a sequence of steps, for a conditional, and for a while-loop), with instructions on how to use them that went something like this:

  1. Start with a single box whose text describes the functionality to be implemented.
  2. If every box in the diagram is trivial to implement in your programming language, stop: you’re done. Implement the program using the obvious translation of sequences, loops, and conditionals into your language.
  3. Otherwise choose some single box in the diagram whose functionality is non-trivial (will require more than a few lines of code) and replace it with a pattern: either break it down into a sequence of steps, or make it into a while-condition-do-action loop, or make it into an if-then-else choice.
  4. Return to step 2.

I recommended this idea to my friend, since when I started to learn to program I found these three patterns extremely helpful. As I thought about it further, it occurred to me that the three patterns in question correspond 1:1 to (a) the three constructors used in regular languages, and (b) the three patterns proposed in the 1970s by Michael A. Jackson. The diagrams I learned from Howard Strauss were not the same as Jackson’s diagrams graphically, but the semantics were essentially the same. I expect that a good argument can be made that together with function calls and recursion, those three patterns are the atomic patterns of software design for all conventional (i.e. sequential imperative) languages.

I think the patterns provide a useful starting point for a novice programmer: if you can see how to express an idea using those three patterns, it’s hard not to see how to capture it in a program in Pascal, or C, or Python, or whatever language you’re using. Jackson is quite good on deriving the structure of the program from the structures of the input and output in a systematic way.

The languages I most often teach, however, are XSLT and XQuery; they do not fall into the class of conventional sequential imperative languages, and the three patterns I learned from Howard Strauss (and which Howard Strauss may or may not have learned from Michael A. Jackson) do not help me structure a program in either language.

Is there a similarly small set of simple fundamental patterns that can be used to describe how to build up an XSLT transformation, or an XQuery program?

What are they?

Do they have a plausible graphical representation one could use for sketching out a stepwise refinement of a design?

The joy of testing

[5 November 2009, some additions 6 November 2009]

I’m using Jeni Tennison’s xspec to develop tests for a simple stylesheet I’m writing. An xspec test takes the form of a scenario along the lines of

  • When you match a foo element, do this.
  • When you call function bar with these arguments, expect this result.
  • When you call this named template, expect this result.

It’s a relatively young project, and the documentation is perhaps best described as nascent. Working from the documentation (it does exist, which makes for a nice change from some things I work with), I first wrote nine or ten tests to describe the behavior of an existing stylesheet; when I ran the tests against that stylesheet, all of them reported failures, because my formulation of the expected results violated various silent assumptions of the xspec code. That might indicate opportunities for making the xspec documentation more informative. I’ve spent an enjoyable hour or two this evening, however, looking at the xspec code and figuring out how my test cases are confusing it, reformulating them, and watching the bars of red in the test report change, one by one, to green. It’s nice to have a visible sign of forward progress.

There are other XSLT test frameworks I haven’t tried, and I can’t compare xspec to any of them. But I can say this: if you are developing XSLT stylesheets and aren’t using any of the available test frameworks, you really ought to look into xspec.

A helpful page about XSLT testing is maintained by Tony Graham of Menteith Consulting. If xspec doesn’t work out for you, check out the other frameworks he lists there.

Firefox and namespace nodes: an open plea

[21 October 2009]

One of the long-standing gaps in Mozilla’s support for XSLT 1.0 is its failure to support the XPath namespace axis; for the many stylesheets that don’t use that axis, the gap is not a problem. But access to the namespace axis is essential for many stylesheets that work upon XSLT stylesheets, or XSD schema documents, or any other documents which may have namespace-qualified data in attribute values and element content; Firefox’s failure to support it means that browser-based tools for those vocabularies must often carry warnings like “Works with everything except Firefox.” What a drag.

So it was encouraging, early this year, when a team of students at Simon Fraser University provided a fix for the bug. (Thank you, SFU! Way to go!) What I don’t understand is: given that there is a fix, given that it passes all the tests, given that this fix removes one of the major blots on Firefox’s XSLT conformance, why isn’t it in the product yet?

I wonder if it’s because the responsible parties don’t perceive the bug or the fix as important; that would be understandable, since with 17 votes in favor of fixing the issue, this bug is way down among the weeds. If that’s the reason, then perhaps it would help if those who do feel the bug is important were to raise the vote total of the relevant bug a bit.

So if you care about XSLT support and have a login ID on bugzilla.mozilla.org, please navigate to bug 94270 and vote for the bug. (Click the ‘vote’ link next to the display of the Importance field.) If you care about XSLT support in Firefox and don’t have a login ID on bugzilla.mozilla.org, I urge you to consider getting a login ID, so you can vote for this bug.

If anyone reading this has insight into the dynamics of getting a fix that’s ready and tested into the next release, I for one would be interested to learn more.

[My evil twin Enrique has produced a poetic version of this plea, addressed to the committers of Firefox:

Without a warning, you broke my heart.
I’ve got this QName to take apart,
But the namespace axis returned no nodes.
Can’t find no binding to guide my code.

So come on, Firefox, Firefox please!
an’ I’m beggin’ you, Firefox, and I’m on my knees.
Please fix this bug,
Accept the patch.
The namespace axis —
Let it match, let it match, let it match.

I’d refuse to include it, as being a sacrilege against the memory of Ron McKernan, but he’d just hack the web site and add it anyway. And if he did, he’d make it look as if I had included it myself, against my better judgement. He’s really past all controlling lately.]


[6 May 2009]

Short version: I’ve made a new toy for playing with one aspect of XSD 1.1, namely the conditional inclusion of elements in schema documents, controlled by attributes in the vc (version control) namespace. The experience has reminded me of reasons to like the vc:* design and of reasons to like (and hate) Javascript in the browser.

Continue reading