XML and the struggle to keep documentation current and in synch with practice

[9 December 2017]

One of the nice things about having data in a reusable form like SGML or XML that is not application-specific is that it makes it easier to keep documentation in synch with practices and/or software. (Relational databases have some of the same advantages, but I don’t find them handy for texts, and annotating specific data values can require arbitrarily complex technical prose.)

An example I am disproportionately pleased with has recently come about.

The project Annotated Turki Manuscripts from the Jarring Collection Online is transcribing some Central Asian manuscripts collected in and near Kashgar in the first half of the twentieth century. The manuscripts we are working with are written in Perso-Arabic script, and in order to make them better accessible to interested readers more comfortable with Latin script than with Perso-Arabic we provide transcriptions in Latin transliteration as well as in the original script. The domain specialists in the project have spent a lot of time working on the transliteration scheme, trying to make it easily readable while still retaining a 1:1 relation with the original so that no information is lost in transliteration.

Because the transliteration scheme itself is a significant work product, we want to document it. Because it needs to be applied to every new transcription, it also needs to be realized in executable code. And, as one might expect, the scheme has changed slightly as we have gained experience with the manuscripts and with it.

Our representation of the transliteration scheme has taken a variety of forms over the last couple of years: extensive notes on a whiteboard, images of that whiteboard, entries in a table in the project wiki, hard-coded tables of character mappings in an XSLT stylesheet written by a student and other stylesheets derived from it, a spreadsheet, and recently also an XML document, which is both on the Web in XML form with a stylesheet to render it more or less legible to humans (transliteration tables are intrinsically kind of dense) and used by the latest incarnation of the student’s stylesheet (itself on the Web), replacing the hard-coded representation used in earlier versions.

The XML representation has the disadvantage that it’s not as easy to sort it in many different ways as it is to sort the spreadsheet; it has the advantage over the spreadsheet of significantly better data normalization and less redundancy. Compared to the tables used in earlier versions of the XSLT stylesheet, the XML document is significantly better documented and thus easier to debug. (The redundant presentation of strings as displayed characters and as sequences of Unicode code points is important in a way that will surprise no one who has struggled with special character handling issues before.) The mixture of prose and tabular data in the XML, and the more systematic distinction between information about a particular Perso-Arabic string and information about a particular phonetic realization of that string and the Latin-script regularization of that pronunciation), are things that XML does really easily, and other data formats don’t do nearly as easily.

Using XSLT stylesheets to make XML representations of information (here the script-to-script mapping rules) more easily human readable seems similar in spirit to literate programming as developed and practiced by Donald Knuth, although different in details.

Searching for patterns in XML siblings

[9 April 2013]

I’ve been experimenting with searches for elements in XML documents whose sequences of children match certain patterns. I’ve got some interesting results, but before I can describe them in a way that will make sense for the reader, I’ll have to provide some background information.

For example, consider a TEI-encoded language corpus where each word is tagged as a w element and carries a pos attribute. At the bottom levels of the XML tree, the documents in the corpus might look like this (this extract is from COLT, the Bergen Corpus of London Teenage English, as distributed by ICAME, the International Computer Archive of Modern and Medieval English; an earlier version of COLT was also included in the British National Corpus):

<u id="345" who="14-7">
<s n="407">
<w pos="PPIS1">I</w>
<w pos="VBDZ">was</w>
<w pos="XX">n't</w>
<w pos="JJ">sure</w>
<w pos="DDQ">what</w>
<w pos="TO">to</w>
<w pos="VVI">revise</w>
<w pos="RR">though</w>
</s>
</u>
<u id="346" who="14-1">
<s n="408">
<w pos="PPIS1">I</w>
<w pos="VV0">know</w>
<w pos="YCOM">,</w>
<w pos="VBZ">is</w>
<w pos="PPH1">it</w>
<w pos="AT">the</w>
<w pos="JJ">whole</w>
<w pos="JJ">bloody</w>
<w pos="NN1">book</w>
<w pos="CC">or</w>
<w pos="RR">just</w>
<w pos="AT">the</w>
<w pos="NN2">bits</w>
<w pos="PPHS1">she</w>
<w pos="VVD">tested</w>
<w pos="PPIO2">us</w>
<w pos="RP">on</w>
<w pos="YSTP">.</w>
</s>
</u>

These two u (utterance) elements record part of a conversation between speaker 14-7, a 13-year-old female named Kate of unknown socio-economic background, and speaker 14-1, a 13-year-old female named Sarah in socio-economic group 2 (that’s the middle group; 1 is high, 3 is low).

Suppose our interest is piqued by the phrase “the whole bloody book” and we decide we to look at other passages where we find a definite article, followed by two (or more) adjectives, followed by a noun.

Using the part-of-speech tags used here, supplied by CLAWS, the part-of-speech tagger developed at Lancaster’s UCREL (University Centre for Computer Corpus Research on Language), this amounts at a first approximation to searching for a w element with pos = AT, followed by two or more w elements with pos = JJ, followed by a w element with pos = NN1. If we want other possible determiners (“a”, “an”, “every”, “some”, etc.) and not just “the” and “no”, and other kinds of adjective, and other forms of noun, the query eventually looks like this:

let $determiner := ('AT', 'AT1', 'DD',
'DD1', 'DD2',
'DDQ', 'DDQGE', 'DDQV'),
$adjective := ('JJ', 'JJR', 'JJT', 'JK',
'MC', 'MCMC', 'MC1', 'MD'),
$noun := ('MC1', 'MC2', 'ND1',
'NN', 'NN1', 'NN2',
'NNJ', 'NNJ2',
'NNL1', 'NNL2',
'NNT1', 'NNT2',
'NNU', 'NNU1', 'NNU2',
'NP', 'NP1', 'NP2',
'NPD1', 'NPD2',
'NPM1', 'NPM2' )

let $hits :=
collection('COLT')
//w[@pos=$determiner]
[following-sibling::w[1][@pos = $adjective]
[following-sibling::w[1][@pos = $adjective]
[following-sibling::w[1][@pos = $noun]
]]]
for $h in $hits return
<hit doc="{base-uri($h)}">{
$h,
<orth>{
normalize-space(string($h/..))
}</orth>,
$h/..
}</hit>

Such searches pose several problems, for which I’ve been mulling over solutions for a while now.

  • One problem is finding a good way to express the concept of “two or more adjectives”. (The attentive reader will have noticed that the XQuery given searches for determiners followed by exactly two adjectives and a noun, not two or more adjectives.)

    To this, the obvious solution is regular expressions over w elements. The obvious problem standing in the way of this obvious solution is that XPath, XQuery, and XSLT don’t actually have support in their syntax or in their function library for regular expressions over sequences of elements, only regular expressions over sequences of characters.

  • A second problem is finding a syntax for expressing the query which ordinary working linguists will find less daunting or more convenient than XQuery.

    Why ordinary working linguists should find XQuery daunting, I don’t know, but I’m told they will. But even if one doesn’t find XQuery daunting, one may find the syntax required for sibling searches a bit cumbersome. The absence of a named axis meaning “immediate following sibling” is particularly inconvenient, because it means one must perpetually remember to add “[1]” to steps; experience shows that forgetting that predicate in even one place can lead to bewildering results. Fortunately (or not), the world in general (and even just the world of corpus linguistics) contains a large number of query languages that can be adopted or used for inspiration.

    Once such a syntax is invented or identified, of course, one will have the problem of building an evaluator for expressions in the new language, for example by transforming expressions in the new syntax into XQuery expressions which the XQuery engine or an XSLT processor evaluates, or by writing an interpreter for the new language in XQuery or XSLT.

  • A third problem is finding a good way to make the queries faster.

    I’ve been experimenting with building user-level indices to help with this. By user-level indices I mean user-constructed XML documents which serve the same purpose as dbms-managed indices: they contain a subset of the information in the primary (or ‘real’) documents, typically in a different structure, and they can make certain queries faster. They are not to be confused with the indices that most database management systems can build on their own, with or without user action. Preliminary results are encouraging.

More on these individual problems in other posts.

URI casuistry

[4 March 2013]

I’ve been working lately on improving a DTD parser I wrote some time ago.

It’s been instructive to work with libcurl, the well known URL library by Daniel Stenberg and others, and with uriparser, a library for parsing and manipulating URIs written by Weijia Song and Sebastian Pipping (to all of whom thanks); I use the latter to absolutize relative references in entity declarations and the former to dereference the resulting absolute URIs.

A couple of interesting problems arise.

Relative references in parameter-entity declarations

When a parameter entity declaration uses a relative reference in its system identifier, you need to resolve that relative reference against the base URI. Section 5 of RFC 3986 is clear that the base URI should come from the URI of the DTD file in which the relative reference is used. So if the DTD http://example.com/2013/doc.dtd contains a parameter entity declaration of the form

<!ENTITY % chapters SYSTEM "chapters.dtd">

then the relative reference chapters.dtd is resolved against the base URI http://example.com/2013/doc.dtd to produce the absolutized form http://example.com/2013/chapters.dtd. This is true even if the reference to the parameter entity chapters occurs not in doc.dtd but in some other file: the logic is, as I understand it, that the relative reference is actually used when the parameter entity is declared, not when it is referenced, and the base URI comes from the place where the relative reference is used. Of course, in many or most cases the declaration and the reference to an external parameter entity will occur in the same resource.

I should qualify my statement; this is what I believe to be correct and natural practice, and what I believe to be implied by RFC 3986. I have occasionally encountered software which behaved differently; I have put it down to bugs, but it may mean that some developers of XML software interpret the base-URI rules of RFC 3986 differently. And one could also argue that use is not the issue; the base URI to be used is the URI of the resource within which the relative reference physically occurs; in this case it amounts to the same thing.

I’m not sure, however, what ought to happen if we add a level of indirection. Suppose …

  • DTD file http://example.com/a.dtd contains the declaration <!ENTITY % chapdecl '&#x003C;!ENTITY % chapters SYSTEM "chapters.dtd">'> (not a declaration of the parameter entity chapters, but the declaration of a parameter entity containing the declaration of that parameter entity).
  • DTD file http://example.com/extras/b.dtd contains a parameter entity reference to %chapdecl; (and thus, logically, it is this DTD file that contains the actual declaration of chapters and the actual use of the relative reference).
  • DTD http://example.com/others/c.dtd contains a reference to %chapters;.

Which DTD file should provide the base URI for resolving the relative reference? I think the declaration/reference logic rules out the third candidate. If we say that we should take the base URI from the entity in which the relative reference was used, and that the relative reference is used when the parameter entity chapters is declared, then the second choice (b.dtd) seems to have the strongest claim. If we say that we should take the base URI from the entity in which the relative reference appears, and that the relative reference here clearly appears in the declaration of the parameter entity chapdecl, then the first choice (a.dtd) seems to have the strongest claim.

I am sometimes tempted by the one and sometimes by the other. The logic that argues for a.dtd has, however, a serious problem: the declaration of chapdecl might just as easily look like this: <!ENTITY % chapdecl '&#x003C;!ENTITY % chapters SYSTEM "%fn_pt1;%fn_pt2;.%fn_ext;">'>, with the relative reference composed of references to three parameter entities each perhaps declared in a different resource with a different URI. Where does the relative reference “appear” in that case? So on the whole the second choice (b.dtd) seems the right one. But my code actually chooses a.dtd for the base URI in this case: each parameter entity carries a property that indicates what base URI to use for relative references that occur within the text of that parameter entity, and in the case of internal entities like chapdecl the base URI is inherited from the entity on top of the entity stack when the declaration is read. Here, the top of the stack is chapdecl, which as an internal entity inherits its base-URI-for-children property from a.dtd. Changing to use the base URI of the resource within which the entity declaration appears (to get b.dtd as the answer) would require adding new code to delay the calculation of the base URI: possible, but fiddly, easy to get wrong, and not my top priority. (I am grateful that in fact these cases never appear in the DTDs I care about, though if they did I might have intuitions about what behavior to regard as correct.)

HTTP_REFERER

A similar complication arises when we wish to follow the advice of some commentators on the W3C system team’s blog post on excessive DTD traffic and provide an HTTP_REFERER value that indicates the source of the URI from which we are retrieving a DTD. In the case given above, which URI file should be listed as the source of the reference? Is it a.dtd, b.dtd, or c.dtd?

It may be that answers to this are laid out in a spec somewhere. But … where?

Posted in XML

Recursive descent parsing in XQuery (and other functional languages)

[7 January 2013; typos in code patterns corrected 8 January and 21 June 2013]

Everyone who designs or builds systems to do interesting work occasionally needs to deal with input in some specialized notation or other. Nowadays a lot of specialized information is in XML, and in that case the need is to deal with vocabularies designed for the particular kind of specialized information involved. But sometimes specialized data comes in its own non-XML syntax — ISBNs, URIs, expressions in symbolic logic, chess-game notation, textual user interface languages, query languages, and so on. Even in the XML universe there are plenty of notations for structured information of various kinds that are not XML: DTDs, XPath expressions, CSS, XQuery.

In these cases, if you want to do anything useful that depends on understanding the structure of the data, you’re going to need to write a parser for its notation. For some simple cases like ISBNs and ISSNs, or URIs, you can get by with regular expressions. (At least, more or less: if you want to validate the check digit in an ISBN or ISSN, regular expressions are going to have a hard time getting the job done, though oddly enough a finite state automaton would have no particular trouble with it.) But many interesting notations are context-free languages, which means regular expressions don’t suffice to distinguish well-formed expressions from other strings of characters, or to identify the internal structure of expressions.

Now, if you’re writing in C and you run into this problem, you can easily use yacc and lex to generate a parser for your language (assuming it satisfies the requirements of yacc). If you’re writing in Java, there are several parser-generator tools to choose from. If you’re writing in a less widely used language, you may find a parser generator, or you may not.

It’s handy, in this situation, to be able to write your own parsers from scratch.

By far the simplest method for hand-written parsing is the one known as recursive descent. For each non-terminal symbol in the grammar, there is a routine whose job it is to read strings in the input which represent that non-terminal. The structure of the routine follows the structure of the grammar rules for that non-terminal in a simple way, which makes recursive-descent parsers feel close to the structure of the information and also to the structure of the parsing process. (The parser generated by yacc, on the other hand, remains a completely opaque black box to me, even after years of using it.)

In his book Compiler Construction (Harlow, England: Addison-Wesley, 1996, tr. from Grundlagen und Techniken des Compilerbaus [Bonn: Addison-Wesley, 1996]), the Swiss computer scientist Niklaus Wirth summarizes the rules for formulating a recursive-descent parser in a table occupying less than half a page. For each construct in the EBNF notation for grammars, Wirth shows a corresponding construct in an imperative programming language (Oberon), so before I show the table I should perhaps review the EBNF notation. In any formal notation for grammars, a grammar is made up of a sequence of grammar productions, and each production (in the case of context-free grammars) consists of a single non-terminal symbol on the left-hand side of the rule, and an expression on the right-hand side of the rule which represents the possible realizations of the non-terminal. The right-hand side expression is made up of non-terminal symbols and terminal symbols (e.g. quoted strings), arranged in sequences, separated as need be by choice operators (for choice, the or-bar | is used), with parentheses, square brackets (which mark optional material), and braces (which mark material that can occur zero or more times).

Wirth’s EBNF for EBNF will serve to illustrate the syntax:

syntax = {production}.
production = identifier “=” expression “.”.
expression = term {“|” term}.
term = factor {factor}.
factor = identifier | string | “(” expression “)” | “[” expression “]” | “{” expression “}”.
identifier = letter {letter | digit}.
string = “”” {character} “””.
letter = “A” | … | “Z”.
digit = “0” | … | “9”.

(It may be worth noting that this formulation achieves its simplicity in part by hand-waving: it doesn’t say anything about whitespace separating identifiers, and the definition of string is not one a machine can be expected to read and understand. But Wirth isn’t writing this grammar for a machine, but for human readers.)

It’s easy to see that the routines in a recursive-descent parser for a grammar in this notation must deal with six constructs on the right-hand side of rules: strings, parenthesized expressions (three kinds), sequences of expressions, and choices between expressions. Wirth summarizes the necessary code in this table with the construct K on the left, and the program for it, Pr(K), on the right. In the code fragments, sym is a global variable representing the symbol most recently read from the input stream, and next is the routine responsible for reading the input stream and updating sym. The meta-expression first(K) denotes the set of symbols which can possibly occur as the first symbol of a string derived from construct K.

k Pr(k)
“x” IF sym = “x” THEN next ELSE error END
(exp) Pr(exp)
[exp] IF sym IN first(exp) THEN Pr(exp) END
{exp} WHILE sym IN first(exp) DO Pr(exp) END
fac0 fac1facn Pr(fac0); Pr(fac1); … ; Pr(facn)
term0 | term1 | … | termn CASE sym of
   first(term0) : Pr(term0)
| first(term1) : Pr(term1)
| …
| first(termn) : Pr(termn)
END

This is easy enough to express in any language that has global variables and assignment statements. But what do we do when we are writing an application in a functional language, like XQuery or XSLT, and need to parse sentences in some context-free language? No assignment statements, and all functions have the property that if you call them several times with the same argument you will always get the same results back. [Addendum, 8 January 2013: XQuery and XSLT users do in fact have access to useful parser generators: see the pointers to Gunther Rademacher’s REx and Dmitre Novatchev’s FXSL provided in the comments. The question does, however, still arise for those who want to write recursive-descent parsers along the lines sketched by Wirth, which is where this post is trying to go.]

I’ve wondered about this for some time (my notes show I was worrying about it a year ago today), and the other day a simple solution occurred to me: each of the functions in a recursive descent parser depends on the state of the input, so in a functional language the state of the input has to passed to the function as an argument. And each function changes the state of the input (by advancing the input pointer), which in a functional language we can represent by having each function return the new state of the input and the new current symbol as its result.

A new table, analogous to Wirth’s, but with XQuery code patterns on the right hand side, looks like this. Here, the common assumption is that each function is passed a parameter named $E0 whose value is an env variable, with two children: sym contains the current symbol and input contains the remainder of the input (which for simplicity I’m going to assume is a string). If an error condition arises, an error element is added to the environment. The job of reading the next token is handled by the function next().

k Pr(k, $E0)
“x” if ($E0/sym = “x”)
  then next($E0)
  else <env>
    <error>expected “x” but did not find it</error>
    {$E0/*}
  </env>
(exp) Pr(exp, $E0)
[exp] if ($E0/sym = first(exp)) then
  Pr(exp, $E0)
else
  $E0
{exp} This requires two translations. For each such sequence exp, we declare a function:

declare function seq_exp(
  $E0 as element(env)
) as element(env) {
  if ($E0/sym = first(exp)) then
    let $E1 := Pr(exp),
        $E2 := seq_exp($E1)
    return $E2
  else
    $E0
};
Inline, we just call that function:

seq_exp($E0)
fac0 fac1facn let $E1 := Pr(fac0, $E0),
    $E2 := Pr(fac1, $E1),
    … ,
    $En + 1 := Pr(facn, $En)
return $En + 1
term0 | term1 | … | termn if ($E0/sym = first(term0)) then
  Pr(term0)
else if ($E0/sym = first(term1)) then
  Pr(term1)

else if ($E0/sym = first(termn)) then
  Pr(termn)

Like Wirth’s, the code shown here produces a recognizer that doesn’t do anything with the input except read it and accept or reject it; like Wirth’s, it can easily be extended to do things with the input. (My first instinct, of course, is to return an XML representation of the input string’s abstract syntax tree, so I can process it further using other XML tools.)

Vocabulary specialization and generalization, use and interchange: things our schema languages should make easier than they do

[23 January 2011]

Perhaps it’s because the call for participation in the 2011 Balisage Pre-conference Symposium on XML Document Interchange has just come out, or perhaps it’s for other reasons, but I found myself thinking today about the problem of specialized uses of generalized vocabularies.

There are lots of vocabularies defined in fairly general terms: HTML, TEI, DocBook, the NLM article tag set, you can surely think of plenty yourself.

Often, for a specific purpose in a specific organization or project, it would be handy to have a much tighter, much more specific vocabulary (and thus one that’s semantically richer, easier to process, and easier to validate tightly). For example, consider writing and managing an issues list (or a list of use cases, or any other list containing items of a specialized genre), in a generic vocabulary. It’s easy enough: you just have a section for each issue and with that section you have standard sections on where the issue came from, what part of the project it relates to, its current status, and the history of your work on it. Easy enough. And if you’re the kind of person who write macros in whatever editor you use, you can write a macro to set up a new issue by adding a section of type ‘issue’ with subsections with appropriate types and headings. But isn’t that precisely what a markup-aware editor typically does? Well, yes, typically: any schema-aware editor can look at the schema, and as soon as you say “add a new issue” they can populate it with all of the required subelements. Or, they could, if you had an element type called ‘issue’, with appropriately named sub-elements. If instead you are using a generic ‘div’ element, your editor is going to have a hard time helping you, because you haven’t said what you really mean. You want an issue, but what you’ve said is ‘add a div’.

Some schemas, and some schema languages, try to make this easier by allowing you to say, essentially, that an issue element is kind of div, and that the content model for issue is a specialization of that for div (and so on). This is better than nothing, but I’m probably not the only person who fails to use these facilities in all the cases where they would be helpful. And when I do, I have to extend the standard stylesheets for my generic vocabulary to handle my new elements, because even when the stylesheet language supports the specialization mechanisms of the schema language (as XSLT 2.0 supports element substitution groups in XSD), most stylesheets are not written to take advantage of it. And if I’m exchanging documents with someone else, they may or may not want to have to deal with my extensions to the schema.

I wonder if we might get a better answer if (a) in our schema languages it were as easy to write a rule for div type='issue' as for issue, and (b) in our validation tools it were as easy to apply multiple grammars to a document as a single grammar, and to specify that the class of documents we are interested in is given by the intersection of the grammars, or by their union, or (for grammars A, B, C) by A ∪ (B ∩ ¬ C). Also (c) if for any schema extension mechanism it were easy to generate a transformation to take documents in the extended schema into the base schema, and vice versa.

Perhaps NVDL may be in a position to help with (b), though I’ve never learned it well enough to know and it seems to be more heavily freighted with unacknowledged assumptions about schema languages and validation than I’d like.

And perhaps Relax NG already can handle both (a) and (b).

Homework to do.