… I’ve been wandering late … (travels, part 2)

[26 October 2008]

This is the second in series of posts recording some of my impressions from recent travels.

After the XSLT meetings described in the previous post, and then a week at home, during which I was distracted by events that don’t need to be described here, I left again in early October for Europe. During the first half of last week [week before last, now], I was in Mannheim attending a workshop on organized by the electronic publications working group of the Union of German Academies of Science. Most of the projects represented were dictionaries of one stripe or another, many of them historical dictionaries (the Thesaurus Linguae Latinae, the Dictionnaire Etymologique de l’Ancien Français, the Deutsches Rechtswörterbuch, the Qumran-Wörterbuch, the Wörterbuch der deutschen Vinzersprache, both an Althochdeutsches Wörterbuch and an Althochdeutsches Etymologisches Wörterbuch, a whole set of dialect dictionaries, and others too numerous to name).

Some of the projects are making very well planned, good use of information technology (the Qumran dictionary in Göttingen sticks particularly in my mind), but many suffer from the huge weight of a paper legacy, or from short-sighted decisions made years ago. I’m sure it seemed like a good idea at the time to standardize on Word 6, and to build the project work flow around a set of Word 6 macros which are thought not to run properly in Word 7 or later versions of Word, and which were built by a clever participant in the project who is now long gone, leaving no one who can maintain or recreate them. But however good an idea it seemed at the time, it was in reality a foolish decision for which project is now paying a price (being stuck in outdated software, without the ability to upgrade, and with increasing difficulty finding support), and for which the academy sponsoring the project, and the future users of the work product, will continue paying for many years to come.

I gave a lecture Monday evening under the title “Standards in der geisteswissenschaftlichen Textdatenverarbeitung: Über die Zukunftssicherung von Sprachdaten”, in which I argued that the IT practice of projects involved with the preservation of our common cultural heritage must attend to a variety of problems that can make their work inaccessible to posterity.

The consistent use of suitably developed and carefully chosen open standards is by far the best way to ensure that the data and tools we create today can still be used by the intended beneficiaries in the future. I ended with a plea for people with suitable backgrounds in the history of language and culture to participate in standardization work, to ensure that the standards developed at W3C or elsewhere provide suitable support for the cultural heritage. The main problem, of course, is that the academy projects are already stretched thin and have no resources to spare for extra work. But I hope that the academies will realize that they have a role to play here, which is directly related to their mission.

It’s best, of course, if appropriate bodies join W3C as members and provide people to serve in Working Groups. More universities, more academies, more user organizations need to join and help guide the Web. (Boeing and Chevron and other user organizations within W3C do a lot for all users of the Web, but there is only so much they can accomplish as single members; they need reinforcements!) But even if an organization does not or cannot join W3C, an individual can have an impact by commenting on draft W3C specifications. All W3C working groups are required by the W3C development process to respond formally to comments received on Last-Call drafts, and to make a good-faith effort to satisfy the originator of the comment, either by doing as suggested or by providing a persuasive rationale for not doing so. (Maybe it’s not a good idea after all, maybe it’s a good idea but conflicts with other good ideas, etc.) It is not unknown for individuals outside the working group (and outside W3C entirely) to have a significant impact on a spec just by commenting on it systematically.

Whether anyone in the academies will take the invitation to heart remains to be seen, though at least a couple of people asked hesitantly after the lecture how much membership dues actually run. So maybe someday …

I’ve been wandering early …

[21 October 2008]

I’ve been traveling a good deal lately.

This is the first in series of posts recording some of my impressions.

In late September the XSL Working Group held a one-week meeting in Prague to work on revisions of XSLT to make it easier to support streaming transformations in XSLT. By streaming, the WG means:

  • The transformation can run in memory independent of document size (sometimes constant memory, sometimes memory proportional to document depth, sometimes memory proportional to the size of discrete windows of data in the document).
  • The transformation can begin delivering results before all of the input is available (e.g. can work on so-called ‘infinite’ XML documents like streams of stock quotations).
  • The transformation can be preformed in a single pass over the document.

It turns out that for different use cases it can be necessary or useful to:

  • declare a particular input document as streamable / to be streamed if possible
  • declare a particular set of templates as streamable
  • declare that particular parts of the document need to be available in full (buffered for random access) for part of the transform, but can then be discarded (e.g. for windowing use cases)

Some members of the WG may have been apprehensive at the thought of five straight days of WG discussions. Would we have enough to do, or would we run out of ideas on Tuesday and spend Wednesday through Friday staring at the floor in embarrassment while the chair urged us to get some work done? (If no one else harbored these fears, I certainly did.) But in fact we had a lively discussion straight through to the end of the week, and made what I think was good progress toward concrete proposals for the spec.

Among the technical ideas with most legs is (I think) the idea that sometimes what you want to do with a particular node in the input tree can actually be done with a partial copy of the input tree, and that different kinds of partial copy may be appropriate in different situations.

If you perform a deep copy of an element node in an XDM data model instance, for example, you have access to the entire subtree rooted at that node, but not to any of its siblings or ancestors, nor to anything else in the tree from which it came. For cases where you wish to (or must) allow access to the subtree rooted in a node, but to nothing else, such a deep copy is ideal: it preserves the information you want to have available, and it makes all other information inaccessible. (This is essentially the way that XSD 1.1 restricts assertions to the subtree rooted in a given node: logically speaking the assertions are evaluated against a copy of the node, not against the node itself.)

Several kinds of copy can be distinguished. In the terminology of the XSL Working Group (using terms introduced mostly by Michael Kay and Mohamed Zergaoui):

  • Y-copy: contains the subtree rooted in the node being copied, and also all of its ancestor nodes and their attributes, but none of their siblings. It is thus shaped like an upside down uppercase Y.
  • Nabla-copy: contains just the subtree rooted in the node being copied. It is thus shaped like an upside-down nabla. (Yes, also like a right-side-up delta, but since the Y-copy requires the Y to be inverted, we say nabla copy not delta copy. Besides, a delta copy would sound more like something used in change management.)
  • Dot-copy: contains just the node being copied, itself, and its attributes if any.
  • Yen-copy: like a Y-copy, but includes the siblings of each ancestor together with their attributres (although not their children).
  • Spanish exclamation-point copy: contains just the node being copied, and its ancestors, together with their attributes. Shaped like an exclamation point (dot, with something above it), or like an upside-down Spanish exclamation point.

I’ve been quite taken recently by one possible application of these ideas outside of the streaming XSLT work. In the current draft, assertions in XSD 1.1 are restricted to / are evaluated against a nabla-copy of the element or attribute being validated, and the conditions used for conditional type assignment are evaluated against a dot copy of the element. These restrictions are painful, especially the latter since it makes it impossible to select the type of an element depending on its xml:lang value (which is inherited from an ancestor if not specified locally). But XSD 1.1 could readily support conditions on the nearest value of xml:lang if conditions were evaluated on a Spanish-exclamation-point copy, instead of on a dot copy, of the element in question. I don’t know whether the XML Schema WG will buy this approach, but the possibility does seem to suggest that there is value in the idea of thinking about things in terms of invariants preserved by different kinds of node copying operations.

More questions about Monte Carlo test-case generation

[9 August 2008]

Dave Pawson’s response to the previous post here has made me wonder about ways to tweak the Monte Carlo methods for generating test strings using random number generators — in particular, ways to make them more systematic (and also, perhaps, more automatic).

At the moment, I don’t have answers, but a couple of questions have occurred to me, and if I write them down on a piece of paper I’m just going to lose the paper. So I’ll write them down here.

(1) Can we measure the effectiveness of different methods of test generation by counting how many random strings must be generated in order to obtain a certain number of ‘interesting’ test cases?

(2) One way to define ‘interesting’: given two grammars for similar but not identical languages, how many random strings must we generate, using different methods, in order to find ten strings that exercise the difference between the languages? (I.e. ten strings that are members of L1 but not of L2, or vice versa.)

If we think we know that L1 and L2 differ in exactly three ways, we can also ask: how many test cases must we generate in order to find examples (or: ten examples) of each of the three differences between the languages? (A method that quickly exercises the first two differences but never finds the third would not be our best choice for work with languages where we don’t know the answers in advance.)

(3) Another way: given any grammar, how many test strings must we generate, using a given method, in order to find three different strings which match each non-terminal in the grammar?

How many other ways of defining ‘interesting’ test cases can we find? How many other ways can we find of measuring the effectiveness of methods of generating test cases?

(4) At the moment, I know three methods for generating pseudo-random strings for testing grammars:

  1. Run the grammar itself as a generator, possibly using probabilities assigned to different branches at each choice point. John Cowan described this method in a comment on an earlier post here; Liam Quin also described essentially this method to me, in private conversation. If I need a name for this method, I’ll call it probabilistic sentence generation.
  2. For a given length of string n (given as a parameter, or itself chosen randomly), pick n characters randomly from the character set and concatenate them. I’ll call this random character concatenation.
  3. For a given number n, choose n items randomly from an array of strings constructed to include strings that match each non-terminal and terminal in the grammar’s vocabulary, and concatenate them. Since here the random selection is from strings representing the terminal + non-terminal vocabulary, I’ll call this random TNT concatenation.

Question: are there other ways to use random-number generators to generate test strings for grammars?

(5) Can the array of strings used in the TNT method be constructed automatically, by using random character concatenation until you have matches for each terminal and non-terminal, and the array is fully populated?

(Here, Enrique dug his elbow into my rib. “Well, obviously yes it can,” he hissed. “What kind of idiotic question is that?!” “Hey, I’m just brainstorming. Gimme a break. Besides, this helps set up the two questions that follow.”)

I should mention here that I have a strong gut feeling that it’s useful, in practical cases, to augment the grammar with a terminal class for characters (or strings) which can never occur in sentences of the language. For a grammar for URIs, that is, the class of characters illegal in URIs definitely needs to be represented in the TNT array; for a grammar for numbers in scientific notation, you definitely want to have a few entries in the TNT array for characters like space, or #, or letters of the alphabet other than ‘e’ and ‘E’, which occur in the coded character sets your users will be using but are not part of the alphabet of the language in the formal sense. I have no good formal grip on this gut feeling, and no good formal way to describe it; I’ll work on that. But for now, understand that when I talk about strings matching terminals, I also intend to include some characters that don’t match any terminal. Otherwise you will surely fail to exercise some error detection paths in your system.

(6) Can the TNT array be populated by a bootstrapping method? Would this work?

  1. Start with an array that just includes substrings for the terminals.
  2. Select a production rule whose right-hand side contains only terminals (or, later on, one whose right hand side contains only items for which the TNT array has substrings).
  3. Generate random strings until you have a match, or two matches, for that non-terminal. (N.B. for recursive rules you’ll want to treat each choice in the right-hand side as a separate production rule.)
  4. Add the new substrings to the TNT array.
  5. Repeat (i.e. return to step 2), until you have strings in the TNT array for each production rule (not just each non-terminal and terminal) in the grammar.

(7) Another possible bootstrapping method:

  1. Begin with a TNT array for all the terminals, as just described.
  2. Also begin with a count table showing the non-terminals (or better, the production rules) of the grammar, and how many substrings they have in the TNT array; to begin with, every production rule in the count table will have zero matches.
  3. Generate a random string from the TNT array.
  4. Check the string against each rule in the grammar table for which you have fewer than three strings in the TNT array; when you find a match, add that string to the TNT array and increment the number for that rule in the count table. If the string doesn’t match any rules (or any rules that aren’t already at 3), discard it and generate another (i.e. go back to step 3). If every entry in the count table has a count of 3, stop.

The description just given assumes that you want three strings in the TNT table for each rule in the grammar. That’s an arbitrary choice; it could be two, or one, or ten. And if there is reason to want especially thorough testing in some parts of the grammar, we could assign different quotas to different non-terminals. (If I’m particularly worried about character classes, then I want a lot of test cases that exercise that part of the grammar; I can get more by giving the non-terminals in that part of the grammar higher representation in the TNT table.)

(8) When I originally sketched out the TNT method I assumed that what I wanted was a few strings for each non-terminal (and terminal) in the grammar. In the course of working through the questions above I have come to believe that what we want is a few strings for each rule in the grammar: non-terminals with three rules need more strings than non-terminals with just one rule. Which is true? If the measurement methods described above work, it should be possible to measure the effeciveness of different approaches quantitatively.

(9) In setting up the TNT array, is the best default behavior going to be to provide the same number of strings for each item in the terminal + non-terminal vocabulary? Or would we do better to count the number of times each TNT vocabulary item is actually mentioned in the grammar, and assign slots proportionally? One mention: two strings. Eight mentions: sixteen strings. (Or should we do it logarithmically? One mention, two strings; eight mentions, three times two strings; n mentions, 2 times log2 n strings?)

(10) So far, this has all been about sequences of characters. How many of these methods can be transferred conveniently to the task of generating XML documents as test cases for document grammars?

Two styles of Monte Carlo testing for grammars

[8 August 2008]

In an earlier post I talked about generating test cases for grammars, and asked a number of questions. I’ve been meaning to get back to the topic; this afternoon I revisited some work I did with the XSD 1.1 grammar for regular expressions, and so the topic is again on my mind.

Pseudo-random strings

In the past, I’ve used Monte Carlo test generation to test grammars, in particular to find strings that distinguish between grammars. For example, given two grammars for the lexical space of a particular datatype, can we find strings that are accepted by one grammar and rejected by the other?

But random strings can also be used just to understand a given grammar better.

At one point in our work on XSD 1.1, for example, I proposed that in order to have a looser coupling between XSD and the RFCs that define URIs, we should just accept any string at all as a type-valid URI, much the same way that HTML browsers ‘accept’ any string as the value of an href attribute. They may not understand it, but they don’t raise an error message. The rationale for my proposal was that in any case, the generic grammars defined by RFC 3986 and its predecessors are so vague, so general, and so forgiving, that it wasn’t really very clear that they outlawed very much at all. I’ve heard people who know URIs well say that really the only thing strings that grammar won’t accept are strings with two hash signs. Others, equally well informed, say that those are legal, too.

The scheme-specific grammars (e.g. the grammar for http: URLs) are much more useful in catching errors, I argued. But we don’t require that they be checked, and don’t really want to make them part of the definition of type validity for the anyURI type. (I think a good validator should check anyURI value against the scheme-specific grammars and issue warnings if there are problems, but I don’t know any that do that.) So, I argued, the XSD spec was giving the illusion of useful value checking, but not the reality. Others in the working group maintained that the generic grammar does perform some useful checks, and wanted to retain it.

Generating ten thousand or so strings of twenty or forty characters each, with each character a randomly chosen character in the printable range of the ASCII character set, persuaded me that I was wrong. Even if the only useful thing the generic URI grammar does is eliminate strings containing forbidden characters, it rejects well over half the random strings I generated.

So I abandoned my proposal, and in XSD 1.1, type validity for anyURI does involve checking the literal against the grammar given in the RFC.

More interesting pseudo-random strings

One problem, for a human looking at the data, is that if you generate random strings as I did, by generating twenty random numbers between 32 and 127, and then concatenating the corresponding characters, then you generate an awful lot of uninteresting test cases, and you can go a long long time before you actually exercise a specific part of your grammar.

So when I faced the task of generating test cases for XSD’s regular-expression grammar, I wanted some method that would generate more interesting test cases, test cases more likely to exercise all the parts of the grammar. In particular, I wanted to make sure we got good coverage of the more obscure corners of the character-class constructs. These have relatively elaborate substructures and the likelihood of generating a complex character-class expression using randomly chosen characters was way too low.

To make a long story short, I think I found a way. This is what I did.

  1. I built a set of strings in the following way.
    1. For each non-terminal in the grammar, I generated one or more strings that matched that non-terminal, and added them to the set.
    2. For each literal string mentioned in the grammar, I included that literal string in the set. For example, the first production rule in the regex grammar is
      regExp ::= branch ( '|' branch )* 

      So I added “|” to the set of strings.

    3. For each class of terminal symbol, I included one or two examples in the set of strings.

    For the regex grammar, I ended up with a set of a hundred or so strings, from “a” to “(” to “[p{IsBasicLatin}-[aeiou]]”.

  2. I put the set of strings into an array, in no particular order.
  3. Then, I generated sequences of five or ten random numbers between 0 and the size of my set of string. For example, “[92, 21, 48, 6, 41, 69, 47”.
  4. Then I used each number to look up the corresponding string in the array, and concatenated the results. For the random numbers just given, this produced the strings ”}”, ”a”, ”2”, ”)”, ”|xyz”, ”|”, ”[\p{IsBasicLatin}-[aeiou]]”, and ”(”.
    Together, these yield the test string “ }a2)|xyz|[\p{IsBasicLatin}-[aeiou]](

The test cases generated by this modified method (random selection of substrings from a set of strings known to be grammatically interesting) exercised the grammar far better than test cases generated by the random-character method.

If I were a better mathematician, I might be able to characterize the difference. But in a purely informal way, I think one can think of the production rules of any grammar as a set of regular expressions over an alphabet consisting of both the terminal and the non-terminal symbols of the grammar, not just of characters. In any normal grammar, for most non-terminals some or most random character sequences will fail to match the non-terminal. Test cases generated by selecting characters at random will have the same effect as selecting items randomly from the terminal + non-terminal vocabulary, but with a bias (typically a very strong bias) against selecting any non-terminal symbol. By making a set of substrings as described, the second Monte Carlo method evens the odds a bit, and you get sequences that are much more likely to match the regular expressions over the terminal + non-terminal vocabulary, or to come very close to matching them. And that combination (matching all the productions, and coming close to matching each of the productions, but missing) is as close as I can come to characterizing the set of ‘interesting’ test cases.

Certainly, it was not necessary to generate tens of thousands of test cases in order to find examples that were valid against one form of the grammar, and invalid against a different form of the grammar.

If anyone wants to look at this quick and dirty test generator, the code for the test generator is at http://www.w3.org/XML/2008/03/xsdl-regex/testgenerator1.pl; the parser for the grammars I was testing is in the other files of the same directory.