Digital Humanities 2008

After two days of drizzle and gray skies, the sun came out on Saturday to make the last day of Digital Humanities 2008 memorable and ensure that the participants all remember Finland and Oulu as beautiful and not (only) as gray and wet and chilly. Watching the sun set over the water, a few minutes before midnight, by gliding very slowly sideways beneath the horizon, gave me great satisfaction.

The Digital Humanities conference is the successor to the annual joint conference of the Association for Computers and the Humanities (ACH) and the Association for Literary and Linguistic Computing (ALLC), now organized by the umbrella organization they have founded, which in a bit of nomenclature worthy of Garrison Keillor is called the Association of Digital Humanities Organizations.

There were a lot of good papers this year, and I don’t have time to go through them all here, since I’m supposed to be getting ready to catch the airport bus. So I hope to do a sort of fragmented trip report in the form of followup posts on a number of projects and topics that caught my eye. A full-text XML search engine I had never heard of before (TauRo, from the Scuola Normale Superiore in Pisa), bibliographic software from Brown, and a whole long series of digital editions and databases are what jump to my mind now, in my haste. The attendance was better than I had expected, and confirmed what some have long suspected: Kings College London has become the 800-pound gorilla of humanities computing. Ten percent of the attendees had Kings affiliations, there was an endless series of reports on intelligently conceived and deftly executed projects from Kings, and Kings delegates seemed to play a disproportionately large role in the posing of incisive questions and in the interesting parts of discussions. There were plenty of good projects done elsewhere, too, but what Harold Short and his colleagues have done at Kings is really remarkable — someone interested in how institutions are built up to eminence (whether as a study in organizational management or because they want to build up some organization) should really do a study of how they have gone about it.

As local organizer, Lisa-Lena Opas-Hänninen has done an amazing job, and Espen Ore’s program committee deserves credit for a memorable program. Next year’s organizers at the University of Maryland in College Park have a tough act to follow.

XSD 1.1 is in Last Call

Yesterday the World Wide Web Consortium published new drafts of its XML Schema Definition Language (XSD) 1.1, as ‘last-call’ drafts.

The idiom has an obscure history, but is clearly related to the last call for orders in pubs which must close by a certain hour. The working group responsible for a specification labels it ‘last call’, as in ‘last call for comments’, to indicate that the working group believes the spec is finished and ready to move forward. If other working groups or external readers have been waiting to review the document, thinking “there’s no point reviewing it now because they are still changing things”, the last call is a signal that the responsible working group has stopped changing things, so if you want to review it, it’s now or never.

The effect, of course, can be to evoke a lot of comments that require significant rework of the spec, so that in fact it would be foolish for a working group to believe they are essentially done when they reach last call. (Not that it matters what the WG thinks: a working group that believes last call is the end of the real work will soon be taught better.)

In the case of XSD 1.1, this is the second last call publication both for the Datatypes spec and for the Structures spec (published previously as last-call working drafts in February 2006 and in August 2007, respectively). Each elicited scores of comments: by my count there are 126 Bugzilla issues on Datatypes opened since 17 February 2006, and 96 issues opened against Structures since 31 August 2007. We have closed all of the substantive comments, most by fixing the problem and a few (sigh) by discovering either that we could not reach consensus on what to do about the problem (or in some cases could not reach consensus about whether there was really a problem before us) or that we could not make the requested change without more delay than seemed warrantable. There are still a number of ‘editorial’ issues open, which are expected not to affect the conformance requirements for the spec or to change the results of anyone’s review of the spec, and which we therefore hope to be able to close after going to last call.

XSD 1.1 is, I think, somewhat improved over XSD 1.0 in a number of ways, ranging from the very small but symbolically very significant to much larger changes. On the small but significant side: the spec has a name now (XSD) that is distinct from the generic noun phrase used to describe the subject matter of the spec (XML schemas), which should make it easier for people to talk about XML schema languages other than XSD without confusing some listeners. On the larger side:

  • XSD 1.1 supports XPath 2.0 assertions on complex and simple types. The subset of XPath 2.0 defined for assertions in earlier drafts of XSD 1.1 has been dropped; processors are expected to support all of XPath 2.0 for assertions. (There is, however, a subset defined for conditional type assignment, although here too schema authors are allowed to use, and processors are allowed to support, full XPath.)
  • ‘Negative’ wildcards are allowed, that is wildcards which match all names except some specified set. The excluded names can be listed explicitly, or can be “all the elements defined in the schema” or “all the elements present in the content model”.
  • The xs:redefine element has been deprecated, and a new xs:override element has been defined which has clearer semantics and is easier to use.

Some changes vis-a-vis 1.0 were already visible in earlier drafts of 1.1:

  • The rules requiring deterministic content models have been relaxed to allow wildcards to compete with elements (although the determinism rule has not been eliminated completely, as some would prefer).
  • XSD 1.1 supports both XML 1.0 and XML 1.1.
  • A conditional inclusion mechanism is defined for schema documents, which allows schema authors to write schema documents that will work with multiple versions of XSD. (This conditional inclusion mechanism is not part of XSD 1.0, and cannot be added to it by an erratum, but there is no reason a conforming XSD 1.0 processor cannot support it, and I encourage makers of 1.0 processors to add support for it.)
  • Schema authors can specify various kinds of ‘open content’ for content models; this can make it easier to produce new versions of a vocabulary with the property that any document valid against the new vocabulary will also be valid against the old.
  • The Datatypes spec includes a precisionDecimal datatype intended to support the IEEE 754R floating-point decimal specification recently approved by IEEE.
  • Processors are allowed to support primitive datatypes, and datatype facets, additional to those defined in the specification.
  • We have revised many, many passages in the spec to try to make them clearer. It has not been easy to rewrite for clarity while retaining the kind of close correspondence to 1.0 that allows the working group and implementors to be confident that the rewrite has not inadvertently changed the conformance criteria. Some readers will doubtless wish that the working group had done more in this regard. But I venture to hope that many readers will be glad for the improvements in wording. The spec is still complex and some parts of it still make for hard going, but I think the changes are trending in the right direction.

If you have any interest in XSD, or in XML schema languages in general, I hope you will take the time to read and comment on XSD 1.1. The comment period runs through 12 September 2008. The specs may be found on the W3C Technical Reports index page.

Thinking about test cases for grammars

[19 June 2008]

I’ve been thinking about testing and test cases a lot recently.

I don’t have time to write it all up, and it wouldn’t fit comfortably in a single post anyway. But several questions have turned out to provide a lot of food for thought.

The topic first offered itself in connection with several bug reports against the grammar for regular expressions in XSD Part 2: Datatypes, and with the prospect of revising the grammar to resolve the issues. When revising a grammar, it would be really useful to be confident that the changes one is making change the parts of the language one wants to change, and leave the rest of the language untouched. In the extreme case, perhaps we don’t want to change the language at all, just to reformulate the grammar to be easier to follow or to provide semantics for. How can we be confident that the old grammar and the new one describe the same language?

For regular languages, I believe the problem is fairly straightforward. You can calculate the minimal FSA for each language and check whether the two minimal FSAs are isomorphic. Or you can calculate both set differences (L1 – L2 and L2 – L1) and check that both of them are the empty set. And there are tools like Grail that can help you perform the check, although the notation Grail uses is just different enough from the notation XSD uses to make confusion and translation errors possible (but similar enough that you think it would be over-engineering to try to automate the translation).

Buyt for context-free languages, the situation is not so good. In principle, the equivalence of context-free languages is decidable, but I would have to spend time rereading Hopcroft and Ullman, or Grune and Jacobs, to figure out how to go about it. And I don’t know of any good grammar-analysis tools. (When I ask people, they say the closest thing they know of to a grammar analysis tool are the error messages from yacc and its ilk.) So even if one did buckle down and try to prove the original form of the grammar and the new form equivalent, the possibility of errors in the proof is quite real and it would be nice to have a convenient way of generating a thorough set of test cases.

I can think of two ways to generate test cases:

  • Generation of random or pseudo-random strings; let’s call this Monte Carlo testing.
  • Careful systematic creation of test cases. I.e., hard work, either in the manual construction of tests or in setting things up for automatic test generation.

Naturally my first thought was how to avoid hard work by generating useful test cases with minimal labor.

The bad news is that this only led to other questions, like “what do you mean by useful test cases?”

The obvious answer is that in the grammar comparison case, one wants to generate test cases which will expose differences in the languages defined by the two grammars, just as in the case of software one wants test cases which will expose errors in the program. The parallel suggests that one might learn something useful by attempting to apply general testing principles to grammars and to specifications based on grammars.

So I’ve been thinking about some questions which arise from that idea. In much of this I am guided by Glenford J. Myers, The art of software testing (New York: Wiley, 1979). If I had no other reasons for being grateful to Paul Cotton, his recommending Myers to me would still put me in his debt.

  • For the various measures of test ‘coverage’ defined by Myers (statement coverage, decision coverage, condition coverage, decision/condition coverage, multiple condition coverage), what are the corresponding measures for grammars?
  • If one generates random strings to use as test cases, how long does the string need to be in order to be useful? (For some meaning of “useful” — for example, in order to ensure that all parts of the grammar can be exercised.)
  • How long can the strings get before they are clearly not testing anything that shorter strings haven’t already tested adequately (for some meaning of “adequate”)?
  • From earlier experience generating random strings as test cases, I know that for pretty much any definition of “interesting test case”, the large majority of random test cases are not “interesting”. Is there a way to increase the likelihood of a test case being interesting? A way that doesn’t involve hard work, I mean.
  • How good a job can we do at generating interesting test cases with only minimal understanding of the language, or with only minimal analysis of its grammar or FSA?
  • What kinds of analysis offer the best bang for the buck in terms of improving our ability to generate test cases automatically?

Balisage offers hope for the deadline-challenged

In my never-ending quest to help those who, like myself, never get around to things until the deadline is breathing down their necks, I have until now avoided mentioning that Balisage, the conference on markup theory and practice, has issued a call for late-breaking news.

The deadline for late-breaking submissions is 13 June 2008. It is now officially breathing down your neck.

There, will that do the job?

The 13th of June is this Friday. Just enough time to write up that great piece of work you just did, but not long enough to make a huge big thing of it and get all worked up in knots.

Balisage is an annual conference on markup theory and practice, held in early August each year in Montréal. Well, I say annual, but strictly speaking this is Balisage’s first year. The organizers have in the past been involved in other conferences in Montreal in August (most recently Extreme Markup Languages), and we regard Balisage as the natural continuation. So if you have always wanted to go to the Extreme Markup Languages conference, and are disappointed to see no announcements this year for that conference, come to Balisage. I think you’ll find what you’re looking for.

The full call for late-breaking news, and details of how to make a submission, are at http://www.balisage.net/latebreaking-call.html.