[14 July 2009, Happy Bastille Day]
In interoperability testing, test cases are particularly interesting when they elicit different behaviors from different processors.
In revision of a grammar, strings are of particular interest when they are grammatical according to one version of the grammar, but not according to the other. Either the change was intended (and the string provides an example) or it was not intended (and the string exhibits a problem). Differences in the structure of the parse trees produced by the different grammars may be interesting, too.
Several working groups I’ve served on have spent time worrying about whether our spec’s rules for handling (especially escaping and unescaping) URIs and IRIs should align with the rules specified by a variety of other specs (HTML 4.01, XSLT 1.0, any of the various RFC which have at various times been the authoritative source of information, any of the various internet drafts which have later turned into, or failed to turn into, RFCs, etc. etc. ad luxuriam). At any given time, it would have been really really useful to have an answer to the question “Do these two different formulations of the rules ever actually produce different results? Or are they just different ways of saying the same thing? And if they do ever produce different results, are the cases involved already so pathological on other grounds that we don’t actually mind?”
These cases have in common that they exhibit discrepancies among things that (other things being equal) are (or might be) expected to be indistinguishable. That is, they document the daylight (which my Oxford dictionary glosses as “visible distance between one … thing and another”) between things that shouldn’t have daylight between them.
I’m coming to believe that during the development of a spec, daylight analysis — seeking and finding instances of daylight between things, or seeking and failing to find any daylight — may be the most important function of test cases. If not the most important, then surely a very important use.
If this conjecture is true, then it ought to have implications for judging the relative effectiveness of different methods for constructing collections of test cases. Traditional testing can be measured by how many bugs it finds, at what cost, and techniques for generating test cases are valued high or low depending on how likely they seem to be to find new bugs. For spec development, the utility of a test is tied to the likelihood of its finding daylight between the old and the new formulations of some rule.
Hmm. Is that a difference or not? I suppose a bug can be regarded as an instance of daylight between the implementation and the spec it’s supposed to be implementing. So perhaps bug-finding is just a special case of daylight analysis.
Of course, each revision of a rule or a grammar provides a new opportunity for daylight to arise and be found. It seems to follow that you may need new test cases for each revision. Automated methods that allow quick generation of relevant new test cases would be particularly useful.
Daylight analysis is not the only useful goal of test generation during spec development. Tests that illustrate the intended behavior of a rule are useful for clarification. Call these exemplary tests.
And for checking that a working group understands a problem area well, and that the rule for that problem is well formulated, it can be useful to construct tests with randomly selected properties, just to check to see that everyone agrees on how the rule should handle them. Call these sanity checks. Random selection of properties in sanity checks helps ensure that you don’t inadvertently feed your rule only ‘sensible’ examples which happen not to exercise its flaws. I once used an Alloy model to make twelve very simple test cases for the schema composition part of XSD; nine of them turned out to pose questions to which the spec’s answer is not obvious. (The twelve were a bit redundant: eliminating the redundancies, the twelve auto-generated examples boiled down to four non-redundant examples, for one of which the spec provides a clear analysis.) If I had limited myself to sensible examples I would almost certainly have missed most or all of the problematic cases.
Daylight analysis as a goal works well with random generation of test cases, because it helps deal with one of the great problems of random generation: most randomly generated test cases (at least, the ones I have been able to generate using random means) are rather boring. The goal of finding daylight provides a simple filter: a randomly generated test case is interesting if it finds daylight in two things you are comparing; it is uninteresting otherwise. In realistic cases, uninteresting test cases will overwhelmingly outnumber interesting ones, but if you can apply an automated filter (parse with grammar G1, parse with grammar G2, compare the results; if they differ, you have found daylight), then you can keep a few uninteresting test cases (just in case) but throw most of them away and focus on interesting ones.