More questions about Monte Carlo test-case generation

[9 August 2008]

Dave Pawson’s response to the previous post here has made me wonder about ways to tweak the Monte Carlo methods for generating test strings using random number generators — in particular, ways to make them more systematic (and also, perhaps, more automatic).

At the moment, I don’t have answers, but a couple of questions have occurred to me, and if I write them down on a piece of paper I’m just going to lose the paper. So I’ll write them down here.

(1) Can we measure the effectiveness of different methods of test generation by counting how many random strings must be generated in order to obtain a certain number of ‘interesting’ test cases?

(2) One way to define ‘interesting’: given two grammars for similar but not identical languages, how many random strings must we generate, using different methods, in order to find ten strings that exercise the difference between the languages? (I.e. ten strings that are members of L1 but not of L2, or vice versa.)

If we think we know that L1 and L2 differ in exactly three ways, we can also ask: how many test cases must we generate in order to find examples (or: ten examples) of each of the three differences between the languages? (A method that quickly exercises the first two differences but never finds the third would not be our best choice for work with languages where we don’t know the answers in advance.)

(3) Another way: given any grammar, how many test strings must we generate, using a given method, in order to find three different strings which match each non-terminal in the grammar?

How many other ways of defining ‘interesting’ test cases can we find? How many other ways can we find of measuring the effectiveness of methods of generating test cases?

(4) At the moment, I know three methods for generating pseudo-random strings for testing grammars:

  1. Run the grammar itself as a generator, possibly using probabilities assigned to different branches at each choice point. John Cowan described this method in a comment on an earlier post here; Liam Quin also described essentially this method to me, in private conversation. If I need a name for this method, I’ll call it probabilistic sentence generation.
  2. For a given length of string n (given as a parameter, or itself chosen randomly), pick n characters randomly from the character set and concatenate them. I’ll call this random character concatenation.
  3. For a given number n, choose n items randomly from an array of strings constructed to include strings that match each non-terminal and terminal in the grammar’s vocabulary, and concatenate them. Since here the random selection is from strings representing the terminal + non-terminal vocabulary, I’ll call this random TNT concatenation.

Question: are there other ways to use random-number generators to generate test strings for grammars?

(5) Can the array of strings used in the TNT method be constructed automatically, by using random character concatenation until you have matches for each terminal and non-terminal, and the array is fully populated?

(Here, Enrique dug his elbow into my rib. “Well, obviously yes it can,” he hissed. “What kind of idiotic question is that?!” “Hey, I’m just brainstorming. Gimme a break. Besides, this helps set up the two questions that follow.”)

I should mention here that I have a strong gut feeling that it’s useful, in practical cases, to augment the grammar with a terminal class for characters (or strings) which can never occur in sentences of the language. For a grammar for URIs, that is, the class of characters illegal in URIs definitely needs to be represented in the TNT array; for a grammar for numbers in scientific notation, you definitely want to have a few entries in the TNT array for characters like space, or #, or letters of the alphabet other than ‘e’ and ‘E’, which occur in the coded character sets your users will be using but are not part of the alphabet of the language in the formal sense. I have no good formal grip on this gut feeling, and no good formal way to describe it; I’ll work on that. But for now, understand that when I talk about strings matching terminals, I also intend to include some characters that don’t match any terminal. Otherwise you will surely fail to exercise some error detection paths in your system.

(6) Can the TNT array be populated by a bootstrapping method? Would this work?

  1. Start with an array that just includes substrings for the terminals.
  2. Select a production rule whose right-hand side contains only terminals (or, later on, one whose right hand side contains only items for which the TNT array has substrings).
  3. Generate random strings until you have a match, or two matches, for that non-terminal. (N.B. for recursive rules you’ll want to treat each choice in the right-hand side as a separate production rule.)
  4. Add the new substrings to the TNT array.
  5. Repeat (i.e. return to step 2), until you have strings in the TNT array for each production rule (not just each non-terminal and terminal) in the grammar.

(7) Another possible bootstrapping method:

  1. Begin with a TNT array for all the terminals, as just described.
  2. Also begin with a count table showing the non-terminals (or better, the production rules) of the grammar, and how many substrings they have in the TNT array; to begin with, every production rule in the count table will have zero matches.
  3. Generate a random string from the TNT array.
  4. Check the string against each rule in the grammar table for which you have fewer than three strings in the TNT array; when you find a match, add that string to the TNT array and increment the number for that rule in the count table. If the string doesn’t match any rules (or any rules that aren’t already at 3), discard it and generate another (i.e. go back to step 3). If every entry in the count table has a count of 3, stop.

The description just given assumes that you want three strings in the TNT table for each rule in the grammar. That’s an arbitrary choice; it could be two, or one, or ten. And if there is reason to want especially thorough testing in some parts of the grammar, we could assign different quotas to different non-terminals. (If I’m particularly worried about character classes, then I want a lot of test cases that exercise that part of the grammar; I can get more by giving the non-terminals in that part of the grammar higher representation in the TNT table.)

(8) When I originally sketched out the TNT method I assumed that what I wanted was a few strings for each non-terminal (and terminal) in the grammar. In the course of working through the questions above I have come to believe that what we want is a few strings for each rule in the grammar: non-terminals with three rules need more strings than non-terminals with just one rule. Which is true? If the measurement methods described above work, it should be possible to measure the effeciveness of different approaches quantitatively.

(9) In setting up the TNT array, is the best default behavior going to be to provide the same number of strings for each item in the terminal + non-terminal vocabulary? Or would we do better to count the number of times each TNT vocabulary item is actually mentioned in the grammar, and assign slots proportionally? One mention: two strings. Eight mentions: sixteen strings. (Or should we do it logarithmically? One mention, two strings; eight mentions, three times two strings; n mentions, 2 times log2 n strings?)

(10) So far, this has all been about sequences of characters. How many of these methods can be transferred conveniently to the task of generating XML documents as test cases for document grammars?

One thought on “More questions about Monte Carlo test-case generation

  1. I don’t think you are being lazy enough Michael. Take any of your grammar non-terminals. Feed it to a function [[optional length parameter | number of terms wanted ]] which returns an array of string of characters you want. How to characterise that function?

    Wouldn’t that be easier to test?

    Another definition of ‘interesting’. Put a black hat on and devise test data manually. How long does it take to detect that weakness? Or perhaps introduce a weakness in the grammar and see how many calls to the function it takes to detect the ‘flaw’.

    regards DaveP

Comments are closed.