How formal can you get?

[26 January 2010; updated a URI 28 Jan 2010]

In a recent comment on an earlier post here, David Carlisle wrote of proofs concerning properties of the XPath 1.0 data model

It’s not clear how formal any such proof could be, given that the XPath model and even more so, XML itself are defined so informally.

It’s true that the XPath spec defines the data model in prose and not in formulas. But on the whole it’s relatively formal and explicit prose, and I would expect it to be relatively easy to translate the prose into a formal notation.

As a test of that proposition, I recently improved a shining hour or four by translating section 5 of the XPath 1.0 spec into Alloy, the modeling tool developed by Daniel Jackson and others in the MIT Software Design Group. (I would describe Alloy briefly here, but I’ve written about Alloy often enough in this blog that I won’t describe it again now; regular readers of this blog will already know about it, and those who don’t can search the Web for information, or search this blog for what I’ve said about it before.)

The full Alloy model of XPath 1.0 is available from the Black Mesa Technologies web site. It will be of most interest, of course, for those familiar with Alloy, or wanting to learn more about it. But Alloy’s formalism is simple enough that anyone with a taste for formal methods will find it easy going, and the document paraphrases all the important things in English prose.

The net result, I think, is that there are (unsurprisingly) some places where the definition of the XPath 1.0 data model could or should be more explicit, and a few places where it seems a rule or two given explicitly is strictly speaking redundant, but by and large the definition is pretty clean and seems to mean what its creators meant it to mean. By and large, the definition of the data model is formulated without reliance on the XML spec (there are a few places where this separation could be cleaner), so that the informality of the XML spec (or what I prefer to think of as its programmatic promiscuity regarding models) does not in fact make it hard to formalize the XPath 1.0 data model.

In the current version, there are a couple of rough bits that I hope to sand down before I use this model in any other contexts. The ones I’m currently aware of are these:

  • The definition of the parent relation does not require that whenever parent(c,p) is true, either child(p,c) or attribute(p,c) is true. The XPath 1.0 spec doesn’t say this explicitly, but I think its authors probably felt that it would be pedantic to say something like that for a relation with a name like parent.
  • Similarly, the relations parent and ch are not guaranteed acyclic, though I think the original spec assumes that the names parent and child make clear that the relations should be acyclic. (But see the song “I’m my own Grandpa” and the Alloy models illustrating it, developed in Daniel Jackson’s Software Abstractions and shipped with Alloy in the samples directory.)
  • The predicate precedes intended to model document order is defined recursively; this is not legal in Alloy. So a conventional relation on nodes will need to be defined instead, with appropriate constraints. It is still not clear whether the rules given in the spec suffice to make the order total (at least in the absence of multiply occurring children); if they don’t, I’d like to propose an additional predicate which has that effect.