One way to define the XPath data model

[6 April 2010; addenda and copy editing 7-8 April 2010]

After discovering earlier this year that the definition of the XPath 1.0 data model falls short of the goal of guaranteeing the desired properties to all instances of the data model, I’ve been spending some time experimenting with alternative definitions, trying to see what must be specified a priori and what properties can be left to follow from others.

It’s no particular surprise that the data model can be defined in a variety of different ways. I’ve worked out three with a certain degree of precision. Here is one, which is not the usual way of defining things. For simplicity, it ignores attributes and namespace nodes; it’s easy enough to add them in once the foundations are a bit firmer.

Assume a non-empty finite set S and two binary relations R and Q on S, with the following properties [Some constraints are shown here as deleted: they were included in the first version of this list but later proved to be redundant; see below] :

  1. R is functional, acyclic, and injective (i.e. for any x and y, R(x) = R(y) implies x = y).
  2. There is exactly one member of S which is not in the domain of R (i.e. R(e) has no value), and exactly one which is not in the range of R (i.e. there is one element e such that for no element f do we have e = R(f)).
  3. Q is transitive and acyclic.
  4. The transitive reduction of Q is functional and injective.
  5. It will be observed that R essentially places the elements of S in a sequence without duplicates. For all elements e, f, g, h of S, if Q includes the pairs (e, f) and (g, h) and if g falls between e and f in the sequence defined by R (or, more formally, if the transitive closure of R contains the pairs (e, f), (e, g), and (g, f)), then h also falls between e and f in that sequence.
  6. The transitive closure of the inverse of R (i.e. R-1*) contains Q as a subset.
  7. The single element of S which is not in the domain of R is also neither in the domain nor the range of Q.

It turns out that if we have any relations R and Q defined on some set S, then we have an instance of the XPath 1.0 data model. The nodes in the model instance, the axes defining their interrelations, and so on can all be defined in terms of S, R, and Q.

For the moment, I’ll leave the details as an exercise for the reader. (I also realize, as I’m about to click “Publish”, that I have not actually checked to see whether the set of constraints given above is minimal. I started with a short list and added constraints until S, R, and Q sufficed to determine a unique data model instance, but I have not checked to see whether any of the later additions rendered any of the earlier additions unnecessary. So points for any reader who identifies redundant constraints in the list given above.)

[When I did check for minimality, it turned out that several of the constraints included in the list above are redundant. The fact that relation R is functional and injective, for example, follows from the others shown. Actually it follows from a subset of them. The deletions above show one way of reducing the number of a priori constraints: they all follow from the others and can be dropped. None of the remaining items follows from the others; if any of them are deleted, the constraints no longer suffice to ensure the properties required by XPath.]