Some difficulties in specifying markup semantics

C. M. Sperberg-McQueen
World Wide Web Consortium

Allen Renear
University of Illinois

Claus Huitfeldt
University of Bergen

David Dubin
University of Illinois

This document is a working draft.

If you are not actively collaborating with the authors on the project described here, then you probably stumbled across this by accident. Please do not quote publicly or point other people to this document. The URI at which this document now exists is very unlikely to remain permanently available.



This paper summarizes the state of play on some aspects of the Bechamel project's efforts to show how the meaning of markup can be specified, as of our presentation in Tübingen last year (it is derived from the slides used for one of the two talks in Tübingen). In its current state, it is the responsibility of the first author; it has not been reviewed by the other authors.

1. Background

The Bechamel project can be understood as an attempt to fulfill the description implicit in Wilhelm Ott's description, thirty years ago, of computing in the humanities:

Ihr [d.i. der EDV] Einsatz ist überall dort möglich, wo Daten irgendwelcher Art – also auch Texte – nach eindeutig formulierbaren und vollständig formalisierbaren Regeln verarbeitet werden müssen.

Wilhelm Ott, Metrische Analysen zu Vergil Aeneis Buch VI (Tübingen: Niemeyer, 1973), p. V.
That is, we wish to formulate unambiguous, fully formalized rules for assigning meaning to the markup in documents.
Our working definition of meaning is exhibited in this quotation from a work on program semantics:
... we shall accept that the meaning of A is the set of sentences S true because of A. The set S may also be called the set of consequences of A. Calling sentences of S consequences of A underscores the fact that there is an underlying logic which allows one to deduce that a sentence is a consequence of A.
Wladyslaw M. Turski and Thomas S. E. Maibaum, The specification of computer programs (Wokingham: Addison-Wesley, 1987), p. 4.

2. An example

Let us examine an example. In 1775, the South Carolina political leader Henry Laurens wrote a letter to the royal governor of South Carolina, Lord William Campbell. The beginning of the letter might read thus in an online edition:
It was be For When we applied to Your Excellency for leave to adjourn it was because we foresaw that we were should continue wasting our own time ...
In XML or SGML form, this passage might plausibly be encoded thus:
<p><del>It was be</del> <del>For</del> When we 
applied to Your Excellency for leave to adjourn 
it was because we foresaw that we 
<del>were</del> <add>should continue</add> 
wasting our own time ... </p>
[Discussion ...]
A non-literary example may also be useful. This is the purchase order used as an example in Part 0 (the tutorial) of the XML Schema 1.0 specification:
<?xml version="1.0"?>
<purchaseOrder orderDate="1999-10-20">
    <shipTo country="US">
        <name>Alice Smith</name>
        <street>123 Maple Street</street>
        <city>Mill Valley</city>
        <state>CA</state>
        <zip>90952</zip>
    </shipTo>
    <billTo country="US">
        <name>Robert Smith</name>
        <street>8 Oak Avenue</street>
        <city>Old Town</city>
        <state>PA</state>
        <zip>95819</zip>
    </billTo>
    <comment>Hurry, my lawn is going wild!</comment>
    <items>
        <item partNum="872-AA">
            <productName>Lawnmower</productName>
            <quantity>1</quantity>
            <USPrice>148.95</USPrice>
            <comment>Confirm this is electric</comment>
        </item>
        <item partNum="926-AA">
            <productName>Baby Monitor</productName>
            <quantity>1</quantity>
            <USPrice>39.98</USPrice>
            <shipDate>1999-05-21</shipDate>
        </item>
    </items>
</purchaseOrder>
Part of the document tree from the Laurens example [Is this relevant? ...]
[If this is relevant, it is to note that we expect that in general, the semantics of colloquial XML markup vocabularies to exploit the tree structure.]
From this document, it is straightforward to imagine drawing some inferences based on the element types. For example, the del element indicates that something was deleted; the add that something was inserted. Give these things arbitrary names and we can write:
deleted(n81). /* "It was be" */
deleted(n84). /* "For" */
inserted(n90). /* "should continue" */
Similarly, there is a relation that holds between purchase orders and their shipping addresses, or between purchase orders and their billing addresses. We can write it this way in a straightforward translation:
po_shippingaddress(p123,a45).
po_billingaddress(p123,a46).
These predicates can be paraphrased as saying “The relation po_shippingaddress holds between p123 and a45” and “The relation po_billingaddress holds between p123 and a46”. Alternatively, we can use a meta-logical predicate to capture this information (this can make it easier to search for all relations holding for a particular object).
relation_applies(shipto,[p123,a45]).
relation_applies(billto,[p123,a46]).
From the occurrences of the del element in this document, we can infer a set of ground facts like the following:
deleted(n81).
deleted(n84).
deleted(n87).
deleted(n96).
deleted(n105).
deleted(n123).
deleted(n133).
deleted(n137).
deleted(n143).
deleted(n149).
...

The simple pattern observable here in turn allows us to think of the meaning of del as an element type as being captured by a sentence skeleton (or sentence schema) of the following form:
deleted(___).
When we interpret the markup in a particular document, we fill in the blank appropriately. Just how that happens in the general case is a topic of continuing work, but for the del element it's clear that the blank is filled either with the identifier of the element (if we are willing to conflate the element with what it encodes) or (if we wish to be very careful to distinguish the representation from the thing represented) the identifier of the thing encoded by the delelement.
We can apply this principle to the other relations we have mentioned above, which gives us sentence skeletons like the following:
deleted(___).
inserted(___).
po_shippingaddress(___,___).
po_billingaddress(___,___).
po_item(___,___).
Note that some of these have more than one blank; part of our task is to specify how a description of the meaning of a vocabulary can specify how each blank is to be filled in.

3. The system we wish to build

The system we seek to build might be put together in two different ways. The following system overview illustrates the first way: From an XML document instance and a formal tag set definition (FTSD), we create several sets of Prolog sentences: a set of image sentences which represent the original XML document, a set of property rules which govern the propagation of properties through the document, and a set of mapping rules which specify how the information in the markup maps into the application domain. From these sets of sentences, others are derived by logical inference. The application sentences are, in some sense, what is most obviously meant by our saying that the meaning of markup is the set of inferences it licenses.
A second approach exploits the ability to do XPath evaluation conveniently in XSLT, and generates the application sentences directly from the XML instance by means of an XSLT stylesheet which was itself derived from the FTSD. For purposes of comparison between the two approaches, it is possible to make the XSLT stylesheets also generate the other sets of sentences shown in the first system overview.
The two system designs appear to differ primarily in where they put their emphasis. Using XSLT is a way to reduce implementation time for a system to produce application sentences; the XSLT-based approach seems to regard the application sentences and the FTSD as the foci of interest. Using Prolog inference rules to generate the application sentences appears to reflect a view that the propagation sentences, property rules, and mapping rules themselves are of significant interest, and that the entire path from XML (image sentences) to application sentences ought to be traversed by inference, not by any other mechanisms.

4. Some technical and non-technical problems

In our work so far, we have encountered a number of problems, many of them technical in nature. This section of this paper summarizes the solutions we think we have found to some of them.
  1. n-ary statements
  2. deixis
  3. inheritance
  4. distributed, non-distributed properties
  5. overriding, conflict, union
  6. milestones
  7. individuals[1] (what are we talking about?)
  8. references to the same individual
  9. certainty and responsibility

4.1. n-ary statements

This is one of the problems which defeats the straw-man proposal of 2000. Some inferences we want to draw from the markup allow more than one argument (for elements) or more than two (for attributes). For example:
  • TEI.2 element:

    There exists a text [document, text witness, ...]; this is an electronic representation of that text.

  • bibl element:

    There exists a bibliographic item; this element contains a bibliographic description of that item.

  • head element:

    This element is the title of the immediately enclosing div / div0 / div1 / div2 ... element.

Our solution is straightforward: We allow skeleton sentences to have predicates which take two or more arguments.
text(X) & encoding(Y) & encodes(Y,X).
bibitem(X) & bibdesc(Y) & bibitem_desc(X,Y).
section-title(X), title_of_section(X,Y).

4.2. Deixis

The straw-man proposal of 2000 did not need any way to point to elements or attributes, because every element and every attribute had a translation the the same arity and the same arguments. Allowing arbitrary arity requires us to be able to say how to fill the additional argument slots. Our hypothesis is that in colloquial XML the additional arguments are most conveniently specified by expressions which specify a location; in many cases, the location is relative to the location of the markup whose skeleton sentence is being filled in.
Consider the title element:

This is a title of the bibliographic item described by the nearest containing bibl element.

We need a way to point to things like the nearest containing bibl element
Solution: use XPath expressions.
title([[.]]) & 
(for all x) 
  (bibitem_desc(x, [[ ancestor::bibl ]]) 
    → bibitem_title(x, [[.]]) )
The double brackets are used to enclose deictic expressions in the examples here and below; the precise nature of their denotation is a problem yet to be elucidated.
There are alternate solutions:
  • caterpillar expressions [Brüggemann-Klein/Wood 2000]
  • CSS selectors
  • a subset of XPath axes, in Prolog notation
  • ...

4.3. Inheritance

One of the most obvious features of colloquial XML vocabularies is that they use inheritance to propagate properties from parent to child. Consider a simplified lang attribute: <foo lang="L"> implies:

For each descendant d, the language of d is L.

Or, more formally:

has_language( [[descendant-or-self::*]] , [[ attribute::lang ]] ).

4.4. Distribution

Not all properties are inherited in the same way. Some, we say, are distributed, others non-distributed. For example, one of the following sentences applies, one does not:
If a chapter is in a language, then by default everything in that chapter is in that language.
* If a chapter is a chapter, then by default everything in that chapter is also a chapter.
(The asterisk beside the second sentence signals its falsehood.)
Solution: We can distinguish two approaches to distribution of properties.
  1. For each element or attribute with a property P, formulate sentences appropriately:

    For each child c, c has property P

    or

    For each child c, c has property is-part-of-a-P.

  2. Allow second-level descriptions of predicates:
    distributed(lang).
    distributed(quoted).
    distributed(highlighted).
    non-distributed(p).
    non-distributed(div).
    

4.5. Overriding (e.g. lang)

The account of inheritance given above is too simple: in many cases, the inherited value of a property can be overridden by a locally specified value. The lang attribute of TEI and HTML, and the xml:lang attribute of the XML specification, is a well known example of this design pattern.
There are several ways to think about this and express it:
  1. We can propagate down only to nodes which don't have their own lang attributes:
    lang="X" → for each c in {child::text()}, 
                 haslang(c,X)
             & for each c in {child::*[not(@lang)]}, 
                 haslang(c,X)
    In English: when an element E has the attribute-value specification lang="X" for some X, then:
    1. Each text-node child of E has language X.
    2. Each element C which is a child of E has language X, if C does not have an attribute-value specification for lang.
    Note that we write haslang(N,L) to signal that node N has language L.
    This formulation assigns the property of being in a particular language to both elements and text nodes (but not to attributes).
    Note that for the recursive step down to element children, we really want to cause the haslang property to propagate further, e.g. by re-executing this rule on the element children as if they had an explicit lang value. But this formulation doesn't actually manage to capture that requirement.
  2. An alternative view is that XML elements don't have a language — only text-nodes do. On this view, the element on which the lang attribute occurs has nothing direct to do with the language property: it's just the common parent of a bunch of text nodes which are in the same language. To get the right language property for every text node, there is a simple rule:
    (for all x) 
      (gi(x,'#pcdata') 
        → haslang(x,ancestor::*[@lang][1]/@lang))
    In English: for every text node (nodes with a pseudo-GI of #pcdata), the haslang relation links that text node to the language specified on the nearest enclosing element which has a value for lang.
    This view poses two challenges:
    • The information we need is carried by the lang attribute, but the sentences we generate are all related not to the attribute-value specifications for lang, nor to their parent elements, but to the text nodes in the document. This suggests that the rules associated with a particular construct (here: the lang attribute) may need to fire not when that construct is encountered, but at some other time. Instead of pushing the information down from the element with the lang attribute, this formulation imagines us pulling it down to each text node. It's not clear how best to organize the choice of push and pull in a working system.
    • No language information is associated with any attribute. This is implausible, but it's fortunately also easy to correct.
  3. Another formulation, which attempts to repair the problem in the first formulation:
    attv(N,lang,L) → haslang(N,L).
    For every element N with a lang value of L, infer haslang(N,L).
    Define haslang thus:
    haslang(N,L) →
      (for all y)
        (parent(y,N) & ¬(attv(y,lang,_)) 
          → haslang(y,L))
        & (for all y)
            (parent(y,N) & gi(y,'#pcdata') 
              → haslang(y,L))
    
    I.e. haslang(N,L) implies that haslang(E,L) applies to every element E which is a child of N, if E doesn't have an attribute value specification for lang, and also that haslang(T,L) applies to every text node child T of N.
  4. A fourth formulation, which resembles the second formulation above in using ‘pull’ rather than ‘push’, but which ascribes a language property to elements rather than to text nodes (if only to show that we can):
    (for all x : element) 
      (attv(x,lang,_) → language(x,x->@lang))
    (for all x : element) 
      (¬attv(x,lang,_) 
        → language(x,x->ancestor::*[@lang][1]/@lang)).
    
    Where x->y means ‘the value of the XPath expression y, interpreted with x as the current node’.
    For all elements X, if X has a lang value, then the language of X is whatever the lang attribute says. And: For all elements X, if X has no lang value, then the language of X is whatever the lang attribute says on the nearest ancestor which has a lang attribute.
    Analogous rules for text nodes and attributes would be necessary in practice:
    (for all x : text) 
      (language(x,x->ancestor::*[@lang][1]/@lang)).
    (for all x : attribute) 
      (language(x,x->ancestor::*[@lang][1]/@lang)).
    
  5. We can merge all the rules given in the preceding formulation into a single rule:
    (for all x : node) 
      ((element(x) | attribute(x) | textnode(x))
        → language(x,
          [[ x->ancestor-or-self::*[@lang][1]/@lang ]] ))
    
Solution: There are two approaches; it's not clear which to prefer. We can define lang top-down:
(for all  e) [element(e) →
  hasprop([[.]], 
    [[language(ancestor-or-self::*[@lang][1]/@lang ]]
  ))]
(for all  n) (((att(n) | text(n)) 
  & hasprop(parent::*,language(L)))
    → hasprop(.,language(L))]
In English: for every element e, e has a language property whose value is given by the lang attribute on e itself, or on the nearest ancestor element which has a lang value. For evey attribute or text node n, n has a language property whose value is given by the lang attribute on the nearest ancestor element which has a lang value.
Or we can define it bottom up:
let L = ancestor-or-self::*[@lang][1]/@lang)
in (hasprop(.,language(L))
  & hasprop(child::text(),language(L))
  & hasprop(attribute::*,language(L))
)
In English: [for every element E], let L be the lang value on the nearest ancestor with such a value. Then E itself, each attribute of E, and each text-node child of E, has a language property with the value L.
Note that none of the formulations given here accounts explicitly for elements, text nodes, or attributes which:
  • have no ancestor with a lang value
  • do not contain natural-language text and therefore ought not, strictly speaking, to be associated with a natural language
Perhaps the meaning (for us, at least) of the lang attribute is that every element and attribute ought to be classified as having, or not having, a language property.

4.6. Conflict and union

Another problem related to overriding is the need to distinguish conflicting values from compatible values. A vocabulary with concepts like bold and italic must say how they relate: do they override each other or complement each other?
Solution: if they complement each other, it's because each specifies a different property. If they override each other, it's because both specify the same property. Formulate the rules accordingly. [N.B. This is going to make it very hard to provide a general interpretation of the TEI rend attribute.]
When b within i gives bold italic:
b: bold( [[descendant-or-self::*]] )
i: italic( [[descendant-or-self::*]] )
So
<b>aaa <i>bbb</i> ccc</b>
gives
bold(n1)
& bold(n2)
& bold(n3)
& italic(n2)
...
When b within i gives bold, i within b gives italic:
b: fontstyle( [[descendant-or-self::*]], bold )
i: fontstyle( [[descendant-or-self::*]], italic )
or
(for all  n) [textnode(n) →
  hasprop([[.]], 
    [[fontstyle(ancestor-or-self::*
      [name() = 'b' or name() = 'i'][1]) ]]
  )]

4.7. Milestones

Milestones are unusual, and often frowned upon, because they require looking back along the frontier of the tree, treating the XML as a stream, rather than up, down, or sideways in the tree.
Solution: The meaning of a milestone, however, is easy to specify, as long as the language for deictic expressions allows it. In XPath, the preceding axis is what is needed. Here's a sample formulation:
(for all  n) ((textnode(n) | element(n))
  → hasprop(.,pagenum( 
    [[preceding::pb/@n]])))
For each text node or element N, N has a pagenum property which has as its value the value of the n attribute on the most recent milestone element.
Note that for elements, the pagenum property is, in effect, the page number of the first page, not necessarily the only page number on which material from that element appears.
Complications will be necessary to deal with transcriptions which provide multiple sets of page numbers (such as McKinnon's Kierkegaard).

4.8. Individuals

What do the identifiers n81 etc. denote?
  • elements
  • contents (as tokens)
  • contents (as types)
  • things outside the markup, in the application domain, [whose existence we believe in because of the markup]:
    • bibliographic items
    • manuscripts or printed editions used as the copy text for the electronic transcription
    • the work as an abstract object
    • the expression of the work (also abstract)
    • the manifestation of the expression
    • one particular physical (?) text carrier
(there exists x)
  (bibitem(x) 
  & bibdesc([[.]]) 
  & bibitem_desc(X,[[.]]).
Disagreements over what individuals to posit have thus far been a major stumbling block in our attempts to write skeleton sentences for specific tag sets or document instances. Does the root element in a MECS-WIT transcription license all of the following inferences?
  1. * There exists a text T.
  2. There exists a manuscript M.
  3. There exists a MECS transcription E.
  4. The catalog number of M is ...
  5. The Wittgenstein Archive ID number of E is ...
  6. * The language of T is German.
  7. * M is a manuscript of T.
  8. * E is a transcription of T, from M.
or should the starred items be replaced by alternative formulations?
  • * The language of M is German.
  • * E is a transcription of M.
Solution:
  1. Think hard.
  2. Be careful about data types.
I think there are two possibilities for reducing or eliminating the ambiguity of the sentences:
  • Define a set of functors which effectively cast their argument (to element, to token, to type, ...). Either establish a clear default and use the casts only where needed, or establish the rule that there is no default and use the casts for all cases.
  • Adopt rules like the following:
    • Strings (or Prolog atoms) denote types, not tokens.
    • Node identifiers for text nodes denote tokens, not types.
    • Node identifiers for elements and attributes denote those constructs in the input document, as tokens (object semantics, not value semantics). To get at their string values, use an XPath string() function or the like. To get at the objects they encode, use the Prolog relation models(O,L) or the like.

4.9. References to the same individual

In some cases, different markup constructs generate separate inferences about the same individuals. The meaning of a bibl element includes, for example, the inference “there exists a bibliographic item I, and this element is / contains a bibliographic description of I”. A title element inside a bibl element does not mean only ‘this is a title’ or even ‘this is the title of a bibliographic item’, but ‘this is the title of the bibliographic item whose existence is posited by the enclosing bibl element.
Fortunately, Richard Montague was here before us, and we have a reasonably well established way of handling this case.
First, we translate the existential assertion of the bibl element thus:
(there exists x)
  (bib_item(x) 
  & bibl_item_desc(x,[[.]]) 
  & (for all  y)(bibl_item_desc(y,[[.]]) ↔ x = y))
In English: there is some bibliographic item X for which this element is a bibligraphic description, and for any bibliographic item Y for which this element is a bibliographic description, X = Y. Or, more loosely: there is exactly one bibliographic item for which this element is / contains a bibliographic description.
If the existential assertion given above is in our context, we can successfully refer to it in the application sentences generated by the title element:
title(.) 
& (for all  x)
  (bibl_item_desc(x, [[ancestor::bibl]]) 
    → bib_item_title(x,[[.]])]
In English: this is a title, and for every X such that the bibl element enclosing this title is a bibliographic description of X, this is the title of X. Or, more colloquially: this element is a title, and specifically it is the title of the bibliographic item described by the enclosing bibl.
Note in passing that the inferences just described are not invariably safe: in an unstructured bibl for a journal article there are apt to be two title elements: one for the article and one for the journal. I think world knowledge (i.e. knowledge of common bibliographic practice) allows us to distinguish them, but I believe that it's world knowledge, not markup semantics. For a biblStruct, on the other hand, such inferences are more likely to be safe.
Note also that the uniqueness of the bibliographic item is not really essential to successful reference: what is crucial is the existential claim.

4.10. Certainty and responsibility

One of the first problems one bumps up against in devising skeleton sentences for the TEI, or MECS-WIT, is the extensive support in these markup schemes for recording uncertainties, alternatives, etc. It is easier to see how to write sentences for the unmarked cases in which either certainty is assumed or in any case nothing is said about certainty or uncertainty.
If we are sure, that is, we write:
abbreviation(n81,"S","South").
Solution 1: If we are not sure, one obvious way to handle it is to reify the proposition. Give it an identifier, so that we can refer to it from elsewhere.
abbreviation(s443, n81, "S", "South").
certainty(s443, uncertain).
responsibility(s443, renear).
This manoeuvre allows us to record the information, at the cost of requiring queries to be aware of the two distinct predicates abbreviation/3 and abbreviation/4. It might be better to require a unique identifier on every assertion, to eliminate the variation. But for now, I think we should prefer the variable number of arguments, with auxiliary predicates which understand both forms.
Solution 2: (This one wasn't in the slides from Tübingen, but only in the working paper “Desperately seeking sentences”.) Provide certainty information using the predicate as a structure:
abbreviation(n81, "S", "South").
certainty(abbreviation(n81, "S", "South"), uncertain).
responsibility(abbreviation(n81, "S", "South"), renear).
I think mapping to this from TEI's certainty and responsibility elements is slightly less convenient than mapping to solution 1.
Solution 3: hard-code a certainty and a responsibility argument into the predicates that need it.
abbreviation(n81, "S", "South",uncertain, renear).
In view of the fact that virtually anything can be uncertain, in TEI explicitly so, this seems unduly cumbersome.

Notes

[1] In the logical sense: what are the individual nameable objects in the universe of discourse?