Desperately seeking skeletons

Notes on work towards formulating skeleton sentences

C. M. Sperberg-McQueen

6 June 2002

rev. 12 June 2002

$Id: desperately-seeking-skeletons.html,v 1.13 2002/07/16 15:55:35 cmsmcq Exp $

This document is a working draft.

If you are not actively collaborating with me on the project described here, then you probably stumbled across this by accident. Please do not quote publicly or point other people to this document. The URI at which this document now exists is very unlikely to remain permanently available.

This document contains informal, incomplete, and half-digested notes on TEI Lite and other DTDs, made as I was trying to formulate some skeleton sentences for them.

1. Default interpretations

[6 June 2002]
Perhaps we can start by saying what happens in ‘normal’ (perhaps we should say “simple” instead) document representation. If I have a TEI document of the form
<TEI.2> ...
 <text id="T">
   <div1 id="a" type="chapter">...</div1>
   <div1 id="b" type="chapter">...</div1>
   <div1 id="c" type="chapter">...</div1>
then in the simple case I can say that text T consists of three chapters, represented in turn by elements a, b, and c. That is:
  • everything here is representing some constituent part of the text
  • everything in the text (well, clearly not -- perhaps I mean everything of the kind I'm representing, here “every chapter in the text”) is represented here
  • the constituents of the text appear in the same order as their representations here in the encoding
From the absence of a back element I am likely to infer the absence of any back matter in the text, or in the exemplar. From the absence of front matter, one might expect similarly to be able to infer the absence of any front matter in the text or exemplar; in practice, of course, front matter is so commonly part of modern texts that my first instinct is to look at the encodingDesc to check for information about sampling (in the samplingDecl or in prose) to see whether the front matter has been omitted on principle, or to look at the sourceDesc or profileDesc to see when the text was created or first published, since early texts more often omit front matter.
Formally, I suppose I ought to have the same reaction for front and back matter. Is the fact that I don't something that should be reflected in the inference rules? Or not? Am I committing inference abuse when I react differently?
Obvious exceptions: link, interp, and so on are not constituent parts of the text, but overt interpretations; their location is also immaterial. We may need a (second-level?) constituent property to distinguish elements which represent constituent parts of the text from others.
Exception: notes from an editor or transcriber may not be regarded as part of the text for all purposes, but their location is likely to be important. Perhaps a location_significant property?

2. Abbreviations

[6 June 2002]
The abbr element says that its content is an abbreviation. Specifically, it says that the token marked is an abbreviation; the type may or may not always be an abbreviation, but this one here is.
If it has an expan attribute, then we can infer that the meaning of this abbreviation, here, is given by the value of that attribute.
I am not quite sure how to convey the difference between token and type, or between general equivalences and here. So I can think of two translations:
ignores the problems, while
attempts to convey the distinction by the pseudo-function token(), which means simply ‘viewed as a token’, and the magic keyword here, which simply means “in this context” (without, I fear, attempting to specify what it is about the context which matters).
Possibly work out two systems, one with and one without here and the type/token distinction?
Some sentences:

if @expan then expansion(.,@expan).

if (@cert and @expan) then
else if (@expan) then

if (@resp and @expan) then 
else if (@expan) then

if (@type) then 
These rules can be encoded in XSLT this way:
 <xsl:template match="abbr">
  <xsl:variable name="this">
   <xsl:call-template name="make-elemid"/>
  <xsl:value-of select="$this"/>

  <xsl:if test="@expan">
  <xsl:value-of select="$this"/>
   <xsl:value-of select="@expan"/>

   <xsl:value-of select="$this"/>
   <xsl:value-of select="@expan"/>
    <xsl:when test="@resp">
     <xsl:value-of select="@resp"/>

   <xsl:value-of select="$this"/>
   <xsl:value-of select="@expan"/>
    <xsl:when test="@cert">
     <xsl:value-of select="@cert"/>

  <xsl:if test="@type">
  <xsl:value-of select="$this"/>
   <xsl:value-of select="@type"/>
If the header is silent, then the absence of any expan elements implies that abbreviations have not been expanded. From the absence of abbr elements, we may not infer the absence of abbreviations; they need not be tagged.
Taking the examples in the TSD of P3, namely
(1) The address is Southmoor <abbr expan="road">Rd</abbr>.
(2) The address is Southmoor <abbr expan="road" resp="LB">Rd</abbr>.
(3) <abbr type="acronym">RSVP</abbr>
(4) <abbr type="organization">SPQR</abbr>
(5) <abbr type="brevigraph">&per;</abbr>
(6) <abbr type="contraction">yr hbl servt</abbr>
(7) <abbr type="oganization"
          expan="senatus populusque romanorum"
We get the following sentences (sorted):
certainty(expansion(n25,'senatus populusque romanorum'),unknown).
expansion(n25,'senatus populusque romanorum').
responsible(expansion(n25,'senatus populusque romanorum'),'author').

3. MEP

[6-7 June 2002]
Another example worth looking at is the MEP transcription of Abraham Lincoln's letter of 10 April 1862 to Richard Yates and William Butler. (Full text is in MEP W10 at

3.1. Inferences

The following markup-based inferences seem plausible here (working more or less top to bottom in the document).
  • There is a historical document; for reference, let's call it D, to distinguish it from its electronic transcription, which we'll call T.
  • The sender of D was Abraham Lincoln. (This breaks down into: A sender of D was Abraham Lincoln. There was only one sender. Hence, the sender of D was A.L.)
  • An addressee of D was Richard Yates.
  • An addressee of D was William Butler.
  • D was prepared (completed) on 10 April 1862.
  • T was prepared by David R. Chesnutt and C. M. Sperberg-McQueen.
  • Work was done on T on 31 August 1999.
  • Work was done on T on 29 October 1999.
  • T (the electronic form) was prepared on the basis of Abraham Lincoln, Speeches and Writings 1859-1865 (New York: Library of America, 1989), p. 315.
  • D was prepared (completed) in Washington.
  • The addressee Richard Yates was a person.
  • The addressee William Butler was a person.
  • D contains an address and a paragraph, in that order. [This is not the same as “D consists of ...”]
  • The address in D is to “Hon. R. Yates & Wm. Butler / Springfield, Ills.”
  • The paragraph in D reads as given.
Missing from the inferences are the following. Perhaps we should regard some of them as licensed by the markup after all, or perhaps we should augment the markup to license them?
  • D is a letter.
Some items are problematic in practice.
  • Both Basler and Library of America give the letter without salutation. Does this mean that there is no salutation, or did they omit all salutations (e.g. to save space)?
  • Basler gives the signature as “A. Lincoln”; the Library of America gives no signature. Since LoA systematically omits all signatures, we cannot infer from its absence that there is no signature in the document. Since (if I remember correctly) Basler normalizes all signatures to the same form, we may not be able to infer anything about the actual form used (although I think I recall Basler or LoA saying it's almost always “A. Lincoln”).
  • Basler prints the date and place line opposite the address. The latter is not normalized -- is the former? Our transcription, based on LoA, puts the dateline outside of the document body, which means it may have been normalized by the editor. So whatever the facts of the matter are, the markup does not actually license the inference that the letter has a date and place line, or any inferences about how the name Washington is spelled in it, if the document has one.
  • Note that Basler's text (shown in the figure) and the Library of America text (given in the text) differ: “not as plenty” vs. “not so plenty”.

3.2. Prolog formulation of inferences

We can formalize the inferences made in something like the following Prolog form:
encodes(x1,d1).   /* or should this be encodes(n1,d1)? */
sender(d1,'Abraham Lincoln').
addressee(d1,'Richard Yates').
addressee(d1,'William Butler').
preparedBy(x1,'David R. Chesnutt and C. M. Sperberg-McQueen').
prepDate(x1,'31 August 1999').
prepDate(x1,'29 October 1999').
exemplar(x1,'Abraham Lincoln, Speeches and
    Writings 1859-1865 (New York:  Library of
    America, 1989), p. 315').
person('Richard Yates').
person('William Butler').
OK, now, to make this fit into our framework as enunciated so far, I think I need to distinguish different layers.

3.3. Layers

3.3.1. Image sentences

All the sentences with predicates node, gi, etc. go here. An extract:

content(n62,"\n  ").


content(n64,"\n    ").


content(n66,"Hon. R. Yates & Wm. Butler").

content(n67,"\n    ").


content(n69,"Springfield, Ills.").

content(n70,"\n  ").

content(n71,"\n  ").


content(n73,"I fully appreciate Gen. Pope's ... as blackberries.").



3.3.2. Propagation sentences

In this layer, we ascribe properties to the elements; we can follow the straw man proposal as far as it goes, and in some cases we may need to do something different.
For the bit of document represented by the image sentences just given, namely the document body, we might have:
has_property(n61,'document body').

has_property(n63,'address'). /* it's an address */
has_property(n63,'address of letter'). /* it's THE address */
  stringvalue("Hon. R. Yates & Wm. Butler / Springfield, Ills.")).
  /* is this a useful way to do string values? */

has_property(n65,'address line').
has_property(n65,stringvalue("Hon. R. Yates & Wm. Butler")).
has_property(n68,'address line').
has_property(n68,stringvalue("Springfield, Ills.")).

  stringvalue("I fully appreciate Gen. Pope's ... as blackberries.")).

3.3.3. Application sentences

In this layer, we move away from elements and element-like objects, and try to make the transition to the kinds of objects you would find in a formal theory of the application area. In the example, for simplicity I am going to start using suggestive names for individuals, rather than mechanically produced identifiers; this is to help me keep track of what I'm doing, not because I expect the running system to do this.
For the entire document, we might have:
person(yates,"Richard Yates").
person(butler,"William Butler").
person(lincoln,"Abraham Lincoln").
letter(al_to_yates_18620410,lincoln,[yates, butler]).
Surely there is more to this?

4. The minutes example

[10-12 June 2002]
CH and MSM started working (Monday 10 June) on a simple imaginary example of minutes from a meeting. Things seem to be starting to cohere.

4.1. The XML

4.1.1. The instance

Here is the sample:
<!DOCTYPE minutes SYSTEM "minutes.dtd" >
<minutes lang="en">
<what>drafted minutes during meeting</what>
<date isoform="2002-03-31">31 Mar</date>
<what>put draft minutes on server</what>
<date isoform="2002-03-21" lang="la">idem</date>
<what>found and fixed three typos, supply
isoform attributes for some dates, put fresh copy on server</what>
<date isoform="2002-03-21">21 March 2002</date>
<p><person>John</person> expressed concern about <person>Pius</person>'s
attendance record.
<resolved>to increase the fine for missed meetings to a nickel.</resolved></p>
<p>We discussed the date and place of the next meeting.</p>
<p>Agreed: <b>Next meeting is <date>28 March</date>, usual place</b>.
<what>to order refreshments</what>
<when>26 March</when>
(N.B. The kitchen is, <i lang="la">mirabile dictu</i>,
demanding two days' notice for refreshments now.)</p>

4.1.2. The DTD

The DTD is fairly straightforward; it is intended to provide some examples of distributive and non-distributive processes and of context-dependent meanings.
<!ENTITY % phrases "(#PCDATA |date | i | b | person | action | resolved)*" >

<!ELEMENT minutes (revisionHistory?, present, absent, date, body) >
<!ATTLIST minutes; >

<!ELEMENT revisionHistory (rev*) >
<!ATTLIST revisionHistory; >
<!ELEMENT rev (date, person, what) >
<!ATTLIST rev; >

<!ELEMENT present (person*) >
<!ATTLIST present; >
<!ELEMENT absent (person*) >
<!ATTLIST absent; >
<!ELEMENT person (#PCDATA) >
<!ATTLIST person; >
<!ELEMENT date (#PCDATA) >
<!ATTLIST date;
          isoform CDATA #IMPLIED>
<!ELEMENT body (p | action | resolved | div)* >
<!ATTLIST body; >
<!ELEMENT div (head, (p | action | resolved | div)*) >
<!ATTLIST div; >
<!ELEMENT action (person, what, when) >
<!ATTLIST action; >
<!ELEMENT resolved %phrases; >
<!ATTLIST resolved; >
<!ELEMENT what (#PCDATA) >
<!ATTLIST what; >
<!ELEMENT when (#PCDATA |date)* >
<!ATTLIST when; >
<!ELEMENT head (#PCDATA | date | i | b | person)* >
<!ATTLIST head;
<!ELEMENT p %phrases; >
<!ATTLIST p; >
<!ELEMENT i %phrases; >
<!ATTLIST i; >
<!ELEMENT b %phrases; >
<!ATTLIST b; >

4.2. Image sentences

The image sentences for part of the document, using the current Prolog predicates, are these.
There are 99 nodes in the tree returned by an XML processor. For i = 1 to 99, we have node declarations which supply arbitrary names for those nodes, and similarly information on the traversal order of nodes:
The various element nodes have generic identifiers as shown; all nodes omitted here are text nodes and have the value cdata (or, as shown here, '#pcdata' to avoid name clashes with user names).
(N.B. the SWI parser omits some of these nodes: being an SGML parser at heart, it cannot bring itself to report text nodes for white space in element content. My XML parser is less merciful.)
Some elements have attributes:
The text nodes have content (I include a couple of examples for text nodes which just have newline or other whitespace as content, but otherwise I omit them silently).
content(n14,"drafted minutes during meeting").
content(n20,"31 Mar").
content(n26,"put draft minutes on server").
content(n38,"found and fixed three typos, supply\nisoform attributes for some dates, put fresh copy on server").
content(n58,"21 March 2002").
content(n65," expressed concern about ").
content(n68,"'s\nattendance record.\n").
content(n70,"to increase the fine for missed meetings to a nickel.").
content(n73,"We discussed the date and place of the next meeting.").
content(n76,"Agreed: ").
content(n78,"Next meeting is ").
content(n80,"28 March").
content(n81,", usual place").
content(n89,"to order refreshments").
content(n92,"26 March").
content(n94,"\n(N.B. The kitchen is, ").
content(n96,"mirabile dictu").
content(n97,",\ndemanding two days' notice for refreshments now.)").
The tree structure is encoded by the first_child, nsib, and parent predicates; only samples are given.

4.3. Property axioms and propagation sentences

The details of this level are unsettled; the discussion here assumes that propagation sentences refer to the same individuals as the image sentences, but attribute different properties to them. If we wished instead to postulate a different set of individuals, and generate identifiers for them, we could do so.

4.3.1. Propagation sentences

We begin by listing (some of? all of?) the sentences we wish to be able to infer from the image sentences. Later, we will specify the property axioms in such a way as to allow us to make the appropriate inferences.
The propagation sentences correspond closely to the kinds of inferences allowed by the straw man proposal of 1998-2000. In the simple case, the rules enunciated here are identical to those of the straw man proposal. Kinds of things
First, we observe that the document is the minutes of a meeting; we associate the document with its root element:
We similarly note that various things are things of specified kinds. It's not entirely clear just how far to go here. For now, we follow the straw man proposal in assuming that every element needs such a predicate. That means we have object-class properties for the constituent parts of the minutes:
This is one way to do it; others may be entertained and may be better.[1]
We also want object-class information for the person, date, and item description elements in the revision history, together with the rev units which enclose them:
and in the lists of those present and absent:
Within the body, the elements and their meaning may be slightly different — more ‘textual’, perhaps. But let's assume we should postulate a property (class?) sentence for each element. Then we have paragraphs:
and action items and resolutions (together with their contents):
And floating around in the prose soup, we have some remaining stuff:
We note in passing that the words
Agreed: <b>Next meeting is <date>28 March</date>, usual place</b>.
ought perhaps to have been captured as a resolved element. A human might well treat it in much the same way, unless aware that the organization in question had decided to distinguish scheduling decisions from other resolutions. A machine working only from the markup has no choice but to distinguish the two. The inferences will only be as good as the markup: garbage in, garbage out, and no silk purses will be made out of sows' ears. Part-of and sequence relations
Next, we note that certain things are constituent parts of their parents; by implication, others may not be. Similarly, sequence is important for some things, but carries no information for others.[2] For the constituent-part relation, we assume the predicate part_whole. In this document, the part/whole relation is a subset of the child/parent relation; there might be documents in which it's not (e.g. TEI documents with virtual elements). Note that not every child/parent pair shows up here: for our purposes, revisionHistory is applicable to the document element, but does not stand in a part/whole relation to it. The part_whole relation is thus a kind of filtering of the child/parent relation; such a filtering will be particularly important for documents like the sample purchase order, where a grouping element like items does not turn into an object at any later point.
part_whole(n05,n03). /* rev is part of revisionHistory */
part_whole(n07,n05). /* date is part of rev */
part_whole(n10,n05). /* person is part of rev */
part_whole(n13,n05). /* what is part of rev */

part_whole(n17,n03). /* rev is part of revisionHistory */
part_whole(n19,n17). /* date is part of rev */
part_whole(n22,n17). /* person is part of rev */
part_whole(n25,n17). /* what is part of rev */

part_whole(n29,n03). /* rev is part of revisionHistory */
part_whole(n31,n29). /* date is part of rev */
part_whole(n34,n29). /* person is part of rev */
part_whole(n37,n29). /* what is part of rev */

part_whole(n42,n01). /* present is part of minutes */
part_whole(n51,n01). /* absent is part of minutes */
part_whole(n57,n01). /* date is part of minutes */
part_whole(n60,n01). /* body is part of minutes */

part_whole(n44,n42). /* person is part of present */
part_whole(n47,n42). /* person is part of present */
part_whole(n53,n51). /* person is part of absent */

part_whole(n62,n60). /* p is part of body */
part_whole(n72,n60). /* p is part of body */
part_whole(n75,n60). /* p is part of body */

part_whole(n85,n83). /* person is part of action */
part_whole(n88,n83). /* what is part of action */
part_whole(n91,n83). /* when is part of action */
For the ordering predicate (which is a filtering of nsib, just as part_whole is a filtering of parent), we use the predicate succ, which is analogous to that used in definitions of integers as 0 and, for each integer i, succ(i).
/* successor predicate: direct adjacency */
succ(n05,n17). /* the rev elements are ordered */

succ(n62,n72). /* the p elements within the body are ordered */

succ(n63,n65). /* the contents of each p element are ordered */
/* paragraph n72 has only one child, no sequencing applies */
succ(n76, n77).
succ(n77, n82).
succ(n82, n83).
succ(n83, n94).
succ(n94, n95).
succ(n95, n97).
Note that the children of minutes (present, absent, date, and body) are not ordered by the succ predicate. The DTD prescribes the order, so the order of appearance does not convey any information not already conveyed by the minutess tag. If the DTD did allow multiple orders, we could specify that order is not significant by formulating a rule which ensures that no succ relation holds between siblings.
Using succ, an overall sequence comparison can be built, something like this:
precedes(X,Y) :- succ(X,Y).
precedes(X,Y) :- succ(X,Z), precedes(Z,Y).
precedes(X,Y) :- parent(X,Z), precedes(Z,Y).
precedes(X,Y) :- parent(Y,Z), precedes(X,Z).
(I suspect this may allocate sequential position to some things I'd rather not see sequenced, but I'll let it go for now.) Distributive properties
[This section 25 June 2002]
Finally, we note that the markup makes explicit the language of the text, using a lang attribute with an inherited value, similar to that used by TEI and HTML and similar to the xml:lang attribute defined by XML 1.0. We will use the lang(Node,Langcode) predicate to express the language of a given node.
We make the following assumptions about the lang relation:
  • When N is a text node, lang(N, L) means “Node N is in the language whose code is L.”[3]
  • When N is an element node, lang(N,L) is true for at most one language code L and means that every text node underneath it is in language L. Alternate assumptions:
    • No text node underneath it has any language other than L (so we don't lose out on TEI.2 having language L just because we decided that dates have no language value).
    • No text node underneath it has any language other than L, and at least one has language L.
    • At least one text node beneath it has language L, and lang(N,L) is true for each language L which appears in the content of N.
    • As above, but elements with more than one language within them use the code mul — so lang(N,"und") does not mean that all text nodes beneath N have lang(D,"und").
Open questions include
  • accounting for the language of attribute values; perhaps a syntax like lang(att(Elem,Attname),L) would suffice
  • distinguishing attributes (or elements!) for which assigning a language is nonsensical; attributes of type ID or IDREF are an obvious example — in fact attributes of any type other than xsd:string, but also some of type xsd:string (part numbers!); is it harmless to allow assertions to be made about their language? Note that some specifiers of markup semantics will want to exclude proper nouns, dates, numbers, etc. from being assigned a language; others may not care. This means we have to show how to block propagation into a subtree, for those who want to block it.
Let's start from the bottom. For natural-language material in the document, the text nodes need to be identified as being in a particular language. Most of the document is, of course, in English:
lang(n14,"en"). /* "drafted minutes during meeting" */
lang(n20,"en"). /* "31 Mar" */
lang(n26,"en"). /* "put draft minutes on server" */
lang(n38,"en"). /* "found and fixed three typos, ..." */
lang(n58,"en"). /* "21 March 2002" */
lang(n65,"en"). /* " expressed concern about " */
lang(n68,"en"). /* "'s\nattendance record.\n" */
lang(n70,"en"). /* "to increase the fine for missed meetings to a nickel." */
lang(n73,"en"). /* "We discussed the date and place of the next meeting." */
lang(n76,"en"). /* "Agreed: " */
lang(n78,"en"). /* "Next meeting is " */
lang(n80,"en"). /* "28 March" */
lang(n81,"en"). /* ", usual place" */
lang(n89,"en"). /* "to order refreshments" */
lang(n92,"en"). /* "26 March" */
lang(n94,"en"). /* "\n(N.B. The kitchen is, " */
lang(n97,"en"). /* ",\ndemanding two days' notice for refreshments now.)" */
Note that “31 Mar”, at least, could also be encountered without surprise in some other languages. The claim that it's in English is a claim about the occurrence of “31 Mar” here, not other possible occurrences. That is, the lang predicate applies to the tokens (or to the specific text node), not to the string.
Latin words occur twice:
lang(n32,"la"). /* "idem" */
lang(n96,"la"). /* "mirabile dictu" */
As noted above, opinions may differ about whether we should or may say that something like “2002-03-21” is in a particular language. Proper nouns are often said to be not in a particular language,[4] or perhaps to be in all: “Paul”, “Pius”, and “John” could appear in any language context. In practice, I believe most encoding projects would have no hesitation in saying proper names can be assigned to the language of their context without causing problems. Doing it that way, however, won't provide any useful illlustration of anything not already provided by other elements here. So let us assume that we wish to mark proper nouns and dates as language-neutral, for which we will use the ISO 639-2 code und (‘undetermined’).
lang(n08,"und").  /* "2002-03-21" */
lang(n11,"und").  /* "Paul" */
lang(n23,"und").  /* "Paul" */
lang(n35,"und").  /* "Paul" */
lang(n45,"und").  /* "John" */
lang(n48,"und").  /* "Paul" */
lang(n54,"und").  /* "Pius" */
lang(n64,"und").  /* "John" */
lang(n67,"und").  /* "Pius" */
lang(n86,"und").  /* "Paul" */
The challenge for us is finding a way to prevent the language property from propagating into the elements dominating these text nodes.
Text nodes which contain only white space seem to present an interesting metaphysical challenge: is it meaningful to say that a newline followed by two blanks is in English? Is it harmful? We can say that the lang predicate means exactly the same thing for such a node as it means for any other node, or that it is meaningless when applied to such a node but does no harm, or that it is meaningful and must be avoided. This affects some 37 nodes (36 with just a newline, one with a full stop and a newline).
If we wish to assert the lang property for elements, according to the assumptions stated above, then we need to assert that the following elements are English, or Latin, or undetermined, respectively.
lang(n13,"en").  /* what */
lang(n19,"en").  /* date */
lang(n25,"en").  /* what */
lang(n37,"en").  /* what */
lang(n57,"en").  /* date */
lang(n69,"en").  /* resolved */
lang(n72,"en").  /* p */
lang(n77,"en").  /* b */
lang(n79,"en").  /* date */
lang(n88,"en").  /* what */
lang(n91,"en").  /* when */

lang(n31,"la").  /* date */
lang(n95,"la").  /* i */

lang(n07,"und").  /* date */
lang(n10,"und").  /* person */
lang(n22,"und").  /* person */
lang(n34,"und").  /* person */
lang(n42,"und").  /* present */
lang(n44,"und").  /* person */
lang(n47,"und").  /* person */
lang(n51,"und").  /* absent */
lang(n53,"und").  /* person */
lang(n63,"und").  /* person */
lang(n66,"und").  /* person */
lang(n85,"und").  /* person */
If we follow the rules outlined above, the other elements won't have any value for the lang property; if we tag them lang(N,"mul"), then we would have the following:
lang(n01,"mul").  /* minutes */
lang(n03,"mul").  /* revisionHistory */
lang(n05,"mul").  /* rev */
lang(n17,"mul").  /* rev */
lang(n29,"mul").  /* rev */
lang(n60,"mul").  /* body */
lang(n62,"mul").  /* p */
lang(n75,"mul").  /* p */
lang(n83,"mul").  /* action */
And if we specified that when undetermined language and a specified language mix, then the parent should be tagged with the specified language, some elements will change their tagging:
lang(n05,"en").  /* rev */
lang(n17,"en").  /* rev */
lang(n62,"en").  /* p */
lang(n83,"en").  /* action */
The rules to follow may be expressed this way; there are probably more compact and clearer ways to put this:
  1. Assign lang values to each text node.
    1. For each text node T, find the lang attribute value on the nearest ancestor with such a value; call this L.
    2. For each text node, test to see whether it should have the lang property or not. Depending on the vocabulary being described, this may be
      test to see whether any ancestor is on the ‘non-lingual’ list, or
      test to see whether any ancestor is on the ‘lingual’ list, or
      test to see whether the nearest ancestor with a ‘lingual’ property has the value true or false
      Our example works fine with the first rule, but the third would also do fine for us; the second would not.
    3. If T should have the lang value, assert lang(T,L).
    4. Otherwise, assert lang(T,"und").
  2. Assign lang values to each element: Starting at the root, perform a depth-first traversal of the tree. Calculate the lang value of each node N after calculating those of all children. Select from the following cases:
    • If no children have a lang value, assert nothing and continue.
    • If at least one child has a lang value, and all children with a lang value have the same value L, then assert lang(N,L).
    • If at least one child has a lang value, but more than one value is used, then assert nothing and continue.
If we wish to use the code mul for polyglot elements, then we can change the last rule to
  • If at least one child has a lang value, but more than one value is used, then assert lang(N,"mul").

4.3.2. Property axioms (take 1)

To know how to formulate the property axioms, we need to make a choice between two ways of proceeding.
  • Generate the propagation sentences explicitly, and assert them as facts in the Prolog database. Let's call this choice active assertion
  • Provide rules such that if one of the propagation sentences is formulated as a Prolog goal, the goal will succeed with an appropriate unification. Let's call this choice verification.
In the first case, if we cycle through all the property axioms and assert all the appropriate sentences, then a dump of the Prolog database will include sentences like those given in the preceding section. In the second case, the dump will not have those sentences (unless they get there some other way), but the sentences will nonetheless be known to be true. If we ask Prolog the question "is this true?", we will get the answer "yes".
?- part_whole(n42,n01).
We can also ask what items it's true for:
49 ?- part_whole(X,Y).

X = n05
Y = n03 ;

X = n07
Y = n05

(For the Prolog novices among us, I should observe that the lines beginning with X and Y were typed by the system, but the semicolons ending some of them were typed by the user. The semicolon is used in Prolog as an 'or' connector, and responding to a solution of a goal with ';' is a way of asking to see other solutions. After the last one, the system replies 'Yes' because it succeeded in finding a solution the last time I asked; if I had continued to press ';', eventually there would be no further solutions and the final answer would be No.)
This section discusses the verification approach, because it's simpler to do in Prolog. The next section discusses the active-assertion approach.
We can implement the verification approach simply. For each kind of propagation sentence — i.e., for each predicate — we write a rule that says under what circumstances the predicate is true.
Whenever we have a minutes element, we have a set of minutes for a meeting.
minutes(X) :- node(X), gi(X,minutes).
In the current state of play, the node(X) is redundant because nothing that is not a node appears as the first argument of a gi/2 predicate.
Whenever we have a present element, we have the list of those present at the meeting. And similarly for absent elements.
present_list(X) :- node(X), gi(X,present).
absent_list(X) :- node(X), gi(X,absent).
We will want, eventually, to stipulate that a particular list of people present belongs to a particular set of minutes, but for now CH and MSM are guessing that this belongs in the application-sentence layer, not here.
For the date element, we can specify at least three meanings. As the child of minutes, it denotes the date of the meeting. As the child of action, it denotes the due date for the action. Everywhere, it marks its content as a date. We could decide, I think, to place the rule for differentiating these meanings into the application sentence layer (that is to say, include it in the mapping axioms), on the grounds that it's really relevant to meetings and actions and other application objects. In that case, we have a very simple rule here:
date(X) :- node(X), gi(X,date).
standard_value_of_date(X,Y) :- date(X), attv(X,isoform,Y).
Alternatively, we could decide that we wish to distinguish the usages at the property level, possibly on the grounds that the differentiae relate to document structure. In which case we...
[Example of distinguishing the uses of date in the propagation layer to be worked out.]
The lang property will require a more elaborate rule than those above; we will make use of the nearest_anc predicate defined in David's For text nodes:
nearest_lang(N,L) :- node(N),
lingual(N) :- node(N),
    not(ancestor(N,A), gi(A,G), non_lingual(G)).
lang(N,L) :- lingual(N), nearest_lang(N,L).
lang(N,"und") :- not(lingual(N)).
For elements (perhaps the complexity here is such that we ought not to assign the property to elements? but then what do we do when we encounter a vocabulary whose designer says explicitly that elements do have such and such a bottom-up property?), we might have:
/* We define child_language(N,L) as a relation holding between
   a parent node and the languages of its child nodes. 
   Then setof(L,child_language(N,L), List) returns a List
   of all the languages of all the children.
child_language(N,L) :- parent(C,N), lang(C,L).

lang(N,"qqq") :- setof(L,child_language(N,L),[]).
lang(N,L) :- setof(L,child_language(N,L),[L]).
lang(N,"mul") :- setof(L,child_language(N,L),[H|T]).
I have added an explicit code (qqq) for material without a meaninful lang property value; in the description above, no assertions were made about these nodes, but it seems easier in the Prolog to assert something for every node. The code qqq is from a range “Reserved for local use”.

4.3.3. Property axioms (take 2)

If we want to assert the propagation sentences actively, we need to construct the sentences (for this to work, there should be a finite number of them) and then use the Prolog assert predicate to add them to the database.
[to be supplied]
[On one view, if we take this approach the property axioms need to be formulated in a very different way, with assert clauses and so on. Actually, though, for any given rule-defined predicate, we can generate the set of facts which represent the application of the rule to a particular document, and assert them. In SWI Prolog, doing so will require that the predicate be declared as dynamic. The process can go something like this:
16 ?- dynamic(part_whole/2).

17 ?- findall(part_whole(P,W), part_whole(P,W), Bag), asserta_each(Bag).

P = _G555
W = _G556
Bag = [part_whole(n05, n03), part_whole(n07, n05), part_whole(n10, n05), part_w\
hole(n13, n05), part_whole(n17, n03), part_whole(n19, n17), part_whole(n22, n17\
), part_whole(n25, n17), part_whole(..., ...)|...]

18 ?-
where asserta_each is defined this way:
asserta_each([H|T]) :- asserta(H), asserta_each(T).
Analogous clauses for assert, assertz, record, recorda, recordz, etc. could be used.

4.4. Mapping axioms and application sentences

For most purposes, if we associate the meaning of markup with the set of sentences true because of the markup in the document, the sentences of most interest are those about objects at the application level: the application sentences.

4.4.1. Application sentences in English

Tentative application sentences:
  • There was a meeting on 21 March 2002.
  • The minutes of the meeting are in the document rooted at n01.
  • John and Paul attended.
  • Pius was absent.
  • John is a person. (And Paul. And Pius.)
  • The meeting produced one resolution, namely to increase the fine for absenteeism by a nickel.
  • The meeting produced one action, namely Paul to order refreshments for the next meeting.

4.4.2. Application sentences in naive Prolog

In one possible Prolog form (not in exactly the same order):
resolution(r33,"to increase the fine for missed meetings to a nickel.").
action(a347,p2,informal-date("26 March"),"to order refreshments").
The biggest challenge to us here is posed by the need to know, somehow, the identifier used to name the meeting. There are several possible approaches:
  • We can use a structured identifier like meeting(n1), which is roughly equivalent to saying “the meeting whose minutes are recorded by n1” in natural language. Rules like meeting_minutes(meeting(n1),n1) will look rather like tautologies. That may be OK: they are tautologies.
  • We can call gensym as part of the process of asserting or of verifying the application sentences, and record the result somehow so we don't end up generating multiple identifiers for the same meeting (that is, multiple identifiers for the meeting based on the existence of the same set of minutes). David's construct predicate takes this approach.
For now, I'll follow David in using construct.
The easiest way to use construct is almost certainly to use the predicates provided by David for writing application sentences, so let's rewrite the application sentences above using that syntax, which is an objects-and-properties notation which will feel familiar to anyone who has spent time getting their head around the XML Information Set or RDF.

4.4.3. Application sentences using object-oriented predicates

After writing the section above, I looked at David Dubin's work to see how he had solved the problem of identifier generation, and discovered that he has worked out a small set of predicates which allow a convenient and consistent formulation of application-level information, involving objects, classes, properties, and relations. This section formulates the application sentences for the meeting minutes example using those predicates.
First, let us declare the relevant classes:
class(document). /* ? minutes will be of type document */
Now, we can say that the document at node n1 is minutes, and that (from it we know that) there was a meeting. The identifiers are not necessarily what we'll get with Prolog.

We don't say (here) that it was from the existence of the minutes (o1) that we inferred the existence of the meeting (o2), but that's OK. That's the role of the mapping axioms.[5]
Now let us observe that the meeting took place on 21 March. The simplest way to do this might be to say something like this:
This, however, won't do justice to dates, which have (in the sample DTD) both a string form (the content of the date element) and a normalized ISO-format form, the latter being optional. We might describe dates this way: [6]
Thus armed, we can make our application sentences better:
In reality, someone will observe that “2002-03-21” is itself in ISO form, and so we will also have the inference
but I don't think that's markup-based inference. It's based on knowledge of real-world date formats. (Unless of course the DTD had the rule “The isoform attribute may be omitted if and only if the content is already in ISO form. But it's doubtful that Paul, who seems to have done the DTD, did it that way.)”
The minutes of the meeting are the ones at node n1.
John and Paul attended; Pius was absent. They are all persons. Since the number of people who attend or are absent can in principle be arbitrarily large, we express attendance and non-attendance not as properties but as relations.




relation(person_attends_meeting,[person, meeting]).
relation(person_misses_meeting,[person, meeting]).
And now finally to the resolution and the action item. First, we define their classes; all we are going to remember about them is the meeting they occurred in and their content.
property_of(resolution, meeting, object(meeting)).
property_of(resolution, text, string).

property_of(action, meeting, object(meeting)).
property_of(action, who, string).
property_of(action, when, string).
property_of(action, what, string).
The content model for the when element is (#PCDATA | date)*, which is (we can hypothesize) intended to accommodate instantiations like <when>Whenever it's convenient, but no later than <date>next Thursday</date></when>. The property defined above is thus a bit too simple: the when property is not a string, but a string which may contain date objects. A full working out of an object class informal-date might be rewarding, but I'll pass for now, and simply say that in addition to the properties defined above, actions can have a relation to date objects; the DTD ensures that date elements can occur only in the when element.
The meeting, meanwhile, should have an inverse relation (not sure this is important in Prolog in practice, but it is needed in some object systems):
Thus armed, we can specify the resolution and action item agreed at this meeting:
opv(o7,text,"to increase the fine for missed meetings to a nickel.").

opv(o8,when,"to order refreshments").
opv(o8,what,"26 March").

opv(o9,localform,"26 March").

4.4.4. Generating the application sentences

[Discussion of mapping axioms and how to generate the application sentences to be supplied here.]

4.5. World knowledge and further inferences

[To be worked out.]
Examples of further inferences we make based on world knowledge:
  • Since there was a meeting organized enough to have written minutes, the chances are that this is a meeting of some standing organization. Counter-indices would be labels like ad hoc, open public meeting, etc., or absence of a list of people absent. (For purposes of reference, let us refer to this putative organization as O, and to the hypothesis of O's existence as H1.)
  • The membership of O on the date of the meeting was (probably) John, Paul, and Pius.
  • Meetings take place on a given date, and (for conventional meetings, not teleconferences) at a particular location. (Let's call the location L and the date D.)
  • Since they attended the meeting, John and Paul were presumably physically in location L on date D, at least at the time of the meeting.
  • Since he did not attend the meeting, Pius may be presumed not in location L on date D at the time of the meeting. Note that for purposes of this inference, location L must be taken in a very narrow sense, not a city or even a building, but a room.
  • Since they attended the meeting, John and Paul may be presumed alive on date D, at least at the time of the meeting.
  • Since he did not attend the meeting, we cannot infer with certainty that Pius was alive at the time of the meeting; since he is listed as absent, though, we may infer that he was expected.
  • Since Pius was expected at the meeting, we may infer that (as far as John and Paul knew) he was alive at the time of the meeting.[7]
  • The names of the attendees and absentee, and their propensity for falling into Latin, suggests that O is perhaps a committee or club[8] whose members served as popes sometime during [the middle decades of?] the twentieth century. Let us refer to this hypothesis as H2.[9]
  • If H2 is correct, then there is only one place (well, theoretically two places — could this explain why Benedict XV, Pius X and XI, and John Paul I are not listed either as present or as absent?) where O could be holding its meetings. [10] In this case, we can conclude that the document either has a miraculous provenance, or is a fiction.
These inferences rely not only on the application sentences we have generated from this document, but from real-world knowledge about meetings, the taking of minutes of meetings, etc. We do not know whether any knowledge base exists from which such inferences could be drawn; the use of application sentences for further inferences with the help of a general purpose knowledge base is one possible application of the system we have sketched out, but it lies, strictly speaking, beyond the scope of our project at present.

5. Distribution

[11 July 2002]
I continue to worry about distributed properties and how to describe them. On the blackboard, I have written several skeleton sentences for the TEI lang attribute; they also apply, of course, to the HTML lang and the xml:lang attributes.

5.1. First attempt

My first attempt is an attempt to say, roughly, “if I am an element with a lang attribute, any text-node children I have are in the indicated language, and any element-node children I have which don't have lang attributes of their own have the indicated language, too.”
lang="X" → child::text() has language X
                & child::*[not(@lang)] has language X
Note that the notion “element E has language L” ought to mean about the same thing as “element E has the attribute-value specification lang="X"”, but this formulation doesn't quite get there.

5.2. Second attempt

Second attempt: assume the predicate haslang(X,L) with the meaning “text X is in language L”, and assume / claim / argue that it only makes sense for text nodes. Then elements may have a value for the lang attribute, but its purpose is solely to allow us to predicate a haslang property for the text nodes in that subtree. The elements themselves have no analogous property.[11] The function of the lang attribute on an element, that is, is to license not inferences about that element, but inferences about the text nodes within that element.
We therefore document the lang attribute with a skeleton sentence that quantifies over all text nodes:
(for all x) (gi(x,'#pcdata') → haslang(x,ancestor::*[@lang][1]/@lang))
or, roughly, “for all x, if x is a text node, then the haslang property of x has the value of the lang attribute on the nearest ancestore which has a lang attribute.” [12]
This reverses, in some sense, the direction of the inference: instead of making the element with the lang attribute the current element, and pushing information down to the descendants, we position ourselves among the descendants by means of the universal quantifier, and pull information down from the ancestor.

5.3. Third attempt

Third attempt: repair the problem in attempt 1:
(1) attv(N,lang,L) → haslang(N,L).
(2) haslang(N,L) →
      (for all y)(parent(y,N) & ¬(attv(y,lang,_)) → haslang(y,L))
    & (for all y)(parent(y,N) & gi(y,'#pcdata') → haslang(y,L))
where the meaning of haslang(N,L) is approximately “if N is a text node, the tokens in N are in language L; if N is an element node, the text nodes in N are in language L unless otherwise specified.” I think this captures the meaning reasonably well, but in practical terms I am uneasy for reasons I cannot quite identify. I think I am uneasy because relying on sentence (2) for the core notion of distribution means I don't know how to make XSLT or Prolog generate the base set of inferences.

5.4. Fourth attempt

Another attempt at ‘pull’ semantics, this time allowing the language predicate to apply both to elements and to text nodes.
(for all x : element) (attv(x,lang,_) → language(x,x->@lang))
(for all x : element) (¬attv(x,lang,_) → language(x,x->ancestor::*[@lang][1]/@lang)).
The notation e->E may be read as “the value of the Xpath expression E interpreted with node e as the current node”.
This has the advantage of being clear, correct, and easy to use to generate a set of language sentences, one for each element.
This formulation is perhaps too narrow, however: what about text nodes? We can cover them, and attributes as well, by changing “element” to “node” — this in turn is too broad, in that it licenses inferences about comments and processing instructions, which is probably going too far.

5.5. Fifth attempt

The two rules in the fourth attempt may be merged to give a single rule:
(for all x : node) 
   ((element(x) | attribute(x) | textnode(x))
   → language(x,x->ancestor-or-self::*[@lang][1]/@lang))

5.6. Questions

It seems clear that we can formulate rather different skeleton sentences which license the same sets of inferences, or substantially the same sets. How do we choose among them?
I lean toward some criteria like the following:
  1. Can we write a working system that can use them to generate the appropriate list of licensed inferences, or to calculate the truth or falsehood of some candidate sentence?
  2. Other things being equal, which alternative is clearer or more perspicuous?
  3. Other things being equal, which alternative gets by with less mechanism?
  4. Other things being equal, which alternative is shorter?
  5. Other things being equal, which alternative is more elegant?
I am not certain of the correct order for the items after the first.

5.7. Sixth attempt

This appears to be a combination of several or all of the above:
   → hasprop(.,language(ancestor-or-self::*[@lang][1]/@lang))]
Aa[[att(a) | text(a)] & hasprop(parent::*,language(L)) 
   → hasprop(.,language(L))]

5.8. Seventh attempt

Yet another way to put this: we note that according to SGML doctrine every element in the TEI has a lang attribute, whether it carries an attribute-value specification for it or not. We can, if we want, express everything we wish by rules which fire on each element. Note, in reading what follows, that XPath coerces attributes without attribute-value specifications to false, not true.
let L = ancestor-or-self::*[@lang][1]/@lang)
in (hasprop(.,language(L))
  & hasprop(child::text(),language(L))
  & hasprop(attribute::*,language(L))

6. Unique identifications

One challenge we face is this: if the bibl element postulates the existence of a bibliographic item of which the bibl encodes the bibliographic description, that's fine. But when the title element needs to say “This is the title of the bibliographic item described by my enclosing bibl element”, then we need a way to translate that definite description in such a way as to identify not just the enclosing bibl element, which XPath does nicely, but an arbitrary object whose existence we infer from that element.
I've been reading David R. Dowty, Robert E. Wall, and Stanley Peters, Introduction to Montague Semantics (Dordrecht: Kluwer, 1980); the extensive focus on quantification means this is not as directly related to problems of markup semantics as I thought it might turn out to be, but I've learned some things and it's interesting.
One thing I learned from looking at their sample translation of "The mayor walks" (p. 279) as
   Ey[Ax[mayor'(x) ↔ x = y] ^ walk'(y)] 
was that we can handle this problem more easily than I had feared. I think we can handle bibl and title by saying in the rule for bibl something like
  Ex[bibliographic_item(x) & bibl_describes_item(.,x) 
    & Ay[bibl_describes_item(y) ↔ x = y]]
and then in the rule for title something like
    & Ax[bibl_describes_item(ancestor::bibl, x) → title_of_item(.,x)]
In a system for production work, it may make sense to have a concise notational convention for this, which we can translate into this slightly more verbose form, but for now I think this is concise enough.

A. To do

In principle, this document is just a working log, not a paper or a draft paper. But if the examples begin to look better, they may be requisitioned for use in papers, and even as a working log, there are things that need to be finished if the examples are to count as fully worked out.
  • Show how to associate date with different properties in the propagation layer, based on location of the element.
  • Smooth and clarify the section on property axioms (take 2).
  • Supply mapping axioms for the meeting minutes example (need to study DD's po.transcript and
  • Show three sets of mapping axioms and application sentences for the purchase order: DD's Prolog rules, XSLT equivalents, FOPC with XPath evaluation function.


[1] For example:
present_list(n1,n3). /* n3 is the present list OF n1! */
At the moment, CH and MSM propose to postpone the association of n3 with n1 to the application sentences.
Similarly, we could say
etc., if we took the predicate to be a statement about the type, rather than the token.
[2] These could be represented more compactly, perhaps, as
where the keywords tuple and sequence record whether predicates like precedes(X,Y) are applicable on this level or not. The keyword set would also be in order, but there might be no practical difference between tuple and set in that case.
Having an explicit ordering predicate, like the list in the comprises predicate, or the precedes predicate in the main text, is a simple way to convey the difference between things that are ordered with respect to each other (such as the paragraphs of the body) and things which are not usefully thought of as ordered with respect to each other (such as the present list, absent list, date, and body of the minutes — their sequence of occurrence in the document conveys no information whatsoever, and is accordingly fixed by the content model). But it might be simpler to say, in effect, “the order of the document counts unless there are special circumstances”, which we could do by saying something like
meaning that the nodes indicated are not ordered with respect to their siblings.
The question “what kind of appeals to order in the document will we actually need to support?” is perhaps best answered empirically.
[3] It is difficult to pass by this without wanting to elaborate on cases of uncertainty, qualification, and waffling.
[4] This is mostly plausible, but it makes it hard to explain the differences among Titus Livius, Tite-Live, and Livy.
[5] Or perhaps not, in this case. In cases like the purchase order, where we infer the application objects from XML constructs which more or less explicitly represent them, the syntax DD supplies is straightforward. Here, we are insisting on inserting an intermediate link in the chain, and claiming that the XML is a representation of a document, and that it is from the document that we draw the inference. A distinction without a difference?
[6] Along about this point, I begin to think in a production system I'd want XML Schema simple types for the type column here, so I could say property_of(date,isoform,xsd:date).
[7] Opinions differ about whether it is conceivable that someone known to be recently deceased could be listed under “Absent” in the minutes of a meeting.
[8] The nickel fine for absenteeism really does strongly suggest a club, as does the emphasis on refreshments. But this observation relies on an understanding of the text, not just on the application sentences derived from the markup.
[9] If H2 holds, then the name “Pius” is ambiguous: Pius X, XI, or XII. There are, of course, many other candidate identifications for these three names; even among the popes, there are other Johns, Pauls, Piuses. But it is not until the twentieth century that any three popes who took these three names were close enough in time to have known each other. But see also below.
[10] Note that John XXIII did not have the name John while Pius XII was alive, and Paul VI only took that name after both the others are dead. Note also that on the purported date, none of the three were alive on earth.
[11] These premises don't hold up very well when confronted with the TEI w element and intra-word markup, so I don't think I want to go here. But it's clear that attributing a language to a string of characters, or to a sequence of tokens in a text, is subtly different from attributing a language to an SGML element.
[12] We leave unspecified, for now, what this means in the case that no ancestor has a lang attribute. It might reduce to haslang(x,'') or to haslang(x,null) or raise an error; in the latter case we will want to arrange to protect the sentence with a conditional: “if there is an ancestor with a value for the lang attribute, then the value of haslang for x is ...”