This document is a working draft.
If you are not actively collaborating with the authors on the
project described here, then you probably stumbled across this
by accident. Please do not quote publicly or point other people
to this document. The URI at which this document now exists
is very unlikely to remain permanently available.
This paper summarizes the state of play on some aspects of the
Bechamel project's efforts to show how the meaning of markup can be
specified, as of our presentation in Tübingen last year (it is
derived from the slides used for one of the two talks in
Tübingen). In its current state, it is the responsibility of the
first author; it has not been reviewed by the other authors.
2. An example
Let us examine an example. In 1775, the South Carolina political
leader Henry Laurens wrote a letter to the royal governor of South
Carolina, Lord William Campbell. The beginning of the letter
might read thus in an online edition:
It was be For When we
applied to Your Excellency for leave to adjourn
it was because we foresaw that we
were should continue
wasting our own time ...
In XML or SGML form, this passage might plausibly be
encoded thus:
<p><del>It was be</del> <del>For</del> When we
applied to Your Excellency for leave to adjourn
it was because we foresaw that we
<del>were</del> <add>should continue</add>
wasting our own time ... </p>
[Discussion ...]
A non-literary example may also be useful. This is the
purchase order used as an example in Part 0 (the tutorial) of
the XML Schema 1.0 specification:
<?xml version="1.0"?>
<purchaseOrder orderDate="1999-10-20">
<shipTo country="US">
<name>Alice Smith</name>
<street>123 Maple Street</street>
<city>Mill Valley</city>
<state>CA</state>
<zip>90952</zip>
</shipTo>
<billTo country="US">
<name>Robert Smith</name>
<street>8 Oak Avenue</street>
<city>Old Town</city>
<state>PA</state>
<zip>95819</zip>
</billTo>
<comment>Hurry, my lawn is going wild!</comment>
<items>
<item partNum="872-AA">
<productName>Lawnmower</productName>
<quantity>1</quantity>
<USPrice>148.95</USPrice>
<comment>Confirm this is electric</comment>
</item>
<item partNum="926-AA">
<productName>Baby Monitor</productName>
<quantity>1</quantity>
<USPrice>39.98</USPrice>
<shipDate>1999-05-21</shipDate>
</item>
</items>
</purchaseOrder>
Part of the document tree from the Laurens example
[Is this relevant? ...]
[If this is relevant, it is to note that we expect
that in general, the semantics of colloquial XML markup
vocabularies to exploit the tree structure.]
From this document, it is straightforward to imagine
drawing some inferences based on the element types. For
example, the
del element indicates that
something was deleted; the
add that something
was inserted. Give these things arbitrary names and we
can write:
deleted(n81). /* "It was be" */
deleted(n84). /* "For" */
inserted(n90). /* "should continue" */
Similarly, there is a relation that holds between purchase
orders and their shipping addresses, or between purchase orders
and their billing addresses. We can write it this way in
a straightforward translation:
po_shippingaddress(p123,a45).
po_billingaddress(p123,a46).
These predicates can be paraphrased as saying “The relation
po_shippingaddress holds between
p123
and
a45” and
“The relation
po_billingaddress holds between
p123
and
a46”.
Alternatively, we can use a meta-logical predicate to capture
this information (this can make it easier to search for
all relations holding for a particular object).
relation_applies(shipto,[p123,a45]).
relation_applies(billto,[p123,a46]).
From the occurrences of the
del element in this
document, we can infer a set of ground facts like the following:
deleted(n81).
deleted(n84).
deleted(n87).
deleted(n96).
deleted(n105).
deleted(n123).
deleted(n133).
deleted(n137).
deleted(n143).
deleted(n149).
...
The simple pattern observable here in turn allows us to think of
the meaning of
del as an element type as being captured
by a
sentence skeleton (or sentence schema) of the
following form:
deleted(___).
When we interpret the markup in a particular document, we fill in the
blank appropriately. Just how that happens in the general case is a
topic of continuing work, but for the
del element it's
clear that the blank is filled either with the identifier of the
element (if we are willing to conflate the element with what it
encodes) or (if we wish to be very careful to distinguish the
representation from the thing represented) the identifier of the thing
encoded by the
delelement.
We can apply this principle to the other relations we have
mentioned above, which gives us sentence skeletons like the following:
deleted(___).
inserted(___).
po_shippingaddress(___,___).
po_billingaddress(___,___).
po_item(___,___).
Note that some of these have more than one blank; part of our task
is to specify how a description of the meaning of a vocabulary can
specify how each blank is to be filled in.
3. The system we wish to build
The system we seek to build might be put together in two different
ways. The following system overview illustrates the first way:
From an XML document instance and a formal tag set definition (FTSD),
we create several sets of Prolog sentences: a set of
image
sentences which represent the original XML document, a set of
property rules which govern the propagation of properties
through the document, and a set of
mapping rules which
specify how the information in the markup maps into the application
domain. From these sets of sentences, others are derived by logical
inference. The
application sentences are, in some sense,
what is most obviously meant by our saying that the meaning of markup
is the set of inferences it licenses.
A second approach
exploits the ability to do XPath evaluation conveniently in XSLT, and
generates the application sentences directly from the XML instance by
means of an XSLT stylesheet which was itself derived from the FTSD.
For purposes of comparison between the two approaches, it is
possible to make the XSLT stylesheets also generate the other sets
of sentences shown in the first system overview.
The two system designs appear to differ primarily in where they put
their emphasis. Using XSLT is a way to reduce implementation time for
a system to produce application sentences; the XSLT-based approach
seems to regard the application sentences and the FTSD as the foci of
interest. Using Prolog inference rules to generate the application
sentences appears to reflect a view that the propagation sentences,
property rules, and mapping rules themselves are of significant
interest, and that the entire path from XML (image sentences) to
application sentences ought to be traversed by inference, not by
any other mechanisms.
4. Some technical and non-technical problems
In our work so far, we have encountered a number of
problems, many of them technical in nature. This section
of this paper summarizes the solutions we think we have
found to some of them.
- n-ary statements
- deixis
- inheritance
- distributed, non-distributed properties
- overriding, conflict, union
- milestones
- individuals[1] (what are we talking about?)
- references to the same individual
- certainty and responsibility
4.1. n-ary statements
This is one of the problems which defeats the straw-man
proposal of 2000. Some inferences we want to draw from the markup
allow more than one argument (for elements) or more than
two (for attributes). For example:
TEI.2 element:
There exists a text [document, text witness, ...];
this is an electronic representation of
that text.
bibl element:
There exists a bibliographic item;
this element contains a bibliographic
description of that item.
head element:
This element is the title of the
immediately enclosing div / div0 / div1 / div2 ... element.
Our solution is straightforward: We allow
skeleton sentences to have predicates which take
two or more arguments.
text(X) & encoding(Y) & encodes(Y,X).
bibitem(X) & bibdesc(Y) & bibitem_desc(X,Y).
section-title(X), title_of_section(X,Y).
The straw-man proposal of 2000 did not need any way to point to
elements or attributes, because every element and every attribute had
a translation the the same arity and the same arguments. Allowing
arbitrary arity requires us to be able to say how to fill the
additional argument slots. Our hypothesis is that in colloquial XML
the additional arguments are most conveniently specified by
expressions which specify a location; in many cases, the location is
relative to the location of the markup whose skeleton sentence is
being filled in.
Consider the
title element:
This is a title of
the bibliographic item
described by the nearest containing bibl
element.
We need a way to point to things like
the nearest containing bibl
elementSolution: use XPath expressions.
title([[.]]) &
(for all x)
(bibitem_desc(x, [[ ancestor::bibl ]])
→ bibitem_title(x, [[.]]) )
The double brackets are used to enclose deictic expressions
in the examples here and below; the precise nature of their
denotation is a problem yet to be elucidated.
There are alternate solutions:
- caterpillar expressions [Brüggemann-Klein/Wood 2000]
- CSS selectors
- a subset of XPath axes, in Prolog notation
- ...
4.3. Inheritance
One of the most obvious features of colloquial XML vocabularies
is that they use inheritance to propagate properties from parent to
child. Consider a simplified
lang attribute:
<foo lang="L"> implies:
For each descendant d, the language of d
is L.
Or, more formally:
has_language( [[descendant-or-self::*]] , [[ attribute::lang ]] ).
4.4. Distribution
Not all properties are inherited in the same way.
Some, we say, are distributed, others non-distributed. For example,
one of the following sentences applies, one does not:
If a chapter is in a language, then by default everything in that
chapter is in that language.
* If a chapter is a chapter, then by default everything in that
chapter is also a chapter.
(The asterisk beside the second sentence signals its falsehood.)
Solution:
We can distinguish two approaches to distribution of properties.
- For each element or attribute with a property
P, formulate sentences appropriately:
For each child c, c has property P
or For each child c, c has property is-part-of-a-P.
- Allow second-level descriptions of predicates:
distributed(lang).
distributed(quoted).
distributed(highlighted).
non-distributed(p).
non-distributed(div).
4.5. Overriding (e.g. lang)
The account of inheritance given above is too simple: in many
cases, the inherited value of a property can be overridden by a
locally specified value. The lang attribute of TEI
and HTML, and the xml:lang attribute of the XML
specification, is a well known example of this design pattern.
There are several ways to think about
this and express it:
We can propagate down only to nodes which don't have
their own
lang attributes:
lang="X" → for each c in {child::text()},
haslang(c,X)
& for each c in {child::*[not(@lang)]},
haslang(c,X)
In English: when an element E has the attribute-value specification
lang="X" for some X, then:
- Each text-node child of E has language X.
- Each element C which is a child of E has language X,
if C does not have an attribute-value specification
for lang.
Note that we write
haslang(N,L) to signal that node N
has language L.
This formulation assigns the property of being in a particular
language to both elements and text nodes (but not to attributes).
Note that for the recursive step down to element children, we
really want to cause the haslang property to propagate
further, e.g. by re-executing this rule on the element children as if
they had an explicit lang value. But this formulation
doesn't actually manage to capture that requirement.
An alternative view is that XML elements don't have a
language — only text-nodes do. On this view, the element on
which the
lang attribute occurs has nothing direct to
do with the language property: it's just the common parent of a bunch
of text nodes which are in the same language. To get the right
language property for every text node, there is a simple rule:
(for all x)
(gi(x,'#pcdata')
→ haslang(x,ancestor::*[@lang][1]/@lang))
In English: for every text node (nodes with a pseudo-GI of
#pcdata), the
haslang relation links
that text node to the language specified on the nearest enclosing
element which has a value for
lang.
This view poses two challenges:
- The information we need is carried by the lang
attribute, but the sentences we generate are all related not to the
attribute-value specifications for lang, nor to their
parent elements, but to the text nodes in the document. This suggests
that the rules associated with a particular construct (here: the
lang attribute) may need to fire not when that
construct is encountered, but at some other time. Instead of pushing
the information down from the element with the lang
attribute, this formulation imagines us pulling it down to each text
node. It's not clear how best to organize the choice of push and pull
in a working system.
- No language information is associated with any attribute. This
is implausible, but it's fortunately also easy to correct.
Another formulation, which attempts to repair the problem
in the first formulation:
attv(N,lang,L) → haslang(N,L).
For every element N with a
lang value of L,
infer
haslang(N,L).
Define
haslang thus:
haslang(N,L) →
(for all y)
(parent(y,N) & ¬(attv(y,lang,_))
→ haslang(y,L))
& (for all y)
(parent(y,N) & gi(y,'#pcdata')
→ haslang(y,L))
I.e.
haslang(N,L) implies that
haslang(E,L)
applies to every element E which is a child of N, if E doesn't have
an attribute value specification for
lang, and also
that
haslang(T,L) applies to every text node child T of
N.
A fourth formulation, which resembles the second formulation
above in using ‘pull’ rather than
‘push’, but which ascribes a language property to
elements rather than to text nodes (if only to show that we can):
(for all x : element)
(attv(x,lang,_) → language(x,x->@lang))
(for all x : element)
(¬attv(x,lang,_)
→ language(x,x->ancestor::*[@lang][1]/@lang)).
Where
x->y means ‘
the value of the XPath expression
y, interpreted with x as the current node’.
For all elements X, if X has a lang value, then
the language of X is whatever the lang attribute
says. And:
For all elements X, if X has no lang value, then
the language of X is whatever the lang attribute says
on the nearest ancestor which has a lang attribute.
Analogous rules for text nodes and attributes would be
necessary in practice:
(for all x : text)
(language(x,x->ancestor::*[@lang][1]/@lang)).
(for all x : attribute)
(language(x,x->ancestor::*[@lang][1]/@lang)).
We can merge all the rules given in the preceding
formulation into a single rule:
(for all x : node)
((element(x) | attribute(x) | textnode(x))
→ language(x,
[[ x->ancestor-or-self::*[@lang][1]/@lang ]] ))
Solution: There are two approaches; it's not
clear which to prefer. We can define
lang top-down:
(for all e) [element(e) →
hasprop([[.]],
[[language(ancestor-or-self::*[@lang][1]/@lang ]]
))]
(for all n) (((att(n) | text(n))
& hasprop(parent::*,language(L)))
→ hasprop(.,language(L))]
In English: for every element e, e has a
language
property whose value is given by the
lang attribute on
e itself, or on the nearest ancestor element which has a
lang value. For evey attribute or text node n, n has a
language property whose value is given by the
lang attribute on the nearest ancestor element which
has a
lang value.
Or we can define it bottom up:
let L = ancestor-or-self::*[@lang][1]/@lang)
in (hasprop(.,language(L))
& hasprop(child::text(),language(L))
& hasprop(attribute::*,language(L))
)
In English: [for every element E], let L be the
lang
value on the nearest ancestor with such a value. Then E itself,
each attribute of E, and each text-node child of E, has a
language property with the value L.
Note that none of the formulations given here accounts
explicitly for elements, text nodes, or attributes which:
- have no ancestor with a lang value
- do not contain natural-language text and therefore ought
not, strictly speaking, to be associated with a natural language
Perhaps the meaning (for us, at least) of the lang
attribute is that every element and attribute ought to be classified
as having, or not having, a language property.
4.6. Conflict and union
Another problem related to overriding is the need to distinguish
conflicting values from compatible values. A vocabulary with concepts
like bold and italic must say how they
relate: do they override each other or complement each other?
Solution: if they complement each other, it's because
each specifies a different property. If they override each other,
it's because both specify the same property. Formulate the rules
accordingly. [N.B. This is going to make it very hard to provide a
general interpretation of the TEI rend attribute.]
When
b within
i gives
bold italic:
b: bold( [[descendant-or-self::*]] )
i: italic( [[descendant-or-self::*]] )
So
<b>aaa <i>bbb</i> ccc</b>
gives
bold(n1)
& bold(n2)
& bold(n3)
& italic(n2)
...
When
b within
i gives
bold,
i within
b gives italic:
b: fontstyle( [[descendant-or-self::*]], bold )
i: fontstyle( [[descendant-or-self::*]], italic )
or
(for all n) [textnode(n) →
hasprop([[.]],
[[fontstyle(ancestor-or-self::*
[name() = 'b' or name() = 'i'][1]) ]]
)]
4.7. Milestones
Milestones are unusual, and often frowned upon, because they
require looking back along the frontier of the tree, treating the XML
as a stream, rather than up, down, or sideways in the tree.
Solution:
The meaning of a milestone, however, is easy to specify, as long
as the language for deictic expressions allows it. In XPath,
the
preceding axis is what is needed. Here's a
sample formulation:
(for all n) ((textnode(n) | element(n))
→ hasprop(.,pagenum(
[[preceding::pb/@n]])))
For each text node or element N, N has a
pagenum
property which has as its value the value of the
n
attribute on the most recent milestone element.
Note that for elements, the pagenum property
is, in effect, the page number of the first page, not necessarily
the only page number on which material from that element appears.
Complications will be necessary to deal with transcriptions
which provide multiple sets of page numbers (such as McKinnon's
Kierkegaard).
4.8. Individuals
What do the identifiers
n81 etc. denote?
- elements
- contents (as tokens)
- contents (as types)
- things outside the markup, in the application domain,
[whose existence we believe in because of the markup]:
- bibliographic items
- manuscripts or printed editions used as the copy text for
the electronic transcription
- the work as an abstract object
- the expression of the work (also abstract)
- the manifestation of the expression
- one particular physical (?) text carrier
(there exists x)
(bibitem(x)
& bibdesc([[.]])
& bibitem_desc(X,[[.]]).
Disagreements over what individuals to posit have thus far been
a major stumbling block in our attempts to write skeleton sentences
for specific tag sets or document instances. Does the root element
in a MECS-WIT transcription license all of the following inferences?
- * There exists a text T.
- There exists a manuscript M.
- There exists a MECS transcription E.
- The catalog number of M is ...
- The Wittgenstein Archive ID number of E is ...
- * The language of T is German.
- * M is a manuscript of T.
- * E is a transcription of T, from M.
or should the starred items be replaced by alternative formulations?
- * The language of M is German.
- * E is a transcription of M.
Solution:
- Think hard.
- Be careful about data types.
I think there are two possibilities for reducing or eliminating
the ambiguity of the sentences:
- Define a set of functors which effectively cast their
argument (to element, to token, to type, ...). Either establish
a clear default and use the casts only where needed, or
establish the rule that there is no default and use the casts
for all cases.
- Adopt rules like the following:
- Strings (or Prolog atoms) denote types, not tokens.
- Node identifiers for text nodes denote tokens, not
types.
- Node identifiers for elements and attributes denote those
constructs in the input document, as tokens (object semantics, not
value semantics). To get at their string values, use an XPath
string() function or the like. To get at the objects they
encode, use the Prolog relation models(O,L) or
the like.
4.9. References to the same individual
In some cases, different markup constructs generate separate
inferences about the same individuals. The meaning of a
bibl element includes, for example, the inference
“there exists a bibliographic item I, and this element is / contains
a bibliographic description of I”. A title element
inside a bibl element does not mean only ‘this is
a title’ or even ‘this is the title of a bibliographic
item’, but ‘this is the title of the
bibliographic item whose existence is posited by the enclosing
bibl element.’
Fortunately, Richard Montague was here before us, and we have a
reasonably well established way of handling this case.
First, we translate the existential assertion of the
bibl
element thus:
(there exists x)
(bib_item(x)
& bibl_item_desc(x,[[.]])
& (for all y)(bibl_item_desc(y,[[.]]) ↔ x = y))
In English: there is some bibliographic item X for which this element
is a bibligraphic description, and for any bibliographic item Y for
which this element is a bibliographic description, X = Y. Or, more
loosely: there is exactly one bibliographic item for which this
element is / contains a bibliographic description.
If the existential assertion given above is in our context, we can
successfully refer to it in the application sentences generated by
the
title element:
title(.)
& (for all x)
(bibl_item_desc(x, [[ancestor::bibl]])
→ bib_item_title(x,[[.]])]
In English: this is a title, and for every X such that the
bibl element enclosing this title is a bibliographic
description of X, this is the title of X. Or, more colloquially:
this element is a title, and specifically it is the title of the
bibliographic item described by the enclosing
bibl.
Note in passing that the inferences just described are not
invariably safe: in an unstructured bibl for a journal
article there are apt to be two title elements: one for
the article and one for the journal. I think world knowledge (i.e.
knowledge of common bibliographic practice) allows us to distinguish
them, but I believe that it's world knowledge, not markup semantics.
For a biblStruct, on the other hand, such inferences
are more likely to be safe.
Note also that the uniqueness of the bibliographic item is not
really essential to successful reference: what is crucial is the
existential claim.
4.10. Certainty and responsibility
One of the first problems one bumps up against in devising skeleton
sentences for the TEI, or MECS-WIT, is the extensive support in these
markup schemes for recording uncertainties, alternatives, etc. It is
easier to see how to write sentences for the unmarked cases in which
either certainty is assumed or in any case nothing is said about
certainty or uncertainty.
If we are sure, that is, we write:
abbreviation(n81,"S","South").
Solution 1:
If we are not sure, one obvious way to handle it is to
reify the proposition. Give it an identifier, so that we can
refer to it from elsewhere.
abbreviation(s443, n81, "S", "South").
certainty(s443, uncertain).
responsibility(s443, renear).
This manoeuvre allows us to record the information, at the cost
of requiring queries to be aware of the two distinct predicates
abbreviation/3 and
abbreviation/4.
It might be better to require a unique identifier on every assertion,
to eliminate the variation. But for now, I think we should prefer
the variable number of arguments, with auxiliary predicates which
understand both forms.
Solution 2: (This one wasn't in the slides
from Tübingen, but only in the working paper
“Desperately seeking sentences”.)
Provide certainty information using the predicate as a structure:
abbreviation(n81, "S", "South").
certainty(abbreviation(n81, "S", "South"), uncertain).
responsibility(abbreviation(n81, "S", "South"), renear).
I think mapping to this from TEI's certainty and responsibility
elements is slightly less convenient than mapping to solution 1.
Solution 3: hard-code a certainty and a responsibility
argument into the predicates that need it.
abbreviation(n81, "S", "South",uncertain, renear).
In view of the fact that virtually anything can be uncertain,
in TEI explicitly so, this seems unduly cumbersome.