Report from Bergen, October 2003

Claus Huitfeldt

C. M. Sperberg-McQueen

8 October 2003

This document is a working draft.

If you are not actively collaborating with me on the project described here, then you probably stumbled across this by accident. Please do not quote publicly or point other people to this document. The URI at which this document now exists is very unlikely to remain permanently available.



Claus and MSM have been trying to recover the results of our discussions in June; we each have the firm conviction that we reached important results in our discussions then, but we have forgotten what they were. This document is an attempt to get things written down.

1. Notation

Typing: (CH and?) MSM believe that in June the Bechamel Project agreed to write
(n : node)(d : document)(s : string)(P(n,d,s))
instead of (the equivalent):
(n)(d)(s)
  (node(n) & document(d) & string(s)
   → P(n,d,s))
In other words, we allow ourselves to assume a relatively simple type system and suitable type predicates.
Functions: CH and MSM believe that in June we all agreed to use relational notation rather than functional notation. If f is a function containing the pair (x, y), then we do not write
... f(x) ...
but
... f(x,y) ... y ...
Metamarkup: MSM believes that in June we agreed that the object system needs to allow predicates concerning the markup and the marked up document, rather than just to things at the pure application level.

2. Claus's axioms from June

2.1. The example

We began by rewriting the example using the image sentence predicates from the Extreme 2002 paper. The sample document begins:
<text/
<doc/
  <code-version/1.01>
  <catno/171>
  <ti/Farbenlehre>
  <crncopy/70> 
  ... 
/doc> 
... 
/text>
(For now, we ignore the fact that MECS documents aren't necessarily trees. This one is a tree, in the bits we care about for now.)
The image sentences are these. The node predicate is the same as for XML:
node(n00).  node(n01).  node(n02).  node(n03).  node(n04).  
node(n05).  node(n06).  node(n07).  node(n08).  node(n09).  
So is the gi predicate:
gi(n00,text).
gi(n01,doc).     
gi(n02,code-version). 
gi(n03,catno).   
gi(n04,ti).      
gi(n05,crncopy). 
gi(n06,'#pcdata'). gi(n07,'#pcdata').
gi(n08,'#pcdata'). gi(n09,'#pcdata').
The parent predicate will no longer be a function when the image sentences describe a MECS or TexMECS document, and MSM thinks we may wish to change the way the predicate works, but since this sample document is a tree, we don't need to address that question here.
parent(n01,n00).  
parent(n02,n01).  parent(n03,n01).  parent(n04,n01).  parent(n05,n01).  
parent(n06,n02).  parent(n07,n03).  parent(n08,n04).  parent(n09,n05).
The first-child and nsib predicates here are as described in the Extreme paper; in the long run, TexMECS will require a replacement for the nsib predicate, since the next-sibling relation in Goddag structures is ternary, not binary.
first-child(n00,n01).
first-child(n01,n02).
first-child(n02,n06).
first-child(n03,n07).
first-child(n04,n08).
first-child(n05,n09).
nsib(n02,n03).
nsib(n03,n04).
nsib(n04,n05).
For simplicity, we omit all the PCDATA nodes which contain only whitespace.
content(n06,'1.01'). 
content(n07,'171'). 
content(n08,'Farbenlehre'). 
content(n09,'70'). 
In reviewing the Urbana notes, MSM asked “So, what are the constants c0 and c1?” CH said he wasn't sure.
In reviewing the image sentences just recorded, CH asked “So, what are the contents of node n0?” MSM said there are two different answers, depending on how one translates that question from English prose into the formalism:
  1. What is in the content relation with n0 as the container? In Prolog:
    ?- content(n00,X).
     
    No
    
  2. Of what things is n0 the parent? In Prolog:
    ?- parent(X,n00).
     
    X = n01 ;
     
    No
    
    or
    ?- findall(X,parent(X,n00),List).
                                                                                    
    X = _G157
    List = [n01]
     
    Yes
    

2.2. The axioms

CH says that in examining the axioms in the Urbana notes, it's best to ignore the quantifiers at first, and try to figure out the correct quantification later, based on the sentence.
Let's work backwards. We want to generate the sentence
Documentation(n0,n1)
so the axiom we are looking for needs to have that clause as the consequent of some implication.
CH and MSM first wrote two similar forms of the axiom:
(n : node)(m : node)
  (gi(n,doc) & parent(n,m)
  → Documentation(m,n))
CH's form was fully bound (in order to evade having to specify quantifiers):
gi(n1,doc) & parent(n1,n0) → Documentation(n0,n1)

2.3. May we assume valid documents?

We asked ourselves whether we can in fact assume valid documents.
On one hand, we can argue that the task of formal semantics is to provide interpretations of well-formed formulas; there is no need to avoid providing interpretations of ill-formed formulas. In this context, the set of well-formed formulas corresponds to the set of valid documents, not the set of all data streams. Less formally: it's hard enough for an account of MECS-WIT semantics to describe how to interpret MECS-WIT documents; asking it to provide coherent semantics for documents which contain errors (or which are not, on a strict view, MECS-WIT documents at all) is asking too much.
On the other hand, it is troubling (at least to MSM) to think of a semantic interpretation engine which blithely ascribes an interpretation to data streams without noticing if they happen to contain garbage. It would be far preferable for such an interpretation engine to reply “This data stream is meaningless” or “No interpretation is possible for invalid documents” when presented with invalid / ill-formed data.
Let us call an interpretation engine ‘blithe’ if it interprets well-formed data but does not distinguish well-formed from ill-formed data, and ‘careful’ if it won't interpret things that aren't valid.
We note that one way to make a careful engine out of a blithe engine is to wrap it in a validator; in pseudo-code:
if (validate(doc) == VALID) {
   call_blithe_engine(doc);
} else {
   printf("Huh?\n");
}
We also note that since MECS documents cannot be validated mechanically, we cannot build a careful engine for MECS using this technique. We can write the inferences in a way that involves some of the syntactic regularities which would be guaranteed by syntactic validation. Or we can write the syntactic regularities out as rules of their own. For example, any well-behaved MECS-Wit document has a single doc element, and that doc element is a direct child of the text element (or of the data stream — we continue to jump back and forth between being willing to postulate such an element and not being willing to do so):
(for all s : mecswit-stream)
  (well-behaved(s)
  → (∃ n : node)(in-stream(n,s)
    & gi(n,doc)
    & (for all p : node)(gi(p,doc) & in-stream(p,s) → p = n)
    & parent(n,s)))
We can if we wish use rules of this kind to check to see whether a MECS-Wit data stream is ill-behaved.

2.4. Alternative formulations of axiom A1

We can also mention some of the syntactic regularities in the interpretation of element types. The modified axiom A1' from Urbana does this:
(for all p) doc(p) 
  → (∃ q)(ancestor(q,p) & Documentation(p,q) & Stream(q))
In English: for any p, if p is a doc element, then there exists a q which is an ancestor of p, and q is a data stream, and p is the documentation for q. (Or, less formally: p is the documentation for the data stream p is in.)
MSM proposed the following alternative formulation:
(for all n : node)(for all m : node)
  (gi(n,doc) & gi(m,text) & parent(n,m)
  & (for all p : node)(gi(p,doc) & descend_anc(p,m)
    → p = n)
  → Documentation(n,m))
In English: for any nodes n and m, if n is a doc element which is a child of the text element m, and m has no other doc elements anywhere, then n is the documentation for m (less formally: n is the documentation for the document it's in).
Unresolved question: we agreed that the clauses relating to the enclosing data stream or document element, and to the uniqueness of the doc element, are ‘syntactic’ in the sense that they are true of any well-behaved MECSWIT document and would be guaranteed by the equivalent of DTD-based validation. We also thought it might be useful to understand why Claus's instinct is to put these clauses into the consequent, and MSM's instinct is to put them into the antecedent.
But we decided not to try to go into that question just now.

2.5. A quick aside on desert landscapes vs fertile terrain

We have the impression we talked in Urbana about how to deal with the sentences whose gloss reads something like ‘At this point, the item has a [name of type of textual object].’ Neither of is certain we remember the conclusions we reached, if any (but MSM thinks we reached conclusions because he went home thinking he knew what he needed to know).
When we looked more closely at the problem, it looked more tractable than MSM had feared. We can take two distinct approaches:
  • A ‘fertile valley’ approach will postulate the existence of lots of objects which are paragraphs, sections, emphatic phrases, and so on. (This may be the approach textual Platonists will lean toward.) There will be lots of objects at the application level, and they will have lots of different types, from character through deleted-bit through paragraph through notebook.
  • A ‘desert landscape’ approach will avoid postulating any such things, preferring to work only with the document as a sequence of characters, (or, for engineering purposes only, a sequence of sequences of characters — those who think in XPath and Infoset terms will say “a sequence of text nodes”, vs. “a sequence of character information items”). There will be fewer objects at the application level, and they will all be characters or character sequences.

3. Friday 10 October 2003

3.1. Goals for now

Goals for now:
  • continue with image sentences
  • when we have a larger body of them (ideally including text structuring tags and some phrase-level tags), look at a simple, short example and try to generate image sentences and application sentences manually for a fragment
  • MSM to try to generate image sentences from TexMecs
  • CH to try to write a translator to go from Mecs to TexMecs

3.2. Alternative formulations of A1 (validity revisited)

Consider these formulations of axiom A1:
  1. gi(n1,doc) → documentation(n1,n0)
N.B. this one doesn't bind n0. If we do that, we get
  1. gi(n1,doc) → (∃ x)(parent(n1,x) & documentation(n1,x))
There was a twinkle in Claus's eye as he proposed
  1. gi(n1,doc) → documentation(n1,mother_of(n1))
and asked MSM what he thought. MSM said he rather liked it, since it was compact, but noted that we had agreed in Urbana not to use functional notation, on the grounds that mixing functional notation and relational notation was likely to be confusing in the long run. This may mean MSM passed CH's test.
Version 2 looks better, but it is a ground form and thus can't really be a skeleton sentence or an axiom. Can we quantify it?
  1. (for all n : node)(gi(n,doc) → (∃ x : node)(parent(n,x) & documentation(n,x)))
Now consider
  1. (for all n : node)(for all x : node) (gi(n,doc) & parent(n,x) → documentation(n,x))
Both CH and MSM prefer version 5 to version 4. Why? If we assume well-behaved documents, versions 4 and 5 will never predict different conclusions.
We tried out various formulations of the reason:
  • We don't really want to assume well-behaved documents after all, unless it buys us something. If version 4 were much simpler than version 5, we might prefer it. But it's not. (MSM counts atomic tokens and finds that 4 and 5 are identical in length.)
  • We don't want existential quantification in the right hand side. (This is too strong; there are counter-examples.)
  • We don't want existential quantification over image-sentence objects or markup-language constructs). We either have something in the image sentences, or we don't have it. There is no sense in postulating the existence of an element as version 4 does: either we have the element, in which case we don't need a RHS to postulate its existence, or we don't, in which case we don't know anything about it except that it ought to have existed, had the world been a better place.
  • We don't want to infer syntactic regularities in the semantic rules. (If we want axioms which capture logical regularities, we can formulate them as standalone axioms, as formulated in section 2.3.)
  • We will allow ourselves to assume that MECS-WIT documents are well behaved only if the assumption gives us an advantage in terms of simplification.

3.3. Code-version

For code-version: (for all n : node) (for all m : node) (for all s : string) (gi(n,codeversion) & content(n,s) & root(n,m) → conforms(m,s))

4. Monday 13 October

4.1. Catno

For catno: This item has number n in the Von Wright catalog of 1984.
MSM's first formulation:
(for all n : node) (for all s : string) (for all s2 : stream)
  (gi(n,catno) & content(n,s) & in(n,s2) 
  → (∃ i : item)
    (has-vw(i,s) & transcribes(s2,i)))
Recast to use doc and assume a document root element rather than a stream:
(for all n : node) (for all m : node) (for all s : string) 
  (gi(n,catno) & content(n,s1) & root(n,s) 
  → (∃ i : item) (has-vw(i,s) & transcribes(m,i)))
CH's first formulation:
doc(n,catno) & content(n,'171') & root(n,m) 
→ transcription(o,n) & item(o) & vwno(o,'171')
We digressed to the notion of ‘having a VW number’ — do we need to explicate this? Perhaps, thinks MSM:
hasvwno(i : item, s : string)
 iff
(∃ e : entry) 
  (e in vwcatalog 
  & entry_id(e, s)
  & entry_item(e,i) )
N.B. CH doesn't find this particularly compelling. But we don't actually need to do this.
It's also worth noting that identifying i (or o) as an item is in some sense redundant, since only items occur as the first argument in the relation vwno(_,_).
Can we say explicitly that (a) this transcription only transcribes one item, and (b) only one item has this VW number? Yes, we can:
(for all n, m : node)(for all s: string)
  (gi(n,catno) & root(n, m) & content(n, s)
  → (∃ i : item)(vwno(I,s) & transcribes(m, i)
    & (for all j : item)(vwno(j,s) → i = j)
    & (for all j : item)(transcribes(m,j) → i = j)))
Claus offers an alternative formfor the first j clause:
& ¬(∃ j : item)(vwno(j,s) & ¬(i = j))

4.2. Ti

This item has the title s.
It's a syntactic fact that there is only one ti per item. In practice it uses the title given in von Wright.
(for all n : node) (for all i : item) (for all s : string)
  (gi(n, ti) & content(n, s) & elem_item(n, i)
  → item_title(i, s))
We may specify an auxiliary relation item/2:
element_item(n, i) iff (∃ m : node)(root(n, m) & transcribes(m, i))
CH frowns a little over the auxiliary relation.

4.3. Crncopy

MSM tries: This item appears in photocopy in volume n of the Cornell edition.
(for all n : node) (for all s : string) (for all i : item)
  (gi(n, crncopy) & content(n, s) & element_item(n, i)
  → has_crn_no(i, s))

has_crn_no(i, s) iff (∃ c)(cornellvol(c) & reproduces(c, i) & volume_no(c, s))
CH has something very different.
gi(n,'crncopy') & element_item(n, i) & content(n, s)
  → (∃ j)(copy(j,i) & crnno(j, n))
N.B. in dealing with catno, we did not postulate the von Wright catalog and its entries directly, but just said vwno(i, s). Here, CH does want to postulate the existence of something new. Why?
  • The VW numbers, really, are numbers of the items, whereas the Cornell numbers are at root numbers of something else.
N.B. one Cornell volume may reproduce several items. There are rare cases (just one?) where an item (the large TS) is split over several physical volumes in the Cornell paper edition.
The reels are a whole other story; they are not 1:1 with the paper volumes, nor with items. The mapping is a bit of a mystery in fact.

4.4. Wabcopy

As for crncopy. No, perhaps not: probably contains a prose description.
(for all n : node) (for all i : item) (for all s : string)
  (gi(n, wabcopy) & element_item(n, i) & content(n, s)
  → (∃ j : copy)(copies(j, i) & nldescribes(s, j)))
MSM's version of this and CH's are the same exept for the last bit, where MSM has nldescribes (≉ ‘provides a natural language description of’) and CH has designates or identifies.
This is a bit of a soft spot. (I.e. we should probably come back to this.)

4.5. Datfrom, datto

At this point, MSM found himself growing tired of writing universal quantifiers and elaborate antecedents for conditionals which do nothing but bind the variables to values in the document, and he relapsed into writing skeleton sentences:
datfrom: est_terminus_a_quo(item(.), contents(.))
datto: est_terminus_ad_quem(item(.), contents(.))
I.e.
(for all n : node) (for all i : item) (for all s : string)
  (gi(n, datfrom) & element_item(n, i) & contents(n, s)
  → est_terminus_a_quo(i, s))
(for all n : node) (for all i : item) (for all s : string)
  (gi(n, datto) & element_item(n, i) & contents(n, s)
  → est_terminus_ad_quem(i, s))
On second thought, MSM wondered whether we do want to quantify over dates, so that we get something with
... → (for all de, di : date)(written_date(i, di) & string_date(s, de)
       → di ≤ de)
i.e. ‘the actual date the item was written is less than or equal to (comes on or before the date of) the estimate
We did not resolve this question. CH was not quite happy but we did not clarify just why.

5. 14 August 2003

5.1. Dates, continued

Another pass at datfrom and datto:
(for all n : node) (for all i : item) (for all s : string) (for all di,de : date)
  (gi(n, datfrom) & element_item(n,i) & contents(n,s)
  & date_lex_val(s,de) & item_date(i, di)
  → di ≥ de)

(for all n : node) (for all i : item) (for all s : string) (for all di,de : date)
  (gi(n, datto) & element_item(n,i) & contents(n,s)
  & date_lex_val(s,de) & item_date(i, di)
  → di ≤ de)
MSM's first version of the preceding used a generic predicate for mapping a lexical form to a value given a type name, so instead of date_lex_val(s, de) he had type_lex_val(date,s,de). That felt uncomfortable to Claus (a bit too much like meta-logic?) so we changed it. The name of a type appearing in an argument position made Claus wary.
CH is also a bit worried about “di ≥ de” — MSM argues that all sets necessarily have partial orderings, so that assuming a less-than-or-equals relation does not involve us in anything we haven't already bought into already.
We agreed to proceed, but if the roof falls in later, we may need to revisit these cases.
We can rewrite the consequent ge(di, de) if that makes us feel better. (MSM is unimpressed.)
[We later noticed that we did not deal with the uncertainy implicit in the prose documentation of datfrom and datto. They are often estimates and could be wrong without invalidating everything else.

5.2. The contents relation

We noticed at this point that we have quietly been using contents(n, s) without definition for nodes with generic identifiers other than #pcdata. It means:
(for all n : node) (for all s : string)
  ¬ gi(n, '#pcdata') →
  (contents(n, s) iff (∃ m : node) 
                      (parent(m, n) & first-child(n, m)
                      & ¬(∃ l : node)(nsib(m,l))
                      & gi(m,'#pcdata')
                      & contents(m,s)))

5.3. Transcribers

MSM tries
(for all n, m : node) (for all i : item) (for all s : string)
  (gi(n,'transcribers') & contents(n,s) & element_item(n, i) & root(n, m)
    → (∃ x : persons)(transcribes(x, i, m) & nldescribes(s, x)))
One might think first of a two-argument transcribes relation.
Claus's draft:
g(n, transcribers) & contents(n, s) & root(n, m)
  → (∃ p)(designates(s, p) & transcribers(p, m) & person(p))
N.B. In this context, the information content of the ternary relation transcribes(p,i,m) and the binary relation transcribers(p,m) is the same, since catno establishes a transcribes/2 relation between m and i.
We argued a bit about this, and decided to use the binary relation transcriber_transcript rather than the ternary relation. CH feels the ternary relation is partially redundant, and MSM agrees (not in third normal form).
Do we care about having a person predicate, for transcribers and proofreaders? MSM thinks so, CH is not sure.
We can do proofreaders by analogy with transcribers.

5.4. Sections

MSM's first cut:
(for all n : node) (for all i : item) (for all s : string)
  (gi(n, sections) & content(n, s) & element_item(n, i)
  → section_description(i,s))
CH's first cut:
  gi(n, sections) & content(n, s) & root(n, m)
  → (for all x : node)
      (gi(x, sec) & contents(x, t) & root(x,m)
      → describes_recognition_criteria(t,x))
Joint revision:
(for all n : node) (for all i : item) (for all s : string)
  (gi(n, sections) & content(n, s) & element_item(n, i)
  → (for all x : node)
      (gi(x, sec) & root(x,m)
      → describes_recognition_criteria(s,x))
We discussed whether the sections element describes the item or the sec elements in the transcription.
Question: do we write this from the point of view of the sections element, or from that of the sec element?
After a lunch break, we produced:
(for all n : node) (for all m : node) (for all s : string)
  (gi(n, sec) & element_sectionselement(n, m) & gi(m, sections) & contents(m, s)
  → describes_recognition_criteria(s, n))
Note that gi(m, sections) is redundant, since it should always be true if element_sections(_, m).
Which of the forms here is preferable? The last (after lunch) is shorter. And note that since the sections element is optional, we can make it shorter still:
(for all n, m, r : node) (for all s : string)
  (gi(n, sec) & gi(m, sections) & root(n, r) & root(m, r) & contents(m, s)
  → describes_recognition_criteria(s, n))

5.5. Seclines

We talked about this one, but blew it off.

5.6. Hands

The contents of hands can range from things like “h = unspecified later hand” or “s = Typoskript-Ergänzung; h = Nachtrag von LW, Mittel noch zu identifizieren; H1 = Nachtrag in Bleistift von LW; S1 = Nachtrag von Rush Rhees; See comments in Transcom” to little essays about hunches and hypotheses and evidence concerning the identity of the hands.
The codes thus identified are used a suffixes on generic identifiers to indicate who did what. The codes match the regular expression (_(h|H[1-9]|S[1-9]))?(_c(h|H[1-9]|S[1-9]))*, but perhaps it's easier to explain this in writing if we specify a grammar:
suffix  ::= resp cancel* | cancel+ 
resp    ::= '_' handid
cancel  ::= '_c' handid?
handid  ::= known | unknown
known   ::= h | S[1-9]
unknown ::= H[1-9] | X
A suffix or part of a suffix matching the resp production indicates who is responsible for whatever the generic identifier is identifying: an addition, a deletion, a line in the margin, etc. A suffix or part of a suffix matching the cancel production indicates who is responsible for canceling (a) whatever was done by the hand named in the immediately preceding suffix segment, if there is one, or otherwise (b) whatever is indicated by the generic identifier.
For example: if the generic identifier xxx indicates a deletion of some kind, then
  • xxx_h would mean h performed the deletion (note that in practice, the deletion semantics may already indicate this, so this suffix may not appear with some generic identifiers)
  • xxx_h_cS1 would mean h performed the deletion and S1 canceled the deletion (restoring the text)
  • xxx_h_cS1_cS2 would mean h performed the deletion and S1 canceled the deletion and S2 canceled the cancelation (restoring the deletion and removing the text)
In practice, examination of the data confirms CH's belief that only a very restricted set of suffixes is used:
  • simple resp suffixes: _h, _hm, _H1, _H2, _H3, _H4, _H5, _s, _S1, _S1/, _S2, _X
  • simple cancel suffixes: _c, _ch, _cH5
  • a resp indicator with one cancellation: _h_c, _h_ch, _h_cH5, _H1_cH1, _S1_ch, _S1_cH5
We talked about the problem of interpreting the suffixes and considered, as an example, the gi add_S1_cS2 and eventually arrived at the following tentative discussion proposal:
(for all n : node) (for all i : item) (for all s : string)
(for all h : node) (for all sh : string) 
  (gi(n,added_S1_cS2) & contents(n,s) & element_item(n,i)
  & xpath(n,'//hands',h) & contents(h,sh)
  → (for all c : ucschar) (c in s 
    → (∃ g : graph) 
      (g in i & /* location of g ... */
      & added(g)
      & canceled(g)
      & (∃ p1 : person) (added_by(g, p1) & nldescribes(hs,p1))
      & (∃ p2 : person) (canceled_by(g,p2) & nldescribes(hs,p2)))))
where xpath/3 takes a node set as a first argument, an XPath expression as a second, and returns a node set as its third argument (we're here assuming that singleton node sets can be dealt with in the same way as single nodes).
[Note 25 October: The semantics of this are straightforward, but rather implausible, since the normal way to cancel an addition is to delete it, so that instead of an addition with the suffix _S1_cS2 we'll have an addition with _S1 and a deletion with _S2. In point of fact, none of the insertion elements seem to have any suffixes with anything other than a single term. The only generic identifiers with two-term suffixes are
  • underlining: us1_h_ch, us1_h_cH5 (us1 = ‘underlined by one straight solid line’), uw1_h_ch (‘underlined by one solid wavy line’)
  • lines in margins: clirm_h_ch (‘text with curved line in right margin’), wlirm_h_ch, wlirm_h_ch (‘text with wavy line(s) in right margin’)
  • relocated and cross-referenced text: co_h_cH5, co_h_ch (co = ‘change order, first element not part of context’)
  • deletions: d_H1_cH1, d_S1_cH5, d_S1_ch, d_h_c, d_h_cH5, d_h_ch (d = ‘deletion’)
  • overwriting: p.o_h_cH5, p.o_h_cH5, p.o_h_ch, p.o_h_ch (p.o is the polyelement code for ‘overwriting)
  • other: vdline_h_ch (perhaps a typo for vpline_h_ch?)

6. Notes on eXist

During this month, Sindre Sørensen has made progress in making Cocoon + eXist work as a potential replacement for the Mots-15 system and the XML-aware search engines.
We have encountered some problems.
  • The searches //d/parent::s[i] and //i/parent::s[d] should produce the same results as //s[d and i], but don't. They appear (at a cursory glance) to produce the same hits as each other (the counts per document are the same, and where we've inspected the hits they have been the same), but the hits include a lot of garbage, in particular results which are not s elements. The search //d/parent::*[i] appears to return the same hits as the first two problematic searches.

7. Character data and transcription proper

7.1. A question

MSM suggested we begin our attack on location questions by asking “What does it mean if I find the character sequence ‘abcde’ (uninterrupted by any markup) in the transcription?”

7.2. Some intuitions

Our initial thoughts:
  • For clarity, let's distinguish (as we have done above) between ucs-character and graph. The former occur in electronic documents, such as the MECS-WIT or XML-WIT documents we are considering here; the latter occur in manuscripts and typescripts and so on. Note that when the rubber meets the road, a ucs-character in this sense is a token, not a type: if we are considering the string abracadabra, occurring in a document, it has eleven ucs-characters, not five. When being careful we may speak about the grapheme with which a particular ucs-character (and also a particular graph) is associated, or to which it is assigned by a reader.
  • For each ucs-character in the string “abcde”, we postulate the existence of a corresponding graph in the item being transcribed; let's call them g1 through g5.
  • We postulate an order relation on the graphs. For now, we'll content ourselves with saying it contains
    • nextgraph(g1,g2)
    • nextgraph(g2,g3)
    • nextgraph(g3,g4)
    • nextgraph(g4,g5)
  • We leave for later the questions “What pairs in order relations have g1 as their second member?” and “What pairs in order relations have g5 as their first member?” If g1 is preceded, and g5 followed, by an XML tag, then the answer to these questions will depend on what those tags are, as well as on what we plan to do about the fundamental non-linearities of written matter.
  • MSM speculates (CH has stepped away for a moment) that we will end up postulating several order relations (e.g. one which includes and one which excludes things like deletions) — or more likely, some astronomically high number of potential order relations, one for each beta-text. The cheapest way to do this is presumably not to try to enumerate the orderings, but only the choice points and their interrelations.

7.3. A simple example

26 October 2003: This morning we discussed how to translate the intuitions above into conditionals, at least for the simple case.
We considered the sample document
<s>John <us1>loves</us1> Mary.</s>
The image sentences are straightforward.
node(n0001).  node(n0001).  node(n0003).  
node(n0004).  node(n0005).
travord(n0001,1).  travord(n0002,2).  travord(n0003,3).
travord(n0004,4).  travord(n0005,5).
gi(n0001,s).  gi(n0003,us1).
gi(n0002,'#pcdata').  gi(n0004,'#pcdata').  gi(n0005,'#pcdata').

first_child(n0001,n0002).  first_child(n0003,n0004).
nsib(n0002,n0003).  nsib(n0003,n0005).
parent(n0002,n0001).  parent(n0003,n0001).  
parent(n0004,n0003).  parent(n0005,n0001).

content(n0002,"John ").  content(n0004,"loves").  
content(n0005," Mary.").
From these, we want to generate first of all some existence claims for the graphs in the manuscript:
graph(g1).  graph(g2). ... graph(g16).
Also, information on what grapheme each graph is assigned to:
grapheme(g1,'J').  grapheme(g2,'o'). ... grapheme(g16'.').
And finally we want to capture the order relation within the PCDATA chunks:
gnext(g1,g2). gnext(g2,g3). gnext(g3,g4). gnext(g4,g5).
gnext(g6,g7). gnext(g7,g8). gnext(g8,g9). gnext(g9,g10).
gnext(g11,g12).  gnext(g12,g13).  gnext(g13,g14).  gnext(g14,g15).
gnext(g15,g16).
Figuring out how to arrange for gnext(g5, g6), gnext(g10, g11), gnext(X, g1), and gnext(g16, Y) is reserved for a later step.

7.4. Initial proposals

CH produced an algorithm in pseudo-code (which I cannot reproduce here because I don't have the piece of paper). MSM produced first the following paraphrase of the intuitions itemized above, and then an attempt at translating it into logic. “For every character in a TCDATA node[1], there is a graph in the manuscript with appropriate grapheme and gnext relations.”
It turns out to seem simplest to separate the inference of the existence of the graph, with its grapheme information, on the one hand, from the inference of the gnext pairs on the other.
The existence of the graph:
(for all n, p : node) (for all i : item) (for all s : string) 
(for all gid : NAME)
(for all c : ucs-character) (for all j : positive-integer)
(for all eme : grapheme)
  (gi(n,gid) & transcription_element(gid) & element_item(n, i) 
  & parent(p, n) & gi(p, '#pcdata') & contents(p, s)
  & position(s, j, c) & ucschar_grapheme(c, eme)
  → (∃ g : graph)
      (graph(g) & graph_in_item(g,i)
      & grapheme(g, eme)
      & models(c, g)
      & (for all g2 : graph) (models(c, g2) → g = g2)))
In English:
  • If a PCDATA node which is the child of a transcription element has string s as its contents, and
  • the UCS-character c occurs in that string, then
  • there is a graph g in the item being transcribed, and
  • the grapheme of g is the same as that represented by c, and
  • UCS character c models graph g, and
  • c models only graph g.
The order relation:
(for all p : node) (for all s : string)
(for all c1, c2 : ucs-character) (for all j, k : positive-integer)
(for all g1, g2 : graph)
  (gi(p, '#pcdata') & contents(p, s)
  & position(s, j, c1) & position(s, k, c2) 
  & models(c1, g1) & models(c2, g2) 
  & j + 1 = k
  → gnext(g1,g2)
In English:
  • If two characters c1 and c2 are adjacent in a PCDATA node, then
  • the graphs modeled by c1 and c2, if any, are in the gnext relation.

7.5. An attempt at a generalization

During our walk with Class 1A this afternoon, MSM came to believe that he had found the way to handle the missing elements of the gnext relation. It comes down to this:
  • A Each PCDATA node has one first ucschar and one last ucschar.
  • B Each element node has a set of first ucschars and a set of last ucschars. Let us call these sets StartChar and FinalChar.
  • C In ‘simple’ cases (those with unproblematic sequence), the StartChar set of an element is the StartChar set of its first child node, and the element's FinalChar set is the FinalChar set of the element's last child node.
  • D In some other cases, e.g. a substitution or any set of alternative readings, the StartChar set of the substitution is the union of the StartChar sets of the alternatives, and the FinalChar set is similarly the union of the FinalChar sets of the alternatives.
  • E If one alternative in a substitution is the empty string[2] then for purposes of the order relation the empty-string alternative is replaced with a string consisting of the special symbol lambda; thus both the StartChar and FinalChar sets of the substitution will contain lambda.
  • F The gnext relation is calculated with the help of an auxiliary relation gnext-aux and its powers gnextaux2, gnextaux3, gnextaux4, ....
  • G Calculating gnextaux: In simple cases, a graph g1 modeled by a ucschar in the FinalChar set of node n1 and a graph g2 modeled by a ucschar in the StartChar set of node n2 are in the relation gnextaux(g1,g2) just in case nsib(n1,n2). In more complex cases, g1 and g2 are in this relation just in case n2 canfollow n1, where the canfollow relation is one which
    • may be and often is asserted implicitly by adjacency of nodes, or
    • may be suppressed (left unasserted) even for adjacent nodes, depending on the parent of the nodes or on the identity of one of the nodes, or
    • may be asserted explicitly by markup like virtual elements or TEI next/prev markup.
  • H Calculating gnext: For any two graphs g1 and g2 not modeled by ucschars in the same PCDATA chunk,
    • If neither g1 nor g2 is a lambda, then we have gnext(g1,g2) if and only if we have gnextaux(g1,g2).
    • If we have gnextaux(g1,g2) and g2 is a lambda, then we find the g3 for which we have gnextaux(g2,g3) (i.e. we have gnextaux2(g1,g3). If g3 is also a lambda, we keep going until we find a non-lambda or reach a ucschar for which we have no gnextaux. [That is, we take the lowest non-zero power of gnextaux for which both items in the pair are non-lambda, if any such power exists.]
It would be nice to have a version of this which didn't require lambdas.

Notes

[1] TCDATA is a nonce term for ‘transcribed character data’.
[2] One way of thinking of deletions and insertions reads them as substitutions in which one alternative is the empty string; this is not the way MECS-WIT thinks of them, and whether deletions and insertions always fit into the pattern described in this proposition is a decision to be made in formalizing the meaning of a given vocabulary.