graphic with four colored squares
Cover page image (keys)

Containment and dominance in Goddag structures

C. M. Sperberg-McQueen

Claus Huitfeldt

Processing Text-Technological Resources

Bielefeld, March 2008

Introduction

Overview

Overview, take two

Goddag structures

The tripod

What do SGML and XML offer?
  • formats for serialized character streams
  • also an obvious data format (DAG with obvious single-rooted tree)
  • also a validation story
MLCD (Markup Languages for Complex Documents) defines (experimentally!)
  • TexMecs format for serialized character streams
  • Goddag structures
  • rabbit/duck grammars

Goddag

Informally, tangled trees with shared substructures.
More formally:
  • directed acyclic graphs
  • arcs = parent/child relation
  • each parent's children ordered (or unordered) separately
  • parents may share children (multiple parentage)

Overlap

Simple overlap (jj and np):
<head|
<np|Broadway <jj|Hit|np> or Miss|jj>
|head>
Self-overlap (coll):
<head|
<coll~1|Broadway <coll~2|Hit|coll~1> or Miss|coll~2>
|head>

Simple overlap

Example: Peer Gynt

FØRSTE HANDLING

[En li med løvtrær nær ved Åses gård. En elv fosser nedover. Et gammelt kvernehus på den annen side. Het sommerdag.]
[ Peer Gynt, en sterkbygget tyveårs gutt, kommer nedover gangstien. Åse, moren, liten og fin, følger etter. Hun er vred og skjeller. ]
Åse: Peer, du lyver!
Peer Gynt : [uten å stanse] Nei, jeg gjør ei!
Åse: Nå, så bann på det er sant!
Peer Gynt: Hvorfor banne?
Åse: Tvi, du tør ei!
Alt i hop er tøv og tant!

Example: Peer Gynt (in English)

ACT FIRST

SCENE FIRST
[ A wooded hillside near ÅSE's farm. A river rushes down the slope. On the further side of it an old mill shed. It is a hot day in summer.]
[ PEER GYNT, a strongly-built youth of twenty, comes down the pathway. His mother, ÅSE, a small, slightly built woman, follows him, scolding angrily.]
ÅSE: Peer, you're lying!
PEER GYNT : [without stopping] No, I'm not!
ÅSE: Well then, swear to me it's true.
PEER GYNT: Swear? why should I?
ÅSE: See, you dare not!
It's a lie from first to last.

Markup for Peer Gynt

<act|
<head|ACT FIRST|head>
<scene|
<head|SCENE FIRST|head>
<sp who="Åse"|<L|Peer, you're lying!|sp>
<sp who="Peer"|<stage|without stopping.|stage>
No, I am not!|L>|sp>
<sp who="Åse"|<L|Well then, swear that it is true!|L>|sp>
<sp who="Peer"|<L|Swear? Why should I?|sp>
<sp who="Åse"|See, you dare not!|L>
<L|It's a lie from first to last.|L>|sp> ...
|scene>
|act>

Peer Gynt

Virtual elements

HUGHIE: How did that translation go?
Da de dum de dum,
gets a new frog —
...
LOUIS: Er,
...
it's a new pond.
DEWEY: Ah ...
When the old pond
...
Right. That's it.

Virtual elements

<sp who="HUGHIE"|<p|How did that translation go?|p>
<lg type="haiku"|
<l|da de dum de dum,|l>
<l@frog|gets a new frog,|l>
<l|...|l>|lg>|sp>
<sp who="LOUIS"|<p|Er,|p>
<lg|<l|...|l><l@new|it's a new pond.|l>|lg>|sp>
<sp who="DEWEY"|<p|Ah ...|p>
<lg|<l@pond|When the old pond|l>
<l|...|l>|lg>
<p|Right. That's it.|p>|sp>
<lg|
<^l^pond>
<^l^frog>
<^l^new>
|lg>

Hughie, Louis, and Dewey

The problem (or: the symptoms)

Multiple views

Peer Gynt, verse view

Peer Gynt, dramatic view

Peer Gynt, both views

Note that this is not quite the same as the basic Goddag structure

Peer Gynt

Do the parent-child relations here make sense?

Analogue: Shared structures in XCONCUR

Does concurrent XML exhibit a similar tension?
<(d)act><(v)act>
<(d)head><(v)head>ACT FIRST</(v)head></(d)head>
<(v)scene><(d)scene>
<(d)head><(v)head>SCENE FIRST</(v)head></(d)head>
<(v)stage><(v)stage>A wooded hillside ...</(v)stage></(d)stage>
<(d)sp><(d)speaker>AASE</(d)speaker>
  <(v)l>Peer, you're lying! </(d)sp>
<(d)sp><(d)speaker>PEER GYNT </(d)speaker> 
<(d)stage><(v)stage>without stopping</(d)stage></(v)stage> 
    No, I'm not!</(v)l></(d)sp>
<(d)sp><(d)speaker>AASE</(d)speaker>
  <(v)l>Well then, swear to me it's true.</(v)l></(d)sp>
<(d)sp><(d)speaker>PEER GYNT</(d)speaker>
  <(v)l>Swear?  why should I?</(d)sp>
<(d)sp><(d)speaker>AASE</(d)speaker>
    See, you dare not!</(v)l>
  <(v)l>Every word of it's a lie.</(v)l>
</(d)sp> ...
</(v)scene></(v)act></(d)scene></(d)act>

Analogue: Shared structures in XCONCUR

Analogue: Shared substructures in colored XML

In multi-colored trees, only nodes are shared, not (necessarily) subtrees.

Analogue: Parent/child relations in fragmented XML

Fragmentation is one common technique for forcing some overlaps into a tree.
<sp><speaker>ÅSE</speaker>
  <l part="I">Peer, you're lying!</l></sp>
<sp><speaker>PEER GYNT </speaker> <stage>without stopping</stage> 
  <l part="F">No, I'm not!</l></sp>
<sp><speaker>ÅSE</speaker>
  <l>Well then, swear to me it's true.</l></sp>
<sp><speaker>PEER GYNT</speaker>
  <l part="I">Swear?  why should I?</l></sp>
<sp><speaker>ÅSE</speaker>
  <l part="F">See, you dare not!</l>
  <l>It's a lie from first to last.</l></sp>
...

Parent/child relations in fragmented XML

If parent/child relations mirror containment (credit: Claus Huitfeldt and Vemund Olstad):
 <text>
  <front>...</front>
  <body>
   <page part="I" n="9">
    <head>A CHRISTMAS CAROL.</head>
   </page>
   <div1 n="S1" id="S1" type="stave" org="uniform" sample="complete" part="N" TEIform="div1">
    <page part="M" n="9">
     <head>MARLEY'S GHOST</head>
     <p id="P-10">Marley was dead: to begin with.  There is no doubt whatever about that. ... </p>
     <p id="P-11">Mind!  I don't mean to say that I know, of my own knowledge, ...  </p>
     <p id="P-12">Scrooge knew he was dead?  Of course he did.  How could it be otherwise?  ...  </p>
    </page>
    <p id="P-13">
     <page part="F" n="9">
      The mention of Marley's funeral brings me back to the point ...
      ... his taking a stroll at night, in an easterly wind, upon his own
     </page>
     <page part="I" n="10">
      ramparts, than there  would be in any other middle-aged gentleman rashly turning out after dark ...
     </page>
    </p>
    <page part="M" n="10">
     <p id="P-14">Scrooge never painted out Old Marley's name.  There it stood, ...     </p>
     <p id="P-15">Oh!  But he was a tight-fisted hand at the grindstone, Scrooge!  ...     </p>
     <p id="P-16">External heat and cold had little influence on Scrooge. No warmth ...     </p>
     <p id="P-17">Nobody ever stopped him in the street to say, with gladsome looks, <q>My dear ...  </p>
    </page>
    <p id="P-18">
     <page part="F" n="10">But what did Scrooge care!  It was the very thing he liked.  
      To edge his way along the crowded paths of life, warning all</page>
     <page part="I" n="11"> 
      human sympathy to keep its distance, was what the knowing ones call
      <soCalled>nuts</soCalled> to Scrooge.
     </page>
    </p>
    <page part="F" n="11">
     <p id="P-19">Once upon a time ~ of all the good days in the year, on Christmas Eve ...  </p>
     <p id="P-20">The door of Scrooge's counting-house was open that he might keep his eye upon ... </p>
     <p id="P-21"><q>A merry Christmas, uncle!  God save you!</q> cried a cheerful voice.  ...  </p>
     <p id="P-22"><q part="I" id="q-4">Bah!</q> said Scrooge, <q part="F" id="q-5">Humbug!</q> </p>
     <p id="P-23">He had so heated himself with rapid walking in the fog and frost, this nephew ... </p>
     ...
    </page>
    ...
   </div1>
   ...
  </body>
 </text>

Symptoms of a problem

The patient presents the following symptoms:
  • parent-child relations inconstant
  • overall structure of scene seems chaotic
  • no explicit relations between nodes in different trees (XConcur)
    • identity
    • containment / dominance
    • (Colored XML) Same node has different substructure in different colors. But elementnode.
Is there a cure? Is there a diagnosis?

Tentative diagnosis and prescription

Points of unease

Why do we care?

Derivation is not necessarily tree-like

In rewrite systems derivation is the fundamental concept.
It doesn't necessarily give us a tree. Consider the context-sensitive grammar:
S → abc | aSQ
bQc → bbcc
cQ → Qc

Derivation graph

Containment and dominance in grammars

In context-free grammars*
  • start-symbol has no predecessor
  • every other symbol does
  • ... so we have a tree

An example

Expr → Term
Expr → Expr Add Term
Term → Factor
Term → Term Mul Factor
Factor → Number | Var | '(' Expr ')'
Add → '+' | '-'
Mul → '×' | '÷'
...

Parse tree

Trees for type-1 grammars

Context-sensitive grammars in the narrow sense (Chomsky's type-1) do produce trees, but slightly odd ones.
The grammar
S → abC | aSQ
bQC → bbCC
CQ → CX
CX → QX
QX → QC
C → c
produces:

Context-free grammars do more than recognition

Why care about parse trees? Because they show us what the expression means:
Expr → Term { V[0] := V[1]; }
Expr → Expr Add Term { if (Add.op = '+') then V[0] := V[1] + V[2] else V[0] := V[1] - V[2]; }
Term → Factor { V[0] := V[1]; }
Term → Term Mul Factor { if (Mul.op = '×') then V[0] := V[1] * V[2] else V[0] := V[1] / V[2]; }
Factor → Number | Var | '(' Expr ')'
Add → '+' | '-'
Mul → '×' | '÷'
...

Why care about parse trees?

Two words: compositional semantics.

Containment and dominance in CFGs

In CFGs
  • X contains Y
is roughly* the same as
  • X dominates Y
But need they be the same?
*But X → Y ≠ Y → X.

Containment and dominance in CFGs 2

Specifically:
  1. “X dominates Y” ⇒ “X contains Y”
  2. “X properly contains Y” ⇒ “X dominates Y”
  3. “X contains Y” ∧ “Y contains X” ⇒ “X dominates Y” ∨ “Y dominates X”

A modest proposal

Let us distinguish:
  • “X contains Y”: statement about what part of the frontier / base data is dominated by X and Y
  • “X dominates Y”: statement about grammatical relation (“Y” appears on RHS of some production for “X”)
So dominance is reserved for “meaningful” relations.*
* Was soll das schließlich bedeuten?

Modest proposal 2

So:
  1. “X dominates Y” ⇒ “X contains Y”
  2. “X properly contains Y” ∧ “X and Y in same grammar” ⇒ “X dominates Y”
  3. “X contains Y” ∧ “Y contains X” ∧ “X and Y in same grammar” ⇒ “X dominates Y” ∨ “Y dominates X”

Two roles of grammars

Grammars
  • define sets of strings
  • define sets of structures
Is “John loves Mary” a sentence?
Or is the sentence “s(np(pn(John)), vp(finite_verb(loves), np(pn(Mary))))”?

A new algorithm for generating Goddags

  1. Assign a color to each grammar.
  2. Validate with rabbit-duck grammars.
  3. For each dominance relation, introduce parent-child arc (color-code for grammar).
  4. Introduce additional non-dominance containment arcs (gray) as needed.
  5. Result has some ‘redundant’ arcs. Mark them as ‘weak’.
To see one view: filter by color.
To see containment: ignore weak arcs.
Cf. color-filtering of multi-colored trees.
Cf. dominance/containment distinction of XConcur.

Concluding unscientific postscript

  • background: Goddag structures
  • symptoms of a problem: capricious parent/child relations (etc.) in the data anxiety in the encoder
  • solution: distinguish ‘meaningful’ parent/child relations from merely empirical containment relations
The devil is in the details.