Containment and dominance in Goddag structures
C. M. Sperberg-McQueen
Claus Huitfeldt
Processing Text-Technological Resources
Bielefeld, March 2008
Introduction
Overview
- background
- problem? well, maybe
- solution? probably not — speculation on where a solution might be, instead
Overview, take two
- Goddag structures
- symptoms of a problem
- multiple views over the same data
- analogous problems: XConcur, colored XML, fragmentation
- towards an analysis of the problem
- why does it matter?
- a hint from grammar (digression on containment and dominance in formal grammars)
- containment ≠ dominance?
- some implications
Goddag structures
- What are Goddags? (Review)
- description
- simple example
- example: verse drama
- example: multiple orders
The tripod
What do SGML and XML offer?
- formats for serialized character streams
- also an obvious data format (DAG with obvious single-rooted tree)
- also a validation story
MLCD (Markup Languages for Complex Documents) defines (experimentally!)
- TexMecs format for serialized character streams
- Goddag structures
- rabbit/duck grammars
Goddag
Informally, tangled trees with shared substructures.
More formally:
- directed acyclic graphs
- arcs = parent/child relation
- each parent's children ordered (or unordered) separately
- parents may share children (multiple parentage)
Overlap
Simple overlap (
jj and
np):
<head|<np|Broadway <jj|Hit|np> or Miss|jj>
|head>
Self-overlap (
coll):
<head|<coll~1|Broadway <coll~2|Hit|coll~1> or Miss|coll~2>
|head>
Example: Peer Gynt
FØRSTE HANDLING
[
En li med løvtrær nær ved Åses gård.
En elv fosser nedover.
Et gammelt kvernehus på den annen side. Het sommerdag.]
[
Peer Gynt, en sterkbygget tyveårs gutt, kommer nedover gangstien.
Åse, moren, liten og fin, følger etter.
Hun er vred og skjeller.
]
Åse: Peer, du lyver!
Peer Gynt : [uten å stanse]
Nei, jeg gjør ei!
Åse: Nå, så bann på det er sant!
Peer Gynt: Hvorfor banne?
Åse: Tvi, du tør ei!
Alt i hop er tøv og tant!
Example: Peer Gynt (in English)
ACT FIRST
SCENE FIRST
[
A wooded hillside near ÅSE's farm. A river rushes down the slope.
On the further side of it an old mill shed. It is a hot day in
summer.]
[
PEER GYNT, a strongly-built youth of twenty, comes down the
pathway. His mother, ÅSE, a small, slightly built woman, follows
him, scolding angrily.]
ÅSE: Peer, you're lying!
PEER GYNT : [without stopping] No,
I'm not!
ÅSE: Well then, swear to me it's true.
PEER GYNT: Swear? why should I?
ÅSE: See, you dare not!
It's a lie from first to last.
Markup for Peer Gynt
<act|
<head|ACT FIRST|head>
<scene|
<head|SCENE FIRST|head>
<sp who="Åse"|<L|Peer, you're lying!|sp>
<sp who="Peer"|<stage|without stopping.
|stage>
<sp who="Åse"|<L|Well then,
swear that it is true!|L>|sp>
<sp who="Peer"|<L|Swear? Why should I?|sp>
<sp who="Åse"|See, you dare not!|L>
<L|It's a lie from first to last.|L>|sp>
...
|scene>
|act>
Virtual elements
HUGHIE:
How did that translation go?
Da de dum de dum,
gets a new frog —
...
LOUIS:
Er,
...
it's a new pond.
DEWEY:
Ah ...
When the old pond
...
Right. That's it.
Virtual elements
<sp who="HUGHIE"|<p|How did that translation go?|p>
<lg type="haiku"|<l|da de dum de dum,|l>
<l@frog|gets a new frog,|l>
<l|...|l>|lg>|sp>
<sp who="LOUIS"|<p|Er,|p>
<lg|<l|...|l><l@new|it's a new pond.|l>|lg>|sp>
<sp who="DEWEY"|<p|Ah ...|p>
<lg|<l@pond|When the old pond|l>
<l|...|l>|lg>
<p|Right. That's it.|p>|sp>
<lg|<^l^pond>
<^l^frog>
<^l^new>
|lg>
The problem (or: the symptoms)
- multiple views
- analogues:
- concurrent markup
- colored XML
- fragmentation
Multiple views
- what do we wish to do with multiple views?
- co-existence
- identity of shared nodes
- convenient filtering
- interrelation (e.g. enjambement)
Peer Gynt, both views
Note that this is not quite the same as the basic
Goddag structure
Peer Gynt
Do the parent-child relations here make sense?
Analogue: Shared structures in XCONCUR
Does concurrent XML exhibit a similar tension?
<(d)act><(v)act>
<(d)head><(v)head>ACT FIRST</(v)head></(d)head>
<(v)scene><(d)scene>
<(d)head><(v)head>SCENE FIRST</(v)head></(d)head>
<(v)stage><(v)stage>A wooded hillside ...</(v)stage></(d)stage>
<(d)sp><(d)speaker>AASE</(d)speaker>
<(v)l>Peer, you're lying! </(d)sp>
<(d)sp><(d)speaker>PEER GYNT </(d)speaker>
<(d)stage><(v)stage>without stopping</(d)stage></(v)stage>
No, I'm not!</(v)l></(d)sp>
<(d)sp><(d)speaker>AASE</(d)speaker>
<(v)l>Well then, swear to me it's true.</(v)l></(d)sp>
<(d)sp><(d)speaker>PEER GYNT</(d)speaker>
<(v)l>Swear? why should I?</(d)sp>
<(d)sp><(d)speaker>AASE</(d)speaker>
See, you dare not!</(v)l>
<(v)l>Every word of it's a lie.</(v)l>
</(d)sp> ...
</(v)scene></(v)act></(d)scene></(d)act>
Analogue: Shared structures in XCONCUR
Analogue: Shared substructures in colored XML
In multi-colored trees, only nodes are shared, not (necessarily)
subtrees.
Analogue: Parent/child relations in fragmented XML
Fragmentation is one common technique for forcing some overlaps
into a tree.
<sp><speaker>ÅSE</speaker>
<l part="I">Peer, you're lying!</l></sp>
<sp><speaker>PEER GYNT </speaker> <stage>without stopping</stage>
<l part="F">No, I'm not!</l></sp>
<sp><speaker>ÅSE</speaker>
<l>Well then, swear to me it's true.</l></sp>
<sp><speaker>PEER GYNT</speaker>
<l part="I">Swear? why should I?</l></sp>
<sp><speaker>ÅSE</speaker>
<l part="F">See, you dare not!</l>
<l>It's a lie from first to last.</l></sp>
...
Parent/child relations in fragmented XML
If parent/child relations mirror containment (
credit: Claus
Huitfeldt and Vemund Olstad):
<text>
<front>...</front>
<body>
<page part="I" n="9">
<head>A CHRISTMAS CAROL.</head>
</page>
<div1 n="S1" id="S1" type="stave" org="uniform" sample="complete" part="N" TEIform="div1">
<page part="M" n="9">
<head>MARLEY'S GHOST</head>
<p id="P-10">Marley was dead: to begin with. There is no doubt whatever about that. ... </p>
<p id="P-11">Mind! I don't mean to say that I know, of my own knowledge, ... </p>
<p id="P-12">Scrooge knew he was dead? Of course he did. How could it be otherwise? ... </p>
</page>
<p id="P-13">
<page part="F" n="9">
The mention of Marley's funeral brings me back to the point ...
... his taking a stroll at night, in an easterly wind, upon his own
</page>
<page part="I" n="10">
ramparts, than there would be in any other middle-aged gentleman rashly turning out after dark ...
</page>
</p>
<page part="M" n="10">
<p id="P-14">Scrooge never painted out Old Marley's name. There it stood, ... </p>
<p id="P-15">Oh! But he was a tight-fisted hand at the grindstone, Scrooge! ... </p>
<p id="P-16">External heat and cold had little influence on Scrooge. No warmth ... </p>
<p id="P-17">Nobody ever stopped him in the street to say, with gladsome looks, <q>My dear ... </p>
</page>
<p id="P-18">
<page part="F" n="10">But what did Scrooge care! It was the very thing he liked.
To edge his way along the crowded paths of life, warning all</page>
<page part="I" n="11">
human sympathy to keep its distance, was what the knowing ones call
<soCalled>nuts</soCalled> to Scrooge.
</page>
</p>
<page part="F" n="11">
<p id="P-19">Once upon a time ~ of all the good days in the year, on Christmas Eve ... </p>
<p id="P-20">The door of Scrooge's counting-house was open that he might keep his eye upon ... </p>
<p id="P-21"><q>A merry Christmas, uncle! God save you!</q> cried a cheerful voice. ... </p>
<p id="P-22"><q part="I" id="q-4">Bah!</q> said Scrooge, <q part="F" id="q-5">Humbug!</q> </p>
<p id="P-23">He had so heated himself with rapid walking in the fog and frost, this nephew ... </p>
...
</page>
...
</div1>
...
</body>
</text>
Symptoms of a problem
The patient presents the following symptoms:
- parent-child relations inconstant
- overall structure of scene seems chaotic
- no explicit relations between nodes in different trees (XConcur)
- identity
- containment / dominance
- (Colored XML) Same node has different substructure in different colors.
But element ≠ node.
Is there a cure? Is there a diagnosis?
Tentative diagnosis and prescription
- So ... what's the diagnosis?
- Is there a problem at all?
- No, just get used to it.
- Yes, if you can only find out what it is.
- Why do we care?
- Digression into grammars
- Syntax and semantics
- Stop assuming that containment = dominance
Points of unease
- we want 1:1 relation between markup units (XML elements) and
conceptual units (‘logical elements’),
don't always get it
- parent/child relations fluctuate
- what is a scene? a div?
- A a sequence of SP (or STAGE) elements
- B a sequence of L (or STAGE) elements
- C a sequence of SP or L or STAGE elements
Why do we care?
- structural units (conceptual, markup) are tools of thought
- fluctuation in parent/child relations
makes navigation / querying / processing hard
- grammar rules capture (or crystallize) knowledge
- grammar is used for validation
Derivation is not necessarily tree-like
In rewrite systems derivation is the
fundamental concept.
It doesn't necessarily give us a tree. Consider the context-sensitive grammar:
S → abc | aSQ
bQc → bbcc
cQ → Qc
Containment and dominance in grammars
In context-free grammars*
- start-symbol has no predecessor
- every other symbol does
- ... so we have a tree
An example
Expr → Term
Expr → Expr Add Term
Term → Factor
Term → Term Mul Factor
Factor → Number | Var | '(' Expr ')'
Add → '+' | '-'
Mul → '×' | '÷'
...
Trees for type-1 grammars
Context-sensitive grammars in the narrow sense (Chomsky's type-1)
do produce trees, but slightly odd ones.
The grammar
S → abC | aSQ
bQC → bbCC
CQ → CX
CX → QX
QX → QC
C → c
produces:
Context-free grammars do more than recognition
Why care about parse trees?
Because they show us what the expression means:
Expr → Term { V[0] := V[1]; }
Expr → Expr Add Term { if (Add.op = '+') then V[0] := V[1] + V[2] else V[0] := V[1] - V[2]; }
Term → Factor { V[0] := V[1]; }
Term → Term Mul Factor { if (Mul.op = '×') then V[0] := V[1] * V[2] else V[0] := V[1] / V[2]; }
Factor → Number | Var | '(' Expr ')'
Add → '+' | '-'
Mul → '×' | '÷'
...
Why care about parse trees?
Two words: compositional semantics.
Containment and dominance in CFGs
In CFGs
is roughly* the same as
But need they be the same?
*But X → Y ≠ Y → X.
Containment and dominance in CFGs 2
Specifically:
- “X dominates Y” ⇒ “X contains Y”
- “X properly contains Y” ⇒ “X dominates Y”
- “X contains Y” ∧ “Y contains X” ⇒ “X dominates Y” ∨ “Y dominates X”
A modest proposal
Let us distinguish:
- “X contains Y”: statement about what part of the frontier / base data is dominated by X and Y
- “X dominates Y”: statement about grammatical relation (“Y” appears on RHS of
some production for “X”)
So
dominance is reserved for
“meaningful” relations.
** Was soll das schließlich bedeuten?
Modest proposal 2
So:
- “X dominates Y” ⇒ “X contains Y”
- “X properly contains Y” ∧ “X and Y in same grammar” ⇒ “X dominates Y”
- “X contains Y” ∧ “Y contains X” ∧ “X and Y in same grammar”
⇒ “X dominates Y” ∨ “Y dominates X”
Two roles of grammars
Grammars
- define sets of strings
- define sets of structures
Is “John loves Mary” a sentence?
Or is the sentence
“s(np(pn(John)), vp(finite_verb(loves), np(pn(Mary))))”?
A new algorithm for generating Goddags
- Assign a color to each grammar.
- Validate with rabbit-duck grammars.
- For each dominance relation, introduce parent-child arc (color-code for grammar).
- Introduce additional non-dominance containment arcs (gray) as needed.
- Result has some ‘redundant’ arcs. Mark them as ‘weak’.
To see one view: filter by color.
To see containment: ignore weak arcs.
Cf. color-filtering of multi-colored trees.
Cf. dominance/containment distinction of XConcur.
Concluding unscientific postscript
- background: Goddag structures
- symptoms of a problem:
capricious parent/child relations (etc.) in the data
anxiety in the encoder
- solution: distinguish ‘meaningful’ parent/child relations
from merely empirical containment relations
The devil is in the details.