graphic with four colored squares
Cover page image (keys)

Rabbit/duck grammars: a validation method for overlapping structures

C. M. Sperberg-McQueen

Workdshop on Multidimensional Markup (Goddag workshop), Amsterdam

1 December 2008

Overview

Validation with CONCUR

A document has n trees.
Each tree has a grammar.
All trees share leaves (unless you want a roaring headache).

Example: Peer Gynt

Åse: Peer, du lyver!
Peer Gynt : [uten å stanse] Nei, jeg gjør ei!
Åse: Nå, så bann på det er sant!
Peer Gynt: Hvorfor banne?
Åse: Tvi, du tør ei!
Alt i hop er tøv og tant!
In English:
AASE: Peer, you're lying!
PEER GYNT : [without stopping] No, I'm not!
AASE: Well then, swear to me it's true.
PEER GYNT: Swear? why should I?
AASE: See, you dare not!
Every word of it's a lie.

Concurrent Peer

<(d,v)act><(v)scene><(d)scene>
<(d)sp><(d)speaker>AASE</(d)speaker>
  <(v)l>Peer, you're lying! </(d)sp>
<(d)sp><(d)speaker>PEER GYNT </(d)speaker> 
<(d,v)stage>without stopping</(d,v)stage> 
    No, I'm not!</(v)l></(d)sp>
<(d)sp><(d)speaker>AASE</(d)speaker>
  <(v)l>Well then, swear to me it's true.</(v)l></(d)sp>
<(d)sp><(d)speaker>PEER GYNT</(d)speaker>
  <(v)l>Swear?  why should I?</(d)sp>
<(d)sp><(d)speaker>AASE</(d)speaker>
    See, you dare not!</(v)l>
  <(v)l>Every word of it's a lie.</(v)l>
</(d)sp> ...
</(d,v)scene></act>

Desiderata

Some things we would like to do, but can't do using CONCUR:
  • insist that act and scene be the same in verse and drama trees.
  • forbid line boundaries in speaker attributions.
  • make speaker or note invisible to verse hierarchy.

Goals

Can we get The Right Thing?
Maybe. In the meantime we want something
  • powerful enough to be useful,
  • but not too powerful;
  • clear*;
  • simple*;
  • ambidextrous.

Rabbit/duck grammars

Related work

Rabbit/duck grammars

A simple extension to CONCUR. A set of rabbit/duck grammars is:
  • a set of grammars
  • intended for use together
  • whose intersection defines the language to be accepted.

The class system

In each grammar, each element is one* of:
  1. a normal element
  2. a milestone element
  3. a transparent element
  4. an invisible or opaque element

The Gestalt switch

Rabbit/duck grammars thus perform a gestalt switch on tags; named for:

Content-model extensions

Elementary class consciousness

An element foo is
  • first-class if defined and mentioned normally (foo)
  • second-class if undefined and mentioned as (#tag(foo))
  • ... or declared MILESTONES
  • third-class if undefined and unmentioned
  • ... or declared IGNORE-TAGS
  • fourth-class if declared IGNORE

Peer Gynt dramatic view

<!--* Grammar D *-->
<!GRAMMAR play [
<!ELEMENT play  (act+) >
<!ELEMENT act   (scene+)
<!ELEMENT scene (sp | stage | #tag(L))* >
<!ELEMENT sp    (#PCDATA | stage | #tag(L))* >
<!ATTLIST sp    
          who CDATA #REQUIRED >
<!ELEMENT stage (#PCDATA) >
]>

Peer Gynt verse view

<!--* Grammar V *-->
<!GRAMMAR play [
<!ELEMENT play  (act+) >
<!ELEMENT act   (scene+)
<!ELEMENT scene (L | stage | speaker | #tag(sp) )* >
<!ELEMENT L     (#PCDATA | stage | speaker | #tag(sp) )* >
<!ELEMENT stage (#PCDATA) >
<!ELEMENT speaker (#PCDATA) >
]>

Losing the stage directions

<!--* Grammar V *-->
<!GRAMMAR play [
<!ELEMENT play  (act+) >
<!ELEMENT act   (scene+)
<!ELEMENT scene (L | #tag(sp) )* >
<!ELEMENT L     (#PCDATA | #tag(sp) )* >
<!ELEMENT stage IGNORE >
<!ELEMENT speaker IGNORE >
]>

Parsing: the problem

The Goddag structure

Parsing the verse hierarchy

Parsing the drama hierarchy

Parsing both

Two general conditions

The class system revisited

The class system ensures that we can:
  1. re-create DTDs and conventional schemas
  2. constrain interrelations among grammars (within limits)
  3. allow unconstrained co-existence (redundant); cf.
    • CONCUR (all elements 1st- or 3d-class)
    • HTML (ignore tags you don't understand)
  4. allow grammars to play ostrich: hide your head in the sand when you see something you don't understand

Two forms of self-overlap

Handling parallel hierarchies

Hard-core overlap

Soft-core

<head>
  <np>
   Broadway
  <jj>
    Hit
  </np>
    or
    Miss
  </jj>
</head>

Parse the noun phrase

<head>
  <np>
    Broadway
    <jj sID="j123"/>
    Hit
  </np>
  or
  Miss
  <jj eID="j123"/>
</head>

Parse the adjective

<head>
  <np sID="np23"/>
  Broadway
  <jj>
    Hit
    <np eID="np23">
    or
    Miss
  </jj>
</head>

Hard-core (Trojan horses)

<head>
 <coll sID="np23"/>
 Broadway
 <coll sID="jj123"/>
 Hit
 <coll eID="np23"/>
 or
 Miss
 <coll eID="jj123"/>
</head>

Parse left, parse right

<head>                          <head>
  <coll>                          <coll sID="np23"/>
    Broadway                      Broadway
    <coll sID="j123"/>            <coll>
    Hit                             Hit
  </coll>                           <coll eID="np23">
  or                                or
  Miss                              Miss
  <coll eID="j123"/>              </coll>
</head>                         </head>

Generalizing

If there are n overlapping instances of element foo, then
  • We want n parses / readings of (this bit of) the document:
  • one for each element instance.
  • In the other readings, each instance is treated as second- (or third-) class.

Self-overlap in content models

<!GRAMMAR head [
<!ELEMENT head #overlap(
              (#PCDATA | coll | #tag(coll))*
              ) >
<!ELEMENT coll (#PCDATA | #tag(coll))*) >
]>

Implementation

Easier than you might think:
  • XML projection (not for self-overlap):
    1. Translate each grammar into XML.
    2. Translate instance into n XML documents.
    3. Validate the n derivative instances.
  • ad hoc parsers (lexical scanner knows about element classes)
  • single-pass using parallel parsing
  • single-pass using Brzozowski derivatives

Applications

Example: the valid contract

Imagine a language for contracts:
  • normal document structure
  • variable location for seller, buyer, jurisdiction, etc.
  • one buyer, one seller, one jurisdiction, one set of dates, ...

Conventional structure

<!GRAMMAR text [
<!ENTITY % hypertext "a" >
<!ENTITY % hilites "it | bd | tt" >
<!ENTITY % buyer "individual-buyer | corp-buyer 
  | buyer-PIN | buyer-VATnum | buyer-address" >
<!ENTITY % seller "individual-seller | corp-seller 
  | seller-PIN | seller-VATnum | seller-address" >
<!ENTITY % misc "start-date | end-date | amount
  | description-of-goods | jurisdiction" >

<!ELEMENT text (head, p*, div*) >
<!ELEMENT div  (head?, p*, div*) >
<!ELEMENT p    (#PCDATA | %buyer; | %seller; 
  | %hypertext; | %hilites; | %misc;)* >
<!ELEMENT head (#PCDATA | %buyer; | %seller; 
  | %hypertext; | %hilites; | %misc;)* >
<!ELEMENT a    (#PCDATA | %hilites;)* >
<!ELEMENT it   (#PCDATA | %hilites;)* >
<!ELEMENT bd   (#PCDATA | %hilites;)* >
<!ELEMENT tt   (#PCDATA | %hilites;)* >

<!ELEMENT individual-buyer  (#PCDATA | %hilites;)* >
<!ELEMENT corp-buyer        (#PCDATA | %hilites;)* >
<!ELEMENT buyer-PIN         (#PCDATA | %hilites;)* >
<!ELEMENT buyer-VATnum      (#PCDATA | %hilites;)* >
<!ELEMENT buyer-address     (#PCDATA | %hilites;)* >
<!ELEMENT individual-seller (#PCDATA | %hilites;)* >
<!ELEMENT corp-seller       (#PCDATA | %hilites;)* >
<!ELEMENT seller-PIN        (#PCDATA | %hilites;)* >
<!ELEMENT seller-VATnum     (#PCDATA | %hilites;)* >
<!ELEMENT seller-address    (#PCDATA | %hilites;)* >
<!ELEMENT start-date        (#PCDATA | %hilites;)* >
<!ELEMENT end-date          (#PCDATA | %hilites;)* >
<!ELEMENT amount            (#PCDATA | %hilites;)* >
<!ELEMENT description-of-goods (#PCDATA | %hilites;)* >
<!ELEMENT jurisdiction      (#PCDATA | %hilites;)* >
]>

One-buyer, one-seller constraints

<!ELEMENT doc  (#PCDATA 
               & ( (individual-buyer & buyer-PIN) 
                 | (corp-buyer & buyer-VATnum)
                 )
               & ( (individual-seller & seller-PIN) 
                 | (corp-seller & seller-VATnum)
                 )
               & buyer-address
               & seller-address
               & start-date
               & end-date
               & amount
               & description-of-goods
               & jurisdiction) >
<!--* N.B. interleave interpretation of & needed here, 
    * not the SGML interpretation *-->

Evaluation

How are we doing? Rabbit/duck grammars are:
powerful enough to be useful,
but not too powerful;
clear*;
simple*;
ambidextrous.

Follow-on work