Applications of Brzozowski derivatives to XML Schema processing

C. M. Sperberg-McQueen

Extreme Markup Languages 2005

Montréal, 3 August 2005

TOC | First


Overview

previous table of contents next
I of III

Definition

previous table of contents next
1 of 3 [40]
Source: Janusz A. Brzozowski, “Derivatives of regular expressions,” Journal of the ACM 11.4 (1964): 481-494.
Definition:
Given a set R of sequences and some finite sequence s, the derivative of R with respect to s is denoted by DsR and is DsR = {t | stR}.
In words, the set of strings t which can follow s in sentences of R.
N.B. If R is regular, DsR must be regular.

Brzozowski's contributions

previous table of contents next
2 of 3 [40]
Brzozowski showed:
  • how to construct a regex for DsR
  • any regular R has only finite number of unequal characteristic derivatives
  • ... and a finite number of dissimilar derivatives
  • how to construct FSA from regex

What use?

previous table of contents next
3 of 3 [40]
Most important, simple and beautiful.
Previous work, but not widely understood.

Related work

previous table of contents next
On derivatives and markup:
  • [Wilmott] [1991]
  • Brüggemann-Klein / Wood (1993, 1998, ...)
  • English 1999
  • Clark 2002
  • Kawaguchi 2003
Use of derivatives in regex library:
  • Hopkins 1994
On XSD:
  • Thompson/Tobin 2003
  • Fuchs/Brown 2003

Preliminaries

previous table of contents next
II of III

Content model notation

previous table of contents next
1 of 19 [40]

XML Schema equivalents

previous table of contents next
2 of 19 [40]
  RE XSD 1.0
empty sequence ε <xsd:sequence/>
empty set <xsd:choice/>
child element expanded name <xsd:element ref = "QName"/>
<xsd:element name = "NCName"> ... </xsd:element>
wildcard wc(KW, NS) <xsd:any namespace = "NS" processContents = "KW"/>
neg. wildcard wc(KW, not(NS)) <xsd:any namespace = "##other" processContents = "KW"/>

XML Schema equivalents (2)

previous table of contents next
3 of 19 [40]
  RE XSD 1.0
disjunction F | G <xsd:choice> F G </xsd:choice>
concatenation F, G <xsd:sequence> F G </xsd:sequence>
all-group F & G <xsd:all> F G </xsd:all>
repetition F{n, m} <F minOccurs="n" maxOccurs="m" />

Nullability

previous table of contents next
4 of 19 [40]
Define ν(E) ≡ (ε ∈ L(E)).
E ν(E)
ε true
false
expanded name / QName false
wc(KW, NS), wc(KW, not(NS)) false
F | G ν(F) ∨ ν(G)
F, G ν(F) ∧ ν(G)
F & G ν(F) ∧ ν(G)
F{n, m} n = 0 ∨ ν(F)

Brzozowski nullability / Delta

previous table of contents next
Define δ(E) ≡ ε iff ν(E), else ∅

Constructing derivatives

previous table of contents next
5 of 19 [40]
First, define DxE for a single symbol x.
Then DsE involves taking single-symbol derivative, one symbol at a time.
If s = < x1, x2, x3, ..., xn >,
then DsE = Dxn (... (Dx3 (Dx2 (Dx1E)))...)

Derivatives of ε and ∅

previous table of contents next
6 of 19 [40]
E DxE
ε
If E = ε, no string can follow x and make a member of {ε}.
E DxE
If E = ∅, no string can follow x and make a member of ∅.

Derivatives of QNames

previous table of contents next
7 of 19 [40]
E DxE
expanded name / QName e if x = e, then ε, else ∅
If E = a, then
  • ε can follow a and make a.
  • Nothing can follow b and make a member of {a}.

Derivatives of wildcards

previous table of contents next
8 of 19 [40]
E DxE
wc(KW, NS) if x ∈ namespace NNS, then ε, else ∅
wc(KW, not(NS)) if x ∈ namespace NNS, then ∅, else ε
Same logic as for QNames.

Derivatives of choices

previous table of contents next
9 of 19 [40]
E DxE
F | G (DxF) | (DxG)
For example, if E = (x | y), then
DxE = Dxx | Dxy
   = ε | ∅
   = ε

Derivatives of sequences

previous table of contents next
10 of 19 [40]
E DxE
F, G if ν(F)
then ((DxF), G) | (DxG),
else ((DxF), G)
For example, if E = (x, y), then
DxE = ((Dxx), y)
   = (ε, y)
   = y

Sequence examples

previous table of contents next
11 of 19 [40]
Or if E = (x, y) | (y, z, w), then
DxE = ((Dxx), y) | ((Dxy), z, w)
   = (ε, y) | (∅, z, w)
   = (y) | (∅)
   = y
N.B. The expression need not be deterministic. If E = (x, y) | (x, z), then
DxE = ((Dxx), y) | ((Dxx), z)
   = (ε, y) | (ε, z)
   = (y) | (z)
   = y | z

Derivatives of all-groups

previous table of contents next
12 of 19 [40]
E DxE
F & G ((DxF) & G) | (F & (DxG))
For example, if E = (y & (x & z)), then
DxE = ((Dxy) & x & z) | (y & Dx(x & z))
   = (∅ & x & z) | (y & ((Dx(x) & z) | (x & Dx(z))))
   = (∅) | (y & (((ε) & z) | (x & (∅))))
   = (y & (((ε) & z) | (∅)))
   = (y & z)

Derivatives of repetitions

previous table of contents next
13 of 19 [40]
E DxE
F{n, m} if n = 0 ∧ m = unbounded, then
(DxF), F{0, unbounded}
else if not ν(F), then
(DxF), F{--n, --m}
else (when ν(F) and EF*)
(DxF), F{--n, --m} | DxF{--n, --m}
where
  • --0 = 0
  • --unbounded = unbounded
  • --n = n - 1, for n > 0

Repetition examples

previous table of contents next
14 of 19 [40]
For example, if E = (x, y){2, 4}, then
DxE = ((Dx(x, y)), (x, y){1, 3})
   = (y, (x, y){1, 3})
Or, if E = (x, y){0, unbounded}, then
DxE = ((Dx(x, y)), (x, y){0, unbounded})
   = (y, (x, y){0, unbounded})

A nullable repetition example

previous table of contents next
If E = ((x, y)?, z?){2, 4}, then
DxE= ((Dx((x,y)?,z?)), ((x,y)?,z?){1,3})
       | (Dx((x,y)?,z?){1,3})
   = (y?,z?,((x,y)?,z?){1,3})
       | (y?,z?,((x,y)?,z?){0,2})
       | (Dx((x,y)?,z?){0,2})
   = (y?,z?,((x,y)?,z?){1,3})
       | (y?,z?,((x,y)?,z?){0,2})
       | (Dx((x,y)?,z?), ((x,y)?,z?){0,1}) | (Dx((x,y)?,z?){0,1})
   = (y?,z?,((x,y)?,z?){1,3})
       | (y?,z?,((x,y)?,z?){0,2})
       | (y?,z?,((x,y)?,z?){0,1})
       | (Dx((x,y)?,z?){0,1})

Nullable repetition (conclusion)

previous table of contents next
   = (y?,z?,((x,y)?,z?){1,3})
       | (y?,z?,((x,y)?,z?){0,2})
       | (y?,z?,((x,y)?,z?){0,1})
       | (Dx((x,y)?,z?), ((x,y)?,z?){0,0})
       | (Dx((x,y)?,z?){0,0})
   = (y?,z?,((x,y)?,z?){1,3})
       | (y?,z?,((x,y)?,z?){0,2})
       | ((y?,z?),((x,y)?,z?){0,1})
       | ((y?,z?), ε)
       | (Dxε)
   = (y?,z?,((x,y)?,z?){1,3})
       | (y?,z?,((x,y)?,z?){0,2})
       | (y?,z?,((x,y)?,z?){0,1})
       | (y?,z?)

Simplifying expressions

previous table of contents next
15 of 19 [40]
We've already made some use of these:
  • E | E = E
  • E | F = F | E
  • (E | F) | G = E | (F | G)
  • ε, E = E, ε = E
  • ε & E = E & ε = E
  • ∅ | E = E | ∅ = E
  • ∅, E = E, ∅ = ∅
  • ∅ & E = E & ∅ = ∅

Characteristic derivatives

previous table of contents next
16 of 19 [40]
Two regexes are similar if one can be transformed into the other using
  • E | E = E
  • E | F = F | E
  • (E | F) | G = E | (F | G)

Finding all derivatives

previous table of contents next
17 of 19 [40]
  1. Let counter n = 0, set Derivatives = {}, flag alldone = true.
  2. For each string s ∈ Σ* with length n,
    1. Find F = DsE.
    2. If FDerivatives, then do nothing.
    3. Else add F to Derivatives and set alldone flag = false.
    4. If alldone = false, then increment n, set alldone flag = true, and go to 2.
    5. Else stop.
The characteristic derivatives are those in the set Derivatives.

Regex to FSA

previous table of contents next
18 of 19 [40]
  1. Find the characteristic derivatives.
  2. Make one state per characteristic derivative.
  3. Connect two states F and G with an arc labeled x if and only if G = DxF.

FSA example

previous table of contents next
19 of 19 [40]
If E = (a, b, c+), then characteristic derivatives / states are
  1. DεE = (a, b, c+)
  2. DaE = (b, c+)
  3. DabE = c+
  4. DabcE = DabccE = DabcccE = ... = c*
  5. For all other s, DsE = ∅

Schema processing

previous table of contents next
III of III

Validation

previous table of contents next
1 of 18 [40]
For all regular expressions E and all strings s,
“Is sL(E)?”
“Is DsE nullable?”

Validation example

previous table of contents next
2 of 18 [40]
If E = (h+, ((p+, s*) | (p*, s+)), t?), is the sequence h h p p p p p s valid?
  • DhE = h*, ((p+, s*) | (p*, s+)), t?
  • Dh hE = h*, ((p+, s*) | (p*, s+)), t?
  • Dh h pE = ((p*, s*) | (p* s+)), t?
  • ...
  • Dh h p p p p pE = ((p*, s*) | (p* s+)), t?
  • Dh h p p p p p sE = (s* | s*), t?
Is (s* | s*), t? nullable?
Yes: sequence is valid.

EDI example

previous table of contents next
3 of 18 [40]
If E = ((h?, i{1,9999}){1,9999}), is a sequence of 437 item elements valid?
FSA construction explodes.
Derivative grows, but not explosively.
  • DiE = i{0,9998}, (h?, i{1,9999}){0,9998}
  • DiiE = (i{0,9997}, (h?, i{1,9999}){0,9998})
           | (i{0,9998}, (h?, i{1,9999}){0,9997})
  • DiiE = (i{0,9996}, (h?, i{1,9999}){0,9998})
           | (i{0,9997}, (h?, i{1,9999}){0,9997})
           | (i{0,9998}, (h?, i{1,9999}){0,9996})
  • ...

Checking determinism

previous table of contents next
4 of 18 [40]

Initial determinism

previous table of contents next
5 of 18 [40]
How many ways can initial x match in E?
E c(x,E)
ε 0
0
expanded name e if x = e then 1 else 0
wc(KW, NS) if xNNS then 1 else 0
wc(KW, not(NS)) if xNNS then 0 else 1
F | G c(x, F) + c(x, G)
F, G if ν(F) then c(x, F) + c(x, G), else c(x, F)
F & G c(x, F) + c(x, G)
F{n, m} c(x, F)

Determinism reformulated

previous table of contents next
6 of 18 [40]
SGML unambiguity = XML determinism = XSD unique particle attribution:
For all symbols x ∈ Σ,
and for all characteristic derivatives F of the original content model,
c(x, F) ≤ 1.

Simplified determinism check

previous table of contents next
7 of 18 [40]
Before checking for determinism, simplify expression E to E′:
  • If E = ε or ∅ or an expanded name or wc(KW,NS) or wc(KW,not(NS)), then E′ = E.
  • If E = F | G, then E′ = F′ | G′.
  • If E = F, G, then E′ = F′, G′.
  • If E = F{n, m}, then E′ = F′{j,k}, where
    • F′ is the recursive simplification of F.
    • If n = m > 1, then j = k = 2.
    • Else j = min(1,n), k = min(2,m), where for all natural numbers n, min(n, unbounded) = n

Determinism example

previous table of contents next
8 of 18 [40]
For example, if E = (a{2,4}, a), then characteristic derivatives are:
  • DεE = (a{2,4}, a), c(a, DεE) = 1
  • DaE = (a{1,3}, a), c(a, DaE) = 1
  • DaaE = (a{0,2}, a), c(a, DaaE) = 2
  • DaaaE = ((a?, a) | ε), c(a, DaaaE) = 2
  • DaaaaE = (a | ε), c(a, DaaaaE) = 1
The equivalent E′ is (a{1,2}, a); the characteristic derivatives are:
  • DεE′ = (a{1,2}, a), c(a, DεE′) = 1
  • DaE′ = (a{0,1}, a), c(a, DaE′) = 2
  • DaaE′ = (a | ε), c(a, DaaE′) = 1
E′ is not deterministic (and therefore neither is E).

Cost of determinism check

previous table of contents next
9 of 18 [40]
Roughly equivalent* to
  • constructing Glushkov automaton
  • checking its determinism
* at worst, the same cost

Weakened wildcards

previous table of contents next
10 of 18 [40]

all-groups

previous table of contents next
11 of 18 [40]

all-group examples

previous table of contents next
12 of 18 [40]
If E = (a, b+) & ((c* | d+), e), then interleave rule accepts content a b e b:
  1. Da E = b+ & ((c* | d*), e)
  2. Dab E = b* & ((c* | d*), e)
  3. Dabe E = (b* | ∅) = b*
  4. Dabeb E = b*
while non-interleave rule rejects it:
  1. Da E = b+, ((c* | d*), e)
  2. Dab E = b*, ((c* | d*), e)
  3. Dabe E = (ε | ∅) = ε
  4. Dabeb E = (∅ | ∅) = ∅

Content model subsumption

previous table of contents next
13 of 18 [40]
F subsumes G iff GF.
Informally, iff every element valid against content model of G is valid against content model of F.

Checking subsumption

previous table of contents next
14 of 18 [40]
First, extend syntax:
  • F and G
  • ~F
Next, define derivative w.r.t. x:
  • Ds (F and G) = (Ds F) and (Ds G)
  • Ds (~F) = ~(Ds F)

The subsumption problem

previous table of contents next
15 of 18 [40]
F subsumes G
GF
≡ L(G and ~F) = ∅
≡ no characteristic derivative of (G and ~F) is nullable.

Subsumption example

previous table of contents next
16 of 18 [40]
Let
  • F = (a?, (b?, c*)+, a?)
  • G = (b | c)*, a)
Prove GF.

Solution

previous table of contents next
17 of 18 [40]
Derivatives of E = (F and ~G) = (((b | c)*, a) and ~(a?, (b?, c*)+, a?)) are:
  • DεE = (((b | c)*, a) and ~(a?, (b?, c*)+, a?))
  • DaE = ((ε) and ~(((b?, c*)+, a?) | ε))
  • DbE = (((b | c)*, a) and ~(c*, (b?, c*)*, a?))
  • DcE = (((b | c)*, a) and ~((b?, c*)*, a?))
  • DaaE = ((∅) and ~(ε | ε))
  • DabE = ((∅) and ~((c*, (b?, c*)*, a?) | ε))
  • DbaE = ((ε) and ~(ε))
None are nullable; there are no strings matching E; F subsumes G.

Conclusion

previous table of contents next
18 of 18 [40]
Brzozowski derivatives
  • easily extensible to new regex notations
  • concise, precise
  • easy to implement
  • sometimes cheaper / faster than FSA-based methods
  • even when theoretically not faster, often easier to use, more perspicuous