A paper accepted for presentation at Extreme Markup Languages 2005, Montréal
In words, the derivative of R with respect to s is the set of strings t which can follow s in sentences of R, or: the set of strings t such that the concatenation of s and t is a sentence in R.Given a set R of sequences and some finite sequence s, the derivative of R with respect to s is denoted by D_{s}R and is D_{s}R = {t | st ∈ R}.
and(E | (F | G))
= ((E | F) | G)
= (E | F | G)
and(E, (F, G))
= ((E, F), G)
= (E, F, G)
(E & (F & G))
= ((E & F) & G)
= (E & F & G)
D_{s} E = D_{<x1, x2, ... xn>} E = D_{xn} (... (D_{x3} (D_{x2} (D_{x1} E))) ...)
( (script | style | meta)*, ( (title, (script | style | meta)*, (base, (script | style | meta)*)?) | (base, (script | style | meta)*, (title, (script | style | meta)*)) ) )which is a slight simplification of the definition of head in XHTML. If we seek to validate the sequence <meta, title, style> against this content model, we proceed as follows:
(e{1,5}, b{0,1}){1,5}
e{0,4}, b?, (e{1,5}, b?){0,4}
e{0,3}, b?, (e{1,5}, b?){0,4} | e{0,4}, b?, (e{1,5}, b?){0,3}
e{0,2}, b?, (e{1,5}, b?){0,4} | e{0,4}, b?, (e{1,5}, b?){0,3} | e{0,3}, b?, (e{1,5}, b?){0,3} | e{0,4}, b?, (e{1,5}, b?){0,2}
(e{1,5}, b?){0,4} | (e{1,5}, b?){0,3} | (e{1,5}, b?){0,3} | (e{1,5}, b?){0,2}'
e?, b?, (e{1,5}, b?){0,4} | e{0,4}, b?, (e{1,5}, b?){0,3} | e{0,3}, b?, (e{1,5}, b?){0,3} | e{0,4}, b?, (e{1,5}, b?){0,2} | e{0,2}, b?, (e{1,5}, b?){0,3} | e{0,4}, b?, (e{1,5}, b?){0,2} | e{0,3}, b?, (e{1,5}, b?){0,2} | e{0,4}, b?, (e{1,5}, b?)?
(N.B. this definition of subsumption corresponds to shallow local validation, not to deep validation.)B subsumes R if and only if for all strings s, δ(D_{s} (R and ~B)) = ∅
We'll use variables n (current string length), X (the set of characteristic derivatives found so far), and f (a Boolean flag).
Start with n = 0, X = ∅, f = false. For each s in the set of all strings of length n,
Let F = D_{s} E. If F is equivalent to any member of X, do nothing. Otherwise, let X = X ∪ {F} and let f = true. If f = true, then increment n and go to step 2. Otherwise, f = false, and X contains the set of characteristic derivatives.
we end up calculating the derivatives of R in parallel to those of B, with the exception that we don't necessarily generate all n × m pairs of characteristic derivatives.D_{s} (R and ~B) = (D_{s} R) and (D_{s} ~B) = (D_{s} R) and ~(D_{s} B)
(((c, d, b){2,3}), c, d, e, f) and ~( ( (c, d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?)){0,3})) | (c, d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?)){0,2})) | (c, d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?))?)) | (c, d, ((b, c, d){0,4}), (e?)) ), f)
(d, b, ((c, d, b){1,2}), c, d, e, f) and ~( ( (d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?)){0,3})) | (d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?)){0,2})) | (d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?))?)) | (d, ((b, c, d){0,4}), (e?)) ), f )
(c, d, e, f) and ~( ( (c, d, ((b, c, d)?), (e?), ((((b, c, d){0,5}), (e?)){0,3})) | (c, d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?)){0,2})) | (c, d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?))?)) | (c, d, ((b, c, d){0,4}), (e?)) | (c, d, ((b, c, d){0,3}), (e?), ((((b, c, d){0,5}), (e?)){0,2})) | (c, d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?))?)) | (c, d, ((b, c, d){0,4}), (e?)) | (c, d, ((b, c, d){0,3}), (e?), ((((b, c, d){0,5}), (e?))?)) | (c, d, ((b, c, d){0,4}), (e?)) | (c, d, ((b, c, d){0,3}), (e?)) | (c, d, ((b, c, d){0,2}), (e?), ((((b, c, d){0,5}), (e?)){0,2})) | (c, d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?))?)) | (c, d, ((b, c, d){0,4}), (e?)) | (c, d, ((b, c, d){0,3}), (e?), ((((b, c, d){0,5}), (e?))?)) | (c, d, ((b, c, d){0,4}), (e?)) | (c, d, ((b, c, d){0,3}), (e?)) | (c, d, ((b, c, d){0,2}), (e?), ((((b, c, d){0,5}), (e?))?)) | (c, d, ((b, c, d){0,4}), (e?)) | (c, d, ((b, c, d){0,3}), (e?)) | (c, d, ((b, c, d){0,2}), (e?)) | (c, d, ((b, c, d)?), (e?), ((((b, c, d){0,5}), (e?)){0,2})) | (c, d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?))?)) | (c, d, ((b, c, d){0,4}), (e?)) | (c, d, ((b, c, d){0,3}), (e?), ((((b, c, d){0,5}), (e?))?)) | (c, d, ((b, c, d){0,4}), (e?)) | (c, d, ((b, c, d){0,3}), (e?)) | (c, d, ((b, c, d){0,2}), (e?), ((((b, c, d){0,5}), (e?))?)) | (c, d, ((b, c, d){0,4}), (e?)) | (c, d, ((b, c, d){0,3}), (e?)) | (c, d, ((b, c, d){0,2}), (e?)) | (c, d, ((b, c, d)?), (e?), ((((b, c, d){0,5}), (e?))?)) | (c, d, ((b, c, d){0,4}), (e?)) | (c, d, ((b, c, d){0,3}), (e?)) | (c, d, ((b, c, d){0,2}), (e?)) | (c, d, ((b, c, d)?), (e?)) ), f)
(c, d, e, f) and ~( ( (c, d, ((b, c, d)?), (e?), ((((b, c, d){0,5}), (e?)){0,3})) | (c, d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?)){0,2})) | (c, d, ((b, c, d){0,4}), (e?), ((((b, c, d){0,5}), (e?))?)) | (c, d, ((b, c, d){0,4}), (e?)) | (c, d, ((b, c, d){0,3}), (e?), ((((b, c, d){0,5}), (e?)){0,2})) | (c, d, ((b, c, d){0,3}), (e?), ((((b, c, d){0,5}), (e?))?)) | (c, d, ((b, c, d){0,3}), (e?)) | (c, d, ((b, c, d){0,2}), (e?), ((((b, c, d){0,5}), (e?)){0,2})) | (c, d, ((b, c, d){0,2}), (e?), ((((b, c, d){0,5}), (e?))?)) | (c, d, ((b, c, d){0,2}), (e?)) | (c, d, ((b, c, d)?), (e?), ((((b, c, d){0,5}), (e?)){0,2})) | (c, d, ((b, c, d)?), (e?), ((((b, c, d){0,5}), (e?))?)) | (c, d, ((b, c, d)?), (e?)) ), f)which has 13 disjuncts in the inner expression, rather than 35. It may be worth noting that Brzozowski shows explicitly that even if only relatively rudimentary simplification is done on the characteristic derivatives, there is still only a finite number of them. More formally, he shows that not only is there a finite number of characteristic derivatives such than none is equal to the others, there is also a finite number, albeit a larger number, such that none is similar to the others, where similar expressions can be transformed into each other using only the identities
f and ~( ( ((((b, c, d){0,5}), (e?)){0,3}) | ((((b, c, d){0,5}), (e?)){0,2}) | ((((b, c, d){0,5}), (e?))?) | ε ), f)
Brüggemann-Klein, Anne. 1993. “Regular expressions into finite automata.” Theoretical Computer Science 120.2 (1993): 197-213.
Brüggemann-Klein, Anne. 1993. Formal models in document processing. Habilitationsschrift, Freiburg i.Br., 1993. 110 pp. Available at ftp://ftp.informatik.uni-freiburg.de/documents/papers/brueggem/habil.ps (Cover pages archival copy also at http://www.oasis-open.org/cover/bruggDissert-ps.gz).
Brüggemann-Klein, Anne, and Derick Wood. 1998. “One-unambiguous regular languages.” Information and computation 140 (1998): 229-253.
Brzozowski, Janusz A. “Derivatives of regular expressions” Journal of the ACM 11.4 (1964): 481-494.
Clark, James. 2002. “An algorithm for RELAX NG validation.” 13 February 2002. http://www.thaiopensource.com/relaxng/derivative.html
English, Joe. 1999. How to validate XML. http://www.flightlab.com/~joe/sgml/validate.html
Foster, Bob. 2003. “Derivative works [Parsing].” Blog entry, 27 January 2003, in Technical difficulties. http://bobfoster.com/plog/index.php?m=200301
Foster, Bob. 2004. “Derivatives of Bounded Rep[e]tition.” Blog entry, 17 May 2004, in City Life: Full Speed Ahead. http://jroller.com/comments/bobfoster/FullSpeedAhead/derivatives_of_bounded_repitition
Fuchs, Matthew, and Allen Brown. 2003. “Supporting UPA and restriction on an extension of XML Schema.” Extreme Markup Languages, Montréal, 2003. Available online at http://www.idealliance.org/papers/extreme03/html/2003/Fuchs01/EML2003Fuchs01.html
Glushkov, V. N. 1961. “The abstract theory of automata”, tr. J. M. Jackson, in Russian Mathematical Surveys, A translation of the survey articles and of selected biographical articles in Uspekhi matematicheskikh nauk, ed. J. L. B. Cooper, vol. 16 (London: Cleaver-Hume, 1961), pp. 1-53.
Hopkins, Mark. 1994. “Regular Expression Software”. Posting to comp.compilers, 17 February 1994. http://compilers.iecc.com/comparch/article/94-02-109
International Organization for Standardization (ISO). 1986. ISO 8879-1986 (E). Information processing — Text and Office Systems — Standard Generalized Markup Language (SGML). International Organization for Standardization, Geneva, 1986.
Kawaguchi, Kohsuke. 2003. Sun Multi-Schema XML Validator. Version 1.2, April 2003. Available from http://www.sun.com/software/xml/developers/multischema/
Schmidt, Martin. 2002. Design and Implementation of a validating XML parser in Haskell. Master's thesis, Fachhochschule Wedel (Germany).
Sperberg-McQueen, C. M. 2004. “Notes on finite state automata with counters.” http://www.w3.org/XML/2004/05/msm-cfa.html
Thompson, Henry S., and Richard Tobin. 2003. “Using finite state automata to implement W3C XML Schema content model validation and restriction checking.” XML Europe 2003, London. Available online at http://www.idealliance.org/papers/dx_xmle03/papers/02-02-05/02-02-05.html (imperfectly rendered) or at http://www.ltg.ed.ac.uk/~ht/XML_Europe_2003.html.
[Wilmott, Sam]. [1991]. Content model algebra. Exoterica white paper. [Dated and attributed from external evidence. Original at http://www.omnimark.com/white/cma/cma.html has disappeared, but copies appear to be available at http://www.oasis-open.org/cover/omnimarkContentModelAlgebra200101.html and http://home.chello.no/~mgrsby/sgmlintr/sgmlcont.htm]
World Wide Web Consortium (W3C). Extensible Markup Language (XML) 1.0 (Third Edition), ed. Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler (Second Edition), François Yergeau (Third Edition). W3C Recommendation 4 February 2004. Published by the World Wide Web Consortium at http://www.w3.org/TR/REC-xml/.
World Wide Web Consortium (W3C). “XML Schema Part 0: Primer”, ed. David Fallside. W3C Recommendation, 2 May 2001. [Cambridge, Sophia-Antipolis, Tokyo: W3C] http://www.w3.org/TR/xmlschema-0/.
World Wide Web Consortium (W3C). 2001. XML Schema Part 1: Structures, ed. Henry S. Thompson, David Beech, Murray Maloney, and Noah Mendelsohn. W3C Recommendation 2 May 2001. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/2001/REC-xmlschema-1-20010502/
World Wide Web Consortium (W3C). 2001. XML Schema Part 2: Datatypes, ed. Biron, Paul V. and Ashok Malhotra. W3C Recommendation 2 May 2001. [Cambridge, Sophia-Antipolis, and Tokyo]: World Wide Web Consortium. http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/
World Wide Web Consortium (W3C). Namespaces in XML 1.1, ed. Tim Bray, Dave Hollander, Andrew Layman, and Richard Tobin. W3C Recommendation 4 February 2004. Published by the World Wide Web Consortium at http://www.w3.org/TR/xml-names11.