This document describes recursive-descent parsers for arithmetic
expressions, written in XSLT 1.0 [and eventually also in 2.0]. Its
purpose is to show how the tokenization functions, regular-expression
support, and reuse of intermediate result trees in XSLT 2.0 make it
easier to write parsers in XSLT.
1. A grammar for arithmetic expressions
The expressions we are interested in parsing use a conventional
infix notation. The grammar we will work with is similar to
those of many conventional languages, although it is not identical
to any, as far as I know.
Expressed in Wirth's EBNF notation, it is:
< 1 Grammar [File arithmetic.bnf] > ≡
expr = [ sign ] term { addop term } .
term = factor { mulop factor } .
factor = num { expop num }
sign = addop
addop = '+' | '-'
mulop = '*' | '/' | '%' | 'div' | 'mod' | '×' | '÷'
expop = '**' | '^'
num = INT | DEC | VAR | '(' expr ')'
In Wirth's notation, braces surround sequences that can occur
zero or more times, while brackets surround optional material.
So the first production says that an expr is
zero or one occurrences of sign, followed by
a term, followed by zero or more occurrences
of the sequence ( addop, term).
Less formally: one or more terms, separated by addition
operators and the whole optionally preceded by a sign.
The primitive token types are discussed below.
Sample expressions in this language include:
- 4
- 4.235
- $x
- (4)
- 3 + 7 - 4 + 124.45 - 42
- 3 + ((7 - 4) + (124.45 - 42))
- 3 * ((7 div 4) ÷ (124.45 - 42))
- 2 ** 4
- (3 × 17) ** 2 - 1
3. Tokenization
The arithmetic expressions we'll be parsing will be ASCII strings;
the first step toward simplifying the work of the parser will be
to tokenize the input string. We distinguish the token types
lpar, rpar (left and right parenthesis),
addop (-, +),
mulop (*, ×, /, %, mul, div, ÷),
expop (**, ^),
var, int, dec,
and err.
[Tokenization template from arith1a.xsl goes here.]
4. Parsing
A recursive-descent parser in a procedural language translates
the grammar rules very directly into code. In such a language,
the spine of the procedure for
expr would
look something like this:
procedure rd_expr:
if current_symbol in start('expr') {
rd_sign;
}
rd_term;
do while current_symbol = 'addop' {
rd_addop;
rd_term;
}
In principle, we can write a template for
expr whose
spine is very similar:
<xsl:template name="rd_expr">
<xsl:if test="contains($start_expr, name($current_token))">
<xsl:call-template name="rd_sign"/>
</xsl:if>
<xsl:call-template name="rd_term"/>
<xsl:if test="name($current_token) = 'addop'">
<xsl:call-template name="rd_addop_term_seq"/>
</xsl:if>
</xsl:template>
(where the variable
$start_expr is to be imagined
as holding a list of the start symbols of the non-terminal
expr, and
$current_token is an
element representing a token. The input to the parser, that is,
will be a flat sequence of elements representing tokens in the
input.
A first complication is that we don't want just to recognize whether
the input is or is not a legal sentence, but to pass information about
its structure to other code. In pseudo-code, we'd want something more
like the following:
function rd_expr(): Expr {
var s : Sign,
t0, t1 : Term,
o : Addop,
e0, e1 : Expr;
if current_symbol in start('expr') then
s := rd_sign()
else
s := nil;
t0 := rd_term();
if s = subtract then
e := newOp(subtract,0,t0)
else
e := t0;
do while current_symbol = 'addop' {
o := rd_addop();
t1 := rd_term();
e := newOp(o,e,t2);
}
}
One is to imagine that the function
newOp constructs
an operator node in the abstract syntax tree. If we fold some things
together, the code becomes more compact:
function rd_expr(): Expr {
var t : Term,
o : Addop,
e : Expr;
if current_symbol in start('expr') then
e := newOp(rd_sign(),0,rd_term())
else
e := rd_term();
do while current_symbol = 'addop' {
o := rd_addop();
t := rd_term();
e := newOp(o,e,t2);
}
return e;
}
In XSLT 1.0, this iteration can be expressed as recursion,
and we can return the resulting expression with code like this.
after initializing the variables
e and
t
to the left- and right-hand arguments of the operator:
<xsl:choose>
<!--* If there is no leading sign, then (a) read a term and
* (b) pass it to rd_termseq as left-hand argument *-->
<xsl:when test="not(contains($start_sign,name($current)))">
<xsl:variable name="t">
<xsl:call-template name="rd_term"/>
</xsl:variable>
<xsl:call-template name="rd_termseq">
<xsl:with-param name="lhs" select="$t"/>
</xsl:call-template>
</xsl:when>
<!--* If there IS a leading sign, call rd_termseq with zero
* as the left-hand argument *-->
<xsl:otherwise>
<xsl:variable name="s">
<xsl:call-template name="rd_sign"/>
</xsl:variable>
<xsl:variable name="t">
<xsl:call-template name="rd_term"/>
</xsl:variable>
<xsl:variable name="e">
<xsl:element name="op">
<xsl:attribute name="op"><xsl:value-of select="$s"/></xsl:attribute>
<xsl:element name="num">
<xsl:element name="int">0</xsl:element>
</xsl:element>
<xsl:copy-of select="$t"/>
</xsl:element>
</xsl:variable>
<!--* recur *-->
<xsl:call-template name="rd_termseq">
<xsl:with-param name="lhs" select="$e"/>
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
There is, however, another complication. XSLT has no side effects,
and no assignments, so input symbols cannot be
‘consumed’ and variables like
$current_token cannot be updated by reading
sub-expressions. The ‘current position’, that is,
cannot be moved as a side effect of reading an expression. We will
solve this by passing the current position element as an argument
(under the name cursor), and figuring out, after
each call to read a subexpression, how far it must be advanced.
With these elaborations, the code for the signless case looks
like this:
< 4 Read unsigned expression > ≡
<!--* If there is no leading sign, then (a) read a term and
* (b) pass it to rd_termseq as left-hand argument *-->
<xsl:when test="not(contains($start_sign,name($current)))">
<xsl:variable name="t">
<xsl:call-template name="rd_term">
<xsl:with-param name="cursor" select="$cursor"/>
</xsl:call-template>
</xsl:variable>
<xsl:variable name="w_t">
<xsl:call-template name="width_term">
<xsl:with-param name="cursor" select="$cursor"/>
</xsl:call-template>
</xsl:variable>
<xsl:call-template name="rd_termseq">
<xsl:with-param name="cursor"
select="$cursor/following-sibling::*[$w_t]"/>
<xsl:with-param name="lhs" select="$t"/>
</xsl:call-template>
</xsl:when>
This code is not used elsewhere.
That gives us enough basic principles to go on with:
- Each non-terminal is represented by two templates, one for
reading it, which returns an appropriate element, and one for
calculating its width, which is used to increment the cursor.
- Optional items give rise to a conditional (xsl:choose).
- Each repetition on a right hand side is represented by a (recursive) template.
- Each item on the right hand side is handled by a function call
(xsl:call-template) to the template for reading that
non-terminal, followed by a second call to the template for calculating
the width (in tokens) of the expression just read. The first call
in a sequence passes $cursor as the cursor
argument, and later calls use $cursor/following-sibling::*[$w],
where w is the width of the items already read.
With these principles, we can build a parser in XSLT 1.0.
5. The XSLT 1.0 parser
The top level of the stylesheet imports an identity
transformation and includes templates for each non-terminal
and repetition:
The main templates, for non-terminals, each have the same basic form
First, they check to make sure the cursor is in the start set for
the non-terminal they are recognizing; if so, they do the work.
Otherwise, they raise an error.
< 6 Read expr > ≡
<xsl:template name="rd_expr">
<xsl:param name="cursor"/>
<xsl:choose>
<xsl:when test="contains($start_expr,name($cursor))">
{Read an expression 8}
</xsl:when>
<xsl:otherwise>
<xsl:message>!!! Trying to read an expression but cannot.</xsl:message>
<xsl:message>Found <xsl:value-of select="name($cursor)"/>
with value <xsl:value-of select="$cursor"/>. !!!</xsl:message>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
This code is used in < Top level of 1.0 parser 5 >
The start set of
expr is the union of the start sets
of
sign and
term. Perhaps it's simplest
if we just define all the start sets and get them out of the way.
< 7 Initialize global variables > ≡
<xsl:variable name="start_addop">addop</xsl:variable>
<xsl:variable name="start_mulop">mulop</xsl:variable>
<xsl:variable name="start_expop">expop</xsl:variable>
<xsl:variable name="start_sign"><xsl:value-of select="$start_addop"/></xsl:variable>
<xsl:variable name="start_num">int dec var lpar</xsl:variable>
<xsl:variable name="start_factor"><xsl:value-of select="$start_num"/></xsl:variable>
<xsl:variable name="start_term"><xsl:value-of select="$start_factor"/></xsl:variable>
<xsl:variable name="start_expr">
<xsl:value-of select="concat($start_sign,' ',$start_term)"/>
</xsl:variable>
This code is used in < Top level of 1.0 parser 5 >
The actual work is done in a conditional with one branch for the leading-sign
case and one for the unsigned case:
Reading an unsigned expression is simpler, so let's do it first.
First, we read the term at the beginning of the expression, assigning
it to the variable
t, and its width to
w_t.
Then we return the result of the template, either by returning
t or by calling
rd_termseq to
handling the repetition of addition operators and further terms.
Whenever we call a template to read a non-terminal at a given
location, we will also call a template to calculate its width. The
pattern is shown here:
< 10 Set t and w_t by calling rd_term at cursor > ≡
<xsl:variable name="t">
<xsl:call-template name="rd_term">
<xsl:with-param name="cursor" select="$cursor"/>
</xsl:call-template>
</xsl:variable>
<xsl:variable name="w_t">
<xsl:call-template name="width_term">
<xsl:with-param name="cursor" select="$cursor"/>
</xsl:call-template>
</xsl:variable>
This code is used in < Read an unsigned expression 9 >
Having initialized variables
t and
w_t,
we can figure out what to do next.
If there is an
addop following the term we just read
(i.e. the token
w_t tokens over from the cursor has the
generic identifier
addop), then we call
rd_termseq to read as many occurrences of the
(
addop,
term) sequence as there are in
the expression. Since we want addition operators to be
left-associative (so “
7 - 2 + 3” is equivalent to
“
(7 - 2) + 3” ⇒ 8 not “
7 - (2 +
3)” ⇒ 2), we can't generate the top-level expression
yet: that will be generated by
rd_termseq when it reads
the last
addop. We pass the term we've read in as a
parameter to
rd_termseq, so it can be used as the
left-hand argument to the next operator read.
If there is no following addop, then we just return
t.
If an
addop follows, we call
rd_termseq,
with
t as a left-hand argument for the next operator,
and a cursor that has been incremented by
w_t tokens.
This call will produce the result we actually return, so it's not wrapped
in a variable declaration.
Reading a signed expression is similar, but the arithmetic is slightly
different.
First we read the sign, then we read the term, then we calculate
values for
t and
w_t, after which
the code continues as before.
Since signs are very straightforward tokens, we won't have a separate
template for reading a sign and calculating its width. The sign itself
is the string value (content) of the
addop element, and
the width is always one.
Calling
rd_term and
width_term is
exactly the same as in the unsigned case shown above, but we have to
have different code because we need a different value for the
cursor argument.
< 15 Set t0 and w_t by calling rd_term at cursor + 1 > ≡
<xsl:variable name="t0">
<xsl:call-template name="rd_term">
<xsl:with-param name="cursor" select="$cursor/following-sibling::*[1]"/>
</xsl:call-template>
</xsl:variable>
<xsl:variable name="w_t">
<xsl:call-template name="width_term">
<xsl:with-param name="cursor" select="$cursor/following-sibling::*[1]"/>
</xsl:call-template>
</xsl:variable>
This code is used in < Read a signed expression 13 >
Having initialized variables
s and
t, we
can now calculate the right expression to pass to
rd_termseq
for use as a left-hand argument. If the sign is positive, we can ignore
the sign and just pass the term. If it's negative, we create an expression
ourselves, subtracting the term from zero.
< 16 Calculate an expression t to pass to rd_termseq > ≡
<xsl:variable name="t">
<xsl:choose>
<xsl:when test="$s = '-'">
<xsl:element name="op">
<xsl:attribute name="op">-</xsl:attribute>
<xsl:element name="num"><xsl:element name="int">0</xsl:element></xsl:element>
<xsl:copy-of select="$t0"/>
</xsl:element>
</xsl:when>
<xsl:when test="$s = '-'">
<xsl:copy-of select="$t0"/>
</xsl:when>
<xsl:otherwise>
<xsl:message>I didn't know there were addition operators other than - and +. Huh?</xsl:message>
</xsl:otherwise>
</xsl:choose>
</xsl:variable>
This code is used in < Read a signed expression 13 >
< 18 Read term > ≡
<xsl:template name="rd_term">
<xsl:param name="cursor"/>
<xsl:choose>
<xsl:when test="contains($start_term,name($cursor))">
{Read a term !! BAD REFERENCE! No scrap called x1.rd_term_go!}
</xsl:when>
<xsl:otherwise>
<xsl:message>!!! Trying to read a term but cannot.</xsl:message>
<xsl:message>Found <xsl:value-of select="name($cursor)"/>
with value <xsl:value-of select="$cursor"/>. !!!</xsl:message>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
This code is used in < Top level of 1.0 parser 5 >
< 22 Read num > ≡
<xsl:template name="rd_num">
<xsl:param name="cursor"/>
<xsl:choose>
<xsl:when test="contains($start_num,name($cursor))">
{Read a number !! BAD REFERENCE! No scrap called x1.rd_num_go!}
</xsl:when>
<xsl:otherwise>
<xsl:message>!!! Trying to read a number but cannot.</xsl:message>
<xsl:message>Found <xsl:value-of select="name($cursor)"/>
with value <xsl:value-of select="$cursor"/>. !!!</xsl:message>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
This code is used in < Top level of 1.0 parser 5 >
The templates for repetitions:
...
When there is a sign, the XSLT 1.0 code looks like
this:
this:
this:
< 29 Read signed expression > ≡
<xsl:variable name="t">
<xsl:call-template name="rd_term">
<xsl:with-param name="cursor" select="$cursor"/>
</xsl:call-template>
</xsl:variable>
<xsl:variable name="w_t">
<xsl:call-template name="width_term">
<xsl:with-param name="cursor" select="$cursor"/>
</xsl:call-template>
</xsl:variable>
This code is used in < Read signed expression 27 >
this:
< 30 Read signed expression > ≡
<xsl:variable name="e">
<xsl:element name="op">
<xsl:attribute name="op"><xsl:value-of select="$s"/></xsl:attribute>
<xsl:element name="num"><xsl:element name="int">0</xsl:element></xsl:element>
<xsl:copy-of select="$t"/>
</xsl:element>
</xsl:variable>
This code is used in < Read signed expression 27 >