Parsing arithmetic expressions

in XSLT 1.0 and 2.0

C. M. Sperberg-McQueen

6 February 2005



This document describes recursive-descent parsers for arithmetic expressions, written in XSLT 1.0 [and eventually also in 2.0]. Its purpose is to show how the tokenization functions, regular-expression support, and reuse of intermediate result trees in XSLT 2.0 make it easier to write parsers in XSLT.

1. A grammar for arithmetic expressions

The expressions we are interested in parsing use a conventional infix notation. The grammar we will work with is similar to those of many conventional languages, although it is not identical to any, as far as I know. Expressed in Wirth's EBNF notation, it is:
< 1 Grammar [File arithmetic.bnf] > ≡
expr = [ sign ] term { addop term } .
term = factor { mulop factor } .
factor = num { expop num }
sign  = addop
addop = '+' | '-'
mulop = '*' | '/' | '%' | 'div' | 'mod' | '×' | '÷'
expop = '**' | '^'
num = INT | DEC | VAR | '(' expr ')'



In Wirth's notation, braces surround sequences that can occur zero or more times, while brackets surround optional material. So the first production says that an expr is zero or one occurrences of sign, followed by a term, followed by zero or more occurrences of the sequence ( addop, term). Less formally: one or more terms, separated by addition operators and the whole optionally preceded by a sign.
The primitive token types are discussed below.
Sample expressions in this language include:
  • 4
  • 4.235
  • $x
  • (4)
  • 3 + 7 - 4 + 124.45 - 42
  • 3 + ((7 - 4) + (124.45 - 42))
  • 3 * ((7 div 4) ÷ (124.45 - 42))
  • 2 ** 4
  • (3 × 17) ** 2 - 1

2. Abstract syntax trees

The task of the parser is to read an expression and write out its abstract syntax tree in XML form. In the interests of verisimilitude, we'll also define simple rules in XSLT for evaluating expressions and performing simple integer / float type checking.
The abstract syntax trees will conform to the following DTD:
< 2 DTD for abstract syntax trees of arithmetic expressions > ≡

<!ENTITY % expr "sum | diff | prod | rdiv | idiv | mod
   | incr | decr | const | var">
<!ELEMENT e     (%expr;) >
<!ELEMENT sum   ((%expr;), (%expr;)) >
<!ELEMENT diff  ((%expr;), (%expr;)) >
<!ELEMENT prod  ((%expr;), (%expr;)) >
<!ELEMENT rdiv  ((%expr;), (%expr;)) >
<!ELEMENT idiv  ((%expr;), (%expr;)) >
<!ELEMENT mod   ((%expr;), (%expr;)) >
<!ELEMENT incr  (%expr;) >
<!ELEMENT decr  (%expr;) >
<!ELEMENT const (int | dec) >
<!ELEMENT var   (#PCDATA) >
<!ELEMENT int   (#PCDATA) >
<!ELEMENT dec   (#PCDATA) >

This code is not used elsewhere.

Or maybe we should use a more compact DTD:
< 3 DTD for abstract syntax trees of arithmetic expressions [File arithmetic.dtd] > ≡
<!ENTITY % expr "op | unop | val" >
<!ENTITY % atts "
          value CDATA #IMPLIED
          type  (int | real) #IMPLIED" >
<!ELEMENT e     (%expr;) >
<!ATTLIST e     %atts; >
<!ELEMENT op    ((%expr;), (%expr;)) >
<!ATTLIST op    %atts; 
          op    CDATA   #REQUIRED 
>
<!ELEMENT unop  (%expr;) >
<!ATTLIST unop  %atts; 
          op    CDATA   #REQUIRED 
>
<!ELEMENT val   (int | dec | var) >
<!ATTLIST val   %atts; >
<!ELEMENT int   (#PCDATA) >
<!ELEMENT dec   (#PCDATA) >
<!ELEMENT var   (#PCDATA) >



3. Tokenization

The arithmetic expressions we'll be parsing will be ASCII strings; the first step toward simplifying the work of the parser will be to tokenize the input string. We distinguish the token types lpar, rpar (left and right parenthesis), addop (-, +), mulop (*, ×, /, %, mul, div, ÷), expop (**, ^), var, int, dec, and err.
[Tokenization template from arith1a.xsl goes here.]

4. Parsing

A recursive-descent parser in a procedural language translates the grammar rules very directly into code. In such a language, the spine of the procedure for expr would look something like this:
procedure rd_expr:
  if current_symbol in start('expr') {
    rd_sign;
  }
  rd_term;
  do while current_symbol = 'addop' {
    rd_addop;
    rd_term;
  }
In principle, we can write a template for expr whose spine is very similar:
 <xsl:template name="rd_expr">
  <xsl:if test="contains($start_expr, name($current_token))">
   <xsl:call-template name="rd_sign"/>
  </xsl:if>
  <xsl:call-template name="rd_term"/>
  <xsl:if test="name($current_token) = 'addop'">
   <xsl:call-template name="rd_addop_term_seq"/>
  </xsl:if>
 </xsl:template>
(where the variable $start_expr is to be imagined as holding a list of the start symbols of the non-terminal expr, and $current_token is an element representing a token. The input to the parser, that is, will be a flat sequence of elements representing tokens in the input.
A first complication is that we don't want just to recognize whether the input is or is not a legal sentence, but to pass information about its structure to other code. In pseudo-code, we'd want something more like the following:
function rd_expr(): Expr {
  var s : Sign,
      t0, t1 : Term,
      o : Addop,
      e0, e1 : Expr;
  if current_symbol in start('expr') then
    s := rd_sign()
  else 
    s := nil;
  t0 := rd_term();
  if s = subtract then
    e := newOp(subtract,0,t0)
  else
    e := t0;
  do while current_symbol = 'addop' {
    o := rd_addop();
    t1 := rd_term();
    e := newOp(o,e,t2);
  }
}
One is to imagine that the function newOp constructs an operator node in the abstract syntax tree. If we fold some things together, the code becomes more compact:
function rd_expr(): Expr {
  var t : Term,
      o : Addop,
      e : Expr;
  if current_symbol in start('expr') then 
    e := newOp(rd_sign(),0,rd_term())
  else 
    e := rd_term();
  do while current_symbol = 'addop' {
    o := rd_addop();
    t := rd_term();
    e := newOp(o,e,t2);
  }
  return e;
}
In XSLT 1.0, this iteration can be expressed as recursion, and we can return the resulting expression with code like this. after initializing the variables e and t to the left- and right-hand arguments of the operator:
  <xsl:choose>
   <!--* If there is no leading sign, then (a) read a term and 
       * (b) pass it to rd_termseq as left-hand argument *-->
   <xsl:when test="not(contains($start_sign,name($current)))">
    <xsl:variable name="t">
     <xsl:call-template name="rd_term"/>
    </xsl:variable>
    <xsl:call-template name="rd_termseq">
     <xsl:with-param name="lhs" select="$t"/>
    </xsl:call-template>
   </xsl:when>

   <!--* If there IS a leading sign, call rd_termseq with zero
       * as the left-hand argument *-->
   <xsl:otherwise>
    <xsl:variable name="s">
     <xsl:call-template name="rd_sign"/>
    </xsl:variable>
    <xsl:variable name="t">
     <xsl:call-template name="rd_term"/>
    </xsl:variable>
    <xsl:variable name="e">
     <xsl:element name="op">
      <xsl:attribute name="op"><xsl:value-of select="$s"/></xsl:attribute>
      <xsl:element name="num">
        <xsl:element name="int">0</xsl:element>
      </xsl:element>
      <xsl:copy-of select="$t"/>
     </xsl:element>
    </xsl:variable>
    <!--* recur *-->
    <xsl:call-template name="rd_termseq">
     <xsl:with-param name="lhs" select="$e"/>
    </xsl:call-template>
   </xsl:otherwise>
  </xsl:choose>
There is, however, another complication. XSLT has no side effects, and no assignments, so input symbols cannot be ‘consumed’ and variables like $current_token cannot be updated by reading sub-expressions. The ‘current position’, that is, cannot be moved as a side effect of reading an expression. We will solve this by passing the current position element as an argument (under the name cursor), and figuring out, after each call to read a subexpression, how far it must be advanced.
With these elaborations, the code for the signless case looks like this:
< 4 Read unsigned expression > ≡
   <!--* If there is no leading sign, then (a) read a term and 
       * (b) pass it to rd_termseq as left-hand argument *-->
   <xsl:when test="not(contains($start_sign,name($current)))">

    <xsl:variable name="t">
     <xsl:call-template name="rd_term">
      <xsl:with-param name="cursor" select="$cursor"/>
     </xsl:call-template>
    </xsl:variable>
    <xsl:variable name="w_t">
     <xsl:call-template name="width_term">
      <xsl:with-param name="cursor" select="$cursor"/>
     </xsl:call-template>
    </xsl:variable>
    <xsl:call-template name="rd_termseq">
     <xsl:with-param name="cursor" 
       select="$cursor/following-sibling::*[$w_t]"/>
     <xsl:with-param name="lhs" select="$t"/>
    </xsl:call-template>
    
   </xsl:when>

This code is not used elsewhere.

That gives us enough basic principles to go on with:
  • Each non-terminal is represented by two templates, one for reading it, which returns an appropriate element, and one for calculating its width, which is used to increment the cursor.
  • Optional items give rise to a conditional (xsl:choose).
  • Each repetition on a right hand side is represented by a (recursive) template.
  • Each item on the right hand side is handled by a function call (xsl:call-template) to the template for reading that non-terminal, followed by a second call to the template for calculating the width (in tokens) of the expression just read. The first call in a sequence passes $cursor as the cursor argument, and later calls use $cursor/following-sibling::*[$w], where w is the width of the items already read.
With these principles, we can build a parser in XSLT 1.0.

5. The XSLT 1.0 parser

The top level of the stylesheet imports an identity transformation and includes templates for each non-terminal and repetition:
< 5 Top level of 1.0 parser [File arithmetic.1b.xsl] > ≡
<!DOCTYPE xsl:stylesheet PUBLIC 'http://www.w3.org/1999/XSL/Transform'
      '/SGML/Public/W3C/xslt10.dtd' [
<!ENTITY nl "<xsl:text xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>&#xA;</xsl:text>">
]>
<xsl:stylesheet version="1.0">

 <!--* arithmetic.1b.xsl: A simple parser for arithmetic expressions, 
     * in XSLT 1.0. 
     * 
     * This is stage b, which does the parsing.  Because we don't
     * have i/o with side effects and we don't have updates, we end
     * up parsing everything twice: once to return the value (rd_e,
     * rd_term) and once to calculate the width of the expression we
     * parsed (width_e, width_term), so we can skip past it.
     * 
     * This stylesheet was generated by a literate programming system.
     * For maintenance, edit arith.parser.xml, not this file.
     *-->

 <xsl:include href="/SGML/Styles/identity.xsl"/>

{Initialize global variables 7}

 <!--* I. Main templates, for non-terminals *-->
{Read expr 6}
{Find width of expr 17}
{Read term 18}
{Find width of term 19}
{Read factor 20}
{Find width of factor 21}
{Read num 22}
{Find width of num 23}

 <!--* II. Templates for repetitions *-->
{Read {addop term} sequence 24}
{Read {mulop factor} sequence 25}
{Read {expop num} sequence 26}

</xsl:stylesheet>
<!-- Keep this comment at the end of the file
Local variables:
mode: xml
sgml-default-dtd-file:"/SGML/Public/Emacs/xslt.ced"
sgml-omittag:t
sgml-shorttag:t
sgml-indent-data:t
sgml-indent-step:1
End:
-->



The main templates, for non-terminals, each have the same basic form First, they check to make sure the cursor is in the start set for the non-terminal they are recognizing; if so, they do the work. Otherwise, they raise an error.
< 6 Read expr > ≡
 <xsl:template name="rd_expr">
  <xsl:param name="cursor"/>

  <xsl:choose>
   <xsl:when test="contains($start_expr,name($cursor))">
    {Read an expression 8}
   </xsl:when>
   <xsl:otherwise>
    <xsl:message>!!! Trying to read an expression but cannot.</xsl:message>
    <xsl:message>Found <xsl:value-of select="name($cursor)"/> 
     with value <xsl:value-of select="$cursor"/>. !!!</xsl:message>
   </xsl:otherwise>
  </xsl:choose>

 </xsl:template>

This code is used in < Top level of 1.0 parser 5 >

The start set of expr is the union of the start sets of sign and term. Perhaps it's simplest if we just define all the start sets and get them out of the way.
< 7 Initialize global variables > ≡
 <xsl:variable name="start_addop">addop</xsl:variable>
 <xsl:variable name="start_mulop">mulop</xsl:variable>
 <xsl:variable name="start_expop">expop</xsl:variable>
 <xsl:variable name="start_sign"><xsl:value-of select="$start_addop"/></xsl:variable>
 <xsl:variable name="start_num">int dec var lpar</xsl:variable>
 <xsl:variable name="start_factor"><xsl:value-of select="$start_num"/></xsl:variable>
 <xsl:variable name="start_term"><xsl:value-of select="$start_factor"/></xsl:variable>
 <xsl:variable name="start_expr"> 
  <xsl:value-of select="concat($start_sign,' ',$start_term)"/> 
 </xsl:variable>

This code is used in < Top level of 1.0 parser 5 >

The actual work is done in a conditional with one branch for the leading-sign case and one for the unsigned case:
< 8 Read an expression > ≡
    <xsl:choose>
     <xsl:when test="contains($start_sign,name($cursor))">
      {Read a signed expression 13}
     </xsl:when>
     <xsl:otherwise>
      {Read an unsigned expression 9}
     </xsl:otherwise>
    </xsl:choose>

This code is used in < Read expr 6 >

Reading an unsigned expression is simpler, so let's do it first.
< 9 Read an unsigned expression > ≡
      <!--* If there is no leading sign, then (a) read a term t and 
          * (b) if an addop follows then pass t to rd_termseq 
          * as left-hand argument, else (c) return t *-->
      {Set t and w_t by calling rd_term at cursor 10}
      {Return appropriate result for rd_expr 11}

This code is used in < Read an expression 8 >

First, we read the term at the beginning of the expression, assigning it to the variable t, and its width to w_t. Then we return the result of the template, either by returning t or by calling rd_termseq to handling the repetition of addition operators and further terms.
Whenever we call a template to read a non-terminal at a given location, we will also call a template to calculate its width. The pattern is shown here:
< 10 Set t and w_t by calling rd_term at cursor > ≡
      <xsl:variable name="t">
       <xsl:call-template name="rd_term">
	<xsl:with-param name="cursor" select="$cursor"/>
       </xsl:call-template>
      </xsl:variable>
      <xsl:variable name="w_t">
       <xsl:call-template name="width_term">
	<xsl:with-param name="cursor" select="$cursor"/>
       </xsl:call-template>
      </xsl:variable>

This code is used in < Read an unsigned expression 9 >

Having initialized variables t and w_t, we can figure out what to do next.
< 11 Return appropriate result for rd_expr > ≡
      <xsl:choose>
       <xsl:when test="name($cursor/following-sibling::*[$w_t]) = 'addop'">
        {Call rd_termseq with $t, at cursor + $w_t 12}
       </xsl:when>
       <xsl:otherwise>
	<xsl:copy-of select="$t"/>
       </xsl:otherwise>
      </xsl:choose>

This code is used in < Read an unsigned expression 9 > < Read a signed expression 13 >

If there is an addop following the term we just read (i.e. the token w_t tokens over from the cursor has the generic identifier addop), then we call rd_termseq to read as many occurrences of the (addop, term) sequence as there are in the expression. Since we want addition operators to be left-associative (so “7 - 2 + 3” is equivalent to “(7 - 2) + 3” ⇒ 8 not “7 - (2 + 3)” ⇒ 2), we can't generate the top-level expression yet: that will be generated by rd_termseq when it reads the last addop. We pass the term we've read in as a parameter to rd_termseq, so it can be used as the left-hand argument to the next operator read.
If there is no following addop, then we just return t.
If an addop follows, we call rd_termseq, with t as a left-hand argument for the next operator, and a cursor that has been incremented by w_t tokens. This call will produce the result we actually return, so it's not wrapped in a variable declaration.
< 12 Call rd_termseq with $t, at cursor + $w_t > ≡
      <xsl:call-template name="rd_termseq">
       <xsl:with-param name="cursor" 
	select="$cursor/following-sibling::*[$w_t]"/>
       <xsl:with-param name="lhs" select="$t"/>
      </xsl:call-template>

This code is used in < Return appropriate result for rd_expr 11 >

Reading a signed expression is similar, but the arithmetic is slightly different.
< 13 Read a signed expression > ≡
      <!--* If there is a leading sign, call rd_termseq with zero
          * as the left-hand argument *-->
      {Set s by reading sign at cursor 14}
      {Set t0 and w_t by calling rd_term at cursor + 1 15}
      {Calculate an expression t to pass to rd_termseq 16}
      {Return appropriate result for rd_expr 11}
 
      <xsl:variable name="e">
       <xsl:element name="op">
	<xsl:attribute name="op"><xsl:value-of select="$s"/></xsl:attribute>
	<xsl:element name="num"><xsl:element name="int">0</xsl:element></xsl:element>
	<xsl:copy-of select="$t"/>
       </xsl:element>
      </xsl:variable>
      <xsl:call-template name="rd_termseq">
       <xsl:with-param name="cursor" select="$cursor"/>
       <xsl:with-param name="lhs" select="$e"/>
      </xsl:call-template>

This code is used in < Read an expression 8 >

First we read the sign, then we read the term, then we calculate values for t and w_t, after which the code continues as before.
Since signs are very straightforward tokens, we won't have a separate template for reading a sign and calculating its width. The sign itself is the string value (content) of the addop element, and the width is always one.
< 14 Set s by reading sign at cursor > ≡
      <xsl:variable name="s">
       <xsl:value-of select="$cursor"/>
      </xsl:variable>

This code is used in < Read a signed expression 13 >

Calling rd_term and width_term is exactly the same as in the unsigned case shown above, but we have to have different code because we need a different value for the cursor argument.
< 15 Set t0 and w_t by calling rd_term at cursor + 1 > ≡
      <xsl:variable name="t0">
       <xsl:call-template name="rd_term">
	<xsl:with-param name="cursor" select="$cursor/following-sibling::*[1]"/>
       </xsl:call-template>
      </xsl:variable>
      <xsl:variable name="w_t">
       <xsl:call-template name="width_term">
	<xsl:with-param name="cursor" select="$cursor/following-sibling::*[1]"/>
       </xsl:call-template>
      </xsl:variable>

This code is used in < Read a signed expression 13 >

Having initialized variables s and t, we can now calculate the right expression to pass to rd_termseq for use as a left-hand argument. If the sign is positive, we can ignore the sign and just pass the term. If it's negative, we create an expression ourselves, subtracting the term from zero.
< 16 Calculate an expression t to pass to rd_termseq > ≡
      <xsl:variable name="t">
       <xsl:choose>
	<xsl:when test="$s = '-'">
	 <xsl:element name="op">
	  <xsl:attribute name="op">-</xsl:attribute>
	  <xsl:element name="num"><xsl:element name="int">0</xsl:element></xsl:element>
	  <xsl:copy-of select="$t0"/>
	 </xsl:element>
	</xsl:when>
	<xsl:when test="$s = '-'">
	 <xsl:copy-of select="$t0"/>
	</xsl:when>
	<xsl:otherwise>
	 <xsl:message>I didn't know there were addition operators other than - and +. Huh?</xsl:message>
	</xsl:otherwise>
       </xsl:choose>
      </xsl:variable>

This code is used in < Read a signed expression 13 >

< 18 Read term > ≡
 <xsl:template name="rd_term">
  <xsl:param name="cursor"/>

  <xsl:choose>
   <xsl:when test="contains($start_term,name($cursor))">
    {Read a term !! BAD REFERENCE! No scrap called x1.rd_term_go!}
   </xsl:when>
   <xsl:otherwise>
    <xsl:message>!!! Trying to read a term but cannot.</xsl:message>
    <xsl:message>Found <xsl:value-of select="name($cursor)"/> 
     with value <xsl:value-of select="$cursor"/>. !!!</xsl:message>
   </xsl:otherwise>
  </xsl:choose>

 </xsl:template>

This code is used in < Top level of 1.0 parser 5 >

< 20 Read factor > ≡
 <xsl:template name="rd_factor">
  <xsl:param name="cursor"/>

  <xsl:choose>
   <xsl:when test="contains($start_factor,name($cursor))">
    {Read a factor !! BAD REFERENCE! No scrap called x1.rd_factor_go!}
   </xsl:when>
   <xsl:otherwise>
    <xsl:message>!!! Trying to read a factor but cannot.</xsl:message>
    <xsl:message>Found <xsl:value-of select="name($cursor)"/> 
     with value <xsl:value-of select="$cursor"/>. !!!</xsl:message>
   </xsl:otherwise>
  </xsl:choose>

 </xsl:template>

This code is used in < Top level of 1.0 parser 5 >

< 22 Read num > ≡
 <xsl:template name="rd_num">
  <xsl:param name="cursor"/>

  <xsl:choose>
   <xsl:when test="contains($start_num,name($cursor))">
    {Read a number !! BAD REFERENCE! No scrap called x1.rd_num_go!}
   </xsl:when>
   <xsl:otherwise>
    <xsl:message>!!! Trying to read a number but cannot.</xsl:message>
    <xsl:message>Found <xsl:value-of select="name($cursor)"/> 
     with value <xsl:value-of select="$cursor"/>. !!!</xsl:message>
   </xsl:otherwise>
  </xsl:choose>

 </xsl:template>

This code is used in < Top level of 1.0 parser 5 >

The templates for repetitions:
< 24 Read {addop term} sequence > ≡

This code is used in < Top level of 1.0 parser 5 >

...
When there is a sign, the XSLT 1.0 code looks like this:
< 27 Read signed expression > ≡
    {Read signed expression 28}
    {Read signed expression 29}
    {Read signed expression 30}

    <xsl:call-template name="rd_termseq">
     <xsl:with-param name="cursor" select="$cursor"/>
     <xsl:with-param name="lhs" select="$e"/>
    </xsl:call-template>

This code is not used elsewhere.

this:
< 28 Read signed expression > ≡
    <xsl:variable name="s">
     <xsl:value-of select="$cursor"/>
    </xsl:variable>

This code is used in < Read signed expression 27 >

this:
< 29 Read signed expression > ≡
    <xsl:variable name="t">
     <xsl:call-template name="rd_term">
      <xsl:with-param name="cursor" select="$cursor"/>
     </xsl:call-template>
    </xsl:variable>
    <xsl:variable name="w_t">
     <xsl:call-template name="width_term">
      <xsl:with-param name="cursor" select="$cursor"/>
     </xsl:call-template>
    </xsl:variable>

This code is used in < Read signed expression 27 >

this:
< 30 Read signed expression > ≡
    <xsl:variable name="e">
     <xsl:element name="op">
      <xsl:attribute name="op"><xsl:value-of select="$s"/></xsl:attribute>
      <xsl:element name="num"><xsl:element name="int">0</xsl:element></xsl:element>
      <xsl:copy-of select="$t"/>
     </xsl:element>
    </xsl:variable>

This code is used in < Read signed expression 27 >