Fundamental primitives of XSLT programming

[2016-08-07]

A friend planning an introductory course on programming for linguists recently asked me what I thought such linguist-programmers absolutely needed to learn in their first semester of programming. I thought back to a really helpful “Introduction to Programming” course taught in the 1980s at the Princeton University Computer Center by Howard Strauss, then the head of User Services. As I remember it, it consisted essentially of the introduction of three flow-chart patterns (for a sequence of steps, for a conditional, and for a while-loop), with instructions on how to use them that went something like this:

  1. Start with a single box whose text describes the functionality to be implemented.
  2. If every box in the diagram is trivial to implement in your programming language, stop: you’re done. Implement the program using the obvious translation of sequences, loops, and conditionals into your language.
  3. Otherwise choose some single box in the diagram whose functionality is non-trivial (will require more than a few lines of code) and replace it with a pattern: either break it down into a sequence of steps, or make it into a while-condition-do-action loop, or make it into an if-then-else choice.
  4. Return to step 2.

I recommended this idea to my friend, since when I started to learn to program I found these three patterns extremely helpful. As I thought about it further, it occurred to me that the three patterns in question correspond 1:1 to (a) the three constructors used in regular languages, and (b) the three patterns proposed in the 1970s by Michael A. Jackson. The diagrams I learned from Howard Strauss were not the same as Jackson’s diagrams graphically, but the semantics were essentially the same. I expect that a good argument can be made that together with function calls and recursion, those three patterns are the atomic patterns of software design for all conventional (i.e. sequential imperative) languages.

I think the patterns provide a useful starting point for a novice programmer: if you can see how to express an idea using those three patterns, it’s hard not to see how to capture it in a program in Pascal, or C, or Python, or whatever language you’re using. Jackson is quite good on deriving the structure of the program from the structures of the input and output in a systematic way.

The languages I most often teach, however, are XSLT and XQuery; they do not fall into the class of conventional sequential imperative languages, and the three patterns I learned from Howard Strauss (and which Howard Strauss may or may not have learned from Michael A. Jackson) do not help me structure a program in either language.

Is there a similarly small set of simple fundamental patterns that can be used to describe how to build up an XSLT transformation, or an XQuery program?

What are they?

Do they have a plausible graphical representation one could use for sketching out a stepwise refinement of a design?

Running the BaseX XQuery engine in the OpenShift cloud platform

[7 January 2016]

Late last year I worked through the process of making the BaseX XQuery engine run under Tomcat 6 at a commercial Java hosting provider. As I mentioned in that post, I also spent some time trying to make a cloud-services solution work, using OpenShift and Andy Bunce’s excellent Openshift quick start for BaseX. I ran into trouble then, because the Red Hat Cloud (rhc) tools refused to install on either of my two machines because the operating systems were more than twelve months old.

But during a quiet day or two last month I downloaded a new operating system for the newer machine (I’m exaggerating; it probably only took sixteen hours or so), and the other day I tried again.

The instructions for the quick-start set up are a bit terse, so for future reference (and for the benefit of any other XQuery developers out there who would like a little more detail in the instructions), this is my checklist for next time. Many of the same considerations apply as for installing BaseX under Tomcat; see the earlier post for more discussion.

Prerequisites

The following checklist assumes that:

  • You are reasonably comfortable with command-line tools.
  • You are reasonably comfortable with git, or can copy and paste git incantations without mucking them up.
  • You know what the basexhttp script is (or can live without knowing all the details of how things work).
  • It may be helpful if you’re familiar with ssh and scp, but it’s not essential.

If these don’t all apply, you will need to make some adjustments on the fly.

Preliminary preparation

  1. Sign up for an account at OpenShift.

    I don’t remember how long this took, maybe an hour. (It would go faster if it were easier to find the actual signup page and if one didn’t read the agreements one is consenting to.)

    I went through the “log in to the console and create an app” tutorial at OpenShift and “created” my first application (by clicking a button). I didn’t find this helpful or instructive, but YMMV.

    Remember your userid and password; you will need them repeatedly.

  2. Install the OpenShift / Red Hat Cloud command-line tools (rhc); instructions are on the Getting Started with OpenShift Online page at OpenShift.

    This also takes a little while, to download and install all the libraries on which the rhc command-line tools depend. My recollection is that it took half an hour or so. I expect it’s faster for people with faster connections than mine.

    (This is where things went terribly wrong for me in November, since Red Hat’s command-line tools refused to be installed in an operating system that shipped two years ago. Trying to solve the problem by upgrading the system’s Ruby interpreter landed me in dependency hell.)

  3. Decide in advance what userids you wish to specify for your BaseX database(s); for each userid specify the initial password and the initial permissions for each user. At the very least, determine what userid and password will be used as the default database user for the app.

    To allow myself to undertake each database operation with the lowest feasible level of privilege, I made myself an array of userids, one for each privilege level, and assigned them passwords generated by a simple random process:

    • Angie (admin privileges)
    • Chris (create privileges)
    • Will (write privileges)
    • Ralph (read privileges)
    • Nadine (no privileges)
  4. Prepare in advance a small XML file or two to use in creating a helloworld database to check that things are running as you expect.

    I use the documents in http://cmsmcq.com/2015/11/XQuery-over-HTTP/data/ for making a helloworld example. When I use the REST interface to create a database and populate it, it’s simplest to retrieve the documents by URI; when I use the dba interface of BaseX 8.3.1, it proves simplest if I have copies of them on my hard disk.

Doing the deed

The basic instructions are all given in the quick-start readme file; they are, perhaps, a little terse there, for readers who haven’t worked with OpenShift before, so I’ll repeat them with some comments. We are going to create a do-nothing OpenShift application, copy Andy Bunce’s BaseX quickstart setup into it, and check it in. For concreteness, I will assume the app to be built is named “Allegheny”, and the OpenShift user’s domain is “AIK”.

  1. rhc app create -a Allegheny -t diy-0.1

    This creates a do-nothing app named Allegheny on the OpenShift server, using the diy-0.1 cartridge, then clones a git repository of the code into a new directory called Allegheny, on your hard disk. The DIY cartridge provides a sort of minimal environment for an app, on a do-it-yourself basis; fortunately for us, we do not have to do it ourselves, as Andy Bunce has done all the crucial bits for us.

    When I did it this morning, this step took a little over two minutes.

  2. cd Allegheny

    Move into the app directory.

    Do not omit this step; you will regret it, especially if you use git yourself as a backup or synching mechanism. (That is to say, when I forgot to do this, the next step put all of the quickstart code into my home directory. It took me a painfully long time to find and delete it all and get it back out of my git repository. And my .gitignore file seems to have lost both its content and its history. Fortunately, I do have backups.)

  3. git remote add upstream -m master https://github.com/Quodatum/openshift-basex-quick-start.git

    git pull -s recursive -X theirs upstream master

    Pull the quickstart code into your app. (If you want to understand in detail what these lines do, I recommend the git manual. I am not going to try to explain.)

    When I ran this this morning, it took a little under two minutes; it seemed longer.

    If you’re in a hurry, go ahead to the next step. Otherwise, now may be a good time to pause to look at the application’s directory structure, since it’s where you will be doing your development. The crucial bits appear (in the current state of my imperfect knowledge) to be:

    • config (A bash script which sets some variables to be used in other bash scripts, including BASEX_USER and BASEX_PASSWORD, which are used as the values of the -U and -P options in calls to the basexhttp script. The URI of the version of BaseX to install is also given here, as are the port numbers the server should use.)

    • .openshift/ (part of the OpenShift infrastructure)

      • action_hooks/ (this is where the application developer places code to be executed by the OpenShift infrastructure at predefined moments)

        • start (a bash script to be executed when the application needs to be started; the script provided by the quickstart calls basexhttp start)

        • stop (a bash script to be executed when the application needs to be stopped; the script provided by the quickstart calls pkill java to stop the server; in principle it would rather call basexhttp stop, but at the moment that doesn’t work)

      • cron/ (Directory for cron jobs to be executed on behalf of the application, with subdirectories hourly, daily, etc.; the README file has a pointer to the relevant OpenShift documentation. The weekly subdirectory appears to have a weekly statistics job in it.)

      • markers/ (I have no idea what this is; the README points to the documentation.)

    • basex/ (For XQuery developers, this is where the action is.)

      • lib/ (Contains a JAR file for Saxon HE 9.7.0.1.)

      • repo/ (The BaseX repository; contains XQuery modules installed using REPO INSTALL URI-of-module; for details see the BaseX wiki under Repository.)

      • webapp/

        • restxq.xqm/ (A sample RESTXQ function by Andy Bunce supplied as part of the quickstart framework; it generates the default welcome page.)

        • WEB-INF/ (Contains the configuration files governing BaseX’s behavior as a web service)

          • jetty.xml

            (Jetty-specific configuration. With any luck you’ll never need to change anything here. The BaseX documentation refers readers to the Jetty documentation for details, which in turn suggests that you consult the Jetty 9 documentation instead, and good luck to you if you actually need to understand the details. The salient item for the quickstart is setting the IP address and port for the server, using the environment variables OPENSHIFT_DIY_IP and OPENSHIFT_DIY_PORT.)

          • web.xml

            (Generic servlet configuration. By default, the quickstart has the REST and WebDAV interfaces turned off and the RESTXQ interface turned on; you’ll need to edit the web.xml file to turn the REST and WebDAV interfaces back on. This document also specifies the default user and password for the server; how the specification in the web.xml file relates to the values given by the -U and -P options passed to basexhttp remains a mystery to me.

            Since the default web.xml file for the quickstart does not set the RESTXQPATH or RESTPATH options, they default to “the standard web application directory”, which appears in this case to be the webapps directory in the root directory of the git repository. That would be consistent with the placement of restxq.xqm. The web.xml file also doesn’t specify the REPOPATH option; the documentation says that it defaults to {home}/BaseXRepo, but apparently {home}/repo is also a possibility; that would be consistent with the placement of the repo/ directory here.)

    The other files and directories either have obvious functions (.git, .gitignore, LICENSE, README.md) or appear to be just samples (diy/, misc/).

  4. You can now adjust things in the configuration, if you wish. I mostly wait until later.

  5. git push origin master

    This checks in your changes on the server. The OpenShift infrastructure will tell the running DIY application to shut down (i.e. it will run the script in the checked-in version of .openshift/action-hooks/stop), push the changes for your local git repository to the copy in the cloud, and restart the app (by running .openshift/action-hooks/start).

    When I ran it this morning, this step took just under three minutes, including fetching and installing the BaseX binary.

    (In my initial experiments, I ran into a problem here; there is some disagreement between BaseX and OpenShift regarding who can connect to what ports, and the result is that basexhttp stop doesn’t have the desired effect, which means the check-in fails. In the meantime, Andy Bunce has rewritten the quickstart code with a temporary workaround.)

  6. Set up userids.

    BaseX can now be configured, using the dba application that ships with BaseX 8, and which is conveniently linked from the default quickstart welcome page at http://allegheny-AIK.rhcloud.com/.

    The only essential configuration to do at this point is to change the admin password. I try to do this quickly, since until it is done, anyone who happens across this BaseX engine on the open Web can have admin privileges just by logging in using the default password.

    There are a couple of catches that make the process very slightly less straightforward than it might be.

    • The users.xml file will go into the OpenShift data directory, which means that you cannot conveniently put it in place before starting BaseX. (If inconvenience is no object, then by all means, be my guest: first prepare a users.xml file with another copy of BaseX, then scp it to the ./app-root/data/basex/data directory of your OpenShift app, before doing the push in the previous step.)

    • The config file needs to know the admin userid and password, in order to start and stop the server. (At least, I think it needs to know them; I haven’t actually tried putting gibberish into those variables to see.) If we change the admin password, we risk not being able to stop and start the server gracefully.

    This is the technique I’ve worked out for dealing with these catches. There are probably better ways.

    • Log in to the dba application using the default admin userid and password.
    • Add the users in the list prepared earlier, with the corresponding privileges and passwords. Note that this list includes a second user with admin privileges, here called Angie.
    • In the local repo, change the relevant parts of the config file to read as follow (substituting your userid and password of choice, of course):
      BASEX_USER="Angie"
      BASEX_PASSWORD="Where-is-the-devil-in-Evelyn-what's-it-doing-in-Angela's-eyes"
    • Check in your changes and push them to OpenShift (git add config; git commit -m "Changing userid and password"; git push origin master).
    • Log in to the dba application again (actually, you’re probably still logged in), and change the password of the admin user. At this point, we have closed the window of vulnerability we opened when we started the server with the default admin password. When I’m feeling paranoid, I take this moment to check the Databases, Users, Files, and Logs tabs to see whether any intruders actually showed up and did anything.
  7. Test

    To test that things are going as expected, I also create a database and do a few queries with the dba application and the REST interface.

So now I know three different ways to have access to an XQuery database over the web:

  • Run it as a server on a machine you control, or on which you can persuade the sysadmin to install it. (I assume that this can be done in the virtual private servers offered by many Web hosting providers, but I haven’t done it that way myself.)
  • Run it as a servlet running under Tomcat on a Java hosting provider.
  • Run it as an application in a cloud service.

My personal experience with these is all with BaseX, but of course all three methods will also work, at least in principle, for other XQuery engines like eXist and MarkLogic.

It’s always good to have more than one string to one’s bow.

Running the BaseX XQuery engine under Tomcat

[30 November 2015; notes on Tomcat 7 and BaseX 8.3.1 added 7 January 2016]

For a long time I have wished to have an XQuery engine I could use as a back end for Web applications.

Once upon a time, one could use 28msec’s Sausalito platform for this, but not since 28msec stopped supporting it. (In theory, one can still use their new tools to build XQuery-based applications, but I was not able to figure out how, the last time I tried.) But as I wrote when I first used Sausalito, “having an XML database on demand is a lot like having running water on demand — those who have never had it may think it’s a luxury anyone should be able to live without, but once you’ve had it, it can be hard to go back.”

I spent a couple days this past week trying again to choose among the various options (an XQuery server running in a shared hosting environment, running in a dedicated virtual private server, running in the cloud) and succeeded, after two days of work, in getting BaseX running under Tomcat at a Java hosting provider. (If you’re looking for Java hosting, I can say that I have been happy at Kattare.com.) Some of those two days went to attempting to make another solution work (but OpenShift appears to be unwilling to install its command-line tools on either of my current operating systems, and my network connection appears to be unequal to the task of downloading an OS upgrade for the sake of making OpenShift’s installation software happy), and some went to missteps.

If I could have spent time only on the things that turned out to work, it would only have taken a couple hours.

For the record, and for the benefit of any other XQuery developers out there who face the task of getting an XQuery engine running in a commercial hosting environment, this is the checklist I am making for myself for next time.

Prerequisites

The following checklist assumes that:

  • Your Java hosting provider provides you with Tomcat.
  • Your Java hosting provider provides you with ssh access.
  • You know how to ssh into your hosting provider and are comfortable using bash or another shell.
  • You have and can run a copy of BaseX installed on your local machine, which is the same version of BaseX as you are going to install in Tomcat. (Warning: if you are running a current BaseX locally, you may need to install a back-level BaseX for preparatory step 3 below.) It’s helpful to know where it’s installed, so you can launch the stand-alone version of BaseX successfully.
  • You are interested in getting an XQuery engine set up, so you can develop XQuery apps; you are not seeking to develop Java applications.
    (Too bad for you; in order to install anything at all in Tomcat you must understand a little about Tomcat, and all of the Tomcat documentation is aimed at Java developers, not at users of Java applications. Sorry about that; I don’t like it either. As Liam Quin recently remarked to me, our refrigerator salesmen manage to tell us how to install the thing without expecting us to read documentation about how to build a freon condenser; someday the people who build Java application containers like Tomcat will realize that their users include people who want to use not write Java programs. I can hardly wait for the time when Java infrastructure is documented as intelligently as refrigerators. But so far, the refrigerators are winning.)

If these don’t all apply, you will need to make some adjustments on the fly.

Preliminary preparation

  1. Review enough of the Tomcat configuration on your host (consulting the Tomcat documentation where necessary) to know in advance the answers to the following questions:

    • 1.a What version of Tomcat and Java are you using?

      If you’re running Tomcat 6 and Java 1.6, you’ll need a version of BaseX that runs under Java 1.6. The last of that line was BaseX 7.9.

      If you’re running Tomcat 7 and Java 1.7, you can use the current version of BaseX (as I write this, that’s BaseX 8.3.1).

      If you’re running some other combination, I have no idea whether you can run BaseX 8 or not; there are conflicting reports on the Web, and I’m not interested enough in Java application development to bother trying to track down the details. Either stick with BaseX 7.9 for safety or try BaseX 8 and see if it works.

      Trying to run BaseX 8.x under Tomcat 6 and Java 1.6 will not break anything (that I know of). But the servlet will not deploy, and you will find an error message in your Tomcat logs that (how can I put this tactfully?) appears to be aimed at Java developers and is not very helpful to other readers. In my case, trying to deploy BaseX 8.3.1 in a Tomcat 6 environment cost me a few hours and some hairpulling, because it took time to understand what was going wrong.

      The original version of this post discussed only installation of BaseX 7.9, but I had now upgraded to Tomcat 7 and can also report on the process for BaseX 8.3.1.

    • 1.b What directory serves as the base directory for Tomcat ($CATALINA_BASE)?

      On my hosting provider, it’s my home directory (i.e. ~) on the Java host.

    • 1.c What URI is used for the web application management interface (the “Manager application”) in your Tomcat?

      This was surprisingly time consuming for me; I host two domains at my Java hosting provider, which I’ll call able.example.com and baker.example.com here, and the Manager application was available only on one of them (the one configured by server.xml as the ‘default’ one). But eventually I found that http://baker.example.com/manager/html/list gave me results; the various candidates with paths like $host/manager/html, $host/manager, $host/server/manager/html/list, $host/server/manager/html, $host/server/manager, and any path at all on the host http://able.example.com/, all failed.

    • 1.d What userid and password do you use to sign in to the Manager application? Or equivalently for most purposes, where is your tomcat-users.xml file?

      Please tell me you and your hosting provider didn’t leave it at the Tomcat default. If you did, now would be a good time to change it, so that random strangers don’t have administration privileges in your Tomcat instance.

      In my installations, this information is in $CATALINA_BASE/conf/tomcat-users.xml; I find it convenient to have that file open in an emacs buffer before trying to navigate to the Manager application, so I can copy and paste the password instead of typing it.

    • 1.e What does the Manager application look like when things are normal?

      That is, look at it before hand, before you do anything. Otherwise, you’ll never be able to tell whether what you’re looking at afterwards is normal, or you broke something.

    • 1.f Where is the webapps directory for the Tomcat engine and host into which you wish to install?

      In my case, there is one for each virtual host, named with the virtual host name, a hyphen, and the string “webapps”: able_example_com-webapps and baker_example_com-webapps.

    • 1.g How do you stop and restart Tomcat?

    • 1.h Where does your Tomcat write its logs?

    • On my hosting provider, these go into ~/logs.

  2. Decide in advance what userids you wish to specify for your BaseX database(s); for each userid specify the initial password and the initial permissions for each user. At the very least, determine what userid and password will be used as the default database user for the servlet.

    For what it’s worth, I like to set the database up with two users, initially: an administrator userid with ADMIN permissions, whom I’ll call ‘Abel’ here, and a userid with READ permissions, whom I’ll call ‘Romeo’. For purposes of this exposition I’ll assume their passwords are ‘Elba1812’ and ‘Juliet1597’.

  3. Prepare a suitable file of userids, passwords, and privileges.

    How you do this varies between BaseX 7.9 and BaseX 8.0 and later.

    In BaseX 7.9, the documentation tells us that global users, passwords, and permissions are stored in a .basexperm file in the BaseX home directory. The calculation of what counts as the home directory varies depending on how BaseX is installed, and I am unable to deduce from the documentation what counts as the home directory when BaseX 7.9 is running under Tomcat. Experiment, however, shows that it’s apparently …/webapps/BaseX79/WEB-INF. (At least, that’s where the .basexperm file needs to go.)

    An examination of the .basexperm file on your local installation (look in your home directory) will show you (at least, it showed me) that it’s not intended for editing by a human. We will need to use BaseX to make a .basexperm file for us.

    You might be tempted to say “Wait, let’s get BaseX running on the server, then use the REST interface to create users Abel and Romeo and assign them passwords and permissions, then use the Abel userid to assign permission level NONE to the built-in userid ‘admin’. I like the way you think, but I don’t think that’s going to work. (Ask me how I know.) The server command for creating or altering passwords does not allow the password value to be given as an argument: instead, the server prompts you for the password. But there is no way for the REST interface to prompt you for a password.

    One workaround (this is apparently what the BaseX developers had in mind for this situation) is to use BaseX 7.9 on your local machine (we’ll use the stand-alone version, but if you’re adventurous you can use the client-server version or the GUI) to make a .basexperm (permissions) file, which you will then copy to the appropriate directory on the server.

    Since I don’t want the .basexperm file I create now to interfere with the existing set of users and passwords on my local machine, I want to run BaseX with a new home directory that is not the one usually used. So what I do in my bash shell is something like the following:

    # make a temporary place for a .basexperm
    # file to reside
    
    mkdir ~/basex-temp-home
    cd ~/basex-temp-home
    echo "# hi there" > .basexhome
    echo "DBPATH = `pwd`/data" > .basex
    
    # Now launch BaseX.  N.B.
    
    /opt/local/BaseX-7.9/basex/bin/basex
    

    In BaseX, I issue the following server commands to create the users I want, and assign a new password for the built-in admin user. I’d assign NONE to admin if I could, but I can’t, so I just reset the password.

    # Just check to see where we are to start with:
    SHOW USERS
    
    CREATE USER Abel
    # at this point, the server prompts for
    # the password 'Elba1812'
    CREATE USER Romeo
    # password:  Juliet1597
    GRANT ADMIN TO Abel
    GRANT READ TO Romeo
    ALTER USER admin
    # password:  Elba1812
    
    # now, just to make sure we are
    # where we want to be
    SHOW USERS
    
    EXIT
    

    If all has gone well, BaseX will have created a .basexperm file in directory ~/basex-temp-home.

    (Alternatively, I guess one might rename the existing .basexperm to .temp.basexperm.old, launch the GUI, issue the server commands given above, exit, move the new .basexperm out of ~ and into some other place, then give the old .basexperm its old name back. But I haven’t tried it this way.)

    In BaseX 8.3.1, the documentation tells us that users, passwords, and permissions are stored in a human-editable users.xml document in the database directory. Experiment shows that if in this environment we add a new user, BaseX 8.3.1 places the users.xml document at …/webapps/BaseX831/data.

    One could go through a process similar to that described above to create a users.xml file, but in practice I became lazy and handled things a slightly easier, slightly riskier way (see description of “the lazy way” below in steps 8 and 9).

  4. Prepare in advance a small Web-accessible XML file or two to use in creating a helloworld database to check that things are running as you expect.

    I use the documents in http://cmsmcq.com/2015/11/XQuery-over-HTTP/data/ for making a helloworld example. When installing BaseX 8.3.1, it proves simpler to use XML documents on my hard disk.

Doing the deed

After the four preparation steps just described, the steps for actually installing the software are very simple in principle: first fetch a copy of the WAR file for BaseX into the appropriate webapps directory, adjust its configuration, and then restart Tomcat to deploy the app.

  1. Download the WAR file for BaseX.

    In one window, I use ssh to reach my hosting provider. In another, I navigate to the Downloads area on the BaseX web site.

    In the browser window, I right-click the WAR file and save its URI to the clipboard (in different browsers this is Copy Link, or Copy Link Address, or Copy Link Location).

    In the ssh window, I issue the following commands, using a Paste action (command-V) to insert the URI of the WAR file (substituting the BaseX version number for $vvv):

    mkdir incoming
    cd incoming
    curl --output BaseX$vvv.war $URI-of-WAR-file
    
  2. Open the Manager application in a browser window and keep an eye on it.

  3. Put the WAR file in the webapps directory

    If you do this while Tomcat is running, Tomcat will unpack the WAR file; this is convenient, because you need to edit files inside it. On the other hand, you don’t want BaseX to start up yet, so keep an eye on the Manager window to make sure that BaseX doesn’t suddenly show up there. If it does, use the Manager application to stop it.

  4. Configure BaseX

    The only essential configuration to do at this point is to install the userids prepared earlier and change the default username and password used when the HTTP request supplies none; otherwise anyone who happens across your BaseX engine on the open Web has admin privileges.

    • 8.a Copy the .basexperm or users.xml to your Java hosting provider. (Or, if we are doing things the lazy way, do nothing here.)

      It doesn’t much matter how you do this (but be aware that the file is binary; cut and paste is not going to work). I use scp; from my local machine I type:

      scp .basexperm able.example.com:incoming/.basexperm
      

      I could copy it straight to the webapps directory, but I like to keep a copy in ~/incoming in case I end up zapping the webapps subdirectory.

    • 8.b Place the .basexperm file in the …/webapps/BaseX79/WEB-INF directory, or the users.xml in the …/webapps/BaseX831/data directory. (Again, if we are doing things the lazy way, we do nothing here.)

      In my ssh window on the web host I type:

      cd ~/able_example_com-webapps/BaseX79
      cp ~/incoming/.basexperm .
      

      8.c Change the default user and password by editing the web.xml file. (Even if we are doing things the lazy way, this is worth doing.)

      No one who presents no credentials gets to create a new database or change the setup on our server. So the default user should not have admin permissions, only read permissions. It is this for which we made the read-only user Romeo.

      In …/webapps/BaseX79/WEB-INF/web.xml, close to the top of the file, the default user credentials are set:

      <!-- Set default credentials -->
      <context-param>
      <param-name>org.basex.user</param-name>
      <param-value>admin</param-value>
      </context-param>
      <context-param>
      <param-name>org.basex.password</param-name>
      <param-value>admin</param-value>
      </context-param>
      

      This changes to

      <!-- Set default credentials -->
      <context-param>
      <param-name>org.basex.user</param-name>
      <param-value>Romeo</param-value>
      </context-param>
      <context-param>
      <param-name>org.basex.password</param-name>
      <param-value>Juliet1597</param-value>
      </context-param>
      
    • 8.d Take care of anything else you need to configure.

      In my Tomcat 6 installation, there is already a servlet named ‘default’, so the rules at the end of web.xml for static content cause problems. For the moment, I comment them out. I may try to deal properly with them later, but I am looking for a back end, not a complete web server. If I have static content to be served, Apache will serve it, not BaseX.

      Under Tomcat 7, the rules for static content caused no trouble, so I didn’t comment them out.

  5. Flip the switch

    Restart Tomcat to cause it to deploy BaseX.

    You can in principle cause Tomcat to deploy the new application by issuing appropriate commands from the Manager application, and possibly even upload the WAR file from your machine, without logging in to your hosting provider. When I tried doing it that way, I ran into other problems, so I’ve never actually done it that way.

    Check in the Manager application to make sure BaseX appears in the list of servlets and is deployed.

    It’s at this point that I ran into failures with BaseX 8.3.1 running under Tomcat 6. The Tomcat logs (remember I told you to find out where your Tomcat logs are going?) said (among many other things):

    Nov 27, 2015 6:44:35 PM org.apache.catalina.startup.HostConfig deployWAR

    SEVERE: Error deploying web application archive BaseX831.war

    java.lang.UnsupportedClassVersionError: org/basex/http/SessionListener :
    Unsupported major.minor version 51.0 (unable to load class org.basex.http.SessionListener)

    Searching on the Web for the wording of the error message tells me that this is what happens when you try to run a Java program compiled with Java 7 (or 1.7? or 51.0? I wonder what drugs Sun engineers were on when they decided how to number Java versions? And how many drugs Oracle engineers have had to consume in order to be willing to continue following the same pattern?) under Java 6 (or 1.6). This is what alerted me to the statement that was there on the BaseX downloads site all along: “Versions before BaseX 8.0 can be run with Java 6.” Read with a heightened sensitivity to subtle entailments, we can infer that they mean: version 8.0 and later won’t run with Java 6.

    This analysis seems consistent with the fact that I get the same error message when I try to run BaseX 8.x on a system whose default Java is Java 6.

    If we are using BaseX 8.* and doing things the lazy way, it is now that we set up our users and passwords. Do this first thing, since until you do it, your server is open to anyone who happens by and tries “admin” as the userid and “admin” as the password.

    1. Log in to the dba application that comes with BaseX 8, using the default admin username and password. On my setup, the dba application is at http://baker.example.com/BaseX831/dba.
    2. Change the password for the admin account to Elba1812. (This closes the door to our casual intruders.)
    3. Glance at the dba application’s tabs for Databases, Users, Files, and Settings,
      just to make sure no one has exploited the window of vulnerability we opened by starting the server with the default admin password in place.
    4. Add the userid Abel with password Elba1812.
    5. Add the userid Romeo with password Juliet1597.
  6. Test

    To test that things are going as expected, the REST interface can be used (both in 7.9 and 8.3.1) to create and query a database. In 8.3.1, it’s easier to use the dba application. I’ll describe both methods.

    Using the REST interface

    First, a query that should work from the default user (without credentials):

    http://able.example.com/BaseX79/rest
    

    Hmm. This doesn’t in fact work for me: it prompts me for a userid and password. One possible explanation is that I mistyped the password either when creating the .basexperm file or in the web.xml file. Another is that I have once again managed to misunderstand what the BaseX wiki says about the USER and PASSWORD options. For the moment, however, I am going to ignore this problem of failure in setting the default user. When I do give the appropriate userid and password, I get the expected response from the REST server.

    Next, some queries that should require credentials; if you dereference this using curl or wget without specifying the –user option, you will get an “Access denied” message, and if you try in a browser the browser will normally prompt you for userid and password. Supply user Abel and password Elba1812.

    All of the following URIs contain characters that will need to be escaped; I find that the browser always takes care of that for me.

    http://able.example.com/BaseX79/rest?command=show users
    

    Then create a database and add some documents:

    http://able.example.com/BaseX79/rest?command=create db helloworld http://cmsmcq.com/2015/11/XQuery-over-HTTP/data/doc1.xml
    http://able.example.com/BaseX79/rest/helloworld?command=add http://cmsmcq.com/2015/11/XQuery-over-HTTP/data/doc2.xml
    http://able.example.com/BaseX79/rest/helloworld?command=add http://cmsmcq.com/2015/11/XQuery-over-HTTP/data/doc3.xml
    http://able.example.com/BaseX79/rest/helloworld?command=add http://cmsmcq.com/2015/11/XQuery-over-HTTP/data/doc4.xml
    

    Now we can query them:

    http://able.example.com/BaseX79/rest/helloworld?query={/}
    

    Finally, we can delete the database:

    http://able.example.com/BaseX79/rest?command=drop db helloworld
    

    If the responses to dereferencing these URIs is as expected, then hurrah, it works, and you can now get on with your application development.

    If not, then check the logs (remember I told you to find out where your logs are going?) and good luck to you.

    Using the dba application in BaseX 8.3.1

    In the Databases tab, create a helloworld database.

    Then use the Add button to add resources to it. In 8.3.1, this expects
    you to upload files from your browser; it doesn’t seem to accept URIs. So I uploaded
    my local copies of the four hello-world documents, one at a time.

    In the Queries tab, type some queries in the query widget, click the Run button to evaluate them, and make sure the results are as expected.

    • collection('helloworld')/*
    • collection('helloworld')//greeting[@var='northern']

    The REST queries listed above should also work.

Searching for patterns in XML siblings

[9 April 2013]

I’ve been experimenting with searches for elements in XML documents whose sequences of children match certain patterns. I’ve got some interesting results, but before I can describe them in a way that will make sense for the reader, I’ll have to provide some background information.

For example, consider a TEI-encoded language corpus where each word is tagged as a w element and carries a pos attribute. At the bottom levels of the XML tree, the documents in the corpus might look like this (this extract is from COLT, the Bergen Corpus of London Teenage English, as distributed by ICAME, the International Computer Archive of Modern and Medieval English; an earlier version of COLT was also included in the British National Corpus):

<u id="345" who="14-7">
<s n="407">
<w pos="PPIS1">I</w>
<w pos="VBDZ">was</w>
<w pos="XX">n't</w>
<w pos="JJ">sure</w>
<w pos="DDQ">what</w>
<w pos="TO">to</w>
<w pos="VVI">revise</w>
<w pos="RR">though</w>
</s>
</u>
<u id="346" who="14-1">
<s n="408">
<w pos="PPIS1">I</w>
<w pos="VV0">know</w>
<w pos="YCOM">,</w>
<w pos="VBZ">is</w>
<w pos="PPH1">it</w>
<w pos="AT">the</w>
<w pos="JJ">whole</w>
<w pos="JJ">bloody</w>
<w pos="NN1">book</w>
<w pos="CC">or</w>
<w pos="RR">just</w>
<w pos="AT">the</w>
<w pos="NN2">bits</w>
<w pos="PPHS1">she</w>
<w pos="VVD">tested</w>
<w pos="PPIO2">us</w>
<w pos="RP">on</w>
<w pos="YSTP">.</w>
</s>
</u>

These two u (utterance) elements record part of a conversation between speaker 14-7, a 13-year-old female named Kate of unknown socio-economic background, and speaker 14-1, a 13-year-old female named Sarah in socio-economic group 2 (that’s the middle group; 1 is high, 3 is low).

Suppose our interest is piqued by the phrase “the whole bloody book” and we decide we to look at other passages where we find a definite article, followed by two (or more) adjectives, followed by a noun.

Using the part-of-speech tags used here, supplied by CLAWS, the part-of-speech tagger developed at Lancaster’s UCREL (University Centre for Computer Corpus Research on Language), this amounts at a first approximation to searching for a w element with pos = AT, followed by two or more w elements with pos = JJ, followed by a w element with pos = NN1. If we want other possible determiners (“a”, “an”, “every”, “some”, etc.) and not just “the” and “no”, and other kinds of adjective, and other forms of noun, the query eventually looks like this:

let $determiner := ('AT', 'AT1', 'DD',
'DD1', 'DD2',
'DDQ', 'DDQGE', 'DDQV'),
$adjective := ('JJ', 'JJR', 'JJT', 'JK',
'MC', 'MCMC', 'MC1', 'MD'),
$noun := ('MC1', 'MC2', 'ND1',
'NN', 'NN1', 'NN2',
'NNJ', 'NNJ2',
'NNL1', 'NNL2',
'NNT1', 'NNT2',
'NNU', 'NNU1', 'NNU2',
'NP', 'NP1', 'NP2',
'NPD1', 'NPD2',
'NPM1', 'NPM2' )

let $hits :=
collection('COLT')
//w[@pos=$determiner]
[following-sibling::w[1][@pos = $adjective]
[following-sibling::w[1][@pos = $adjective]
[following-sibling::w[1][@pos = $noun]
]]]
for $h in $hits return
<hit doc="{base-uri($h)}">{
$h,
<orth>{
normalize-space(string($h/..))
}</orth>,
$h/..
}</hit>

Such searches pose several problems, for which I’ve been mulling over solutions for a while now.

  • One problem is finding a good way to express the concept of “two or more adjectives”. (The attentive reader will have noticed that the XQuery given searches for determiners followed by exactly two adjectives and a noun, not two or more adjectives.)

    To this, the obvious solution is regular expressions over w elements. The obvious problem standing in the way of this obvious solution is that XPath, XQuery, and XSLT don’t actually have support in their syntax or in their function library for regular expressions over sequences of elements, only regular expressions over sequences of characters.

  • A second problem is finding a syntax for expressing the query which ordinary working linguists will find less daunting or more convenient than XQuery.

    Why ordinary working linguists should find XQuery daunting, I don’t know, but I’m told they will. But even if one doesn’t find XQuery daunting, one may find the syntax required for sibling searches a bit cumbersome. The absence of a named axis meaning “immediate following sibling” is particularly inconvenient, because it means one must perpetually remember to add “[1]” to steps; experience shows that forgetting that predicate in even one place can lead to bewildering results. Fortunately (or not), the world in general (and even just the world of corpus linguistics) contains a large number of query languages that can be adopted or used for inspiration.

    Once such a syntax is invented or identified, of course, one will have the problem of building an evaluator for expressions in the new language, for example by transforming expressions in the new syntax into XQuery expressions which the XQuery engine or an XSLT processor evaluates, or by writing an interpreter for the new language in XQuery or XSLT.

  • A third problem is finding a good way to make the queries faster.

    I’ve been experimenting with building user-level indices to help with this. By user-level indices I mean user-constructed XML documents which serve the same purpose as dbms-managed indices: they contain a subset of the information in the primary (or ‘real’) documents, typically in a different structure, and they can make certain queries faster. They are not to be confused with the indices that most database management systems can build on their own, with or without user action. Preliminary results are encouraging.

More on these individual problems in other posts.

Recursive descent parsing in XQuery (and other functional languages)

[7 January 2013; typos in code patterns corrected 8 January and 21 June 2013]

Everyone who designs or builds systems to do interesting work occasionally needs to deal with input in some specialized notation or other. Nowadays a lot of specialized information is in XML, and in that case the need is to deal with vocabularies designed for the particular kind of specialized information involved. But sometimes specialized data comes in its own non-XML syntax — ISBNs, URIs, expressions in symbolic logic, chess-game notation, textual user interface languages, query languages, and so on. Even in the XML universe there are plenty of notations for structured information of various kinds that are not XML: DTDs, XPath expressions, CSS, XQuery.

In these cases, if you want to do anything useful that depends on understanding the structure of the data, you’re going to need to write a parser for its notation. For some simple cases like ISBNs and ISSNs, or URIs, you can get by with regular expressions. (At least, more or less: if you want to validate the check digit in an ISBN or ISSN, regular expressions are going to have a hard time getting the job done, though oddly enough a finite state automaton would have no particular trouble with it.) But many interesting notations are context-free languages, which means regular expressions don’t suffice to distinguish well-formed expressions from other strings of characters, or to identify the internal structure of expressions.

Now, if you’re writing in C and you run into this problem, you can easily use yacc and lex to generate a parser for your language (assuming it satisfies the requirements of yacc). If you’re writing in Java, there are several parser-generator tools to choose from. If you’re writing in a less widely used language, you may find a parser generator, or you may not.

It’s handy, in this situation, to be able to write your own parsers from scratch.

By far the simplest method for hand-written parsing is the one known as recursive descent. For each non-terminal symbol in the grammar, there is a routine whose job it is to read strings in the input which represent that non-terminal. The structure of the routine follows the structure of the grammar rules for that non-terminal in a simple way, which makes recursive-descent parsers feel close to the structure of the information and also to the structure of the parsing process. (The parser generated by yacc, on the other hand, remains a completely opaque black box to me, even after years of using it.)

In his book Compiler Construction (Harlow, England: Addison-Wesley, 1996, tr. from Grundlagen und Techniken des Compilerbaus [Bonn: Addison-Wesley, 1996]), the Swiss computer scientist Niklaus Wirth summarizes the rules for formulating a recursive-descent parser in a table occupying less than half a page. For each construct in the EBNF notation for grammars, Wirth shows a corresponding construct in an imperative programming language (Oberon), so before I show the table I should perhaps review the EBNF notation. In any formal notation for grammars, a grammar is made up of a sequence of grammar productions, and each production (in the case of context-free grammars) consists of a single non-terminal symbol on the left-hand side of the rule, and an expression on the right-hand side of the rule which represents the possible realizations of the non-terminal. The right-hand side expression is made up of non-terminal symbols and terminal symbols (e.g. quoted strings), arranged in sequences, separated as need be by choice operators (for choice, the or-bar | is used), with parentheses, square brackets (which mark optional material), and braces (which mark material that can occur zero or more times).

Wirth’s EBNF for EBNF will serve to illustrate the syntax:

syntax = {production}.
production = identifier “=” expression “.”.
expression = term {“|” term}.
term = factor {factor}.
factor = identifier | string | “(” expression “)” | “[” expression “]” | “{” expression “}”.
identifier = letter {letter | digit}.
string = “”” {character} “””.
letter = “A” | … | “Z”.
digit = “0” | … | “9”.

(It may be worth noting that this formulation achieves its simplicity in part by hand-waving: it doesn’t say anything about whitespace separating identifiers, and the definition of string is not one a machine can be expected to read and understand. But Wirth isn’t writing this grammar for a machine, but for human readers.)

It’s easy to see that the routines in a recursive-descent parser for a grammar in this notation must deal with six constructs on the right-hand side of rules: strings, parenthesized expressions (three kinds), sequences of expressions, and choices between expressions. Wirth summarizes the necessary code in this table with the construct K on the left, and the program for it, Pr(K), on the right. In the code fragments, sym is a global variable representing the symbol most recently read from the input stream, and next is the routine responsible for reading the input stream and updating sym. The meta-expression first(K) denotes the set of symbols which can possibly occur as the first symbol of a string derived from construct K.

k Pr(k)
“x” IF sym = “x” THEN next ELSE error END
(exp) Pr(exp)
[exp] IF sym IN first(exp) THEN Pr(exp) END
{exp} WHILE sym IN first(exp) DO Pr(exp) END
fac0 fac1facn Pr(fac0); Pr(fac1); … ; Pr(facn)
term0 | term1 | … | termn CASE sym of
   first(term0) : Pr(term0)
| first(term1) : Pr(term1)
| …
| first(termn) : Pr(termn)
END

This is easy enough to express in any language that has global variables and assignment statements. But what do we do when we are writing an application in a functional language, like XQuery or XSLT, and need to parse sentences in some context-free language? No assignment statements, and all functions have the property that if you call them several times with the same argument you will always get the same results back. [Addendum, 8 January 2013: XQuery and XSLT users do in fact have access to useful parser generators: see the pointers to Gunther Rademacher’s REx and Dmitre Novatchev’s FXSL provided in the comments. The question does, however, still arise for those who want to write recursive-descent parsers along the lines sketched by Wirth, which is where this post is trying to go.]

I’ve wondered about this for some time (my notes show I was worrying about it a year ago today), and the other day a simple solution occurred to me: each of the functions in a recursive descent parser depends on the state of the input, so in a functional language the state of the input has to passed to the function as an argument. And each function changes the state of the input (by advancing the input pointer), which in a functional language we can represent by having each function return the new state of the input and the new current symbol as its result.

A new table, analogous to Wirth’s, but with XQuery code patterns on the right hand side, looks like this. Here, the common assumption is that each function is passed a parameter named $E0 whose value is an env variable, with two children: sym contains the current symbol and input contains the remainder of the input (which for simplicity I’m going to assume is a string). If an error condition arises, an error element is added to the environment. The job of reading the next token is handled by the function next().

k Pr(k, $E0)
“x” if ($E0/sym = “x”)
  then next($E0)
  else <env>
    <error>expected “x” but did not find it</error>
    {$E0/*}
  </env>
(exp) Pr(exp, $E0)
[exp] if ($E0/sym = first(exp)) then
  Pr(exp, $E0)
else
  $E0
{exp} This requires two translations. For each such sequence exp, we declare a function:

declare function seq_exp(
  $E0 as element(env)
) as element(env) {
  if ($E0/sym = first(exp)) then
    let $E1 := Pr(exp),
        $E2 := seq_exp($E1)
    return $E2
  else
    $E0
};
Inline, we just call that function:

seq_exp($E0)
fac0 fac1facn let $E1 := Pr(fac0, $E0),
    $E2 := Pr(fac1, $E1),
    … ,
    $En + 1 := Pr(facn, $En)
return $En + 1
term0 | term1 | … | termn if ($E0/sym = first(term0)) then
  Pr(term0)
else if ($E0/sym = first(term1)) then
  Pr(term1)

else if ($E0/sym = first(termn)) then
  Pr(termn)

Like Wirth’s, the code shown here produces a recognizer that doesn’t do anything with the input except read it and accept or reject it; like Wirth’s, it can easily be extended to do things with the input. (My first instinct, of course, is to return an XML representation of the input string’s abstract syntax tree, so I can process it further using other XML tools.)