Running the BaseX XQuery engine under Tomcat

[30 November 2015; notes on Tomcat 7 and BaseX 8.3.1 added 7 January 2016]

For a long time I have wished to have an XQuery engine I could use as a back end for Web applications.

Once upon a time, one could use 28msec’s Sausalito platform for this, but not since 28msec stopped supporting it. (In theory, one can still use their new tools to build XQuery-based applications, but I was not able to figure out how, the last time I tried.) But as I wrote when I first used Sausalito, “having an XML database on demand is a lot like having running water on demand — those who have never had it may think it’s a luxury anyone should be able to live without, but once you’ve had it, it can be hard to go back.”

I spent a couple days this past week trying again to choose among the various options (an XQuery server running in a shared hosting environment, running in a dedicated virtual private server, running in the cloud) and succeeded, after two days of work, in getting BaseX running under Tomcat at a Java hosting provider. (If you’re looking for Java hosting, I can say that I have been happy at Some of those two days went to attempting to make another solution work (but OpenShift appears to be unwilling to install its command-line tools on either of my current operating systems, and my network connection appears to be unequal to the task of downloading an OS upgrade for the sake of making OpenShift’s installation software happy), and some went to missteps.

If I could have spent time only on the things that turned out to work, it would only have taken a couple hours.

For the record, and for the benefit of any other XQuery developers out there who face the task of getting an XQuery engine running in a commercial hosting environment, this is the checklist I am making for myself for next time.


The following checklist assumes that:

  • Your Java hosting provider provides you with Tomcat.
  • Your Java hosting provider provides you with ssh access.
  • You know how to ssh into your hosting provider and are comfortable using bash or another shell.
  • You have and can run a copy of BaseX installed on your local machine, which is the same version of BaseX as you are going to install in Tomcat. (Warning: if you are running a current BaseX locally, you may need to install a back-level BaseX for preparatory step 3 below.) It’s helpful to know where it’s installed, so you can launch the stand-alone version of BaseX successfully.
  • You are interested in getting an XQuery engine set up, so you can develop XQuery apps; you are not seeking to develop Java applications.
    (Too bad for you; in order to install anything at all in Tomcat you must understand a little about Tomcat, and all of the Tomcat documentation is aimed at Java developers, not at users of Java applications. Sorry about that; I don’t like it either. As Liam Quin recently remarked to me, our refrigerator salesmen manage to tell us how to install the thing without expecting us to read documentation about how to build a freon condenser; someday the people who build Java application containers like Tomcat will realize that their users include people who want to use not write Java programs. I can hardly wait for the time when Java infrastructure is documented as intelligently as refrigerators. But so far, the refrigerators are winning.)

If these don’t all apply, you will need to make some adjustments on the fly.

Preliminary preparation

  1. Review enough of the Tomcat configuration on your host (consulting the Tomcat documentation where necessary) to know in advance the answers to the following questions:

    • 1.a What version of Tomcat and Java are you using?

      If you’re running Tomcat 6 and Java 1.6, you’ll need a version of BaseX that runs under Java 1.6. The last of that line was BaseX 7.9.

      If you’re running Tomcat 7 and Java 1.7, you can use the current version of BaseX (as I write this, that’s BaseX 8.3.1).

      If you’re running some other combination, I have no idea whether you can run BaseX 8 or not; there are conflicting reports on the Web, and I’m not interested enough in Java application development to bother trying to track down the details. Either stick with BaseX 7.9 for safety or try BaseX 8 and see if it works.

      Trying to run BaseX 8.x under Tomcat 6 and Java 1.6 will not break anything (that I know of). But the servlet will not deploy, and you will find an error message in your Tomcat logs that (how can I put this tactfully?) appears to be aimed at Java developers and is not very helpful to other readers. In my case, trying to deploy BaseX 8.3.1 in a Tomcat 6 environment cost me a few hours and some hairpulling, because it took time to understand what was going wrong.

      The original version of this post discussed only installation of BaseX 7.9, but I had now upgraded to Tomcat 7 and can also report on the process for BaseX 8.3.1.

    • 1.b What directory serves as the base directory for Tomcat ($CATALINA_BASE)?

      On my hosting provider, it’s my home directory (i.e. ~) on the Java host.

    • 1.c What URI is used for the web application management interface (the “Manager application”) in your Tomcat?

      This was surprisingly time consuming for me; I host two domains at my Java hosting provider, which I’ll call and here, and the Manager application was available only on one of them (the one configured by server.xml as the ‘default’ one). But eventually I found that gave me results; the various candidates with paths like $host/manager/html, $host/manager, $host/server/manager/html/list, $host/server/manager/html, $host/server/manager, and any path at all on the host, all failed.

    • 1.d What userid and password do you use to sign in to the Manager application? Or equivalently for most purposes, where is your tomcat-users.xml file?

      Please tell me you and your hosting provider didn’t leave it at the Tomcat default. If you did, now would be a good time to change it, so that random strangers don’t have administration privileges in your Tomcat instance.

      In my installations, this information is in $CATALINA_BASE/conf/tomcat-users.xml; I find it convenient to have that file open in an emacs buffer before trying to navigate to the Manager application, so I can copy and paste the password instead of typing it.

    • 1.e What does the Manager application look like when things are normal?

      That is, look at it before hand, before you do anything. Otherwise, you’ll never be able to tell whether what you’re looking at afterwards is normal, or you broke something.

    • 1.f Where is the webapps directory for the Tomcat engine and host into which you wish to install?

      In my case, there is one for each virtual host, named with the virtual host name, a hyphen, and the string “webapps”: able_example_com-webapps and baker_example_com-webapps.

    • 1.g How do you stop and restart Tomcat?

    • 1.h Where does your Tomcat write its logs?

    • On my hosting provider, these go into ~/logs.

  2. Decide in advance what userids you wish to specify for your BaseX database(s); for each userid specify the initial password and the initial permissions for each user. At the very least, determine what userid and password will be used as the default database user for the servlet.

    For what it’s worth, I like to set the database up with two users, initially: an administrator userid with ADMIN permissions, whom I’ll call ‘Abel’ here, and a userid with READ permissions, whom I’ll call ‘Romeo’. For purposes of this exposition I’ll assume their passwords are ‘Elba1812’ and ‘Juliet1597’.

  3. Prepare a suitable file of userids, passwords, and privileges.

    How you do this varies between BaseX 7.9 and BaseX 8.0 and later.

    In BaseX 7.9, the documentation tells us that global users, passwords, and permissions are stored in a .basexperm file in the BaseX home directory. The calculation of what counts as the home directory varies depending on how BaseX is installed, and I am unable to deduce from the documentation what counts as the home directory when BaseX 7.9 is running under Tomcat. Experiment, however, shows that it’s apparently …/webapps/BaseX79/WEB-INF. (At least, that’s where the .basexperm file needs to go.)

    An examination of the .basexperm file on your local installation (look in your home directory) will show you (at least, it showed me) that it’s not intended for editing by a human. We will need to use BaseX to make a .basexperm file for us.

    You might be tempted to say “Wait, let’s get BaseX running on the server, then use the REST interface to create users Abel and Romeo and assign them passwords and permissions, then use the Abel userid to assign permission level NONE to the built-in userid ‘admin’. I like the way you think, but I don’t think that’s going to work. (Ask me how I know.) The server command for creating or altering passwords does not allow the password value to be given as an argument: instead, the server prompts you for the password. But there is no way for the REST interface to prompt you for a password.

    One workaround (this is apparently what the BaseX developers had in mind for this situation) is to use BaseX 7.9 on your local machine (we’ll use the stand-alone version, but if you’re adventurous you can use the client-server version or the GUI) to make a .basexperm (permissions) file, which you will then copy to the appropriate directory on the server.

    Since I don’t want the .basexperm file I create now to interfere with the existing set of users and passwords on my local machine, I want to run BaseX with a new home directory that is not the one usually used. So what I do in my bash shell is something like the following:

    # make a temporary place for a .basexperm
    # file to reside
    mkdir ~/basex-temp-home
    cd ~/basex-temp-home
    echo "# hi there" > .basexhome
    echo "DBPATH = `pwd`/data" > .basex
    # Now launch BaseX.  N.B.

    In BaseX, I issue the following server commands to create the users I want, and assign a new password for the built-in admin user. I’d assign NONE to admin if I could, but I can’t, so I just reset the password.

    # Just check to see where we are to start with:
    # at this point, the server prompts for
    # the password 'Elba1812'
    # password:  Juliet1597
    ALTER USER admin
    # password:  Elba1812
    # now, just to make sure we are
    # where we want to be

    If all has gone well, BaseX will have created a .basexperm file in directory ~/basex-temp-home.

    (Alternatively, I guess one might rename the existing .basexperm to .temp.basexperm.old, launch the GUI, issue the server commands given above, exit, move the new .basexperm out of ~ and into some other place, then give the old .basexperm its old name back. But I haven’t tried it this way.)

    In BaseX 8.3.1, the documentation tells us that users, passwords, and permissions are stored in a human-editable users.xml document in the database directory. Experiment shows that if in this environment we add a new user, BaseX 8.3.1 places the users.xml document at …/webapps/BaseX831/data.

    One could go through a process similar to that described above to create a users.xml file, but in practice I became lazy and handled things a slightly easier, slightly riskier way (see description of “the lazy way” below in steps 8 and 9).

  4. Prepare in advance a small Web-accessible XML file or two to use in creating a helloworld database to check that things are running as you expect.

    I use the documents in for making a helloworld example. When installing BaseX 8.3.1, it proves simpler to use XML documents on my hard disk.

Doing the deed

After the four preparation steps just described, the steps for actually installing the software are very simple in principle: first fetch a copy of the WAR file for BaseX into the appropriate webapps directory, adjust its configuration, and then restart Tomcat to deploy the app.

  1. Download the WAR file for BaseX.

    In one window, I use ssh to reach my hosting provider. In another, I navigate to the Downloads area on the BaseX web site.

    In the browser window, I right-click the WAR file and save its URI to the clipboard (in different browsers this is Copy Link, or Copy Link Address, or Copy Link Location).

    In the ssh window, I issue the following commands, using a Paste action (command-V) to insert the URI of the WAR file (substituting the BaseX version number for $vvv):

    mkdir incoming
    cd incoming
    curl --output BaseX$vvv.war $URI-of-WAR-file
  2. Open the Manager application in a browser window and keep an eye on it.

  3. Put the WAR file in the webapps directory

    If you do this while Tomcat is running, Tomcat will unpack the WAR file; this is convenient, because you need to edit files inside it. On the other hand, you don’t want BaseX to start up yet, so keep an eye on the Manager window to make sure that BaseX doesn’t suddenly show up there. If it does, use the Manager application to stop it.

  4. Configure BaseX

    The only essential configuration to do at this point is to install the userids prepared earlier and change the default username and password used when the HTTP request supplies none; otherwise anyone who happens across your BaseX engine on the open Web has admin privileges.

    • 8.a Copy the .basexperm or users.xml to your Java hosting provider. (Or, if we are doing things the lazy way, do nothing here.)

      It doesn’t much matter how you do this (but be aware that the file is binary; cut and paste is not going to work). I use scp; from my local machine I type:

      scp .basexperm

      I could copy it straight to the webapps directory, but I like to keep a copy in ~/incoming in case I end up zapping the webapps subdirectory.

    • 8.b Place the .basexperm file in the …/webapps/BaseX79/WEB-INF directory, or the users.xml in the …/webapps/BaseX831/data directory. (Again, if we are doing things the lazy way, we do nothing here.)

      In my ssh window on the web host I type:

      cd ~/able_example_com-webapps/BaseX79
      cp ~/incoming/.basexperm .

      8.c Change the default user and password by editing the web.xml file. (Even if we are doing things the lazy way, this is worth doing.)

      No one who presents no credentials gets to create a new database or change the setup on our server. So the default user should not have admin permissions, only read permissions. It is this for which we made the read-only user Romeo.

      In …/webapps/BaseX79/WEB-INF/web.xml, close to the top of the file, the default user credentials are set:

      <!-- Set default credentials -->

      This changes to

      <!-- Set default credentials -->
    • 8.d Take care of anything else you need to configure.

      In my Tomcat 6 installation, there is already a servlet named ‘default’, so the rules at the end of web.xml for static content cause problems. For the moment, I comment them out. I may try to deal properly with them later, but I am looking for a back end, not a complete web server. If I have static content to be served, Apache will serve it, not BaseX.

      Under Tomcat 7, the rules for static content caused no trouble, so I didn’t comment them out.

  5. Flip the switch

    Restart Tomcat to cause it to deploy BaseX.

    You can in principle cause Tomcat to deploy the new application by issuing appropriate commands from the Manager application, and possibly even upload the WAR file from your machine, without logging in to your hosting provider. When I tried doing it that way, I ran into other problems, so I’ve never actually done it that way.

    Check in the Manager application to make sure BaseX appears in the list of servlets and is deployed.

    It’s at this point that I ran into failures with BaseX 8.3.1 running under Tomcat 6. The Tomcat logs (remember I told you to find out where your Tomcat logs are going?) said (among many other things):

    Nov 27, 2015 6:44:35 PM org.apache.catalina.startup.HostConfig deployWAR

    SEVERE: Error deploying web application archive BaseX831.war

    java.lang.UnsupportedClassVersionError: org/basex/http/SessionListener :
    Unsupported major.minor version 51.0 (unable to load class org.basex.http.SessionListener)

    Searching on the Web for the wording of the error message tells me that this is what happens when you try to run a Java program compiled with Java 7 (or 1.7? or 51.0? I wonder what drugs Sun engineers were on when they decided how to number Java versions? And how many drugs Oracle engineers have had to consume in order to be willing to continue following the same pattern?) under Java 6 (or 1.6). This is what alerted me to the statement that was there on the BaseX downloads site all along: “Versions before BaseX 8.0 can be run with Java 6.” Read with a heightened sensitivity to subtle entailments, we can infer that they mean: version 8.0 and later won’t run with Java 6.

    This analysis seems consistent with the fact that I get the same error message when I try to run BaseX 8.x on a system whose default Java is Java 6.

    If we are using BaseX 8.* and doing things the lazy way, it is now that we set up our users and passwords. Do this first thing, since until you do it, your server is open to anyone who happens by and tries “admin” as the userid and “admin” as the password.

    1. Log in to the dba application that comes with BaseX 8, using the default admin username and password. On my setup, the dba application is at
    2. Change the password for the admin account to Elba1812. (This closes the door to our casual intruders.)
    3. Glance at the dba application’s tabs for Databases, Users, Files, and Settings,
      just to make sure no one has exploited the window of vulnerability we opened by starting the server with the default admin password in place.
    4. Add the userid Abel with password Elba1812.
    5. Add the userid Romeo with password Juliet1597.
  6. Test

    To test that things are going as expected, the REST interface can be used (both in 7.9 and 8.3.1) to create and query a database. In 8.3.1, it’s easier to use the dba application. I’ll describe both methods.

    Using the REST interface

    First, a query that should work from the default user (without credentials):

    Hmm. This doesn’t in fact work for me: it prompts me for a userid and password. One possible explanation is that I mistyped the password either when creating the .basexperm file or in the web.xml file. Another is that I have once again managed to misunderstand what the BaseX wiki says about the USER and PASSWORD options. For the moment, however, I am going to ignore this problem of failure in setting the default user. When I do give the appropriate userid and password, I get the expected response from the REST server.

    Next, some queries that should require credentials; if you dereference this using curl or wget without specifying the –user option, you will get an “Access denied” message, and if you try in a browser the browser will normally prompt you for userid and password. Supply user Abel and password Elba1812.

    All of the following URIs contain characters that will need to be escaped; I find that the browser always takes care of that for me. users

    Then create a database and add some documents: db helloworld

    Now we can query them:{/}

    Finally, we can delete the database: db helloworld

    If the responses to dereferencing these URIs is as expected, then hurrah, it works, and you can now get on with your application development.

    If not, then check the logs (remember I told you to find out where your logs are going?) and good luck to you.

    Using the dba application in BaseX 8.3.1

    In the Databases tab, create a helloworld database.

    Then use the Add button to add resources to it. In 8.3.1, this expects
    you to upload files from your browser; it doesn’t seem to accept URIs. So I uploaded
    my local copies of the four hello-world documents, one at a time.

    In the Queries tab, type some queries in the query widget, click the Run button to evaluate them, and make sure the results are as expected.

    • collection('helloworld')/*
    • collection('helloworld')//greeting[@var='northern']

    The REST queries listed above should also work.

Mechanization of logic and mechanization of numerical calculation

[9 January 2015]

In their book Mechanization of reasoning in a historical perspective (Amsterdam: Rodopi, 1995), Witold Marciszewski and Roman Murawski write (actually, this chapter is by WM):

Albert Einstein could not do without arithmetical data-processing in his computations, but had no need to resort to rules of formalized deduction in his reasonings …

Point taken. (It should be noted that WM and RM define data-processing broadly, to include things like the mechanical rules for multiplication and long division taught in primary schools.) We use mechanical rules for arithmetic calculation all the time, but very seldom for inference. It would be interesting to have a theory as to why this might be so. Perhaps humans are better at inference than at arithmetic? (Certain kinds of inference will, I guess, have been reinforced and selected for by evolution.) Or perhaps humans are so bad at inference that people seldom or never engage in chains of inferences that would require mechanical aid? As if we never did arithmetic with numbers greater than ten?

Of course, sometimes people make mistaken inferences, which mechanized reasoning aids could in theory prevent.

And compare Leibniz’s suggestion that people would do better in some endeavors if they did resort more often to rules of formalized deduction (in “De logica nova condenda”, in Die Grundlagen des logischen Kalküls, ed. and tr. Franz Schupp with Stephanie Weber [Hamburg: Meiner,2000], p. 2 [my translation, following Schupp and Weber]):

Est vero in notra potestate, ut in colligendo non erremus, si scilicet quoad argumentandi formam rigide observemus regulas Logicas, quoad materiam vero nullas assumamus propositiones, quarum vel veritas, vel major ex datis probabilitas, non sit jam antea rigorose demonstrata. Quam methodum secuti sunt Mathematici, admirando cum successu.

Est etiam in potestate nostra, ut controversias finiamus, si scilicet argumenta quae afferuntur in formam accurate redigamus, non syllogismos tantum formando atque examinando, set et prosyllogismos, et prosyllogismorum prosyllogismos, donec vel absolvatur probatio, vel constet quid adhuc investigandum probandumve argumentanti supersit, ne scilicet inani circulo priora repetat, et Diogenis dolium volvat.

But it is in our power not to err in logical inference, namely if we rigidly observe the rules of logic, with respect to the form of argument, and if with respect to the subject matter weal assume nothing whose truth, or at least its likelihood, has not been shown on the basis of available data. Which is the method that the mathematicians have followed with admirable success.

It is also in our power to put an end to controversies, namely if we put the arguments brought forward accurately into a form, in which we not only formulate and examine the syllogisms of the argument, but the prosyllogisms, and the prosyllogisms of the prosyllogisms, until finally the proof is completed, or else it is established what parts of the argument must be further investigated in order to avoid falling into an empty circle repeating what has already been said, and rolling the tub of Diogenes.

Is there a contradiction here, or can these two views be reconciled?

More digital than thou

[16 December 2013]

An odd thing has started happening in reviews for the Digital Humanities conference: reviewers are objecting to papers if the reviewer thinks it has relevance beyond the field of DH, apparently on the grounds that the topic is then insufficiently digital. It doesn’t matter how relevant the topic is to work in DH, or how deeply embedded the topic is in a core DH topic like text encoding — if some reviewers don’t see a computer in the proposal, they want to exclude it from the conference.

[“You’re making this up,” said my evil twin Enrique. “That’s absurd; I don’t believe it.” Absurd, yes, but made up? no. The reviewing period for DH 2014 just ended. One paper proposal I reviewed addressed the conceptual analysis of … I’ll call the topic X, to preserve some fig leaf of anonymity here; X in turn is a core practice of many DH projects and an essential part of any account of the meaning of tag sets like that of the Text Encoding Initiative. I thought the paper constituted a useful contribution to an ongoing discussion within the framework of DH. Another reviewer found the paper interesting and thoughtful but found nothing specifically digital in the proposal [after all, X can arise in a pen and paper world, too, computers are not essential], graded it 0 for relevance to DH theory and practice, and voted to reject it. It should be submitted, this reviewer said, to a conference in [another field where X is a concern] and not in a DH conference. I showed this to Enrique. For once, he was speechless. Thank you, Reviewer 2!]

At some level, the question is what counts as work in the field of digital humanities. Is it only work in which computers figure in an essential role? Or is digital humanities concerned with the application of computers to the humanities and all the problems that arise in that effort? I think the latter. Some reviewers appear to think the former. If solving an essential problem turns out to involve considerations that are not peculiar to work with computing machines, however, what do we do? I believe that the problem remains relevant to DH because it’s rooted in DH and because those interested in doing good work in DH need the answer. The other reviewer seems to take the view that once the question becomes more general, and applicable to work that doesn’t use computers, the question is no longer peculiarly digital, and thus not sufficiently digital; they would like to exclude it from the DH conference on the grounds that it addresses a question of interest not only to DH but also to other fields.

[“Wait a second,” said Enrique. “You’re saying there’s a pattern here. Isn’t it just this one reviewer?” “No,” I said. “This line of thought also came up in at least one other review (possibly in a more benign form), and it has also been seen in past years’ reviews.” “Is that why you’re so touchy about this?” laughed Enrique. “Did they reject one of your papers on this account?” “Oh, hush.”]

The notion that there must be something about a topic that is peculiar to the digital humanities, as opposed to the broader humanities disciplines, makes sense perhaps if one believes that the digital humanities are intrinsically different from the pre-electronic humanities disciplines. On that view, any DH topic is necessarily distinct from any non-DH humanities topic, and once a topic is relevant to the broader humanistic fields (e.g. “what is the nature of X?”), it is ipso facto no longer a DH topic.

This is like arguing that papers about the concept of literary genre don’t belong at a conference about English literary history, because there is nothing peculiarly English about the notion of genre and any general discussion will (if it’s worth anything at all) also be relevant to non-English literatures. Or like trying to exclude work on the theory of computation from computer science conferences because it applies to all computation, not only to computation carried out by electronic stored-program binary digital computers.

[“I notice that leaders in the field of computer science occasionally feel obliged to remark that the name computer science is a misnomer,” said Enrique, “because computers are in no sense an essential element in the field.“ ”Perhaps they have the same problem, in their way,” I said. “My sympathy,” said Enrique.]

Another problem I see with the view my co-reviewer seems to hold is that some people believe that DH is not intrinsically different from the pre-electronic humanities disciplines. I didn’t get involved with computers in order to stop doing traditional philology; I got involved to do it better. Computers allow a much broader basis for our literary and linguistic argumentation, and they demand a higher degree of precision than philologists typically achieved in past decades or centuries, but I believe that digital philology is recognizably the same discipline as pre-digital philology. If the DH conference were to start refusing papers on the grounds that they are describing work that might be relevant to scholars working with pen and paper, then it would be presupposing an answer to the important question of how computers change and don’t change the world, instead of encouraging discussion of that question. (And it would be excluding people like me from the conference, or trying to.)

[“Trying in vain, I bet,” said Enrique. “Well, yeah. I’m here, and I’m not leaving.”]

These exclusionary impulses are ironic, in their way, because one of the motive forces behind the formation of scholarly organizations like the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing (now the European Association for Digital Humanities) and behind their journals and conferences was that early practitioners in the field were often made to feel unwelcome in the conferences and journals of their home disciplines. The late Paul Fortier was eloquent on the subject of the difficulties he had trying to publish his computer-assisted work on Céline in conventional journals for Romance languages and literatures; he often spoke of Computers and the Humanities as having made it possible for him and others like him to have an academic career.

It will be a sad thing if the recent growth in degree programs devoted to digital humanities turns out to result in the field of DH setting up signs at all boundaries reading “Outsiders not wanted.”

Flattening structure without losing all rules

[25 November 2013, typos corrected 26 November; spam links removed 27 April 2017]

The other day during the question-and-answer session after my talk at Markupforum 2013, Marko Hedler speculated that some authors (and perhaps some others) might prefer to think of text as having a rather flat structure, not a hierarchical nesting structure: a sequence of paragraphs, strewn about with the occasional heading, rather than a collection of hierarchically nesting sections, each with a heading of the appropriate level.

One possible argument for this would be how well the flat model agrees with the text model implicit in word processors. Whether word processors use a flat model because their developers have correctly intuited what authors left to their own devices think, or whether author who want a flat structure think as they do because their minds have been poisoned by exposure to word processors, we need not now decide.

In a hallway conversation, Prof. Hedler wondered if it might not be possible to invent some sort of validation technique to define such a flat sequence. To this, the immediate answer is yes (we can validate such a sequence) and no (no new invention is needed): as the various schemas for HTML demonstrate, it’s easy enough to define a text body as a sequence of paragraphs and headngs:

<!ELEMENT body (p | ul | ol | blockquote
| h1 | h2 | h3 ...)*>

What, we wondered then, if want to enforce the usual rules about heading levels? We need to check the sequence of headings and ensure that any heading is at most one level deeper than the preceding heading, so a level-one heading is always followed by another level-one heading or by a level-two heading, but not by a heading of level three or greater, while (for example) a fourth-level heading can be immediately followed by a fifth-level heading, but not a sixth-level heading. And so forth.

The discussion having begun with the conjecture that conventional vocabularies (like TEI or JATS) use nesting sections because nesting sections are the kinds of things we can specify using context-free grammars, we were inclined to believe that it might be hard to perform this level check with a conventional grammar-based schema language. In another talk at the conference, Gerrit Imsieke sketched a solution in Schematron (from memory, something like this rule for an h1: following-sibling::* [matches(local-name(), '^hd')] [1] /self::*[self::h1 or self::h2] — that can’t be right as it stands, but you get the idea).

It turns out we were wrong. It’s not only not impossible to check heading levels in a flat structure using a grammar-based schema language, it’s quite straightforward. Here is a solution in DTD syntax:

<!ENTITY p-level 'p | ul | ol | blockquote' >
<!ENTITY pseq '(%p-level;)+' >
<!ENTITY h6seq 'h6, %pseq;' >
<!ENTITY h5seq 'h5, (%pseq;)?, (%h6seq;)*'>
<!ENTITY h4seq 'h4, (%pseq;)?, (%h5seq;)*'>
<!ENTITY h3seq 'h3, (%pseq;)?, (%h4seq;)*'>
<!ENTITY h2seq 'h2, (%pseq;)?, (%h3seq;)*'>
<!ENTITY h1seq 'h1, (%pseq;)?, (%h2seq;)*'>

<!ELEMENT body ((%pseq;)?, (%h1seq;)*) '>

Note: as defined, these fail (for simplicity’s sake) to guarantee that a section at any level must contain either paragraph-level elements or subsections (lower-level headings) or both; a purely mechanical reformulation can make that guarantee:

<!ENTITY h1seq 'h1,
((%pseq;, (%h2seq;)*)
| (%h2seq;)+)' >

Searching for patterns in XML siblings

[9 April 2013]

I’ve been experimenting with searches for elements in XML documents whose sequences of children match certain patterns. I’ve got some interesting results, but before I can describe them in a way that will make sense for the reader, I’ll have to provide some background information.

For example, consider a TEI-encoded language corpus where each word is tagged as a w element and carries a pos attribute. At the bottom levels of the XML tree, the documents in the corpus might look like this (this extract is from COLT, the Bergen Corpus of London Teenage English, as distributed by ICAME, the International Computer Archive of Modern and Medieval English; an earlier version of COLT was also included in the British National Corpus):

<u id="345" who="14-7">
<s n="407">
<w pos="PPIS1">I</w>
<w pos="VBDZ">was</w>
<w pos="XX">n't</w>
<w pos="JJ">sure</w>
<w pos="DDQ">what</w>
<w pos="TO">to</w>
<w pos="VVI">revise</w>
<w pos="RR">though</w>
<u id="346" who="14-1">
<s n="408">
<w pos="PPIS1">I</w>
<w pos="VV0">know</w>
<w pos="YCOM">,</w>
<w pos="VBZ">is</w>
<w pos="PPH1">it</w>
<w pos="AT">the</w>
<w pos="JJ">whole</w>
<w pos="JJ">bloody</w>
<w pos="NN1">book</w>
<w pos="CC">or</w>
<w pos="RR">just</w>
<w pos="AT">the</w>
<w pos="NN2">bits</w>
<w pos="PPHS1">she</w>
<w pos="VVD">tested</w>
<w pos="PPIO2">us</w>
<w pos="RP">on</w>
<w pos="YSTP">.</w>

These two u (utterance) elements record part of a conversation between speaker 14-7, a 13-year-old female named Kate of unknown socio-economic background, and speaker 14-1, a 13-year-old female named Sarah in socio-economic group 2 (that’s the middle group; 1 is high, 3 is low).

Suppose our interest is piqued by the phrase “the whole bloody book” and we decide we to look at other passages where we find a definite article, followed by two (or more) adjectives, followed by a noun.

Using the part-of-speech tags used here, supplied by CLAWS, the part-of-speech tagger developed at Lancaster’s UCREL (University Centre for Computer Corpus Research on Language), this amounts at a first approximation to searching for a w element with pos = AT, followed by two or more w elements with pos = JJ, followed by a w element with pos = NN1. If we want other possible determiners (“a”, “an”, “every”, “some”, etc.) and not just “the” and “no”, and other kinds of adjective, and other forms of noun, the query eventually looks like this:

let $determiner := ('AT', 'AT1', 'DD',
'DD1', 'DD2',
'DDQ', 'DDQGE', 'DDQV'),
$adjective := ('JJ', 'JJR', 'JJT', 'JK',
'MC', 'MCMC', 'MC1', 'MD'),
$noun := ('MC1', 'MC2', 'ND1',
'NN', 'NN1', 'NN2',
'NNJ', 'NNJ2',
'NNL1', 'NNL2',
'NNT1', 'NNT2',
'NNU', 'NNU1', 'NNU2',
'NP', 'NP1', 'NP2',
'NPD1', 'NPD2',
'NPM1', 'NPM2' )

let $hits :=
[following-sibling::w[1][@pos = $adjective]
[following-sibling::w[1][@pos = $adjective]
[following-sibling::w[1][@pos = $noun]
for $h in $hits return
<hit doc="{base-uri($h)}">{

Such searches pose several problems, for which I’ve been mulling over solutions for a while now.

  • One problem is finding a good way to express the concept of “two or more adjectives”. (The attentive reader will have noticed that the XQuery given searches for determiners followed by exactly two adjectives and a noun, not two or more adjectives.)

    To this, the obvious solution is regular expressions over w elements. The obvious problem standing in the way of this obvious solution is that XPath, XQuery, and XSLT don’t actually have support in their syntax or in their function library for regular expressions over sequences of elements, only regular expressions over sequences of characters.

  • A second problem is finding a syntax for expressing the query which ordinary working linguists will find less daunting or more convenient than XQuery.

    Why ordinary working linguists should find XQuery daunting, I don’t know, but I’m told they will. But even if one doesn’t find XQuery daunting, one may find the syntax required for sibling searches a bit cumbersome. The absence of a named axis meaning “immediate following sibling” is particularly inconvenient, because it means one must perpetually remember to add “[1]” to steps; experience shows that forgetting that predicate in even one place can lead to bewildering results. Fortunately (or not), the world in general (and even just the world of corpus linguistics) contains a large number of query languages that can be adopted or used for inspiration.

    Once such a syntax is invented or identified, of course, one will have the problem of building an evaluator for expressions in the new language, for example by transforming expressions in the new syntax into XQuery expressions which the XQuery engine or an XSLT processor evaluates, or by writing an interpreter for the new language in XQuery or XSLT.

  • A third problem is finding a good way to make the queries faster.

    I’ve been experimenting with building user-level indices to help with this. By user-level indices I mean user-constructed XML documents which serve the same purpose as dbms-managed indices: they contain a subset of the information in the primary (or ‘real’) documents, typically in a different structure, and they can make certain queries faster. They are not to be confused with the indices that most database management systems can build on their own, with or without user action. Preliminary results are encouraging.

More on these individual problems in other posts.