Running the BaseX XQuery engine in the OpenShift cloud platform

[7 January 2016]

Late last year I worked through the process of making the BaseX XQuery engine run under Tomcat 6 at a commercial Java hosting provider. As I mentioned in that post, for sale I also spent some time trying to make a cloud-services solution work, vitamin using OpenShift and Andy Bunce’s excellent Openshift quick start for BaseX. I ran into trouble then, anaemia because the Red Hat Cloud (rhc) tools refused to install on either of my two machines because the operating systems were more than twelve months old.

But during a quiet day or two last month I downloaded a new operating system for the newer machine (I’m exaggerating; it probably only took sixteen hours or so), and the other day I tried again.

The instructions for the quick-start set up are a bit terse, so for future reference (and for the benefit of any other XQuery developers out there who would like a little more detail in the instructions), this is my checklist for next time. Many of the same considerations apply as for installing BaseX under Tomcat; see the earlier post for more discussion.

Prerequisites

The following checklist assumes that:

  • You are reasonably comfortable with command-line tools.
  • You are reasonably comfortable with git, or can copy and paste git incantations without mucking them up.
  • You know what the basexhttp script is (or can live without knowing all the details of how things work).
  • It may be helpful if you’re familiar with ssh and scp, but it’s not essential.

If these don’t all apply, you will need to make some adjustments on the fly.

Preliminary preparation

  1. Sign up for an account at OpenShift.

    I don’t remember how long this took, maybe an hour. (It would go faster if it were easier to find the actual signup page and if one didn’t read the agreements one is consenting to.)

    I went through the “log in to the console and create an app” tutorial at OpenShift and “created” my first application (by clicking a button). I didn’t find this helpful or instructive, but YMMV.

    Remember your userid and password; you will need them repeatedly.

  2. Install the OpenShift / Red Hat Cloud command-line tools (rhc); instructions are on the Getting Started with OpenShift Online page at OpenShift.

    This also takes a little while, to download and install all the libraries on which the rhc command-line tools depend. My recollection is that it took half an hour or so. I expect it’s faster for people with faster connections than mine.

    (This is where things went terribly wrong for me in November, since Red Hat’s command-line tools refused to be installed in an operating system that shipped two years ago. Trying to solve the problem by upgrading the system’s Ruby interpreter landed me in dependency hell.)

  3. Decide in advance what userids you wish to specify for your BaseX database(s); for each userid specify the initial password and the initial permissions for each user. At the very least, determine what userid and password will be used as the default database user for the app.

    To allow myself to undertake each database operation with the lowest feasible level of privilege, I made myself an array of userids, one for each privilege level, and assigned them passwords generated by a simple random process:

    • Angie (admin privileges)
    • Chris (create privileges)
    • Will (write privileges)
    • Ralph (read privileges)
    • Nadine (no privileges)
  4. Prepare in advance a small XML file or two to use in creating a helloworld database to check that things are running as you expect.

    I use the documents in http://cmsmcq.com/2015/11/XQuery-over-HTTP/data/ for making a helloworld example. When I use the REST interface to create a database and populate it, it’s simplest to retrieve the documents by URI; when I use the dba interface of BaseX 8.3.1, it proves simplest if I have copies of them on my hard disk.

Doing the deed

The basic instructions are all given in the quick-start readme file; they are, perhaps, a little terse there, for readers who haven’t worked with OpenShift before, so I’ll repeat them with some comments. We are going to create a do-nothing OpenShift application, copy Andy Bunce’s BaseX quickstart setup into it, and check it in. For concreteness, I will assume the app to be built is named “Allegheny”, and the OpenShift user’s domain is “AIK”.

  1. rhc app create -a Allegheny -t diy-0.1

    This creates a do-nothing app named Allegheny on the OpenShift server, using the diy-0.1 cartridge, then clones a git repository of the code into a new directory called Allegheny, on your hard disk. The DIY cartridge provides a sort of minimal environment for an app, on a do-it-yourself basis; fortunately for us, we do not have to do it ourselves, as Andy Bunce has done all the crucial bits for us.

    When I did it this morning, this step took a little over two minutes.

  2. cd Allegheny

    Move into the app directory.

    Do not omit this step; you will regret it, especially if you use git yourself as a backup or synching mechanism. (That is to say, when I forgot to do this, the next step put all of the quickstart code into my home directory. It took me a painfully long time to find and delete it all and get it back out of my git repository. And my .gitignore file seems to have lost both its content and its history. Fortunately, I do have backups.)

  3. git remote add upstream -m master https://github.com/Quodatum/openshift-basex-quick-start.git

    git pull -s recursive -X theirs upstream master

    Pull the quickstart code into your app. (If you want to understand in detail what these lines do, I recommend the git manual. I am not going to try to explain.)

    When I ran this this morning, it took a little under two minutes; it seemed longer.

    If you’re in a hurry, go ahead to the next step. Otherwise, now may be a good time to pause to look at the application’s directory structure, since it’s where you will be doing your development. The crucial bits appear (in the current state of my imperfect knowledge) to be:

    • config (A bash script which sets some variables to be used in other bash scripts, including BASEX_USER and BASEX_PASSWORD, which are used as the values of the -U and -P options in calls to the basexhttp script. The URI of the version of BaseX to install is also given here, as are the port numbers the server should use.)

    • .openshift/ (part of the OpenShift infrastructure)

      • action_hooks/ (this is where the application developer places code to be executed by the OpenShift infrastructure at predefined moments)

        • start (a bash script to be executed when the application needs to be started; the script provided by the quickstart calls basexhttp start)

        • stop (a bash script to be executed when the application needs to be stopped; the script provided by the quickstart calls pkill java to stop the server; in principle it would rather call basexhttp stop, but at the moment that doesn’t work)

      • cron/ (Directory for cron jobs to be executed on behalf of the application, with subdirectories hourly, daily, etc.; the README file has a pointer to the relevant OpenShift documentation. The weekly subdirectory appears to have a weekly statistics job in it.)

      • markers/ (I have no idea what this is; the README points to the documentation.)

    • basex/ (For XQuery developers, this is where the action is.)

      • lib/ (Contains a JAR file for Saxon HE 9.7.0.1.)

      • repo/ (The BaseX repository; contains XQuery modules installed using REPO INSTALL URI-of-module; for details see the BaseX wiki under Repository.)

      • webapp/

        • restxq.xqm/ (A sample RESTXQ function by Andy Bunce supplied as part of the quickstart framework; it generates the default welcome page.)

        • WEB-INF/ (Contains the configuration files governing BaseX’s behavior as a web service)

          • jetty.xml

            (Jetty-specific configuration. With any luck you’ll never need to change anything here. The BaseX documentation refers readers to the Jetty documentation for details, which in turn suggests that you consult the Jetty 9 documentation instead, and good luck to you if you actually need to understand the details. The salient item for the quickstart is setting the IP address and port for the server, using the environment variables OPENSHIFT_DIY_IP and OPENSHIFT_DIY_PORT.)

          • web.xml

            (Generic servlet configuration. By default, the quickstart has the REST and WebDAV interfaces turned off and the RESTXQ interface turned on; you’ll need to edit the web.xml file to turn the REST and WebDAV interfaces back on. This document also specifies the default user and password for the server; how the specification in the web.xml file relates to the values given by the -U and -P options passed to basexhttp remains a mystery to me.

            Since the default web.xml file for the quickstart does not set the RESTXQPATH or RESTPATH options, they default to “the standard web application directory”, which appears in this case to be the webapps directory in the root directory of the git repository. That would be consistent with the placement of restxq.xqm. The web.xml file also doesn’t specify the REPOPATH option; the documentation says that it defaults to {home}/BaseXRepo, but apparently {home}/repo is also a possibility; that would be consistent with the placement of the repo/ directory here.)

    The other files and directories either have obvious functions (.git, .gitignore, LICENSE, README.md) or appear to be just samples (diy/, misc/).

  4. You can now adjust things in the configuration, if you wish. I mostly wait until later.

  5. git push origin master

    This checks in your changes on the server. The OpenShift infrastructure will tell the running DIY application to shut down (i.e. it will run the script in the checked-in version of .openshift/action-hooks/stop), push the changes for your local git repository to the copy in the cloud, and restart the app (by running .openshift/action-hooks/start).

    When I ran it this morning, this step took just under three minutes, including fetching and installing the BaseX binary.

    (In my initial experiments, I ran into a problem here; there is some disagreement between BaseX and OpenShift regarding who can connect to what ports, and the result is that basexhttp stop doesn’t have the desired effect, which means the check-in fails. In the meantime, Andy Bunce has rewritten the quickstart code with a temporary workaround.)

  6. Set up userids.

    BaseX can now be configured, using the dba application that ships with BaseX 8, and which is conveniently linked from the default quickstart welcome page at http://allegheny-AIK.rhcloud.com/.

    The only essential configuration to do at this point is to change the admin password. I try to do this quickly, since until it is done, anyone who happens across this BaseX engine on the open Web can have admin privileges just by logging in using the default password.

    There are a couple of catches that make the process very slightly less straightforward than it might be.

    • The users.xml file will go into the OpenShift data directory, which means that you cannot conveniently put it in place before starting BaseX. (If inconvenience is no object, then by all means, be my guest: first prepare a users.xml file with another copy of BaseX, then scp it to the ./app-root/data/basex/data directory of your OpenShift app, before doing the push in the previous step.)

    • The config file needs to know the admin userid and password, in order to start and stop the server. (At least, I think it needs to know them; I haven’t actually tried putting gibberish into those variables to see.) If we change the admin password, we risk not being able to stop and start the server gracefully.

    This is the technique I’ve worked out for dealing with these catches. There are probably better ways.

    • Log in to the dba application using the default admin userid and password.
    • Add the users in the list prepared earlier, with the corresponding privileges and passwords. Note that this list includes a second user with admin privileges, here called Angie.
    • In the local repo, change the relevant parts of the config file to read as follow (substituting your userid and password of choice, of course):
      BASEX_USER="Angie"
      BASEX_PASSWORD="Where-is-the-devil-in-Evelyn-what's-it-doing-in-Angela's-eyes"
    • Check in your changes and push them to OpenShift (git add config; git commit -m "Changing userid and password"; git push origin master).
    • Log in to the dba application again (actually, you’re probably still logged in), and change the password of the admin user. At this point, we have closed the window of vulnerability we opened when we started the server with the default admin password. When I’m feeling paranoid, I take this moment to check the Databases, Users, Files, and Logs tabs to see whether any intruders actually showed up and did anything.
  7. Test

    To test that things are going as expected, I also create a database and do a few queries with the dba application and the REST interface.

So now I know three different ways to have access to an XQuery database over the web:

  • Run it as a server on a machine you control, or on which you can persuade the sysadmin to install it. (I assume that this can be done in the virtual private servers offered by many Web hosting providers, but I haven’t done it that way myself.)
  • Run it as a servlet running under Tomcat on a Java hosting provider.
  • Run it as an application in a cloud service.

My personal experience with these is all with BaseX, but of course all three methods will also work, at least in principle, for other XQuery engines like eXist and MarkLogic.

It’s always good to have more than one string to one’s bow.