Working on your Spanish

[18 November 2017; copy edits 18-20 November]

In just over seven months, the annual Digital Humanities conference will take place in Mexico City. I doubt that I’m the only digital humanist in the world who is surreptitiously trying to improve my Spanish before next June.

If you are trying to improve your Spanish (whether for that reason or for others), here are some things I am finding useful.

First, a textbook.

  • The French publisher Assimil has a Spanish volume in their sans peine series: L’espagnol sans peine. It’s available in several versions for speakers of different languages: English (Spanish with ease), German (Spanisch ohne Mühe), Italian, Dutch, and Portuguese. The series is described, quite accurately, as suitable for “beginners and pseudo-beginners” (débutants et faux-débutants).

    I bought the book and CDs direct from the Assimil web site and had them shipped to me in the U.S. without any difficulty; the only catch for some potential buyers is that the site is in French. Period. (Wait, aren’t they specialists in books for foreign-language learners? Don’t their web people realize the firm has non-Francophone target readers? Ah, well. There are some English-language resellers one can use, or you can grab a Francophone friend and make them babysit you through finding the book you want and making your purchase.)

    I won’t attempt to describe the Assimil method here. The linguist John McWhorter has given a good account of their results in his piece on the NPR web site (and there is some useful concrete advice at the site ‘How to Learn any Language’, for those who find the books’ instructions vague). I will say that for those who like me are attempting to learn (or re-learn) a language on their own and not in a class, the Assimil sans peine series has no equal that I know of.

    I bought the combination pack with the printed book, sound CDs with recordings of the dialogues, and a CD with MPEG versions of the recordings; I have downloaded the MPEG recordings to a directory on an Android tablet, imported the directory into the Podcast Addict app as an audio book, and I use Podcast Addict to play the day’s recording on continuous loop while I fetch the paper, wash dishes, etc. Listening in a podcast player has the advantage that I can speed up the early lessons to something approaching normal conversational speed.

Second, a supply of relatively easy reading and listening material. My searches for podcasts for learners of Spanish as a foreign languages turned up large numbers of results, some of which were obviously irrelevant and some of which I was able to delete without qualms after listening for a couple of minutes. I now listen to three:

  • Español automatico, prepared by a personable teacher of Spanish named Karo Martínez; in some installments, she speaks in relatively simple Spanish about assorted topics (the relative merits of the various actors who have played James Bond, the history of Catalonia, how to learn Spanish, and others; self-help topics of the how-to-be-more-organized variety are not uncommon), in others she specifically discusses issues of Spanish idiomatic usage and vocabulary. Episodes generally range from twenty to forty minutes (although when she got going on her introduction to the history of Barcelona, it ran over an hour). Transcripts and other additional materials appear to be available from the web site, some for purchase, which I hope pays for the cost of producing and publishing the podcast, and others for free (but you have to give them your email address, which induces a spasm of irrational privacy paranoia in me, so I have no idea what form the transcripts take). The only blemish for me is the repeated plea for listeners to file a five-star review of the podcast in iTunes.

  • El oso latino habla español — para mejorar su español, a quirky podcast put together in Sherbrooke, Québec, produced by a Québecois Spanish-learner named Pascal Dion and featuring the Peruvian Oswaldo Horna Montes (known, I gather, to his friends as El oso latino) as the main speaker. Episodes often include interviews with visitors from Latin America, with other Latin Americans living in Sherbrooke, or with anyone whom the host and producer think will be interesting; topics of discussion regularly include differences among varieties of Spanish and idioms peculiar to this or that regional variety. In one show, Montes narrated the preparation of a Peruvian dish for dinner. Dion appears in a regular segment on language-learner errors called Crónica del gringo, and Montes’s daughters in the segment Los chistes de Celia y de Marisol, which makes me laugh til I cry even though I have not yet understood a word of any of the jokes they have told. Lots of music. The most personal and thus the most memorable of these three podcasts. Generally around 30 minutes per episode.

  • News in slow Spanish, which is more or less what it sounds like: a weekly podcast with news stories in Spanish (there are similar News in Slow X podcasts for a number of different languages). There are several versions (intermediate vs. advanced, Castilian vs. Latin American), but oddly only one that I was able to locate from within Podcast Addict (Spain, Advanced, which proves not too advanced for me). News in Slow Spanish (Latino) does appear in the Tune-In Radio app (the only thing I miss there is continuous looping).

    I shied away from this at first; I have decades-old bad memories of unconvincing ‘newscasts’ specially for language learners filled with soft news (to give them a longer shelf life) and painful explanations of words. But I ended up trying the podcast, after I failed to find an app or podcast that would give me a conventional five- to fifteen-minute radio news broadcast once a day or once an hour (something along the lines of the NPR News app, or the Radio Canada app — still looking; if you know of anything let me know). And I was convinced. The news items are real, current, and interesting, and the editorial comment feels lively and intelligent. In the Latin-American version, particularly, it is interesting to listen to the friendly discussion between hosts with slightly different political leanings.

    I’ve been listening to the initial free portion of the podcast; a longer episode with more news items and some discussions of Spanish grammar and idioms is available as a paid service. The free portion runs about five minutes. (Every now and then there is either a slip or an intentional freebie, and the entire thirty-minute program is included.)

On easy reading material, I’m not doing too well. Children’s and young-adult books are an obvious choice, but I don’t know an easy way to know what’s worth buying and what’s not. Well, actually I do. The next time I’m at my public library I’ll ask at the information desk for recommendations.

And third, a supply of normal Spanish for listening and reading, ideally interesting and not over-challenging. Here, of course, the choice is limitless and the expanse of possibilities feels as trackless as Borges’s Library.

There is always the news (which tends to have a relatively manageable vocabulary, and to have a lot of short pieces). There are Android apps for any number of Spanish-language newspapers, most of which I haven’t heard of and some of which may or may not be worth reading. With my mind focused on Mexico City, I have looked only at apps and web sites for Mexican newspapers. A Mexican colleague (whom I thank, but who shall remain nameless here because I haven’t asked permission to name them here) has suggested:

  • Animal politico (left wing); the web site is fine on a desktop machine and a bit hard to navigate on a tablet. I didn’t find any app.
  • El Universal (center right); I did find an app but did not find it usable.
  • La Jornada (left wing); I find the Android app usable (though I wish it allowed me to adjust the font size), so I haven’t worked with the web site.

At the moment, I confess to finding Mexican newspapers slightly heavy going.

Since I’m a digital humanist, and I’m looking forward to DH 2018, I read the blog run by the Red de humanidades digitales with great interest, even if sometimes with imperfect comprehension.

Eventually, I’ll be looking for Spanish-language detective stories and the like: page-turners are a real boon for a foreign-language learner, so I will happily read many things in a foreign language whose English equivalents I wouldn’t normally be caught dead with. (I’m told that on the same principle, some adult literacy programs in this country do great work with Mickey Spillane.) Suggestions welcome.

Finding good podcasts aimed not at language learners but at intelligent adults has been a challenge, but looking around for podcasts on the sites of UNAM (Universidad Nacional Autónomo de México) and IMER (Instituto Mexicano de la Radio) has produced dividends, as have some journalistic think pieces I found on the Web on new media in Latin America. Right now my subscriptions include the following. (At the moment, all of these are tough sledding for me, but they repay repeated listening. The ability to slow the playback helps — it’s like being able to say “¡Demasiado rápido! ¡Más despacio, por favor!” and have the podcast nod and slow down.)

  • Azul Chiclamino, a podcast by Rodrigo Llop. I have no idea how to describe this; perhaps the subtitle will do: La realidad de lo absurdo. This is sometimes characterized as a humorous broadcast; for a sufficiently nuanced definition of “humor” (think Mark Twain) that’s probably true, but I find the podcast much more appealing than that label would lead me to expect.
  • Radio Ambulante, an NPR-affiliated podcast. Feels a bit like Radio Lab or This American Life, in Spanish: thoughtful, serious, well produced. Has the advantage that its stories are often about Hispanic affairs in the US, so I understand some of the background; has the drawback that its stories are often about the US, so I’m not learning about Latin America.
  • Ráfagas de pensamiento, a series of short pieces by the philosopher Ernesto Priani Saisó of UNAM, often reacting to a passage in an earlier philosopher (Nietzche, Husserl, Leibniz, More, …). Produced with atmospheric music and read by what sound like professional voice actors. Has the disadvantage for me that the background music can make the words harder to hear; has the advantage that it’s worth listening to. Usually 3-5 minutes per piece.
  • A multipart dramatization by Radio UNAM of Así asesinaron a Trotski by Leandro Sánchez Salazar (the man in charge of the investigation of Trotsky’s murder). I have no idea what Sánchez’s book is like as a historical source, but it has the virtue of strong narrative drive (even though I already know how it turns out). I may need to read the book in order to understand some of the broadcast.

Netflix and Amazon appear to have rather thin selections when it comes to Spanish-language films but they do have some. If anyone reading this knows an effective way to search by language on either, I’m all ears; surely searching for “Almodovar” should not be the only possibility (I am going to save Buñuel for later, when my Spanish is better and I can tell the difference between surrealism and not understanding the words). YouTube has a fair bit of Spanish content, though again I have not found any good way to find it except for searching on random Spanish words. An impulsive search on “Así asesinaron a Trotski” turned up several documentaries on Trotsky’s assassination, Trotsky’s life, Trotsky and Stalin, and Ramon Mercader (the man who killed Trotsky), as well as a few seminars on Trotskyite political theory.

[Addendum: on Netflix, selecting Browse / Audio & Subtitles takes the user to an interface where one can browse items with audio, or subtitles, in a given language. This is imperfect, but probably better than nothing. Looking for something to watch in the resulting display feels like looking for a book to read in a library arranged by color; for every ten times you feel irritated by its apparently random arrangement and the inconvenience of having to click on something every time you would like more information than is given in the icon, you may once or twice feel pleased by some serendipity.]

All of this is, of course, just my two cents. As may be clear from the above, my language learning work happens mostly on an Android tablet, not on a desktop machine.

n-gram Markov models as regular approximations?

[16 September 2016][cleared out Viagra spam 27 April 2017]

By construction, Markov models can generate, or recognize, regular languages. This follow from the fact that they are essentially weighted finite state automata.

If we take a body of material that conforms to a context-free grammar G, segment it into tokens the way a lexical scanner would do, and then construct an n-gram Markov model on the basis of the material, we will have a weighted finite state automaton (FSA) that produces or recognizes a regular approximation of the context-free language of G.

How will that regular approximation, and any regular grammar that expresses it (with or without the weights) relate to the original context-free grammar G? How will it relate to other regular approximations of L(G), the kind that one gets by manipulating the grammar G directly?

This is mostly an abstract question, but regular approximations have many uses.

  • Recognizing regular languages is often faster than recognizing context-free languages, and requires fewer resources.
  • Schema languages like XSD and Relax NG can use regular expressions to restrict strings, but not context-free grammars. (Amelia Lewis told me once she’d like to design a schema language which allowed context-free grammars for constraining strings; I think that would be an interesting schema language.)
  • Language-specific editing modes (e.g. c-mode in Emacs) must often treat the language in question as if it were regular (i.e. you don’t really have access to the context-free grammar or a parse tree when deciding how much to indent a line: you have to decide based on whether the preceding line ends with an open brace or not, and so on). I’ve never found descriptions of how to write language modes in Emacs easy to follow, perhaps because they don’t explain how to stop thinking about a language in terms of its context-free grammar and how to think about it instead as a regular language.

But probably I’m thinking about this today because I’ve been thinking about part-of-speech tagging a lot recently. The easiest way to build a reasonably good part-of-speech tagger nowadays appears to be to build a hidden Markov model on the basis of some training corpus. (There are other more sophisticated approaches, which fight it out among themselves for the odd tenth of a percentage point in their correctly-tagged rates. They don’t appear to be nearly as easy to understand and build.)

I conjecture that one of the reasons bigram and trigram part-of-speech taggers do as well as they do is that they apply information about what can occur where in a sentence, in a way that has been dumbed down to the point where it can fit into (be expressed by) a finite state automaton. I wonder if a systematic way to go from a context-free grammar to an FSA could help build better / smarter taggers. Could it capture more information about grammatical context? Transmit that information over longer distances? Provide guidance in cases where the available training data would otherwise be too sparse?

One reason trigrams can do better than bigrams is that they provide more information for the decision about how to tag each word: the choice of tag depends not just on the preceding tag but on the preceding two tags. One reason trigrams can do worse than bigrams is that there are a lot more potential trigrams for any set of part-of-speech tags than there are bigrams, so trigrams are apt to suffer from sparse-data problems unless one has “a lot” of training data (for some meaning of “a lot”); 4-grams, 5-grams, etc., appear to be non-starters because of the sparse-data problem. Could a systematic derivation of a Markov model from a CFG help? Could we judiciously tweak the underlying FSA to carry information we know (or suspect) will be useful? That would provide the same advantages as n-grams with larger n. Could generating the FSA from a grammar help provide guidance for distinguishing grammatical-but-infrequent turns of phrase from gibberish? That would help minimize the sparse-data problem for n-grams with larger n.

It’s the 16th of September — Happy Independence Day, neighbors! Viva Hidalgo!

Fundamental primitives of XSLT programming

[2016-08-07]

A friend planning an introductory course on programming for linguists recently asked me what I thought such linguist-programmers absolutely needed to learn in their first semester of programming. I thought back to a really helpful “Introduction to Programming” course taught in the 1980s at the Princeton University Computer Center by Howard Strauss, then the head of User Services. As I remember it, it consisted essentially of the introduction of three flow-chart patterns (for a sequence of steps, for a conditional, and for a while-loop), with instructions on how to use them that went something like this:

  1. Start with a single box whose text describes the functionality to be implemented.
  2. If every box in the diagram is trivial to implement in your programming language, stop: you’re done. Implement the program using the obvious translation of sequences, loops, and conditionals into your language.
  3. Otherwise choose some single box in the diagram whose functionality is non-trivial (will require more than a few lines of code) and replace it with a pattern: either break it down into a sequence of steps, or make it into a while-condition-do-action loop, or make it into an if-then-else choice.
  4. Return to step 2.

I recommended this idea to my friend, since when I started to learn to program I found these three patterns extremely helpful. As I thought about it further, it occurred to me that the three patterns in question correspond 1:1 to (a) the three constructors used in regular languages, and (b) the three patterns proposed in the 1970s by Michael A. Jackson. The diagrams I learned from Howard Strauss were not the same as Jackson’s diagrams graphically, but the semantics were essentially the same. I expect that a good argument can be made that together with function calls and recursion, those three patterns are the atomic patterns of software design for all conventional (i.e. sequential imperative) languages.

I think the patterns provide a useful starting point for a novice programmer: if you can see how to express an idea using those three patterns, it’s hard not to see how to capture it in a program in Pascal, or C, or Python, or whatever language you’re using. Jackson is quite good on deriving the structure of the program from the structures of the input and output in a systematic way.

The languages I most often teach, however, are XSLT and XQuery; they do not fall into the class of conventional sequential imperative languages, and the three patterns I learned from Howard Strauss (and which Howard Strauss may or may not have learned from Michael A. Jackson) do not help me structure a program in either language.

Is there a similarly small set of simple fundamental patterns that can be used to describe how to build up an XSLT transformation, or an XQuery program?

What are they?

Do they have a plausible graphical representation one could use for sketching out a stepwise refinement of a design?

Google Play Books and CSS

[3 February 2016]

My evil twin Enrique came by in a rage the other day. He had spent the entire weekend trying to figure out why Google Play Books, the ebook reader made by Google, was ignoring large parts of the CSS styling in an EPub3 ebook he has been working on. The CSS worked fine in an earlier test, but he has made numerous changes and improvements, and in the latest test GPB ignores not just the new stylesheet rules but most of the old ones.

His approach involved making test book after test book, trying to split the difference between the version that worked (but did not style everything as desired) and the version that didn’t work — the same sort of binary search depressingly familiar to developers of all kinds working with mysterious problems in the absence of useful diagnostics.

On the plus side, Enrique learned something from the experience; he has hints for those who have to do this with ebooks:

  • Make the testbook as short and small as possible, to minimize the wait time while you upload to Google Play from the development machine and then download again to the Android device you are targeting.
  • Make at least one change to the metadata in the ebook’s OPF file (often called content.opf, though that’s not required); otherwise some reading software can conclude that they already have the book, and not bother to download it or load it or show it to the user. Changing the title and last-modified date seems to help. (Changing the title is important, actually, since otherwise the test book will be impossible to identify in a list of ebook titles.)
  • Give each test book a distinctive cover (any PNG will do); otherwise it’s impossible to tell them apart in the cover-thumbnails view offered by many reading systems.

And what was the problem? Why did GPB object to the CSS?

That, said Enrique, was the infuriating thing. GPB had no trouble with any rules in the CSS files; it had no trouble with the same CSS, if all the comments were deleted.

Were the comments ill-formed? No, not according to W3C’s CSS validator or any other validator Enrique could find.

But GPB ignored the main CSS file if it began with an identifying comment, a revision history, and an overview of the structure of the file. Delete the revision history and the overview, and all was well. Move them to the end of the file, and all was well. Put them at the top of the file, and GPB ignored the file. As far as Enrique could tell, GPB was scanning the first 500 or 700 bytes of the file looking for a CSS rule, and ignored the file if it didn’t find one. An @import statement doesn’t count, he said. Moving the import instructions to the top of the file did not solve the problem; only deleting the long and useful comments solved the problem.

“Is this documented?” I asked. This only made Enrique sputter all the more. I didn’t hear everything he said (probably just as well; he curses like a sailor sometimes), but I gathered that it’s not documented anywhere he could see. (Pointers to documentation of the various ad-hoc restrictions imposed by Google Play Books, or to a GPB-oriented validator for ebooks, would be welcome. Enrique has sworn that he will never try to make a book readable in Google Play Books again, because it’s not worth the aggravation, but other people might still want to.)

It’s a shame that the conformance level of so many EPub3 readers is so abysmally low (and also a shame that even conforming EPub3 readers offer so little interoperability); it makes it hard to publish ebooks of any texts that are more complicated than a George Eliot novel (division into chapters and prose paragraphs are pretty much the only thing an epub publisher can count on; anything more complicated rapidly becomes a nightmare, in my limited experience).

Running the BaseX XQuery engine in the OpenShift cloud platform

[7 January 2016]

Late last year I worked through the process of making the BaseX XQuery engine run under Tomcat 6 at a commercial Java hosting provider. As I mentioned in that post, I also spent some time trying to make a cloud-services solution work, using OpenShift and Andy Bunce’s excellent Openshift quick start for BaseX. I ran into trouble then, because the Red Hat Cloud (rhc) tools refused to install on either of my two machines because the operating systems were more than twelve months old.

But during a quiet day or two last month I downloaded a new operating system for the newer machine (I’m exaggerating; it probably only took sixteen hours or so), and the other day I tried again.

The instructions for the quick-start set up are a bit terse, so for future reference (and for the benefit of any other XQuery developers out there who would like a little more detail in the instructions), this is my checklist for next time. Many of the same considerations apply as for installing BaseX under Tomcat; see the earlier post for more discussion.

Prerequisites

The following checklist assumes that:

  • You are reasonably comfortable with command-line tools.
  • You are reasonably comfortable with git, or can copy and paste git incantations without mucking them up.
  • You know what the basexhttp script is (or can live without knowing all the details of how things work).
  • It may be helpful if you’re familiar with ssh and scp, but it’s not essential.

If these don’t all apply, you will need to make some adjustments on the fly.

Preliminary preparation

  1. Sign up for an account at OpenShift.

    I don’t remember how long this took, maybe an hour. (It would go faster if it were easier to find the actual signup page and if one didn’t read the agreements one is consenting to.)

    I went through the “log in to the console and create an app” tutorial at OpenShift and “created” my first application (by clicking a button). I didn’t find this helpful or instructive, but YMMV.

    Remember your userid and password; you will need them repeatedly.

  2. Install the OpenShift / Red Hat Cloud command-line tools (rhc); instructions are on the Getting Started with OpenShift Online page at OpenShift.

    This also takes a little while, to download and install all the libraries on which the rhc command-line tools depend. My recollection is that it took half an hour or so. I expect it’s faster for people with faster connections than mine.

    (This is where things went terribly wrong for me in November, since Red Hat’s command-line tools refused to be installed in an operating system that shipped two years ago. Trying to solve the problem by upgrading the system’s Ruby interpreter landed me in dependency hell.)

  3. Decide in advance what userids you wish to specify for your BaseX database(s); for each userid specify the initial password and the initial permissions for each user. At the very least, determine what userid and password will be used as the default database user for the app.

    To allow myself to undertake each database operation with the lowest feasible level of privilege, I made myself an array of userids, one for each privilege level, and assigned them passwords generated by a simple random process:

    • Angie (admin privileges)
    • Chris (create privileges)
    • Will (write privileges)
    • Ralph (read privileges)
    • Nadine (no privileges)
  4. Prepare in advance a small XML file or two to use in creating a helloworld database to check that things are running as you expect.

    I use the documents in http://cmsmcq.com/2015/11/XQuery-over-HTTP/data/ for making a helloworld example. When I use the REST interface to create a database and populate it, it’s simplest to retrieve the documents by URI; when I use the dba interface of BaseX 8.3.1, it proves simplest if I have copies of them on my hard disk.

Doing the deed

The basic instructions are all given in the quick-start readme file; they are, perhaps, a little terse there, for readers who haven’t worked with OpenShift before, so I’ll repeat them with some comments. We are going to create a do-nothing OpenShift application, copy Andy Bunce’s BaseX quickstart setup into it, and check it in. For concreteness, I will assume the app to be built is named “Allegheny”, and the OpenShift user’s domain is “AIK”.

  1. rhc app create -a Allegheny -t diy-0.1

    This creates a do-nothing app named Allegheny on the OpenShift server, using the diy-0.1 cartridge, then clones a git repository of the code into a new directory called Allegheny, on your hard disk. The DIY cartridge provides a sort of minimal environment for an app, on a do-it-yourself basis; fortunately for us, we do not have to do it ourselves, as Andy Bunce has done all the crucial bits for us.

    When I did it this morning, this step took a little over two minutes.

  2. cd Allegheny

    Move into the app directory.

    Do not omit this step; you will regret it, especially if you use git yourself as a backup or synching mechanism. (That is to say, when I forgot to do this, the next step put all of the quickstart code into my home directory. It took me a painfully long time to find and delete it all and get it back out of my git repository. And my .gitignore file seems to have lost both its content and its history. Fortunately, I do have backups.)

  3. git remote add upstream -m master https://github.com/Quodatum/openshift-basex-quick-start.git

    git pull -s recursive -X theirs upstream master

    Pull the quickstart code into your app. (If you want to understand in detail what these lines do, I recommend the git manual. I am not going to try to explain.)

    When I ran this this morning, it took a little under two minutes; it seemed longer.

    If you’re in a hurry, go ahead to the next step. Otherwise, now may be a good time to pause to look at the application’s directory structure, since it’s where you will be doing your development. The crucial bits appear (in the current state of my imperfect knowledge) to be:

    • config (A bash script which sets some variables to be used in other bash scripts, including BASEX_USER and BASEX_PASSWORD, which are used as the values of the -U and -P options in calls to the basexhttp script. The URI of the version of BaseX to install is also given here, as are the port numbers the server should use.)

    • .openshift/ (part of the OpenShift infrastructure)

      • action_hooks/ (this is where the application developer places code to be executed by the OpenShift infrastructure at predefined moments)

        • start (a bash script to be executed when the application needs to be started; the script provided by the quickstart calls basexhttp start)

        • stop (a bash script to be executed when the application needs to be stopped; the script provided by the quickstart calls pkill java to stop the server; in principle it would rather call basexhttp stop, but at the moment that doesn’t work)

      • cron/ (Directory for cron jobs to be executed on behalf of the application, with subdirectories hourly, daily, etc.; the README file has a pointer to the relevant OpenShift documentation. The weekly subdirectory appears to have a weekly statistics job in it.)

      • markers/ (I have no idea what this is; the README points to the documentation.)

    • basex/ (For XQuery developers, this is where the action is.)

      • lib/ (Contains a JAR file for Saxon HE 9.7.0.1.)

      • repo/ (The BaseX repository; contains XQuery modules installed using REPO INSTALL URI-of-module; for details see the BaseX wiki under Repository.)

      • webapp/

        • restxq.xqm/ (A sample RESTXQ function by Andy Bunce supplied as part of the quickstart framework; it generates the default welcome page.)

        • WEB-INF/ (Contains the configuration files governing BaseX’s behavior as a web service)

          • jetty.xml

            (Jetty-specific configuration. With any luck you’ll never need to change anything here. The BaseX documentation refers readers to the Jetty documentation for details, which in turn suggests that you consult the Jetty 9 documentation instead, and good luck to you if you actually need to understand the details. The salient item for the quickstart is setting the IP address and port for the server, using the environment variables OPENSHIFT_DIY_IP and OPENSHIFT_DIY_PORT.)

          • web.xml

            (Generic servlet configuration. By default, the quickstart has the REST and WebDAV interfaces turned off and the RESTXQ interface turned on; you’ll need to edit the web.xml file to turn the REST and WebDAV interfaces back on. This document also specifies the default user and password for the server; how the specification in the web.xml file relates to the values given by the -U and -P options passed to basexhttp remains a mystery to me.

            Since the default web.xml file for the quickstart does not set the RESTXQPATH or RESTPATH options, they default to “the standard web application directory”, which appears in this case to be the webapps directory in the root directory of the git repository. That would be consistent with the placement of restxq.xqm. The web.xml file also doesn’t specify the REPOPATH option; the documentation says that it defaults to {home}/BaseXRepo, but apparently {home}/repo is also a possibility; that would be consistent with the placement of the repo/ directory here.)

    The other files and directories either have obvious functions (.git, .gitignore, LICENSE, README.md) or appear to be just samples (diy/, misc/).

  4. You can now adjust things in the configuration, if you wish. I mostly wait until later.

  5. git push origin master

    This checks in your changes on the server. The OpenShift infrastructure will tell the running DIY application to shut down (i.e. it will run the script in the checked-in version of .openshift/action-hooks/stop), push the changes for your local git repository to the copy in the cloud, and restart the app (by running .openshift/action-hooks/start).

    When I ran it this morning, this step took just under three minutes, including fetching and installing the BaseX binary.

    (In my initial experiments, I ran into a problem here; there is some disagreement between BaseX and OpenShift regarding who can connect to what ports, and the result is that basexhttp stop doesn’t have the desired effect, which means the check-in fails. In the meantime, Andy Bunce has rewritten the quickstart code with a temporary workaround.)

  6. Set up userids.

    BaseX can now be configured, using the dba application that ships with BaseX 8, and which is conveniently linked from the default quickstart welcome page at http://allegheny-AIK.rhcloud.com/.

    The only essential configuration to do at this point is to change the admin password. I try to do this quickly, since until it is done, anyone who happens across this BaseX engine on the open Web can have admin privileges just by logging in using the default password.

    There are a couple of catches that make the process very slightly less straightforward than it might be.

    • The users.xml file will go into the OpenShift data directory, which means that you cannot conveniently put it in place before starting BaseX. (If inconvenience is no object, then by all means, be my guest: first prepare a users.xml file with another copy of BaseX, then scp it to the ./app-root/data/basex/data directory of your OpenShift app, before doing the push in the previous step.)

    • The config file needs to know the admin userid and password, in order to start and stop the server. (At least, I think it needs to know them; I haven’t actually tried putting gibberish into those variables to see.) If we change the admin password, we risk not being able to stop and start the server gracefully.

    This is the technique I’ve worked out for dealing with these catches. There are probably better ways.

    • Log in to the dba application using the default admin userid and password.
    • Add the users in the list prepared earlier, with the corresponding privileges and passwords. Note that this list includes a second user with admin privileges, here called Angie.
    • In the local repo, change the relevant parts of the config file to read as follow (substituting your userid and password of choice, of course):
      BASEX_USER="Angie"
      BASEX_PASSWORD="Where-is-the-devil-in-Evelyn-what's-it-doing-in-Angela's-eyes"
    • Check in your changes and push them to OpenShift (git add config; git commit -m "Changing userid and password"; git push origin master).
    • Log in to the dba application again (actually, you’re probably still logged in), and change the password of the admin user. At this point, we have closed the window of vulnerability we opened when we started the server with the default admin password. When I’m feeling paranoid, I take this moment to check the Databases, Users, Files, and Logs tabs to see whether any intruders actually showed up and did anything.
  7. Test

    To test that things are going as expected, I also create a database and do a few queries with the dba application and the REST interface.

So now I know three different ways to have access to an XQuery database over the web:

  • Run it as a server on a machine you control, or on which you can persuade the sysadmin to install it. (I assume that this can be done in the virtual private servers offered by many Web hosting providers, but I haven’t done it that way myself.)
  • Run it as a servlet running under Tomcat on a Java hosting provider.
  • Run it as an application in a cloud service.

My personal experience with these is all with BaseX, but of course all three methods will also work, at least in principle, for other XQuery engines like eXist and MarkLogic.

It’s always good to have more than one string to one’s bow.