7500 to 1

[14 December 2017]

A little less than a month ago, noticing that spam had gotten out of hand here again, I started monitoring comments on this blog manually, to observe them more closely.

Since that time, the About page and the posts still open for comments have received something more than 7500 comments (so, between 200 and 300 a day), not counting those blocked by the Bad Behavior plugin), of which one was a relevant comment made by a human, and the others were spam.

Sigh.

XML and the struggle to keep documentation current and in synch with practice

[9 December 2017]

One of the nice things about having data in a reusable form like SGML or XML that is not application-specific is that it makes it easier to keep documentation in synch with practices and/or software. (Relational databases have some of the same advantages, but I don’t find them handy for texts, and annotating specific data values can require arbitrarily complex technical prose.)

An example I am disproportionately pleased with has recently come about.

The project Annotated Turki Manuscripts from the Jarring Collection Online is transcribing some Central Asian manuscripts collected in and near Kashgar in the first half of the twentieth century. The manuscripts we are working with are written in Perso-Arabic script, and in order to make them better accessible to interested readers more comfortable with Latin script than with Perso-Arabic we provide transcriptions in Latin transliteration as well as in the original script. The domain specialists in the project have spent a lot of time working on the transliteration scheme, trying to make it easily readable while still retaining a 1:1 relation with the original so that no information is lost in transliteration.

Because the transliteration scheme itself is a significant work product, we want to document it. Because it needs to be applied to every new transcription, it also needs to be realized in executable code. And, as one might expect, the scheme has changed slightly as we have gained experience with the manuscripts and with it.

Our representation of the transliteration scheme has taken a variety of forms over the last couple of years: extensive notes on a whiteboard, images of that whiteboard, entries in a table in the project wiki, hard-coded tables of character mappings in an XSLT stylesheet written by a student and other stylesheets derived from it, a spreadsheet, and recently also an XML document, which is both on the Web in XML form with a stylesheet to render it more or less legible to humans (transliteration tables are intrinsically kind of dense) and used by the latest incarnation of the student’s stylesheet (itself on the Web), replacing the hard-coded representation used in earlier versions.

The XML representation has the disadvantage that it’s not as easy to sort it in many different ways as it is to sort the spreadsheet; it has the advantage over the spreadsheet of significantly better data normalization and less redundancy. Compared to the tables used in earlier versions of the XSLT stylesheet, the XML document is significantly better documented and thus easier to debug. (The redundant presentation of strings as displayed characters and as sequences of Unicode code points is important in a way that will surprise no one who has struggled with special character handling issues before.) The mixture of prose and tabular data in the XML, and the more systematic distinction between information about a particular Perso-Arabic string and information about a particular phonetic realization of that string and the Latin-script regularization of that pronunciation), are things that XML does really easily, and other data formats don’t do nearly as easily.

Using XSLT stylesheets to make XML representations of information (here the script-to-script mapping rules) more easily human readable seems similar in spirit to literate programming as developed and practiced by Donald Knuth, although different in details.

Working on your Spanish

[18 November 2017; copy edits 18-24 November]

In just over seven months, the annual Digital Humanities conference will take place in Mexico City. I doubt that I’m the only digital humanist in the world who is surreptitiously trying to improve my Spanish before next June.

If you are trying to improve your Spanish (whether for that reason or for others), here are some things I am finding useful.

First, a textbook.

  • The French publisher Assimil has a Spanish volume in their sans peine series: L’espagnol sans peine. It’s available in several versions for speakers of different languages: English (Spanish with ease), German (Spanisch ohne Mühe), Italian, Dutch, and Portuguese. The series is described, quite accurately, as suitable for “beginners and pseudo-beginners” (débutants et faux-débutants).

    I bought the book and CDs direct from the Assimil web site and had them shipped to me in the U.S. without any difficulty; the only catch for some potential buyers is that the site is in French. Period. (Wait, aren’t they specialists in books for foreign-language learners? Don’t their web people realize the firm has non-Francophone target readers? Ah, well. There are some English-language resellers one can use, or you can grab a Francophone friend and make them babysit you through finding the book you want and making your purchase.)

    I won’t attempt to describe the Assimil method here. The linguist John McWhorter has given a good account of their results in his piece on the NPR web site (and there is some useful concrete advice at the site ‘How to Learn any Language’, for those who find the books’ instructions vague). I will say that for those who like me are attempting to learn (or re-learn) a language on their own and not in a class, the Assimil sans peine series has no equal that I know of.

    I bought the combination pack with the printed book, sound CDs with recordings of the dialogues, and a CD with MPEG versions of the recordings; I have downloaded the MPEG recordings to a directory on an Android tablet, imported the directory into the Podcast Addict app as an audio book, and I use Podcast Addict to play the day’s recording on continuous loop while I fetch the paper, wash dishes, etc. Listening in a podcast player has the advantage that I can speed up the early lessons to something approaching normal conversational speed.

Second, a supply of relatively easy reading and listening material. My searches for podcasts for learners of Spanish as a foreign languages turned up large numbers of results, some of which were obviously irrelevant and some of which I was able to delete without qualms after listening for a couple of minutes. I now listen to three:

  • Español automatico, prepared by a personable teacher of Spanish named Karo Martínez; in some installments, she speaks in relatively simple Spanish about assorted topics (the relative merits of the various actors who have played James Bond, the history of Catalonia, how to learn Spanish, and others; self-help topics of the how-to-be-more-organized variety are not uncommon), in others she specifically discusses issues of Spanish idiomatic usage and vocabulary. Episodes generally range from twenty to forty minutes (although when she got going on her introduction to the history of Barcelona, it ran over an hour). Transcripts and other additional materials appear to be available from the web site, some for purchase, which I hope pays for the cost of producing and publishing the podcast, and others for free (but you have to give them your email address, which induces a spasm of irrational privacy paranoia in me, so I have no idea what form the transcripts take). The only blemish for me is the repeated plea for listeners to file a five-star review of the podcast in iTunes.

  • El oso latino habla español — para mejorar su español, a quirky podcast put together in Sherbrooke, Québec, produced by a Québecois Spanish-learner named Pascal Dion and featuring the Peruvian Oswaldo Horna Montes (known, I gather, to his friends as El oso latino) as the main speaker. Episodes often include interviews with visitors from Latin America, with other Latin Americans living in Sherbrooke, or with anyone whom the host and producer think will be interesting; topics of discussion regularly include differences among varieties of Spanish and idioms peculiar to this or that regional variety. In one show, Montes narrated the preparation of a Peruvian dish for dinner. Dion appears in a regular segment on language-learner errors called Crónica del gringo, and Montes’s daughters in the segment Los chistes de Celia y de Marisol, which makes me laugh til I cry even though I have not yet understood a word of any of the jokes they have told. Lots of music. The most personal and thus the most memorable of these three podcasts. Generally around 30 minutes per episode.

  • News in slow Spanish, which is more or less what it sounds like: a weekly podcast with news stories in Spanish (there are similar News in Slow X podcasts for a number of different languages). There are several versions (intermediate vs. advanced, Castilian vs. Latin American), but oddly only one that I was able to locate from within Podcast Addict (Spain, Advanced, which proves not too advanced for me). News in Slow Spanish (Latino) does appear in the Tune-In Radio app (the only thing I miss there is continuous looping).

    I shied away from this at first; I have decades-old bad memories of unconvincing ‘newscasts’ specially for language learners filled with soft news (to give them a longer shelf life) and painful explanations of words. But I ended up trying the podcast, after I failed to find an app or podcast that would give me a conventional five- to fifteen-minute radio news broadcast on demand, preferably updated once a day or once an hour (something along the lines of the NPR News app, or the Radio Canada app, which will play you the most recent hourly newscast on demand — still looking; if you know of anything let me know). And I was convinced. The news items are real, current, and interesting, and the editorial comment feels lively and intelligent. In the Latin-American version, particularly, it is interesting to listen to the friendly discussion between hosts with slightly different political leanings.

    I’ve been listening to the initial free portion of the podcast; a longer episode with more news items and some discussions of Spanish grammar and idioms is available as a paid service. The free portion runs about five minutes. (Every now and then there is either a slip or an intentional freebie, and the entire thirty-minute program is included.)

On easy reading material, I’m not doing too well. Children’s and young-adult books are an obvious choice, but I don’t know an easy way to know what’s worth buying and what’s not. Well, actually I do. The next time I’m at my public library I’ll ask at the information desk for recommendations.

And third, a supply of normal Spanish for listening and reading, ideally interesting and not over-challenging. Here, of course, the choice is limitless and the expanse of possibilities feels as trackless as Borges’s Library. (If you need more motivation to work on your Spanish, think about it — you could be reading Borges in the original!)

There is always the news (which tends to have a relatively manageable vocabulary, and to have a lot of short pieces). There are Android apps for any number of Spanish-language newspapers, most of which I haven’t heard of and some of which may or may not be worth reading. With my mind focused on Mexico City, I have looked only at apps and web sites for Mexican newspapers. A Mexican colleague (whom I thank, but who shall remain nameless here because I haven’t asked permission to name them here) has suggested:

  • Animal politico (left wing); the web site is fine on a desktop machine and a bit hard to navigate on a tablet. I didn’t find any app.
  • El Universal (center right); I did find an app but did not find it usable.
  • La Jornada (left wing); I find the Android app usable (though I wish it allowed me to adjust the font size), so I haven’t worked with the web site.

At the moment, I confess to finding Mexican newspapers slightly heavy going.

Since I’m a digital humanist, and I’m looking forward to DH 2018, I read the blog run by the Red de humanidades digitales with great interest, even if sometimes with imperfect comprehension.

Eventually, I’ll be looking for Spanish-language detective stories and the like: page-turners are a real boon for a foreign-language learner, so I will happily read many things in a foreign language whose English equivalents I wouldn’t normally be caught dead with. (I’m told that on the same principle, some adult literacy programs in this country do great work with Mickey Spillane.) Suggestions welcome.

Finding good podcasts aimed not at language learners but at intelligent adults has been a challenge, but looking around for podcasts on the sites of UNAM (Universidad Nacional Autónomo de México) and IMER (Instituto Mexicano de la Radio) has produced dividends, as have some journalistic think pieces I found on the Web on new media in Latin America. Right now my subscriptions include the following. (At the moment, all of these are tough sledding for me, but they repay repeated listening. The ability of podcast software to slow the playback helps — it’s like being able to say “¡Demasiado rápido! ¡Más despacio, por favor!” and have the podcast nod and slow down.)

  • Azul Chiclamino, a podcast by Rodrigo Llop. I have no idea how to describe this; perhaps the subtitle will do: La realidad de lo absurdo. This is sometimes characterized as a humorous broadcast; for a sufficiently nuanced definition of “humor” (think Mark Twain) that’s probably true, but I find the podcast much more appealing than that label would lead me to expect.
  • Radio Ambulante, an NPR-affiliated podcast. Feels a bit like Radio Lab or This American Life, in Spanish: thoughtful, serious, well produced. Has the advantage that its stories are often about Hispanic affairs in the US, so I understand some of the background; has the drawback that its stories are often about the US, which doesn’t help break out of a US news bubble.
  • Ráfagas de pensamiento, a series of short pieces by the philosopher Ernesto Priani Saisó of UNAM, often reacting to a passage in an earlier philosopher (Nietzche, Husserl, Leibniz, More, …). Produced with atmospheric music and read by what sound like professional voice actors. Has the disadvantage for me that the background music can make the words harder to hear; has the advantage that it’s worth listening to. Usually 3-5 minutes per piece.
  • A multipart dramatization by Radio UNAM of Así asesinaron a Trotski by Leandro Sánchez Salazar, the man in charge of the investigation of Trotsky’s murder. I have no idea what Sánchez’s book is like as a historical source, but it has the virtue of strong narrative drive — even though I already know how it turns out. I may need to read the book in order to understand some of the broadcast.

Netflix and Amazon appear to have rather thin selections when it comes to Spanish-language films but they do have some. If anyone reading this knows an effective way to search by language on either, I’m all ears; surely searching for “Almodovar” should not be the only possibility (I am going to save Buñuel for later, when my Spanish is better and I can tell the difference between surrealism and not understanding the words). YouTube has a fair bit of Spanish content, though again I have not found any good way to find it except for searching on random Spanish words. An impulsive search on “Así asesinaron a Trotski” turned up several documentaries on Trotsky’s assassination, Trotsky’s life, Trotsky and Stalin, and Ramon Mercader (the man who killed Trotsky), as well as a few seminars on Trotskyite political theory.

[Addendum: on Netflix, selecting Browse / Audio & Subtitles takes the user to an interface where one can browse items with audio, or subtitles, in a given language. This is imperfect, but probably better than nothing. Looking for something to watch in the resulting display feels like looking for a book to read in a library arranged by color; for every ten times you feel irritated by its apparently random arrangement and the inconvenience of having to click on something every time you would like more information than is given in the icon, you may once or twice feel pleased by some serendipity.]

All of this is, of course, just my two cents. As may be clear from the above, my language learning work happens mostly on an Android tablet, not on a desktop machine.

n-gram Markov models as regular approximations?

[16 September 2016][cleared out Viagra spam 27 April 2017]

By construction, Markov models can generate, or recognize, regular languages. This follow from the fact that they are essentially weighted finite state automata.

If we take a body of material that conforms to a context-free grammar G, segment it into tokens the way a lexical scanner would do, and then construct an n-gram Markov model on the basis of the material, we will have a weighted finite state automaton (FSA) that produces or recognizes a regular approximation of the context-free language of G.

How will that regular approximation, and any regular grammar that expresses it (with or without the weights) relate to the original context-free grammar G? How will it relate to other regular approximations of L(G), the kind that one gets by manipulating the grammar G directly?

This is mostly an abstract question, but regular approximations have many uses.

  • Recognizing regular languages is often faster than recognizing context-free languages, and requires fewer resources.
  • Schema languages like XSD and Relax NG can use regular expressions to restrict strings, but not context-free grammars. (Amelia Lewis told me once she’d like to design a schema language which allowed context-free grammars for constraining strings; I think that would be an interesting schema language.)
  • Language-specific editing modes (e.g. c-mode in Emacs) must often treat the language in question as if it were regular (i.e. you don’t really have access to the context-free grammar or a parse tree when deciding how much to indent a line: you have to decide based on whether the preceding line ends with an open brace or not, and so on). I’ve never found descriptions of how to write language modes in Emacs easy to follow, perhaps because they don’t explain how to stop thinking about a language in terms of its context-free grammar and how to think about it instead as a regular language.

But probably I’m thinking about this today because I’ve been thinking about part-of-speech tagging a lot recently. The easiest way to build a reasonably good part-of-speech tagger nowadays appears to be to build a hidden Markov model on the basis of some training corpus. (There are other more sophisticated approaches, which fight it out among themselves for the odd tenth of a percentage point in their correctly-tagged rates. They don’t appear to be nearly as easy to understand and build.)

I conjecture that one of the reasons bigram and trigram part-of-speech taggers do as well as they do is that they apply information about what can occur where in a sentence, in a way that has been dumbed down to the point where it can fit into (be expressed by) a finite state automaton. I wonder if a systematic way to go from a context-free grammar to an FSA could help build better / smarter taggers. Could it capture more information about grammatical context? Transmit that information over longer distances? Provide guidance in cases where the available training data would otherwise be too sparse?

One reason trigrams can do better than bigrams is that they provide more information for the decision about how to tag each word: the choice of tag depends not just on the preceding tag but on the preceding two tags. One reason trigrams can do worse than bigrams is that there are a lot more potential trigrams for any set of part-of-speech tags than there are bigrams, so trigrams are apt to suffer from sparse-data problems unless one has “a lot” of training data (for some meaning of “a lot”); 4-grams, 5-grams, etc., appear to be non-starters because of the sparse-data problem. Could a systematic derivation of a Markov model from a CFG help? Could we judiciously tweak the underlying FSA to carry information we know (or suspect) will be useful? That would provide the same advantages as n-grams with larger n. Could generating the FSA from a grammar help provide guidance for distinguishing grammatical-but-infrequent turns of phrase from gibberish? That would help minimize the sparse-data problem for n-grams with larger n.

It’s the 16th of September — Happy Independence Day, neighbors! Viva Hidalgo!

Fundamental primitives of XSLT programming

[2016-08-07]

A friend planning an introductory course on programming for linguists recently asked me what I thought such linguist-programmers absolutely needed to learn in their first semester of programming. I thought back to a really helpful “Introduction to Programming” course taught in the 1980s at the Princeton University Computer Center by Howard Strauss, then the head of User Services. As I remember it, it consisted essentially of the introduction of three flow-chart patterns (for a sequence of steps, for a conditional, and for a while-loop), with instructions on how to use them that went something like this:

  1. Start with a single box whose text describes the functionality to be implemented.
  2. If every box in the diagram is trivial to implement in your programming language, stop: you’re done. Implement the program using the obvious translation of sequences, loops, and conditionals into your language.
  3. Otherwise choose some single box in the diagram whose functionality is non-trivial (will require more than a few lines of code) and replace it with a pattern: either break it down into a sequence of steps, or make it into a while-condition-do-action loop, or make it into an if-then-else choice.
  4. Return to step 2.

I recommended this idea to my friend, since when I started to learn to program I found these three patterns extremely helpful. As I thought about it further, it occurred to me that the three patterns in question correspond 1:1 to (a) the three constructors used in regular languages, and (b) the three patterns proposed in the 1970s by Michael A. Jackson. The diagrams I learned from Howard Strauss were not the same as Jackson’s diagrams graphically, but the semantics were essentially the same. I expect that a good argument can be made that together with function calls and recursion, those three patterns are the atomic patterns of software design for all conventional (i.e. sequential imperative) languages.

I think the patterns provide a useful starting point for a novice programmer: if you can see how to express an idea using those three patterns, it’s hard not to see how to capture it in a program in Pascal, or C, or Python, or whatever language you’re using. Jackson is quite good on deriving the structure of the program from the structures of the input and output in a systematic way.

The languages I most often teach, however, are XSLT and XQuery; they do not fall into the class of conventional sequential imperative languages, and the three patterns I learned from Howard Strauss (and which Howard Strauss may or may not have learned from Michael A. Jackson) do not help me structure a program in either language.

Is there a similarly small set of simple fundamental patterns that can be used to describe how to build up an XSLT transformation, or an XQuery program?

What are they?

Do they have a plausible graphical representation one could use for sketching out a stepwise refinement of a design?