Scenes from a Recommendation 3: Boston, Prudential Tower

Another memory from the development of XML.

It’s November 1996, at the GCA SGML ’96 conference, at the Sheraton in Boston. The SGML on the Web Working Group and ERB have just been through an exhausting and exhilarating few weeks, when from a standing start we prepared the first public working draft of XML. At this conference, we have been given a slot for late-breaking news and will give the first public presentation of our work.

Lou Burnard, of Oxford University Computing Services, the founder of the Oxford Text Archive, is there to give an opening plenary talk about the British National Corpus, a 100-million-word representative corpus of British English, tagged in SGML. Lou and I are old friends; since 1988 we have worked together as editors of the Guidelines of the Text Encoding Initiative. Working together to shepherd a couple dozen working groups and task forces full of recalcitrant academics and other-worldly text theorists (“but why should a stanza have to contain lines of verse? I can perfectly well imagine a stanza containing no lines at all”) from requirements to draft proposals, to turn their wildly inconsistent and incomplete results into something resembling a coherent set of rules for encoding textual material likely to be useful to scholarship, and to produce in the end 1500 pages of mostly coherent prose documentation for the TEI encoding scheme, Lou and I have been effectively joined at the hip for years. We have consumed large quantities of good Scotch whisky together, and some quantities of beer and not so good whisky. We have told each other our life stories, in a state of sufficient inebriation that neither of us remembers any details beyond our shared admiration for Frank Zappa. We have sympathized with each other in our struggles with our respective managements; we have supported each other in working group and steering committee meetings; we have pissed each other off repeatedly, and learned, with a little help from our friends (thank you, Elaine), to patch things up and keep going. No one but my wife knew me better than Lou; no one but my wife could push my buttons and enrage me more effectively. (And she didn’t push those buttons nearly as often as Lou did.)

Tim Bray is also there, naturally. He and I have not worked together nearly as long as Lou and I have, but the compressed schedule and the intensity of the XML work have made it a similarly close relationship. We spend time on the phone arguing about the best way to define this feature or that, or counting noses to see which way a forthcoming decision is likely to come out (we liked to try to draft wording in advance of the decisions, when possible). We commiserate when Charles Goldfarb calls and spends a couple hours trying to wear us down on the technical issue of the day. (Fortunately, Charles called Tim and Jon Bosak more often than me. Either he decided he couldn’t wear me down, or he concluded I was a lightweight not worth worrying about. I’m not complaining.) Like Lou, Tim often reads a passage I have drafted and says “This is way too complicated, let’s just leave this, and this, and this, and that, out. See? Now it’s a lot simpler.”

At one point I believed it was generally a good idea for an editorial team to have a minimalist and a maximalist yoked together: the maximalist gets you the functionality you need, and the minimalist keeps it from being much more than you need. Maybe it is a good idea in general. Or maybe it was just that it worked well both in the TEI and in XML. At the very least, it’s suggestive that in the work on the XML Schema spec, I was the resident minimalist; if in any working group I am the minimalist, it’s a good bet that the product of that WG will be regarded as baroque by most readers.

It’s the evening before the conference proper, and there is a reception for attendees in a lounge at the top of the Prudential Tower. I am standing chatting with Tim Bray and Lauren Wood, and suddenly Lou comes striding urgently across the room towards us. He reaches us. He looks at me; he looks at Tim; and he says, in pitch-perfect tones of the injured spouse, “So this is the other editor you’ve been seeing behind my back!”

Applescript, so close and yet so far

[2 February 2008]

There are lots of big things on my mind lately: papers due and overdue and long overdue, submissions deadlines coming up, and a long long list of things to fix in the XSD 1.1 spec.

But there are some little things that refuse to stop taking up time and energy.

Years ago, tired of the hassles of trying to synchronize desktop and laptop, I followed the example of my friend Willard McCarty and started using my laptop as my only machine. This has worked pretty well on the whole, though it has saddled me with heavier laptops than some of my friends carry and given me less disk space than I could have had on desktop machines bought for the same price.

But a key part of making this work is having an external keyboard to use at my desk. I use a wave-shaped keyboard from Logitech at my desk, and to make things work as I expect, I use the Mac System Preferences interface to switch the Option and Command keys when I’m using the external keyboard.

Unfortunately, when I’m using the Powerbook’s own keyboard, this system preference must be undone. And then when I return to my desk, I have to switch the keys again.

Changing the relevant keyboard settings takes seven or eight mouse clicks. That gets old. I’d like to automate it; can Applescript help? Yes, it can: the samples include at least one example of scripting a change to the system preferences.

So I spent some time the other day trying to script my task: one script to launch System Preferences, choose Keyboard and Mouse, choose Modifier Keys, switch Command and Option, choose OK, and quit; another to go the other way.

The documentation makes fairly clear that I need to know the names for buttons and subpanes and so on provided by the application, so I can tell Applescript which things to activate. But I seem to be missing a step; I can’t find anything that tells me what names System Preferences gives to its panes. There’s an Open Dictionary option in the Script editor, but the dictionary for System Preferences only tells me that it defines things called panes. It doesn’t tell me — or am I just missing something here? — what IDs those panes have, or how to find out.

At the moment, this task is out of time and is going back to the bottom of the to-do list. But every time I take my machine away from my desk, or bring it back, I’m reminded that I haven’t solved this one yet.

W3C working group meetings / Preston’s Maxim

[25 January 2008]

I’m finishing a week of working group meetings in Florida, in the usual state of fatigue.

The SML (Service Modeling Language) and XML Query working groups met Monday-Wednesday. SML followed its usual practice of driving its agenda from Bugzilla: review the issues for which the editors have produced proposals, and act on the proposals, then review the ones that we have discussed before but not gotten consensus suitable for sending them to the editors, then others. The working group spent a fair bit of time discussing issues I had raised or was being recalcitrant on, only to end up resolving them without making the suggested change. I felt a little guilty at taking so much of the group’s time, but no one exhibited any sign of resentment. In one or two cases I was in the end persuaded to change my position; in others it simply became clear that I wasn’t going to manage to persuade the group to do as I suggested. I have always found it easier to accept with (what I hope is) good grace a decision going against me, if I can feel that I’ve been given a fair chance to speak my piece and persuade the others in the group. The chairs of SML are good at ensuring that everyone gets a chance to make their case, but also adept (among their other virtues) at moving the group along at a pretty good clip.

(In some working groups I’ve been in, by contrast, some participants made it a habit not to argue the merits of the issue but instead to spend the the time available arguing over whether the group should be taking any time on the issue at all. This tactic reduces the time available for discussion, impairs the quality of the group’s deliberation, and thus reduces the chance that the group will reach consensus; it’s thus extremely useful for those who wish to defend the status quo but do not have, or are not willing to expose, technical arguments for their position. The fact that this practice reduces me to a state of incoherent rage is merely a side benefit.)

“Service Modeling Language” is an unfortunate name, I think: apart from the fact that the phrase doesn’t suggest any very clear or specific meaning to anyone hearing it for the first time, the meanings it does suggest have pretty much nothing whatever to do with the language. SML defines a set of facilities for cross-document validation, in particular validation of, and by means of, inter-document references. Referential integrity can be checked using XSD (aka XML Schema), but only within the confines of a single document; SML makes it possible to perform referential integrity checking over a collection of documents, with cross-document analogues of XSD’s key, keyref, and unique constraints and with further abilities, in particular being able to specify simply that inter-document references of a given kind must point to elements with a particular expanded name, or elements of with a given governing type definition, or that chains of references of a particular kind must be acyclic. In addition, the SML Interchange Format (SML-IF) specifies rules that make it easier to specify exactly what schema is to be use for validating a document using XSD and thus to get consistent validation results.

The XML Schema working group met Wednesday through Friday. Wednesday morning went to a joint session with the SML and XML Query working groups: Kumar Pandit gave a high-level overview of SML and there was discussion. Then in a joint session with XML Query, we discussed the issue of implementation-defined primitive types.

The rest of the time, the Schema WG worked on last-call issues against XML Schema. Since we had a rough priority sort of the issues, we were able just to sort the issues list and open them one after the other and ask “OK, what do we do about this one?”

Among the highlights visible from Bugzilla:

  • Assertions will be allowed on simple types, not just on complex types.
  • For negative wildcards, the keyword ##definedSibling will be available, so that schema authors can conveniently say “Allow anything except elements already included in this content model”; this is in addition to the already-present ##defined (“Allow anything except elements defined in this schema”). The Working Group was unable to achieve consensus on deep-sixing the latter; it has really surprising effects when new declarations are included in a schema and seems likely to produce mystifying problems in most usage scenarios, but some Working Group members are convinced it’s exactly what they or their users want.
  • The Working Group declined a proposal that some thought would have made it easier to support XHTML Modularization (in particular, the constraints on xhtml:form and xhtml:input); it would have made it possible for the validity of an element against a type to depend, in some cases, on where the element appears. Since some systems (e.g. XPath 2.0, XQuery 1.0, and XSLT 2.0) assume that type-validity is independent of context, the change would have had a high cost.
  • The sections of the Structures spec which contain validation rules and constraints on components (and the like) will be broken up into smaller chunks to try to make them easier to navigate.
  • The group hearkened to the words of Norm Walsh on the name of the spec (roughly paraphrasable as “XSDL? Not WXS? Not XSD? XSDL? What are you smoking?”); the name of the language will be XSD 1.1, not XSDL 1.1.

We made it through the entire stack of issues in the two and a half days; as Michael J. Preston (a prolific creator of machine-generated concordances known to a select few as “the wild man of Boulder”) once told me: it’s amazing how much you can get done if you just put your ass in a chair and do it.

Primitives and non-primitives in XSDL

John Cowan asks, in a comment on another post here, what possible rationale could have governed the decisions in XSDL 1.0 about which types to define as primitives and which to derive from other types.

I started to reply in a follow-up comment, but my reply grew too long for that genre, so I’m promoting it to a separate post.

The questions John asks are good ones. Unfortunately, I don’t have good answers. In all the puzzling cases he notes, my explanation of why XSDL is as it is begins with the words “for historical reasons …”.

Continue reading

Honeypots: better than CAPTCHAs?

[17 January 2008]

As noted earlier, the short period of time between starting a blog and encountering comment spam has now passed, for this blog. And while the volume of comment spam is currently very low by most standards, for once I’d like to get out in front of a problem.

So when not otherwise committed, I spent most of yesterday reading about various comment-spam countermeasures, starting with those recommended by those who commented on my earlier post. (More comments on that post, faster, than on any other post yet: clearly the topic hit a nerve.)

If you’re keeping score, I decided ultimately to install Spam Karma 2, in part because my colleague Dom Hazaël-Massieux uses it, so I hope I can lean on him for support.

But the most interesting idea I encountered was certainly the one mentioned here by Lars Marius Garshol (to whom thanks). The full exposition of the idea by Ned Batchelder is perfectly clear (and to be recommended), but the gist can be sumarized thus:

  • Some comment spam comes from humans hired to create it.
  • Some spam comes from “playback bots” which learn the structure of a comment form once (with human assistance) and then post comments repeatedly, substituting link spam into selected fields.
  • Some comment spam comes from “form-filling bots”, which read the form and put more or less appropriate data into more or less the right fields, apparently guiding their behavior by field type and/or name.

For the first (human) kind of spam, there isn’t much you can do (says Batchelder). You can’t prevent it reliably. You can use rel="nofollow" in an attempt to discourage them, but Michael Hampton has argued in a spirited essay on rel="nofollow" that in fact nofollow doesn’t discourage spammers. By now that claim is more an empirical observation than a prediction. By making it harder to manipulate search engine rankings, rel="nofollow" makes spammers think it even more urgent (says Hampton) to get functional links into other places where people may click on them.

But I can nevertheless understand the inclination to use rel="nofollow": it’s not unreasonable to feel that if people are going to deface your site, you’d at least like to ensure their search engine ranking doesn’t benefit from the vandalism.

And of course, you can also always delete their comments manually when you see them.

For the playback bots, Batchelder uses a clever combination of hashing and a local secret to fight back: if you change the names of fields in the form, by hashing the original names together with a time stamp and possibly the requestor’s IP address, then (a) you can detect comments submitted a suspiciously long time after the comment form was downloaded, and (b) you can prevent the site-specific script from being deployed to an army of robots at different IP addresses.

My colleague Liam Quin has pointed out that this risks some inconvenience to real readers. If someone starts to write a comment on a post, then breaks off to head for the airport, and finally finishes editing their comment and submitting it after reaching their hotel at the other end of a journey, then not only will several hours have passed, but their IP number will have changed. Liam and I both travel a lot, so it may be easy for us to overestimate the frequency with which that happens in the population at large, but it’s an issue. And users behind some proxy servers (including those at hotels) will frequently appear to shift their IP addresses in a quirky and capricious manner.

For form-filling bots, Batchelder uses invisible fields as ‘honeypots’. These aren’t hidden fields (which won’t deceive bots, because they know about them), but fields created in such a way that they are not visible to the sighted human users. Since humans don’t see them, humans won’t fill them out, while a form-filling bot will see them and (in accordance with its nature) will fill them out. This gives the program which handles comment submissions a convenient test: if there’s new data in the honeytrap field, the comment is pretty certain to be spam.

Batchelder proposes a wide variety of methods for making fields invisible: CSS style “display: none” or “font-size: 0”, positioning the field absolutely and then carefully positioning an opaque image or something else opaque over it. And we haven’t even gotten into Javascript yet.

For the sake of users with Javascript turned off and/or CSS-impaired browsers, the field will be labeled “Please leave this field blank; it’s here to catch spambots” or something similar.

In some ways, the invisible-honeypot idea seems to resemble the idea of CAPTCHAs. In both cases, the human + computer system requesting something from a server is requested to perform some unpredictable task which a bot acting alone will find difficult. In the case of CAPTCHAs, the task is typically character-recognition from an image, or answering a simple question in natural language. In the case of the honeypot, the task is calculating whether a reasonably correct browser’s implementation of Javascript and CSS will or will not render a given field in such a way that a human will perceive it. This problem may be soluble, in the general case or in many common cases, by a program acting alone, but by far the simplest way to perform it is to display the page in the usual way and let a human look to see whether the field is visible or not. That is, unlike a conventional CAPTCHA, a honeypot input field demands a task which the browser and human are going to be performing anyway.

The first question that came to my mind was “But wait. What about screen readers? Do typical screen readers do Javascript? CSS?”

My colleagues in the Web Accessibility Initiative tell me the answer is pretty much a firm “Sometimes.” Most screen readers (they tell me) do Javascript; behavior for constructs like CSS display: none apparently varies. (Everyone presumably agrees that a screen reader shouldn’t read material so marked, but some readers do; either their developers disagree or they haven’t yet gotten around to making the right thing happen.) You do want to be sure, if you use this technique, to make sure the “Please leave empty” label is associated with the field in a way that will be clear to screen readers and the like. (Of course, this holds for all field labels, not just labels for invisible fields. See Techniques for WCAG 2.0 and Understanding WCAG 2.0 for more on this topic.)

The upshot appears to be:

  • For sighted or unsighted readers with Javascript and/or CSS processing supported by their software and turned on, a honeypot of this kind is unseen / unheard / unperceived (unless something goes wrong), and incurs no measurable cost to the human. The cost of the extra CSS or Javascript processing by the machine is probably measurable but negligeable.
  • For human readers whose browsers and/or readers don’t do Javascript and/or CSS, the cost incurred by a honeypot of this kind is (a) some clutter on the page and (b) perhaps a moment of distraction while the reader wonders “But why put a field there if you want me to leave it blank?” or “But how can putting a data entry field here help to catch spambots?” For most users, I guess this cost is comparable to that of a CAPTCHA, possibly lower. For users excluded by a CAPTCHA (unsighted users asked to read an image, linguistically hesitant users asked to perform in a language not necessarily their own), the cost of a honeypot seems likely to be either a little lower than that of a CAPTCHA, or a lot lower.

I’m not an accessibility expert, and I haven’t thought about this for very long. But it sure looks like a great idea to me, superior to CAPTCHAs for many users, and no worse than CAPTCHAs (as far as I can now tell) for anyone.

If this blog used homebrew software, I’d surely apply these techniques for resisting comment spam. And I think I can figure out how to modify WordPress to use some of them, if I ever get the time. But I didn’t see any off-the-shelf plugins for WordPress that use them. (It’s possible that Bad Behavior uses these or similar techniques, but I haven’t been able to get a clear idea of what it does, and it has what looks like a misguided affinity for the idea of blacklists, on which I have given up. As Mark Pilgrim points out, when we fight link spam, we might as well try to learn from the experience of fighting spam in other media.)

Is there a catch? Am I missing something?

What’s not to like?