W3C working group meetings / Preston’s Maxim

[25 January 2008]

I’m finishing a week of working group meetings in Florida, in the usual state of fatigue.

The SML (Service Modeling Language) and XML Query working groups met Monday-Wednesday. SML followed its usual practice of driving its agenda from Bugzilla: review the issues for which the editors have produced proposals, and act on the proposals, then review the ones that we have discussed before but not gotten consensus suitable for sending them to the editors, then others. The working group spent a fair bit of time discussing issues I had raised or was being recalcitrant on, only to end up resolving them without making the suggested change. I felt a little guilty at taking so much of the group’s time, but no one exhibited any sign of resentment. In one or two cases I was in the end persuaded to change my position; in others it simply became clear that I wasn’t going to manage to persuade the group to do as I suggested. I have always found it easier to accept with (what I hope is) good grace a decision going against me, if I can feel that I’ve been given a fair chance to speak my piece and persuade the others in the group. The chairs of SML are good at ensuring that everyone gets a chance to make their case, but also adept (among their other virtues) at moving the group along at a pretty good clip.

(In some working groups I’ve been in, by contrast, some participants made it a habit not to argue the merits of the issue but instead to spend the the time available arguing over whether the group should be taking any time on the issue at all. This tactic reduces the time available for discussion, impairs the quality of the group’s deliberation, and thus reduces the chance that the group will reach consensus; it’s thus extremely useful for those who wish to defend the status quo but do not have, or are not willing to expose, technical arguments for their position. The fact that this practice reduces me to a state of incoherent rage is merely a side benefit.)

“Service Modeling Language” is an unfortunate name, I think: apart from the fact that the phrase doesn’t suggest any very clear or specific meaning to anyone hearing it for the first time, the meanings it does suggest have pretty much nothing whatever to do with the language. SML defines a set of facilities for cross-document validation, in particular validation of, and by means of, inter-document references. Referential integrity can be checked using XSD (aka XML Schema), but only within the confines of a single document; SML makes it possible to perform referential integrity checking over a collection of documents, with cross-document analogues of XSD’s key, keyref, and unique constraints and with further abilities, in particular being able to specify simply that inter-document references of a given kind must point to elements with a particular expanded name, or elements of with a given governing type definition, or that chains of references of a particular kind must be acyclic. In addition, the SML Interchange Format (SML-IF) specifies rules that make it easier to specify exactly what schema is to be use for validating a document using XSD and thus to get consistent validation results.

The XML Schema working group met Wednesday through Friday. Wednesday morning went to a joint session with the SML and XML Query working groups: Kumar Pandit gave a high-level overview of SML and there was discussion. Then in a joint session with XML Query, we discussed the issue of implementation-defined primitive types.

The rest of the time, the Schema WG worked on last-call issues against XML Schema. Since we had a rough priority sort of the issues, we were able just to sort the issues list and open them one after the other and ask “OK, what do we do about this one?”

Among the highlights visible from Bugzilla:

  • Assertions will be allowed on simple types, not just on complex types.
  • For negative wildcards, the keyword ##definedSibling will be available, so that schema authors can conveniently say “Allow anything except elements already included in this content model”; this is in addition to the already-present ##defined (“Allow anything except elements defined in this schema”). The Working Group was unable to achieve consensus on deep-sixing the latter; it has really surprising effects when new declarations are included in a schema and seems likely to produce mystifying problems in most usage scenarios, but some Working Group members are convinced it’s exactly what they or their users want.
  • The Working Group declined a proposal that some thought would have made it easier to support XHTML Modularization (in particular, the constraints on xhtml:form and xhtml:input); it would have made it possible for the validity of an element against a type to depend, in some cases, on where the element appears. Since some systems (e.g. XPath 2.0, XQuery 1.0, and XSLT 2.0) assume that type-validity is independent of context, the change would have had a high cost.
  • The sections of the Structures spec which contain validation rules and constraints on components (and the like) will be broken up into smaller chunks to try to make them easier to navigate.
  • The group hearkened to the words of Norm Walsh on the name of the spec (roughly paraphrasable as “XSDL? Not WXS? Not XSD? XSDL? What are you smoking?”); the name of the language will be XSD 1.1, not XSDL 1.1.

We made it through the entire stack of issues in the two and a half days; as Michael J. Preston (a prolific creator of machine-generated concordances known to a select few as “the wild man of Boulder”) once told me: it’s amazing how much you can get done if you just put your ass in a chair and do it.

Primitives and non-primitives in XSDL

John Cowan asks, in a comment on another post here, what possible rationale could have governed the decisions in XSDL 1.0 about which types to define as primitives and which to derive from other types.

I started to reply in a follow-up comment, but my reply grew too long for that genre, so I’m promoting it to a separate post.

The questions John asks are good ones. Unfortunately, I don’t have good answers. In all the puzzling cases he notes, my explanation of why XSDL is as it is begins with the words “for historical reasons …”.

Continue reading

Allowing ‘extension primitives’ in XML Schema?

In issue 3251 against XSDL 1.1 (aka ‘XML Schema 1.1’ to those who haven’t internalized the new acronym), Michael Kay suggests that XSDL, like other languages, allow implementations to define primitive types additional to those described in the XSDL spec.

I’ve been thinking about this a fair bit recently.

The result is too much information to put comfortably into a single post, so I’ll limit this post to a discussion of the rationale for the idea.

‘User-defined’ primitives?

Michael is not the first to suggest allowing the set of primitives to be extended without revving the XSDL spec. I’ve heard others, too, express a desire for this or for something similar (see below). In one memorable exchange at Extreme Markup Languages a couple years ago, Ann Wrightson noted that in some projects she has worked on, the need for additional primitives is keenly felt. In the heat of the moment, she spoke feelingly of the “arrogance” of languages that don’t allow users to define their own primitives. I remember it partly because that word stung; I doubt that my reply was as calmly reasoned and equable as I would have liked.

Strictly speaking, of course, the notion of ‘user-defined primitives’ is a contradiction in terms. If a user can define something meaningfully (as opposed to just declaring a sort of black box), then it seems inevitable that that definition will appeal to some other concepts, in terms of which the new thing is to be understood and processed. Those concepts, in turn, either have formal definitions in terms of yet other concepts, or they have no formal definition but must just be known to the processor or understood by human readers by other means. The term primitive denotes this last class of things, that aren’t defined within the system. Whatever a user can define, in the nature of things it can’t very well be a primitive in this sense of the word.

Defining types without lying to the processor

But if one can keep one’s pedantry under control long enough, it’s worth while trying to understand what is desired, before dismissing the expression of the desire as self-contradictory. It’s suggestive that some people point to DTLL (the Datatype Library Language being designed by Jeni Tennison) as an instance of the kind of thing needed. I have seen descriptions of DTLL that claimed that it had no primitives, or that it allowed user-defined primitives (thus falling into the logical trap just mentioned), but I believe that in public discussions Jeni has been more careful.

In practice, DTLL does have primitives, in the sense of basic concepts not defined in terms of other concepts still more basic. In the version Jeni presented at Extreme Markup Languages 2006, the primitive notions in terms of which all types are defined are (a) the idea of tuples with named parts and (b) the four datatypes of XPath 1.0. (Note: I specify the version not because I know that DTLL has since changed, but only because I don’t know that it has not changed; DTLL, alas, is still on the ‘urgent / important / read this soon’ pile, which means it’s continually being shoved aside by things labeled ‘super-urgent / life-threatening / read this yesterday’. It also suffers from my expectation that I’ll have to concentrate and think hard; surfing the Web and reading people’s blogs seems less strenuous.)

But DTLL does not have primitives, in the sense of a set of semantically described types from which all other types are to be constructed. All types (if I remember correctly) are in the same pool, and there is no formal distinction between primitive types and derived types.

Of course, the distinction between primitives and derived types has no particular formal significance in XSDL, either. What you can do with one, you can do with the other. The special types, on the other hand, are, well, special: you can derive new types from the primitives, but you can’t derive new types from the specials, like xs:anySimpleType or (in XSDL 1.1) xs:anyAtomicType. Such derivations require magic (which is one way of saying there isn’t any formal method of defining the nature of a type whose base type is xs:anyAtomicType: it requires [normative] prose).

But XSDL 1.0 and the current draft of XSDL 1.1 do require that all user-defined types be defined in terms of other known types. And this has a couple of irritating corollaries.

If you want to define an XSDL datatype for dates in the form defined by RFC 2822 (e.g. “18 Jan 2008”), or for lengths in any of the forms defined by (X)HTML or CSS (which accepts “50%”, and “50” [pixels], and “50*”, and “50em”, and so on), you can do so if you have enough patience to work out the required regular expressions. But you can’t derive the new rfc2822:date type from xs:date (as you might wish to do, to signal that they share the same value space). You must lie to the processor, and say that really, the set of RFC 2822 dates is a subtype of xs:string.

Exercise for those gifted at casuistry: write a short essay explaining that what is really being defined here is the set of RFC 2822 date expressions, which really are strings, so this is actually all just fine and nothing to complain about.

Exercise for those who care about having notations humans can actually use: read the essay produced by your casuist colleague and refrain from punching the author in the eye.

Lying to the processor is always dangerous, and usually a bad idea; designing a system that requires lying to the processor in order to do something useful is similarly a bad idea (and can lead to heated remarks about the arrogance of the system). When the schema author is forced to pretend that the value space of email dates is the value space of strings (i.e. sequences of UCS characters), it not only does violence to the soul of the schema author and makes the processor miss the opportunity to use a specially optimized storage format for the values, but it also makes it impossible to derive further types by imposing lower and upper bounds on the value space (e.g. a type for plausible email dates: email dated before 1950? probably not). And you can’t expect the XSDL validator to understand the relation among pixels and points and picas and ems and inches, so just forget about the idea of restricting the length type with an upper bound of “2 pc” (two picas) and having the software know that “30pt” (thirty points) exceeds that upper bound.

What about NOTATION?

As suggested in the examples just above, there are a lot of specialized notations that could usefully be defined as extension primitives in an XSDL context. Different date formats are a rich vein in themselves, but any specialized form of expression used to capture specialized information concisely is a candidate. Lengths, in a document formatting context. Rational numbers. Colors (again in a display context). Read the sections on data types in the HTML and CSS specs. Read the section on abstract data types in the programming textbook of your choice. Lots of possibilities.

One might argue that the correct way to handle these is to declare them for what they are: specialized notations, which may be processed and validated by a specialized processor (called, perhaps, as a plugin by the validator) but which a general-purpose markup validator needn’t be expected to know about.

In principle, this could work, I guess. And it may be (related to) what the designers of ISO 8879 (the SGML spec) had in mind when they defined NOTATION. But I don’t think NOTATION will fly as a solution for this kind of problem today, for several reasons:

  • There is no history or established practice of SGML or XML validators calling subvalidators to validate the data declared as being in a particular notation. So the ostensible reason for using NOTATION (“It’s already there! you don’t need to do anything!”) doesn’t really hold up: using declared notations to guide validation would be plowing new ground.
  • Almost no one ever really wants to use notations. In part this is because few people ever feel they really understand what notations are good for, and usually not for long, and they don’t always agree. So software developers never really know what to do with them, and end up doing nothing much with them.
  • If RFC 2822 dates are a ‘notation’ rather than a ‘datatype’, then surely the same applies to ISO 8601 dates. Why treat some date expressions as lexical representations of dates, and others as magic cookies for a black-box process? If notations were intended to keep software that works with SGML and XML markup from having to understand things like integers and dates, then the final returns are now in, and we can certify that that attempt didn’t succeed. (Some document-oriented friends of mine like to tell me that all this datatype stuff was foisted on XSDL by data-heads and does nothing at all for documents. I keep having to remind them that they spent pretty much the entire period from 1986 to 1998 warning their students that neither content models nor attribute declarations allowed you to specify, for example, that the content of a quantity element, or the value of a quantity attribute, ought to be [for example] a quantity — i.e. an integer, preferably restricted to a plausible range. XSDL may have given you things you never asked for, and it may give you things you asked for in a form you didn’t expect and don’t much like. But don’t claim you never asked for datatypes. I was there, and while I don’t have videotapes, I do remember.)

Who defines the new primitives?

Some people are nervous at the idea of trying to allow users to define new kinds of dates, or new kinds of values, in part because the attempt to allow the definition of arbitrary value spaces, in a form that can actually be used to check the validity of data, seems certain to end up by putting a Turing complete language into the spec, or by pointing to some existing programming language and requiring that people use that language to define their new types (and requiring all schema processors to include an interpreter for that language). And the spec ends up defining not just a schema language, but a set of APIs.

However you cut it, it seems a quick route to a language war; include me out.

Michael Kay has pointed out that there is a much simpler way. Don’t try to provide for user-defined types derived by restriction from xs:anyAtomicType in some interoperable way. That would require a lot of machinery, and would be difficult to reach consensus on.

Michael proposes: just specify that implementations may provide additional implementation-defined primitive types. In the nature of things, an implementation can do this however it wants. Some implementors will code up email dates and CSS lengths the same way they code the other primitives. Fine. Some implementors will expose the API that their existing primitive types use, so they choose, at the appropriate moment, to link in a set of extension types, or not. Some will allow users to provide implementations of extension types, using that API, and link them at run time. Some may provide extension syntax to allow users to describe new types in some usable way (DTLL, anyone?) without having to write code in Java or C or [name of language here].

That way, all the burden of designing the way to allow user-specified types to interface with the rest of the implementation falls on the implementors, if they wish to carry it, and not on the XSDL spec. (Hurrah, cries the editor.)

If enough implementations do that, and enough experience is gained, the relevant community might eventually come to some consensus on a good way to do it itneroperably. And at that point, it could go into the spec for some schema language.

This has worked tolerably well for the analogous situation with implementation-defined / user-defined XPath functions in XSLT 1.0. XSLT 1.0 doesn’t provide syntax for declaring user-defined functions; it merely allows implementations to suport additional functions in XPath expressions. Some implementations did so, either their own functions, or functions defined by projects like EXSLT. And some implementations did in fact provide syntax for allowing users to define functions without writing Java or C code and running the linker. (And lo! the experience thus gained has made it possible to get consensus in XSLT 2.0 for standardized syntax for user-defined functions.)

But with that thought, I have reached a topic perhaps better separated out into another post. Other languages, specs, and technologies have faced similar decisions (should we allow implementations to add new blorts to our language, or should we restrict them to using the standard blorts?); what has the experience been like? What can we learn from the experience?

Safari’s love/hate relation with XML

[17 January 2008]

So what is it about Safari and XSLT?

I write a lot of documents. I write them in XML. I really like it when publishing them on the Web means: just checking them into the W3C’s CVS repository (from which they propagate to the the W3C’s Web servers automatically, courtesy of some extremely nifty software put together by our Systems Team with chewing gum, scotch tape, and baling wire great ingenuity). No muss, no fuss. No running make or ant. Just. Check. It. In. And presto! it’s on the Web.

And mostly that works.

Actually, for the browsers I usually use, it always works. But I have friends who tell me I really should be using Safari, as it’s faster and simpler and better in ways that momentarily defeated their ability to explain — if I tried it for awhile, I would see, I think was the idea.

But Safari puts me in a bind.

I can’t use Safari as my daily browser if I can’t reliably display XML in it.

And I can’t write it off entirely as a waste of my time, since much of the time it does display XML just fine.

That is: sometimes it works. And sometimes it doesn’t. And so far I have not been able to get much light shed on when.

To take a simple example: consider this working paper I’m writing for the W3C SML Working Group. There are two copies of this document: one on the W3C server at the URI just linked to, and one on my local file system, in the directory subtree that holds stuff I’ve checked out from the W3C CVS server. All the references (to DTD, to stylesheet, to images, …) are relative, so that they work on the local copy even when I’m off the network, and so I don’t have to change them when I check revisions in.

The local document displays fine in Firefox. So does the copy of the server. Opera displays both the local and the Web copy just fine. Internet Explorer displays the Web copy just fine. I don’t have a copy of IE that can check the local copy, but I used IE for display of local XML for a long time; I’m confident it would work fine.

Safari displays the local copy just fine.

And on the Web copy? Safari gives me a blank screen.

Safari has, it seems, a love/hate relation with XML.

And that means I have a love/hate relation with Safari.

Posted in XML

Honeypots: better than CAPTCHAs?

[17 January 2008]

As noted earlier, the short period of time between starting a blog and encountering comment spam has now passed, for this blog. And while the volume of comment spam is currently very low by most standards, for once I’d like to get out in front of a problem.

So when not otherwise committed, I spent most of yesterday reading about various comment-spam countermeasures, starting with those recommended by those who commented on my earlier post. (More comments on that post, faster, than on any other post yet: clearly the topic hit a nerve.)

If you’re keeping score, I decided ultimately to install Spam Karma 2, in part because my colleague Dom Hazaël-Massieux uses it, so I hope I can lean on him for support.

But the most interesting idea I encountered was certainly the one mentioned here by Lars Marius Garshol (to whom thanks). The full exposition of the idea by Ned Batchelder is perfectly clear (and to be recommended), but the gist can be sumarized thus:

  • Some comment spam comes from humans hired to create it.
  • Some spam comes from “playback bots” which learn the structure of a comment form once (with human assistance) and then post comments repeatedly, substituting link spam into selected fields.
  • Some comment spam comes from “form-filling bots”, which read the form and put more or less appropriate data into more or less the right fields, apparently guiding their behavior by field type and/or name.

For the first (human) kind of spam, there isn’t much you can do (says Batchelder). You can’t prevent it reliably. You can use rel="nofollow" in an attempt to discourage them, but Michael Hampton has argued in a spirited essay on rel="nofollow" that in fact nofollow doesn’t discourage spammers. By now that claim is more an empirical observation than a prediction. By making it harder to manipulate search engine rankings, rel="nofollow" makes spammers think it even more urgent (says Hampton) to get functional links into other places where people may click on them.

But I can nevertheless understand the inclination to use rel="nofollow": it’s not unreasonable to feel that if people are going to deface your site, you’d at least like to ensure their search engine ranking doesn’t benefit from the vandalism.

And of course, you can also always delete their comments manually when you see them.

For the playback bots, Batchelder uses a clever combination of hashing and a local secret to fight back: if you change the names of fields in the form, by hashing the original names together with a time stamp and possibly the requestor’s IP address, then (a) you can detect comments submitted a suspiciously long time after the comment form was downloaded, and (b) you can prevent the site-specific script from being deployed to an army of robots at different IP addresses.

My colleague Liam Quin has pointed out that this risks some inconvenience to real readers. If someone starts to write a comment on a post, then breaks off to head for the airport, and finally finishes editing their comment and submitting it after reaching their hotel at the other end of a journey, then not only will several hours have passed, but their IP number will have changed. Liam and I both travel a lot, so it may be easy for us to overestimate the frequency with which that happens in the population at large, but it’s an issue. And users behind some proxy servers (including those at hotels) will frequently appear to shift their IP addresses in a quirky and capricious manner.

For form-filling bots, Batchelder uses invisible fields as ‘honeypots’. These aren’t hidden fields (which won’t deceive bots, because they know about them), but fields created in such a way that they are not visible to the sighted human users. Since humans don’t see them, humans won’t fill them out, while a form-filling bot will see them and (in accordance with its nature) will fill them out. This gives the program which handles comment submissions a convenient test: if there’s new data in the honeytrap field, the comment is pretty certain to be spam.

Batchelder proposes a wide variety of methods for making fields invisible: CSS style “display: none” or “font-size: 0”, positioning the field absolutely and then carefully positioning an opaque image or something else opaque over it. And we haven’t even gotten into Javascript yet.

For the sake of users with Javascript turned off and/or CSS-impaired browsers, the field will be labeled “Please leave this field blank; it’s here to catch spambots” or something similar.

In some ways, the invisible-honeypot idea seems to resemble the idea of CAPTCHAs. In both cases, the human + computer system requesting something from a server is requested to perform some unpredictable task which a bot acting alone will find difficult. In the case of CAPTCHAs, the task is typically character-recognition from an image, or answering a simple question in natural language. In the case of the honeypot, the task is calculating whether a reasonably correct browser’s implementation of Javascript and CSS will or will not render a given field in such a way that a human will perceive it. This problem may be soluble, in the general case or in many common cases, by a program acting alone, but by far the simplest way to perform it is to display the page in the usual way and let a human look to see whether the field is visible or not. That is, unlike a conventional CAPTCHA, a honeypot input field demands a task which the browser and human are going to be performing anyway.

The first question that came to my mind was “But wait. What about screen readers? Do typical screen readers do Javascript? CSS?”

My colleagues in the Web Accessibility Initiative tell me the answer is pretty much a firm “Sometimes.” Most screen readers (they tell me) do Javascript; behavior for constructs like CSS display: none apparently varies. (Everyone presumably agrees that a screen reader shouldn’t read material so marked, but some readers do; either their developers disagree or they haven’t yet gotten around to making the right thing happen.) You do want to be sure, if you use this technique, to make sure the “Please leave empty” label is associated with the field in a way that will be clear to screen readers and the like. (Of course, this holds for all field labels, not just labels for invisible fields. See Techniques for WCAG 2.0 and Understanding WCAG 2.0 for more on this topic.)

The upshot appears to be:

  • For sighted or unsighted readers with Javascript and/or CSS processing supported by their software and turned on, a honeypot of this kind is unseen / unheard / unperceived (unless something goes wrong), and incurs no measurable cost to the human. The cost of the extra CSS or Javascript processing by the machine is probably measurable but negligeable.
  • For human readers whose browsers and/or readers don’t do Javascript and/or CSS, the cost incurred by a honeypot of this kind is (a) some clutter on the page and (b) perhaps a moment of distraction while the reader wonders “But why put a field there if you want me to leave it blank?” or “But how can putting a data entry field here help to catch spambots?” For most users, I guess this cost is comparable to that of a CAPTCHA, possibly lower. For users excluded by a CAPTCHA (unsighted users asked to read an image, linguistically hesitant users asked to perform in a language not necessarily their own), the cost of a honeypot seems likely to be either a little lower than that of a CAPTCHA, or a lot lower.

I’m not an accessibility expert, and I haven’t thought about this for very long. But it sure looks like a great idea to me, superior to CAPTCHAs for many users, and no worse than CAPTCHAs (as far as I can now tell) for anyone.

If this blog used homebrew software, I’d surely apply these techniques for resisting comment spam. And I think I can figure out how to modify WordPress to use some of them, if I ever get the time. But I didn’t see any off-the-shelf plugins for WordPress that use them. (It’s possible that Bad Behavior uses these or similar techniques, but I haven’t been able to get a clear idea of what it does, and it has what looks like a misguided affinity for the idea of blacklists, on which I have given up. As Mark Pilgrim points out, when we fight link spam, we might as well try to learn from the experience of fighting spam in other media.)

Is there a catch? Am I missing something?

What’s not to like?