Such a little thing, to provoke so many thoughts

[3 February 2009]

I saw an interesting bit of stray XML this morning, which raises a number of questions worth mulling over.

The New York Times sends me email every morning with headlines from the day’s papers; this is one of the perqs, I think, of being a subscriber. Given the choice, I asked for the ASCII-only version (like many people I know, I don’t much like HTML email; I’m not sure this is a well-founded prejudice, but there it is).

In today’s headlines, I find the following fragment:

- QUOTATION OF THE DAY -

"Oh, you're one of <em style="i">them."
- IRIS CHAU, recounting an acquaintance's reaction when she said she
worked at a banking company.

My eye was caught (perhaps this is just a deformation professionelle) by the unmatched start-tag for an ’em’ element, appearing in what was intended to be a markup-free context. This seems to tell us several things about the internal system used by the Times, and to invite several questions:

They seem to be using markup (perhaps HTML, perhaps some other XML vocabulary), not just for delivery of the headlines but in the system for preparing features like the quotation of the day.
Their XML vocabulary seems to require an inline specification of rendering information. This seems odd; wouldn’t it be natural for a stylesheet to say that ’em’ elements should be italic? How many other renderings of emphasis are there likely to be in the main story block, in a broadsheet? (Oh, well; not my design, and I don’t know what constraints the vocabulary designer was working under.)
Assuming that this quotation was cut and pasted out of the story in which it appears (which does not appear to be ill-formed), could we infer that the journalist who did the cut and paste could was using an editing tool that had no markup awareness? Would a tool with better awareness of markup have helped here? (E.g. by picking up the end-tag of the ’em’ element, as soon as it picks up the start-tag, as SoftQuad’s Author/Editor used to do [and presumably still does in its current incarnation], or by dropping the start-tag.)
Could a better validator somewhere in the system that generates the ASCII headlines have caught this error? Could it have been fixed automatically, or with minimal-cost human intervention?
What properties would a validator for that process need to have, to induce people to view it as an asset instead of a nuisance?
What properties would a schema language need to have, in order to make it easier to write a validator with those properties? And how might the schema language go about encouraging implementors to write such validators?

Hmmmmm.

Messages in a Bottle

CMSMcQ's klog

Such a little thing, to provoke so many thoughts