[17 January 2008]
As noted earlier, the short period of time between starting a blog and encountering comment spam has now passed, for this blog. And while the volume of comment spam is currently very low by most standards, for once I’d like to get out in front of a problem.
So when not otherwise committed, I spent most of yesterday reading about various comment-spam countermeasures, starting with those recommended by those who commented on my earlier post. (More comments on that post, faster, than on any other post yet: clearly the topic hit a nerve.)
If you’re keeping score, I decided ultimately to install Spam Karma 2, in part because my colleague Dom Hazaël-Massieux uses it, so I hope I can lean on him for support.
But the most interesting idea I encountered was certainly the one mentioned here by Lars Marius Garshol (to whom thanks). The full exposition of the idea by Ned Batchelder is perfectly clear (and to be recommended), but the gist can be sumarized thus:
- Some comment spam comes from humans hired to create it.
- Some spam comes from “playback bots” which learn the structure of a comment form once (with human assistance) and then post comments repeatedly, substituting link spam into selected fields.
- Some comment spam comes from “form-filling bots”, which read the form and put more or less appropriate data into more or less the right fields, apparently guiding their behavior by field type and/or name.
For the first (human) kind of spam, there isn’t much you can do (says Batchelder). You can’t prevent it reliably. You can use
rel="nofollow" in an attempt to discourage them, but Michael Hampton has argued in a spirited essay on
rel="nofollow" that in fact nofollow doesn’t discourage spammers. By now that claim is more an empirical observation than a prediction. By making it harder to manipulate search engine rankings,
rel="nofollow" makes spammers think it even more urgent (says Hampton) to get functional links into other places where people may click on them.
But I can nevertheless understand the inclination to use
rel="nofollow": it’s not unreasonable to feel that if people are going to deface your site, you’d at least like to ensure their search engine ranking doesn’t benefit from the vandalism.
And of course, you can also always delete their comments manually when you see them.
For the playback bots, Batchelder uses a clever combination of hashing and a local secret to fight back: if you change the names of fields in the form, by hashing the original names together with a time stamp and possibly the requestor’s IP address, then (a) you can detect comments submitted a suspiciously long time after the comment form was downloaded, and (b) you can prevent the site-specific script from being deployed to an army of robots at different IP addresses.
My colleague Liam Quin has pointed out that this risks some inconvenience to real readers. If someone starts to write a comment on a post, then breaks off to head for the airport, and finally finishes editing their comment and submitting it after reaching their hotel at the other end of a journey, then not only will several hours have passed, but their IP number will have changed. Liam and I both travel a lot, so it may be easy for us to overestimate the frequency with which that happens in the population at large, but it’s an issue. And users behind some proxy servers (including those at hotels) will frequently appear to shift their IP addresses in a quirky and capricious manner.
For form-filling bots, Batchelder uses invisible fields as ‘honeypots’. These aren’t
hidden fields (which won’t deceive bots, because they know about them), but fields created in such a way that they are not visible to the sighted human users. Since humans don’t see them, humans won’t fill them out, while a form-filling bot will see them and (in accordance with its nature) will fill them out. This gives the program which handles comment submissions a convenient test: if there’s new data in the honeytrap field, the comment is pretty certain to be spam.
Batchelder proposes a wide variety of methods for making fields invisible: CSS style “
display: none” or “
display: none apparently varies. (Everyone presumably agrees that a screen reader shouldn’t read material so marked, but some readers do; either their developers disagree or they haven’t yet gotten around to making the right thing happen.) You do want to be sure, if you use this technique, to make sure the “Please leave empty” label is associated with the field in a way that will be clear to screen readers and the like. (Of course, this holds for all field labels, not just labels for invisible fields. See Techniques for WCAG 2.0 and Understanding WCAG 2.0 for more on this topic.)
The upshot appears to be:
I’m not an accessibility expert, and I haven’t thought about this for very long. But it sure looks like a great idea to me, superior to CAPTCHAs for many users, and no worse than CAPTCHAs (as far as I can now tell) for anyone.
If this blog used homebrew software, I’d surely apply these techniques for resisting comment spam. And I think I can figure out how to modify WordPress to use some of them, if I ever get the time. But I didn’t see any off-the-shelf plugins for WordPress that use them. (It’s possible that Bad Behavior uses these or similar techniques, but I haven’t been able to get a clear idea of what it does, and it has what looks like a misguided affinity for the idea of blacklists, on which I have given up. As Mark Pilgrim points out, when we fight link spam, we might as well try to learn from the experience of fighting spam in other media.)
Is there a catch? Am I missing something?
What’s not to like?