You loaded it, you might as well read it

August 21, 2002

The spam abyss stares back

Everyone hates spam (unsolicited bulk e-mail, not the canned meat product), but we must not become so zealous in our fight that we attack the innocent. This time, Ed Felten has been unfairly blacklisted. Unlike John Gilmore, who had been operating an open relay, Mr Felten got blacklisted simply because someone else recommended his site on a mailing list and a third person erroneously reported it as spam.

The sad thing is not that the mail was incorrectly reported as spam (although that is sad), but that SpamCop refused to correct their incorrect listing.

Everybody involved (me, my ISP, the person who filed the complaint, and the author of the message) agreed that the report was an error, and we all told this to SpamCop. Naturally, SpamCop failed to respond and continued to block the site.

This, unfortunately, undermines the legitimacy of the blacklists themselves. As Mr Felten points out, there are many ways to take advantage of SpamCop’s lack of discrimination and get someone listed incorrectly. As I see it, though, it’s even worse. If I were an unethical spammer (that is, a spammer), I’d be busy sending in tons of false reports to SpamCop—not to get the particular sites blocked, but to destroy the credibility of SpamCop. If enough high-powered, non-spam sites get blocked people will stop using it. (via Cory Doctorow) #

Loose parsing

Mark Pilgrim, noting the wide variation and often-invalid markup seen in RSS feeds, puts together an “ultra-liberal RSS parser” designed to be able to interpret existing feeds. Joe Gregorio points out that if everyone can parse badly-formed feeds, then there is no incentive to fix them. Mr Pilgrim agrees, in principle, but points out that:

  1. This parser would be used in end-user products, and end-users will want it to work even if the feed is invalid
  2. No web browser would refuse to display badly-formed web pages, because virtually every web page has invalid HTML code

(Lost? RSS is a standard for representing the content of a site, particularly newly-added material. Unlike web pages, which are intended for human consumption, RSS feeds are intended for software, such as NetNewsWire.)

While I agree with Mr Pilgrim up to a point (users don’t want and shouldn’t need to know about valid/invalid markup), I can’t help but note that the it’s ability of web browsers to handle invalid HTML that accounts for the poor coding seen on the web. Had browsers been less forgiving to start with, developers wouldn’t have fallen into bad habits.

In fact, it was the desire to avoid situations where most markup is invalid that lead to the rule in the XML specification which effectively forbids software from trying to interpret badly-formed XML. It was hoped that this all-or-nothing tactic would force developers to fix errors.

You might wonder why strict coding is important, if people can build parsers that can interpret invalid markup, but the reasons are pretty simple. First, it’s a lot harder to build a parser that can interpret invalid markup, and, second, there’s no guarantee that two different parsers will interpret the same invalid markup in the same way.

Mr Pilgrim also points out the lack of a good RSS validator. While I don’t have the resources or motivation to write one myself right now, I will note that any such validator should really be built in layers. First, you have to make sure the document is well-formed XML. If it isn’t, officially it cannot be further interpreted. Next, you need to make sure that the XML conforms to the RDF syntax. Lastly, you check whether the RDF graph contains the necessary classes and properties for a reasonable RSS channel. The last step has only a few definite rules, and then a bunch of heuristics, like whether certain common properties are in the right namespace or unrecognized names are used in known namespaces. #