The spam abyss stares back
Everyone hates spam (unsolicited bulk e-mail, not the canned meat
product), but we must not become so zealous in our fight that we
attack the innocent. This time, Ed Felten has been
unfairly blacklisted. Unlike John Gilmore, who had been operating
an open relay, Mr Felten got blacklisted simply because someone
else recommended his site on a mailing list and a third person erroneously reported it
as spam.
The sad thing is not that the mail was incorrectly reported as
spam (although that is sad), but that SpamCop refused to correct
their incorrect listing.
Everybody involved (me, my ISP,
the person who filed the
complaint, and the author of the message) agreed that the report was an
error, and we all told this to SpamCop. Naturally, SpamCop failed to
respond and continued to block the site.
This, unfortunately, undermines the legitimacy of the blacklists
themselves. As Mr Felten points out, there are many
ways to take
advantage of SpamCop’s lack of discrimination and get someone
listed incorrectly. As I see it, though, it’s even worse.
If I were an unethical spammer (that is, a spammer), I’d be busy
sending in tons of false reports to SpamCop—not to get the particular
sites blocked, but to destroy the credibility of SpamCop. If enough
high-powered, non-spam sites get blocked people will stop using it.
(via Cory Doctorow)
#
Loose parsing
Mark Pilgrim, noting the wide variation and often-invalid markup
seen in RSS feeds,
puts together an “ultra-liberal
RSS parser” designed
to be able to interpret existing feeds. Joe Gregorio points out that
if everyone can parse badly-formed feeds, then there is
no
incentive to fix them. Mr Pilgrim agrees, in principle,
but points out that:
- This parser would be used in end-user products, and end-users
will want it to work even if the feed is invalid
- No web browser would refuse to display badly-formed web pages,
because virtually every web page has invalid
HTML code
(Lost? RSS is a standard
for representing the content of a site, particularly newly-added
material. Unlike web pages, which are intended for human consumption,
RSS feeds are intended for
software, such as NetNewsWire.)
While I agree with Mr Pilgrim up to a point (users don’t want
and shouldn’t need to know about valid/invalid markup), I can’t
help but note that the it’s ability of web browsers to handle
invalid HTML
that accounts for the poor coding seen on the web. Had browsers been
less forgiving to start with, developers wouldn’t have fallen into
bad habits.
In fact, it was the desire to avoid situations where most markup
is invalid that lead to the rule in the
XML specification
which effectively forbids software from trying to interpret badly-formed
XML. It was hoped
that this all-or-nothing tactic would force developers to fix errors.
You might wonder why strict coding is important, if people can build
parsers that can interpret invalid markup, but the reasons are pretty
simple. First, it’s a lot harder to build a parser that can interpret
invalid markup, and, second, there’s no guarantee that two different
parsers will interpret the same invalid markup in the same way.
Mr Pilgrim also points out the lack of a good
RSS validator. While I don’t
have the resources or motivation to write one myself right now, I will
note that any such validator should really be built in layers.
First, you have to make sure the document is well-formed
XML. If it
isn’t, officially it cannot be further interpreted. Next, you
need to make sure that the XML
conforms to the RDF
syntax. Lastly, you check whether the
RDF graph
contains the necessary classes and properties for a reasonable
RSS channel. The last
step has only a few definite rules, and then a bunch of heuristics,
like whether certain common properties are in the right namespace
or unrecognized names are used in known namespaces.
#