bogofilter

Now that I have a new, much faster desktop machine, I finally took the plunge and set up Bayesian spam filtering. Hand-maintained rules were missing more and more, not enough to really bother me that much for my regular mail, but enough to bother me for group-advice and newgroups mail. I was losing a lot of stuff in the spam deluge.

I've now got Gnus 5.10.2 all set up using the Debian bogofilter package and it works like a charm. So far, it's been perfect. (I pre-seeded it with the past six months of group-advice and newgroups mail, among some other things, and apparently didn't make too many errors.) The only drawback is that registering new spam or non-spam takes a little while, even with a fairly fast machine, but not long enough that I think it will bother me.

The documentation has some real problems, though, so here are a few additional notes:

The documentation really wants you to use customize, preferrably on each group. I don't like putting a lot of stuff into group properties, so I instead set the general variables. gnus-spam-newsgroup-contents takes a list of pairs of regexes that match group names (including the nnml: part) to either gnus-group-spam-classification-spam or gnus-group-spam-classification-ham. Example:

(custom-set-variables '(gnus-spam-newsgroup-contents
'(("^nnml:mail\.spam.*" gnus-group-spam-classification-spam)
("^nnml:mail\.\(eyrie\|rra\)" gnus-group-spam-classification-ham)
("^nnml:work\.\(personal\|news\)" gnus-group-spam-classification-ham)
("^nnml:project\..*" gnus-group-spam-classification-ham))))

gnus-spam-process-newsgroups takes a list of pairs of similar regexes matched to lists of processing functions, which is different than what the documentation says. Example:

(custom-set-variables '(gnus-spam-process-newsgroups
'(("^nnml:project\..*"
(gnus-group-spam-exit-processor-bogofilter
gnus-group-ham-exit-processor-bogofilter))
("^nnml:mail\.\(rra\|eyrie\)"
(gnus-group-spam-exit-processor-bogofilter
gnus-group-ham-exit-processor-bogofilter))
("^nnml:work\.\(personal\|news\)"
(gnus-group-spam-exit-processor-bogofilter
gnus-group-ham-exit-processor-bogofilter))
("^nnml:mail\.spam.*"
(gnus-group-spam-exit-processor-bogofilter)))))

Another point that's very unclear from the documentation that I found out through a lot of fiddling is that you need to add both the spam and ham bogofilter processors to the list of processors for each of your ham groups. Otherwise, the spam that you mark never actually gets registered with bogofilter.

I played with the option to move marked spam into a different group but decided that I didn't like it. I would then have to go read that group and see that spam again to get it to actually expire. So I just leave it in the group and let it expire with the rest of the traffic.

Note that the spam processing stuff does not play well with groups marked auto-expire, since the expirable mark E is not one of the marks that causes a message to be registered as non-spam. I switched all of my auto-expire groups over to total-expire since that's how I use them anyway, and since the registration functions like read marks (r or R) much better than expirable marks. You can change the definition of ham-marks to include expirable messages, but the problem there is that each time you go back into the group to look at old messages, those old messages will all get re-registered as non-spam, which skews the counts.

Other than that, the documentation is okay (you want to use the spam.el package, and I recommend bogofilter over the built-in spam-stats stuff since the latter is going to be a lot slower). Do pay attention to the bit about not running messages through bogofilter in groups where you're not going to use bogofilter filtering; for example, all the groups that I split out before I apply spam checks also aren't registered as non-spam groups because it would skew the counts for the things that bogofilter actually gets to see.

One final note: bogoutil is the program that lets you fiddle with the databases. One thing that's useful to do is to clean out all the tokens that only have one occurrence, to clean out things that are unlikely to recur. I'm guessing I'll probably do that every three months or so. (Note that you can also use db4.0_dump -r and then db4.0_load to dump and restore the database, which is the only way to get it to shrink in size even if you've taken things out of it.)

I should probably stick this all up somewhere more permanent on my web pages....

Posted: 2003-06-07 11:10 — Why no comments?

Last spun 2022-02-06 from thread modified 2013-01-04