Re: Spam, bogofilter, etc - Gordon Cormack

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Gordon Cormack <gvcormac@uwaterloo.ca>
To: linux-kernel@vger.kernel.org
Subject: Re: Spam, bogofilter, etc
Date: Tue, 3 Oct 2006 10:50:51 +0000 (UTC)	[thread overview]
Message-ID: <loom.20061003T123315-373@post.gmane.org> (raw)
In-Reply-To: Pine.LNX.4.64.0610020933020.3952@g5.osdl.org

Linus Torvalds <torvalds <at> osdl.org> writes:

> I'm sorry, but spam-filtering is simply harder than the bayesian 
> word-count weenies think it is. I even used to _know_ something about 
> bayesian filtering, since it was one of the projects I worked on at uni, 
> and dammit, it's not a good approach, as shown by the fact that it's 
> trivial to get around.

Linus, I've seen no evidence that statistical filters are trivial
to beat.  Can you provide some? 

> I don't know why people got so excited about the whole bayesian thing. 
> It's fine as _one_ small clause in a bigger framework of deciding spam, 
> but it's totally inappropriate for a "yes/no" kind of decision on its own.

Why is that?  Statistical filters (so-called 'Bayesian) have lower 
false positive and false negative rates than many other approaches.
Bogofilter is one of the better ones, although it is not particularly
Bayesian.

> If you want a yes/no kind of thing, do it on real hard issues, like not 
> accepting email from machines that aren't registered MX gateways. Sure, 
> that will mean that people who just set up their local sendmail thing and 
> connect directly to port 25 will just not be able to email, but let's face 
> it, that's why we have ISP's and DNS in the first place.

You are saying that this sort of false positive is acceptable to
you.  With no corresponding claim as to the corresponding false
negative rate.

So-called yes/no values are simply tests with their own failure
rates.  As such, they have strictly less information than 
scores or probability estimates that offer a confidence
estimate as well.  The trick is in combining several sources
of evidence, and 'Bayesian' is but one method of combining this
evidence.  
> 
> If you want to do word analysis, use it like SpamAssassin does it - with 
> some Bayesian rule perhaps adding a few points to the score. That's 
> entirely appropriate. But running bogo-filter _instead_ of spamassassin is 
> just asinine.

Spamassassin performs quite poorly with the default weight
given to its statistical filter.  It works much better
if you increase the weight.  Many tests show that it works
better still if you simply discard the ad hoc rules and
rely on the 'Bayesian' filter alone.  I have found that
almost all of the false positives I've encountered in
the last 3 years have been due to Spamassassin's ad hoc
rules, not its statistical filter.

References

   http://plg.uwaterloo.ca/~gvcormac/trecspamtrack05
   http://plg.uwaterloo.ca/~gvcormac/spamassassin.html
   http://www.ceas.cc/2006/listabs.html#12.pdf

Gordon Cormack
University of Waterloo

next prev parent reply	other threads:[~2006-10-03 11:10 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-09-29 14:23 Spam, bogofilter, etc Lee Revell
2006-09-29 14:29 ` Ismail Donmez
2006-10-01 23:23 ` Chris Wedgwood
2006-10-02  0:41   ` Kasper Sandberg
2006-10-02 10:03 ` Matti Aarnio
2006-10-02 15:21   ` Lee Revell
2006-10-02 15:24     ` Martin J. Bligh
2006-10-02 15:48       ` Lee Revell
2006-10-02 17:39         ` Erik Andersen
2006-10-03  3:37           ` dean gaudet
2006-10-03  4:05             ` Neil Brown
2006-10-02 16:40       ` Linus Torvalds
2006-10-02 17:49         ` Alan Cox
2006-10-02 17:19           ` David Lang
2006-10-02 18:02           ` Linus Torvalds
2006-10-02 18:07             ` Martin Bligh
2006-10-02 18:22             ` Valdis.Kletnieks
2006-10-02 18:29               ` Linus Torvalds
2006-10-02 19:31                 ` jdow
2006-10-02 19:31                 ` Antonio Vargas
2006-10-02 21:58             ` Alan Cox
2006-10-04 22:41             ` Adrian Bunk
2006-10-03 17:32           ` Mariusz Kozlowski
2006-10-02 21:33         ` Horst H. von Brand
2006-10-03  8:08         ` John Graham-Cumming
2006-10-03  8:52           ` Howard Chu
2006-10-03  9:40         ` Devdas Bhagat
2006-10-03  9:43         ` Helge Hafting
2006-10-03 10:50         ` Gordon Cormack [this message]
2006-10-02 17:34   ` Thomas Davis
2006-10-03 16:42   ` Mariusz Kozlowski
2006-10-27 22:30 ` Oleg Verych
  -- strict thread matches above, loose matches on Subject: below --
2006-10-03  6:08 Paul Zimmerman
2006-10-03 12:51 ` Valdis.Kletnieks
     [not found] <20061003060346.55869.qmail@web80821.mail.yahoo.com>
2006-10-03  7:01 ` Neil Brown

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=loom.20061003T123315-373@post.gmane.org \
    --to=gvcormac@uwaterloo.ca \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox