* Re: Spam, bogofilter, etc
2006-10-02 16:40 ` Linus Torvalds
@ 2006-10-02 17:49 ` Alan Cox
2006-10-02 17:19 ` David Lang
` (2 more replies)
2006-10-02 21:33 ` Horst H. von Brand
` (4 subsequent siblings)
5 siblings, 3 replies; 35+ messages in thread
From: Alan Cox @ 2006-10-02 17:49 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Martin J. Bligh, Lee Revell, Matti Aarnio, linux-kernel
Ar Llu, 2006-10-02 am 09:40 -0700, ysgrifennodd Linus Torvalds:
> If you want a yes/no kind of thing, do it on real hard issues, like not
> accepting email from machines that aren't registered MX gateways. Sure,
> that will mean that people who just set up their local sendmail thing and
> connect directly to port 25 will just not be able to email, but let's face
> it, that's why we have ISP's and DNS in the first place.
Except most of the ISPs are incompetent and many people have to run
their own mail system in order to get mail that actually *works*. I've
had that experience several times, although thankfully I now have a sane
ISP.
MX checking is as broken or more broken than bayes.
There is another reason bayes is not very good too - every good spammer
reruns their message through spamassassin adding random text till they
get a good score *then* they spew it out.
Alan
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: Spam, bogofilter, etc
2006-10-02 17:49 ` Alan Cox
@ 2006-10-02 17:19 ` David Lang
2006-10-02 18:02 ` Linus Torvalds
2006-10-03 17:32 ` Mariusz Kozlowski
2 siblings, 0 replies; 35+ messages in thread
From: David Lang @ 2006-10-02 17:19 UTC (permalink / raw)
To: Alan Cox
Cc: Linus Torvalds, Martin J. Bligh, Lee Revell, Matti Aarnio,
linux-kernel
On Mon, 2 Oct 2006, Alan Cox wrote:
> Ar Llu, 2006-10-02 am 09:40 -0700, ysgrifennodd Linus Torvalds:
>> If you want a yes/no kind of thing, do it on real hard issues, like not
>> accepting email from machines that aren't registered MX gateways. Sure,
>> that will mean that people who just set up their local sendmail thing and
>> connect directly to port 25 will just not be able to email, but let's face
>> it, that's why we have ISP's and DNS in the first place.
>
> Except most of the ISPs are incompetent and many people have to run
> their own mail system in order to get mail that actually *works*. I've
> had that experience several times, although thankfully I now have a sane
> ISP.
>
> MX checking is as broken or more broken than bayes.
>
> There is another reason bayes is not very good too - every good spammer
> reruns their message through spamassassin adding random text till they
> get a good score *then* they spew it out.
that's why you don't use a fixed table like that. if the table is customized for
your mail then it's unlikly to agree with anyone else's, so mail that will get
through their filter wont' get through yours (and vice versa)
David Lang
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spam, bogofilter, etc
2006-10-02 17:49 ` Alan Cox
2006-10-02 17:19 ` David Lang
@ 2006-10-02 18:02 ` Linus Torvalds
2006-10-02 18:07 ` Martin Bligh
` (3 more replies)
2006-10-03 17:32 ` Mariusz Kozlowski
2 siblings, 4 replies; 35+ messages in thread
From: Linus Torvalds @ 2006-10-02 18:02 UTC (permalink / raw)
To: Alan Cox; +Cc: Martin J. Bligh, Lee Revell, Matti Aarnio, linux-kernel
On Mon, 2 Oct 2006, Alan Cox wrote:
>
> Ar Llu, 2006-10-02 am 09:40 -0700, ysgrifennodd Linus Torvalds:
> > If you want a yes/no kind of thing, do it on real hard issues, like not
> > accepting email from machines that aren't registered MX gateways. Sure,
> > that will mean that people who just set up their local sendmail thing and
> > connect directly to port 25 will just not be able to email, but let's face
> > it, that's why we have ISP's and DNS in the first place.
>
> Except most of the ISPs are incompetent and many people have to run
> their own mail system in order to get mail that actually *works*. I've
> had that experience several times, although thankfully I now have a sane
> ISP.
Sure. I kind of agree - I'm just saying that if you have a _hard_
decision, you should base in on _hard_ data.
The MX checking is at least hard, and is a valid reason to just deny
email. I'm not claiming it's "perfect", but it's a hell of a lot better
than bayes.
> MX checking is as broken or more broken than bayes.
I have to say, OSDL has been doing MX checking, and it's effective as
hell. Most importantly, when it _does_ break, it's not because some
"content" is considered inappropriate, it's because some ISP does
something technically wrong.
OSDL also refused to talk to open mail relays etc. I got into something of
a (fairly civilized) shouting match with John Gilmore over it, who used to
send out email from a "fake open mail relay" on princuple (maybe he still
does). He claimed I was censoring his free speech rights when I didn't
read his emails, but I just told him that I was expressing my right to not
listen to people who are so stupid that they can't configure their email
servers.
(I'm not saying that John is stupid, since he did it on purpose, but he
was also clever enough to know exactly what was involved, so it's not like
he couldn't be heard if he wanted to - it's not "censoring" if nobody
listens to you because you built your own sound-proof walls around you).
> There is another reason bayes is not very good too - every good spammer
> reruns their message through spamassassin adding random text till they
> get a good score *then* they spew it out.
Yes. Which is why it's better to rely on hard technical data, or on a
large body of different small rules, including some that are personalized
(ie white-lists and blacklists that are site-specific, including making
things like the bayesian rules be per-site - perhaps _seeded_ by some
common data, but updated locally).
Of course, the MX checking can also be avoided, and a lot of spam-bots
know to use the ISP connection instead of a direct port-25 approach. But
at least that way, the mail gateway can (and often does) notice the
flooding, and many ISP's successfully throttle at least some spam at the
source, so it does actually have real meaning.
Linus
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: Spam, bogofilter, etc
2006-10-02 18:02 ` Linus Torvalds
@ 2006-10-02 18:07 ` Martin Bligh
2006-10-02 18:22 ` Valdis.Kletnieks
` (2 subsequent siblings)
3 siblings, 0 replies; 35+ messages in thread
From: Martin Bligh @ 2006-10-02 18:07 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Alan Cox, Lee Revell, Matti Aarnio, linux-kernel
>>MX checking is as broken or more broken than bayes.
>
> I have to say, OSDL has been doing MX checking, and it's effective as
> hell. Most importantly, when it _does_ break, it's not because some
> "content" is considered inappropriate, it's because some ISP does
> something technically wrong.
>
> OSDL also refused to talk to open mail relays etc. I got into something of
> a (fairly civilized) shouting match with John Gilmore over it, who used to
> send out email from a "fake open mail relay" on princuple (maybe he still
> does). He claimed I was censoring his free speech rights when I didn't
> read his emails, but I just told him that I was expressing my right to not
> listen to people who are so stupid that they can't configure their email
> servers.
That was actually pretty broken. Sending Andrew email stopped working
for ages. IIRC because I was sending email from my home address through
the IBM work server. It's not a trouble-free solution, and otherwise
fairly reasonable things stop working. I forget what the OSDL admins
did in the end ... I think put in a specific exception for an IP range.
M.
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: Spam, bogofilter, etc
2006-10-02 18:02 ` Linus Torvalds
2006-10-02 18:07 ` Martin Bligh
@ 2006-10-02 18:22 ` Valdis.Kletnieks
2006-10-02 18:29 ` Linus Torvalds
2006-10-02 21:58 ` Alan Cox
2006-10-04 22:41 ` Adrian Bunk
3 siblings, 1 reply; 35+ messages in thread
From: Valdis.Kletnieks @ 2006-10-02 18:22 UTC (permalink / raw)
To: Linus Torvalds
Cc: Alan Cox, Martin J. Bligh, Lee Revell, Matti Aarnio, linux-kernel
[-- Attachment #1: Type: text/plain, Size: 600 bytes --]
On Mon, 02 Oct 2006 11:02:36 PDT, Linus Torvalds said:
> > MX checking is as broken or more broken than bayes.
>
> I have to say, OSDL has been doing MX checking, and it's effective as
> hell. Most importantly, when it _does_ break, it's not because some
> "content" is considered inappropriate, it's because some ISP does
> something technically wrong.
How did OSDL's MX checking deal with split in/out configurations like ours,
where our MX points at a load-balanced farm of Mirapoint front end appliances
with 1 IP address, but our main off-campus *outbound* comes from a different
address?
[-- Attachment #2: Type: application/pgp-signature, Size: 226 bytes --]
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spam, bogofilter, etc
2006-10-02 18:22 ` Valdis.Kletnieks
@ 2006-10-02 18:29 ` Linus Torvalds
2006-10-02 19:31 ` jdow
2006-10-02 19:31 ` Antonio Vargas
0 siblings, 2 replies; 35+ messages in thread
From: Linus Torvalds @ 2006-10-02 18:29 UTC (permalink / raw)
To: Valdis.Kletnieks
Cc: Alan Cox, Martin J. Bligh, Lee Revell, Matti Aarnio, linux-kernel
On Mon, 2 Oct 2006, Valdis.Kletnieks@vt.edu wrote:
>
> How did OSDL's MX checking deal with split in/out configurations like ours,
> where our MX points at a load-balanced farm of Mirapoint front end appliances
> with 1 IP address, but our main off-campus *outbound* comes from a different
> address?
Hey, if I knew what I was doing, I'd be in MIS.
As it is, I just criticise other peoples patches.
Linus
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spam, bogofilter, etc
2006-10-02 18:29 ` Linus Torvalds
@ 2006-10-02 19:31 ` jdow
2006-10-02 19:31 ` Antonio Vargas
1 sibling, 0 replies; 35+ messages in thread
From: jdow @ 2006-10-02 19:31 UTC (permalink / raw)
To: Linus Torvalds, Valdis.Kletnieks; +Cc: linux-kernel
From: "Linus Torvalds" <torvalds@osdl.org>
> On Mon, 2 Oct 2006, Valdis.Kletnieks@vt.edu wrote:
>>
>> How did OSDL's MX checking deal with split in/out configurations like ours,
>> where our MX points at a load-balanced farm of Mirapoint front end appliances
>> with 1 IP address, but our main off-campus *outbound* comes from a different
>> address?
>
> Hey, if I knew what I was doing, I'd be in MIS.
>
> As it is, I just criticise other peoples patches.
DK or DKIM comes to mind. SpamAssassin 3.1.5 handles it neatly.
Off hand expecting a list to maintain perfect anti-spam is rather
difficult. Distributed processing works better. Folks should have
their own anti-spam tools and train them to their own preferences.
(It helps with a list like this one to have a SpamAssassin meta
rule that boosts the scores for BAYES_80 and above while reducing
scores for BAYES_40 and below. It also helps to run a lot of the
SARE, SpamAssassin Rules Emporium, rule sets. Pick and choose for
your particular needs. http://www.rulesemporium.com/rules)
{^_^} Joanne
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: Spam, bogofilter, etc
2006-10-02 18:29 ` Linus Torvalds
2006-10-02 19:31 ` jdow
@ 2006-10-02 19:31 ` Antonio Vargas
1 sibling, 0 replies; 35+ messages in thread
From: Antonio Vargas @ 2006-10-02 19:31 UTC (permalink / raw)
To: Linus Torvalds, Valdis.Kletnieks, Alan Cox, Martin J. Bligh,
Lee Revell, Matti Aarnio, linux-kernel
On 10/2/06, Linus Torvalds <torvalds@osdl.org> wrote:
>
>
> On Mon, 2 Oct 2006, Valdis.Kletnieks@vt.edu wrote:
> >
> > How did OSDL's MX checking deal with split in/out configurations like ours,
> > where our MX points at a load-balanced farm of Mirapoint front end appliances
> > with 1 IP address, but our main off-campus *outbound* comes from a different
> > address?
>
> Hey, if I knew what I was doing, I'd be in MIS.
>
I'd rather say you are not in MIS exactly because you prefer knowing
what you are doing.
> As it is, I just criticise other peoples patches.
>
> Linus
--
Greetz, Antonio Vargas aka winden of network
Every day, every year
you have to work
you have to study
you have to scene.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spam, bogofilter, etc
2006-10-02 18:02 ` Linus Torvalds
2006-10-02 18:07 ` Martin Bligh
2006-10-02 18:22 ` Valdis.Kletnieks
@ 2006-10-02 21:58 ` Alan Cox
2006-10-04 22:41 ` Adrian Bunk
3 siblings, 0 replies; 35+ messages in thread
From: Alan Cox @ 2006-10-02 21:58 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Martin J. Bligh, Lee Revell, Matti Aarnio, linux-kernel
> Of course, the MX checking can also be avoided, and a lot of spam-bots
> know to use the ISP connection instead of a direct port-25 approach. But
> at least that way, the mail gateway can (and often does) notice the
> flooding, and many ISP's successfully throttle at least some spam at the
> source, so it does actually have real meaning.
Actually some of the smarter big ISPs with the less technical customers
transproxy port 25 anyway - using big Linux boxes and the netfilter
code.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spam, bogofilter, etc
2006-10-02 18:02 ` Linus Torvalds
` (2 preceding siblings ...)
2006-10-02 21:58 ` Alan Cox
@ 2006-10-04 22:41 ` Adrian Bunk
3 siblings, 0 replies; 35+ messages in thread
From: Adrian Bunk @ 2006-10-04 22:41 UTC (permalink / raw)
To: Linus Torvalds
Cc: Alan Cox, Martin J. Bligh, Lee Revell, Matti Aarnio, linux-kernel
On Mon, Oct 02, 2006 at 11:02:36AM -0700, Linus Torvalds wrote:
>
>
> On Mon, 2 Oct 2006, Alan Cox wrote:
> >
> > Ar Llu, 2006-10-02 am 09:40 -0700, ysgrifennodd Linus Torvalds:
> > > If you want a yes/no kind of thing, do it on real hard issues, like not
> > > accepting email from machines that aren't registered MX gateways. Sure,
> > > that will mean that people who just set up their local sendmail thing and
> > > connect directly to port 25 will just not be able to email, but let's face
> > > it, that's why we have ISP's and DNS in the first place.
> >
> > Except most of the ISPs are incompetent and many people have to run
> > their own mail system in order to get mail that actually *works*. I've
> > had that experience several times, although thankfully I now have a sane
> > ISP.
>
> Sure. I kind of agree - I'm just saying that if you have a _hard_
> decision, you should base in on _hard_ data.
>...
My personal hard data is:
- if you are sending emails to me, the fourth-last mail server in the
path (the one that actually receives the emails from the Internet)
does greylisting, IOW much spam that can be trivially determined is
already eliminated when bogofilter gets the emails
- much spam I'm getting cames through lists like linux-kernel that
have already filtered out the easy to determine spam
- despite these points, bogofilter catches 90% of the arriving spam
- one false positive every 1-2 years (sic)
- I can (and do) train bogofilter myself
It might have it's weaknesses and might therefore not work well forever,
but at least during the last years bogofilter served me well.
> Linus
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spam, bogofilter, etc
2006-10-02 17:49 ` Alan Cox
2006-10-02 17:19 ` David Lang
2006-10-02 18:02 ` Linus Torvalds
@ 2006-10-03 17:32 ` Mariusz Kozlowski
2 siblings, 0 replies; 35+ messages in thread
From: Mariusz Kozlowski @ 2006-10-03 17:32 UTC (permalink / raw)
To: Alan Cox; +Cc: linux-kernel
> every good spammer reruns their message through spamassassin adding random
> text till they get a good score *then* they spew it out.
That's a flaw in the whole idea of having pre-defined (by human) separate
rules catching misc obvious (to us) spam indicators. If you had a filter that
you just feed with raw data from many sources and that does pattern
recognition and learns on its own, there (probably) would be no way to go
around it. At least it wouldn't be easy. In fact i.e. when ANN is used as
classifier, the rules created after training are hidden and have no obvious
represantation to us so one would have no idea what to change to get the
desired filter output.
Mariusz
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spam, bogofilter, etc
2006-10-02 16:40 ` Linus Torvalds
2006-10-02 17:49 ` Alan Cox
@ 2006-10-02 21:33 ` Horst H. von Brand
2006-10-03 8:08 ` John Graham-Cumming
` (3 subsequent siblings)
5 siblings, 0 replies; 35+ messages in thread
From: Horst H. von Brand @ 2006-10-02 21:33 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Martin J. Bligh, Lee Revell, Matti Aarnio, linux-kernel
Linus Torvalds <torvalds@osdl.org> wrote:
[...]
> If you want a yes/no kind of thing, do it on real hard issues, like not
> accepting email from machines that aren't registered MX gateways. Sure,
> that will mean that people who just set up their local sendmail thing and
> connect directly to port 25 will just not be able to email, but let's face
> it, that's why we have ISP's and DNS in the first place.
Larger sites have ingoing (MX) machines and outgoing (no MX) ones... this
is useless. And the whole SPF fiasco shows that such mechanisms (DNS based,
remote site publishes the data) are even easier to bypass (I've seen
statistics showing that the overwhelming mayority of SPF-"protected" email
is spam).
What does work rather well is greylisting (on first try tell them to come
back later, spammers rarely retry their junk).
Add blacklists (sadly, there are few reliable ones, AFAICS) and you cut it
down even more.
And yes, there is no silver bullet. This is an arms race, get a new
anti-spam device (filter configuration, ...) and soon they will figure out
how to bypass it.
In any case, I've seen claims that around 80% of email now is spam. That
it is still only a little in LKML says that the listmasters are doing an
oustanding job.
--
Dr. Horst H. von Brand User #22616 counter.li.org
Departamento de Informatica Fono: +56 32 2654431
Universidad Tecnica Federico Santa Maria +56 32 2654239
Casilla 110-V, Valparaiso, Chile Fax: +56 32 2797513
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: Spam, bogofilter, etc
2006-10-02 16:40 ` Linus Torvalds
2006-10-02 17:49 ` Alan Cox
2006-10-02 21:33 ` Horst H. von Brand
@ 2006-10-03 8:08 ` John Graham-Cumming
2006-10-03 8:52 ` Howard Chu
2006-10-03 9:40 ` Devdas Bhagat
` (2 subsequent siblings)
5 siblings, 1 reply; 35+ messages in thread
From: John Graham-Cumming @ 2006-10-03 8:08 UTC (permalink / raw)
To: linux-kernel
Linus Torvalds <torvalds <at> osdl.org> writes:
> I'm sorry, but spam-filtering is simply harder than the bayesian
> word-count weenies think it is. I even used to _know_ something about
> bayesian filtering, since it was one of the projects I worked on at uni,
> and dammit, it's not a good approach, as shown by the fact that it's
> trivial to get around.
Have you actually followed any of the research into Bayesian (and similar
machine learning based) anti-spam filtering, and attacks on such filters? Are
you making a claim that these filters are 'trivial to get around' based on a
project you did at University over 10 years ago?
John.
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: Spam, bogofilter, etc
2006-10-03 8:08 ` John Graham-Cumming
@ 2006-10-03 8:52 ` Howard Chu
0 siblings, 0 replies; 35+ messages in thread
From: Howard Chu @ 2006-10-03 8:52 UTC (permalink / raw)
To: John Graham-Cumming; +Cc: linux-kernel
John Graham-Cumming wrote:
> Linus Torvalds <torvalds <at> osdl.org> writes:
>> I'm sorry, but spam-filtering is simply harder than the bayesian
>> word-count weenies think it is. I even used to _know_ something about
>> bayesian filtering, since it was one of the projects I worked on at uni,
>> and dammit, it's not a good approach, as shown by the fact that it's
>> trivial to get around.
> Have you actually followed any of the research into Bayesian (and similar
> machine learning based) anti-spam filtering, and attacks on such filters? Are
> you making a claim that these filters are 'trivial to get around' based on a
> project you did at University over 10 years ago?
Well the recent spate of spams with technical/jargon keywords in their
subjects was enough to make my Seamonkey client start marking all
incoming mail as spam. Interesting that recent journals talk about this
as an approach to get spam past current filters; instead it had a
reverse effect.
So much for email management at our hosting provider. At least on my
highlandsun.com domain I've got my own sendmail milter blocking spams
before they get into the server. It's basically the equivalent of a
sendmail accessdb in LDAP, plus simple rules to reject relays from
unregistered IP addresses, or addresses with dynamically generated
hostnames. Rejecting with 451 temporary failure is also useful, most
bulk mailer programs fail immediately and go away. Real mail servers
will retry; by looking at the logs of the envelope FROM and RCPT I can
pick out any emails that should have been let thru and add an OK
exception to LDAP so the message eventually gets redelivered. I suppose
I could put a URL in the reject error message, and let the sender
confirm it from there. At this point the only spam that gets thru is
from dedicated mass marketers with legitimate DNS registrations and I
just manually add their subnets to my blacklist.
(One then is faced with the interesting question - what if someone from
one of those companies was actually trying to hire my services? Their
loss I guess, sometimes money really is tainted...)
--
-- Howard Chu
Chief Architect, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc
OpenLDAP Core Team http://www.openldap.org/project/
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Spam, bogofilter, etc
2006-10-02 16:40 ` Linus Torvalds
` (2 preceding siblings ...)
2006-10-03 8:08 ` John Graham-Cumming
@ 2006-10-03 9:40 ` Devdas Bhagat
2006-10-03 9:43 ` Helge Hafting
2006-10-03 10:50 ` Gordon Cormack
5 siblings, 0 replies; 35+ messages in thread
From: Devdas Bhagat @ 2006-10-03 9:40 UTC (permalink / raw)
To: linux-kernel
Linus Torvalds <torvalds <at> osdl.org> writes:
<snip>
> I'm sorry, but spam-filtering is simply harder than the bayesian
> word-count weenies think it is. I even used to _know_ something about
Spam stopping is harder than anyone thinks it is. Spam is about consent, not
content, and we have no really reliable way yet of knowing consent (except a
pure whitelist).
> If you want a yes/no kind of thing, do it on real hard issues, like not
> accepting email from machines that aren't registered MX gateways. Sure,
Uhm, MX is for receiving mail, not sending it. Plenty of organisations have
different hosts for MX MTAs and outbound MTAs. I work in that field, so just a
warning note for anyone who wants to take Linus' advice.
Devdas Bhagat
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: Spam, bogofilter, etc
2006-10-02 16:40 ` Linus Torvalds
` (3 preceding siblings ...)
2006-10-03 9:40 ` Devdas Bhagat
@ 2006-10-03 9:43 ` Helge Hafting
2006-10-03 10:50 ` Gordon Cormack
5 siblings, 0 replies; 35+ messages in thread
From: Helge Hafting @ 2006-10-03 9:43 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Martin J. Bligh, Lee Revell, Matti Aarnio, linux-kernel
Linus Torvalds wrote:
> On Mon, 2 Oct 2006, Martin J. Bligh wrote:
>
>> If you got rid of "slut" and "schoolgirl" that'd get rid of half of it.
>>
>
> The problem with bogo-filter is that THE WHOLE CONCEPT IS FLAWED.
>
Perhaps, but it works remarkably well anyway. After training with a
few thousand messages of each kind the amount of wrong
decisions is low. Each month I retrain the filter with the 20
or so messages it wasn't able to classify. (I sort into
spam, nonspam, and "dubious".)
Helge Hafting
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: Spam, bogofilter, etc
2006-10-02 16:40 ` Linus Torvalds
` (4 preceding siblings ...)
2006-10-03 9:43 ` Helge Hafting
@ 2006-10-03 10:50 ` Gordon Cormack
5 siblings, 0 replies; 35+ messages in thread
From: Gordon Cormack @ 2006-10-03 10:50 UTC (permalink / raw)
To: linux-kernel
Linus Torvalds <torvalds <at> osdl.org> writes:
> I'm sorry, but spam-filtering is simply harder than the bayesian
> word-count weenies think it is. I even used to _know_ something about
> bayesian filtering, since it was one of the projects I worked on at uni,
> and dammit, it's not a good approach, as shown by the fact that it's
> trivial to get around.
Linus, I've seen no evidence that statistical filters are trivial
to beat. Can you provide some?
> I don't know why people got so excited about the whole bayesian thing.
> It's fine as _one_ small clause in a bigger framework of deciding spam,
> but it's totally inappropriate for a "yes/no" kind of decision on its own.
Why is that? Statistical filters (so-called 'Bayesian) have lower
false positive and false negative rates than many other approaches.
Bogofilter is one of the better ones, although it is not particularly
Bayesian.
> If you want a yes/no kind of thing, do it on real hard issues, like not
> accepting email from machines that aren't registered MX gateways. Sure,
> that will mean that people who just set up their local sendmail thing and
> connect directly to port 25 will just not be able to email, but let's face
> it, that's why we have ISP's and DNS in the first place.
You are saying that this sort of false positive is acceptable to
you. With no corresponding claim as to the corresponding false
negative rate.
So-called yes/no values are simply tests with their own failure
rates. As such, they have strictly less information than
scores or probability estimates that offer a confidence
estimate as well. The trick is in combining several sources
of evidence, and 'Bayesian' is but one method of combining this
evidence.
>
> If you want to do word analysis, use it like SpamAssassin does it - with
> some Bayesian rule perhaps adding a few points to the score. That's
> entirely appropriate. But running bogo-filter _instead_ of spamassassin is
> just asinine.
Spamassassin performs quite poorly with the default weight
given to its statistical filter. It works much better
if you increase the weight. Many tests show that it works
better still if you simply discard the ad hoc rules and
rely on the 'Bayesian' filter alone. I have found that
almost all of the false positives I've encountered in
the last 3 years have been due to Spamassassin's ad hoc
rules, not its statistical filter.
References
http://plg.uwaterloo.ca/~gvcormac/trecspamtrack05
http://plg.uwaterloo.ca/~gvcormac/spamassassin.html
http://www.ceas.cc/2006/listabs.html#12.pdf
Gordon Cormack
University of Waterloo
^ permalink raw reply [flat|nested] 35+ messages in thread