From: Russ Allbery <rra@stanford.edu>
To: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [OT] Confirmation Spam Blocking was: List 'linux-dvb' closed to public posts
Date: Sat, 24 Jan 2004 15:59:11 -0800 [thread overview]
Message-ID: <87ad4c99v4.fsf@windlord.stanford.edu> (raw)
In-Reply-To: <1hDmg-4AP-9@gated-at.bofh.it> (Linus Torvalds's message of "Sat, 24 Jan 2004 22:20:09 +0100")
Linus Torvalds <torvalds@osdl.org> writes:
> Especially if the "random words" in the spam end up being weighted by
> real frequency, you just _cannot_ use single-word bayes filters on
> it. Or if you do, you'll eventually have those words either being
> neutral, or (worst of all cases) you'll have real mail be marked as spam
> after having aggressively trained the filter for the spams.
> It might not be that big of a deal especially if you have a fairly
> narrow scope of emails in your ham-list, but people who get mail from
> varied sources _will_ get screwed by this, one way or the other.
After having put a couple thousand messages a day through bogofilter for
around half a year now, this is, so far at least, not born out by my
experience. Single word Bayesian filters are still working fine for me in
practice and legitimate e-mail is not being misclassified as spam because
of this sort of dictionary poisoning. All the misclassifications I've
seen have been for very obvious reasons unrelated to Markov chains (I
generally have to explicitly train bogofilter a few times on invoices and
shipping notices from commerce sites, for example, since most
commerce-related words occur with a very high frequency in spam), and it
seems unlikely that they would be measurably helped by multiple-word
Bayesian algorithms.
Perhaps this will become a problem eventually (where eventually involves
more than one hundred thousand messages), but if so, I've not yet seen any
evidence of it.
Maybe I just have that narrow scope of e-mail that you refer to. I'm not
sure how to measure that. My gut instinct is that most people have a
pretty narrow scope of e-mail that they receive, relative to all the
possible legitimate e-mail messages (and I'm much more skeptical of
Bayesian filters when applied site-wide rather than to a single mailbox).
Using multiple words is probably better along some axes (faster training,
perhaps), but a sufficiently trained single-word filter doesn't appear to
have any real difficulties. I'm inclined to believe that people who are
experiencing these sorts of problems with Bayesian filters are using
inferior implementations, haven't sufficiently trained their filters, or
have a radically different range of legitimate e-mail than I do.
--
Russ Allbery (rra@stanford.edu) <http://www.eyrie.org/~eagle/>
next prev parent reply other threads:[~2004-01-24 23:59 UTC|newest]
Thread overview: 63+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <ecartis-01212004203954.14209.1@mail.convergence2.de>
2004-01-21 19:43 ` List 'linux-dvb' closed to public posts Dave Jones
2004-01-21 19:54 ` Christoph Hellwig
2004-01-21 19:56 ` Linus Torvalds
2004-01-21 19:56 ` Dave Jones
2004-01-21 20:12 ` John Bradford
2004-01-21 21:44 ` Jes Sorensen
2004-01-22 5:11 ` Rik van Riel
2004-01-21 20:38 ` Zan Lynx
2004-01-21 20:57 ` Charles Cazabon
2004-01-21 21:57 ` Diego Calleja
2004-01-21 21:15 ` Dave Jones
2004-01-21 21:29 ` Randy.Dunlap
2004-01-21 21:30 ` [OT] Confirmation Spam Blocking was: " Mike Fedyk
2004-01-21 22:50 ` Adrian Bunk
2004-01-21 23:01 ` Wakko Warner
2004-01-22 6:51 ` Jan-Benedict Glaw
2004-01-22 14:31 ` Wakko Warner
2004-01-21 23:40 ` Andreas Jellinghaus
2004-01-22 0:26 ` Zan Lynx
2004-01-22 5:14 ` Rik van Riel
2004-01-22 13:24 ` Jes Sorensen
2004-01-22 16:56 ` David Ford
2004-01-22 17:01 ` Trond Myklebust
2004-01-22 17:10 ` David Ford
2004-01-22 17:35 ` Trond Myklebust
2004-01-22 18:18 ` Andreas Jellinghaus
2004-01-22 17:11 ` Andreas Jellinghaus
2004-01-22 17:30 ` viro
2004-01-22 17:34 ` Ralf Hildebrandt
2004-01-22 17:41 ` David Ford
2004-01-22 18:20 ` Brian Beattie
2004-01-23 7:41 ` Willy Tarreau
2004-01-23 9:24 ` Paul Jakma
2004-01-22 18:35 ` David Lang
2004-01-22 18:49 ` David Ford
2004-01-22 22:18 ` jw schultz
2004-01-22 22:58 ` Linus Torvalds
2004-01-22 23:16 ` Linus Torvalds
2004-01-23 6:49 ` David S. Miller
2004-01-23 15:38 ` Chris Ricker
2004-01-23 9:25 ` Paul Jakma
2004-01-23 19:38 ` Pavel Machek
2004-01-22 22:43 ` Scott Laird
2004-01-24 20:14 ` Kevin O'Connor
2004-01-24 21:12 ` Linus Torvalds
2004-01-24 23:25 ` Kevin O'Connor
[not found] ` <1hDmg-4AP-9@gated-at.bofh.it>
2004-01-24 23:59 ` Russ Allbery [this message]
2004-01-22 22:15 ` Krzysztof Halasa
2004-01-23 8:43 ` Jes Sorensen
2004-01-26 22:58 ` Max Valdez
2004-01-23 9:17 ` Paul Jakma
2004-01-22 5:13 ` Rik van Riel
2004-01-21 23:08 ` Russell King
2004-01-22 13:28 ` Theodore Ts'o
2004-01-21 22:13 ` Linus Torvalds
2004-01-21 23:01 ` Marcus Metzler
2004-01-22 14:14 ` Johannes Stezenbach
2004-01-22 15:14 ` Marcus Metzler
2004-01-22 15:31 ` Johannes Stezenbach
2004-01-21 23:21 ` Stephen Hemminger
2004-01-22 15:15 ` Michael Hunold
2004-01-22 15:18 ` Dave Jones
2004-01-21 20:08 ` Valdis.Kletnieks
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87ad4c99v4.fsf@windlord.stanford.edu \
--to=rra@stanford.edu \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox