home | list info | list archive | date index | thread index

Re: [OCLUG-Tech] Spam Assassin question

On Thu, Jul 20, 2006 at 10:08:31AM -0400, Hugh Campbell wrote:

> The spam filtering definitely _does_ work but it doesn't seem
> particularly effective.  At a minimum, it seems much less trainable
> than my cat (that's not a lot).  I seem to be training the spam
> filter repetitively against the same or very similar types of mail.

You need huge amounts of spam to train spamassassin, but it does
indeed work.  After a bout of spam making it through, I trained it
against my 4000 spams collected over time (most of them categorised
as such by spamassassin itself), and then received no uncaught spam
for a while.

Note that even with "autolearn" turned on, I believe it will learn
from "ham" (non-spam), but not automatically learn from spam, since
that could be dangerous (a la "contagious") if it starts to classify
good messages as spam.

With autolearning on, when you tell spamassassin to learn a message as
spam, it also undoes any "ham" learning it might have done.  So it's
still a good idea to learn spams as spam, lest similar spam messages
start to be classified as ham instead of spam.

Finally, even a 100% Bayesian match can only contribute 3.5 points to
a message in the default configuration.  Unless you raise the value of
a Bayesian match and/or lower your spam threshold, you need other
characteristics to classify a message -- which, thankfully, are
usually present in professional spams.

As a side note, it's been my experience that the most easy spams to
catch (ironically) are the professional ones, since their huge reader
base means tons of people bugged enough to add the appropriate rules
to anti-spam software.  If they're not excessively pushy in their
language, local businesses can add you to their "mailing list" and
their mails may get through easily.

> In frustration, I finally decided to set up a quick filter in KMail
> to get rid of the e-mail "What is OEM Software" that I keep getting,
> and remembered that I shouldn't have to hand write a filter to get
> rid of such a trivially easy-to-spot spam.

Easy to spot for a human.  For spamassassin, it needs other cues.  The
default rules are generally more aimed towards "typical" spams --
penis enlargement / "performance enhancement", Nigerian letters, etc.
Aside from those, it also checks for known spamming hosts (via dynamic
online lists), and characteristics of spam in general (like HTML-only
messages, failure to adhere to mail protocol, etc.).

If spamassassin is working, but then similar spams start to get
through, generally the best option is to upgrade spamassassin to get
the latest set of rules.  If you have enough of a spam backlog,
though, training can provide an additional anti-spam edge.

Attachment: signature.asc
Description: Digital signature

references