SpamHash

From ThorxWiki
(Difference between revisions)
Jump to: navigation, search
(.)
(+procmail config file)
Line 36: Line 36:
   
 
=== Testing ===
 
=== Testing ===
  +
  +
=== Configuration ===
  +
My .procmailrc file within my spampot address
  +
<pre>
  +
# all mail to this user is assumed to be spam. Nothing legit comes here.
  +
# ...thus, generate a spamsum score on EVERYTHING
  +
  +
SHELL=/bin/sh
  +
DROPPRIVS=yes
  +
VERBOSE=on
  +
LOGFILE=emailtmp/procmail.log
  +
  +
# (this comes first so spampot messages aren't spamsum'd twice
  +
:0 Wc
  +
| /usr/local/bin/spamsum -T25 -d emailtmp/spamsum_scores -C -
  +
# 'a' means the previous recipe ran successfull. ie, this message is similar
  +
# to a previously found spam in the spamsum score. So, we pull it now.
  +
:0 a
  +
emailtmp/spampotcaught.d
  +
  +
:0
  +
{
  +
# if the message wasn't previously caught as being spam, then let's
  +
# mark it as potential spam now with spamsum scoring :)
  +
:0 c
  +
| /usr/local/bin/spamsum - >> emailtmp/spamsum_scores
  +
# and since it's a spampot, we save it seperate for now (for testing)
  +
:0
  +
emailtmp/spamsumcache.d
  +
}
  +
  +
# note that all deliveries could be to /dev/null as all messages are assumed to be spam.
  +
# safer will be to remove old messages from the caches after a short period (week?)
  +
</pre>
  +
  +
 
Some email addresses known to recieve spam were directed into the spamsum spampot and self-filtered. After approx an hour, the spampot cache (emails filtered due to matching emails which were spamsummed (saved to a spamsum cache)) was already at 1:1 ratio of messages with the spamsum cache. After 8 hours spamsum:spampot was 1:2.5. That is, we've caught 10 messages for every 4 we have spamsummed, though the most recent mails are running at a 1:20 ratio (95% effective). I expect the ratio to improve as a larger spamsum history is saved (I expect a spamsum history approx 100 times what I had after 8 hours (ie, approx 1 month or 10,000 messages))
 
Some email addresses known to recieve spam were directed into the spamsum spampot and self-filtered. After approx an hour, the spampot cache (emails filtered due to matching emails which were spamsummed (saved to a spamsum cache)) was already at 1:1 ratio of messages with the spamsum cache. After 8 hours spamsum:spampot was 1:2.5. That is, we've caught 10 messages for every 4 we have spamsummed, though the most recent mails are running at a 1:20 ratio (95% effective). I expect the ratio to improve as a larger spamsum history is saved (I expect a spamsum history approx 100 times what I had after 8 hours (ie, approx 1 month or 10,000 messages))
   

Revision as of 08:01, 13 March 2009

Contents

aka SMTP, aka rfc2822

  • Some problems associated with email...
    • spam.

Solution to spam problem

  • spamsum

How does this solve?

A spampot delivery point is known (assumed) to recieve 100% spam. All incoming messages to this mailbox are spamsum'd.

Then as a seperate filter, ALL incoming messages to anyone are spamsum checked against the database of messages generated by the spamsum scores. If the new message rates as "too similar", then it is also assumed to be spam, and dropped. bye bye.

Notes

  • The spamsum score db should be rotated. New scores are appended at the end, so approx a week of scores to be kept. (note however that this is would be a flat text file with no internal dates (would spamsum mind if dates are munged in as "not formatted correctly spamsum scores"?), so rotation would likely have to be performed in a "number of lines" manner. Say, 'keep 10,000 scores', rotate daily.
  • filtered messages should be kept for sanity checking

Pros

  • spamsum is quick, saves message being filtered by heavier bayesian/etc filters
  • dynamically reacts to new spam - so long as spampot is sufficiently knowledgable

Cons

  • spampot address requires accepting messages we consider to be known spam
  • requires totality of message to be accepted

Results

SMTP Phase
post-DATA
CPU Use
relatively low (significantly lower than, for eg, than any bayesian)
Memory Use
low (assumed)
False Positives
low (assuming the spampot used to collect spamsum scores isn't corrupted by ham (eg: mailing lists, or email memes)
Maintenance
low
Effectiveness
good (95% capture rate observed after only 8 hours of testing)

Testing

Configuration

My .procmailrc file within my spampot address

# all mail to this user is assumed to be spam. Nothing legit comes here.
# ...thus, generate a spamsum score on EVERYTHING

SHELL=/bin/sh
DROPPRIVS=yes
VERBOSE=on
LOGFILE=emailtmp/procmail.log

# (this comes first so spampot messages aren't spamsum'd twice
:0 Wc
| /usr/local/bin/spamsum -T25 -d emailtmp/spamsum_scores -C -
# 'a' means the previous recipe ran successfull. ie, this message is similar
# to a previously found spam in the spamsum score. So, we pull it now. 
:0 a
emailtmp/spampotcaught.d

:0
{
       # if the message wasn't previously caught as being spam, then let's
       # mark it as potential spam now with spamsum scoring :)
       :0 c
       | /usr/local/bin/spamsum - >> emailtmp/spamsum_scores
       # and since it's a spampot, we save it seperate for now (for testing)
       :0
       emailtmp/spamsumcache.d
}

# note that all deliveries could be to /dev/null as all messages are assumed to be spam. 
# safer will be to remove old messages from the caches after a short period (week?)


Some email addresses known to recieve spam were directed into the spamsum spampot and self-filtered. After approx an hour, the spampot cache (emails filtered due to matching emails which were spamsummed (saved to a spamsum cache)) was already at 1:1 ratio of messages with the spamsum cache. After 8 hours spamsum:spampot was 1:2.5. That is, we've caught 10 messages for every 4 we have spamsummed, though the most recent mails are running at a 1:20 ratio (95% effective). I expect the ratio to improve as a larger spamsum history is saved (I expect a spamsum history approx 100 times what I had after 8 hours (ie, approx 1 month or 10,000 messages))

Links

Some good spam reading here:

Personal tools
Namespaces

Variants
Actions
Navigation
meta navigation
More thorx
Tools