SpamHash

From ThorxWiki
(Difference between revisions)
Jump to: navigation, search
(+procmail config file)
(thoughts and TODO)
Line 37: Line 37:
 
=== Testing ===
 
=== Testing ===
   
=== Configuration ===
+
Some email addresses known to recieve spam were directed into the spamsum spampot and self-filtered (the procmail config below). After approx an hour, the spampot cache (emails filtered due to matching emails which were previously spamsummed (and saved to a spamsum cache)) was already at 1:1 ratio of messages with the spamsum cache. After 8 hours spamsum:spampot was 1:2.5. That is, we've caught 10 messages for every 4 we have spamsummed. That number dates back to the 'ignorant' spamsum_scores file (ie, empty), and the most recent mails are in fact running at a 1:20 ratio (95% effective). I expect the ratio to improve further as a larger spamsum history is saved (I aim for a spamsum history approx 100 times what I had after 8 hours (ie, approx 1 month or 10,000 messages))
  +
  +
==== Configuration ====
 
My .procmailrc file within my spampot address
 
My .procmailrc file within my spampot address
 
<pre>
 
<pre>
Line 71: Line 71:
 
</pre>
 
</pre>
   
+
=== Thoughts / TODO ===
Some email addresses known to recieve spam were directed into the spamsum spampot and self-filtered. After approx an hour, the spampot cache (emails filtered due to matching emails which were spamsummed (saved to a spamsum cache)) was already at 1:1 ratio of messages with the spamsum cache. After 8 hours spamsum:spampot was 1:2.5. That is, we've caught 10 messages for every 4 we have spamsummed, though the most recent mails are running at a 1:20 ratio (95% effective). I expect the ratio to improve as a larger spamsum history is saved (I expect a spamsum history approx 100 times what I had after 8 hours (ie, approx 1 month or 10,000 messages))
+
* Our test filter only determines is a spam is similar to a previously scored email. We don't know how similar. ie, we don't know how much our effectiveness would change with a different score.
  +
** Test this by running every spampot message over the spamsum_scores and analysing scores resulting (will all be greater than the score in the procmail config.
  +
** Test also the spamsum cache messages to see how many we could be saving? (for each message in the cache, generate a spamsum, then grep -v that out from the spamsum_scores file (so we don't get a perfect match) and generate a spamsum similarity score. Analyse...
  +
* Check the spampot results for the timeframe that similar emails show up. (graph subject lines against dates?)
  +
* Greater efficiency: spamsum as a daemon
  +
* Different type of use: could the algorithm be altered to produce a hash which can validate partial messages. That way if a message was 50% recieved, but already was matching 100% score for that amount of data, we could close the connection early? (would development and in-use resource overhead be worth it just to move the scoring to the SMTP "DATA" phase?
  +
* test spamsum memory and CPU usage over a LARGE dataset (many large files basically)
  +
** Expectation: that memory will always be low (spamsum does not have to hold the entire file in memory even, to generate the spamsum), CPU will get relatively high, and throughput will be limited by disk IO
   
 
== Links ==
 
== Links ==
Some good spam reading here:
+
Further spam reading here:
* A plan for Spam - Paul Graham
+
* [http://www.paulgraham.com/spam.html A plan for Spam - Paul Graham]
* http://www.acme.com/mail_filtering/
+
* http://www.acme.com/mail_filtering/ - This guy is the God of spam filterers

Revision as of 09:25, 13 March 2009

Contents

aka SMTP, aka rfc2822

  • Some problems associated with email...
    • spam.

Solution to spam problem

  • spamsum

How does this solve?

A spampot delivery point is known (assumed) to recieve 100% spam. All incoming messages to this mailbox are spamsum'd.

Then as a seperate filter, ALL incoming messages to anyone are spamsum checked against the database of messages generated by the spamsum scores. If the new message rates as "too similar", then it is also assumed to be spam, and dropped. bye bye.

Notes

  • The spamsum score db should be rotated. New scores are appended at the end, so approx a week of scores to be kept. (note however that this is would be a flat text file with no internal dates (would spamsum mind if dates are munged in as "not formatted correctly spamsum scores"?), so rotation would likely have to be performed in a "number of lines" manner. Say, 'keep 10,000 scores', rotate daily.
  • filtered messages should be kept for sanity checking

Pros

  • spamsum is quick, saves message being filtered by heavier bayesian/etc filters
  • dynamically reacts to new spam - so long as spampot is sufficiently knowledgable

Cons

  • spampot address requires accepting messages we consider to be known spam
  • requires totality of message to be accepted

Results

SMTP Phase
post-DATA
CPU Use
relatively low (significantly lower than, for eg, than any bayesian)
Memory Use
low (assumed)
False Positives
low (assuming the spampot used to collect spamsum scores isn't corrupted by ham (eg: mailing lists, or email memes)
Maintenance
low
Effectiveness
good (95% capture rate observed after only 8 hours of testing)

Testing

Some email addresses known to recieve spam were directed into the spamsum spampot and self-filtered (the procmail config below). After approx an hour, the spampot cache (emails filtered due to matching emails which were previously spamsummed (and saved to a spamsum cache)) was already at 1:1 ratio of messages with the spamsum cache. After 8 hours spamsum:spampot was 1:2.5. That is, we've caught 10 messages for every 4 we have spamsummed. That number dates back to the 'ignorant' spamsum_scores file (ie, empty), and the most recent mails are in fact running at a 1:20 ratio (95% effective). I expect the ratio to improve further as a larger spamsum history is saved (I aim for a spamsum history approx 100 times what I had after 8 hours (ie, approx 1 month or 10,000 messages))

Configuration

My .procmailrc file within my spampot address

# all mail to this user is assumed to be spam. Nothing legit comes here.
# ...thus, generate a spamsum score on EVERYTHING

SHELL=/bin/sh
DROPPRIVS=yes
VERBOSE=on
LOGFILE=emailtmp/procmail.log

# (this comes first so spampot messages aren't spamsum'd twice
:0 Wc
| /usr/local/bin/spamsum -T25 -d emailtmp/spamsum_scores -C -
# 'a' means the previous recipe ran successfull. ie, this message is similar
# to a previously found spam in the spamsum score. So, we pull it now. 
:0 a
emailtmp/spampotcaught.d

:0
{
       # if the message wasn't previously caught as being spam, then let's
       # mark it as potential spam now with spamsum scoring :)
       :0 c
       | /usr/local/bin/spamsum - >> emailtmp/spamsum_scores
       # and since it's a spampot, we save it seperate for now (for testing)
       :0
       emailtmp/spamsumcache.d
}

# note that all deliveries could be to /dev/null as all messages are assumed to be spam. 
# safer will be to remove old messages from the caches after a short period (week?)

Thoughts / TODO

  • Our test filter only determines is a spam is similar to a previously scored email. We don't know how similar. ie, we don't know how much our effectiveness would change with a different score.
    • Test this by running every spampot message over the spamsum_scores and analysing scores resulting (will all be greater than the score in the procmail config.
    • Test also the spamsum cache messages to see how many we could be saving? (for each message in the cache, generate a spamsum, then grep -v that out from the spamsum_scores file (so we don't get a perfect match) and generate a spamsum similarity score. Analyse...
  • Check the spampot results for the timeframe that similar emails show up. (graph subject lines against dates?)
  • Greater efficiency: spamsum as a daemon
  • Different type of use: could the algorithm be altered to produce a hash which can validate partial messages. That way if a message was 50% recieved, but already was matching 100% score for that amount of data, we could close the connection early? (would development and in-use resource overhead be worth it just to move the scoring to the SMTP "DATA" phase?
  • test spamsum memory and CPU usage over a LARGE dataset (many large files basically)
    • Expectation: that memory will always be low (spamsum does not have to hold the entire file in memory even, to generate the spamsum), CPU will get relatively high, and throughput will be limited by disk IO

Links

Further spam reading here:

Personal tools
Namespaces

Variants
Actions
Navigation
meta navigation
More thorx
Tools