Revision as of 11:57, 14 March 2009

Solution to spam problem

spamsum

How does this solve?

A spam honeypot delivery point is known (or assumed) to recieve 100% spam. All incoming messages to this mailbox are then spamsummed.

Then as a seperate filter, all incoming messages to anyone are checked against the database of spamsum scores. If the new message rates as "too similar", then it is also assumed to be spam, and classified as such.

Lexicon used here

spamsum: the program
spamsum_scores: the file listing spamsum scores
spampot: emails which were caught by being too similar to a pre-scored email
spamsum cache: archive of emails that generated the spamsum_scores file
spampot cache: archive of emails caught in the spampot.

note to self: 'spampot' should arguably be reversed since the spamsum cache is really the spampot...

Notes

The spamsum_scores should be rotated. New scores are appended at the end, so approx a month of scores to be kept. (note however that this is a flat text file with no internal dates (would spamsum mind if dates are munged in as "not formatted correctly spamsum scores"?), so rotation would likely have to be performed in a "number of lines" manner. Say, 'keep the last 10,000 scores' and rotate daily.
filtered messages should be kept for sanity checking

Pros

spamsum is quick, saves message being filtered by heavier bayesian/etc filters
dynamically reacts to new spam - so long as spampot is sufficiently knowledgable

Cons

spampot address requires accepting messages we consider to be known spam
is post-DATA - bandwidth for the spam is already consumed
- however: could the spams caught then be used to generate dynamic blacklists of servers?

Scope

I expect spamsum to be equally usable at a personal and an organisation level. However, at a personal level I expect (awaiting testing) that it can be used on ALL email, on the basis that the ONLY repeats recieved will be spam, thus allowing all ham through. This is not true for organisations, so a more dedicated spampot must be setup to create ONLY spam hashes. The accuracy of this spampot is likely to have a large impact on the false positives rate seen.

Results

(Results table style taken from acme.com writeups)

SMTP Phase	post-DATA
CPU Use	low (assumed, to be tested)
Memory Use	low (assumed, to be tested)
False Positives	low (assumed, to be tested)
Maintenance	low
Effectiveness	good (90% capture rate observed in the second day, and 95% observed in the second quarter of that day). Should continue to improve

Testing

Some email addresses known to recieve spam were directed into the spamsum spampot and self-filtered (the procmail config below). After approx an hour, the spampot cache (emails filtered due to matching emails which were previously spamsummed (and saved to a spamsum cache)) was already at 1:1 ratio of messages with the spamsum cache. After 8 hours spamsum:spampot was 1:2.5. That is, we've caught 10 messages for every 4 we have spamsummed. That number dates back to the 'ignorant' spamsum_scores file (ie, empty), and the most recent mails are in fact running at a 1:20 ratio (95% effective). I expect the ratio to improve further as a larger spamsum history is saved (I aim for a spamsum history approx 100 times what I had after 8 hours (ie, approx 1 month or 10,000 messages))

Remember however that this is on a mailbox which is assumed to be recieving 100% spam - so the possibility of false positives is not being tested yet, as such this cannot be claimed to be a 'scientific' test.

Configuration

My .procmailrc file within my spampot address

# all mail to this user is assumed to be spam. Nothing legit comes here.
# ...thus, generate a spamsum score on EVERYTHING

SHELL=/bin/sh
DROPPRIVS=yes
VERBOSE=on
LOGFILE=emailtmp/procmail.log

# (this comes first so spampot messages aren't spamsum'd twice
:0 Wc
| /usr/local/bin/spamsum -T25 -d emailtmp/spamsum_scores -C -
# 'a' means the previous recipe ran successfull. ie, this message is similar
# to a previously found spam in the spamsum score. So, we pull it now. 
:0 a
emailtmp/spampotcaught.d

:0
{
       # if the message wasn't previously caught as being spam, then let's
       # mark it as potential spam now with spamsum scoring :)
       :0 c
       | /usr/local/bin/spamsum - >> emailtmp/spamsum_scores
       # and since it's a spampot, we save it seperate for now (for testing)
       :0
       emailtmp/spamsumcache.d
}

# note that all deliveries could be to /dev/null as all messages are assumed to be spam. 
# safer will be to remove old messages from the caches after a short period (week?)

Thoughts

Our test filter only determines is a spam is similar to a previously scored email. We don't know how similar. ie, we don't know how much our effectiveness would change with a different score.
- Test this by running every spampot message over the spamsum_scores and analysing scores resulting (will all be greater than the score in the procmail config.
- Test also the spamsum cache messages to see how many we could be saving? (for each message in the cache, generate a spamsum, then grep -v that out from the spamsum_scores file (so we don't get a perfect match) and generate a spamsum similarity score. Analyse... (alt: do this test with ssdeep)
Greater efficiency: spamsum as a daemon
Different type of use: could the algorithm be altered to produce a hash which can validate partial messages. That way if a message was 50% recieved, but already was matching 100% score for that amount of data, we could close the connection early? (would development and in-use resource overhead be worth it just to move the scoring to the SMTP "DATA" phase?
Is a spampot even nescessary? Couldn't this simply be run on a complete email dataset? Afterall, it works by allowing through the first instance of every unique email anyway, and ham tends to be relatively unique, whilst spam tends to come in repetitive sets...
- Yes... in simple testing, simply quoting an email in response makes it quite dissimilar, and their reply (which will be the next that spamsum sees) will have two levels of reply! (TODO: get numbers)
- TODO: test simply by feeding a weeks corpus of ALL my regular email through spamsum, simulating this.
  - Do this twice: Once naively, once with pre-learnt spamsum_scores from the spampot
  - Then do it another way: over a known 100% ham corpus? (save a corpus of ham messages to MH or maildir format)
  - Expectation: this will be effective, except possibly for email memes. (if the same funny picture is sent to you twice, even by different people, they will be base64 encoded the same and thus show up as being EXTREMELY similar (how common this is should show up in the 100% ham corpus test)

TODO

test spamsum memory and CPU usage over
- LARGE files
- LARGE spamsum_scores dataset (say with 1000, 10,000, 100,000 results precomputed)
- Both the above at once
- Expectation: that memory will always be low (spamsum does not have to hold the entire file in memory to generate the spamsum?), CPU will get relatively high (yay data processing), and throughput will be limited by disk IO. (when run through procmail, the file is piped in - from disk or from ram?). Spamsum does take longer than md5 and similar cryptographic hashes, due to the nature of the hash generated. Comparing against a list of n hashes in spamsum_scores, it is O(n) time. (see performance details in Kornblum's paper linked below)
Check the spampot results for the timeframe that similar emails show up. (graph subject lines against dates?)
investigate ssdeep as alternative to spamsum. (cos it's in debian and maintained! :D
- Apparently does not support piped data as input, so may not be viable afterall
investigate feasibility of datamining the spampot for servers that could be dynamically blacklisted, as well as generating dynamic header_checks and body_checks. Caution: apply such dynamism with extreme caution!
graph the delivery rate to both caches (this should show spamsum cache delivery rate drop over time and level off - giving a good idea of appropriate retention policy

Links

Further spam reading here:

A plan for Spam - Paul Graham
http://www.acme.com/mail_filtering/ - This guy is the God of spam filterers
http://code.google.com/p/pyspamsum/ - This is the version of spamsum I use (though I don't use the python wrapper)
http://www.samba.org/ftp/unpacked/junkcode/spamsum/ - Tridge's original spamsum
http://www.dfrws.org/2006/proceedings/12-Kornblum.pdf - a paper on identifying similar files. Develops ssdeep from spamsum
- http://ssdeep.sourceforge.net/ - This is in active development (and is in Debian!) :)
- http://www.forensicswiki.org/wiki/Ssdeep - Discussion of ssdeep as a forensics tool

@@ Line 35: / Line 35: @@
 * is post-DATA - bandwidth for the spam is already consumed
 ** however: could the spams caught then be used to generate dynamic blacklists of servers?
+== Scope ==
+I expect spamsum to be equally usable at a personal and an organisation level.
+However, at a personal level I expect (awaiting testing) that it can be used on ALL email, on the basis that the ONLY repeats recieved will be spam, thus allowing all ham through. This is not true for organisations, so a more dedicated spampot must be setup to create ONLY spam hashes. The accuracy of this spampot is likely to have a large impact on the false positives rate seen.
 == Results ==

SpamHash

Revision as of 11:57, 14 March 2009

Contents

Solution to spam problem

How does this solve?

Lexicon used here

Notes

Pros

Cons

Scope

Results

Testing

Configuration

Thoughts

TODO

Links

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

meta navigation

More thorx

Tools