SpamHash
m (update intro to match new page name) |
m (.) |
||
(36 intermediate revisions by 2 users not shown) | |||
Line 3: | Line 3: | ||
'''There are some problems associated with email...''' |
'''There are some problems associated with email...''' |
||
− | * spam. |
+ | * spam. |
+ | * storage - see [[Maildir]] |
||
− | = Solution to spam problem = |
+ | = A Solution to spam problem? = |
− | * A spam hash? |
+ | * spam hash! |
− | == How does this solve? == |
+ | = NADS: Nemo's Approach to Destroying Spam = |
+ | |||
+ | That is, this is a NemProject, named in accordance with [[NINS]]. |
||
+ | |||
+ | (currently I use 'NADS' and 'SpamHash' fairly interchangably. If in the future I work on any other antispam measures, then they would be part of the NADS family alongside SpamHash. At this time however, SpamHash is the only member in the family. |
||
+ | |||
+ | = What does it do? = |
||
A spam honeypot delivery point is known to recieve 100% spam. All incoming messages to this mailbox are then hashed with spamsum, and a database of hashes kept. |
A spam honeypot delivery point is known to recieve 100% spam. All incoming messages to this mailbox are then hashed with spamsum, and a database of hashes kept. |
||
Then as a seperate filter, '''all''' incoming messages to ''anyone'' are checked against the database of hashes. If the new message rates as "similar", then it is also assumed to be spam, classified and sorted as such. |
Then as a seperate filter, '''all''' incoming messages to ''anyone'' are checked against the database of hashes. If the new message rates as "similar", then it is also assumed to be spam, classified and sorted as such. |
||
+ | |||
+ | = Notes = |
||
+ | * ssdeep (based upon spamsum) is a maintained utility (in debian testing at time of writing for eg) that can be used for some testing of effectiveness. It also appears to perform considerably better than spamsum, no doubt due to optimisations. It cannot accept data from STDIN except limitedly via /dev/stdin |
||
+ | * The hashDB should be rotated. New scores are appended at the end, so approx a month of scores to be kept. (note however that this is a flat text file with no internal dates (would spamsum mind if dates are munged in as "not formatted correctly spamsum scores"?), so rotation would likely have to be performed in a "number of lines" manner. Say, 'keep the last 10,000 scores' and rotate daily. |
||
+ | * filtered messages should be kept for sanity checking |
||
== Lexicon used here == |
== Lexicon used here == |
||
− | ;spamsum: the program that generates spamsum hashes. |
+ | ;spamsum: the algorithm that generates spamsum hashes. |
+ | * Note that 'spamsum' is also the name of the first program that implemented the spamsum algorithm. I will endeavour to make it clear which I mean by context. |
||
+ | * ssdeep - Another program to implement spamsum. |
||
;hashDB: the database of spamsum scores. (currently implemented as a flat file: spamsum_scores) |
;hashDB: the database of spamsum scores. (currently implemented as a flat file: spamsum_scores) |
||
;spamcatches: archive of emails which were caught by being too similar to a pre-scored message |
;spamcatches: archive of emails which were caught by being too similar to a pre-scored message |
||
;spamhashes: archive of emails that generated the hashDB |
;spamhashes: archive of emails that generated the hashDB |
||
− | == Notes == |
+ | = Scope = |
− | * ssdeep (based upon spamsum) is a maintained utility (in debian for eg) that can be used for some testing of effectiveness. However, it lacks the ability to accept data from STDIN, so cannot be used as a drop-in replacement to spamsum |
+ | <!-- table summary style taken from acme.com writeups --> |
− | * The hashDB should be rotated. New scores are appended at the end, so approx a month of scores to be kept. (note however that this is a flat text file with no internal dates (would spamsum mind if dates are munged in as "not formatted correctly spamsum scores"?), so rotation would likely have to be performed in a "number of lines" manner. Say, 'keep the last 10,000 scores' and rotate daily. |
||
− | * filtered messages should be kept for sanity checking |
||
− | |||
− | === Pros === |
||
− | * spamsum is quick, saves message being filtered by heavier bayesian/etc filters |
||
− | * dynamically reacts to new spam as soon as a single message is seen. |
||
− | |||
− | === Cons === |
||
− | * spampot address requires accepting messages we consider to be known spam. ie, is post-DATA - bandwidth for the spam is already consumed |
||
− | ** however: could the spams caught then be datamined to generate dynamic blacklists of servers? |
||
− | |||
− | == Scope == |
||
− | I expect spamsum to be equally usable at a personal and an organisation level. |
||
− | However, at a personal level I expect (awaiting testing) that it can be used on ALL email, on the expectation that the ONLY emails likely to match similarity to a previous email will be spam, thus allowing all ham through. This is not true for organisations, so a more dedicated spampot must be setup to create ONLY spam hashes. The accuracy of this spampot is likely to have a large impact on the false positives rate seen, though in part the same 'ham isn't duplicated' rule will apply. |
||
− | |||
− | == Results == |
||
− | (Results table style taken from acme.com writeups) |
||
{|class="wikitable" cellpadding="5" cellspacing="0" border=1 width="33%" align=right |
{|class="wikitable" cellpadding="5" cellspacing="0" border=1 width="33%" align=right |
||
||'''SMTP Phase''' ||style="background:yellow"| post-DATA |
||'''SMTP Phase''' ||style="background:yellow"| post-DATA |
||
|- |
|- |
||
− | |'''CPU Use''' ||style="background:YellowGreen"| low (assumed, to be tested) |
+ | |'''CPU Use''' ||style="background:Yellow"| medium (needs more testing) |
|- |
|- |
||
− | ||'''Memory Use''' ||style="background:YellowGreen"| low (assumed, to be tested) |
+ | ||'''Memory Use''' ||style="background:YellowGreen"| low (needs more testing) |
|- |
|- |
||
||'''False Positives''' ||style="background:YellowGreen"| low (assumed, to be tested) |
||'''False Positives''' ||style="background:YellowGreen"| low (assumed, to be tested) |
||
Line 34: | Line 39: | ||
||'''Maintenance''' ||style="background:lime"| low |
||'''Maintenance''' ||style="background:lime"| low |
||
|- |
|- |
||
− | ||'''Effectiveness''' ||style="background:lime"| good (stabilising at 90% on a spam honeypot, spamsum threshold at 25. Yet to test on a mixed feed) |
+ | ||'''Effectiveness''' ||style="background:lime"| good (stabilising at 90-95% on a spam honeypot with spamsum threshold at 25. Yet to formally test on a mixed feed) |
|} |
|} |
||
+ | I expect spamsum to be equally usable at a personal and an organisation level. |
||
+ | However, at a personal level I expect (awaiting testing) that it can be used on ALL email, on the expectation that the ONLY emails likely to match similarity to a previous email will be spam, thus allowing all ham through. This is not true for organisations, so a more dedicated spampot must be setup to create ONLY spam hashes. The accuracy of this spampot is likely to have a large impact on the false positives rate seen, though in part the same 'ham isn't duplicated' rule will apply. |
||
− | === Testing === |
+ | == Pros == |
+ | * spamsum is quick, saves message being filtered by heavier bayesian/etc filters |
||
+ | * dynamically reacts to new spam as soon as a single message is seen. |
||
− | ==== Performance of the binary ==== |
+ | == Cons == |
− | I ran spamsum, ssdeep and md5sum over my 500+ meg procmail.log file. Three times over each to account for caching issues (note that procmail.log was live though, and grew 100k (over the runs). Whilst this may not be benchmark quality testing, the results I believe are so distinct as to be clear. |
+ | * spampot address requires accepting messages we consider to be known spam. ie, is post-DATA - bandwidth for the spam is already consumed |
+ | ** however: could the spams caught then be datamined to generate dynamic blacklists of servers? |
||
+ | * one false positive (esp messages which may be repeated - for eg: a forwarded meme or mailing list) can lead to others |
||
− | * spamsum averaged about 17minutes per run, using at best 23% CPU, and up to 5 mins of user time. |
||
− | * ssdeep took about 5 minutes to run, using at best 35% CPU, and up to 2:20mins of user time. |
||
− | * md5sum took at worst 44seconds, using at worst 2% CPU, and less than .5seconds user time. |
||
− | ===== Test run raw data ===== |
+ | = Testing = |
+ | == Algorithm performance == |
||
+ | In which I test the performance of hash methods: (spamsum, ssdeep, md5) on large files (large by size and/or count) |
||
+ | * Writeup here: '''[[SpamHash/BinaryTest]]''' |
||
− | <pre> |
+ | === Conclusion Summary === |
− | -rw------- 1 nemo nemo 562874616 Mar 17 08:42 procmail.log |
+ | * Spamsum is REALLY SLOW on a big file. |
+ | * TODO: test on multiple small files |
||
− | spamsum procmail.log 294.56s user 4.48s system 31% cpu 15:40.24 total |
+ | == Self-training on 100% spam feed, naive start == |
− | ssdeep procmail.log 137.96s user 3.97s system 47% cpu 4:57.03 total |
+ | In which I feed a spam honeypot feed into a spamsum setup. Live. |
− | md5sum procmail.log 0.41s user 0.15s system 1% cpu 44.168 total |
+ | * Writeup here: [[SpamHash/NaiveSpamTest]] |
+ | === Conclusion Summary === |
||
+ | * At threshold 25, 90-95% of incoming messages are caught (pretty good imho) |
||
− | spamsum procmail.log 284.00s user 4.16s system 28% cpu 16:54.68 total |
+ | == Self-training on 100% ham corpus, naive start == |
− | ssdeep procmail.log 120.85s user 3.84s system 35% cpu 5:52.64 total |
+ | In which I feed a ham corpus into a spamsum setup. Simulated 11 years of ham |
− | md5sum procmail.log 0.31s user 0.13s system 0% cpu 44.070 total |
+ | * Writeup here: [[SpamHash/NaiveHamTest]] |
+ | === Conclusion summary === |
||
+ | * At threshold of 75, false positive rate is an unnacceptible 1% (approx). Lower thresholds (25 and 50) are so bad as to be not worth further testing |
||
− | spamsum procmail.log 264.34s user 4.19s system 23% cpu 19:22.42 total |
+ | == Testing of hash longevity == |
− | ssdeep procmail.log 139.07s user 4.04s system 52% cpu 4:33.28 total |
+ | In which I attempt to determine how long it is worthwhile to keep the hash for a given spam for. To do so I will have to create a daily (or hourly?) diff to know which hashes were NEW in that hour (and save to a file). Then, I test each message incoming against each old file in turn, noting which one it got the match on. |
− | md5sum procmail.log 0.47s user 0.16s system 2% cpu 26.408 total |
+ | === Prediction === |
+ | I expect most caught spam will be similar to a message from that day, or the day or two previously. I expect to see a strong dropoff of similarity after about a week, and as such, that it will be unlikely to be worthwhile keeping a hash for more than approx two weeks. |
||
+ | * note that in 2005(?) I ran similar spam similarity tests based on subject line alone, and found that common subject lines rarely persisted longer than about 5 days. My prediction here is based on that information. |
||
− | -rw------- 1 nemo nemo 562952378 Mar 17 09:57 procmail.log |
+ | == hash corpus size == |
− | </pre> |
+ | How large of a hashDB is it worth keeping? noting that performance is O(n). (see Kornblum's paper linked below) |
− | ===== TODO ===== |
||
− | Test over multiple small files (something more email-like) |
||
− | ===== Test Conclusions ===== |
+ | = Thoughts = |
− | |||
− | Over LARGE files, the spamsum algorithm appears to be an order of magnitude slower than md5sum. |
||
− | The original spamsum itself is significantly slower than ssdeep - which has presumably been optimised somewhat in the intervening years. |
||
− | |||
− | ==== Naive start on 100% spam honeypot ==== |
||
− | Some email addresses known to recieve spam were directed into the spamsum spampot and self-filtered (the procmail config below). After approx an hour, the spamcatches was already at 1:1 ratio of messages with the spamhashes. ''Remember, this is a self-teaching algorithm which at time zero has ZERO effectiveness.'' After the first day, it has rarely dropped below 70% for a given hour. As of 4 days, it appears to be stabilising between 85% and 90% (However: after 4 days I have a history of 3000 messages, only 450 are hashed. A spam history of a month or two (perhaps 10,000 messages?) which should improve the scoring? Additionally, If ALL caught spams are also hashed and added to the database, then not-quite-close enough variants may be caught before needing to be spamsummed |
||
− | |||
− | Remember however that this is on a mailbox which is assumed to be recieving 100% spam - so the possibility of false positives is not being tested yet, as such this cannot be claimed to be a 'scientific' test. |
||
− | |||
− | [[Image:Spamsum day4.PNG|center|hourly results of caught/hashed spams after four days]] |
||
− | |||
− | ==== Configuration ==== |
||
− | My .procmailrc file within my spampot address |
||
− | <pre> |
||
− | # all mail to this user is assumed to be spam. Nothing legit comes here. |
||
− | # ...thus, generate a spamsum score on EVERYTHING |
||
− | |||
− | # note that spamhashes.d and spamcatches.d are directories (hence .d suffix) |
||
− | # why? procmail will save each message as a file, allowing for easier rollover |
||
− | # and also testing of messages |
||
− | |||
− | SHELL=/bin/sh |
||
− | DROPPRIVS=yes |
||
− | VERBOSE=on |
||
− | LOGFILE=emailtmp/procmail.log |
||
− | |||
− | # (this comes first so spampot messages aren't spamsum'd twice |
||
− | :0 Wc |
||
− | | /usr/local/bin/spamsum -T25 -d emailtmp/spamsum_scores -C - |
||
− | # 'a' means the previous recipe ran successfull. ie, this message is similar |
||
− | # to a previously found spam in the spamsum score. So, we pull it now. |
||
− | :0 a |
||
− | emailtmp/spamcatches.d |
||
− | |||
− | :0 |
||
− | { |
||
− | # if the message wasn't previously caught as being spam, then let's |
||
− | # mark it as potential spam now with spamsum scoring :) |
||
− | :0 c |
||
− | | /usr/local/bin/spamsum - >> emailtmp/spamsum_scores |
||
− | # and since it's a spampot, we save it seperate for now (for testing) |
||
− | :0 |
||
− | emailtmp/spamhashes.d |
||
− | } |
||
− | |||
− | # note that all deliveries could be to /dev/null as all messages are assumed to be spam. |
||
− | # safer will be to remove old messages from the caches after a short period (week?) |
||
− | </pre> |
||
− | |||
− | === Thoughts === |
||
* Our test filter only determines is a spam is similar to a previously scored email. We don't know how similar. ie, we don't know how much our effectiveness would change with a different score. |
* Our test filter only determines is a spam is similar to a previously scored email. We don't know how similar. ie, we don't know how much our effectiveness would change with a different score. |
||
** Test this by running every spamcaught message over the hashDB and analysing scores resulting |
** Test this by running every spamcaught message over the hashDB and analysing scores resulting |
||
Line 75: | Line 77: | ||
* Greater efficiency: spamsum as a daemon? |
* Greater efficiency: spamsum as a daemon? |
||
* Different type of use: could the algorithm be altered to produce a hash which can validate partial messages. That way if a message was 50% recieved, but already was matching 100% score for that amount of data, we could close the connection early? (would development and in-use resource overhead be worth it just to move the scoring to the SMTP "DATA" phase? |
* Different type of use: could the algorithm be altered to produce a hash which can validate partial messages. That way if a message was 50% recieved, but already was matching 100% score for that amount of data, we could close the connection early? (would development and in-use resource overhead be worth it just to move the scoring to the SMTP "DATA" phase? |
||
− | * Is a spampot even nescessary? Couldn't this simply be run on a complete email dataset? Afterall, it works by allowing through the first instance of every unique email anyway, and ham tends to be relatively unique, whilst spam tends to come in repetitive sets... |
+ | * should all spams caught be spamsum'd also - ensuring a wider net? performance overhead with a blossoming hashDB? Benefit = not-quite-close-enough to be caught spams may be close enough to other messages caught. |
− | ** Yes... in simple testing, simply quoting an email in response makes it quite dissimilar, and their reply (which should be the next that spamsum sees) will have two levels of reply! (TODO: get numbers) |
+ | * Because this naturally blocks self-similar spam, might this in fact work AGAINST the effectiveness of downstream bayesian filters? |
− | ** TODO: test simply by feeding a weeks corpus of ALL my regular email through spamsum, simulating this. |
||
− | *** Do this twice: Once naively, once with pre-learnt hashDB from the spampot |
||
− | *** Then do it another way: over a known 100% ham corpus? (save a corpus of ham messages to MH or maildir format) |
||
− | *** Expectation: this will be effective, except possibly for email memes. (if the same funny picture is sent to you twice, even by different people, they will be base64 encoded the same and thus show up as being EXTREMELY similar (how common this is should show up in the 100% ham corpus test) |
||
− | * should all spams caught be spamsum'd also - ensuring a wider net? performance overhead with a blossoming hashDB? Benefit = not-quoite-close-enough to be caught spams may be close enough to other messages caught. |
||
− | * Because this naturally blocks self-similar spam, might this in fact work AGAINST the effectiveness of downstream bayesian filters? (you'd not put a bayesian upstream of this, due to resource reasons) |
||
** I think (hope) not. Or not too much anyway. Bayesian notices fine-grained similarities, whilst spamsum only notices entire-content sized similarities. |
** I think (hope) not. Or not too much anyway. Bayesian notices fine-grained similarities, whilst spamsum only notices entire-content sized similarities. |
||
+ | * can web add a header to the email with the spamsum hash? (noting that it's very existance will change the hash unless we use a no-headers option on spamsum). Header addition done by formail, IF we can get it out of spamsum without having to rehash it again?!) |
||
− | ==== TODO ==== |
+ | == TODO == |
* test spamsum memory and CPU usage over |
* test spamsum memory and CPU usage over |
||
− | ** LARGE files |
+ | ** hashing perforance on large files of several sizes in a testing environment, |
+ | ** hashing performance on MANY small (email sized) files. |
||
** LARGE hashDB dataset (say with 1000, 10,000, 100,000 results precomputed) |
** LARGE hashDB dataset (say with 1000, 10,000, 100,000 results precomputed) |
||
− | ** Both the above at once |
+ | ** all the above at once, and also all the aboves with comparisons against ssdeep/md5/sha1/etc |
− | ** Expectation: that memory will always be low (spamsum does not have to hold the entire file in memory to generate the spamsum?), CPU will get relatively high (yay data processing), and throughput will be limited by disk IO. (when run through procmail, the file is piped in - from disk or from ram?). Spamsum does take longer than md5 and similar cryptographic hashes, due to the nature of the hash generated. Comparing against a list of n hashes in hashDB, it is O(n) time. (see performance details in Kornblum's paper linked below) |
+ | ** Expectation: memory use? no idea (spamsum does not have to hold the entire file in memory to generate the spamsum, but if the input is STDIN, it has to store it to seek within it!), CPU will get relatively high (yay data processing), and throughput will be limited by disk IO. (when run through procmail, the file is piped in - from disk or from ram?). Spamsum does take longer than md5 and similar cryptographic hashes, due to the nature of the hash generated. Comparing against a list of n hashes in hashDB, it is O(n) time. (see performance details in Kornblum's paper linked below) |
* Check the spampot results for the timeframe that similar emails show up. (graph subject lines against dates?) |
* Check the spampot results for the timeframe that similar emails show up. (graph subject lines against dates?) |
||
+ | * compare with razor/pyzor which uses an internal hash too! !!! |
||
+ | * Also compare with Distributed Checksum Clearinghouse: http://www.rhyolite.com/dcc/ |
||
* investigate feasibility of datamining the spampot for servers that could be dynamically blacklisted, as well as generating dynamic header_checks and body_checks. Caution: apply such dynamism with extreme caution! |
* investigate feasibility of datamining the spampot for servers that could be dynamically blacklisted, as well as generating dynamic header_checks and body_checks. Caution: apply such dynamism with extreme caution! |
||
* graph the delivery rate to both caches (this should show spamsum cache delivery rate drop over time and level off - giving a good idea of appropriate retention policy |
* graph the delivery rate to both caches (this should show spamsum cache delivery rate drop over time and level off - giving a good idea of appropriate retention policy |
||
+ | * test EVERYTHING with and without the no-headers option, and with forced blocksize (what is most common? and/or smallest block allowed (3). |
||
+ | * devise some tests to compare against bayesian methods? (resource performance? effectiveness performance?). spambayes/spamassassin/bogofilter/etc... remember also that performance on the same file x will change according to n the number of spamsums in the hashDB, or m the number of tokens in a bayes system! |
||
+ | ** given 12 (24? 36?) months email, simulate SpamHash from naive start. Simulate bayesian on same email given training on 100% of message from the first month and the first week (to show how (I expect) bayesian performance degrades over time without maintenance. Then simulate "maintenance" by requiring retraining on all messages in a day every [a] month, [b] whenever success rate drops below a certain threshold. |
||
+ | *** note that none of these tests are using an independant spam honeypot. Could that also be tested? |
||
+ | *** Measure (how?) CPU/memory/resource performance on all these |
||
+ | ** simulate bayesian results when self-trained on a honeypot? |
||
+ | *** will bayesian do well without any ham training? or when trained multiple times on repeat messages? |
||
+ | |||
+ | == Future == |
||
+ | Whilst each hash may only have a useful life of a week or two (expected so anyay), the best method would be to expire hashes via LRU. Instead, FIFO is likely to be easier. |
||
+ | * LRU could be handled with current software (procmail and spamsum) through: "generate a .hash file with one hash in each. then test each incoming mail against each .hash file _seperately_, from newest to oldest. 'touch' the successfull file. This, however, is a horribly horribly inefficient method and has no real world usefulness ;) |
||
+ | |||
− | == Links == |
+ | = Links = |
Further spam reading here: |
Further spam reading here: |
||
* [http://www.paulgraham.com/spam.html A plan for Spam - Paul Graham] |
* [http://www.paulgraham.com/spam.html A plan for Spam - Paul Graham] |
||
Line 98: | Line 115: | ||
** http://ssdeep.sourceforge.net/ - unlike spamsum which is effectively abandoned, this is in '''active development''' (and even in Debian!) :) |
** http://ssdeep.sourceforge.net/ - unlike spamsum which is effectively abandoned, this is in '''active development''' (and even in Debian!) :) |
||
** http://www.forensicswiki.org/wiki/Ssdeep - Discussion of ssdeep as a forensics tool |
** http://www.forensicswiki.org/wiki/Ssdeep - Discussion of ssdeep as a forensics tool |
||
+ | ** pyzor - http://pyzor.sourceforge.net/ |
||
[[Category:NemProject]] |
[[Category:NemProject]] |
Latest revision as of 12:32, 3 March 2011
|
email... aka SMTP, aka rfc2822
There are some problems associated with email...
- spam.
- storage - see Maildir
[edit] A Solution to spam problem?
- spam hash!
[edit] NADS: Nemo's Approach to Destroying Spam
That is, this is a NemProject, named in accordance with NINS.
(currently I use 'NADS' and 'SpamHash' fairly interchangably. If in the future I work on any other antispam measures, then they would be part of the NADS family alongside SpamHash. At this time however, SpamHash is the only member in the family.
[edit] What does it do?
A spam honeypot delivery point is known to recieve 100% spam. All incoming messages to this mailbox are then hashed with spamsum, and a database of hashes kept.
Then as a seperate filter, all incoming messages to anyone are checked against the database of hashes. If the new message rates as "similar", then it is also assumed to be spam, classified and sorted as such.
[edit] Notes
- ssdeep (based upon spamsum) is a maintained utility (in debian testing at time of writing for eg) that can be used for some testing of effectiveness. It also appears to perform considerably better than spamsum, no doubt due to optimisations. It cannot accept data from STDIN except limitedly via /dev/stdin
- The hashDB should be rotated. New scores are appended at the end, so approx a month of scores to be kept. (note however that this is a flat text file with no internal dates (would spamsum mind if dates are munged in as "not formatted correctly spamsum scores"?), so rotation would likely have to be performed in a "number of lines" manner. Say, 'keep the last 10,000 scores' and rotate daily.
- filtered messages should be kept for sanity checking
[edit] Lexicon used here
- spamsum
- the algorithm that generates spamsum hashes.
- Note that 'spamsum' is also the name of the first program that implemented the spamsum algorithm. I will endeavour to make it clear which I mean by context.
- ssdeep - Another program to implement spamsum.
- hashDB
- the database of spamsum scores. (currently implemented as a flat file: spamsum_scores)
- spamcatches
- archive of emails which were caught by being too similar to a pre-scored message
- spamhashes
- archive of emails that generated the hashDB
[edit] Scope
SMTP Phase | post-DATA |
CPU Use | medium (needs more testing) |
Memory Use | low (needs more testing) |
False Positives | low (assumed, to be tested) |
Maintenance | low |
Effectiveness | good (stabilising at 90-95% on a spam honeypot with spamsum threshold at 25. Yet to formally test on a mixed feed) |
I expect spamsum to be equally usable at a personal and an organisation level. However, at a personal level I expect (awaiting testing) that it can be used on ALL email, on the expectation that the ONLY emails likely to match similarity to a previous email will be spam, thus allowing all ham through. This is not true for organisations, so a more dedicated spampot must be setup to create ONLY spam hashes. The accuracy of this spampot is likely to have a large impact on the false positives rate seen, though in part the same 'ham isn't duplicated' rule will apply.
[edit] Pros
- spamsum is quick, saves message being filtered by heavier bayesian/etc filters
- dynamically reacts to new spam as soon as a single message is seen.
[edit] Cons
- spampot address requires accepting messages we consider to be known spam. ie, is post-DATA - bandwidth for the spam is already consumed
- however: could the spams caught then be datamined to generate dynamic blacklists of servers?
- one false positive (esp messages which may be repeated - for eg: a forwarded meme or mailing list) can lead to others
[edit] Testing
[edit] Algorithm performance
In which I test the performance of hash methods: (spamsum, ssdeep, md5) on large files (large by size and/or count)
- Writeup here: SpamHash/BinaryTest
[edit] Conclusion Summary
- Spamsum is REALLY SLOW on a big file.
- TODO: test on multiple small files
[edit] Self-training on 100% spam feed, naive start
In which I feed a spam honeypot feed into a spamsum setup. Live.
- Writeup here: SpamHash/NaiveSpamTest
[edit] Conclusion Summary
- At threshold 25, 90-95% of incoming messages are caught (pretty good imho)
[edit] Self-training on 100% ham corpus, naive start
In which I feed a ham corpus into a spamsum setup. Simulated 11 years of ham
- Writeup here: SpamHash/NaiveHamTest
[edit] Conclusion summary
- At threshold of 75, false positive rate is an unnacceptible 1% (approx). Lower thresholds (25 and 50) are so bad as to be not worth further testing
[edit] Testing of hash longevity
In which I attempt to determine how long it is worthwhile to keep the hash for a given spam for. To do so I will have to create a daily (or hourly?) diff to know which hashes were NEW in that hour (and save to a file). Then, I test each message incoming against each old file in turn, noting which one it got the match on.
[edit] Prediction
I expect most caught spam will be similar to a message from that day, or the day or two previously. I expect to see a strong dropoff of similarity after about a week, and as such, that it will be unlikely to be worthwhile keeping a hash for more than approx two weeks.
- note that in 2005(?) I ran similar spam similarity tests based on subject line alone, and found that common subject lines rarely persisted longer than about 5 days. My prediction here is based on that information.
[edit] hash corpus size
How large of a hashDB is it worth keeping? noting that performance is O(n). (see Kornblum's paper linked below)
[edit] Thoughts
- Our test filter only determines is a spam is similar to a previously scored email. We don't know how similar. ie, we don't know how much our effectiveness would change with a different score.
- Test this by running every spamcaught message over the hashDB and analysing scores resulting
- Test also the spamsum cache messages to see how many more we could be capturing? (for each message in the cache, generate a spamsum, then grep -v that out from the hashDB (spamsum_scores file) so we don't get a perfect match, and generate a spamsum similarity score. Analyse... (alt: do this test with ssdeep)
- Greater efficiency: spamsum as a daemon?
- Different type of use: could the algorithm be altered to produce a hash which can validate partial messages. That way if a message was 50% recieved, but already was matching 100% score for that amount of data, we could close the connection early? (would development and in-use resource overhead be worth it just to move the scoring to the SMTP "DATA" phase?
- should all spams caught be spamsum'd also - ensuring a wider net? performance overhead with a blossoming hashDB? Benefit = not-quite-close-enough to be caught spams may be close enough to other messages caught.
- Because this naturally blocks self-similar spam, might this in fact work AGAINST the effectiveness of downstream bayesian filters?
- I think (hope) not. Or not too much anyway. Bayesian notices fine-grained similarities, whilst spamsum only notices entire-content sized similarities.
- can web add a header to the email with the spamsum hash? (noting that it's very existance will change the hash unless we use a no-headers option on spamsum). Header addition done by formail, IF we can get it out of spamsum without having to rehash it again?!)
[edit] TODO
- test spamsum memory and CPU usage over
- hashing perforance on large files of several sizes in a testing environment,
- hashing performance on MANY small (email sized) files.
- LARGE hashDB dataset (say with 1000, 10,000, 100,000 results precomputed)
- all the above at once, and also all the aboves with comparisons against ssdeep/md5/sha1/etc
- Expectation: memory use? no idea (spamsum does not have to hold the entire file in memory to generate the spamsum, but if the input is STDIN, it has to store it to seek within it!), CPU will get relatively high (yay data processing), and throughput will be limited by disk IO. (when run through procmail, the file is piped in - from disk or from ram?). Spamsum does take longer than md5 and similar cryptographic hashes, due to the nature of the hash generated. Comparing against a list of n hashes in hashDB, it is O(n) time. (see performance details in Kornblum's paper linked below)
- Check the spampot results for the timeframe that similar emails show up. (graph subject lines against dates?)
- compare with razor/pyzor which uses an internal hash too! !!!
- Also compare with Distributed Checksum Clearinghouse: http://www.rhyolite.com/dcc/
- investigate feasibility of datamining the spampot for servers that could be dynamically blacklisted, as well as generating dynamic header_checks and body_checks. Caution: apply such dynamism with extreme caution!
- graph the delivery rate to both caches (this should show spamsum cache delivery rate drop over time and level off - giving a good idea of appropriate retention policy
- test EVERYTHING with and without the no-headers option, and with forced blocksize (what is most common? and/or smallest block allowed (3).
- devise some tests to compare against bayesian methods? (resource performance? effectiveness performance?). spambayes/spamassassin/bogofilter/etc... remember also that performance on the same file x will change according to n the number of spamsums in the hashDB, or m the number of tokens in a bayes system!
- given 12 (24? 36?) months email, simulate SpamHash from naive start. Simulate bayesian on same email given training on 100% of message from the first month and the first week (to show how (I expect) bayesian performance degrades over time without maintenance. Then simulate "maintenance" by requiring retraining on all messages in a day every [a] month, [b] whenever success rate drops below a certain threshold.
- note that none of these tests are using an independant spam honeypot. Could that also be tested?
- Measure (how?) CPU/memory/resource performance on all these
- simulate bayesian results when self-trained on a honeypot?
- will bayesian do well without any ham training? or when trained multiple times on repeat messages?
- given 12 (24? 36?) months email, simulate SpamHash from naive start. Simulate bayesian on same email given training on 100% of message from the first month and the first week (to show how (I expect) bayesian performance degrades over time without maintenance. Then simulate "maintenance" by requiring retraining on all messages in a day every [a] month, [b] whenever success rate drops below a certain threshold.
[edit] Future
Whilst each hash may only have a useful life of a week or two (expected so anyay), the best method would be to expire hashes via LRU. Instead, FIFO is likely to be easier.
- LRU could be handled with current software (procmail and spamsum) through: "generate a .hash file with one hash in each. then test each incoming mail against each .hash file _seperately_, from newest to oldest. 'touch' the successfull file. This, however, is a horribly horribly inefficient method and has no real world usefulness ;)
[edit] Links
Further spam reading here:
- A plan for Spam - Paul Graham
- http://www.acme.com/mail_filtering/ - This guy is the God of spam filterers
- http://code.google.com/p/pyspamsum/ - This is the version of spamsum I use (though I don't use the python wrapper)
- http://www.samba.org/ftp/unpacked/junkcode/spamsum/ - Tridge's original spamsum
- http://www.dfrws.org/2006/proceedings/12-Kornblum.pdf - a paper on identifying similar files. Develops ssdeep from spamsum
- http://ssdeep.sourceforge.net/ - unlike spamsum which is effectively abandoned, this is in active development (and even in Debian!) :)
- http://www.forensicswiki.org/wiki/Ssdeep - Discussion of ssdeep as a forensics tool
- pyzor - http://pyzor.sourceforge.net/