A Solution to spam problem?

spam hash!

NADS: Nemo's Approach to Destroying Spam

That is, this is a NemProject, named in accordance with NINS.

(currently I use 'NADS' and 'SpamHash' fairly interchangably. If in the future I work on any other antispam measures, then they would be part of the NADS family alongside SpamHash. At this time however, SpamHash is the only member in the family.

What does it do?

A spam honeypot delivery point is known to recieve 100% spam. All incoming messages to this mailbox are then hashed with spamsum, and a database of hashes kept.

Then as a seperate filter, all incoming messages to anyone are checked against the database of hashes. If the new message rates as "similar", then it is also assumed to be spam, classified and sorted as such.

Notes

ssdeep (based upon spamsum) is a maintained utility (in debian testing at time of writing for eg) that can be used for some testing of effectiveness. It also appears to perform considerably better than spamsum, no doubt due to optimisations. It cannot accept data from STDIN except limitedly via /dev/stdin
The hashDB should be rotated. New scores are appended at the end, so approx a month of scores to be kept. (note however that this is a flat text file with no internal dates (would spamsum mind if dates are munged in as "not formatted correctly spamsum scores"?), so rotation would likely have to be performed in a "number of lines" manner. Say, 'keep the last 10,000 scores' and rotate daily.
filtered messages should be kept for sanity checking

Lexicon used here

spamsum: the algorithm that generates spamsum hashes.

Note that 'spamsum' is also the name of the first program that implemented the spamsum algorithm. I will endeavour to make it clear which I mean by context.
ssdeep - Another program to implement spamsum.

hashDB: the database of spamsum scores. (currently implemented as a flat file: spamsum_scores)
spamcatches: archive of emails which were caught by being too similar to a pre-scored message
spamhashes: archive of emails that generated the hashDB

Scope

SMTP Phase	post-DATA
CPU Use	medium (needs more testing)
Memory Use	low (needs more testing)
False Positives	low (assumed, to be tested)
Maintenance	low
Effectiveness	good (stabilising at 90-95% on a spam honeypot with spamsum threshold at 25. Yet to formally test on a mixed feed)

I expect spamsum to be equally usable at a personal and an organisation level. However, at a personal level I expect (awaiting testing) that it can be used on ALL email, on the expectation that the ONLY emails likely to match similarity to a previous email will be spam, thus allowing all ham through. This is not true for organisations, so a more dedicated spampot must be setup to create ONLY spam hashes. The accuracy of this spampot is likely to have a large impact on the false positives rate seen, though in part the same 'ham isn't duplicated' rule will apply.

Pros

spamsum is quick, saves message being filtered by heavier bayesian/etc filters
dynamically reacts to new spam as soon as a single message is seen.

Cons

spampot address requires accepting messages we consider to be known spam. ie, is post-DATA - bandwidth for the spam is already consumed
- however: could the spams caught then be datamined to generate dynamic blacklists of servers?
one false positive (esp messages which may be repeated - for eg: a forwarded meme or mailing list) can lead to others

Testing

Algorithm performance

In which I test the performance of hash methods: (spamsum, ssdeep, md5) on large files (large by size and/or count)

Writeup here: SpamHash/BinaryTest

Conclusion Summary

Spamsum is REALLY SLOW on a big file.
TODO: test on multiple small files

Self-training on 100% spam feed, naive start

In which I feed a spam honeypot feed into a spamsum setup. Live.

Writeup here: SpamHash/NaiveSpamTest

Conclusion Summary

At threshold 25, 90-95% of incoming messages are caught (pretty good imho)

Self-training on 100% ham corpus, naive start

In which I feed a ham corpus into a spamsum setup. Simulated over 15 years of ham

Writeup here: SpamHash/NaiveHamTest

Conclusion summary

At threshold of 75, false positive rate is an unnacceptible 1% (approx). Lower thresholds (25 and 50) are so bad as to be not worth testing

Thoughts

Our test filter only determines is a spam is similar to a previously scored email. We don't know how similar. ie, we don't know how much our effectiveness would change with a different score.
- Test this by running every spamcaught message over the hashDB and analysing scores resulting
- Test also the spamsum cache messages to see how many more we could be capturing? (for each message in the cache, generate a spamsum, then grep -v that out from the hashDB (spamsum_scores file) so we don't get a perfect match, and generate a spamsum similarity score. Analyse... (alt: do this test with ssdeep)
Greater efficiency: spamsum as a daemon?
Different type of use: could the algorithm be altered to produce a hash which can validate partial messages. That way if a message was 50% recieved, but already was matching 100% score for that amount of data, we could close the connection early? (would development and in-use resource overhead be worth it just to move the scoring to the SMTP "DATA" phase?
should all spams caught be spamsum'd also - ensuring a wider net? performance overhead with a blossoming hashDB? Benefit = not-quoite-close-enough to be caught spams may be close enough to other messages caught.
Because this naturally blocks self-similar spam, might this in fact work AGAINST the effectiveness of downstream bayesian filters?
- I think (hope) not. Or not too much anyway. Bayesian notices fine-grained similarities, whilst spamsum only notices entire-content sized similarities.
can web add a header to the email with the spamsum hash? (noting that it's very existance will change the hash unless we use a no-headers option on spamsum). Header addition done by formail, IF we can get it out of spamsum without having to rehash it again?!)

TODO

test spamsum memory and CPU usage over
- hashing perforance on large files of several sizes in a testing environment,
- hashing performance on MANY small (email sized) files.
- LARGE hashDB dataset (say with 1000, 10,000, 100,000 results precomputed)
- all the above at once, and also all the aboves with comparisons against ssdeep/md5/sha1/etc
- Expectation: memory use? no idea (spamsum does not have to hold the entire file in memory to generate the spamsum, but if the input is STDIN, it has to store it to seek within it!), CPU will get relatively high (yay data processing), and throughput will be limited by disk IO. (when run through procmail, the file is piped in - from disk or from ram?). Spamsum does take longer than md5 and similar cryptographic hashes, due to the nature of the hash generated. Comparing against a list of n hashes in hashDB, it is O(n) time. (see performance details in Kornblum's paper linked below)
Check the spampot results for the timeframe that similar emails show up. (graph subject lines against dates?)
investigate feasibility of datamining the spampot for servers that could be dynamically blacklisted, as well as generating dynamic header_checks and body_checks. Caution: apply such dynamism with extreme caution!
graph the delivery rate to both caches (this should show spamsum cache delivery rate drop over time and level off - giving a good idea of appropriate retention policy
test EVERYTHING with and without the no-headers option, and with forced blocksize (what is most common? and/or smallest block allowed (3).
devise some tests to compare against bayesian methods? (resource performance? effectiveness performance?). spambayes/spamassassin/bogofilter/etc... remember also that performance on the same file x will change according to n the number of spamsums in the hashDB, or m the number of tokens in a bayes system!
- given 12 (24? 36?) months email, simulate SpamHash from naive start. Simulate bayesian on same email given training on 100% of message from the first month and the first week (to show how (I expect) bayesian performance degrades over time without maintenance. Then simulate "maintenance" by requiring retraining on all messages in a day every [a] month, [b] whenever success rate drops below a certain threshold.
  - note that none of these tests are using an independant spam honeypot. Could that also be tested?
  - Measure (how?) CPU/memory/resource performance on all these
- simulate bayesian results when self-trained on a honeypot?
  - will bayesian do well without any ham training? or when trained multiple times on repeat messages?
measuring how long a particular spam message is sent for? ie, for what timeframe is a SpamHash usefully valid for?
- Generate a spamhash on each message in a corpus (one email per file, and maybe ignore headers?) and store the message date and spamhash into a database. Analyse.
  - Expectation from prior testing: a spam is repeated for a week or so and then dropped. Holding a spamhash for a month is more than sufficient. (unfortunately, flat text file handling means FIFO method of dropping old hashes. A more formal database would be able to use a LRU filter.

Links

Further spam reading here:

A plan for Spam - Paul Graham
http://www.acme.com/mail_filtering/ - This guy is the God of spam filterers
http://code.google.com/p/pyspamsum/ - This is the version of spamsum I use (though I don't use the python wrapper)
http://www.samba.org/ftp/unpacked/junkcode/spamsum/ - Tridge's original spamsum
http://www.dfrws.org/2006/proceedings/12-Kornblum.pdf - a paper on identifying similar files. Develops ssdeep from spamsum
- http://ssdeep.sourceforge.net/ - unlike spamsum which is effectively abandoned, this is in active development (and even in Debian!) :)
- http://www.forensicswiki.org/wiki/Ssdeep - Discussion of ssdeep as a forensics tool

SpamHash

Contents

A Solution to spam problem?

NADS: Nemo's Approach to Destroying Spam

What does it do?

Notes

Lexicon used here

Scope

Pros

Cons

Testing

Algorithm performance

Conclusion Summary

Self-training on 100% spam feed, naive start

Conclusion Summary

Self-training on 100% ham corpus, naive start

Conclusion summary

Thoughts

TODO

Links

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

meta navigation

More thorx

Tools