SpamHash/NaiveSpamTest
(moved to here from main SpamHash page) |
Revision as of 11:59, 1 April 2009
Self-training on 100% spam feed, naive start
Some email addresses known to recieve spam were directed into the spamsum spampot and self-filtered (the procmail config below). After approx an hour, the spamcatches was already at 1:1 ratio of messages with the spamhashes. Remember, this is a self-teaching algorithm which at time zero has ZERO effectiveness. After the first day, it has rarely dropped below 70% for a given hour. After 4 days, it stabilised between 85% and 90%, and has been trending towards 92% in the second week. (After 4 days I have a history of 3000 messages, only 450 are hashed. A spam history of a month or two (perhaps 10,000 messages?) which should improve the scoring? Additionally, If ALL caught spams are also hashed and added to the database, then not-quite-close enough variants may be caught before needing to be spamsummed
Remember however that this is on a mailbox which is assumed to be recieving 100% spam - so the possibility of false positives is not being tested yet, as such this cannot be claimed to be a 'scientific' test.
Configuration
My .procmailrc file within my spampot address
# all mail to this user is assumed to be spam. Nothing legit comes here. # ...thus, generate a spamsum score on EVERYTHING # note that spamhashes.d and spamcatches.d are directories (hence .d suffix) # why? procmail will save each message as a file, allowing for easier rollover # and also testing of messages SHELL=/bin/sh DROPPRIVS=yes VERBOSE=on LOGFILE=emailtmp/procmail.log # (this comes first so spampot messages aren't spamsum'd twice :0 Wc | /usr/local/bin/spamsum -T25 -d emailtmp/spamsum_scores -C - # 'a' means the previous recipe ran successfull. ie, this message is similar # to a previously found spam in the spamsum score. So, we pull it now. :0 a emailtmp/spamcatches.d :0 { # if the message wasn't previously caught as being spam, then let's # mark it as potential spam now with spamsum scoring :) :0 c | /usr/local/bin/spamsum - >> emailtmp/spamsum_scores # and since it's a spampot, we save it seperate for now (for testing) :0 emailtmp/spamhashes.d } # note that all deliveries could be to /dev/null as all messages are assumed to be spam. # safer will be to remove old messages from the caches after a short period (week?)