SpamHash/NaiveSpamTest
(moved to here from main SpamHash page) |
(expand a little on the writeup) |
||
Line 1: | Line 1: | ||
== Self-training on 100% spam feed, naive start == |
== Self-training on 100% spam feed, naive start == |
||
− | Some email addresses known to recieve spam were directed into the spamsum spampot and self-filtered (the procmail config below). After approx an hour, the spamcatches was already at 1:1 ratio of messages with the spamhashes. ''Remember, this is a self-teaching algorithm which at time zero has ZERO effectiveness.'' After the first day, it has rarely dropped below 70% for a given hour. After 4 days, it stabilised between 85% and 90%, and has been trending towards 92% in the second week. (After 4 days I have a history of 3000 messages, only 450 are hashed. A spam history of a month or two (perhaps 10,000 messages?) which should improve the scoring? Additionally, If ALL caught spams are also hashed and added to the database, then not-quite-close enough variants may be caught before needing to be spamsummed |
+ | Some email addresses known to recieve spam were directed into the spamsum spampot and self-filtered (the procmail config below). After approx an hour, the spamcatches was already at 1:1 ratio of messages with the spamhashes. ''Remember, this is a self-teaching algorithm which at time zero has ZERO effectiveness.'' After the first day, it has rarely dropped below 70% for a given hour. After 4 days, it stabilised between 85% and 90%, and has been trending towards 92% in the second week, and towards 94% in the third week. |
+ | |||
+ | After 4 days I had a history of 3000 messages, only 450 were hashed. A spam history of a month or two (perhaps 10,000 messages?) which should improve the scoring, though there is a time limit to the usefullness of a hash. (what is that time limit?) Additionally, If ALL caught spams are also hashed and added to the database, then some not-quite-close enough variants may be caught before needing to be spamsummed |
||
Remember however that this is on a mailbox which is assumed to be recieving 100% spam - so the possibility of false positives is not being tested yet, as such this cannot be claimed to be a 'scientific' test. |
Remember however that this is on a mailbox which is assumed to be recieving 100% spam - so the possibility of false positives is not being tested yet, as such this cannot be claimed to be a 'scientific' test. |
Latest revision as of 13:50, 1 April 2009
[edit] Self-training on 100% spam feed, naive start
Some email addresses known to recieve spam were directed into the spamsum spampot and self-filtered (the procmail config below). After approx an hour, the spamcatches was already at 1:1 ratio of messages with the spamhashes. Remember, this is a self-teaching algorithm which at time zero has ZERO effectiveness. After the first day, it has rarely dropped below 70% for a given hour. After 4 days, it stabilised between 85% and 90%, and has been trending towards 92% in the second week, and towards 94% in the third week.
After 4 days I had a history of 3000 messages, only 450 were hashed. A spam history of a month or two (perhaps 10,000 messages?) which should improve the scoring, though there is a time limit to the usefullness of a hash. (what is that time limit?) Additionally, If ALL caught spams are also hashed and added to the database, then some not-quite-close enough variants may be caught before needing to be spamsummed
Remember however that this is on a mailbox which is assumed to be recieving 100% spam - so the possibility of false positives is not being tested yet, as such this cannot be claimed to be a 'scientific' test.
[edit] Configuration
My .procmailrc file within my spampot address
# all mail to this user is assumed to be spam. Nothing legit comes here. # ...thus, generate a spamsum score on EVERYTHING # note that spamhashes.d and spamcatches.d are directories (hence .d suffix) # why? procmail will save each message as a file, allowing for easier rollover # and also testing of messages SHELL=/bin/sh DROPPRIVS=yes VERBOSE=on LOGFILE=emailtmp/procmail.log # (this comes first so spampot messages aren't spamsum'd twice :0 Wc | /usr/local/bin/spamsum -T25 -d emailtmp/spamsum_scores -C - # 'a' means the previous recipe ran successfull. ie, this message is similar # to a previously found spam in the spamsum score. So, we pull it now. :0 a emailtmp/spamcatches.d :0 { # if the message wasn't previously caught as being spam, then let's # mark it as potential spam now with spamsum scoring :) :0 c | /usr/local/bin/spamsum - >> emailtmp/spamsum_scores # and since it's a spampot, we save it seperate for now (for testing) :0 emailtmp/spamhashes.d } # note that all deliveries could be to /dev/null as all messages are assumed to be spam. # safer will be to remove old messages from the caches after a short period (week?)