SpamHash/NaiveSpamTest

Latest revision as of 13:50, 1 April 2009

[edit] Self-training on 100% spam feed, naive start

Some email addresses known to recieve spam were directed into the spamsum spampot and self-filtered (the procmail config below). After approx an hour, the spamcatches was already at 1:1 ratio of messages with the spamhashes. Remember, this is a self-teaching algorithm which at time zero has ZERO effectiveness. After the first day, it has rarely dropped below 70% for a given hour. After 4 days, it stabilised between 85% and 90%, and has been trending towards 92% in the second week, and towards 94% in the third week.

After 4 days I had a history of 3000 messages, only 450 were hashed. A spam history of a month or two (perhaps 10,000 messages?) which should improve the scoring, though there is a time limit to the usefullness of a hash. (what is that time limit?) Additionally, If ALL caught spams are also hashed and added to the database, then some not-quite-close enough variants may be caught before needing to be spamsummed

Remember however that this is on a mailbox which is assumed to be recieving 100% spam - so the possibility of false positives is not being tested yet, as such this cannot be claimed to be a 'scientific' test.

hourly results of catches/hashes over the first fortnight

[edit] Configuration

My .procmailrc file within my spampot address

# all mail to this user is assumed to be spam. Nothing legit comes here.
# ...thus, generate a spamsum score on EVERYTHING

# note that spamhashes.d and spamcatches.d are directories (hence .d suffix)
# why? procmail will save each message as a file, allowing for easier rollover
# and also testing of messages

SHELL=/bin/sh
DROPPRIVS=yes
VERBOSE=on
LOGFILE=emailtmp/procmail.log

# (this comes first so spampot messages aren't spamsum'd twice
:0 Wc
| /usr/local/bin/spamsum -T25 -d emailtmp/spamsum_scores -C -
# 'a' means the previous recipe ran successfull. ie, this message is similar
# to a previously found spam in the spamsum score. So, we pull it now. 
:0 a
emailtmp/spamcatches.d

:0
{
       # if the message wasn't previously caught as being spam, then let's
       # mark it as potential spam now with spamsum scoring :)
       :0 c
       | /usr/local/bin/spamsum - >> emailtmp/spamsum_scores
       # and since it's a spampot, we save it seperate for now (for testing)
       :0
       emailtmp/spamhashes.d
}

# note that all deliveries could be to /dev/null as all messages are assumed to be spam. 
# safer will be to remove old messages from the caches after a short period (week?)

@@ Line 1: / Line 1: @@
 == Self-training on 100% spam feed, naive start ==
-Some email addresses known to recieve spam were directed into the spamsum spampot and self-filtered (the procmail config below). After approx an hour, the spamcatches was already at 1:1 ratio of messages with the spamhashes. ''Remember, this is a self-teaching algorithm which at time zero has ZERO effectiveness.'' After the first day, it has rarely dropped below 70% for a given hour. After 4 days, it stabilised between 85% and 90%, and has been trending towards 92% in the second week. (After 4 days I have a history of 3000 messages, only 450 are hashed. A spam history of a month or two (perhaps 10,000 messages?) which should improve the scoring? Additionally, If ALL caught spams are also hashed and added to the database, then not-quite-close enough variants may be caught before needing to be spamsummed
+Some email addresses known to recieve spam were directed into the spamsum spampot and self-filtered (the procmail config below). After approx an hour, the spamcatches was already at 1:1 ratio of messages with the spamhashes. ''Remember, this is a self-teaching algorithm which at time zero has ZERO effectiveness.'' After the first day, it has rarely dropped below 70% for a given hour. After 4 days, it stabilised between 85% and 90%, and has been trending towards 92% in the second week, and towards 94% in the third week.
+After 4 days I had a history of 3000 messages, only 450 were hashed. A spam history of a month or two (perhaps 10,000 messages?) which should improve the scoring, though there is a time limit to the usefullness of a hash. (what is that time limit?) Additionally, If ALL caught spams are also hashed and added to the database, then some not-quite-close enough variants may be caught before needing to be spamsummed
 Remember however that this is on a mailbox which is assumed to be recieving 100% spam - so the possibility of false positives is not being tested yet, as such this cannot be claimed to be a 'scientific' test.

SpamHash/NaiveSpamTest

Latest revision as of 13:50, 1 April 2009

[edit] Self-training on 100% spam feed, naive start

[edit] Configuration

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

meta navigation

More thorx

Tools