SpamHash/NaiveHamTest
(why ignoring headers might be good) |
(rework the page, starting stats again and doin g better :)) |
||
Line 1: | Line 1: | ||
− | Testing SpamHash on a ham stream, naive start. |
+ | = Testing SpamHash on a ham stream, naive start. = |
− | Why? |
+ | '''Why?''' |
− | Because ham is hoped to be, by it's very nature, unique, and thus immune to the similarity testing that spamsum performs. If this is so, then spamhash could be run native on an incoming mail stream, and not require a spampot. |
+ | I expect ham to be, by it's very nature, unique and thus immune to the similarity testing that spamsum performs. If this can be proved to be, then spamhash could be run native on an incoming mail stream, and not require a spampot. |
− | So, given approx 1gig corpus of ham (~120000 messages), spanning some 15 years, it was run through spamsum chronologically with a a threshold of 25. |
+ | = Ham Corpus = |
+ | The ham corpus is my personal email archive. Most mailfolders added (the only notable exception beind approx 50000 messages via an nntp->smtp gateway, as this cannot be considered representative of genuine email. All other mail (personal, email lists, etc) was added in, then all sent mail filtered out (via mutt's $alternates). Some obvious spam which had crept in (mainly via one buggy list), but this filtering was not comprehensive. The resulting archive was then cropped to the 11year window of 1998 to 2008 inclusive. These filters combined to reduce my personal mail archive from approx 120000 messages to 61277. Finally, the mail was saved to [[Maildir]] format with modified filenames. |
||
− | Here is what we find |
+ | = Test procedure = |
− | <pre> |
+ | The ham corpus was run through a script to simulate chronological delivery and filtering via procmail. |
− | By message count: |
||
− | hashes: 58192 |
||
− | catches: 61745 |
||
− | by size: |
+ | * This was run several times: |
− | 726M hashes |
+ | ** each subsequent test twice: once with and without spamsum's "-H" option (ignore email headers) |
− | 366M catches |
+ | ** threshold scores of 25, 50 and 75 (and possibly others dependant on the results seen here, and in equiv spam corpus testing |
− | Time to process: |
+ | = Results = |
− | start time: Mon Mar 30 21:15:44 EST 2009 |
||
− | stop time: Tue Mar 31 20:54:05 EST 2009 |
||
− | real 1418m20.848s |
+ | = Analysis = |
− | user 1008m43.698s |
||
− | sys 45m46.752s |
||
− | </pre> |
||
− | |||
− | As we can see, more than half the messages were caught as similars (or only a third by size, indicating that larger messages were more likely unique). Indeed, a review of the first few catches showed that they are small messages which have similar headers (esp From:, To:, Subject:, Recieved:, Return-Path:). With the spamsum's ignore headers option, I hope that most similar messages become dissimilar. |
||
− | |||
− | Additionally, ssdeep was than run to compare the hashes and catches directories to attempt to determine what measure of similarity the bell curve was around (noting that this comparison is not chronologically accurate like the original test). 35 was center of the resulting bell curve. |
||
− | |||
− | == TODO == |
||
− | * Process again but ignore headers, then both styles with thresholds of 50 and 75 |
||
− | * graph results |
Revision as of 02:37, 3 April 2009
Contents |
Testing SpamHash on a ham stream, naive start.
Why?
I expect ham to be, by it's very nature, unique and thus immune to the similarity testing that spamsum performs. If this can be proved to be, then spamhash could be run native on an incoming mail stream, and not require a spampot.
Ham Corpus
The ham corpus is my personal email archive. Most mailfolders added (the only notable exception beind approx 50000 messages via an nntp->smtp gateway, as this cannot be considered representative of genuine email. All other mail (personal, email lists, etc) was added in, then all sent mail filtered out (via mutt's $alternates). Some obvious spam which had crept in (mainly via one buggy list), but this filtering was not comprehensive. The resulting archive was then cropped to the 11year window of 1998 to 2008 inclusive. These filters combined to reduce my personal mail archive from approx 120000 messages to 61277. Finally, the mail was saved to Maildir format with modified filenames.
Test procedure
The ham corpus was run through a script to simulate chronological delivery and filtering via procmail.
- This was run several times:
- each subsequent test twice: once with and without spamsum's "-H" option (ignore email headers)
- threshold scores of 25, 50 and 75 (and possibly others dependant on the results seen here, and in equiv spam corpus testing