SpamHash/NaiveHamTest

From ThorxWiki
Revision as of 02:41, 3 April 2009 by Nemo (Talk | contribs)

Jump to: navigation, search

Contents

Testing SpamHash on a ham stream, naive start.

Why?

I expect ham to be, by it's very nature, unique and thus immune to the similarity testing that spamsum performs. If this can be proved to be, then spamhash could be run native on an incoming mail stream, and not require a spampot.

Ham Corpus

The ham corpus is my personal email archive. Most mailfolders added (the only notable exception beind approx 50000 messages via an nntp->smtp gateway, as this cannot be considered representative of genuine email. All other mail (personal, email lists, etc) was added in, then all sent mail filtered out (via mutt's $alternates). Some obvious spam which had crept in (mainly via one buggy list), but this filtering was not comprehensive. The resulting archive was then cropped to the 11year window of 1998 to 2008 inclusive. These filters combined to reduce my personal mail archive from approx 120000 messages to 61277. Finally, the mail was saved to Maildir format with modified filenames.

Test procedure

The ham corpus was run through a script to simulate chronological delivery and filtering via procmail.

  • This was run several times:
    • each subsequent test twice: once with and without spamsum's "-H" option (ignore email headers)
    • threshold scores of 25, 50 and 75 (and possibly others dependant on the results seen here, and in equiv spam corpus testing

Results

Analysis

Personal tools
Namespaces

Variants
Actions
Navigation
meta navigation
More thorx
Tools