SpamHash/NaiveHamTest

From ThorxWiki
(Difference between revisions)
Jump to: navigation, search
m (.)
(results for -H -T25)
Line 21: Line 21:
 
** each subsequent test twice: once with and without spamsum's "-H" option (ignore email headers)
 
** each subsequent test twice: once with and without spamsum's "-H" option (ignore email headers)
 
** threshold scores of 25, 50 and 75 (and possibly others dependant on the results seen here, and in equiv spam corpus testing
 
** threshold scores of 25, 50 and 75 (and possibly others dependant on the results seen here, and in equiv spam corpus testing
  +
  +
= Test machine =
  +
933Mhz PIII
   
 
= Results =
 
= Results =
Line 27: Line 30:
 
!Spamsum threshold (-Txx) ||Including headers || Excluding headers (-H)
 
!Spamsum threshold (-Txx) ||Including headers || Excluding headers (-H)
 
|-
 
|-
|'''25''' || to be tested || to be tested
+
|'''25''' || to be tested ||
  +
Process time:
  +
18878.18user 1146.58system 6:56:08elapsed
  +
80%CPU (0avgtext+0avgdata 0maxresident)k
  +
0inputs+0outputs
  +
(63875major+75779350minor)pagefaults 0swaps
  +
* hashes: 30181
  +
* catches: 31096
  +
One word: terrible
 
|-
 
|-
 
||'''50''' || to be tested || to be tested
 
||'''50''' || to be tested || to be tested

Revision as of 10:55, 4 April 2009

Contents

Testing SpamHash on a ham stream, naive start.

Why?

I expect ham to be, by it's very nature, unique and thus immune to the similarity testing that spamsum performs. If this can be proved to be, then spamhash could be run native on an incoming mail stream, and not require a spampot.

Ham Corpus

The ham corpus is my personal email archive.

I combined most mailfolders I currently have (the only notable exception being the archive from an nntp->smtp gateway, as this cannot be considered representative of genuine email. The nntp archive represented some 50,000 messages!) All other mail (personal, email lists, etc) was added in, then sent mail filtered out (via mutt's $alternates). Some obvious spam which had crept in (mainly via one buggy list) were also removed, but this cannot be said to be comprehensive. The resulting archive was then cropped to the 11year window of 1998 to 2008 inclusive.

This filtering combined to reduce my personal mail archive from approx 120000 messages to 61277 (637MB in mbox format).

Finally, the mail was saved to Maildir format with modified filename. The format used is: "YYYYMMDD-HH:MM:SS.<string>.8charmd5:2,". This allows for simple commandline listing of messages in chronological order. We set <string> to name the corpus - allowing for future mixing of ham and spam and spampot corpuses without losing original per-message definitions.

Test procedure

The ham corpus was run through a script to simulate chronological delivery and filtering via procmail.

  • This was run several times:
    • each subsequent test twice: once with and without spamsum's "-H" option (ignore email headers)
    • threshold scores of 25, 50 and 75 (and possibly others dependant on the results seen here, and in equiv spam corpus testing

Test machine

933Mhz PIII

Results

Spamsum threshold (-Txx) Including headers Excluding headers (-H)
25 to be tested

Process time:

18878.18user 1146.58system 6:56:08elapsed
80%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs
(63875major+75779350minor)pagefaults 0swaps
  • hashes: 30181
  • catches: 31096

One word: terrible

50 to be tested to be tested
75 to be tested to be tested

Raw result data available on request

Analysis

Personal tools
Namespaces

Variants
Actions
Navigation
meta navigation
More thorx
Tools