Latest revision as of 01:32, 8 April 2009

[edit] Testing SpamHash on a ham stream, naive start.

[edit] Why?

I expect ham to be, by it's very nature, unique and thus immune to the similarity testing that spamsum performs. If this can be proved to be, then spamhash could be run native on an incoming mail stream, and not require a spampot.

[edit] Original notes

Is a spampot even nescessary? Couldn't this simply be run on a complete email dataset? Afterall, it works by allowing through the first instance of every unique email anyway, and ham tends to be relatively unique, whilst spam tends to come in repetitive sets...
- Yes... in simple testing, simply quoting an email in response makes it quite dissimilar, and their reply (which should be the next that spamsum sees) will have two levels of reply! (TODO: get numbers)
- TODO: test simply by feeding a weeks corpus of ALL my regular email through spamsum, simulating this.
  - Do this twice: Once naively, once with pre-learnt hashDB from the spampot
  - Then do it another way: over a known 100% ham corpus? (save a corpus of ham messages to MH or maildir format)
  - Expectation: this will be effective, except possibly for email memes. (if the same funny picture is sent to you twice, even by different people, they will be base64 encoded the same and thus show up as being EXTREMELY similar (how common this is should show up in the 100% ham corpus test)

[edit] Ham Corpus

The ham corpus is my personal email archive.

I combined most mailfolders I currently have (the only notable exception being the archive from an nntp->smtp gateway, as this cannot be considered representative of genuine email. The nntp archive represented some 50,000 messages!) All other mail (personal, email lists, etc) was added in, then sent mail filtered out (via mutt's $alternates). Some obvious spam which had crept in (mainly via one buggy list) were also removed, but this cannot be said to be comprehensive. The resulting archive was then cropped to the 11year window of 1998 to 2008 inclusive.

This filtering combined to reduce my personal mail archive from approx 120000 messages to 61277 (637MB in mbox format).

Finally, the mail was saved to Maildir format with modified filename. The format used is: "YYYYMMDD-HH:MM:SS.<string>.8charmd5:2,". This allows for simple commandline listing of messages in chronological order. We set <string> to name the corpus - allowing for future mixing of ham and spam and spampot corpuses without losing original per-message definitions.

[edit] Test procedure

The ham corpus was run through a script to simulate chronological delivery and filtering via procmail.

This was run several times:
- each subsequent test twice: once with and without spamsum's "-H" option (ignore email headers)
- threshold scores of 25, 50 and 75 (and possibly others dependant on the results seen here, and in equiv spam corpus testing

[edit] Test machine

[edit] Hardware

933Mhz PIII
2x 6gig QUANTUM FIREBALL
382meg RAM

[edit] Software

Note that the first HD has WindowsXP installed. This is second drive

Debian GNU/Linux 4.0
Kernel: 2.6.18-4-686
900meg SWAP partition

[edit] Results

Spamsum threshold (-Txx)	Including headers	Excluding headers (-H)
25	/usr/bin/time: 18878.18user 1146.58system 6:56:08elapsed 80%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (63875major+75779350minor)pagefaults 0swaps hashes: 30181 catches: 31096 One word: abysmal	to be tested
50	/usr/bin/time: 44472.83user 1264.18system 15:32:24elapsed 4%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (64143major+75657421minor)pagefaults 0swaps hashes: 44070 catches: 17207 One word result: terrible	to be tested
75	/usr/bin/time 85659.36user 1381.94system 27:06:11elapsed 1%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (63674major+75251824minor)pagefaults 0swaps hashes: 60080 catches: 1197 Oneline summary: 2% false positives = not acceptable	/usr/bin/time 36731.38user 1341.55system 13:23:14elapsed 78%CPU (0avgtext+0avgdata 0maxresident)k 0inputs+0outputs (65056major+76551410minor)pagefaults 0swaps hashes: 52189 catches: 9088 Oneline summary: 14% = always use header protection
90	to be tested	to be tested

Raw result data available on request

[edit] Analysis

Higher threshold runs take longer to process. This is because the higher the threshold, the more mail is saved to the hashes bucket. Since every mail is hashed and compared to the hashes bucket, this comparison time increases with the threshold. (the parent script then in turn has less relative time spent on itself, and so %CPU falls as threshold increases.

-T25 and -T50 are terrible. Even -T75 is not acceptible when run over a ham corpus. This increasingly indicates that a spampot is required for an effective SpamHash system.

@@ Line 1: / Line 1: @@
-Testing SpamHash on a ham stream, naive start.
+{{TOCright}}
+= Testing SpamHash on a ham stream, naive start. =
-Why?
+== Why? ==
+I expect ham to be, by it's very nature, unique and thus immune to the similarity testing that spamsum performs. If this can be proved to be, then spamhash could be run native on an incoming mail stream, and not require a spampot.
-Because ham is hoped to be, by it's very nature, unique, and thus immune to the similarity testing that spamsum performs. If this is so, then spamhash could be run native on an incoming mail stream, and not require a spampot.
+=== Original notes ===
+* Is a spampot even nescessary? Couldn't this simply be run on a complete email dataset? Afterall, it works by allowing through the first instance of every unique email anyway, and ham tends to be relatively unique, whilst spam tends to come in repetitive sets...
+** Yes... in simple testing, simply quoting an email in response makes it quite dissimilar, and their reply (which should be the next that spamsum sees) will have two levels of reply! (TODO: get numbers)
+** TODO: test simply by feeding a weeks corpus of ALL my regular email through spamsum, simulating this.
+*** Do this twice: Once naively, once with pre-learnt hashDB from the spampot
+*** Then do it another way: over a known 100% ham corpus? (save a corpus of ham messages to MH or maildir format)
+*** Expectation: this will be effective, except possibly for email memes. (if the same funny picture is sent to you twice, even by different people, they will be base64 encoded the same and thus show up as being EXTREMELY similar (how common this is should show up in the 100% ham corpus test)
-So, given approx 1gig corpus of ham (~120000 messages), spanning some 15 years, it was run through spamsum chronologically with a a threshold of 25.
+= Ham Corpus =
+The ham corpus is my personal email archive.
-Here is what we find
+I combined most mailfolders I currently have (the only notable exception being the archive from an nntp->smtp  gateway, as this cannot be considered representative of genuine email. The nntp archive represented some 50,000 messages!)  All other mail (personal, email lists, etc) was added in, then sent mail filtered out (via mutt's $alternates). Some obvious spam which had crept in (mainly via one buggy list) were also removed, but this cannot be said to be comprehensive. The resulting archive was then cropped to the 11year window of 1998 to 2008 inclusive.
-<pre>
+This filtering combined to reduce my personal mail archive from approx 120000 messages to 61277 (637MB in mbox format).
-By message count:
-hashes: 58192
-catches: 61745
-by size:
+Finally, the mail was saved to [[Maildir]] format with modified filename. The format used is: "''YYYYMMDD-HH:MM:SS.<string>.8charmd5:2,''". This allows for simple commandline listing of messages in chronological order. We set ''<string>'' to name the corpus - allowing for future mixing of ham and spam and spampot corpuses without losing original per-message definitions.
-M    hashes
-M    catches
-Time to process:
+= Test procedure =
-start time:  Mon Mar 30 21:15:44 EST 2009
-stop time:   Tue Mar 31 20:54:05 EST 2009
-real    1418m20.848s
+The ham corpus was run through a script to simulate chronological delivery and filtering via procmail.
-user    1008m43.698s
-sys     45m46.752s
-</pre>
-As we can see, more than half the messages were caught as similars (or only a third by size, indicating that larger messages were more likely unique). Indeed, a review of the first few catches showed that they are small messages which have similar headers (esp From:, To:, Subject:, Recieved:, Return-Path:). With the spamsum's ignore headers option, I hope that most similar messages become dissimilar.
+* This was run several times:
+** each subsequent test twice: once with and without spamsum's "-H" option (ignore email headers)
+** threshold scores of 25, 50 and 75 (and possibly others dependant on the results seen here, and in equiv spam corpus testing
-Additionally, ssdeep was than run to compare the hashes and catches directories to attempt to determine what measure of similarity the bell curve was around (noting that this comparison is not chronologically accurate like the original test). 35 was center of the resulting bell curve.
+= Test machine =
+== Hardware ==
+* 933Mhz PIII
+* 2x 6gig QUANTUM FIREBALL
+* 382meg RAM
+== Software ==
+Note that the first HD has WindowsXP installed. This is second drive
+* Debian GNU/Linux 4.0
+* Kernel: 2.6.18-4-686
+* 900meg SWAP partition
-== TODO ==
+= Results =
-* Process again but ignore headers, then both styles with thresholds of 50 and 75
-* graph results
+{|class="wikitable" cellpadding="5" cellspacing="0" border=1 width="100%" align=right
+!Spamsum threshold (-Txx) ||Including headers || Excluding headers (-H)
+|-
+|'''25''' ||
+/usr/bin/time:
+.18user 1146.58system
+:56:08elapsed 80%CPU
+ (0avgtext+0avgdata 0maxresident)k
+inputs+0outputs
+ (63875major+75779350minor)pagefaults
+swaps
+* hashes: 30181
+* catches: 31096
+One word: abysmal
+|| to be tested
+|-
+||'''50''' ||
+/usr/bin/time:
+.83user 1264.18system
+:32:24elapsed 4%CPU
+ (0avgtext+0avgdata 0maxresident)k
+inputs+0outputs
+ (64143major+75657421minor)pagefaults
+swaps
+* hashes: 44070
+* catches: 17207
+One word result: terrible
+|| to be tested
+|-
+||'''75''' ||
+/usr/bin/time
+.36user 1381.94system
+:06:11elapsed 1%CPU
+ (0avgtext+0avgdata 0maxresident)k
+inputs+0outputs
+ (63674major+75251824minor)pagefaults
+swaps
+* hashes: 60080
+* catches: 1197
+Oneline summary: 2% false positives = not acceptable
+||
+/usr/bin/time
+.38user 1341.55system
+:23:14elapsed 78%CPU
+ (0avgtext+0avgdata 0maxresident)k
+inputs+0outputs
+ (65056major+76551410minor)pagefaults
+swaps
+* hashes: 52189
+* catches: 9088
+Oneline summary: 14% = always use header protection
+|-
+||'''90'''|| to be tested || to be tested
+|}
+Raw result data available on request
+= Analysis =
+* Higher threshold runs take longer to process. This is because the higher the threshold, the more mail is saved to the hashes bucket. Since every mail is hashed and compared to the hashes bucket, this comparison time increases with the threshold. (the parent script then in turn has less relative time spent on itself, and so %CPU falls as threshold increases.
+* -T25 and -T50 are terrible. Even -T75 is not acceptible when run over a ham corpus. This increasingly indicates that a spampot is required for an effective SpamHash system.

SpamHash/NaiveHamTest

Latest revision as of 01:32, 8 April 2009

Contents

[edit] Testing SpamHash on a ham stream, naive start.

[edit] Why?

[edit] Original notes

[edit] Ham Corpus

[edit] Test procedure

[edit] Test machine

[edit] Hardware

[edit] Software

[edit] Results

[edit] Analysis

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

meta navigation

More thorx

Tools