SpamHash/NaiveHamTest

From ThorxWiki
(Difference between revisions)
Jump to: navigation, search
(first writeup)
 
(why ignoring headers might be good)
Line 27: Line 27:
 
</pre>
 
</pre>
   
As we can see, more than half the messages were caught as similars (or only a third by size, indicating that larger messages were more likely unique)
+
As we can see, more than half the messages were caught as similars (or only a third by size, indicating that larger messages were more likely unique). Indeed, a review of the first few catches showed that they are small messages which have similar headers (esp From:, To:, Subject:, Recieved:, Return-Path:). With the spamsum's ignore headers option, I hope that most similar messages become dissimilar.
   
 
Additionally, ssdeep was than run to compare the hashes and catches directories to attempt to determine what measure of similarity the bell curve was around (noting that this comparison is not chronologically accurate like the original test). 35 was center of the resulting bell curve.
 
Additionally, ssdeep was than run to compare the hashes and catches directories to attempt to determine what measure of similarity the bell curve was around (noting that this comparison is not chronologically accurate like the original test). 35 was center of the resulting bell curve.
   
 
== TODO ==
 
== TODO ==
* Process again with threshold of 50 and 75
+
* Process again but ignore headers, then both styles with thresholds of 50 and 75
 
* graph results
 
* graph results

Revision as of 20:27, 2 April 2009

Testing SpamHash on a ham stream, naive start.

Why?

Because ham is hoped to be, by it's very nature, unique, and thus immune to the similarity testing that spamsum performs. If this is so, then spamhash could be run native on an incoming mail stream, and not require a spampot.

So, given approx 1gig corpus of ham (~120000 messages), spanning some 15 years, it was run through spamsum chronologically with a a threshold of 25.

Here is what we find

By message count:
hashes: 58192
catches: 61745

by size:
726M    hashes
366M    catches

Time to process:
start time:  Mon Mar 30 21:15:44 EST 2009
stop time:   Tue Mar 31 20:54:05 EST 2009

real    1418m20.848s
user    1008m43.698s
sys     45m46.752s

As we can see, more than half the messages were caught as similars (or only a third by size, indicating that larger messages were more likely unique). Indeed, a review of the first few catches showed that they are small messages which have similar headers (esp From:, To:, Subject:, Recieved:, Return-Path:). With the spamsum's ignore headers option, I hope that most similar messages become dissimilar.

Additionally, ssdeep was than run to compare the hashes and catches directories to attempt to determine what measure of similarity the bell curve was around (noting that this comparison is not chronologically accurate like the original test). 35 was center of the resulting bell curve.

TODO

  • Process again but ignore headers, then both styles with thresholds of 50 and 75
  • graph results
Personal tools
Namespaces

Variants
Actions
Navigation
meta navigation
More thorx
Tools