SpamHash/NaiveHamTest

From ThorxWiki
Revision as of 21:00, 31 March 2009 by Nemo (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Testing SpamHash on a ham stream, naive start.

Why?

Because ham is hoped to be, by it's very nature, unique, and thus immune to the similarity testing that spamsum performs. If this is so, then spamhash could be run native on an incoming mail stream, and not require a spampot.

So, given approx 1gig corpus of ham (~120000 messages), spanning some 15 years, it was run through spamsum chronologically with a a threshold of 25.

Here is what we find

By message count:
hashes: 58192
catches: 61745

by size:
726M    hashes
366M    catches

Time to process:
start time:  Mon Mar 30 21:15:44 EST 2009
stop time:   Tue Mar 31 20:54:05 EST 2009

real    1418m20.848s
user    1008m43.698s
sys     45m46.752s

As we can see, more than half the messages were caught as similars (or only a third by size, indicating that larger messages were more likely unique)

Additionally, ssdeep was than run to compare the hashes and catches directories to attempt to determine what measure of similarity the bell curve was around (noting that this comparison is not chronologically accurate like the original test). 35 was center of the resulting bell curve.

TODO

  • Process again with threshold of 50 and 75
  • graph results
Personal tools
Namespaces

Variants
Actions
Navigation
meta navigation
More thorx
Tools