SpamHash/NaiveHamTest
Testing SpamHash on a ham stream, naive start.
Why?
Because ham is hoped to be, by it's very nature, unique, and thus immune to the similarity testing that spamsum performs. If this is so, then spamhash could be run native on an incoming mail stream, and not require a spampot.
So, given approx 1gig corpus of ham (~120000 messages), spanning some 15 years, it was run through spamsum chronologically with a a threshold of 25.
Here is what we find
By message count: hashes: 58192 catches: 61745 by size: 726M hashes 366M catches Time to process: start time: Mon Mar 30 21:15:44 EST 2009 stop time: Tue Mar 31 20:54:05 EST 2009 real 1418m20.848s user 1008m43.698s sys 45m46.752s
As we can see, more than half the messages were caught as similars (or only a third by size, indicating that larger messages were more likely unique). Indeed, a review of the first few catches showed that they are small messages which have similar headers (esp From:, To:, Subject:, Recieved:, Return-Path:). With the spamsum's ignore headers option, I hope that most similar messages become dissimilar.
Additionally, ssdeep was than run to compare the hashes and catches directories to attempt to determine what measure of similarity the bell curve was around (noting that this comparison is not chronologically accurate like the original test). 35 was center of the resulting bell curve.
TODO
- Process again but ignore headers, then both styles with thresholds of 50 and 75
- graph results