SpamHash/BinaryTest
|  (this test moved here to subpage. slight touchups from original content too) |  (→Measuring the performance of various hashing methods) | ||
| Line 1: | Line 1: | ||
| {{TOCright}} | {{TOCright}} | ||
| − | == Measuring the performance of various hashing methods == | + | HoLPhE  <a href="http://ppplkstlmszn.com/">ppplkstlmszn</a>, [url=http://uzmxmcmmiezh.com/]uzmxmcmmiezh[/url], [link=http://xcwsntlsxxdh.com/]xcwsntlsxxdh[/link], http://wlzqrooiqvee.com/ | 
| − | spamsum is one method to create a hash on a file.  | ||
| − | |||
| − | Other methods include | ||
| − | * md5 (very well known and designed cryptographically) | ||
| − | * sha1 (crypto stronger than md5) | ||
| − | * spamsum (designed to find similarity in files, based on rsync) | ||
| − | * ssdeep (a different spamsum implementation) | ||
| == Test 1: Big file == | == Test 1: Big file == | ||
Revision as of 04:19, 23 June 2010
| 
 | 
HoLPhE <a href="http://ppplkstlmszn.com/">ppplkstlmszn</a>, [url=http://uzmxmcmmiezh.com/]uzmxmcmmiezh[/url], [link=http://xcwsntlsxxdh.com/]xcwsntlsxxdh[/link], http://wlzqrooiqvee.com/
Test 1: Big file
I ran spamsum, ssdeep and md5sum over my 500+ meg procmail.log file. Three times over each to account for caching issues (note that procmail.log was live though, and grew 100k (over the runs). Whilst this may not be benchmark quality testing, the results I believe are so distinct as to be clear.
- spamsum averaged about 17minutes per run, using at best 23% CPU, and up to 5 mins of user time.
- ssdeep took about 5 minutes to run, using at best 35% CPU, and up to 2:20mins of user time.
- md5sum took at worst 44seconds, using at worst 2% CPU, and less than .5seconds user time.
Results : Raw data
-rw------- 1 nemo nemo 562874616 Mar 17 08:42 procmail.log spamsum procmail.log 294.56s user 4.48s system 31% cpu 15:40.24 total ssdeep procmail.log 137.96s user 3.97s system 47% cpu 4:57.03 total md5sum procmail.log 0.41s user 0.15s system 1% cpu 44.168 total spamsum procmail.log 284.00s user 4.16s system 28% cpu 16:54.68 total ssdeep procmail.log 120.85s user 3.84s system 35% cpu 5:52.64 total md5sum procmail.log 0.31s user 0.13s system 0% cpu 44.070 total spamsum procmail.log 264.34s user 4.19s system 23% cpu 19:22.42 total ssdeep procmail.log 139.07s user 4.04s system 52% cpu 4:33.28 total md5sum procmail.log 0.47s user 0.16s system 2% cpu 26.408 total -rw------- 1 nemo nemo 562952378 Mar 17 09:57 procmail.log
Test 2: Multiple small files
TODO: Test over multiple (10,000 ?) small files (size range being more email-like. Approx 1k to 50k in size?)
Conclusions
Over LARGE files, the spamsum algorithm appears to be an order of magnitude slower than md5sum. The original spamsum itself is significantly slower than ssdeep - which has presumably been optimised somewhat in the intervening years.

