SpamHash/BinaryTest

From ThorxWiki
Revision as of 10:20, 23 June 2010 by Nemo (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Contents

Measuring the performance of various hashing methods

spamsum is one method to create a hash on a file.

Other methods include

  • md5 (very well known and designed cryptographically)
  • sha1 (crypto stronger than md5)
  • spamsum (designed to find similarity in files, based on rsync)
  • ssdeep (a different spamsum implementation)

Test 1: Big file

I ran spamsum, ssdeep and md5sum over my 500+ meg procmail.log file. Three times over each to account for caching issues (note that procmail.log was live though, and grew 100k (over the runs). Whilst this may not be benchmark quality testing, the results I believe are so distinct as to be clear.

  • spamsum averaged about 17minutes per run, using at best 23% CPU, and up to 5 mins of user time.
  • ssdeep took about 5 minutes to run, using at best 35% CPU, and up to 2:20mins of user time.
  • md5sum took at worst 44seconds, using at worst 2% CPU, and less than .5seconds user time.

Results : Raw data

-rw------- 1 nemo nemo 562874616 Mar 17 08:42 procmail.log

spamsum procmail.log  294.56s user 4.48s system 31% cpu 15:40.24 total
ssdeep procmail.log  137.96s user 3.97s system 47% cpu 4:57.03 total
md5sum procmail.log  0.41s user 0.15s system 1% cpu 44.168 total

spamsum procmail.log  284.00s user 4.16s system 28% cpu 16:54.68 total
ssdeep procmail.log  120.85s user 3.84s system 35% cpu 5:52.64 total
md5sum procmail.log  0.31s user 0.13s system 0% cpu 44.070 total

spamsum procmail.log  264.34s user 4.19s system 23% cpu 19:22.42 total
ssdeep procmail.log  139.07s user 4.04s system 52% cpu 4:33.28 total
md5sum procmail.log  0.47s user 0.16s system 2% cpu 26.408 total

-rw------- 1 nemo nemo 562952378 Mar 17 09:57 procmail.log


Test 2: Multiple small files

TODO: Test over multiple (10,000 ?) small files (size range being more email-like. Approx 1k to 50k in size?)


Conclusions

Over LARGE files, the spamsum algorithm appears to be an order of magnitude slower than md5sum. The original spamsum itself is significantly slower than ssdeep - which has presumably been optimised somewhat in the intervening years.

Personal tools
Namespaces

Variants
Actions
Navigation
meta navigation
More thorx
Tools