http://wiki.thorx.net/mediawiki/index.php?title=SpamHash/BinaryTest&feed=atom&action=historySpamHash/BinaryTest - Revision history2024-03-28T15:42:24ZRevision history for this page on the wikiMediaWiki 1.19.20+dfsg-0+deb7u3http://wiki.thorx.net/mediawiki/index.php?title=SpamHash/BinaryTest&diff=5569&oldid=prevNemo: Reverted edits by 200.201.22.20 (Talk); changed back to last version by Nemo2010-06-23T00:20:44Z<p>Reverted edits by <a href="/wiki/Special:Contributions/200.201.22.20" title="Special:Contributions/200.201.22.20">200.201.22.20</a> (<a href="/mediawiki/index.php?title=User_talk:200.201.22.20&action=edit&redlink=1" class="new" title="User talk:200.201.22.20 (page does not exist)">Talk</a>); changed back to last version by <a href="/wiki/User:Nemo" title="User:Nemo">Nemo</a></p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 00:20, 23 June 2010</td>
</tr><tr>
<td colspan="2" class="diff-lineno">Line 1:</td>
<td colspan="2" class="diff-lineno">Line 1:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>{{TOCright}}</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>{{TOCright}}</div></td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div><span class="diffchange diffchange-inline">HoLPhE</span> <span class="diffchange diffchange-inline"><a</span> <span class="diffchange diffchange-inline">href="http://ppplkstlmszn.com/">ppplkstlmszn</a>,</span> <span class="diffchange diffchange-inline">[url=http://uzmxmcmmiezh.com/]uzmxmcmmiezh[/url],</span> <span class="diffchange diffchange-inline">[link=http://xcwsntlsxxdh.com/]xcwsntlsxxdh[/link],</span> <span class="diffchange diffchange-inline">http://wlzqrooiqvee.com/</span></div></td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div><span class="diffchange diffchange-inline">==</span> <span class="diffchange diffchange-inline">Measuring</span> <span class="diffchange diffchange-inline">the</span> <span class="diffchange diffchange-inline">performance</span> <span class="diffchange diffchange-inline">of</span> <span class="diffchange diffchange-inline">various</span> <span class="diffchange diffchange-inline">hashing methods ==</span></div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>spamsum is one method to create a hash on a file. </div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>Other methods include</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>* md5 (very well known and designed cryptographically)</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>* sha1 (crypto stronger than md5)</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>* spamsum (designed to find similarity in files, based on rsync)</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>* ssdeep (a different spamsum implementation)</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>== Test 1: Big file ==</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>== Test 1: Big file ==</div></td>
</tr>
</table>Nemohttp://wiki.thorx.net/mediawiki/index.php?title=SpamHash/BinaryTest&diff=5568&oldid=prev200.201.22.20: /* Measuring the performance of various hashing methods */2010-06-22T18:19:23Z<p><span dir="auto"><span class="autocomment">Measuring the performance of various hashing methods</span></span></p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 18:19, 22 June 2010</td>
</tr><tr>
<td colspan="2" class="diff-lineno">Line 1:</td>
<td colspan="2" class="diff-lineno">Line 1:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>{{TOCright}}</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>{{TOCright}}</div></td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div><span class="diffchange diffchange-inline">==</span> <span class="diffchange diffchange-inline">Measuring</span> <span class="diffchange diffchange-inline">the</span> <span class="diffchange diffchange-inline">performance</span> <span class="diffchange diffchange-inline">of</span> <span class="diffchange diffchange-inline">various</span> <span class="diffchange diffchange-inline">hashing methods ==</span></div></td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div><span class="diffchange diffchange-inline">HoLPhE</span> <span class="diffchange diffchange-inline"><a</span> <span class="diffchange diffchange-inline">href="http://ppplkstlmszn.com/">ppplkstlmszn</a>,</span> <span class="diffchange diffchange-inline">[url=http://uzmxmcmmiezh.com/]uzmxmcmmiezh[/url],</span> <span class="diffchange diffchange-inline">[link=http://xcwsntlsxxdh.com/]xcwsntlsxxdh[/link],</span> <span class="diffchange diffchange-inline">http://wlzqrooiqvee.com/</span></div></td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div>spamsum is one method to create a hash on a file. </div></td>
<td colspan="2" class="diff-empty"> </td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"></td>
<td colspan="2" class="diff-empty"> </td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div>Other methods include</div></td>
<td colspan="2" class="diff-empty"> </td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div>* md5 (very well known and designed cryptographically)</div></td>
<td colspan="2" class="diff-empty"> </td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div>* sha1 (crypto stronger than md5)</div></td>
<td colspan="2" class="diff-empty"> </td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div>* spamsum (designed to find similarity in files, based on rsync)</div></td>
<td colspan="2" class="diff-empty"> </td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div>* ssdeep (a different spamsum implementation)</div></td>
<td colspan="2" class="diff-empty"> </td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>== Test 1: Big file ==</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>== Test 1: Big file ==</div></td>
</tr>
</table>200.201.22.20http://wiki.thorx.net/mediawiki/index.php?title=SpamHash/BinaryTest&diff=4491&oldid=prevNemo: this test moved here to subpage. slight touchups from original content too2009-03-18T00:19:01Z<p>this test moved here to subpage. slight touchups from original content too</p>
<p><b>New page</b></p><div>{{TOCright}}<br />
== Measuring the performance of various hashing methods ==<br />
spamsum is one method to create a hash on a file. <br />
<br />
Other methods include<br />
* md5 (very well known and designed cryptographically)<br />
* sha1 (crypto stronger than md5)<br />
* spamsum (designed to find similarity in files, based on rsync)<br />
* ssdeep (a different spamsum implementation)<br />
<br />
== Test 1: Big file ==<br />
<br />
I ran spamsum, ssdeep and md5sum over my 500+ meg procmail.log file. Three times over each to account for caching issues (note that procmail.log was live though, and grew 100k (over the runs). Whilst this may not be benchmark quality testing, the results I believe are so distinct as to be clear. <br />
<br />
* spamsum averaged about 17minutes per run, using at best 23% CPU, and up to 5 mins of user time. <br />
* ssdeep took about 5 minutes to run, using at best 35% CPU, and up to 2:20mins of user time. <br />
* md5sum took at worst 44seconds, using at worst 2% CPU, and less than .5seconds user time. <br />
<br />
=== Results : Raw data ===<br />
<br />
<pre><br />
-rw------- 1 nemo nemo 562874616 Mar 17 08:42 procmail.log<br />
<br />
spamsum procmail.log 294.56s user 4.48s system 31% cpu 15:40.24 total<br />
ssdeep procmail.log 137.96s user 3.97s system 47% cpu 4:57.03 total<br />
md5sum procmail.log 0.41s user 0.15s system 1% cpu 44.168 total<br />
<br />
spamsum procmail.log 284.00s user 4.16s system 28% cpu 16:54.68 total<br />
ssdeep procmail.log 120.85s user 3.84s system 35% cpu 5:52.64 total<br />
md5sum procmail.log 0.31s user 0.13s system 0% cpu 44.070 total<br />
<br />
spamsum procmail.log 264.34s user 4.19s system 23% cpu 19:22.42 total<br />
ssdeep procmail.log 139.07s user 4.04s system 52% cpu 4:33.28 total<br />
md5sum procmail.log 0.47s user 0.16s system 2% cpu 26.408 total<br />
<br />
-rw------- 1 nemo nemo 562952378 Mar 17 09:57 procmail.log<br />
</pre><br />
<br />
<br />
== Test 2: Multiple small files ==<br />
TODO: Test over multiple (10,000 ?) small files (size range being more email-like. Approx 1k to 50k in size?)<br />
<br />
<br />
== Conclusions ==<br />
Over LARGE files, the spamsum algorithm appears to be an order of magnitude slower than md5sum. <br />
The original spamsum itself is significantly slower than ssdeep - which has presumably been optimised somewhat in the intervening years.</div>Nemo