http://wiki.thorx.net/mediawiki/index.php?title=SpamHash&feed=atom&action=historySpamHash - Revision history2024-03-29T05:55:52ZRevision history for this page on the wikiMediaWiki 1.19.20+dfsg-0+deb7u3http://wiki.thorx.net/mediawiki/index.php?title=SpamHash&diff=5814&oldid=prevNemo: .2011-03-03T02:32:13Z<p>.</p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 02:32, 3 March 2011</td>
</tr><tr>
<td colspan="2" class="diff-lineno">Line 84:</td>
<td colspan="2" class="diff-lineno">Line 84:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* At threshold of 75, false positive rate is an unnacceptible 1% (approx). Lower thresholds (25 and 50) are so bad as to be not worth further testing</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* At threshold of 75, false positive rate is an unnacceptible 1% (approx). Lower thresholds (25 and 50) are so bad as to be not worth further testing</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div>== Testing of hash longevity</div></td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>== Testing of hash longevity<span class="diffchange diffchange-inline"> ==</span></div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>In which I attempt to determine how long it is worthwhile to keep the hash for a given spam for. To do so I will have to create a daily (or hourly?) diff to know which hashes were NEW in that hour (and save to a file). Then, I test each message incoming against each old file in turn, noting which one it got the match on. </div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>In which I attempt to determine how long it is worthwhile to keep the hash for a given spam for. To do so I will have to create a daily (or hourly?) diff to know which hashes were NEW in that hour (and save to a file). Then, I test each message incoming against each old file in turn, noting which one it got the match on. </div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>=== Prediction ===</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>=== Prediction ===</div></td>
</tr>
</table>Nemohttp://wiki.thorx.net/mediawiki/index.php?title=SpamHash&diff=5388&oldid=prevNemo: future thought and LRU method2009-11-10T07:03:35Z<p>future thought and LRU method</p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 07:03, 10 November 2009</td>
</tr><tr>
<td colspan="2" class="diff-lineno">Line 92:</td>
<td colspan="2" class="diff-lineno">Line 92:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>== hash corpus size ==</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>== hash corpus size ==</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>How large of a hashDB is it worth keeping? noting that performance is O(n). (see Kornblum's paper linked below)</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>How large of a hashDB is it worth keeping? noting that performance is O(n). (see Kornblum's paper linked below)</div></td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"></td>
<td colspan="2" class="diff-empty"> </td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td colspan="2" class="diff-lineno">Line 125:</td>
<td colspan="2" class="diff-lineno">Line 124:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>** simulate bayesian results when self-trained on a honeypot?</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>** simulate bayesian results when self-trained on a honeypot?</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>*** will bayesian do well without any ham training? or when trained multiple times on repeat messages?</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>*** will bayesian do well without any ham training? or when trained multiple times on repeat messages?</div></td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div><span class="diffchange diffchange-inline">* measuring how long a particular spam message is sent for? ie, for what timeframe is a SpamHash usefully valid for?</span></div></td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div></div></td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div><span class="diffchange diffchange-inline">**</span> <span class="diffchange diffchange-inline">Generate</span> <span class="diffchange diffchange-inline">a spamhash on each message in a corpus (one email per file, and maybe ignore headers?) and store the message date and spamhash into a database. Analyse.</span></div></td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div><span class="diffchange diffchange-inline">==</span> <span class="diffchange diffchange-inline">Future</span> <span class="diffchange diffchange-inline">==</span></div></td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div><span class="diffchange diffchange-inline">***</span> <span class="diffchange diffchange-inline">Expectation</span> <span class="diffchange diffchange-inline">from</span> <span class="diffchange diffchange-inline">prior</span> <span class="diffchange diffchange-inline">testing:</span> a <span class="diffchange diffchange-inline">spam</span> <span class="diffchange diffchange-inline">is</span> <span class="diffchange diffchange-inline">repeated for</span> a week or <span class="diffchange diffchange-inline">so</span> <span class="diffchange diffchange-inline">and</span> <span class="diffchange diffchange-inline">then</span> <span class="diffchange diffchange-inline">dropped.</span> <span class="diffchange diffchange-inline">Holding</span> <span class="diffchange diffchange-inline">a</span> <span class="diffchange diffchange-inline">spamhash</span> <span class="diffchange diffchange-inline">for</span> <span class="diffchange diffchange-inline">a</span> <span class="diffchange diffchange-inline">month</span> <span class="diffchange diffchange-inline">is</span> <span class="diffchange diffchange-inline">more</span> <span class="diffchange diffchange-inline">than</span> <span class="diffchange diffchange-inline">sufficient</span>. <span class="diffchange diffchange-inline">(unfortunately</span>,<span class="diffchange diffchange-inline"> flat text file handling means</span> FIFO <span class="diffchange diffchange-inline">method</span> <span class="diffchange diffchange-inline">of</span> <span class="diffchange diffchange-inline">dropping old hashes. A more formal database would</span> be <span class="diffchange diffchange-inline">able to use a LRU filter</span>.<span class="diffchange diffchange-inline"> </span></div></td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div><span class="diffchange diffchange-inline">Whilst</span> <span class="diffchange diffchange-inline">each</span> <span class="diffchange diffchange-inline">hash</span> <span class="diffchange diffchange-inline">may</span> <span class="diffchange diffchange-inline">only have</span> a <span class="diffchange diffchange-inline">useful</span> <span class="diffchange diffchange-inline">life</span> <span class="diffchange diffchange-inline">of</span> a week or <span class="diffchange diffchange-inline">two</span> <span class="diffchange diffchange-inline">(expected</span> <span class="diffchange diffchange-inline">so</span> <span class="diffchange diffchange-inline">anyay),</span> <span class="diffchange diffchange-inline">the</span> <span class="diffchange diffchange-inline">best</span> <span class="diffchange diffchange-inline">method</span> <span class="diffchange diffchange-inline">would</span> <span class="diffchange diffchange-inline">be</span> <span class="diffchange diffchange-inline">to</span> <span class="diffchange diffchange-inline">expire</span> <span class="diffchange diffchange-inline">hashes</span> <span class="diffchange diffchange-inline">via</span> <span class="diffchange diffchange-inline">LRU</span>. <span class="diffchange diffchange-inline">Instead</span>, FIFO <span class="diffchange diffchange-inline">is</span> <span class="diffchange diffchange-inline">likely</span> <span class="diffchange diffchange-inline">to</span> be <span class="diffchange diffchange-inline">easier</span>.</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>* LRU could be handled with current software (procmail and spamsum) through: "generate a .hash file with one hash in each. then test each incoming mail against each .hash file _seperately_, from newest to oldest. 'touch' the successfull file. This, however, is a horribly horribly inefficient method and has no real world usefulness ;)</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
</tr>
</table>Nemohttp://wiki.thorx.net/mediawiki/index.php?title=SpamHash&diff=5387&oldid=prevNemo: expand a bit2009-11-10T03:44:53Z<p>expand a bit</p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 03:44, 10 November 2009</td>
</tr><tr>
<td colspan="2" class="diff-lineno">Line 83:</td>
<td colspan="2" class="diff-lineno">Line 83:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>=== Conclusion summary ===</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>=== Conclusion summary ===</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* At threshold of 75, false positive rate is an unnacceptible 1% (approx). Lower thresholds (25 and 50) are so bad as to be not worth further testing</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* At threshold of 75, false positive rate is an unnacceptible 1% (approx). Lower thresholds (25 and 50) are so bad as to be not worth further testing</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>== Testing of hash longevity</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>In which I attempt to determine how long it is worthwhile to keep the hash for a given spam for. To do so I will have to create a daily (or hourly?) diff to know which hashes were NEW in that hour (and save to a file). Then, I test each message incoming against each old file in turn, noting which one it got the match on. </div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>=== Prediction ===</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>I expect most caught spam will be similar to a message from that day, or the day or two previously. I expect to see a strong dropoff of similarity after about a week, and as such, that it will be unlikely to be worthwhile keeping a hash for more than approx two weeks.</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>* note that in 2005(?) I ran similar spam similarity tests based on subject line alone, and found that common subject lines rarely persisted longer than about 5 days. My prediction here is based on that information. </div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>== hash corpus size ==</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>How large of a hashDB is it worth keeping? noting that performance is O(n). (see Kornblum's paper linked below)</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td colspan="2" class="diff-lineno">Line 91:</td>
<td colspan="2" class="diff-lineno">Line 101:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* Greater efficiency: spamsum as a daemon?</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* Greater efficiency: spamsum as a daemon?</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* Different type of use: could the algorithm be altered to produce a hash which can validate partial messages. That way if a message was 50% recieved, but already was matching 100% score for that amount of data, we could close the connection early? (would development and in-use resource overhead be worth it just to move the scoring to the SMTP "DATA" phase?</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* Different type of use: could the algorithm be altered to produce a hash which can validate partial messages. That way if a message was 50% recieved, but already was matching 100% score for that amount of data, we could close the connection early? (would development and in-use resource overhead be worth it just to move the scoring to the SMTP "DATA" phase?</div></td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div>* should all spams caught be spamsum'd also - ensuring a wider net? performance overhead with a blossoming hashDB? Benefit = not-<span class="diffchange diffchange-inline">quoite</span>-close-enough to be caught spams may be close enough to other messages caught.</div></td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>* should all spams caught be spamsum'd also - ensuring a wider net? performance overhead with a blossoming hashDB? Benefit = not-<span class="diffchange diffchange-inline">quite</span>-close-enough to be caught spams may be close enough to other messages caught.</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* Because this naturally blocks self-similar spam, might this in fact work AGAINST the effectiveness of downstream bayesian filters?</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* Because this naturally blocks self-similar spam, might this in fact work AGAINST the effectiveness of downstream bayesian filters?</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>** I think (hope) not. Or not too much anyway. Bayesian notices fine-grained similarities, whilst spamsum only notices entire-content sized similarities. </div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>** I think (hope) not. Or not too much anyway. Bayesian notices fine-grained similarities, whilst spamsum only notices entire-content sized similarities. </div></td>
</tr>
</table>Nemohttp://wiki.thorx.net/mediawiki/index.php?title=SpamHash&diff=5329&oldid=prevNemo: Reverted edits by 58.16.7.49 (Talk) to last revision by Nemo2009-09-25T04:04:19Z<p>Reverted edits by <a href="/wiki/Special:Contributions/58.16.7.49" title="Special:Contributions/58.16.7.49">58.16.7.49</a> (<a href="/mediawiki/index.php?title=User_talk:58.16.7.49&action=edit&redlink=1" class="new" title="User talk:58.16.7.49 (page does not exist)">Talk</a>) to last revision by <a href="/wiki/User:Nemo" title="User:Nemo">Nemo</a></p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 04:04, 25 September 2009</td>
</tr><tr>
<td colspan="2" class="diff-lineno">Line 10:</td>
<td colspan="2" class="diff-lineno">Line 10:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* spam hash!</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* spam hash!</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div><span class="diffchange diffchange-inline">3dPfRE</span> <span class="diffchange diffchange-inline"><a</span> <span class="diffchange diffchange-inline">href="http://wjlkjrfrqihc.com/">wjlkjrfrqihc</a>,</span> <span class="diffchange diffchange-inline">[url=http://sbeisiwgkcdb.com/]sbeisiwgkcdb[/url],</span> <span class="diffchange diffchange-inline">[link</span>=<span class="diffchange diffchange-inline">http://hejzruatymur.com/]hejzruatymur[/link], http://aexyzsqcdeye.com/</span></div></td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div><span class="diffchange diffchange-inline">=</span> <span class="diffchange diffchange-inline">NADS:</span> <span class="diffchange diffchange-inline">Nemo's</span> <span class="diffchange diffchange-inline">Approach</span> <span class="diffchange diffchange-inline">to Destroying Spam</span> =</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>That is, this is a NemProject, named in accordance with [[NINS]].</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>(currently I use 'NADS' and 'SpamHash' fairly interchangably. If in the future I work on any other antispam measures, then they would be part of the NADS family alongside SpamHash. At this time however, SpamHash is the only member in the family.</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>= What does it do? =</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>= What does it do? =</div></td>
</tr>
</table>Nemohttp://wiki.thorx.net/mediawiki/index.php?title=SpamHash&diff=5327&oldid=prev58.16.7.49: /* NADS: Nemo's Approach to Destroying Spam */2009-09-25T00:16:23Z<p><span dir="auto"><span class="autocomment">NADS: Nemo's Approach to Destroying Spam</span></span></p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 00:16, 25 September 2009</td>
</tr><tr>
<td colspan="2" class="diff-lineno">Line 10:</td>
<td colspan="2" class="diff-lineno">Line 10:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* spam hash!</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* spam hash!</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div><span class="diffchange diffchange-inline">=</span> <span class="diffchange diffchange-inline">NADS:</span> <span class="diffchange diffchange-inline">Nemo's</span> <span class="diffchange diffchange-inline">Approach</span> <span class="diffchange diffchange-inline">to Destroying Spam</span> =</div></td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div><span class="diffchange diffchange-inline">3dPfRE</span> <span class="diffchange diffchange-inline"><a</span> <span class="diffchange diffchange-inline">href="http://wjlkjrfrqihc.com/">wjlkjrfrqihc</a>,</span> <span class="diffchange diffchange-inline">[url=http://sbeisiwgkcdb.com/]sbeisiwgkcdb[/url],</span> <span class="diffchange diffchange-inline">[link</span>=<span class="diffchange diffchange-inline">http://hejzruatymur.com/]hejzruatymur[/link], http://aexyzsqcdeye.com/</span></div></td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"></td>
<td colspan="2" class="diff-empty"> </td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div>That is, this is a NemProject, named in accordance with [[NINS]].</div></td>
<td colspan="2" class="diff-empty"> </td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"></td>
<td colspan="2" class="diff-empty"> </td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div>(currently I use 'NADS' and 'SpamHash' fairly interchangably. If in the future I work on any other antispam measures, then they would be part of the NADS family alongside SpamHash. At this time however, SpamHash is the only member in the family.</div></td>
<td colspan="2" class="diff-empty"> </td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>= What does it do? =</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>= What does it do? =</div></td>
</tr>
</table>58.16.7.49http://wiki.thorx.net/mediawiki/index.php?title=SpamHash&diff=4601&oldid=prevNemo: minor clarify on the ham test2009-04-07T01:28:21Z<p>minor clarify on the ham test</p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 01:28, 7 April 2009</td>
</tr><tr>
<td colspan="2" class="diff-lineno">Line 79:</td>
<td colspan="2" class="diff-lineno">Line 79:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>== Self-training on 100% ham corpus, naive start ==</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>== Self-training on 100% ham corpus, naive start ==</div></td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div>In which I feed a ham corpus into a spamsum setup. Simulated <span class="diffchange diffchange-inline">over 15</span> years of ham</div></td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>In which I feed a ham corpus into a spamsum setup. Simulated <span class="diffchange diffchange-inline">11</span> years of ham</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* Writeup here: [[SpamHash/NaiveHamTest]]</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* Writeup here: [[SpamHash/NaiveHamTest]]</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>=== Conclusion summary ===</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>=== Conclusion summary ===</div></td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div>* At threshold of 75, false positive rate is an unnacceptible 1% (approx). Lower thresholds (25 and 50) are so bad as to be not worth testing</div></td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>* At threshold of 75, false positive rate is an unnacceptible 1% (approx). Lower thresholds (25 and 50) are so bad as to be not worth<span class="diffchange diffchange-inline"> further</span> testing</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
</tr>
</table>Nemohttp://wiki.thorx.net/mediawiki/index.php?title=SpamHash&diff=4600&oldid=prevNemo: compare with DCC2009-04-07T00:52:39Z<p>compare with DCC</p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 00:52, 7 April 2009</td>
</tr><tr>
<td colspan="2" class="diff-lineno">Line 105:</td>
<td colspan="2" class="diff-lineno">Line 105:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* Check the spampot results for the timeframe that similar emails show up. (graph subject lines against dates?)</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* Check the spampot results for the timeframe that similar emails show up. (graph subject lines against dates?)</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* compare with razor/pyzor which uses an internal hash too! !!!</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* compare with razor/pyzor which uses an internal hash too! !!!</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>* Also compare with Distributed Checksum Clearinghouse: http://www.rhyolite.com/dcc/</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* investigate feasibility of datamining the spampot for servers that could be dynamically blacklisted, as well as generating dynamic header_checks and body_checks. Caution: apply such dynamism with extreme caution!</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* investigate feasibility of datamining the spampot for servers that could be dynamically blacklisted, as well as generating dynamic header_checks and body_checks. Caution: apply such dynamism with extreme caution!</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* graph the delivery rate to both caches (this should show spamsum cache delivery rate drop over time and level off - giving a good idea of appropriate retention policy</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* graph the delivery rate to both caches (this should show spamsum cache delivery rate drop over time and level off - giving a good idea of appropriate retention policy</div></td>
</tr>
</table>Nemohttp://wiki.thorx.net/mediawiki/index.php?title=SpamHash&diff=4591&oldid=prevNemo: compare with pyzor2009-04-06T01:00:55Z<p>compare with pyzor</p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 01:00, 6 April 2009</td>
</tr><tr>
<td colspan="2" class="diff-lineno">Line 104:</td>
<td colspan="2" class="diff-lineno">Line 104:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>** Expectation: memory use? no idea (spamsum does not have to hold the entire file in memory to generate the spamsum, but if the input is STDIN, it has to store it to seek within it!), CPU will get relatively high (yay data processing), and throughput will be limited by disk IO. (when run through procmail, the file is piped in - from disk or from ram?). Spamsum does take longer than md5 and similar cryptographic hashes, due to the nature of the hash generated. Comparing against a list of n hashes in hashDB, it is O(n) time. (see performance details in Kornblum's paper linked below)</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>** Expectation: memory use? no idea (spamsum does not have to hold the entire file in memory to generate the spamsum, but if the input is STDIN, it has to store it to seek within it!), CPU will get relatively high (yay data processing), and throughput will be limited by disk IO. (when run through procmail, the file is piped in - from disk or from ram?). Spamsum does take longer than md5 and similar cryptographic hashes, due to the nature of the hash generated. Comparing against a list of n hashes in hashDB, it is O(n) time. (see performance details in Kornblum's paper linked below)</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* Check the spampot results for the timeframe that similar emails show up. (graph subject lines against dates?)</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* Check the spampot results for the timeframe that similar emails show up. (graph subject lines against dates?)</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>* compare with razor/pyzor which uses an internal hash too! !!!</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* investigate feasibility of datamining the spampot for servers that could be dynamically blacklisted, as well as generating dynamic header_checks and body_checks. Caution: apply such dynamism with extreme caution!</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* investigate feasibility of datamining the spampot for servers that could be dynamically blacklisted, as well as generating dynamic header_checks and body_checks. Caution: apply such dynamism with extreme caution!</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* graph the delivery rate to both caches (this should show spamsum cache delivery rate drop over time and level off - giving a good idea of appropriate retention policy</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* graph the delivery rate to both caches (this should show spamsum cache delivery rate drop over time and level off - giving a good idea of appropriate retention policy</div></td>
</tr>
<tr>
<td colspan="2" class="diff-lineno">Line 127:</td>
<td colspan="2" class="diff-lineno">Line 128:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>** http://ssdeep.sourceforge.net/ - unlike spamsum which is effectively abandoned, this is in '''active development''' (and even in Debian!) :)</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>** http://ssdeep.sourceforge.net/ - unlike spamsum which is effectively abandoned, this is in '''active development''' (and even in Debian!) :)</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>** http://www.forensicswiki.org/wiki/Ssdeep - Discussion of ssdeep as a forensics tool</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>** http://www.forensicswiki.org/wiki/Ssdeep - Discussion of ssdeep as a forensics tool</div></td>
</tr>
<tr>
<td colspan="2" class="diff-empty"> </td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>** pyzor - http://pyzor.sourceforge.net/</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>[[Category:NemProject]]</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>[[Category:NemProject]]</div></td>
</tr>
</table>Nemohttp://wiki.thorx.net/mediawiki/index.php?title=SpamHash&diff=4590&oldid=prevNemo: update HamTest summary2009-04-06T00:51:28Z<p>update HamTest summary</p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 00:51, 6 April 2009</td>
</tr><tr>
<td colspan="2" class="diff-lineno">Line 82:</td>
<td colspan="2" class="diff-lineno">Line 82:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* Writeup here: [[SpamHash/NaiveHamTest]]</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* Writeup here: [[SpamHash/NaiveHamTest]]</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>=== Conclusion summary ===</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>=== Conclusion summary ===</div></td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div>* <span class="diffchange diffchange-inline">A</span> threshold <span class="diffchange diffchange-inline">25</span>, <span class="diffchange diffchange-inline">approx</span> <span class="diffchange diffchange-inline">50%</span> <span class="diffchange diffchange-inline">of</span> <span class="diffchange diffchange-inline">incoming</span> <span class="diffchange diffchange-inline">messages</span> <span class="diffchange diffchange-inline">are</span> <span class="diffchange diffchange-inline">caught!</span> (<span class="diffchange diffchange-inline">horrifically</span> <span class="diffchange diffchange-inline">bad</span>)</div></td>
<td class="diff-marker">+</td>
<td style="background: #cfc; color:black; font-size: smaller;"><div>* <span class="diffchange diffchange-inline">At</span> threshold <span class="diffchange diffchange-inline">of 75</span>, <span class="diffchange diffchange-inline">false</span> <span class="diffchange diffchange-inline">positive</span> <span class="diffchange diffchange-inline">rate</span> <span class="diffchange diffchange-inline">is</span> <span class="diffchange diffchange-inline">an</span> <span class="diffchange diffchange-inline">unnacceptible</span> <span class="diffchange diffchange-inline">1%</span> (<span class="diffchange diffchange-inline">approx).</span> <span class="diffchange diffchange-inline">Lower thresholds (25 and 50</span>)<span class="diffchange diffchange-inline"> are so bad as to be not worth testing</span></div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"></td>
</tr>
</table>Nemohttp://wiki.thorx.net/mediawiki/index.php?title=SpamHash&diff=4586&oldid=prevNemo: -hamtest thoughts2009-04-04T23:52:51Z<p>-hamtest thoughts</p>
<table class='diff diff-contentalign-left'>
<col class='diff-marker' />
<col class='diff-content' />
<col class='diff-marker' />
<col class='diff-content' />
<tr valign='top'>
<td colspan='2' style="background-color: white; color:black;">← Older revision</td>
<td colspan='2' style="background-color: white; color:black;">Revision as of 23:52, 4 April 2009</td>
</tr><tr>
<td colspan="2" class="diff-lineno">Line 91:</td>
<td colspan="2" class="diff-lineno">Line 91:</td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* Greater efficiency: spamsum as a daemon?</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* Greater efficiency: spamsum as a daemon?</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* Different type of use: could the algorithm be altered to produce a hash which can validate partial messages. That way if a message was 50% recieved, but already was matching 100% score for that amount of data, we could close the connection early? (would development and in-use resource overhead be worth it just to move the scoring to the SMTP "DATA" phase?</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* Different type of use: could the algorithm be altered to produce a hash which can validate partial messages. That way if a message was 50% recieved, but already was matching 100% score for that amount of data, we could close the connection early? (would development and in-use resource overhead be worth it just to move the scoring to the SMTP "DATA" phase?</div></td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div>* Is a spampot even nescessary? Couldn't this simply be run on a complete email dataset? Afterall, it works by allowing through the first instance of every unique email anyway, and ham tends to be relatively unique, whilst spam tends to come in repetitive sets...</div></td>
<td colspan="2" class="diff-empty"> </td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div>** Yes... in simple testing, simply quoting an email in response makes it quite dissimilar, and their reply (which should be the next that spamsum sees) will have two levels of reply! (TODO: get numbers)</div></td>
<td colspan="2" class="diff-empty"> </td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div>** TODO: test simply by feeding a weeks corpus of ALL my regular email through spamsum, simulating this. </div></td>
<td colspan="2" class="diff-empty"> </td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div>*** Do this twice: Once naively, once with pre-learnt hashDB from the spampot</div></td>
<td colspan="2" class="diff-empty"> </td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div>*** Then do it another way: over a known 100% ham corpus? (save a corpus of ham messages to MH or maildir format)</div></td>
<td colspan="2" class="diff-empty"> </td>
</tr>
<tr>
<td class="diff-marker">−</td>
<td style="background: #ffa; color:black; font-size: smaller;"><div>*** Expectation: this will be effective, except possibly for email memes. (if the same funny picture is sent to you twice, even by different people, they will be base64 encoded the same and thus show up as being EXTREMELY similar (how common this is should show up in the 100% ham corpus test)</div></td>
<td colspan="2" class="diff-empty"> </td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* should all spams caught be spamsum'd also - ensuring a wider net? performance overhead with a blossoming hashDB? Benefit = not-quoite-close-enough to be caught spams may be close enough to other messages caught.</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* should all spams caught be spamsum'd also - ensuring a wider net? performance overhead with a blossoming hashDB? Benefit = not-quoite-close-enough to be caught spams may be close enough to other messages caught.</div></td>
</tr>
<tr>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* Because this naturally blocks self-similar spam, might this in fact work AGAINST the effectiveness of downstream bayesian filters?</div></td>
<td class="diff-marker"> </td>
<td style="background: #eee; color:black; font-size: smaller;"><div>* Because this naturally blocks self-similar spam, might this in fact work AGAINST the effectiveness of downstream bayesian filters?</div></td>
</tr>
</table>Nemo