MB-01/Implementation

From ThorxWiki
(Difference between revisions)
Jump to: navigation, search
m (Fixed the MarkovChains link)
m (3 revision(s))
 
(2 intermediate revisions by 2 users not shown)
Line 1: Line 1:
At the moment, Moby is a first-order [[MarkovChains|MarkovChaining]] bot.
+
At the moment, Moby is a first-order [[MarkovChains|Markov chaining]] bot.
   
After some consideration, I *don't* need an SQL backend for a Markov bot. I don't even need a DBM backend. I can make do with a text-file of tab-separated values. The only three columns I need are "previous word", "next word", and "author".
+
After some consideration, I *don't* need an SQL backend for a Markov bot. I don't even need a DBM backend. I can make do with a text-file of tab-separated values. The only three columns I need are "current word", "next word", and "speaker".
   
To find the first word of a chain from all #[[afda]] users, I'd do the equivalent of a statement like this:
+
For the moment, I'm storing these as an ordinary Python list of Python tuples. This worked well, up until about 40,000 rows, when my 300Mhz PC spent about ten seconds of full CPU usage to generate one chain.
   
<pre>
+
Then I had the bright idea of putting the data in a DBM file, and saving much time and energy, and re-using code from dagny's infobot module. I quickly added code to add rows to the DBM file, started Moby back up, and waited.
select NextWord from Markov where PrevWord='';
 
</pre>
 
   
Then I get a list of words, pick one a random (say, "jordanb:"), and then run:
+
Next morning, I woke up to find a 78MB data file, and Moby crashed with an unspecified error (the tab-delimited file, stored on disk, is only 1.1MB now, and 302k zipped).
   
<pre>
+
Currently, I still store data tab-delimited and in a Python list in memory, but now I have two hash-tables - one keyed of "current word" and one keyed off "next word" - to help speed the chaining process. It looks something like this:
select NextWord from Markov where PrevWord='jordanb:';
 
</pre>
 
   
Lather, rinse, repeat.
+
self.data[[ByCurrWord]]["roses"] = [
  +
("roses","are","Screwtape"),
  +
("roses","stink.","jordanb"),
  +
("roses","seem","bbz")
  +
]
   
For a markov based on a particular person, something like the following:
+
where [] denotes a Python list, and () denotes a Python tuple.
   
<pre>
+
For the most part, markov chains, even chains for a particular person (an un-indexed column), are instant. Sometimes there is lag, I suspect that is due to Moby saving his current data to disk (still an expensive operation, which occurrs every 60 seconds), or Moby being swapped to disk at that point.
select NextWord from Markov where PrevWord='' and Author='Screwtape';
 
</pre>
 
   
<nowiki>I stop, of course, when NextWord is empty.</nowiki>
+
For "markov by <speaker>", each time it builds a list of potential next words, it then goes through that list and filters out words that aren't spoken by <speaker> before choosing a next word. A mite slower, but meh.

Latest revision as of 02:04, 25 October 2007

At the moment, Moby is a first-order Markov chaining bot.

After some consideration, I *don't* need an SQL backend for a Markov bot. I don't even need a DBM backend. I can make do with a text-file of tab-separated values. The only three columns I need are "current word", "next word", and "speaker".

For the moment, I'm storing these as an ordinary Python list of Python tuples. This worked well, up until about 40,000 rows, when my 300Mhz PC spent about ten seconds of full CPU usage to generate one chain.

Then I had the bright idea of putting the data in a DBM file, and saving much time and energy, and re-using code from dagny's infobot module. I quickly added code to add rows to the DBM file, started Moby back up, and waited.

Next morning, I woke up to find a 78MB data file, and Moby crashed with an unspecified error (the tab-delimited file, stored on disk, is only 1.1MB now, and 302k zipped).

Currently, I still store data tab-delimited and in a Python list in memory, but now I have two hash-tables - one keyed of "current word" and one keyed off "next word" - to help speed the chaining process. It looks something like this:

self.dataByCurrWord["roses"] = [
       ("roses","are","Screwtape"),
       ("roses","stink.","jordanb"),
       ("roses","seem","bbz")
]

where [] denotes a Python list, and () denotes a Python tuple.

For the most part, markov chains, even chains for a particular person (an un-indexed column), are instant. Sometimes there is lag, I suspect that is due to Moby saving his current data to disk (still an expensive operation, which occurrs every 60 seconds), or Moby being swapped to disk at that point.

For "markov by <speaker>", each time it builds a list of potential next words, it then goes through that list and filters out words that aren't spoken by <speaker> before choosing a next word. A mite slower, but meh.

Personal tools
Namespaces

Variants
Actions
Navigation
meta navigation
More thorx
Tools