MB-01/Implementation
| m (Fixed the MarkovChains link) |  (Updated to talk about current implementation.) | ||
| Line 1: | Line 1: | ||
| − | At the moment, Moby is a first-order [[MarkovChains|MarkovChaining]] bot. | + | At the moment, Moby is a first-order [[MarkovChains|Markov chaining]] bot. | 
| − | After some consideration, I *don't* need an SQL backend for a Markov bot. I don't even need a DBM backend. I can make do with a text-file of tab-separated values. The only three columns I need are "previous word", "next word", and "author". | + | After some consideration, I *don't* need an SQL backend for a Markov bot. I don't even need a DBM backend. I can make do with a text-file of tab-separated values. The only three columns I need are "current word", "next word", and "speaker". | 
| − | To find the first word of a chain from all #[[afda]] users, I'd do the equivalent of a statement like this: | + | For the moment, I'm storing these as an ordinary Python list of Python tuples. This worked well, up until about 40,000 rows, when my 300Mhz PC spent about ten seconds of full CPU usage to generate one chain. | 
| − | <pre> | + | Then I had the bright idea of putting the data in a DBM file, and saving much time and energy, and re-using code from dagny's infobot module. I quickly added code to add rows to the DBM file, started Moby back up, and waited. | 
| − | select NextWord from Markov where PrevWord=''; | ||
| − | </pre> | ||
| − | Then I get a list of words, pick one a random (say, "jordanb:"), and then run: | + | Next morning, I woke up to find a 78MB data file, and Moby crashed with an unspecified error (the tab-delimited file, stored on disk, is only 1.1MB now, and 302k zipped). | 
| − | <pre> | + | Currently, I still store data tab-delimited and in a Python list in memory, but now I have two hash-tables - one keyed of "current word" and one keyed off "next word" - to help speed the chaining process. It looks something like this: | 
| − | select NextWord from Markov where PrevWord='jordanb:'; | ||
| − | </pre> | ||
| − | Lather, rinse, repeat. | + |  self.dataByCurrWord["roses"] = [ | 
| + |         ("roses","are","Screwtape"), | ||
| + |         ("roses","stink.","jordanb"), | ||
| + |         ("roses","seem","bbz") | ||
| + |  ] | ||
| − | For a markov based on a particular person, something like the following: | + | where [] denotes a Python list, and () denotes a Python tuple. | 
| − | <pre> | + | For the most part, markov chains, even chains for a particular person (an un-indexed column), are instant. Sometimes there is lag, I suspect that is due to Moby saving his current data to disk (still an expensive operation, which occurrs every 60 seconds), or Moby being swapped to disk at that point. | 
| − | select NextWord from Markov where PrevWord='' and Author='Screwtape'; | ||
| − | </pre> | ||
| − | <nowiki>I stop, of course, when NextWord is empty.</nowiki> | + | For "markov by <speaker>", each time it builds a list of potential next words, it then goes through that list and filters out words that aren't spoken by <speaker> before choosing a next word. A mite slower, but meh. | 
Revision as of 22:29, 28 April 2002
At the moment, Moby is a first-order Markov chaining bot.
After some consideration, I *don't* need an SQL backend for a Markov bot. I don't even need a DBM backend. I can make do with a text-file of tab-separated values. The only three columns I need are "current word", "next word", and "speaker".
For the moment, I'm storing these as an ordinary Python list of Python tuples. This worked well, up until about 40,000 rows, when my 300Mhz PC spent about ten seconds of full CPU usage to generate one chain.
Then I had the bright idea of putting the data in a DBM file, and saving much time and energy, and re-using code from dagny's infobot module. I quickly added code to add rows to the DBM file, started Moby back up, and waited.
Next morning, I woke up to find a 78MB data file, and Moby crashed with an unspecified error (the tab-delimited file, stored on disk, is only 1.1MB now, and 302k zipped).
Currently, I still store data tab-delimited and in a Python list in memory, but now I have two hash-tables - one keyed of "current word" and one keyed off "next word" - to help speed the chaining process. It looks something like this:
self.dataByCurrWord["roses"] = [
       ("roses","are","Screwtape"),
       ("roses","stink.","jordanb"),
       ("roses","seem","bbz")
]
where [] denotes a Python list, and () denotes a Python tuple.
For the most part, markov chains, even chains for a particular person (an un-indexed column), are instant. Sometimes there is lag, I suspect that is due to Moby saving his current data to disk (still an expensive operation, which occurrs every 60 seconds), or Moby being swapped to disk at that point.
For "markov by <speaker>", each time it builds a list of potential next words, it then goes through that list and filters out words that aren't spoken by <speaker> before choosing a next word. A mite slower, but meh.

