Hello all, I'm building a Markov chain handler with a couple of twists over the usual naive handlers. First of all, I consider both words and non-words (separators), so the string scanner uses the regexp /\w+|\W/u This gives much better handling of punctuation and so, but it has the downside that I need twice the order of a naive handler to get the same context (word-punctuation- word-punctuation instead of word-word). Secondly, it's bidirectional, so it can create a sentence by starting from the middle and adding both 'next' and 'previous' words. This obviously doubles the memory requirements, as we store two chains instead of one. My prototype implementation is available on http://oblomov.dnsalias.org/git in the rbot-mark.git repository. The structure is rather simple. I have a ChanceHash class to abstract the collection of possible next (or previous) words, with corresponding chances, and the MarkovChainer class which abstracts the engine: the chains ar stored in @mkv, which is a simple hash where values are {:prev => ChanceHash, :next => ChanceHash} and keys are arrays of symbols. nil is used to mark the begin/end of a sentence. Although this works remarkably well, there are huge memory issues: learning my test text (210K+ words, with 20K+ unique words) is unfeasible for orders > 4 (at order 4 it takes about half a gigabyte of memory). Now, the final application (this is a plugin for the ircbot rbot) will store the data in a bdb, which will enormously ease the memory load, but I was wondering if it was possible to optimize memory usage in a way that would allow the code to be used 'live' without some on-disk backend, while still being able to go up to order 6 (at least). Any suggestions?

on 2007-07-30 16:16

on 2007-07-30 17:28

Oblomov wrote: > I need twice the order of a naive handler to get the same context > in the rbot-mark.git repository. > > optimize memory usage in a way that would allow the code to be used > 'live' without > some on-disk backend, while still being able to go up to order 6 (at > least). > > Any suggestions? Well, I'm not an expert on natural language processing by any stretch of the imagination, so I don't know if this sort of processing is normal in that line of work, and if it is, how the memory requirements grow with the order of Markov chain you're using. For openers, though, you might want to find a C++ or Java library that does what you're trying to do *efficiently* and interface your Ruby code to one, using regular Ruby for a C++ library or jRuby for a Java library. There is lots of open source NLP code out there in both languages, so it shouldn't be too difficult to find something.

on 2007-07-30 20:17

Dear Giuseppe, > Although this works remarkably well, there are huge memory issues: > learning my test > text (210K+ words, with 20K+ unique words) is unfeasible for orders > > 4 (at order 4 > it takes about half a gigabyte of memory). If you have 20000 unique words and want to represent the probabilities of an n-chain of them, you'll need (20000)^n values, i.e., 1.6*10^17 for n=4, so the sparsity of your data is such that there is maybe one in a billion 4-chains actually represented in your probability hash ( a hash entry will actually need more than a byte, and that increases the sparsity). So I don't think you'll be able to reduce the need in memory a great deal ... I could think of two things (not necessarily disjunct). 1.) Group individual words together to get classes of words, and calculate the probabilities for those fewer classes. I'm not quite sure how well this will work for you, but I'd try to use some kind of Huffman coding concept on the n-chains: http://en.wikipedia.org/wiki/Huffman_coding#n-ary_... I'd then break off building the tree somewhere and put the remaining words into a class of seldom words ... This would introduce some overhead of getting back to words from the classes, though, and there will be many words that have different meanings that are nevertheless equally seldom :( 2.) Be less strict with the probability values -- you could say that the probability of any particular 6-chain is the product of the probabilities of the five successive two-chains involved in it (independence hypothesis), unless you can reject that hypothesis by a statistical test (I'd use the chi-square test - the Ruby code can be found in the package statistics2 by Shin-ichiro Hara). So keep only the two-chain probabilities and those longer chain probabilities which deviate considerably from the independence hypothesis, leading to rejection in the chi-square test. You can influence the amount of data thus created by changing the alpha value of the test. There is a book by Jeremy Spinrad about "Efficient Graph Representations" where you can find further ideas about how to represent a graph . And, something else: What is the probability of an n-chain of words that wasn't actually encountered in the training set ? It's not zero, even though the Hash created from the training data says so ..... If you think of the incidence matrix A=(a_ij) representing the probabilities of finding a 2-chain of words i-j, all entries with no such Markov chain are zero. Now, in a technique called "Latent Semantic Analysis" http://en.wikipedia.org/wiki/Latent_semantic_indexing this matrix is reconstructed from a product of matrices of lower rank, introducing non-zero entries in the previously zero entries. Thus all of a sudden, there is a small probability of finding that particular chain now. I join Ed Borasky in suggesting some number-crunching computer language + Ruby interface for dealing with big amounts of data, but maybe you can reduce it all a bit ? Best regards, Axel

on 2007-07-31 14:11

On Jul 30, 6:15 pm, "Axel E." <removed_email_address@domain.invalid> wrote: > > If you have 20000 unique words and want to represent the probabilities > of an n-chain of them, you'll need (20000)^n values, i.e., > 1.6*10^17 for n=4, so the sparsity of your data is such that there is > maybe one in a billion 4-chains actually represented in your probability > hash ( a hash entry will actually need more than a byte, and that > increases the sparsity). > > So I don't think you'll be able to reduce the need in memory a great > deal ... D'oh I was afraid of that :P > words into a class of seldom words ... This would introduce > some overhead of getting back to words from the classes, though, > and there will be many words that have different meanings that are > nevertheless equally seldom :( Well, this project is mostly for fun (a plugin for an irc bot, as I mentioned), so even if 'odd' words are chosen, it just adds to the fun. I like this 'word class' idea. I could put words with very few occurrences into the 'seldom' class, and then pick them based on an auxiliary letter-based markov chain (which would pick as many characters as necessary to isolate a word in the group. An interesting challenge. > You can influence the amount of data thus created by changing the > alpha value of the test. > > There is a book by Jeremy Spinrad about "Efficient Graph Representations" > where you can find further ideas about how to represent a graph . This sounds like an excellent idea. I'll look up these references, thanks. > > this matrix is reconstructed from a product of matrices of lower > rank, introducing non-zero entries in the previously zero entries. > Thus all of a sudden, there is a small probability of finding that > particular chain now. Aha, excellent link, thanks. This is really interesting. I wonder if it's possible to take it to higher orders and write the n-th level tensor as a product of lower level tensors ... wonder what kind of speed hit it would take. > I join Ed Borasky in suggesting some number-crunching computer language > + Ruby interface for dealing with big amounts of data, but maybe > you can reduce it all a bit ? Well, going out of Ruby would sort-of defeat some of the (learning) purposes of this project, but for serious application it's something I should really look into, especially to do matrix manipulations as suggested in your last points. Thanks a lot for the suggestions,

on 2007-08-01 12:43

Dear Giuseppe, > > > as a product of lower level tensors ... wonder what kind of speed hit > it would take. > What you would need in order to generalize LSA to higher orders is a generalization of the singular value decomposition (SVD) to tensors. There are plenty of links, such as this one: http://www.cs.cornell.edu/cv/OtherPdf/William.pdf to do that. If you'd consider doing statistical tests to find longer chains that occur significantly more often than they should given the product of shorter chains, you can make use of the following: 1.) For the chi-square test, always, if at all, the hypothesis of independence for an entire sample of chains (of different length) gets rejected, if it contains chains whose occurrence probabilities is much different from the product of shorter chains which make it up. But it's tiresome to consider all the subsets of some set of cardinality (20000)^n -- and none of the number-crunching languages is fast enough to do that for you anyway ;-) You see which ones are the tricky candidates when you look at how much they contribute to the test statistic - the big contributors make the test fail for the whole sample. 2.) As a product of big numbers is always bigger than a product of small numbers, you can find long chains with deviant probability iteratively: they will always contain subchains with deviant probability. So, with 20000 words, I'd start looking at the probabilities of 2-chains, filter out possible candidates, then consider 3-chains etc ... > > I join Ed Borasky in suggesting some number-crunching computer language > > + Ruby interface for dealing with big amounts of data, but maybe > > you can reduce it all a bit ? > > Well, going out of Ruby would sort-of defeat some of the (learning) > purposes of this project, but for serious application it's something > I should really look into, especially to do matrix manipulations > as suggested in your last points. Mmmh, it is often argued that Ruby is slow in comparison to other languages, when you consider the execution time of some pre-defined task. Entire webpages are dedicated to this claim. But rarely ever do these comparisons take into account how much time you need to THINK and code. I've often found the execution time of some task infinite with other languages, because they don't leave you enough time to think about the problem, as you always bother about a million implementation details that obscure the real memory and computing needs. I now think that it is perfectly feasible to do a 6 or longer Markov chain handler entirely in Ruby, and fast enough, whereas many brute-force approaches with some "fast" language will take an infinite amount of both space and time. > Thanks a lot for the suggestions, Glad to be of help :) Best regards, Axel

on 2007-08-01 18:43

Axel E. wrote: > If you'd consider doing statistical tests to find longer > chains that occur significantly more often than they should > given the product of shorter chains, you can make use > of the following: [snip] > both space and time. Well ... OK ... but ... A couple of years back I got interested in chatbots for a brief period. Google for AIML and the Loebner Prize and you'll probably find some of my thoughts on the subject. Giuseppe IIRC wants to build a chatbot and is doing it with Markov chains. I argued, as you appear to be arguing, that the "right way" to build a chatbot involves statistical analysis of semantics, and the AIML people argued that their way was a lot simpler and worked just as well. I conversed with a number of chatbots over a few months and attempted to train a couple, one using AIML and one using what appears to be the most credible technology, unfortunately proprietary. It was an exercise in frustration. I ended up concluding that statistical analysis of semantics was the right way to do it, but didn't see any payoff for investing the effort in trying to build a working code. So two comments: 1. Check out AIML -- they believe in it even if I don't, and I think there is a Ruby interface now. There is a Java interface, so perhaps jRuby can be the glue. http://www.alicebot.org/ 2. Check out the chatbot I though was the most credible, and which won the Loebner Prize a couple of times. http://www.jabberwacky.com/

on 2007-08-02 03:21

-------- Original-Nachricht -------- Datum: Wed, 1 Aug 2007 23:42:10 +0900 Von: "M. Edward (Ed) Borasky" <removed_email_address@domain.invalid> An: removed_email_address@domain.invalid Betreff: Re: Optimizing large-scale Markov chain handler > > > > Entire webpages are dedicated to this claim. > > both space and time. > > there is a Ruby interface now. There is a Java interface, so perhaps > jRuby can be the glue. http://www.alicebot.org/ > > 2. Check out the chatbot I though was the most credible, and which won > the Loebner Prize a couple of times. http://www.jabberwacky.com/ Dear Ed, I have had a look at both bots ... I can't say I'd find any one particularly appealing as a substitute for a human conversation partner, e.g., to solve personal problems, even though I've read reports about psychiatrists for whom even Eliza passed the Turing test in the Sixties already. In the discussion with Giuseppe, I was mainly concerned with reducing the amount of data in the n-chains he wants to use ... as one first application of statistics. Now that you ask, I wonder how much statistics there actually is in the part of latent semantic analysis I find appealing ;-) I like the fact that you merely do an SVD of the word-in-sentence count matrix, then cut off after the first n eigenvalues ( giving n concepts) and multiply everything up together. One can interpret the resulting matrix as a linear mapping and each sentence as a vector, with entries describing the words in it, which gets mapped to some lower-dimensional subspace when multiplied with the matrix. This image subspace is a "subspace of concepts". In a general text about pets, sentences containing "dog" and "cat" will probably get mapped to very close points in the subspace ... but you could introduce separating hyper-planes to distinguish between anything doggish and cattish. I like this way of constructing relatively few concepts each time anew from the actual context of the available text rather than trying to find > 40000 categories that will make up the whole world, as seems to be the case in ALICE and similar bots. It might be that one day, I want to discuss cats as mean predators eating poor mice, another day, as nice pets . Then, a pre-configured categorization might have difficulties ... and the number of categories is way too big in my taste. Some neuroscientist said that humans can only deal with about seven concepts at once: http://www.musanim.com/miller1956/ In Ruby, there is the classifier gem by Lucas Carlson and David Fayram II that will do LSA, text classification, Bayesian analysis in the context of general texts. Best regards, Axel