Optimizing large-scale Markov chain handler

Hello all,

I’m building a Markov chain handler with a couple of twists over the
usual naive handlers.

First of all, I consider both words and non-words (separators), so the
string scanner
uses the regexp /\w+|\W/u
This gives much better handling of punctuation and so, but it has the
downside that
I need twice the order of a naive handler to get the same context
(word-punctuation-
word-punctuation instead of word-word).

Secondly, it’s bidirectional, so it can create a sentence by starting
from the middle and
adding both ‘next’ and ‘previous’ words. This obviously doubles the
memory requirements,
as we store two chains instead of one.

My prototype implementation is available on
http://oblomov.dnsalias.org/git
in the rbot-mark.git repository.

The structure is rather simple. I have a ChanceHash class to abstract
the collection of
possible next (or previous) words, with corresponding chances, and the
MarkovChainer
class which abstracts the engine: the chains ar stored in @mkv, which
is a simple hash
where values are {:prev => ChanceHash, :next => ChanceHash} and keys
are arrays of
symbols. nil is used to mark the begin/end of a sentence.

Although this works remarkably well, there are huge memory issues:
learning my test
text (210K+ words, with 20K+ unique words) is unfeasible for orders >
4 (at order 4
it takes about half a gigabyte of memory).

Now, the final application (this is a plugin for the ircbot rbot) will
store the data in a bdb,
which will enormously ease the memory load, but I was wondering if it
was possible to
optimize memory usage in a way that would allow the code to be used
‘live’ without
some on-disk backend, while still being able to go up to order 6 (at
least).

Any suggestions?

Oblomov wrote:

I need twice the order of a naive handler to get the same context
in the rbot-mark.git repository.

optimize memory usage in a way that would allow the code to be used
‘live’ without
some on-disk backend, while still being able to go up to order 6 (at
least).

Any suggestions?

Well, I’m not an expert on natural language processing by any stretch of
the imagination, so I don’t know if this sort of processing is normal in
that line of work, and if it is, how the memory requirements grow with
the order of Markov chain you’re using. For openers, though, you might
want to find a C++ or Java library that does what you’re trying to do
efficiently and interface your Ruby code to one, using regular Ruby
for a C++ library or jRuby for a Java library. There is lots of open
source NLP code out there in both languages, so it shouldn’t be too
difficult to find something.

Dear Giuseppe,

Although this works remarkably well, there are huge memory issues:
learning my test
text (210K+ words, with 20K+ unique words) is unfeasible for orders >
4 (at order 4
it takes about half a gigabyte of memory).

If you have 20000 unique words and want to represent the probabilities
of an n-chain of them, you’ll need (20000)^n values, i.e.,
1.6*10^17 for n=4, so the sparsity of your data is such that there is
maybe one in a billion 4-chains actually represented in your probability
hash ( a hash entry will actually need more than a byte, and that
increases the sparsity).

So I don’t think you’ll be able to reduce the need in memory a great
deal …

I could think of two things (not necessarily disjunct).

1.) Group individual words together to get classes of words,
and calculate the probabilities for those fewer classes. I’m not
quite sure how well this will work for you, but I’d try to use some
kind of Huffman coding concept on the n-chains:

I’d then break off building the tree somewhere and put the remaining
words into a class of seldom words … This would introduce
some overhead of getting back to words from the classes, though,
and there will be many words that have different meanings that are
nevertheless equally seldom :frowning:

2.) Be less strict with the probability values – you could say that
the probability of any particular 6-chain is the product of the
probabilities of the five successive two-chains involved in it
(independence hypothesis), unless you can reject that hypothesis by a
statistical test (I’d use the chi-square test - the Ruby code can be
found in the package statistics2 by Shin-ichiro Hara).

So keep only the two-chain probabilities and those longer chain
probabilities which deviate considerably from the independence
hypothesis,
leading to rejection in the chi-square test.
You can influence the amount of data thus created by changing the
alpha value of the test.

There is a book by Jeremy Spinrad about “Efficient Graph
Representations”
where you can find further ideas about how to represent a graph .

And, something else:
What is the probability of an n-chain of words that wasn’t actually
encountered in the training set ? It’s not zero, even though the
Hash created from the training data says so …
If you think of the incidence matrix A=(a_ij) representing the
probabilities of finding a 2-chain of words i-j, all entries with
no such Markov chain are zero. Now, in a technique called “Latent
Semantic
Analysis”

this matrix is reconstructed from a product of matrices of lower
rank, introducing non-zero entries in the previously zero entries.
Thus all of a sudden, there is a small probability of finding that
particular chain now.

I join Ed Borasky in suggesting some number-crunching computer language

  • Ruby interface for dealing with big amounts of data, but maybe
    you can reduce it all a bit ?

Best regards,

Axel

Dear Giuseppe,

as a product of lower level tensors … wonder what kind of speed hit
it would take.

What you would need in order to generalize
LSA to higher orders is a generalization of the singular value
decomposition (SVD) to tensors. There are plenty of links, such
as this one:

http://www.cs.cornell.edu/cv/OtherPdf/William.pdf

to do that.

If you’d consider doing statistical tests to find longer
chains that occur significantly more often than they should
given the product of shorter chains, you can make use
of the following:

1.) For the chi-square test, always, if at all, the hypothesis
of independence for an entire sample of
chains (of different length) gets rejected, if it contains
chains whose occurrence probabilities is much different from
the product of shorter chains which make it up.
But it’s tiresome to consider all the subsets of some
set of cardinality (20000)^n – and none of the number-crunching
languages is fast enough to do that for you anyway :wink:

You see which ones are the tricky candidates when you look
at how much they contribute to the test statistic - the
big contributors make the test fail for the whole sample.

2.) As a product of big numbers is always bigger than a product
of small numbers, you can find long chains with deviant probability
iteratively: they will always contain subchains with deviant
probability.
So, with 20000 words, I’d start looking at the probabilities
of 2-chains, filter out possible candidates, then consider 3-chains etc

I join Ed Borasky in suggesting some number-crunching computer language

  • Ruby interface for dealing with big amounts of data, but maybe
    you can reduce it all a bit ?

Well, going out of Ruby would sort-of defeat some of the (learning)
purposes of this project, but for serious application it’s something
I should really look into, especially to do matrix manipulations
as suggested in your last points.

Mmmh, it is often argued that Ruby is slow in comparison to other
languages, when you consider the execution time of some pre-defined
task.
Entire webpages are dedicated to this claim.
But rarely ever do these comparisons take into account how much time you
need to THINK and code. I’ve often found the execution
time of some task infinite with other languages, because they don’t
leave
you enough time to think about the problem, as you always bother
about a million implementation details that obscure the real memory and
computing needs.
I now think that it is perfectly feasible to do a 6 or longer Markov
chain handler entirely in Ruby, and fast enough, whereas many
brute-force
approaches with some “fast” language will take an infinite amount of
both space and time.

Thanks a lot for the suggestions,

Glad to be of help :slight_smile:

Best regards,

Axel

On Jul 30, 6:15 pm, “Axel E.” [email protected] wrote:

If you have 20000 unique words and want to represent the probabilities
of an n-chain of them, you’ll need (20000)^n values, i.e.,
1.6*10^17 for n=4, so the sparsity of your data is such that there is
maybe one in a billion 4-chains actually represented in your probability
hash ( a hash entry will actually need more than a byte, and that
increases the sparsity).

So I don’t think you’ll be able to reduce the need in memory a great
deal …

D’oh I was afraid of that :stuck_out_tongue:

words into a class of seldom words … This would introduce
some overhead of getting back to words from the classes, though,
and there will be many words that have different meanings that are
nevertheless equally seldom :frowning:

Well, this project is mostly for fun (a plugin for an irc bot,
as I mentioned), so even if ‘odd’ words are chosen, it just adds
to the fun.

I like this ‘word class’ idea. I could put words with very few
occurrences into the ‘seldom’ class, and then pick them based
on an auxiliary letter-based markov chain (which would pick as many
characters as necessary to isolate a word in the group. An interesting
challenge.

You can influence the amount of data thus created by changing the
alpha value of the test.

There is a book by Jeremy Spinrad about “Efficient Graph Representations”
where you can find further ideas about how to represent a graph .

This sounds like an excellent idea. I’ll look up these references,
thanks.

this matrix is reconstructed from a product of matrices of lower
rank, introducing non-zero entries in the previously zero entries.
Thus all of a sudden, there is a small probability of finding that
particular chain now.

Aha, excellent link, thanks. This is really interesting. I wonder if
it’s
possible to take it to higher orders and write the n-th level tensor
as a product of lower level tensors … wonder what kind of speed hit
it would take.

I join Ed Borasky in suggesting some number-crunching computer language

  • Ruby interface for dealing with big amounts of data, but maybe
    you can reduce it all a bit ?

Well, going out of Ruby would sort-of defeat some of the (learning)
purposes of this project, but for serious application it’s something
I should really look into, especially to do matrix manipulations
as suggested in your last points.

Thanks a lot for the suggestions,

Axel E. wrote:

If you’d consider doing statistical tests to find longer
chains that occur significantly more often than they should
given the product of shorter chains, you can make use
of the following:

[snip]

both space and time.
Well … OK … but …

A couple of years back I got interested in chatbots for a brief period.
Google for AIML and the Loebner Prize and you’ll probably find some of
my thoughts on the subject. Giuseppe IIRC wants to build a chatbot and
is doing it with Markov chains. I argued, as you appear to be arguing,
that the “right way” to build a chatbot involves statistical analysis of
semantics, and the AIML people argued that their way was a lot simpler
and worked just as well.

I conversed with a number of chatbots over a few months and attempted to
train a couple, one using AIML and one using what appears to be the most
credible technology, unfortunately proprietary. It was an exercise in
frustration. I ended up concluding that statistical analysis of
semantics was the right way to do it, but didn’t see any payoff for
investing the effort in trying to build a working code.

So two comments:

  1. Check out AIML – they believe in it even if I don’t, and I think
    there is a Ruby interface now. There is a Java interface, so perhaps
    jRuby can be the glue. http://www.alicebot.org/

  2. Check out the chatbot I though was the most credible, and which won
    the Loebner Prize a couple of times. http://www.jabberwacky.com/

-------- Original-Nachricht --------
Datum: Wed, 1 Aug 2007 23:42:10 +0900
Von: “M. Edward (Ed) Borasky” [email protected]
An: [email protected]
Betreff: Re: Optimizing large-scale Markov chain handler

Entire webpages are dedicated to this claim.
both space and time.

there is a Ruby interface now. There is a Java interface, so perhaps
jRuby can be the glue. http://www.alicebot.org/

  1. Check out the chatbot I though was the most credible, and which won
    the Loebner Prize a couple of times. http://www.jabberwacky.com/

Dear Ed,

I have had a look at both bots … I can’t say I’d find any one
particularly
appealing as a substitute for a human conversation partner, e.g., to
solve personal problems,
even though I’ve read reports about psychiatrists for whom
even Eliza passed the Turing test in the Sixties already.

In the discussion with Giuseppe, I was mainly concerned with
reducing the amount of data in the n-chains he wants to use …
as one first application of statistics.
Now that you ask, I wonder how much statistics there actually
is in the part of latent semantic analysis I find appealing :wink:

I like the fact that you merely do an SVD of the word-in-sentence
count matrix, then cut off after the first n eigenvalues ( giving
n concepts) and multiply everything up together.

One can interpret the resulting matrix as a linear mapping
and each sentence as a vector, with entries describing the words in it,
which gets mapped to some lower-dimensional subspace when
multiplied with the matrix.
This image subspace is a “subspace of concepts”. In a general text
about pets, sentences containing “dog” and “cat” will probably get
mapped to
very close points in the subspace … but you could introduce separating
hyper-planes to distinguish between anything doggish and cattish.

I like this way of constructing relatively few concepts each time anew
from the actual
context of the available text rather than trying to find > 40000
categories
that will make up the whole world, as seems to be the case in ALICE and
similar bots. It might be that one day, I want to discuss cats as
mean predators eating poor mice, another day, as nice pets .
Then, a pre-configured categorization might have difficulties …
and the number of categories is way too big in my taste.

Some neuroscientist said that humans can only deal with about seven
concepts at once:

http://www.musanim.com/miller1956/

In Ruby, there is the classifier gem by
Lucas Carlson and David Fayram II that will do LSA, text classification,
Bayesian analysis in the context of general texts.

Best regards,

Axel