Forum: Ruby Markov Chains (#74)

Announcement (2017-05-07): is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see and for other Rails- und Ruby-related community platforms.
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2006-04-13 14:51
(Received via mailing list)
I owe myself a signed copy of my book.  Err, I mean, this was a popular
That's always great because I have so many wonderful selections of code
to talk
about in the summary, but it can also be hard because I have so many
selections of code to talk about in the summary.  All of the solutions
terrificly educational, even the ones I don't have time to discuss
Everyone who reads this summary is bound by Ruby Law (can be tried in
the Ruby
Courts) to go look through the rest of the submissions.

Now that we have the legal formalities out of the way, let's dig into a
solution.  Jon Egil Strand wrote in with descriptions of several
applications for Markov Chains and this bit of code:

	class MarkovChain
	  def initialize(text)
	    @words =
	    wordlist = text.split
	    wordlist.each_with_index do |word, index|
	      add(word, wordlist[index + 1]) if index <= wordlist.size - 2

	  def add(word, next_word)
	    @words[word] = if !@words[word]
	    @words[word][next_word] += 1

	  def get(word)
	    return "" if !@words[word]
	    followers = @words[word]
	    sum = followers.inject(0) {|sum,kv| sum += kv[1]}
	    random = rand(sum)+1
	    partial_sum = 0
	    next_word = followers.find do |word, count|
	      partial_sum += count
	      partial_sum >= random

	mc ="Agatha Christie - The Mysterious Affair at Styles.txt")

	sentence = ""
	word = "Murder"
	until sentence.count(".") == 4
	  sentence << word << " "
	  word = mc.get(word)
	puts sentence << "\n\n"

Let's examine the MarkovChain object first, as it is where the real work
done.  In initialize() you can see that a Hash is created, the text is
on whitespace, and then each pair of adjacent words are passed to the

In add(), the first word is used to key the Hash we saw created in
adding a frequency Hash as the value the first time a new word is seen.
following word is then added to the counts in the frequency Hash.
Obviously, we
are just working with a first order chain here, since we only ever track
word that immediately preceded the current word.

The real work is done in the get() method, where we pass the current
word and
are returned a reasonable follow-up word.  The word is checked to make
sure we
have seen it, then the frequency Hash is pulled for that word.  Next,
frequencies of all possible follow-up words are summed and a random
selection is
made in that range.  Finally, the frequency Hash is walked one last
time, and
the random number used to select and return the indicated word.

The rest of the solution code follows the class.  First a MarkovChain is
constructed from an Agatha Christie novel.  Then a variable is prepared
to hold
the output (called sentence, but it will actually hold a few sentences)
and a
starting word appropriate to the content is hand-picked.  The script
then loops,
using get() to choose words one by one until it has collected at least
sentences of output, which are dumped to the user before we exit.

The above code does not deal with punctuation at all, which can be both
good and
bad.  The upside is that you will see some natural punctuation, like
commas and
they will generally even appear in reasonable places since they are
attached to
the word from the original text.  The downside is that you may get
like an opening quote, but never get the closing quote.  Sometimes it
works out
naturally, and sometimes it doesn't.

Now, in this whole discussion, I was loosely throwing around this
Hash" term that I never bothered to define.  Allow me to fix that now.
frequency Hash is obviously just a Hash with keys being the words that
follow and values being a count of how many times that word was seen to
in the original text.  Dominik Bathon included excellent examples of how
looks in his submission email.  Using Dominik's trivial example text of:

	q s a w s. a s w q s. q w s a a w. s q a w s.

Jon's code will produce a Hash tree that looks like:

	{ "w." => {"s"  => 1},
	  "w"  => {"s." => 2, "q" => 1, "s"  => 1},
	  "a"  => {"w." => 1, "a" => 1, "w"  => 2, "s" => 1},
	  "s." => {"a"  => 1, "q" => 1},
	  "q"  => {"a"  => 1, "w" => 1, "s." => 1, "s" => 1},
	  "s"  => {"w"  => 1, "a" => 2, "q"  => 1} }

You can really see how punctuation changes things there, since "w" and
"w." are
different words.

The right hand side of that (the values of the top level Hash), shows
frequency breakdown.  "w" occurs 2 times after an "a", but "s" occurs
just once
following the same word.

Domink then went on to show how a person might change that
representation.  Here
are two significant changes:

	{ "w" => { "q" => ["s"],
	           "s" => [".", "a", "."] },
	  "a" => { "a" => ["w"],
	           "w" => ["s", ".", "s"],
	           "s" => ["w"] },
	  "q" => { "a" => ["w"],
	           "w" => ["s"],
	           "s" => ["a", "."] },
	  "s" => { "w" => ["q"],
	           "a" => ["w", "a"],
	           "q" => ["a"] } }

Note that punctuation is handled differently here, because I'm now using
Dominik's examples and his code used a different implementation.  In
code, periods, exclamation marks, and question marks are treated as
their own
words and most other punctuation is ignored.

Now try to ignore the tree structure of the above for now and just focus
on that right-hand side.  This is still just a frequency Hash, though it
is now
disguised as an Array.  If you have the Hash {"a" => 1, "." => 2}, you
select the next word just as we saw Jon do earlier.  However, the Array
"a", "."] represents the same data and we can select a word just by
selecting a member.  The duplication ensures that the odds are still the

The other obvious change is that we are now taking into account an order
for the
chain.  This works out pretty naturally, just by nesting Hashes to the
depth of
the order.  We then walk the tree to find out what comes next.  For
example, if
we have a "w" followed by a "a", we get to our now familiar Array of
choices for
the next word [".", "a", "."].

But wait, Dominik goes one step further:

	{ ["q", "a"] => ["w"],
	  ["q", "w"] => ["s"],
	  ["s", "q"] => ["a"],
	  ["w", "q"] => ["s"],
	  ["s", "w"] => ["q"],
	  ["a", "s"] => ["w"],
	  ["w", "s"] => [".", "a", "."],
	  ["s", "a"] => ["w", "a"],
	  ["q", "s"] => ["a", "."],
	  ["a", "a"] => ["w"],
	  ["a", "w"] => ["s", ".", "s"] }

This is exactly the same as the tree above, except that the Hash tree of
preceding words has been flattened into a single Array of words, used to
key the
Hash.  This is possible because Ruby's Arrays are hashable.  Isn't that
The gain is speed.  Since we only have to make one Hash lookup to get to
frequency Array, it works a little quicker.

Now that we have gone through the trouble of understanding Dominik's
structure, the code should be a breeze.  Let's take a look:

	# ...

	if $0 == __FILE__
	  order = 2
	  n = 10
	  while ARGV[0] =~ /\A-([on])([1-9]\d*)\z/
	    if $1 == "o"
	      order = Integer($2)
	      n = Integer($2)
	  mc =
	  n.times {
	    puts mc.generate_sentence

This is the last little bit of Dominik's code, but it kicks the process
You can see that it selects defaults for the order and number of
sentences to
produce, but then updates them from the provided command-line options.
Next, a
MarkovChainer is built and fed the combined contents of ARGF.  Finally,
requested number of sentences are generated with a simple method call
inside of an iterator.

Let's look at the pieces of MarkovChainer used to read text:

	class MarkovChainer
	  attr_reader :order
	  def initialize(order)
	    @order = order
	    @beginnings = []
	    @freq = {}

	  def add_text(text)
	    # make sure each paragraph ends with some sentence terminator
	    text.gsub!(/\n\s*\n/m, ".")
	    text << "."
	    seps = /([.!?])/
	    sentence = ""
	    text.split(seps).each { |p|
	      if seps =~ p
	        add_sentence(sentence, p)
	        sentence = ""
	        sentence = p

	  # ...


	  def add_sentence(str, terminator)
	    words = str.scan(/[\w']+/)
	    return unless words.size > order # ignore short sentences
	    words << terminator
	    buf = []
	    words.each { |w|
	      buf << w
	      if buf.size == order + 1
	        (@freq[buf[0..-2]] ||= []) << buf[-1]
	    @beginnings << words[0, order]

	  # ...


The initialize() method shows a familiar setup, save that new Array we
talked about yet.  We will tackle that when it comes up.

add_text() isn't much more complicated.  It ensures the document it is
with contains complete sentences, then breaks it on the end-of-line
characters I mentioned earlier.  The trick here is that you need to
notice the
parentheses in the Regexp handed to split().  Ordinarily, split() will
the match used to divide the chunks, but when the match includes
parentheses, those captures are returned as items of their own.  So, the
iterator of the method reads the sentence then the punctuation that
followed it
and hands those over to add_sentence().

add_sentence() is where the data structure we analyzed earlier gets
built.  The
sentence is separated into words, including the end-of-line punctuation
forget.  Then those words are pushed into a buffer one at a time.  When
buffer has enough words stored to satisfy the order, it starts filling
the Hash
of frequency Arrays for this chain.  Finally, we see what that extra
Array is
used for.  It holds beginnings, or groups of words used to start a
(when we wouldn't have previous words to check frequencies for).  You
can see
those being stored in the last line of this method.

Here's the other side of that class, text writing:

	class MarkovChainer
	  # ...

	  def generate_sentence
	    res = @beginnings[rand(@beginnings.size)]
	    loop {
	      unless nw = next_word_for(res[-order, order])
	        return res[0..-2].join(" ") + res.last
	      res << nw


	  # ...

	  def next_word_for(words)
	    arr = @freq[words]
	    arr && arr[rand(arr.size)]


Not much here, as you can see.  With the data structure built, the
practically solves itself.

The interface method, generate_sentence(), starts by selecting a
beginning from
the Array of possible beginnings.  It then iterates using
next_word_for() to
grab follow-up words until a nil is returned (after an end-of-line
since we didn't give follow-ups for them).  Whatever we have at that
becomes the returned sentence, after adding a sprinkle of whitespace
between the

The helper method, next_word_for(), just does the lookup from the
Array.  Don't let that fancy last line throw you, it's just a shortcut
to keep
from calling []() when arr is nil.

A huge thank you to all who came out of hiding to get the quiz rolling
again.  I
just knew I would find a problem you couldn't resist eventually.

The response to the Ruby Quiz contest has been strong and tomorrow we
will start
our run of contributed quizzes with an offering from Pat Eyler...
This topic is locked and can not be replied to.