Forum: Ruby Markov Chains (#74)

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
James G. (Guest)
on 2006-04-13 16:51
(Received via mailing list)
I owe myself a signed copy of my book.  Err, I mean, this was a popular
quiz!
That's always great because I have so many wonderful selections of code
to talk
about in the summary, but it can also be hard because I have so many
wonderful
selections of code to talk about in the summary.  All of the solutions
were
terrificly educational, even the ones I don't have time to discuss
below.
Everyone who reads this summary is bound by Ruby Law (can be tried in
the Ruby
Courts) to go look through the rest of the submissions.

Now that we have the legal formalities out of the way, let's dig into a
solution.  Jon Egil S. wrote in with descriptions of several
practical
applications for Markov Chains and this bit of code:

	class MarkovChain
	  def initialize(text)
	    @words = Hash.new
	    wordlist = text.split
	    wordlist.each_with_index do |word, index|
	      add(word, wordlist[index + 1]) if index <= wordlist.size - 2
	    end
	  end

	  def add(word, next_word)
	    @words[word] = Hash.new(0) if !@words[word]
	    @words[word][next_word] += 1
	  end

	  def get(word)
	    return "" if !@words[word]
	    followers = @words[word]
	    sum = followers.inject(0) {|sum,kv| sum += kv[1]}
	    random = rand(sum)+1
	    partial_sum = 0
	    next_word = followers.find do |word, count|
	      partial_sum += count
	      partial_sum >= random
	    end.first
	    next_word
	  end
	end

	mc = MarkovChain.new(
	  File.read("Agatha Christie - The Mysterious Affair at Styles.txt")
	)

	sentence = ""
	word = "Murder"
	until sentence.count(".") == 4
	  sentence << word << " "
	  word = mc.get(word)
	end
	puts sentence << "\n\n"

Let's examine the MarkovChain object first, as it is where the real work
is
done.  In initialize() you can see that a Hash is created, the text is
split()
on whitespace, and then each pair of adjacent words are passed to the
add()
method.

In add(), the first word is used to key the Hash we saw created in
initialize(),
adding a frequency Hash as the value the first time a new word is seen.
The
following word is then added to the counts in the frequency Hash.
Obviously, we
are just working with a first order chain here, since we only ever track
the
word that immediately preceded the current word.

The real work is done in the get() method, where we pass the current
word and
are returned a reasonable follow-up word.  The word is checked to make
sure we
have seen it, then the frequency Hash is pulled for that word.  Next,
the
frequencies of all possible follow-up words are summed and a random
selection is
made in that range.  Finally, the frequency Hash is walked one last
time, and
the random number used to select and return the indicated word.

The rest of the solution code follows the class.  First a MarkovChain is
constructed from an Agatha Christie novel.  Then a variable is prepared
to hold
the output (called sentence, but it will actually hold a few sentences)
and a
starting word appropriate to the content is hand-picked.  The script
then loops,
using get() to choose words one by one until it has collected at least
four
sentences of output, which are dumped to the user before we exit.

The above code does not deal with punctuation at all, which can be both
good and
bad.  The upside is that you will see some natural punctuation, like
commas and
they will generally even appear in reasonable places since they are
attached to
the word from the original text.  The downside is that you may get
something
like an opening quote, but never get the closing quote.  Sometimes it
works out
naturally, and sometimes it doesn't.

Now, in this whole discussion, I was loosely throwing around this
"frequency
Hash" term that I never bothered to define.  Allow me to fix that now.
The
frequency Hash is obviously just a Hash with keys being the words that
can
follow and values being a count of how many times that word was seen to
follow
in the original text.  Dominik B. included excellent examples of how
this
looks in his submission email.  Using Dominik's trivial example text of:

	q s a w s. a s w q s. q w s a a w. s q a w s.

Jon's code will produce a Hash tree that looks like:

	{ "w." => {"s"  => 1},
	  "w"  => {"s." => 2, "q" => 1, "s"  => 1},
	  "a"  => {"w." => 1, "a" => 1, "w"  => 2, "s" => 1},
	  "s." => {"a"  => 1, "q" => 1},
	  "q"  => {"a"  => 1, "w" => 1, "s." => 1, "s" => 1},
	  "s"  => {"w"  => 1, "a" => 2, "q"  => 1} }

You can really see how punctuation changes things there, since "w" and
"w." are
different words.

The right hand side of that (the values of the top level Hash), shows
the
frequency breakdown.  "w" occurs 2 times after an "a", but "s" occurs
just once
following the same word.

Domink then went on to show how a person might change that
representation.  Here
are two significant changes:

	{ "w" => { "q" => ["s"],
	           "s" => [".", "a", "."] },
	  "a" => { "a" => ["w"],
	           "w" => ["s", ".", "s"],
	           "s" => ["w"] },
	  "q" => { "a" => ["w"],
	           "w" => ["s"],
	           "s" => ["a", "."] },
	  "s" => { "w" => ["q"],
	           "a" => ["w", "a"],
	           "q" => ["a"] } }

Note that punctuation is handled differently here, because I'm now using
Dominik's examples and his code used a different implementation.  In
Dominik's
code, periods, exclamation marks, and question marks are treated as
their own
words and most other punctuation is ignored.

Now try to ignore the tree structure of the above for now and just focus
again
on that right-hand side.  This is still just a frequency Hash, though it
is now
disguised as an Array.  If you have the Hash {"a" => 1, "." => 2}, you
could
select the next word just as we saw Jon do earlier.  However, the Array
[".",
"a", "."] represents the same data and we can select a word just by
randomly
selecting a member.  The duplication ensures that the odds are still the
same.

The other obvious change is that we are now taking into account an order
for the
chain.  This works out pretty naturally, just by nesting Hashes to the
depth of
the order.  We then walk the tree to find out what comes next.  For
example, if
we have a "w" followed by a "a", we get to our now familiar Array of
choices for
the next word [".", "a", "."].

But wait, Dominik goes one step further:

	{ ["q", "a"] => ["w"],
	  ["q", "w"] => ["s"],
	  ["s", "q"] => ["a"],
	  ["w", "q"] => ["s"],
	  ["s", "w"] => ["q"],
	  ["a", "s"] => ["w"],
	  ["w", "s"] => [".", "a", "."],
	  ["s", "a"] => ["w", "a"],
	  ["q", "s"] => ["a", "."],
	  ["a", "a"] => ["w"],
	  ["a", "w"] => ["s", ".", "s"] }

This is exactly the same as the tree above, except that the Hash tree of
preceding words has been flattened into a single Array of words, used to
key the
Hash.  This is possible because Ruby's Arrays are hashable.  Isn't that
handy?
The gain is speed.  Since we only have to make one Hash lookup to get to
the
frequency Array, it works a little quicker.

Now that we have gone through the trouble of understanding Dominik's
data
structure, the code should be a breeze.  Let's take a look:

	# ...

	if $0 == __FILE__
	  order = 2
	  n = 10
	  while ARGV[0] =~ /\A-([on])([1-9]\d*)\z/
	    if $1 == "o"
	      order = Integer($2)
	    else
	      n = Integer($2)
	    end
	    ARGV.shift
	  end
	  mc = MarkovChainer.new(order)
	  mc.add_text(ARGF.read)
	  n.times {
	    puts mc.generate_sentence
	  }
	end

This is the last little bit of Dominik's code, but it kicks the process
off.
You can see that it selects defaults for the order and number of
sentences to
produce, but then updates them from the provided command-line options.
Next, a
MarkovChainer is built and fed the combined contents of ARGF.  Finally,
the
requested number of sentences are generated with a simple method call
repeated
inside of an iterator.

Let's look at the pieces of MarkovChainer used to read text:

	class MarkovChainer
	  attr_reader :order
	  def initialize(order)
	    @order = order
	    @beginnings = []
	    @freq = {}
	  end

	  def add_text(text)
	    # make sure each paragraph ends with some sentence terminator
	    text.gsub!(/\n\s*\n/m, ".")
	    text << "."
	    seps = /([.!?])/
	    sentence = ""
	    text.split(seps).each { |p|
	      if seps =~ p
	        add_sentence(sentence, p)
	        sentence = ""
	      else
	        sentence = p
	      end
	    }
	  end

	  # ...

	  private

	  def add_sentence(str, terminator)
	    words = str.scan(/[\w']+/)
	    return unless words.size > order # ignore short sentences
	    words << terminator
	    buf = []
	    words.each { |w|
	      buf << w
	      if buf.size == order + 1
	        (@freq[buf[0..-2]] ||= []) << buf[-1]
	        buf.shift
	      end
	    }
	    @beginnings << words[0, order]
	  end

	  # ...

	end

The initialize() method shows a familiar setup, save that new Array we
haven't
talked about yet.  We will tackle that when it comes up.

add_text() isn't much more complicated.  It ensures the document it is
dealing
with contains complete sentences, then breaks it on the end-of-line
punctuation
characters I mentioned earlier.  The trick here is that you need to
notice the
parentheses in the Regexp handed to split().  Ordinarily, split() will
remove
the match used to divide the chunks, but when the match includes
capturing
parentheses, those captures are returned as items of their own.  So, the
final
iterator of the method reads the sentence then the punctuation that
followed it
and hands those over to add_sentence().

add_sentence() is where the data structure we analyzed earlier gets
built.  The
sentence is separated into words, including the end-of-line punctuation
don't
forget.  Then those words are pushed into a buffer one at a time.  When
the
buffer has enough words stored to satisfy the order, it starts filling
the Hash
of frequency Arrays for this chain.  Finally, we see what that extra
Array is
used for.  It holds beginnings, or groups of words used to start a
sentence
(when we wouldn't have previous words to check frequencies for).  You
can see
those being stored in the last line of this method.

Here's the other side of that class, text writing:

	class MarkovChainer
	  # ...

	  def generate_sentence
	    res = @beginnings[rand(@beginnings.size)]
	    loop {
	      unless nw = next_word_for(res[-order, order])
	        return res[0..-2].join(" ") + res.last
	      end
	      res << nw
	    }
	  end

	  private

	  # ...

	  def next_word_for(words)
	    arr = @freq[words]
	    arr && arr[rand(arr.size)]
	  end

	end

Not much here, as you can see.  With the data structure built, the
problem
practically solves itself.

The interface method, generate_sentence(), starts by selecting a
beginning from
the Array of possible beginnings.  It then iterates using
next_word_for() to
grab follow-up words until a nil is returned (after an end-of-line
character,
since we didn't give follow-ups for them).  Whatever we have at that
point
becomes the returned sentence, after adding a sprinkle of whitespace
between the
words.

The helper method, next_word_for(), just does the lookup from the
frequency
Array.  Don't let that fancy last line throw you, it's just a shortcut
to keep
from calling []() when arr is nil.

A huge thank you to all who came out of hiding to get the quiz rolling
again.  I
just knew I would find a problem you couldn't resist eventually.

The response to the Ruby Q. contest has been strong and tomorrow we
will start
our run of contributed quizzes with an offering from Pat Eyler...
This topic is locked and can not be replied to.