Here is my solution to this very interesting quiz.
My MarkovChainer can work with any order and it is sentence oriented: I
split the input text with /([.!?])/ and then scan each sentence with
/[\w’]+/, so all punctuation (except “.”, “!” and “?”) is ignored.
It also stores the first words of each sentence, to use them as
start words when generating sentences.
The script accepts command line arguments for the order (-o) and the
number of sentences to generate (-n), the defaults are -o2 -n10. Then it
uses ARGF.read as input text and prints n generated sentences.
Now lets talk a bit about the data structure:
My first implementation used trees of hashes and because symbols are
faster as hash keys than strings, I stored the words as symbols:
For the text “q s a w s. a s w q s. q w s a a w. s q a w s.”, that
looked
like:
{:q=>{:s=>{:"."=>1, :a=>1}, :a=>{:w=>1}, :w=>{:s=>1}},
:s=>{:q=>{:a=>1}, :w=>{:q=>1}, :a=>{:a=>1, :w=>1}},
:w=>{:q=>{:s=>1}, :s=>{:"."=>2, :a=>1}},
:a=>{:s=>{:w=>1}, :a=>{:w=>1}, :w=>{:"."=>1, :s=>2}}}
Then I thought it might be faster to put all the last words into one
array
instead of building a frequency hash, because I built that array anyway,
when I was looking for the next word: That looked like:
{:q=>{:s=>[:a, :"."], :a=>[:w], :w=>[:s]},
:s=>{:q=>[:a], :a=>[:w, :a], :w=>[:q]},
:a=>{:s=>[:w], :a=>[:w], :w=>[:s, :".", :s]},
:w=>{:q=>[:s], :s=>[:".", :a, :"."]}}
And it also was faster. Then I wanted to know if symbols really are
faster
then strings:
{“w”=>{“q”=>[“s”], “s”=>[".", “a”, “.”]},
“a”=>{“a”=>[“w”], “w”=>[“s”, “.”, “s”], “s”=>[“w”]},
“q”=>{“a”=>[“w”], “w”=>[“s”], “s”=>[“a”, “.”]},
“s”=>{“w”=>[“q”], “a”=>[“w”, “a”], “q”=>[“a”]}}
It turned out that strings are faster. I think that is because
converting
a symbol to a string (when generating the sentences) needs one hash
lookup
and generates a new String object. So the faster hash lookup of symbols
doesn’t pay off.
And finally I wanted to see if at least the hash tree is worth it. I
tried
the following:
{[“q”, “a”]=>[“w”],
[“q”, “w”]=>[“s”],
[“s”, “q”]=>[“a”],
[“w”, “q”]=>[“s”],
[“s”, “w”]=>[“q”],
[“a”, “s”]=>[“w”],
[“w”, “s”]=>[".", “a”, “.”],
[“s”, “a”]=>[“w”, “a”],
[“q”, “s”]=>[“a”, “.”],
[“a”, “a”]=>[“w”],
[“a”, “w”]=>[“s”, “.”, “s”]}
Well, the hash tree also turned out to be slower than this simple flat
hash (multiple hash lookups are slower than computing the hash of a
multi
element array and one lookup).
(I also tried joined strings as keys (e.g. “q a”=>[“w”]), but they are
slower than the array keys, because more transformations are needed)
So, the morale of all this:
- don’t use symbols if they have to be converted to a string often
- hash lookups might be slower than you think
- premature optimization…
Dominik
class MarkovChainer
attr_reader :order
def initialize(order)
@order = order
@beginnings = []
@freq = {}
end
def add_text(text)
# make sure each paragraph ends with some sentence terminator
text.gsub!(/\n\s*\n/m, “.”)
text << “.”
seps = /([.!?])/
sentence = “”
text.split(seps).each { |p|
if seps =~ p
add_sentence(sentence, p)
sentence = “”
else
sentence = p
end
}
end
def generate_sentence
res = @beginnings[rand(@beginnings.size)]
loop {
unless nw = next_word_for(res[-order, order])
return res[0…-2].join(" ") + res.last
end
res << nw
}
end
private
def add_sentence(str, terminator)
words = str.scan(/[\w’]+/)
return unless words.size > order # ignore short sentences
words << terminator
buf = []
words.each { |w|
buf << w
if buf.size == order + 1
(@freq[buf[0…-2]] ||= []) << buf[-1]
buf.shift
end
}
@beginnings << words[0, order]
end
def next_word_for(words)
arr = @freq[words]
arr && arr[rand(arr.size)]
end
end
if $0 == FILE
order = 2
n = 10
while ARGV[0] =~ /\A-([on])([1-9]\d*)\z/
if $1 == “o”
order = Integer($2)
else
n = Integer($2)
end
ARGV.shift
end
mc = MarkovChainer.new(order)
mc.add_text(ARGF.read)
n.times {
puts mc.generate_sentence
}
end