Markov Chains (#74)

bbazzarrakk · April 9, 2006, 7:43pm

Great quiz. It will be interesting to see how others have solved this
problem.
Here is my submission.

To use run the following:

$ cat <some_text_file> | ./rand_text.rb

Or you can give options

$ cat <some_text_file> | ./rand_text.rb -o 2 -n 10

-o : the order. Which is the number of previous words to consider
-n : the number of sentences to output

I used an hash of arrays to keep track of the possible state
transitions.
The key is the current state, and the contents of the array is the
possible
next states. When generating the output I randomly select elements from
this
array. I always start with the first ‘n’ number of words in the original
text, where ‘n’ is the order.

There is a sample output where <some_text_file> is Moby Dick and using
the
default parameters of order = 2, and number of sentences = 10.

==
Call me Ishmael.
Some years ago–never mind how long precisely --having little or no
money in
my soul; whenever I find myself involuntarily pausing before coffin
warehouses, and bringing up the rear of every kind whatsoever.
It is a damp, drizzly November in my purse, and nothing particular to
interest me on shore, I thought I would sail about a little and see the
mummies of those creatures in their huge bake-houses the pyramids.
No, when I go to sea, I go to sea as a Commodore, or a Captain, or a
Cook.
I abandon the glory and distinction of such offices to those who like
them.
For my part, I abominate all honorable respectable toils, trials, and
tribulations of every kind whatsoever.
It is quite as much as I can.
This is my substitute for pistol and ball.
With a philosophical flourish Cato throws himself upon his sword; I
quietly
take to the royal mast-head.
True, they rather order me about–however they may thump and punch me
about,
I have of driving off the spleen, and regulating the circulation

bbazzarrakk · April 9, 2006, 9:49pm

Here’s my solutions. I wrote this 2.5 times. The first version
worked, but it was slow as molasses. The second one is much much
faster. The 2.5th inherits from the second one and produces more
readable, complete sentences

first, slow version:
% cat markov.rb
class WordTable
def initialize(src, separator = /\W+/, word_shape = /\w+/)
@src = src
@separator = separator
@word_shape = word_shape
end

def build_table
@table = Hash.new { |h, k| h[k] = [] }
pos = 0
while line = @src.gets
line = line.chomp.downcase
words = line.split(@separator).select { |potential_word|
potential_word.match(@word_shape)
}
words.each do |word|
@table[word] << pos
pos += 1
end
end
self
end
def words
@table.keys
end

def positions_for(word)
@table[word]
end

def followers_for(word)
positions = positions_for(word)
followers_positions = positions.map { |i| i + 1 }
following_words = self.words.select { |key_word|
followers_positions.any? { |pos| positions_for
(key_word).include?(pos) }
}
following_words
end
end

class ChainBuilder
attr_accessor :chain_length
def initialize(initial_word, word_table, chain_length = 10)
@initial_word = initial_word
@word_table = word_table
@chain_length = chain_length
end

def distributed_followers(word)
distd_followers = []
followers = @word_table.followers_for(word)
positions_of_word = @word_table.positions_for(word)
followers.each do |follower|
follower_positions = @word_table.positions_for(follower)
count = follower_positions.select { |pos|
prec = pos - 1
positions_of_word.include?(prec)
}.length
distd_followers += [follower] * count
end
distd_followers
end
def randomized_next_word(word)
choices = distributed_followers(word)
choices[ rand(choices.length) ]
end
def chain
res_chain = [@initial_word]
(self.chain_length - 1).times do
res_chain << randomized_next_word(res_chain.last)
end
res_chain
end
end

if $0 == FILE
if ARGV[0].nil?
STDERR.puts “Please provide a corpus.”
STDERR.puts “#$0 usage: #$0 [chain length]
[initial word]”
exit 1
end
chain_len = (ARGV[1] || 10).to_i
wt = nil

File.open(ARGV[0]) do |file|
#wt = WordTable.new(file, //, /./) # try by characters
wt = WordTable.new(file)
wt.build_table
end
init_word = (ARGV[2] || wt.words[ rand( wt.words.length ) ] )

chain_builder = ChainBuilder.new(init_word, wt, chain_len)
chain = chain_builder.chain
puts chain.join(’ ').capitalize
end

Second, quicker version:
% cat markov2.rb
class MarkovChainBuilder
class Entry < Struct.new(:word_index, :successors)
end
def initialize(src, separator = /\W+/, word_shape = /\w+/,
downcase = true)
@src = src
@separator = separator
@word_shape = word_shape
@downcase = downcase
build_tables
end
def build_tables
@word_list = []
@successor_lists = {}
last_word = nil
@src.each do |line|
line = line.downcase if @downcase
line = line.chomp
words_for_line = line.split(@separator).select{ |w| w.match
(@word_shape)}
words_for_line.each do |word|
unless @successor_lists.has_key? word
entry = @successor_lists[word] = Entry.new
@word_list << word
entry.word_index = @word_list.length - 1
entry.successors = []
end

     unless last_word.nil?
       @successor_lists[last_word].successors << @successor_lists

[word].word_index
end
last_word = word
end
end
end
def distributed_successors_for(word)
@successor_lists[word].successors.map { |index| @word_list[index] }
end
def randomized_next_for(word)
succs = distributed_successors_for(word)
succs[ rand(succs.length) ]
end
def markov_chain(initial_word, len = 10)
res = [initial_word]
(len - 1).times do
res << randomized_next_for(res.last)
end
res
end
def words
@word_list
end
private :build_tables
end

if $0 == FILE
if ARGV[0].nil?
STDERR.puts “Please provide a corpus.”
STDERR.puts “#$0 usage: #$0 [chain length]
[initial word]”
exit 1
end
chain_len = (ARGV[1] || 10).to_i
mc = nil

File.open(ARGV[0]) do |file|
mc = MarkovChainBuilder.new(file)
end
init_word = (ARGV[2] || mc.words[ rand( mc.words.length ) ] )

chain = mc.markov_chain(init_word, chain_len)
puts chain.join(’ ').capitalize
end

And now the third version which is basically the second version, but
looks for sentences:
% cat smarter_markov.rb
require ‘markov2’
class SmarterMarkovChainBuilder < MarkovChainBuilder
def initialize(src)
super(src, /\s+/, /\S+/, false)
end
def markov_chain(initial_word, len = 10)
initial_chain = super
index_of_last_word = nil
initial_chain.each_index do |idx|
if initial_chain[idx] =~ /.|?|!/
index_of_last_word = idx
break
end
end
if index_of_last_word
initial_chain[0…index_of_last_word]
else
initial_chain
end
end
end
if $0 == FILE
mc = nil
File.open(ARGV[0]) do |file|
mc = SmarterMarkovChainBuilder.new(file)
end
start_words = mc.words.select { |w| w =~ /^[A-Z]/ }
init_word = start_words[ rand(start_words.length) ]
puts mc.markov_chain(init_word, 200).join(’ ‘).gsub(/"/,’’)
end

bbazzarrakk · April 9, 2006, 10:05pm

Hi all,

here is my solution. Its a rather stupid approach, but as many simple
and short solutions its better than i had expected.

It uses some gsubs to simplify the input text and surround each word
and punctuation by exactly one space character.

Scan is used to find occurrences of the last words and
store the next words respectively. (even multiple times)

From this array one is choose randomly (which preserves the original
frequency of occurrence because the words are stored multiple times
in the array, increasing the chance of being picked)

Thats it. It works well for files of several MB and gets even faster
if the order is higher.

usage: quiz74.rb [input files]

before, line, order = [’.’], ‘’, ARGV.shift.to_i

txt = ARGF.read.tr(’"/*\()[]’, ’ ‘).downcase
txt = txt.gsub(/[^’\w]/, ’ \0 ').gsub(/\s+/, ’ ')

500.times do
ary = txt.scan(/ #{before.join(’ ')} (\S+)/).transpose.first
before << ary[rand(ary.size)]
before.shift if before.length > order
(puts line; line = ‘’) if line.length > 50 && /\w/ =~ before.last
line << ( /\w/ =~ before.last && !line.empty? ? ’ ’ : ‘’) <<
before.last
end

cheers

Simon

bbazzarrakk · April 10, 2006, 2:18am

My approach isn’t terribly complex, but it was fun to write. It
requires that you have a plain text file “book.txt” in the directory of
execution. I used http://www.gutenberg.org/dirs/etext91/alice30.txt for
all of my tests and my favorite phrase that it came up with was:

Beautiful soup beauootiful soooop soooop of evidence said the queen
turning to the dormouse.

How is works is loads each word into a hash stripping punctuation and
capitalization associating each word with an array of common words that
it is followed by. One word is picked by random to start us off then it
picks randomly commonly chained words from there till it ends up with a
word that was at the end of sentence in the original document.

Kind Regards,

bbazzarrakk · April 10, 2006, 2:36am

I think someone needs to take all the explanations from this thread and
run
their markov chainer on that.

aniel

–
Daniel B.
http://danielbaird.com (TiddlyW;nks! :: Whiteboard Koala :: Blog ::
Things
That Suck)
[[My webhost uptime is ~ 92%… if no answer pls call again later!]]

bbazzarrakk · April 10, 2006, 6:11am

On Apr 9, 2006, at 5:35 PM, Daniel B. wrote:

I think someone needs to take all the explanations from this thread
and run
their markov chainer on that.

Or everything why’s ever said.

–Steve

bbazzarrakk · April 10, 2006, 4:37am

Here is my first ruby quiz solution submission. Thanks for posting
this one, it provided a fun break from my school projects.

I tried to make the code as general as possible, with variable order
of both words and letters. It uses a hash of arrays for storage. I was
planning on unifying the MarkovWords and MarkovLetters classes into
one, but decided against it, mostly to keep them simpler. These
examples use an english translation of The Odyssey to derive the text.

Usage…

[brian@spica] [~/rubyquiz]
$ ruby markov.rb -h
Usage: ruby markov.rb [options]
-h, --help show this usage message
-o, --order set markov chain order
-s, --sentences set number of sentences to print
-w, --words set number of words to print
-l, --letters use letters as the basic unit

Print 5 sentences of order 2, using words as the base unit…

[brian@spica] [~/rubyquiz]
$ ruby markov.rb -s 5 -o 2 …/etext/dyssy10.txt
My men came out of your wits? If Apollo and the whole world, neither
rich nor poor, who is handsome and clean built, whereas I am so lost
in a beautiful golden ewer, and poured it over with a lie; ‘Neptune,’
said I, 'of escaping Charybdis, and at the base of her maids brought
the heifer down with my tears during the darkness of death itself. A
poor unfortunate tramp has come to pass." And Penelope answered,
“Stranger, you must have gone off to bring the contest to an end.”
Melanthius lit the fire for her, but she wishes to hear the enchanting
sweetness of their grey hair.

Print 1 sentence of order 1, using words as the base unit…

[brian@spica] [~/rubyquiz]
$ ruby markov.rb -s 1 -o 1 …/etext/dyssy10.txt
Proserpine had got into a mission to be scandalised at last of them,
and lay a hard task, no more fond of wind blew a man talked and will
leave their sport known that stalks about his wife, and the store for
he stretched out their own house, which I shall be willing to the
house to you, Telemachus.

Print 50 words of order 3, using letters as the base unit…

[brian@spica] [~/rubyquiz]
$ ruby markov.rb -w 50 -o 3 -l …/etext/dyssy10.txt
the of it into beautiful nose kill as did ther people the wood and him
all the meanthrountry stilltrese ffere peak the the the eleide oly
with blooked ent back the himself over s his hey glarge tood welcomind
where and mannot been spoke aloud valitterince one of his

Print 50 words of order 2, using letters as the base unit, with text
from The Aeneid (latin version)…

[brian@spica] [~/rubyquiz]
$ ruby markov.rb -w 50 -o 2 -l …/etext/anidl10.txt
quora desi aras ta acerummarmallenstos es mihinus vade imurbethinc
plia caum turnit que re et sum fuit ade re restaeteri invicaviam
ceuctitur hos aectalisubsillo parmeis suntereffunc meo que
clachilliaeque mul rut moropula sonitotuspiumque terrequebraherautras
nos ad la ciem atque part dubstra repononibus comet ad num undo cisque
retrangeneferat simas obla

[brian@spica] [~/rubyquiz]
$ cat markov.rb
#!/usr/bin/ruby

class MarkovWords

def initialize(filename, order)
@order = order
@words = Hash.new
@state = Array.new(@order)
previous = Array.new(@order)
File.foreach(filename) do |line|
line.split(/\s+/).each do |word|
unless previous.include?(nil)
p = previous.join(’ ')
unless @words.has_key?§
@words[p] = Array.new
end
@words[p] << word
end
previous.shift
previous.push(word)
end
end
end

def print_words(n = 50)
word = next_word(true)
1.upto(n) do |i|
print word, i == n ? “\n” : " "
word = next_word()
end
print “\n”
end

sentences start with a capital or quoted capital and end with

punctuation or quoted punctuation

def print_sentences(n = 5)
sentences = 0
word = next_word(true)
while word !~ /^[’"]?[A-Z]/ word = next_word() end begin print word if word =~ /[?!.]['"]?$/
sentences += 1
if sentences == n
print “\n”
else
print " "
end
else
print " "
end
word = next_word()
end until sentences == n
end

def next_word(restart = false)
if restart or @state.include?(nil)
key = @words.keys[rand(@words.length)]
@state = key.split(/\s+/)
end
key ||= @state.join(’ ')
# restart if we hit a dead end, rare unless text is small
if @words[key].nil?
next_word(true)
else
word = @words[key][rand(@words[key].length)]
@state.shift
@state.push(word)
word
end
end

end

class MarkovLetters

def initialize(filename, order)
@order = order
@letters = Hash.new
@state = Array.new(@order)
previous = Array.new(@order)
File.foreach(filename) do |line|
line.strip!
line << ’ ’ unless line.length == 0
line.gsub!(/\s+/, ’ ‘)
line.gsub!(/[^a-z ]/, ‘’)
line.split(//).each do |letter|
unless previous.include?(nil)
p = previous.join(’’)
unless @letters.has_key?§
@letters[p] = Array.new
end
@letters[p] << letter
end
previous.shift
previous.push(letter)
end
end
end

words begin after a space and end before a space

def print_words(n = 50)
letter = next_letter(true)
while letter != ’ ’
letter = next_letter()
end
letter = next_letter()
words = 0
while words < n
words += 1 if letter == ’ ’
print letter
letter = next_letter()
end
print “\n”
end

def next_letter(restart = false)
if restart or @state.include?(nil)
key = @letters.keys[rand(@letters.length)]
@state = key.split(//)
end
key ||= @state.join(’’)
# restart if we hit a dead end, rare unless text is small
if @letters[key].nil?
next_letter(true)
else
word = @letters[key][rand(@letters[key].length)]
@state.shift
@state.push(word)
word
end
end

end

if $0 == FILE
require ‘getoptlong’

def usage()
$stderr.puts “Usage: ruby #{$0} [options] “,
" -h, --help show this usage message”,
" -o, --order set markov chain order”,
" -s, --sentences set number of sentences to
print",
" -w, --words set number of words to print",
" -l, --letters use letters as the basic unit"
end

order = 2
sentences = 5
words = nil
letters = false

opts = GetoptLong.new(["–help", “-h”, GetoptLong::NO_ARGUMENT],
["–order", “-o”,
GetoptLong::REQUIRED_ARGUMENT],
["–sentences", “-s”,
GetoptLong::REQUIRED_ARGUMENT],
["–words", “-w”,
GetoptLong::REQUIRED_ARGUMENT],
["–letters", “-l”, GetoptLong::NO_ARGUMENT])

opts.each do |opt, arg|
case opt
when “–help”
usage
exit 0
when “–order”
order = arg.to_i
when “–sentences”
sentences = arg.to_i
words = nil
when “–words”
words = arg.to_i
sentences = nil
when “–letters”
letters = true
words = 50 if words.nil?
end
end

if ARGV.length < 1
usage
exit 1
end

ARGV.each do |arg|
begin
if letters
m = MarkovLetters.new(arg, order)
m.print_words(words)
else
m = MarkovWords.new(arg, order)
m.print_words(words) unless words.nil?
m.print_sentences(sentences) unless sentences.nil?
end
rescue
$stderr.puts $!
end
end
end

bbazzarrakk · April 10, 2006, 2:11pm

Hi,

this is the first time i participate in a ruby-quiz. It was quite
funny programming this, and testing it out.
I know that my solution is far from perfect, but since I
invested some time in it (I’m quite new to ruby, so had to look up a
lot of stuff) I’m submitting it.
It does nothing fancy, and it misses some error-checking. The main data
structure is a Hash, it’s explained in the code.
If I did something completely stupid it would be nice if someone tells
me

Thanks,
Tom

bbazzarrakk · April 10, 2006, 1:09pm

On Mon, 2006-04-10 at 13:08 +0900, Stephen W. wrote:

On Apr 9, 2006, at 5:35 PM, Daniel B. wrote:

I think someone needs to take all the explanations from this thread
and run
their markov chainer on that.

Or everything why’s ever said.

I got a few interesting bits using just this year’s redhanded posts.
Thanks to the small input body and a low variance setting there’s almost
as much between the lines as in the original posts

=====
Ten-Sided is an alias for html. xhtml_strict does the same, but with
root beer and cream soda lip balms instead of Roofies or Vitamin K.
Actually, red.trust_sphere.each patches up batsmanâ??s user by then
reflect ACTIVERECORD.

Service Overrides Adding stuff like sessioning and authentication
demands hooks on the web. Skeptical? Iâ??d say you invented tumblelogging
way back in the above equation start out equal. Try seeding the
rankings. Maybe you want trust from Matz to be full projects.

And, of course, he really can be seen as a not entirely bad reason for
actually providing a time of 1.815 seconds, compared with St.
Valentineâ??s YARV which will format the date according to a classâ?? page.

I think Marcel has fixed this in Edge Rails. We get on these little
kicks. As weâ??re all snooping around. Little protocols or obscure version
control systems. Or domino games or something. Trust metric, man. It
just strikes me all the time.

We have, right now, an over-night offsite meeting at a Chupei hackathon
to get Perlâ??s YAML::Syck module on its legs and with a gem, but you
never know. Radicals often portray future animal executions in the above
code and also a follow-up with darcs and switchtower.

Install FuseFS. Run rubyfs.rb. Weâ??ve been through this day before we
talk about this at CampingSessions on the wiki. Good thing gabriele is
out there, because Iâ??ve been playing with a timezone setting, right?
These are tough issues, with a rewrite of miniature file-sharing and I
really appreciate him.

The textarea technique is very hard to get even with net auctions. Donâ??t
get me wrong. It does not mean the book is bad.

bbazzarrakk · April 10, 2006, 4:21pm

On Apr 10, 2006, at 7:08 AM, Tom R. wrote:

If I did something completely stupid it would be nice if someone
tells me

Nothing stupid no. Here are a few extremely minor suggestions.

     File.open(filename) do |io|
         io.each do |line|
	...
         end
     end

That pattern can be simplified to:

File.foreach(filename) do |line| … end

And…

         if @frequencies[tupel].include?(word)
             @frequencies[tupel][word]+=1
         else
             @frequencies[tupel][word]=1
         end

If you set the Hash to default to a value of 0, you could simplify
the above to just:

@frequencies[tupel][word]+=1

All in all, you’re code is quite nice.

Thanks for the submission.

James Edward G. II

bbazzarrakk · April 10, 2006, 4:08pm

On Apr 9, 2006, at 8:45 AM, Albert Vernon S. wrote:

Coming from Perl, I often rely upon auto-vivification, so I needed
to figure out how to work around this. Perhaps there are improved
ways of going about this, and I’d appreciate any feedback on how to
go about it better.

Your code looks very clean to me! A minor thing I noticed is that
you could simplify:

.gsub(/["()]/,"")

to:

.delete(’"()’)

Can you show an example using Perl’s auto-vivification that feels
funny in Ruby? Perhaps we would have better ideas after seeing it…

Thanks for the submission!

James Edward G. II

bbazzarrakk · April 10, 2006, 8:54pm

class WordHash
attr_accessor :words
def initialize(words)
@words = {}
words.each_with_index{|x,i|self.add(x,words[i-1],words[i+1])}
end

def add(word,prev,nex)
@words[word] ||= {}
@words[word][‘prev’] ||= Hash.new{|k,v|k[v]=0}
@words[word][‘prev’][prev] += 1
@words[word][‘next’] ||= Hash.new{|k,v|k[v]=0}
@words[word][‘next’][nex] += 1
end
end
f = ARGF.read.join.tr(’"/*\()[]’’, ’ ‘).downcase
words = f.gsub(/[^’\w]/, ’ \0 ‘).gsub(/\s+/, ’ ‘).split(/\W/)
words = words.map{|x|x.strip}
w = WordHash.new(words)
@k = (a=w.words).keys[rand(a.size)]
wi = [@k.capitalize]
100.times do
c=(a=w.words[@k][‘next’]).keys[rand(a.keys.size)]
if c != ’ ’
wi << c
@k = w.words.keys.find{|y|y==c}||rand(a.keys.size)
else
next
end
end
puts wi.join(" ").gsub(/\s+/,’ ‘)+’.’

bbazzarrakk · April 10, 2006, 7:16pm

Here is my solution to this very interesting quiz.

My MarkovChainer can work with any order and it is sentence oriented: I
split the input text with /([.!?])/ and then scan each sentence with
/[\w’]+/, so all punctuation (except “.”, “!” and “?”) is ignored.
It also stores the first words of each sentence, to use them as
start words when generating sentences.

The script accepts command line arguments for the order (-o) and the
number of sentences to generate (-n), the defaults are -o2 -n10. Then it
uses ARGF.read as input text and prints n generated sentences.

Now lets talk a bit about the data structure:

My first implementation used trees of hashes and because symbols are
faster as hash keys than strings, I stored the words as symbols:

For the text “q s a w s. a s w q s. q w s a a w. s q a w s.”, that
looked
like:

{:q=>{:s=>{:"."=>1, :a=>1}, :a=>{:w=>1}, :w=>{:s=>1}},
:s=>{:q=>{:a=>1}, :w=>{:q=>1}, :a=>{:a=>1, :w=>1}},
:w=>{:q=>{:s=>1}, :s=>{:"."=>2, :a=>1}},
:a=>{:s=>{:w=>1}, :a=>{:w=>1}, :w=>{:"."=>1, :s=>2}}}

Then I thought it might be faster to put all the last words into one
array
instead of building a frequency hash, because I built that array anyway,
when I was looking for the next word: That looked like:

{:q=>{:s=>[:a, :"."], :a=>[:w], :w=>[:s]},
:s=>{:q=>[:a], :a=>[:w, :a], :w=>[:q]},
:a=>{:s=>[:w], :a=>[:w], :w=>[:s, :".", :s]},
:w=>{:q=>[:s], :s=>[:".", :a, :"."]}}

And it also was faster. Then I wanted to know if symbols really are
faster
then strings:

{“w”=>{“q”=>[“s”], “s”=>[".", “a”, “.”]},
“a”=>{“a”=>[“w”], “w”=>[“s”, “.”, “s”], “s”=>[“w”]},
“q”=>{“a”=>[“w”], “w”=>[“s”], “s”=>[“a”, “.”]},
“s”=>{“w”=>[“q”], “a”=>[“w”, “a”], “q”=>[“a”]}}

It turned out that strings are faster. I think that is because
converting
a symbol to a string (when generating the sentences) needs one hash
lookup
and generates a new String object. So the faster hash lookup of symbols
doesn’t pay off.

And finally I wanted to see if at least the hash tree is worth it. I
tried
the following:

{[“q”, “a”]=>[“w”],
[“q”, “w”]=>[“s”],
[“s”, “q”]=>[“a”],
[“w”, “q”]=>[“s”],
[“s”, “w”]=>[“q”],
[“a”, “s”]=>[“w”],
[“w”, “s”]=>[".", “a”, “.”],
[“s”, “a”]=>[“w”, “a”],
[“q”, “s”]=>[“a”, “.”],
[“a”, “a”]=>[“w”],
[“a”, “w”]=>[“s”, “.”, “s”]}

Well, the hash tree also turned out to be slower than this simple flat
hash (multiple hash lookups are slower than computing the hash of a
multi
element array and one lookup).
(I also tried joined strings as keys (e.g. “q a”=>[“w”]), but they are
slower than the array keys, because more transformations are needed)

So, the morale of all this:

don’t use symbols if they have to be converted to a string often
hash lookups might be slower than you think
premature optimization…

Dominik

class MarkovChainer
attr_reader :order
def initialize(order)
@order = order
@beginnings = []
@freq = {}
end

def add_text(text)
# make sure each paragraph ends with some sentence terminator
text.gsub!(/\n\s*\n/m, “.”)
text << “.”
seps = /([.!?])/
sentence = “”
text.split(seps).each { |p|
if seps =~ p
add_sentence(sentence, p)
sentence = “”
else
sentence = p
end
}
end

def generate_sentence
res = @beginnings[rand(@beginnings.size)]
loop {
unless nw = next_word_for(res[-order, order])
return res[0…-2].join(" ") + res.last
end
res << nw
}
end

private

def add_sentence(str, terminator)
words = str.scan(/[\w’]+/)
return unless words.size > order # ignore short sentences
words << terminator
buf = []
words.each { |w|
buf << w
if buf.size == order + 1
(@freq[buf[0…-2]] ||= []) << buf[-1]
buf.shift
end
}
@beginnings << words[0, order]
end

def next_word_for(words)
arr = @freq[words]
arr && arr[rand(arr.size)]
end

end

if $0 == FILE
order = 2
n = 10
while ARGV[0] =~ /\A-([on])([1-9]\d*)\z/
if $1 == “o”
order = Integer($2)
else
n = Integer($2)
end
ARGV.shift
end
mc = MarkovChainer.new(order)
mc.add_text(ARGF.read)
n.times {
puts mc.generate_sentence
}
end

bbazzarrakk · April 10, 2006, 9:19pm

On 10.4.2006, at 14:05, James Edward G. II wrote:

On Apr 9, 2006, at 8:45 AM, Albert Vernon S. wrote:

Coming from Perl, I often rely upon auto-vivification, so I needed
to figure out how to work around this. Perhaps there are improved
ways of going about this, and I’d appreciate any feedback on how
to go about it better.

Can you show an example using Perl’s auto-vivification that feels
funny in Ruby? Perhaps we would have better ideas after seeing it…

Didn’t feel funny, I just wasn’t use to having to declare objects
(hashes and arrays) as I added them as branches on my initial hash.
I had to use a little more discipline, which probably is a good thing
coming from Perl-land.

-a

bbazzarrakk · April 10, 2006, 9:22pm

On Tue, 2006-04-11 at 02:15 +0900, Dominik B. wrote:

So, the morale of all this:

don’t use symbols if they have to be converted to a string often

hash lookups might be slower than you think

premature optimization…

I used symbols in my solution, but primarily because I was trying to
make the hash itself require as little memory as possible to allow for
vast bodies of input (which I admit I’ve not really tried yet) and I
figured a hash on symbols would be smaller than a hash on strings.

I guess this means that GC is pushed by lots of string objects created
and destroyed during the generation run, but I tended to aim for memory
efficiency over speed for this one, as long as it was running ‘fast
enough’.

bbazzarrakk · April 10, 2006, 9:47pm

Hi,

From: “Albert Vernon S.” [email protected]

On 10.4.2006, at 14:05, James Edward G. II wrote:

Can you show an example using Perl’s auto-vivification that feels
funny in Ruby? Perhaps we would have better ideas after seeing it…

Didn’t feel funny, I just wasn’t use to having to declare objects
(hashes and arrays) as I added them as branches on my initial hash.
I had to use a little more discipline, which probably is a good thing
coming from Perl-land.

Incidentally, if you ever need an auto-vivifying hash-of-hashes
(as opposed to mixture of hashes and arrays that is possible
because of Perl syntax), you can do the hash-of-hashes
auto-vivify in Ruby:

HashFactory = lambda { Hash.new {|h,k| h[k] = HashFactory.call} }

irb(main):181:0> x = HashFactory.call
=> {}
irb(main):182:0> x[‘abc’][‘def’][‘ghi’] = 123
=> 123
irb(main):183:0> x
=> {“abc”=>{“def”=>{“ghi”=>123}}}

Regards,

Bill

bbazzarrakk · April 11, 2006, 12:02am

On Mon, 10 Apr 2006 21:20:07 +0200, Ross B.
[email protected]
wrote:

figured a hash on symbols would be smaller than a hash on strings.
Tha hash itself will have the same size, because it only stores VALUEs
(which are just longs). But each of the string VALUEs will “point” to
another “object” on the heap, while the symbol VALUEs don’t have an
extra
“object” attached.

I guess this means that GC is pushed by lots of string objects created
and destroyed during the generation run, but I tended to aim for memory
efficiency over speed for this one, as long as it was running ‘fast
enough’.

Yes, memory efficiency was another reason for me to first try the “hash
tree of symbols”.
But on the other side: the string objects are created anyway (before
they
are converted to symbols). They can be collected after conversion to
symbol, but when you generate the sentence you are again creating many
new
string objects (Symbol#to_str generates a new string object every time,
ruby stores only c-strings for the symbols), which wouldn’t be
necessary,
if you had kept the strings in the first place.
So, yes, symbols are more memory efficient for storing big frequency
hashes, but they are slower for generating, so it’s a tradeoff.

By the way here are some numbers:

order first final
2 7.380s 1.973s
4 6.279s 2.002s
6 8.031s 1.972s

Those runs are for a 700K text file and 1000 sentences generated. I
didn’t
measure the memory.