Forum: Ruby Markov Chains (#74)

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2006-04-07 15:04
(Received via mailing list)
The three rules of Ruby Quiz:

1.  Please do not post any solutions or spoiler discussion for this quiz
until
48 hours have passed from the time on this message.

2.  Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3.  Enjoy!

Suggestion:  A [QUIZ] in the subject of emails about the problem helps
everyone
on Ruby Talk follow the discussion.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

This week's Ruby Quiz is about text generation.  That's right, we're
going to
teach your computer to weave tall tales.

At its most basic level, a solution might be:

	>> (1..30).map { (("a".."z").to_a + [" "] * 10)[rand(36)] }.join
	=> "fb mcr hhluesjbhtf swm eehokmi"

However, let's make our goal to get as close to English looking
sentences as
possible.  One way you might do this is using a technique called Markov
Chains.

To use Markov Chains, you read some text document(s), making note of
which
characters commonly follow which characters or which words commonly
follow other
words (it works for either scale).  Then, when generating text, you just
select
a character or word to output, based on the characters or words that
came before
it.

The number of previous characters or words considered is called the
"order" and
you can adjust that to try and find a natural feel.  For example, here
is some
generated text using a second order word chain derived from the Sherlock
Holmes
novel "The Hound of the Baskervilles" by Arthur Conan Doyle:

	The stars shone cold and bright, while a crushing weight of
responsibility
	from my shoulders. Suddenly my thoughts with sadness. Then on the
lady's
	face. "What can I assist you?"

If you need text's to prime your program with, I suggest searching
Project
Gutenberg:

	http://www.gutenberg.org/
89d967359903c639d31e4cad4569f537?d=identicon&s=25 Charlie Bowman (Guest)
on 2006-04-07 15:56
(Received via mailing list)
I'm pumped! I've been reading this mailing list for the last 3 months
and I finally feel ready to try out a quiz.  This one sounds fun!
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2006-04-07 16:21
(Received via mailing list)
On Apr 7, 2006, at 8:53 AM, Charlie Bowman wrote:

> I'm pumped! I've been reading this mailing list for the last 3 months
> and I finally feel ready to try out a quiz.  This one sounds fun!

I'm glad.

These can be quite a bit of fun, depending on what text you prime
them with...  ;)

James Edward Gray II
A24e072d6092870feff0d5016ff2cdd0?d=identicon&s=25 Aaron Patterson (Guest)
on 2006-04-07 17:07
(Received via mailing list)
On Fri, Apr 07, 2006 at 10:02:28PM +0900, Ruby Quiz wrote:
> possible.  One way you might do this is using a technique called Markov Chains.
I think it would be really neat to apply this to music composition.  I'm
just not sure how you would get a corpus for that (maybe parsing
lilypond files?).

--Aaron
89d967359903c639d31e4cad4569f537?d=identicon&s=25 Charlie Bowman (Guest)
on 2006-04-07 17:16
(Received via mailing list)
or ascii based tab files for us banjo pickers!
D4e51fd9554030ab55c379fdc1a34826?d=identicon&s=25 Keith Lancaster (klancaster)
on 2006-04-07 17:43
(Received via mailing list)
On 4/7/06 10:14 AM, "Charlie Bowman" <charlie@castlebranch.com> wrote:

> or ascii based tab files for us banjo pickers!
>

Banjo? I'm pretty sure he said "music", so I'm not sure why you are
bringing
up banjos. :-)

What's the difference between a banjo and a(n)?

Chain Saw:

   1. a chain saw has a dynamic range.
   2. you can turn a chain saw off.

   3. South American Macaw: one is loud, obnoxious, and noisy; and the
other
is a bird.

   4. Harley Davidson Motorcycle: you can tune a Harley.

   5. Onion: no one cries when you cut up a banjo.

   6. Trampoline: you take your shoes off to jump on a trampoline.

   7. Uzi: an uzi only repeats forty times.

Sorry  - couldn't help myself on this one :-)

Keith
352493919fcca0d211e4ca4458ff045f?d=identicon&s=25 blackcat (Guest)
on 2006-04-07 18:42
(Received via mailing list)
I would like Ruby support of serial ports
for the following three platforms.
#1 Windows
#2 XP
#3 Mac OS X

Please respond offline if you can
show me how to add any or all of these

Gus

rubySPS@oh-bear.com
Be3f276f615e29a7c605efd1d3bc3cae?d=identicon&s=25 Mike (Guest)
on 2006-04-07 18:48
(Received via mailing list)
> I would like Ruby support of serial ports for the following
> three platforms.
> #1 Windows
> #2 XP
> #3 Mac OS X
>

Can you differentiate between/clarify #1 and #2?

-M
89d967359903c639d31e4cad4569f537?d=identicon&s=25 Charlie Bowman (Guest)
on 2006-04-07 19:37
(Received via mailing list)
Banjo players spend half their lives tuning and the other half playing
out of tune.

Banjos are definitely an acquired taste.  I think I've figured out why I
love them so much.  They are as close instrument to a digital signal as
you can get (other than a drum).  The sound is so staccato and you pluck
so many notes in a second that it is very near to being digital :)
896cfc242a7762467c2a0b2af86598e5?d=identicon&s=25 Simon Strandgaard (Guest)
on 2006-04-07 20:11
(Received via mailing list)
On 4/7/06, James Edward Gray II <james@grayproductions.net> wrote:
[snip]
> These can be quite a bit of fun, depending on what text you prime
> them with...  ;)


input_text := ruby_talk;   (* hehe *)
Cff9eed5d8099e4c2d34eae663aae87e?d=identicon&s=25 Jacob Fugal (Guest)
on 2006-04-07 21:12
(Received via mailing list)
On 4/7/06, Simon Strandgaard <neoneye@gmail.com> wrote:
> On 4/7/06, James Edward Gray II <james@grayproductions.net> wrote:
> [snip]
> > These can be quite a bit of fun, depending on what text you prime
> > them with...  ;)
>
> input_text := ruby_talk;   (* hehe *)

I've been considering collecting all the Ilias posts (filtering out
other people's comments that are quoted, for purity) and seeing what
it prints out. ;)

Jacob Fugal
9d0c4e1ee48f3e5a57388eafcb22abcb?d=identicon&s=25 David Brady (Guest)
on 2006-04-08 00:18
(Received via mailing list)
Ruby Quiz wrote:
> 	The stars shone cold and bright, while a crushing weight of responsibility
> 	from my shoulders. Suddenly my thoughts with sadness. Then on the lady's
> 	face. "What can I assist you?"
>


I can't tell if this is a bug, or success beyond my wildest dreams:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
She wore handily the woven craft of lips.  Besotted!  My anglican parrot
wenches freely.  Three dozen Iceland begin to unwind.  Hope.  Charity.
My outrage cannot succumb to rainswept chance.  Now chance he the
broadly winding spar.

This stock is on the move!  You'll never fail to satisfy her again!
Home loans from 0.025%!

V t A G R A from $ 3.50 per item
C 1 @ L I S from $3.75 per item
V Ax L 1 U M from $1.20 per item

Premier PARMACY - your   c e r t i f i e d   online parmacy
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I took usenet for my corpus, then used a genetic algorithm with
SpamAssassin for scoring.

And, uh, I wrote it 6 days ago....  ;-)

-dB
E34b5cae57e0dd170114dba444e37852?d=identicon&s=25 Logan Capaldo (Guest)
on 2006-04-08 00:45
(Received via mailing list)
On Apr 7, 2006, at 10:18 AM, James Edward Gray II wrote:

> James Edward Gray II
Indeed, I just finished my solution (first Ruby Quiz, I've done, yay)
and it's all sorts of fun
A9b6a93b860020caf9d2d1d58c32478f?d=identicon&s=25 Ross Bamford (Guest)
on 2006-04-08 01:03
(Received via mailing list)
On Sat, 2006-04-08 at 03:07 +0900, Simon Strandgaard wrote:
> On 4/7/06, James Edward Gray II <james@grayproductions.net> wrote:
> [snip]
> > These can be quite a bit of fun, depending on what text you prime
> > them with...  ;)
>
>
> input_text := ruby_talk;   (* hehe *)

I had to try this out :) With 500 messages from the past couple of days,
order of 3, I found lots of cryptic wisdom when using a small word
limit, such as:

===
windows :\ i guess clarity, just isn't a interesting.

thanks for making def say_stuff puts

bereflected to the foo to design a gui by open source path.

in summary, to get wrote: pointers, or care what you mean.

example, c# guys (including even worse - perl.

warning: pointer targets in boese wrote: use an i'll go digging

+0900, james edward gray would help with reading v_mod/1 m = 'm'

to figure this out? tashiro wrote: you need own license, possibly

april fool's joke. v.- interface and doscommand which next implemented
anobjective-c frontend set of programming
===

I've attached the quick and dirty script I used to get cleaned-up input
text from ruby-talk archive.

This is a seriously cool quiz :) Thanks all concerned.
E34b5cae57e0dd170114dba444e37852?d=identicon&s=25 Logan Capaldo (Guest)
on 2006-04-08 01:22
(Received via mailing list)
On Apr 7, 2006, at 7:02 PM, Ross Bamford wrote:

> days,
> in summary, to get wrote: pointers, or care what you mean.
> anobjective-c frontend set of programming
> <rubytalk2text.rb>
Thanks for the script, I note this interesting result with my
implementation:
% ruby markov.rb rt.text 12
Alternate austin austin austin austin halostatue gmail com alternate
austin austin austin
81d609425e306219d54d793a0ad98bce?d=identicon&s=25 Matthew Moss (Guest)
on 2006-04-08 04:51
(Received via mailing list)
I went and grabbed a bunch of Grimm's fairy tales from Gutenberg.  I
still need to tweak my algorithm some; the most annoying thing I find
is punctuation, especially quoted strings (i.e. speech) so that the
output text looks like there is some valid speech in there and not
just randomly scattered quote marks.

Anyway, here's a sample I produced with my first attempt:

There was a man who kills seven at one blow? I leapt over the tree
because the huntsmen are shooting down there in the closet on the
porch.' The miller said: 'The Devil must go out,' and opened the door
and opened it. As she went in, a little dwarf no bigger than my
finger. And before her stood princes, and dukes, and earls: and the
fisherman went up to her palace all of shining gold; and told her
mother that she must be the handsomest lady in the land; and she went
to the fire and growled contentedly and comfortably. It was not long
in saying 'Yes' to all this; and as they were entering the village,
the son followed the fox's counsel, and without looking about him went
to the other anvil. The old man led him back into the water. The girls
came just in time; they held him fast and tried to free his beard from
the line, but all in vain; he heard or saw nothing of Jorinda. At last
he thought to himself: 'How the old woman is snoring! I must just see
if she wants anything.'


and:

An aged count once lived in Switzerland, who had an only son, but he
was now grown too old to work; so the farmer would give him his only
daughter to her bed-side, and said, 'Always be a good girl, and I will
never by myself leave the path, to run into the wood, till at last
they danced out at the gate together. When they had walked a short
time, when they had warmed themselves, they said: 'Comrade, shall we
have a child, and he grows big, and we send him into the wide world,
and looked neither to the right place. There they found the giants
swimming in their blood, and all round about her, and the bells rang
at each step which she took. Then she was in a great hurry. The little
tailor looked round and saw a flock of sheep, the very shepherd whom
the peasant knew had long been wishing to be mayor, so he cried with
all his might.
352493919fcca0d211e4ca4458ff045f?d=identicon&s=25 blackcat (Guest)
on 2006-04-08 18:42
(Received via mailing list)
Correction to earlier post  AGS Calabrese
00000000000000000000000000000000000000000000000000
I would like Ruby support of serial ports
for the following three platforms.
#1 Windows
#2 Linux
#3 Mac OS X   Tiger

Please respond offline if you can
show me how to add serial support to any or all of these

Gus

rubySPS@oh-bear.com

my spam filter has been eating messages so I may miss your message.
I am working on the problem
000000000000000000000000000000000000000000000000000



On 2006-Apr 07, at 10:46 AM, Mike wrote:

> I would like Ruby support of serial ports for the following
> three platforms.
> #1 Windows
> #2 XP
> #3 Mac OS X
>

Can you differentiate between/clarify #1 and #2?

-M
0aca516785b12ea4ffd4fec43d69f586?d=identicon&s=25 Albert Vernon Smith (Guest)
on 2006-04-09 15:47
(Received via mailing list)
Hi all-

Here is my first attempt at a Ruby Quiz  (attached file:
'markovtext.rb').  I'm new to Ruby, mostly paying the bills over the
past years with Perl work.  This quiz was fun, and I learned a lot
from doing this.  Coming from Perl, I often rely upon auto-
vivification, so I needed to figure out how to work around this.
Perhaps there are improved ways of going about this, and I'd
appreciate any feedback on how to go about it better.

My solution is fairly straight forward, using a second order word
chain, which I imagine many others will be implementing in similar
ways.  As noted in the comments, I only start making my paragraph on
the second sentence, in order to get greater diversity in how the
paragraph starts.  I'm also striping out some of the punctuation from
the end result, as some texts give orphaned punctuation marks.

I had the most fun from the quiz running my script against various
texts, and seeing the results.  My favorite has been getting results
from "Huckleberry Finn", though the grammar gets a bit off at times,
but that relates to the input text.  For comparison, one can see much
more typical sentence structure from "Journey to the Center of the
Earth".

Example results:

==

Huck Finn via 'markovtext.rb':

And above all, don't you try to run in the dark he says: Yes, Mars
Sid, A dog. Cur'us dog, too. Does you want it, you can go and write
poetry after them out loud. One bill said, The celebrated Dr. Armand
de Montalban, of Paris, would lecture on the floor and tied up the
bank. I couldn't help it, and she snatched a kiss of him, and never
think about it; and I went along up the smoothness and the judge for
him, being a mystery, and he'd cipher out a speech, all full of it,
and I would kill me, dey sk'yers me so.

==

Journey to the Center of the Earth via 'markovtext.rb':

I gave way to the south-east. We have passed through the treatises of
Cassanion, and all his time at Altona, staying with a translation?
This IS the Icelandic Professor. At this rate we shall see. So says
the Professor, no doubt, was pursuing his observations or taking
notes, for in three days we must go back to the dull thuds of the
country might be held up by the descent commenced. I can but try
Spanish, French, Italian, Greek, or Hebrew.

==

Cheers,
-albert
A9b6a93b860020caf9d2d1d58c32478f?d=identicon&s=25 Ross Bamford (Guest)
on 2006-04-09 18:51
(Received via mailing list)
Here is my solution. It was so much fun I couldn't help but keep making
small tweaks. Here are a few of my favourite bits of output with varying
word limits, from various sources (including rubytalk):

====
thou wast question'd my algorithm? -- posted delight hath.

FasterCSV can also reason, having methods 2006 08:58 am, after lunch.

April 1st joke. I'm subscribed to ruby-dev and the other hand, I've
thought of it...hrrmmm... And now I fear, that the Mozilla Foundation
relicensed everything as MPL and GNU GPL provides. It is unfortunate
that it should be able to check against multiple types as well Hi.

Ruby Quiz: 1. Please do not hesitate dive in to it, on the C:\ruby\lib
\ruby\1.8\cgi\session.rb file, $350 convoluted I'm licenses David 11:26
install the 'ruby-db2-0.4' package and all that...

I'll blow this police whistle from my shoulders. Suddenly and makes it
incompatible. But I might have agreed tosuch without having to support
meta-progarmming. Perhaps an example how to handle initialize_copy: The
mixin manages some attributes and the other hand, I've thought of
it...hrrmmm...
====

Before using it, you have to give it some knowledge - the -l option
accepts a filename or URL to learn from (omit the filename for stdin).
After chomping through it, the knowledge will get dumped out to a file
ready to be used for generation. You can do this multiple times to
incrementally teach it.

If you want to use a large body of input I recommend storing each dump
in it's own file, which you can do with the -f option. Then, specify one
or more -f options when running generation to use the specified
knowledge file(s).

Once you've got some input, run markov.rb, specifying your -f options if
you're not using the default 'chainstore' file, and optionally
generation parameters such as -w N to limit to at most N words.

There's a few options you can pass to control both learning and
generation, there's a bit of doc available from the -h option.

I posted a few chomped input files at:
	http://roscopeco.co.uk/code/ruby-quiz-entries/74/
	http://roscopeco.co.uk/code/ruby-quiz-entries/74/README

Thanks again for this quiz, I've gotten a fair few lol moments out of
it :)
74361169979a843b3a5c12d81000debc?d=identicon&s=25 Jon Egil Strand (Guest)
on 2006-04-09 19:19
(Received via mailing list)
Good quiz this one.

Markov Chains are highly practical creations, we use them extensively
when
calulating life insurance and pensions schemes in Mercer.

Wikipedia has a nice summary of other applications, I recommend it
hihgly:
http://en.wikipedia.org/wiki/Markov_chains#Scienti...

At University, they taught us Markov Chains multiple times for various
uses, but none of them used this linguistic approach. It's very
intuitive,
and so much more fun than the hard-core introductions we were given. So
if
I ever teach, I'll teach this quiz.


Here we go:

As primer for this writeup I've used Agatha Cristie's "The mysterious
affair at Styles", a 57_000 words english novel starring Poirot, her
famous detective.

To establish the chain of words I use a Hash of Hashes, where the first
is
a simple lookup of words, and the second containt the words that folow.
The second also counts how many times each word follows the leader.

Example:

Murder of Emily Agnes...
Murder against Alfred Inglethorp ...
Murder against some person ...
Murder against him right ...

@words["Murder"]        >> {"of"=>1, "against"=>3}


The counts are to weigh the randomness when selecting the next word. I
wanted a selection process where rare followers in the real world are
rear
in this world as well. After 'Murder', 'of' will pop up every forth
time.
This is done here

    sum = candidates.inject(0) {|sum,kv| sum += kv[1]}
    random = rand(sum)+1
    partial_sum = 0
    next_word = candidates.find do |word, count|
      partial_sum += count
      partial_sum >= random
    end.first

sum is the total number of followers (4 for 'Murder'). I pick a random
number 0..3, add +1 for indexing and iterate my way up the list of
followers, and keep track of how many words I have passed on my way.
When
we reach (or pass for the first time) the random limit, the word at hand
is the word we want. Since the block finds the first ['word', count]
pair
that satifies our condition, (f.ex ['against', 3]), I use .first to get
it.

That's it. Now just build a sentence by starting with a word and move
along it's path. My syntectic rules are very simple, a sentence is
complete when it containts a period ".", thereby opening up for som
silly
grammar. But it doesn't really matter, grammar processing isn't the text
of the day.

The results are good enoughby far, and quite entertaining too. But alas,
even Mr Poirot would be hard pressed to solve this case.


Output # 1

Murder of sight. Mary Cavendish was vital to connect the charred paper.
"A
great risk, it was a book?" "Yes, often. I could certainly had already
been tested!" I was before I asked.


Output # 2

Murder against Alfred Inglethorp, I fancied that, mon ami, it is broken
bell, and the matter, hoping that a note." Miss Howard had a day we came
to warn him more, had kindly offered to make amends; which weighed with
the Hall in the company.  Miss Howard was that?" "She made it was the
Hall--you'n a penknife, and went round Styles Court in some time I
voiced
the room?" "I was inclined to call for some friends one side. His words
of
horrors business to make a very fond of her--yes, I made a message to
the
slips of dismissal.


=== code ===


class MarkovChain
  def initialize(text)
    @words = Hash.new
    wordlist = text.split
    wordlist.each_with_index do |word, index|
      add(word, wordlist[index + 1]) if index <= wordlist.size - 2
    end
  end

  def add(word, next_word)
    @words[word] = Hash.new(0) if !@words[word]
    @words[word][next_word] += 1
  end

  def get(word)
    return "" if !@words[word]
    followers = @words[word]
    sum = followers.inject(0) {|sum,kv| sum += kv[1]}
    random = rand(sum)+1
    partial_sum = 0
    next_word = followers.find do |word, count|
      partial_sum += count
      partial_sum >= random
    end.first
    #puts "Selected #{next_word} from #{candidates.inspect}"
    next_word
  end
end

mc = MarkovChain.new(File.read("Agatha Christie - The Mysterious Affair
at
Styles.txt"))

sentence = ""
word = "Murder"
until sentence.count(".") == 4
  sentence << word << " "
  word = mc.get(word)
end
puts sentence << "\n\n"

=== code ===
86216345f72b5c078098b6597b232ee4?d=identicon&s=25 Himadri Choudhury (Guest)
on 2006-04-09 19:43
(Received via mailing list)
Great quiz. It will be interesting to see how others have solved this
problem.
Here is my submission.

To use run the following:

$ cat <some_text_file> | ./rand_text.rb

Or you can give options

$ cat <some_text_file> | ./rand_text.rb -o 2 -n 10

-o : the order. Which is the number of previous words to consider
-n : the number of sentences to output

I used an hash of arrays to keep track of the possible state
transitions.
The key is the current state, and the contents of the array is the
possible
next states. When generating the output I randomly select elements from
this
array. I always start with the first 'n' number of words in the original
text, where 'n' is the order.

There is a sample output where <some_text_file> is Moby Dick and using
the
default parameters of order = 2, and number of sentences = 10.



==
Call me Ishmael.
Some years ago--never mind how long precisely --having little or no
money in
my soul; whenever I find myself involuntarily pausing before coffin
warehouses, and bringing up the rear of every kind whatsoever.
It is a damp, drizzly November in my purse, and nothing particular to
interest me on shore, I thought I would sail about a little and see the
mummies of those creatures in their huge bake-houses the pyramids.
No, when I go to sea, I go to sea as a Commodore, or a Captain, or a
Cook.
I abandon the glory and distinction of such offices to those who like
them.
For my part, I abominate all honorable respectable toils, trials, and
tribulations of every kind whatsoever.
It is quite as much as I can.
This is my substitute for pistol and ball.
With a philosophical flourish Cato throws himself upon his sword; I
quietly
take to the royal mast-head.
True, they rather order me about--however they may thump and punch me
about,
I have of driving off the spleen, and regulating the circulation
==
E34b5cae57e0dd170114dba444e37852?d=identicon&s=25 Logan Capaldo (Guest)
on 2006-04-09 21:49
(Received via mailing list)
Here's my solutions. I wrote this 2.5 times. The first version
worked, but it was slow as molasses. The second one is much much
faster. The 2.5th inherits from the second one and produces more
readable, complete sentences

first, slow version:
% cat markov.rb
class WordTable
   def initialize(src, separator = /\W+/, word_shape = /\w+/)
     @src = src
     @separator = separator
     @word_shape = word_shape
   end

   def build_table
     @table = Hash.new { |h, k| h[k] = [] }
     pos = 0
     while line = @src.gets
       line = line.chomp.downcase
       words = line.split(@separator).select { |potential_word|
         potential_word.match(@word_shape)
       }
       words.each do |word|
         @table[word] << pos
         pos += 1
       end
     end
     self
   end
   def words
     @table.keys
   end

   def positions_for(word)
     @table[word]
   end

   def followers_for(word)
     positions = positions_for(word)
     followers_positions = positions.map { |i| i + 1 }
     following_words = self.words.select { |key_word|
       followers_positions.any? { |pos| positions_for
(key_word).include?(pos) }
     }
     following_words
   end
end

class ChainBuilder
   attr_accessor :chain_length
   def initialize(initial_word, word_table, chain_length = 10)
     @initial_word =  initial_word
     @word_table = word_table
     @chain_length = chain_length
   end

   def distributed_followers(word)
     distd_followers = []
     followers = @word_table.followers_for(word)
     positions_of_word = @word_table.positions_for(word)
     followers.each do |follower|
       follower_positions =  @word_table.positions_for(follower)
       count = follower_positions.select { |pos|
         prec =  pos - 1
         positions_of_word.include?(prec)
       }.length
       distd_followers += [follower] * count
     end
     distd_followers
   end
   def randomized_next_word(word)
     choices = distributed_followers(word)
     choices[ rand(choices.length) ]
   end
   def chain
     res_chain = [@initial_word]
     (self.chain_length - 1).times do
       res_chain << randomized_next_word(res_chain.last)
     end
     res_chain
   end
end



if $0 == __FILE__
   if ARGV[0].nil?
     STDERR.puts "Please provide a corpus."
     STDERR.puts "#$0 usage: #$0 <corpus file name> [chain length]
[initial word]"
     exit 1
   end
   chain_len = (ARGV[1]  || 10).to_i
   wt = nil

   File.open(ARGV[0]) do |file|
     #wt = WordTable.new(file, //, /./) # try by characters
     wt = WordTable.new(file)
     wt.build_table
   end
   init_word = (ARGV[2] || wt.words[ rand(  wt.words.length )  ] )

   chain_builder = ChainBuilder.new(init_word, wt, chain_len)
   chain = chain_builder.chain
   puts chain.join(' ').capitalize
end

Second, quicker version:
% cat markov2.rb
class MarkovChainBuilder
   class Entry < Struct.new(:word_index, :successors)
   end
   def initialize(src, separator = /\W+/, word_shape = /\w+/,
downcase = true)
     @src = src
     @separator = separator
     @word_shape = word_shape
     @downcase = downcase
     build_tables
   end
   def build_tables
     @word_list = []
     @successor_lists = {}
     last_word = nil
     @src.each do |line|
       line = line.downcase if @downcase
       line = line.chomp
       words_for_line = line.split(@separator).select{ |w| w.match
(@word_shape)}
       words_for_line.each do |word|
         unless @successor_lists.has_key? word
           entry = @successor_lists[word] = Entry.new
           @word_list << word
           entry.word_index = @word_list.length - 1
           entry.successors = []
         end

         unless last_word.nil?
           @successor_lists[last_word].successors << @successor_lists
[word].word_index
         end
         last_word = word
       end
     end
   end
   def distributed_successors_for(word)
     @successor_lists[word].successors.map { |index| @word_list[index] }
   end
   def randomized_next_for(word)
     succs = distributed_successors_for(word)
     succs[ rand(succs.length) ]
   end
   def markov_chain(initial_word, len = 10)
     res = [initial_word]
     (len - 1).times do
       res << randomized_next_for(res.last)
     end
     res
   end
   def words
     @word_list
   end
   private :build_tables
end

if $0 == __FILE__
   if ARGV[0].nil?
     STDERR.puts "Please provide a corpus."
     STDERR.puts "#$0 usage: #$0 <corpus file name> [chain length]
[initial word]"
     exit 1
   end
   chain_len = (ARGV[1]  || 10).to_i
   mc = nil

   File.open(ARGV[0]) do |file|
     mc = MarkovChainBuilder.new(file)
   end
   init_word = (ARGV[2] || mc.words[ rand(  mc.words.length )  ] )

   chain = mc.markov_chain(init_word, chain_len)
   puts chain.join(' ').capitalize
end

And now the third version which is basically the second version, but
looks for sentences:
% cat smarter_markov.rb
require 'markov2'
class SmarterMarkovChainBuilder < MarkovChainBuilder
   def initialize(src)
     super(src, /\s+/, /\S+/, false)
   end
   def markov_chain(initial_word, len = 10)
     initial_chain = super
     index_of_last_word = nil
     initial_chain.each_index do |idx|
       if initial_chain[idx] =~ /\.|\?|!/
         index_of_last_word = idx
         break
       end
     end
     if index_of_last_word
       initial_chain[0..index_of_last_word]
     else
       initial_chain
     end
   end
end
if $0 == __FILE__
   mc = nil
   File.open(ARGV[0]) do |file|
     mc = SmarterMarkovChainBuilder.new(file)
   end
   start_words = mc.words.select { |w| w =~ /^[A-Z]/ }
   init_word = start_words[ rand(start_words.length) ]
   puts mc.markov_chain(init_word, 200).join(' ').gsub(/"/,'')
end
A9c4658e9e475e13d790ae419acf01b6?d=identicon&s=25 Simon Kröger (Guest)
on 2006-04-09 22:05
(Received via mailing list)
Hi all,

here is my solution. Its a rather stupid approach, but as many simple
and short solutions its better than i had expected.

It uses some gsubs to simplify the input text and surround each word
and punctuation by exactly one space character.

Scan is used to find occurrences of the last <order> words and
store the next words respectively. (even multiple times)

From this array one is choose randomly (which preserves the original
frequency of occurrence because the words are stored multiple times
in the array, increasing the chance of being picked)

Thats it. It works well for files of several MB and gets even faster
if the order is higher.

usage: quiz74.rb <order> [input files]
--------------------------------------------------------------------
before, line, order = ['.'], '', ARGV.shift.to_i

txt = ARGF.read.tr('"/*\\()[]', ' ').downcase
txt = txt.gsub(/[^'\w]/, ' \0 ').gsub(/\s+/, ' ')

500.times do
  ary = txt.scan(/ #{before.join(' ')} (\S+)/).transpose.first
  before << ary[rand(ary.size)]
  before.shift if before.length > order
  (puts line; line = '') if line.length > 50 &&  /\w/ =~ before.last
  line << ( /\w/ =~ before.last && !line.empty? ? ' ' : '') <<
before.last
end
--------------------------------------------------------------------

cheers

Simon
C363de83c027342b5a3a2c946a6bd4ce?d=identicon&s=25 Barry Dmytro (Guest)
on 2006-04-10 02:18
(Received via mailing list)
My approach isn't terribly complex, but it was fun to write.  It
requires that you have a plain text file "book.txt" in the directory of
execution.  I used http://www.gutenberg.org/dirs/etext91/alice30.txt for
all of my tests and my favorite phrase that it came up with was:

Beautiful soup beauootiful soooop soooop of evidence said the queen
turning to the dormouse.

How is works is loads each word into a hash stripping punctuation and
capitalization associating each word with an array of common words that
it is followed by.  One word is picked by random to start us off then it
picks randomly commonly chained words from there till it ends up with a
word that was at the end of sentence in the original document.

Kind Regards,
Bd0203dc8478deb969d72f52e741bd4f?d=identicon&s=25 Daniel Baird (Guest)
on 2006-04-10 02:36
(Received via mailing list)
I think someone needs to take all the explanations from this thread and
run
their markov chainer on that.

:D aniel


--
Daniel Baird
http://danielbaird.com (TiddlyW;nks! :: Whiteboard Koala :: Blog ::
Things
That Suck)
[[My webhost uptime is ~ 92%.. if no answer pls call again later!]]
730d81a72f3ae2beb5f575c3bc7541a3?d=identicon&s=25 Brian Ardrey (Guest)
on 2006-04-10 04:37
(Received via mailing list)
Here is my first ruby quiz solution submission. Thanks for posting
this one, it provided a fun break from my school projects.

I tried to make the code as general as possible, with variable order
of both words and letters. It uses a hash of arrays for storage. I was
planning on unifying the MarkovWords and MarkovLetters classes into
one, but decided against it, mostly to keep them simpler. These
examples use an english translation of The Odyssey to derive the text.


  Usage...

[brian@spica] [~/rubyquiz]
$ ruby markov.rb -h
Usage: ruby markov.rb [options] <filename>
  -h, --help         show this usage message
  -o, --order        set markov chain order
  -s, --sentences    set number of sentences to print
  -w, --words        set number of words to print
  -l, --letters      use letters as the basic unit


  Print 5 sentences of order 2, using words as the base unit...

[brian@spica] [~/rubyquiz]
$ ruby markov.rb -s 5 -o 2 ../etext/dyssy10.txt
My men came out of your wits? If Apollo and the whole world, neither
rich nor poor, who is handsome and clean built, whereas I am so lost
in a beautiful golden ewer, and poured it over with a lie; 'Neptune,'
said I, 'of escaping Charybdis, and at the base of her maids brought
the heifer down with my tears during the darkness of death itself. A
poor unfortunate tramp has come to pass." And Penelope answered,
"Stranger, you must have gone off to bring the contest to an end."
Melanthius lit the fire for her, but she wishes to hear the enchanting
sweetness of their grey hair.


  Print 1 sentence of order 1, using words as the base unit...

[brian@spica] [~/rubyquiz]
$ ruby markov.rb -s 1 -o 1 ../etext/dyssy10.txt
Proserpine had got into a mission to be scandalised at last of them,
and lay a hard task, no more fond of wind blew a man talked and will
leave their sport known that stalks about his wife, and the store for
he stretched out their own house, which I shall be willing to the
house to you, Telemachus.


  Print 50 words of order 3, using letters as the base unit...

[brian@spica] [~/rubyquiz]
$ ruby markov.rb -w 50 -o 3 -l ../etext/dyssy10.txt
the of it into beautiful nose kill as did ther people the wood and him
all the meanthrountry stilltrese ffere peak the the the eleide oly
with blooked ent back the himself over s his hey glarge tood welcomind
where and mannot been spoke aloud valitterince one of his


  Print 50 words of order 2, using letters as the base unit, with text
  from The Aeneid (latin version)...

[brian@spica] [~/rubyquiz]
$ ruby markov.rb -w 50 -o 2 -l ../etext/anidl10.txt
quora desi aras ta acerummarmallenstos es mihinus vade imurbethinc
plia caum turnit que re et sum fuit ade re restaeteri invicaviam
ceuctitur hos aectalisubsillo parmeis suntereffunc meo que
clachilliaeque mul rut moropula sonitotuspiumque terrequebraherautras
nos ad la ciem atque part dubstra repononibus comet ad num undo cisque
retrangeneferat simas obla




[brian@spica] [~/rubyquiz]
$ cat markov.rb
#!/usr/bin/ruby

class MarkovWords

  def initialize(filename, order)
    @order = order
    @words = Hash.new
    @state = Array.new(@order)
    previous = Array.new(@order)
    File.foreach(filename) do |line|
      line.split(/\s+/).each do |word|
        unless previous.include?(nil)
          p = previous.join(' ')
          unless @words.has_key?(p)
            @words[p] = Array.new
          end
          @words[p] << word
        end
        previous.shift
        previous.push(word)
      end
    end
  end

  def print_words(n = 50)
    word = next_word(true)
    1.upto(n) do |i|
      print word, i == n ? "\n" : " "
      word = next_word()
    end
    print "\n"
  end

  # sentences start with a capital or quoted capital and end with
  # punctuation or quoted punctuation
  def print_sentences(n = 5)
    sentences = 0
    word = next_word(true)
    while word !~ /^['"`]?[A-Z]/
      word = next_word()
    end
    begin
      print word
      if word =~ /[?!.]['"`]?$/
        sentences += 1
        if sentences == n
          print "\n"
        else
          print " "
        end
      else
        print " "
      end
      word = next_word()
    end until sentences == n
  end

  def next_word(restart = false)
    if restart or @state.include?(nil)
      key = @words.keys[rand(@words.length)]
      @state = key.split(/\s+/)
    end
    key ||= @state.join(' ')
    # restart if we hit a dead end, rare unless text is small
    if @words[key].nil?
      next_word(true)
    else
      word = @words[key][rand(@words[key].length)]
      @state.shift
      @state.push(word)
      word
    end
  end

end

class MarkovLetters

  def initialize(filename, order)
    @order = order
    @letters = Hash.new
    @state = Array.new(@order)
    previous = Array.new(@order)
    File.foreach(filename) do |line|
      line.strip!
      line << ' ' unless line.length == 0
      line.gsub!(/\s+/, ' ')
      line.gsub!(/[^a-z ]/, '')
      line.split(//).each do |letter|
        unless previous.include?(nil)
          p = previous.join('')
          unless @letters.has_key?(p)
            @letters[p] = Array.new
          end
          @letters[p] << letter
        end
        previous.shift
        previous.push(letter)
      end
    end
  end

  # words begin after a space and end before a space
  def print_words(n = 50)
    letter = next_letter(true)
    while letter != ' '
      letter = next_letter()
    end
    letter = next_letter()
    words = 0
    while words < n
      words += 1 if letter == ' '
      print letter
      letter = next_letter()
    end
    print "\n"
  end

  def next_letter(restart = false)
    if restart or @state.include?(nil)
      key = @letters.keys[rand(@letters.length)]
      @state = key.split(//)
    end
    key ||= @state.join('')
    # restart if we hit a dead end, rare unless text is small
    if @letters[key].nil?
      next_letter(true)
    else
      word = @letters[key][rand(@letters[key].length)]
      @state.shift
      @state.push(word)
      word
    end
  end

end

if $0 == __FILE__
  require 'getoptlong'

  def usage()
    $stderr.puts "Usage: ruby #{$0} [options] <filename>",
                 "  -h, --help         show this usage message",
                 "  -o, --order        set markov chain order",
                 "  -s, --sentences    set number of sentences to
print",
                 "  -w, --words        set number of words to print",
                 "  -l, --letters      use letters as the basic unit"
  end

  order = 2
  sentences = 5
  words = nil
  letters = false

  opts = GetoptLong.new(["--help",      "-h", GetoptLong::NO_ARGUMENT],
                        ["--order",     "-o",
GetoptLong::REQUIRED_ARGUMENT],
                        ["--sentences", "-s",
GetoptLong::REQUIRED_ARGUMENT],
                        ["--words",     "-w",
GetoptLong::REQUIRED_ARGUMENT],
                        ["--letters",   "-l", GetoptLong::NO_ARGUMENT])

  opts.each do |opt, arg|
    case opt
    when "--help"
      usage
      exit 0
    when "--order"
      order = arg.to_i
    when "--sentences"
      sentences = arg.to_i
      words = nil
    when "--words"
      words = arg.to_i
      sentences = nil
    when "--letters"
      letters = true
      words = 50 if words.nil?
    end
  end

  if ARGV.length < 1
    usage
    exit 1
  end

  ARGV.each do |arg|
    begin
      if letters
        m = MarkovLetters.new(arg, order)
        m.print_words(words)
      else
        m = MarkovWords.new(arg, order)
        m.print_words(words) unless words.nil?
        m.print_sentences(sentences) unless sentences.nil?
      end
    rescue
      $stderr.puts $!
    end
  end
end
280b41a88665fd8c699e83a9a25ef949?d=identicon&s=25 Stephen Waits (Guest)
on 2006-04-10 06:11
(Received via mailing list)
On Apr 9, 2006, at 5:35 PM, Daniel Baird wrote:

> I think someone needs to take all the explanations from this thread
> and run
> their markov chainer on that.

Or everything why's ever said.

--Steve
A9b6a93b860020caf9d2d1d58c32478f?d=identicon&s=25 Ross Bamford (Guest)
on 2006-04-10 13:09
(Received via mailing list)
On Mon, 2006-04-10 at 13:08 +0900, Stephen Waits wrote:
> On Apr 9, 2006, at 5:35 PM, Daniel Baird wrote:
>
> > I think someone needs to take all the explanations from this thread
> > and run
> > their markov chainer on that.
>
> Or everything why's ever said.

I got a few interesting bits using just this year's redhanded posts.
Thanks to the small input body and a low variance setting there's almost
as much between the lines as in the original posts :)

=====
Ten-Sided is an alias for html. xhtml_strict does the same, but with
root beer and cream soda lip balms instead of Roofies or Vitamin K.
Actually, red.trust_sphere.each patches up batsmanâ??s user by then
reflect ACTIVERECORD.

Service Overrides Adding stuff like sessioning and authentication
demands hooks on the web. Skeptical? Iâ??d say you invented tumblelogging
way back in the above equation start out equal. Try seeding the
rankings. Maybe you want trust from Matz to be full projects.

And, of course, he really can be seen as a not entirely bad reason for
actually providing a time of 1.815 seconds, compared with St.
Valentineâ??s YARV which will format the date according to a classâ?? page.

I think Marcel has fixed this in Edge Rails. We get on these little
kicks. As weâ??re all snooping around. Little protocols or obscure version
control systems. Or domino games or something. Trust metric, man. It
just strikes me all the time.

We have, right now, an over-night offsite meeting at a Chupei hackathon
to get Perlâ??s YAML::Syck module on its legs and with a gem, but you
never know. Radicals often portray future animal executions in the above
code and also a follow-up with darcs and switchtower.

Install FuseFS. Run rubyfs.rb. Weâ??ve been through this day before we
talk about this at CampingSessions on the wiki. Good thing gabriele is
out there, because Iâ??ve been playing with a timezone setting, right?
These are tough issues, with a rewrite of miniature file-sharing and I
really appreciate him.

The textarea technique is very hard to get even with net auctions. Donâ??t
get me wrong. It does not mean the book is bad.
=====
188ff29b8682ec3a04e88d85a427300d?d=identicon&s=25 Tom Rauchenwald (Guest)
on 2006-04-10 14:11
(Received via mailing list)
Hi,

this is the first time i participate in a ruby-quiz. It was quite
funny programming this, and testing it out.
I know that my solution is far from perfect, but since I
invested some time in it (I'm quite new to ruby, so had to look up a
lot of stuff) I'm submitting it.
It does nothing fancy, and it misses some error-checking. The main data
structure is a Hash, it's explained in the code.
If I did something completely stupid it would be nice if someone tells
me :-)

Thanks,
Tom
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2006-04-10 16:08
(Received via mailing list)
On Apr 9, 2006, at 8:45 AM, Albert Vernon Smith wrote:

> Coming from Perl, I often rely upon auto-vivification, so I needed
> to figure out how to work around this.  Perhaps there are improved
> ways of going about this, and I'd appreciate any feedback on how to
> go about it better.

Your code looks very clean to me!  A minor thing I noticed is that
you could simplify:

   .gsub(/[\"\(\)]/,"")

to:

   .delete('"()')

Can you show an example using Perl's auto-vivification that feels
funny in Ruby?  Perhaps we would have better ideas after seeing it...

Thanks for the submission!

James Edward Gray II
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2006-04-10 16:21
(Received via mailing list)
On Apr 10, 2006, at 7:08 AM, Tom Rauchenwald wrote:

> If I did something completely stupid it would be nice if someone
> tells me :-)

Nothing stupid no.  Here are a few extremely minor suggestions.

         File.open(filename) do |io|
             io.each do |line|
		...
             end
         end

That pattern can be simplified to:

   File.foreach(filename) do |line| ... end

And...

             if @frequencies[tupel].include?(word)
                 @frequencies[tupel][word]+=1
             else
                 @frequencies[tupel][word]=1
             end

If you set the Hash to default to a value of 0, you could simplify
the above to just:

   @frequencies[tupel][word]+=1

All in all, you're code is quite nice.

Thanks for the submission.

James Edward Gray II
18ca239ffade6df0b839d26062f173fb?d=identicon&s=25 Dominik Bathon (Guest)
on 2006-04-10 19:16
(Received via mailing list)
Here is my solution to this very interesting quiz.

My MarkovChainer can work with any order and it is sentence oriented: I
split the input text with /([.!?])/ and then scan each sentence with
/[\w']+/, so all punctuation (except ".", "!" and "?") is ignored.
It also stores the first <order> words of each sentence, to use them as
start words when generating sentences.

The script accepts command line arguments for the order (-o) and the
number of sentences to generate (-n), the defaults are -o2 -n10. Then it
uses ARGF.read as input text and prints n generated sentences.

Now lets talk a bit about the data structure:

My first implementation used trees of hashes and because symbols are
faster as hash keys than strings, I stored the words as symbols:

For the text "q s a w s. a s w q s. q w s a a w. s q a w s.", that
looked
like:

{:q=>{:s=>{:"."=>1, :a=>1}, :a=>{:w=>1}, :w=>{:s=>1}},
  :s=>{:q=>{:a=>1}, :w=>{:q=>1}, :a=>{:a=>1, :w=>1}},
  :w=>{:q=>{:s=>1}, :s=>{:"."=>2, :a=>1}},
  :a=>{:s=>{:w=>1}, :a=>{:w=>1}, :w=>{:"."=>1, :s=>2}}}

Then I thought it might be faster to put all the last words into one
array
instead of building a frequency hash, because I built that array anyway,
when I was looking for the next word: That looked like:

{:q=>{:s=>[:a, :"."], :a=>[:w], :w=>[:s]},
  :s=>{:q=>[:a], :a=>[:w, :a], :w=>[:q]},
  :a=>{:s=>[:w], :a=>[:w], :w=>[:s, :".", :s]},
  :w=>{:q=>[:s], :s=>[:".", :a, :"."]}}

And it also was faster. Then I wanted to know if symbols really are
faster
then strings:

{"w"=>{"q"=>["s"], "s"=>[".", "a", "."]},
  "a"=>{"a"=>["w"], "w"=>["s", ".", "s"], "s"=>["w"]},
  "q"=>{"a"=>["w"], "w"=>["s"], "s"=>["a", "."]},
  "s"=>{"w"=>["q"], "a"=>["w", "a"], "q"=>["a"]}}

It turned out that strings are faster. I think that is because
converting
a symbol to a string (when generating the sentences) needs one hash
lookup
and generates a new String object. So the faster hash lookup of symbols
doesn't pay off.

And finally I wanted to see if at least the hash tree is worth it. I
tried
the following:

{["q", "a"]=>["w"],
  ["q", "w"]=>["s"],
  ["s", "q"]=>["a"],
  ["w", "q"]=>["s"],
  ["s", "w"]=>["q"],
  ["a", "s"]=>["w"],
  ["w", "s"]=>[".", "a", "."],
  ["s", "a"]=>["w", "a"],
  ["q", "s"]=>["a", "."],
  ["a", "a"]=>["w"],
  ["a", "w"]=>["s", ".", "s"]}

Well, the hash tree also turned out to be slower than this simple flat
hash (multiple hash lookups are slower than computing the hash of a
multi
element array and one lookup).
(I also tried joined strings as keys (e.g. "q a"=>["w"]), but they are
slower than the array keys, because more transformations are needed)

So, the morale of all this:
- don't use symbols if they have to be converted to a string often
- hash lookups might be slower than you think
- premature optimization...


Dominik


class MarkovChainer
   attr_reader :order
   def initialize(order)
     @order = order
     @beginnings = []
     @freq = {}
   end

   def add_text(text)
     # make sure each paragraph ends with some sentence terminator
     text.gsub!(/\n\s*\n/m, ".")
     text << "."
     seps = /([.!?])/
     sentence = ""
     text.split(seps).each { |p|
       if seps =~ p
         add_sentence(sentence, p)
         sentence = ""
       else
         sentence = p
       end
     }
   end

   def generate_sentence
     res = @beginnings[rand(@beginnings.size)]
     loop {
       unless nw = next_word_for(res[-order, order])
         return res[0..-2].join(" ") + res.last
       end
       res << nw
     }
   end

   private

   def add_sentence(str, terminator)
     words = str.scan(/[\w']+/)
     return unless words.size > order # ignore short sentences
     words << terminator
     buf = []
     words.each { |w|
       buf << w
       if buf.size == order + 1
         (@freq[buf[0..-2]] ||= []) << buf[-1]
         buf.shift
       end
     }
     @beginnings << words[0, order]
   end

   def next_word_for(words)
     arr = @freq[words]
     arr && arr[rand(arr.size)]
   end

end


if $0 == __FILE__
   order = 2
   n = 10
   while ARGV[0] =~ /\A-([on])([1-9]\d*)\z/
     if $1 == "o"
       order = Integer($2)
     else
       n = Integer($2)
     end
     ARGV.shift
   end
   mc = MarkovChainer.new(order)
   mc.add_text(ARGF.read)
   n.times {
     puts mc.generate_sentence
   }
end
6dab365a82517fb694650a57ee88e4a4?d=identicon&s=25 joey__ (Guest)
on 2006-04-10 20:54
class WordHash
  attr_accessor :words
  def initialize(words)
    @words = {}
    words.each_with_index{|x,i|self.add(x,words[i-1],words[i+1])}
  end

  def add(word,prev,nex)
    @words[word] ||= {}
    @words[word]['prev'] ||= Hash.new{|k,v|k[v]=0}
    @words[word]['prev'][prev] += 1
    @words[word]['next'] ||= Hash.new{|k,v|k[v]=0}
    @words[word]['next'][nex] += 1
  end
end
f =  ARGF.read.join.tr('"/*\\()[]\'', ' ').downcase
words = f.gsub(/[^'\w]/, ' \0 ').gsub(/\s+/, ' ').split(/\W/)
words = words.map{|x|x.strip}
w = WordHash.new(words)
@k = (a=w.words).keys[rand(a.size)]
wi = [@k.capitalize]
100.times do
  c=(a=w.words[@k]['next']).keys[rand(a.keys.size)]
  if c != ' '
  wi << c
  @k =  w.words.keys.find{|y|y==c}||rand(a.keys.size)
  else
    next
  end
end
puts wi.join(" ").gsub(/\s+/,' ')+'.'
0aca516785b12ea4ffd4fec43d69f586?d=identicon&s=25 Albert Vernon Smith (Guest)
on 2006-04-10 21:19
(Received via mailing list)
On 10.4.2006, at 14:05, James Edward Gray II wrote:

> On Apr 9, 2006, at 8:45 AM, Albert Vernon Smith wrote:
>
>> Coming from Perl, I often rely upon auto-vivification, so I needed
>> to figure out how to work around this.  Perhaps there are improved
>> ways of going about this, and I'd appreciate any feedback on how
>> to go about it better.
>
>
> Can you show an example using Perl's auto-vivification that feels
> funny in Ruby?  Perhaps we would have better ideas after seeing it...

Didn't feel funny, I just wasn't use to having to declare objects
(hashes and arrays) as I added them as branches on my initial hash.
I had to use a little more discipline, which probably is a good thing
coming from Perl-land.

-a
A9b6a93b860020caf9d2d1d58c32478f?d=identicon&s=25 Ross Bamford (Guest)
on 2006-04-10 21:22
(Received via mailing list)
On Tue, 2006-04-11 at 02:15 +0900, Dominik Bathon wrote:

> So, the morale of all this:
> - don't use symbols if they have to be converted to a string often
> - hash lookups might be slower than you think
> - premature optimization...

I used symbols in my solution, but primarily because I was trying to
make the hash itself require as little memory as possible to allow for
vast bodies of input (which I admit I've not really tried yet) and I
figured a hash on symbols would be smaller than a hash on strings.

I guess this means that GC is pushed by lots of string objects created
and destroyed during the generation run, but I tended to aim for memory
efficiency over speed for this one, as long as it was running 'fast
enough'.
4feed660d3728526797edeb4f0467384?d=identicon&s=25 Bill Kelly (Guest)
on 2006-04-10 21:47
(Received via mailing list)
Hi,

From: "Albert Vernon Smith" <smithav@cshl.edu>
>
> On 10.4.2006, at 14:05, James Edward Gray II wrote:
>
>> Can you show an example using Perl's auto-vivification that feels
>> funny in Ruby?  Perhaps we would have better ideas after seeing it...
>
> Didn't feel funny, I just wasn't use to having to declare objects
> (hashes and arrays) as I added them as branches on my initial hash.
> I had to use a little more discipline, which probably is a good thing
> coming from Perl-land.

Incidentally, if you ever need an auto-vivifying hash-of-hashes
(as opposed to mixture of hashes and arrays that is possible
because of Perl syntax), you can do the hash-of-hashes
auto-vivify in Ruby:

HashFactory = lambda { Hash.new {|h,k| h[k] = HashFactory.call} }

irb(main):181:0> x = HashFactory.call
=> {}
irb(main):182:0> x['abc']['def']['ghi'] = 123
=> 123
irb(main):183:0> x
=> {"abc"=>{"def"=>{"ghi"=>123}}}


Regards,

Bill
18ca239ffade6df0b839d26062f173fb?d=identicon&s=25 Dominik Bathon (Guest)
on 2006-04-11 00:02
(Received via mailing list)
On Mon, 10 Apr 2006 21:20:07 +0200, Ross Bamford
<rossrt@roscopeco.co.uk>
wrote:

> figured a hash on symbols would be smaller than a hash on strings.
Tha hash itself will have the same size, because it only stores VALUEs
(which are just longs). But each of the string VALUEs will "point" to
another "object" on the heap, while the symbol VALUEs don't have an
extra
"object" attached.

> I guess this means that GC is pushed by lots of string objects created
> and destroyed during the generation run, but I tended to aim for memory
> efficiency over speed for this one, as long as it was running 'fast
> enough'.

Yes, memory efficiency was another reason for me to first try the "hash
tree of symbols".
But on the other side: the string objects are created anyway (before
they
are converted to symbols). They can be collected after conversion to
symbol, but when you generate the sentence you are again creating many
new
string objects (Symbol#to_str generates a new string object every time,
ruby stores only c-strings for the symbols), which wouldn't be
necessary,
if you had kept the strings in the first place.
So, yes, symbols are more memory efficient for storing big frequency
hashes, but they are slower for generating, so it's a tradeoff.

By the way here are some numbers:

order  first   final
2      7.380s  1.973s
4      6.279s  2.002s
6      8.031s  1.972s

Those runs are for a 700K text file and 1000 sentences generated. I
didn't
measure the memory.

Dominik
This topic is locked and can not be replied to.