Buiding anamgrams

andrea · December 26, 2008, 10:36am

I’m studying ruby (and I really love it even more than python) and I
want to do a little exercise.

I want to create an efficient module to generate every possible
anagrams of a word, of any given size…

I already done this in python, I want to see how can it be elegant in
ruby…
I need a good permutation algorithm (this

works fine) and I’d like to use enumerators in the best and most
efficient way.

I also looked at callcc which is great for backtracking, could be used
in this case??
thanks for any help

A basic structure like that could be fine?
class Anagrams
attr :word
include Enumerable
include Comparable

def <=>(other)
self.word <=> other.word
end

def each(&block)
@word.chars
end

def initialize(word)
@word = word
end

end

andrea · December 26, 2008, 11:50am

On 26.12.2008 10:30, andrea wrote:

I’m studying ruby (and I really love it even more than python) and I
want to do a little exercise.

Welcome to the pack! What a nice Christmas occupation.

I want to create an efficient module to generate every possible
anagrams of a word, of any given size…

Do you really mean “anagrams” or rather “permutations”? To decide on
anagrams you likely need some kind of dictionary which is significantly
more difficult than just permuting.

I already done this in python, I want to see how can it be elegant in
ruby…
I need a good permutation algorithm (this
Johnson-Trotter Algorithm: Listing All Permutations
works fine) and I’d like to use enumerators in the best and most
efficient way.

From 1.8.7 on you can do this

irb(main):009:0> “abc”.scan(/./).permutation {|x| p x.join}
“abc”
“acb”
“bac”
“bca”
“cab”
“cba”
=> [“a”, “b”, “c”]
irb(main):010:0> “abc”.chars.to_a.permutation {|x| p x.join}
“abc”
“acb”
“bac”
“bca”
“cab”
“cba”
=> [“a”, “b”, “c”]
irb(main):011:0>

Now you only need a filter that properly decides which of those is a
proper word and use that to decide which words to print and which not.

I also looked at callcc which is great for backtracking, could be used
in this case??

I would try to avoid it as it makes code very hard to read - at least
for me.

end

def each(&block)
@word.chars
end

def initialize(word)
@word = word
end

end

What exactly should be the responsibility of this class? Especially,
why do you call it “anagrams” but store only one word in it?

Cheers

robert

andrea · December 26, 2008, 12:22pm

That’s it. In ruby 1.8.6 you cant use this form , only for 1.8.7 and
prior
…
I tested here and the code works perfectly.
About your class:
class Anagrams

end

def initialize(word)
@word = word
end

end
i really dont understant what it means. Explain better.
King regards
tiago

On Fri, 26 Dec 2008 19:50:00 +0900, Robert K.

andrea · December 27, 2008, 10:01am

Thanks a lot everyone, now I already posted but don’t know why it
doesn’t appear my message.
Anyway you’re right Axel, this is the first “solution” to the problem
but it’s really too slow for any reasonable large size of the string.

This is the code, any hint about style or whatever is welcome:
http://pastie.textmate.org/private/rp4ivw36vefuz07twjlw3g

I’ll have a look to the Markov Chain (which I’m studying right now at
uni).

andrea · December 27, 2008, 1:27am

Dear Andrea,

in this case??
Permutations become quite unwieldy very quickly … maybe you’d have a
look
at Hidden Markov Models instead. There are several algorithms associated
to these models. I think what you’d want to do is to apply the “forward
algorithm”
to estimate how probable a given word is, given the probability of
letter x2 succeeding
letter x1 in the language you are trying to form anagrams in ( …
download some novels
from project Gutenberg, split them into pairs of letters and count…)
.

For the forward algorithm, look at the concrete example here:

There once was a Ruby quiz about Markov chains to help you with Ruby:

http://www.rubyquiz.com/quiz74.html

To test whether a given word exists in language X, you could use raspell
in conjunction with
aspell:

http://blog.evanweaver.com/files/doc/fauna/raspell/files/README.html

Best regards,

Axel

andrea · December 27, 2008, 7:47pm

On Dec 27, 2008, at 2:59 AM, andrea wrote:

Thanks a lot everyone, now I already posted but don’t know why it
doesn’t appear my message.
Anyway you’re right Axel, this is the first “solution” to the problem
but it’s really too slow for any reasonable large size of the string.

This is the code, any hint about style or whatever is welcome:
http://pastie.textmate.org/private/rp4ivw36vefuz07twjlw3g

I think a faster approach, particularly to find multiple anagrams, is
to create a signature or hash with the property that each word in a
set of anagrams will have the same hash code. Finding anagrams then
becomes a problem of building lists of words with the hash hash code.

A convenient hash is simply to take all the letters in each work and
sort them, so that “dog” becomes “dgo” and “god” also becomes “dgo”.
Because they have the same hash, they are anagrams.

Here’s the code from the PickAxe[1] that uses this to find anagrams:

http://pastie.textmate.org/private/yttznlgzcf9czdkc7vxea

and here are some simple tests:

http://pastie.textmate.org/private/stv3sxaecc6z8uwxclldba

In fact, I use this as an example of Gem packaging, so you can
actually do

gem install anagram

and get the whole thing.

Cheers

Dave

[1] http://pragprog.com/titles/ruby3

andrea · December 28, 2008, 2:14am

andrea wrote:

uni).

Just for fun/reference, this is what I came up with after reading your
post earlier. Note that it doesn’t do the “of any length” requirement,
though.

require ‘rubygems’
require ‘raspell’

def anagrams word
checker = Aspell.new
word.split(//).permutation.select { |w| checker.check w.join
}.map { |w| w.join }
end

puts anagrams “words”

-Justin

andrea · January 2, 2009, 9:15am

Thanks to everyone for your very nice answers, now I wrote another
version of the program.

http://pastie.textmate.org/private/ryxlzlqle8txaorswiynea

Which implments all 3 methods discussed here, the nicest of course is
the third.

It creates signature only if really needed, otherwise it just loads
the dict with Marshall. (other ways?)
Is correct the use of md5?? I think it is as it gives the same result
of the openssl md5, but it looks a little bit too fast for such a big
file.
dict_md5 = Digest::MD5.hexdigest(open(@dict).read)
conf_md5 = open($CONF).read

then they must be equal and I have the correct

signatures for my dictionary

if dict_md5 == conf_md5
puts “we already have the right signatures” if $DEBUG

loading the precomputed signature file

@signatures = Marshal.load(open($SIGNATURE))
else

It’s not bad at all, the only problem is that with a huge dictionary
it takes a long time to load it (around 38seconds for a almost 1-
million word dictionary) and takes a big amount of ram.

Could I maybe load it incrementally? Maybe divide the signs by size
and load just what I need to check?
Is it possible using only one file?

Thanks a lot