Shirt Reader (#140)

bbazzarrakk · September 27, 2007, 2:44pm

This quiz is surprisingly challenging. Our brain is capable of making
some
pretty complex associations between sounds to make sense of these
puzzles quite
quickly. To code that up is a non-trivial task.

All of the solutions used some variant of the Metaphone algorithm to
match words
with similar sounds. The solution from steve d is probably the most
complete.
It pulls from a pronunciation dictionary, uses the afore mentioned
Metaphone
algorithm, refines matches using a Levenshtein edit distance algorithm,
and even
inlines some C for speed. I do recommend everyone have a look at that
code.

I’m going to go through one of Eugene K.'s solutions in this
summary
though. It gets pretty good answers in the spot checking I did, it’s
performance isn’t too bad, and it’s a bit smaller than steve’s code. It
starts
by pulling in some external resources:

require ‘rubygems’
require ‘text’
include Text::Metaphone
include Text::Levenshtein
load ‘expectations.rb’

…

Most of this code is about loading the Text gem. For those not familiar
with
it, Text includes a slew of algorithms useful in processing text. This
code
will use the Metaphone and the improved Double Metaphone algorithms to
compare
words by sound. It also uses the Levenshtein distance algorithm to see
how far
apart two possible words are.

The expectations.rb file just fills a global Hash with some test cases.

Another thing most of the solutions did was to perform some translation
of the
provided words to make them more likely to match sounds. One particular
problem
point seemed to be “ee” sounds at the end of words, typically ending in
y.
Performing some phonetic substitutions can increase the matching
accuracy in
these cases. Here’s Eugene’s code for that:

…

subs={‘1’=>‘wan’,‘2’=>‘to’,‘3’=>‘tre’,‘4’=>‘for’,‘5’=>‘five’,‘6’=>‘six’,
‘7’=>‘seven’,‘8’=>‘ate’,‘9’=>‘nine’,‘10’=>‘ten’,
‘c’=>‘see’,‘h’=>‘eich’,‘j’=>‘jey’,‘k’=>‘key’,‘q’=>‘que’,‘r’=>‘ar’}
subsy={}
%w[b c d g p t v z].each {|l| subsy[l]=l+‘y’}
%w[b c d g p t v z].each {|l| subs[l]=l+‘ee’}
%w[f l m n s x].each{|l| subs[l]=‘e’+l}

…

Here we see Hashes constructed to provide the matching of numbers and
improve
the matching of certain letters. Note that these are just used to match
standalone chunks of the input. Substitutions will not be made inside
words.

Next we have some wrappers over the algorithms provided in the Text gem:

…

def metadist(str1,str2)
2*distance(metaphone(str1),metaphone(str2))+
distance(str1,str2)
end

def short_double_metaphone(word)
m1,m2=double_metaphone(word)
[m1[0,2],m2 ? m2[0,2] : nil]
end

…

The first method is an enhanced distance function for checking how
similar two
words are. It works by applying the sound algorithm to each words then
building
and edit distance between those and adding that to the normal edit
distance for
the words. The sound distance is weighted double in the overall result.

The second wrapper is for building a word index using a Double Metaphone
algorithm. It returns shortened versions of the provides sounds for use
as Hash
keys. The code that builds that Hash is what we want to look at next:

…

hash=Hash.new{|h,k|h[k]=[]}

File.open("/usr/share/dict/words") {|f| f.readlines}.each do |w|
word=w.downcase.delete("^a-z")
m1,m2=short_double_metaphone(word)
hash[m1]<<word
hash[m2]<<word if m2
end
$expectations.values.each { |word|
m1,m2=short_double_metaphone(word)
hash[m1]<<word
hash[m2]<<word if m2
}

hash.each_key{|k| hash[k].uniq!}

…

This code just loops over the built-in dictionary (on Unix), building
sound keys
for each word and populating the Hash. Note that the File loop can be
simplified to File.foreach(…) { |w| … }.

One point of interest here is that the code loops over test cases adding
the
solution words to the dictionary. This raises a good point: the
quality of the
dictionary will have a big impact on your results. The built-in
dictionary is
convenient to test with, but you probably want to trade up to a better
word list
when quality really counts.

The input handling code is trivial:

…

inputs=[]
if (ARGV.empty?)
inputs=$expectations.keys
else
inputs << ARGV
end

…

This just supports the quiz interface, or defaults to running the
included test
cases.

All we have left is the application code:

inputs.each { |rebus|
y_ed=rebus[0…-2]<<(subsy[rebus[-1]] || rebus[-1])
word=y_ed.map{|w| subs[w] || w }.join.downcase.gsub(/[^a-z0-9]/,’’)
m1,m2=short_double_metaphone(word)
results=hash[m1]
results+=hash[m2] if m2 && m2!=m1
res=results.uniq.sort_by{|a| [metadist(word,a),a.length]}.first(5)
print “’#{rebus.join(’ ‘)}’ => #{res[0]}”
expected=$expectations[rebus]
print “, expected ‘#{expected}’ is at position
#{res.index(expected)}”
if expected
puts
}

The process here isn’t too hard to follow. The inputs are combined into
a list,
with the last element undergoing the “ee” sound substitutions we
discussed
earlier. After that, all elements go through the general letter and
number
substitutions and get combined into a word.

The sound Hash keys are pulled for the resulting word and used to lookup
a
result set of possible matches. That result set is then ordered by the
enhanced
distance function and and pruned to the top five matches.

The rest of the code just handles the printing.

My thanks to all who managed to get surprisingly accurate results for a
tough
problem. As always, you’ve taught me new tricks.

It’s pretty likely that tomorrow’s problem will center around
probability…

bbazzarrakk · September 27, 2007, 4:19pm

It’s pretty likely that tomorrow’s problem will center around probability…
Nice 1
Robert