Metaphone analysis

Not sure how much this will interest people but I don’t have a blog so
I’m
posting something I threw together today cause I think it might be
useful.

In what little free time I have I’ve been wanting to put together a
Rails/Ferret based restful dictionary. So I finally got a chance to get
started today so the first thing I wanted to do was implement a
metaphone
analyzer and filter.

Some links for more info on the metaphone algorithms:

The jist of it is that it breaks words down into its phonetic parts. For
example, the word ‘cool’ and ‘kewl’ both become ‘KL’ in the
double-metaphone
algorithm. Indexing dictionary words in this manner is almost essential
so
that users can find the proper spelling of a word by spelling it how it
sounds.

The first thing I did was create a MetaphoneFilter class that would run
the
metaphone algorithm over a token stream. It’s a fairly simple class, but
does
require the ‘Text’ gem be installed.

require ‘ferret’
require ‘text’

module Curtis
module Analysis

TODO write tests!

class MetaphoneFilter < Ferret::Analysis::TokenStream
def initialize(token_stream, version = :double)
@input = token_stream
@version = version
end

def next
t = @input.next

return nil if t.nil?

t.text = @version.eql?(:double) ?
  Text::Metaphone.double_metaphone(t.text) :
  Text::Metaphone.metaphone(t.text)

end
end

end
end

Second I created a MetaphoneAnalyzer class that would use the
MetaphoneFilter
created above. The MetaphoneAnalyzer also makes use of the StemFilter so
that
words like “eat” and “eating” both equal to “eat”.

require ‘ferret’

TODO write tests

module Curtis
module Analysis

class MetaphoneAnalyzer < Ferret::Analysis::Analyzer
include Ferret::Analysis

def initialize(version = :double, stop_words = ENGLISH_STOP_WORDS)
@stop_words = stop_words
@version = version
end

def token_stream(field, str)

MetaphoneFilter.new(StemFilter.new(StopFilter.new(LowerCaseFilter.new(StandardTokenizer.new(str)),
@stop_words)), @version)
end
end

end
end

I saved both of these files, ‘metaphone_filter.rb’ and
‘metaphone_analyzer.rb’
to RAILS_ROOT/extras. Next I added the following line to my
‘config/environments.rb’ file:
config.load_paths += %W{ #{RAILS_ROOT}/extras }

after that i fired up script/console to test it all out:

require ‘metaphone_analyzer’
=> true
include Curtis::Analysis
=> Object
ts = MetaphoneAnalyzer.new.token_stream(nil, “the quick brown fox jumped
over the lazy dog”)
=> $Curtis::Analysis::Metaphonefilter......@version=:double
while token = ts.next
p token
end
[“KK”, nil]
[“PRN”, nil]
[“FKS”, nil]
[“JMP”, “AMP”]
[“AFR”, nil]
[“LS”, nil]
[“TK”, nil]
=> nil

As you can see it has been metaphoned. Now if someone were to search but
inadvertently type ‘qwick’ instead of ‘quick’ it would still match
because
‘qwick’ metaphoned also becomes ‘KK’.

Still a lot to do, such as test it with AAF, and see how it interacts
with
using slop (which measures the Levenshtein distance,
Levenstein - Wikipedia, between two terms) so that I
can
put in a “Did you mean xxx” feature (where xxx is a list of terms within
a
certain distance of the original query). Plus many other ideas also,
such as
thesaurus searching.

Hopefully this has been informative. Wanted to show how to create new
Analyzers and Filters for anyone who was curious (I know I was until
today),
as well as give a general idea for how I’m going to put them to use.

I’d be happy to hear any questions or comments on the above.

Oh, one last thing… the MetaphoneAnalyzer and MetaphoneFilter default
to the
double-metaphone algorithm. Just pass pass nil (or anything other
than :double when constructing the analyzer to use just the metaphone
algorithm.

Thanks,
Curtis

On 25.11.2006, at 22:38, Curtis Hatter wrote:

As you can see it has been metaphoned. Now if someone were to
search but
inadvertently type ‘qwick’ instead of ‘quick’ it would still match
because
‘qwick’ metaphoned also becomes ‘KK’.

You can achieve almost the same result using Ferret’s built-in
FuzzyQuery. It works even better for misspellings than phonetic
algorithms, and it’s language-neutral.

Consider:

i = Ferret::I.new
i << “the quick brown fox”

i.search(“quikc~”).total_hits
=> 1
i.search(“qwick~”).total_hits
=> 1

Whereas metaphone yields:

Text::Metaphone.double_metaphone(“quick”)
=> [“KK”, nil]
Text::Metaphone.double_metaphone(“quikc”)
=> [“KKK”, nil]

Cheers,
Andy

On Saturday 25 November 2006 17:21, Andreas K. wrote:

Whereas metaphone yields:

Text::Metaphone.double_metaphone(“quick”)
=> [“KK”, nil]
Text::Metaphone.double_metaphone(“quikc”)
=> [“KKK”, nil]

I’m looking at trying to use both. My reason:

i = Ferret::I.new
i << “The quick brown fox”

i.search(“qwik~”).total_hits
=> 0

Where as double metaphoning “quick” or “qwik” both become “KK”.

What I’m thinking might be a good solution is to index the word and it’s
double-metaphone equivalent. Then search for exact hits against the
metaphone
and fuzzy hits against the word field. Then sort based on score, with
hopefully exact matches being 100.

Still investigating the best solutions. Thanks for the ideas,

Curtis

Curtis

Yep. In the same way as ‘bag’, ‘pack’, ‘back’, ‘poke’ and ‘pike’ all
become ‘PK’. I think the accurracy of this particular phonetic
algorithm is disputable.

true… and had I not been introduced to guitar hero 2 this weekend I
think I
might have realized that myself. I haven’t good success with the Soundex
algorithm either. Metaphone seemed good at first but I think you’ve
convinced
me otherwise. ‘Pike’ and ‘bag’ should not be the same.

i << “quick”

quikc 0.88
quick 0.53
quake 0.53
quid 0.44
quack 0.35
quark 0.35
quiche 0.35

As you can see, the exact match ranks highest.

I think I’ll try this approach first and add in a phonetic algorithm if
necessary.

At least I discovered how to write filters for Ferret, which was much
easier
that I would have imagined.

Thanks for the information, is nice to learn a bit more about the things
Ferret can already do so well.

Curtis

On 27.11.2006, at 04:26, Curtis Hatter wrote:

Yep. In the same way as ‘bag’, ‘pack’, ‘back’, ‘poke’ and ‘pike’ all
become ‘PK’. I think the accurracy of this particular phonetic
algorithm is disputable.

true… and had I not been introduced to guitar hero 2 this weekend
I think I
might have realized that myself.

Sounds like a lot of FN.

I think I’ll try this approach first and add in a phonetic
algorithm if
necessary.

It depends on what you want to achieve. If you want to compensate for
typos, FuzzyQuery is probably better. A simple letter-swap can easily
trick a phonetic algorithm:

metaphone => MTFN
metahpone => MTPN

FuzzyQuery will catch this even at a relatively low sensitivity of
0.75 (‘metahpone~0.75’)

I used FuzzyQuery to build a doublet detection into my Rails app.
When a user creates a new Person record, a fuzzy search is run in the
background as the form fields are filled out. Possible doublets are
displayed next to the form in a “Did you mean…” fashion.

For example, if the user enters “Rachel Welsh”, the doublet detection
would find “Raquel Welch”. Before Ferret I tried to achieve this with
MySQL’s SOUNDEX function, which didn’t work quite as well. (Although
SOUNDEX, which is based on the algorithm of the same name, still
works way better than metaphone.)

At least I discovered how to write filters for Ferret, which was
much easier
that I would have imagined.

Yep. It’s great that Ferret can be extended and customized in so many
ways.

Thanks for the information, is nice to learn a bit more about the
things
Ferret can already do so well.

Ferret is the single best Ruby library I’ve come across in the past
two years. It just rocks. Period. Thanks David for giving it to us!

Cheers,
Andreas

On 26.11.2006, at 22:35, Curtis Hatter wrote:

Text::Metaphone.double_metaphone(“quick”)
i.search(“qwik~”).total_hits
=> 0

Which is OK I guess, since ‘qwik’ and ‘quick’ are quite different.
Still, you can adjust the tolerance of FuzzyQuery if desired:

i.search(“qwik~0.4”).total_hits
=> 1

Where as double metaphoning “quick” or “qwik” both become “KK”.

Yep. In the same way as ‘bag’, ‘pack’, ‘back’, ‘poke’ and ‘pike’ all
become ‘PK’. I think the accurracy of this particular phonetic
algorithm is disputable.

What I’m thinking might be a good solution is to index the word and
it’s
double-metaphone equivalent. Then search for exact hits against the
metaphone
and fuzzy hits against the word field. Then sort based on score, with
hopefully exact matches being 100.

You should in any case index the actual terms, because the metaphones
alone would make exact matches impossible.

If you use FuzzySearch, you don’t need an extra field and you
autmatically get a score based on how close the match is.

Example:

i = Ferret::I.new

i << “quick”
i << “quikc”
i << “quack”
i << “quake”
i << “quark”
i << “quid”
i << “quiche”

i.search_each(“quikc~0.3”) do |doc, score|
printf “%6s %1.2f\n”, i[doc][:id], score
end

quikc 0.88
quick 0.53
quake 0.53
quid 0.44
quack 0.35
quark 0.35
quiche 0.35

As you can see, the exact match ranks highest.


Andy