Not sure how much this will interest people but I don’t have a blog so
I’m
posting something I threw together today cause I think it might be
useful.
In what little free time I have I’ve been wanting to put together a
Rails/Ferret based restful dictionary. So I finally got a chance to get
started today so the first thing I wanted to do was implement a
metaphone
analyzer and filter.
Some links for more info on the metaphone algorithms:
The jist of it is that it breaks words down into its phonetic parts. For
example, the word ‘cool’ and ‘kewl’ both become ‘KL’ in the
double-metaphone
algorithm. Indexing dictionary words in this manner is almost essential
so
that users can find the proper spelling of a word by spelling it how it
sounds.
The first thing I did was create a MetaphoneFilter class that would run
the
metaphone algorithm over a token stream. It’s a fairly simple class, but
does
require the ‘Text’ gem be installed.
require ‘ferret’
require ‘text’
module Curtis
module Analysis
TODO write tests!
class MetaphoneFilter < Ferret::Analysis::TokenStream
def initialize(token_stream, version = :double)
@input = token_stream
@version = version
end
def next
t = @input.next
return nil if t.nil?
t.text = @version.eql?(:double) ?
Text::Metaphone.double_metaphone(t.text) :
Text::Metaphone.metaphone(t.text)
end
end
end
end
Second I created a MetaphoneAnalyzer class that would use the
MetaphoneFilter
created above. The MetaphoneAnalyzer also makes use of the StemFilter so
that
words like “eat” and “eating” both equal to “eat”.
require ‘ferret’
TODO write tests
module Curtis
module Analysis
class MetaphoneAnalyzer < Ferret::Analysis::Analyzer
include Ferret::Analysis
def initialize(version = :double, stop_words = ENGLISH_STOP_WORDS)
@stop_words = stop_words
@version = version
end
def token_stream(field, str)
MetaphoneFilter.new(StemFilter.new(StopFilter.new(LowerCaseFilter.new(StandardTokenizer.new(str)),
@stop_words)), @version)
end
end
end
end
I saved both of these files, ‘metaphone_filter.rb’ and
‘metaphone_analyzer.rb’
to RAILS_ROOT/extras. Next I added the following line to my
‘config/environments.rb’ file:
config.load_paths += %W{ #{RAILS_ROOT}/extras }
after that i fired up script/console to test it all out:
require ‘metaphone_analyzer’
=> true
include Curtis::Analysis
=> Object
ts = MetaphoneAnalyzer.new.token_stream(nil, “the quick brown fox jumped
over the lazy dog”)
=> $Curtis::Analysis::Metaphonefilter......@version=:double
while token = ts.next
p token
end
[“KK”, nil]
[“PRN”, nil]
[“FKS”, nil]
[“JMP”, “AMP”]
[“AFR”, nil]
[“LS”, nil]
[“TK”, nil]
=> nil
As you can see it has been metaphoned. Now if someone were to search but
inadvertently type ‘qwick’ instead of ‘quick’ it would still match
because
‘qwick’ metaphoned also becomes ‘KK’.
Still a lot to do, such as test it with AAF, and see how it interacts
with
using slop (which measures the Levenshtein distance,
Levenstein - Wikipedia, between two terms) so that I
can
put in a “Did you mean xxx” feature (where xxx is a list of terms within
a
certain distance of the original query). Plus many other ideas also,
such as
thesaurus searching.
Hopefully this has been informative. Wanted to show how to create new
Analyzers and Filters for anyone who was curious (I know I was until
today),
as well as give a general idea for how I’m going to put them to use.
I’d be happy to hear any questions or comments on the above.
Oh, one last thing… the MetaphoneAnalyzer and MetaphoneFilter default
to the
double-metaphone algorithm. Just pass pass nil (or anything other
than :double when constructing the analyzer to use just the metaphone
algorithm.
Thanks,
Curtis