Fuzzy searching


#1

Hi, everyone,

Just wondering if someone had come up with a good way to do fuzzy
searches if you use MySQL as your database (we tried switching to
PostgreSQL, but that ended up adding even more problems). Ferret sounds
great, but reading through the discussion it looks like we need to solve
the problem of write conflicts. I just wrote a post in ruby-talk about
using KirbyBase maybe to solve this. Of course, I’m hoping someone has
come up with an even better solution! :wink:

Anyway, any advice would be appreciated.

Thanks!

Jen


#2

Here’s a simple fuzzy-search algorithm you can plug in to whatever you
decide to use.
Most of this code was stolen from php: http://us2.php.net/similar_text
(in comments section).
O(m*n), where m and n are lengths of the two target strings.
You may want to rewrite it to make it look more Rubyish.

this is the main algorithm.

compute longest common subsequence btn strings s1 and s2

def lcs_length(s1, s2)

m = [s1.length,128].min
n = [s2.length,128].min
x = 1 + [m,n].max

this table will be used to compute the LCS-Length,

only the first 128 chars per string are considered

lcs_tbl = []

initialize 2-dimensional array to 0

(0…x).each { || lcs_tbl << Array.new(x,0) }

(1…m).each { |i|
(1…n).each { |j|
if (s1[i-1] == s2[j-1])
lcs_tbl[i][j] = lcs_tbl[i-1][j-1] + 1
elsif (lcs_tbl[i-1][j] >= lcs_tbl[i][j-1])
lcs_tbl[i][j] = lcs_tbl[i-1][j]
else
lcs_tbl[i][j] = lcs_tbl[i][j-1]
end
}
}

return lcs_tbl[m][n]
end

def get_lcs(s1, s2, verbose=false)

ok, now replace all spaces and weird characters with ‘normal’

ascii-7
s1 = normalize_string(s1);
s2 = normalize_string(s2);

check for pathological cases

return 0 if [s1.length, s2.length].max < 1

puts “comparing strings #{s1} and #{s2}” if verbose
lcs = lcs_length(s1,s2) # compute longest common subsequence

average string length

ms = (s1.length + s2.length) / 2.0

(puts “ms: #{ms} lcs: #{lcs}”;puts) if verbose
return (lcs*100)/ms
end

code stolen from one of the ruby MLs.

def normalize_string(foo)
foo = foo.downcase.strip
foo.gsub!(/[Ã?ÁÃ?Ã?âäà ãáäåāÄ?Ä?Ç?Ç?ǡǻȁÈ?ȧẵặ]/,‘a’)
foo.gsub!(/[ëêéèẽÄ?Ä?Ä?ẻÈ?È?ẹȩÄ?á¸?á¸?ềếá»?á»?á¸?á¸?á»?ḝ]/,‘e’)
foo.gsub!(/[Ã?ÍÃ?ĨÏiìíîĩīĭïá»?ǐá»?įÈ?È?ḭɨḯ]/,‘i’)
foo.gsub!(/[Ã?Ã?Ã?Ã?Ã?òóôõōŏȯöỏÅ?Ç?ȍȏơǫọɵøá»?á»?á»?á»?ȱȫȭṍṏá¹?á¹?ờá»?ỡá»?ợǭá»?Ç¿]/,‘o’)
foo.gsub!(/[Ã?Ã?Ã?ŨÃ?ùúûũūŭüủůűÇ?È?È?ưụṳųṷṵṹṻÇ?Ç?Ç?Ç?Ç?ừứữửự]/,‘u’)
foo.gsub!(/[ỳýŷỹȳẏÿỷáº?ƴỵ]/,‘y’)
foo.gsub!(/[Å?]/,‘oe’)
foo.gsub!(/[Ã?ǼǢæ]/,‘ae’)
foo.gsub!(/[ñǹÅ?]/,‘n’)
foo.gsub!(/[Ã?ç]/,‘c’)
foo.gsub!(/[Ã?]/,‘b’)
foo.gsub!(/[Å?]/,‘oe’)
foo.gsub!(/[ij]/,‘ij’)
foo.gsub!(/[\s’"\/?.=+&%]/,’’)
foo.gsub!(/
+/,’_’)
foo
end

a little test

[
[“a”,“a”],[“ab”,“ab”],[“abc”,“abc”],
[“a”,“b”],[“ab”,“bb”],[“abc”,“abb”],
[“a”,“1”],[“ab”,"_b"],[“abcd”,“a cd”],
[“abcd”,“abcde”],[“abcd”,“abcdefghi”],[“abcde”,“abcdf”],
[“abcd”,“zbcd”],[“abcd”,“efghiabcd”],[“abcde”,“fabcd”],
[“abc”,“acb”],[“abcde”,“abzde”],[“abcde”,“abde”],
[“abc”,“ac”],[“abcde”,“ab1de”],[“abcde”,“ab de”]
].each { |e|
puts get_lcs(e[0],e[1],true)
puts
}


#3

On 11/26/05, Lou V. removed_email_address@domain.invalid wrote:

Here’s a simple fuzzy-search algorithm you can plug in to whatever you decide to use.
Most of this code was stolen from php: http://us2.php.net/similar_text (in comments section).
O(m*n), where m and n are lengths of the two target strings.
You may want to rewrite it to make it look more Rubyish.

I believe that Jen was looking for a solution for doing the fuzzy
search on the MySQL server itself.

Sincerely,

Tom L.
removed_email_address@domain.invalid
http://AllTom.com/


#4

MySQL does not have a fuzzy search capability, nor does the stored
procedure language have arrays, so fuzzy search embedded w/in
the MySQL database is not possible to do efficiently (although you
can fake arrays either with temp tables or gobs of string ops).
Which leaves you with downloading the data and performing the
match outside of the db. On the surface this appears inefficient,
because, after all, the data has to be moved to the client each
time a search is processed. But considering that the search itself
is O(m*n), data transfer is not the overriding concern (unless you
have an unbearably slow transfer speed).
In fact, many fuzzy search algorithms have a much worse running-time
order [O(n^3)], so pulling the data off the server so that a more
efficient algorithm can be used may even be faster than doing the
computation centrally on the server.


#5

I may try that algorithm Lou posted. I’m going to choose between that
and using Ferret to do this … David B. said that locking in
Ferret should be able to handle this without a server process except for
really high concurrency applications. Since I’m dealing with less than
10,000 records, I don’t think this will be a big issue.

Thanks!

Jen


#6

Scott W. wrote:

Not sure how easy it would be to fix, or if it even needs fixing. If
anyone wants a simple example, download:
http://www.butlerpress.com/sti.zip

What about explicitly using the ‘model :x’ statements?

–J


#7

I ran across some odd behavior with multiple subclasses of
ActiveRecord. Sometimes find() and find_all() would find subclass
records and sometimes not, depending on whether a subclass had been
loaded or not. I can imagine why this happens, and how it might be
difficult to fix, but it was surprising and took me a while to figure
out.

Here’s a concrete example. Say I have three classes that inherit from
ActiveRecord, and I have one instance of each in my ‘animals’ table:
Dog < Mammal < Animal < ActiveRecord::Base

Run these command in order and you get:
Mammal.count() = 1
Dog.count() = 1
Mammal.count() = 2!

Here’s the SQL:
SHOW FIELDS FROM animals
SELECT COUNT() FROM animals WHERE ( (animals.type = ‘Mammal’ ) )
SHOW FIELDS FROM animals
SELECT COUNT(
) FROM animals WHERE ( (animals.type = ‘Dog’ ) )
SELECT COUNT(*) FROM animals WHERE ( (animals.type = ‘Mammal’ OR
animals.type = ‘Dog’ ) )

Sure looks like Mammal/Rails doesn’t know about Dog until I use Dog.
There’s a workaround:
Dog
Mammal.count() = 2
Dog.count() = 1
Mammal.count() = 2

Not sure how easy it would be to fix, or if it even needs fixing. If
anyone wants a simple example, download:
http://www.butlerpress.com/sti.zip

Scott