Fuzzy Searching a Database

How would you recommend implementing something like a fuzzy search with
Ruby and a SQL database. For example, I want to search a field in a
table for anything that is similar to “Frank Smithe”. I would want an
array of matches with some sort of score telling me how good of a match
the item received (“frank smithe” 100%, “frank smith” 90%).

Am I making sense? Is there a third party app I would need?

Thanks,
Tom

At 3:37 AM +0900 12/5/05, TomRossi7 wrote:

How would you recommend implementing something like a fuzzy search with
Ruby and a SQL database. For example, I want to search a field in a
table for anything that is similar to “Frank Smithe”. I would want an
array of matches with some sort of score telling me how good of a match
the item received (“frank smithe” 100%, “frank smith” 90%).

Soundex is a phonetic algorithm for indexing names by their sound when
pronounced in English. The basic aim is for names with the same
pronunciation to be encoded to the same string so that matching can
occur
despite minor differences in spelling. Soundex is the most widely
known of
all phonetic algorithms and is often used (incorrectly) as a synonym
for
“phonetic algorithm”.

-- http://en.wikipedia.org/wiki/Soundex

If you process your database’s names into a column of Soundex encodings,
you should be able to search this for an encoded target word. You could
then post-process the results to get the “score” for each one. Note
that
you may want to look around for alternative encodings (eg, Metaphone).

-r

Just make Kit screen all of the queries. Get her a blackberry and email
them to her.

this was discussed on both this ML and the Rails ML about a week ago.
The
threads have the word ‘fuzzy’ in the subject.

On Dec 4, 2005, at 10:58 AM, Rich M. wrote:

If you process your database’s names into a column of Soundex
encodings,
you should be able to search this for an encoded target word.

Also worth mentioning is that many database systems have Soundex
support built-in.

–Steve

Lou V. wrote:

Am I making sense? Is there a third party app I would need?

Thanks,
Tom

I needed to do something similar with my office database. I used the
Levenshtein distance. The routine was also
quite useful in finding name differences between my office data base and
the billing company’s.
#------------------------------------------------------

Determine Levenshtein distance of two strings

def Ld(s,t)

#s string #1
#t string #2

n = s.size
m = t.size
a = Array.new

if n != 0 && m != 0

    #2 create array
    r = Array.new
    rz = Array.new

    0.upto(m) {|x| r.push(0)}

    0.upto(n) {|x|a.push(r.dup)}
    a.each_index {|x| a[x][0] = x}
    0.upto(m) {|x| a[0][x] = x}

    cost =  0
    1.upto(n) do |i|
        1.upto(m) do |j|
            if s[i] == t[j]
                cost =0
            else
                cost = 1
            end
            a[i][j] = [a[ i- 1][j] +1,a[i][j - 1] + 1,a[i - 1][j -

1] + cost].min
end
end
a[n][m]
else
0
end
end

/How would you recommend implementing something like a fuzzy search with
Ruby and a SQL database. For example, I want to search a field in a
table for anything that is similar to “Frank Smithe”. I would want an
array of matches with some sort of score telling me how good of a match
the item received (“frank smithe” 100%, “frank smith” 90%)./

Less direct, but maybe useful…
The pyLucene project is the apache java lucene library implemented in C
and SWIG’d to python (i think).
You might be able to create ruby bindings for the full text search C
library used there.

http://pylucene.osafoundation.org/

http://lucene.apache.org/java/docs

check http://ferret.davebalmain.com/api/files/TUTORIAL.html

Ferret is a Lucene port for Ruby.

Monday 05 December 2005 11:10ã?Ben Stroud ã?ã??はæ?¸ãã¾ã?ã?:

see fulltext searching at mysql.com.

from http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html

The following example is more complex. The query returns the relevance
values and it also sorts the rows in order of decreasing relevance. To
achieve this result, you should specify MATCH() twice: once in the
SELECT list and once in the WHERE clause. This causes no additional
overhead, because the MySQL optimizer notices that the two MATCH()
calls are identical and invokes the full-text search code only once.

mysql> SELECT id, body, MATCH (title,body) AGAINST
-> (‘Security implications of running MySQL as root’) AS score
-> FROM articles WHERE MATCH (title,body) AGAINST
-> (‘Security implications of running MySQL as root’);
±—±------------------------------------
±----------------+
| id | body |
score |
±—±------------------------------------
±----------------+
| 4 | 1. Never run mysqld as root. 2. … |
1.5219271183014 |
| 6 | When configured properly, MySQL … | 1.3114095926285 |
±—±------------------------------------
±----------------+
2 rows in set (0.00 sec)

http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html