Detecting similar strings

dmarkow · August 1, 2006, 8:25am

Is there a way to take two strings, and decide if they are “similar.”
I’m creating a contact system in Rails, and am having a large problem
with my users punching in duplicate entries with the last names spelled
slightly different.

Is there a way to check if 2 strings are “identical” up to a certain
percentage, such as only having 1 or 2 characters different?

dmarkow · August 1, 2006, 8:45am

On 01/08/06, Dylan M. [email protected] wrote:

Is there a way to take two strings, and decide if they are “similar.”
I’m creating a contact system in Rails, and am having a large problem
with my users punching in duplicate entries with the last names spelled
slightly different.

Is there a way to check if 2 strings are “identical” up to a certain
percentage, such as only having 1 or 2 characters different?

Here’s a Perl module that does something similar. You might try
porting it over to Ruby.

Farrel

dmarkow · August 1, 2006, 8:48am

On 01/08/06, Farrel L. [email protected] wrote:

porting it over to Ruby.
String::Similarity - calculate the similarity of two strings - metacpan.org

Farrel

And here is a nice description of the Levenstein Distance:

Farrel

dmarkow · August 1, 2006, 9:24am

sender: “Dylan M.” date: “Tue, Aug 01, 2006 at 03:25:59PM +0900” <<<EOQ
Is there a way to take two strings, and decide if they are “similar.”
I’m creating a contact system in Rails, and am having a large problem
with my users punching in duplicate entries with the last names spelled
slightly different.

Is there a way to check if 2 strings are “identical” up to a certain
percentage, such as only having 1 or 2 characters different?
Looks like there is a soundex implementation for Ruby:
http://raa.ruby-lang.org/search.rhtml?search=soundex

If by any chance you are using MySQL you could use the soundex function
builtin into it as well.

Good luck,
Alex

dmarkow · August 1, 2006, 9:37am

Looks like there is a soundex implementation for Ruby:
http://raa.ruby-lang.org/search.rhtml?search=soundex

If by any chance you are using MySQL you could use the soundex function
builtin into it as well.

If you check on Google under “ruby soundex module”, you’ll find a few
likely hits:
ruby soundex module - Google Search

I remember a friend saying that there are superior algorithms to soundex
that are no more complex, so you might want to explore about a little.

dmarkow · August 1, 2006, 10:35am

On 01/08/06, Farrel L. [email protected] wrote:

And here is a nice description of the Levenstein Distance:
http://en.wikipedia.org/wiki/Levenshtein_distance

You know, there are lots of implementations there, but the Ruby one
seems to be missing [There’s no reason to restrict it to working on
strings. If you duck, it’ll work just as nicely on arrays of what have
you.]

dmarkow · August 1, 2006, 10:23am

On 8/1/06, [email protected] [email protected] wrote:

Looks like there is a soundex implementation for Ruby:
that are no more complex, so you might want to explore about a little.

If the names might be from different languages I would rather use
Levenstein than soundex. Levenstein is probably good at describing
typos as “very close” but soundex might be somewhat language specific.

But I haven’t tried so I might be wrong.

Thanks

Michal

dmarkow · August 1, 2006, 3:56pm

Here is an implementation of levenstein distance. It seems to me that
there is another in ruby gems somewhere but I don’t remember where.

Determine Levenshtein distance of two strings

def Ld(s,t)

#s string #1
#t string #2

n = s.size
m = t.size
a = Array.new

if n != 0 && m != 0

    #2 create array
    r = Array.new
    rz = Array.new

    0.upto(m) {|x| r.push(0)}

    0.upto(n) {|x|a.push(r.dup)}
    a.each_index {|x| a[x][0] = x}
    0.upto(m) {|x| a[0][x] = x}

#a.each {|x| p x}

    cost =  0
    1.upto(n) do |i|
        1.upto(m) do |j|
            if s[i] == t[j]
                cost =0
            else
                cost = 1
            end
            a[i][j] = [a[ i- 1][j] +1,a[i][j - 1] + 1,a[i - 1][j -

1] + cost].min
end
end
a[n][m]
#a.each {|x| p x}
else
0
end
end

dmarkow · August 1, 2006, 3:56pm

Dylan M. wrote:

Is there a way to take two strings, and decide if they are “similar.”
I’m creating a contact system in Rails, and am having a large problem

Interesting your name is almost “Markov” ;-}

Anyway, besides algorithms mentioned here, I have notes on Double
Metaphone, NYSIIS, Phonex. For sequence analysis in general, google
McIlroy-Hunt, Ratcliff/Obershelp:

http://docs.python.org/lib/module-difflib.htmlhttp://docs.python.org/lib/module-difflib.html

dmarkow · August 1, 2006, 4:06pm

Gene T. wrote:

Dylan M. wrote:

Is there a way to take two strings, and decide if they are “similar.”

http://homepages.cs.ncl.ac.uk/brian.randell/Genealogy/NameMatching.pdf