[OT] finding fuzzy duplicate data

I have a Rails app with a contact list that needs to interface with
Outlook
and at least one other external data source. My fear is that as my
client
throws CSVs at my Web app, I will have subtly different items referring
to
the same contact.

I’ve come up with several alternate approaches to this but thought I’d
ask
if anyone else has already faced this problem. FWIW, here were two
approaches I felt might work:

  1. Tag contacts that have already been sync’ed with Outlook. Strangely,
    Outlook does not provide any unique identifier with its contact
    information
    so this would have to be done in some custom field. Ack!

  2. Use a proximity or fuzzy match to determine whether the same contact
    is
    being updated. So, “Sam Smith” and “Sammy Smith” might be the same
    person,
    but “Sam Jones” would not be. The user could then manually resolve
    possible
    duplicates.

Regarding (2), ferret seems like a good way to get a Levenshtein
distance
for my existing data, as the data can be indexed as added, economizing
on
the matching hassle later.

Anyone have any thoughts or experience with this?

Thanks

View this message in context:
http://www.nabble.com/-OT--finding-fuzzy-duplicate-data-tf2633426.html#a7350065
Sent from the RubyOnRails Users mailing list archive at Nabble.com.

Steve R. wrote:

Anyone have any thoughts or experience with this?

Thanks

View this message in context:
http://www.nabble.com/-OT--finding-fuzzy-duplicate-data-tf2633426.html#a7350065
Sent from the RubyOnRails Users mailing list archive at Nabble.com.

The only other thought that comes to mind is maybe using a Soundex
algorithm (SoundEx - Phonetic Search Algorithms - How To - Source Code: C, JavaScript, Perl, VB - Creativyst(R) Software - Explored,Designed,Delivered.(SM))
in combination with your other ideas.

c.

It looks like the text gem has a lot of good implementations of these
algorithms. I’m going to throw a bunch of test data at this and see what
happens. Metaphone, soundex and levenshtein all seem promising…

I’m concerned that if the change were, say a change of address or a
change
of phone number, the test mentioned below would fail erroneously,
allowing a
duplicate into the database. However, names often stay the same. Eeeek,
except when people change them because of marriage or personal
preference.
Hmmmmm. Am I overthinking this?

Thanks,

Steve

Philip H.-8 wrote:

ask
being updated. So, “Sam Smith” and “Sammy Smith” might be the same


View this message in context:
http://www.nabble.com/-OT--finding-fuzzy-duplicate-data-tf2633426.html#a7352074
Sent from the RubyOnRails Users mailing list archive at Nabble.com.

Outlook does not provide any unique identifier with its contact information

Anyone have any thoughts or experience with this?

There’s also metaphone (Metaphone - Wikipedia) which is
supposed to be better than soundex. Not sure how levenshtein fits in.

I also don’t remember how well any of them do with names as opposed to
similar sounding, but normal words.

Couldn’t you match the phone numbers up? Odds are if home,business,cell
all match it’s the same person…

-philip