Duplicate Record Detection

How are others doing duplicate record detection? I’m not finding very
many
solutions, or methods online. I found one called SimString, but not much
else. I was wondering how others are detecting duplicates. Similar to
suggested items, I would like to show records that may match or have
similar
attributes.

validate_uniqueness_of might be of help?

That would help slow the duplication, but if someone fills out a form
and
submits fname, lname, ss#, and they typo the ss# I would have a
duplicate. I
would like to display to admin users that this record has a related
link, or
is similar. Similar to how Google finds duplicates in your contacts and
merges them.

On Mon, Jul 11, 2011 at 9:42 AM, Justin S. [email protected]
wrote:

That would help slow the duplication, but if someone fills out a form and
submits fname, lname, ss#, and they typo the ss# I would have a duplicate.

“slow the duplication”?? No, insuring that the SSNs are unique via
validations and unique indexes would prevent duplicates, period.

What is it about that you don’t like?


Hassan S. ------------------------ [email protected]

twitter: @hassan

It would stop duplication of that unique string, but not fix a typo like
transposed numbers. Besides, that was a simple example, not meant to be
challenged. It’s the process of detecting duplicates, I’m looking for. I
know how to validate and key tables. Maybe another example is loading a
million records from external source, and you need to find duplicates.
I’m
just asking if anyone has seen api’s or ruby utilities that preform this
function. Like SimString it seems to compare how close to strings are to
matching.

On Mon, Jul 11, 2011 at 4:02 PM, Hassan S. <

I don’t believe you’re going to find a magic formula for what you’re
suggesting. The same thing could be said about last or first names as
you are suggesting could happen with SSNs. What if somebody misspells
Smith for Smit, for example? But worse yet, what if it is not a
misspelling situation and the Smit is actually Smit? The same is true
for SSNs, switching the last 2 digits does not mean it was a
“misspell”, it could just be that 2 different people have the same
name and very similar SSNs. You have to draw a line somewhere, I
think.

You could use auto-complete fields and then provide options based on
records found using the ‘LIKE’ option in the where clause using the
information currently being entered. That might help but I think
you’ll find it’s not worth the effort.

Yes, this is all very true. I was thinking if a comparison was done on
multiple attributes that would help with just one name being wrong. I’m
not
looking for magic, just wondering how others find duplicated records. I
could see this being used to detect data that links or is similar in
nature.

Found this as well.

On Mon, Jul 11, 2011 at 1:56 PM, pepe [email protected] wrote:

Whenever I have worked on similar projects in ended up being the
customer’s idea of what a “close approximation” was that made a
possible duplicate.

Exactly – if they’re not identical they’re not “duplicates”.

On the other hand if you define “similarity” to some degree you can use
e.g. the Levenshtein gem to measure how “different” 2 given fields are.

Levenshtein.distance(“Hassan S.”, “Hassan A. Schroeder”)
=> 3

HTH!

Hassan S. ------------------------ [email protected]

twitter: @hassan

Whenever I have worked on similar projects in ended up being the
customer’s idea of what a “close approximation” was that made a
possible duplicate. It was usually something like:
same birth date
same last name
same city (optional)
same state (optional)

Since there is not such a thing as a tried and true method for what a
duplicate record is I believe you’ll just need to do some manual work.
My advise would be to ask your customer/boss for what the rules are.

Yes, this is very nice.

On Mon, Jul 11, 2011 at 5:04 PM, Hassan S. <