String matching/comparing, statistical similarity

crenner · July 8, 2008, 5:09pm

Hi there,

I am trying to find a solution for this problem:

I am generating a large amount of data from different data feeds (CSV).

I have a short Ruby program that parses all of the CSV files (input),
manipulates the data, and then writes the data to a new CSV file,
combined_data.csv

My problem is that the data feeds I get are formatted differently.

Example, let’s use iPods:

For my combined_data.csv, I might want to combine the data from the 3
feeds like so:

combined_data.csv
iPod Video 20 GB | black, white | $300 | sony | iPod touch - Official Apple Support

Thus, were I have 3 relevant rows of data, I want to combine them into
one. Of course, the data isn’t in the same order in the different CSV
files.

The only “common element” is the name (product name in this case), but
as you will see above, there are usually some differences. So I have to
use the name to combine the data.

lowercase and removing whitespaces won’t work because words can be
different.

I basically have no idea how this can be accomplished (especially since
there are 100s of products), apart from manually adding a common ID,
e.g.
if n == “IPOD video 20Gigs”
id = “ipodv20”
end

Second thing I thought of, maybe there’s a way in Ruby to compare
strings and get back a percentage of how similar the strings are?

If you have any ideas, I’d appreciate your input.

Cheers, Chris

crenner · July 8, 2008, 5:31pm

Chris C. wrote:

Hi there,

I am trying to find a solution for this problem:

(…)

Second thing I thought of, maybe there’s a way in Ruby to compare
strings and get back a percentage of how similar the strings are?

I don’t think this can be done. You may be able to come up with
something but you’ll never know if it’s correct. Anyway, Levenshtein
distance is available:
http://raa.ruby-lang.org/project/levenshtein/

Regards,

Siep

crenner · July 8, 2008, 7:08pm

Second thing I thought of, maybe there’s a way in Ruby to
compare strings and get back a percentage of how similar the
strings are?

There’s a very fast C-implementation of the Levenshtein
distance. This module has two methods, of which one is called
normalized_distance. It does exactly what you mentioned:

Levenshtein.normalized_distance(s1, s2, threshold=nil)

Download and documentation:

$ gem install levenshtein

http://rubyforge.org/projects/levenshtein/
levenshtein (0.2.2)

gegroet,
Erik V. - http://www.erikveen.dds.nl/

crenner · July 9, 2008, 1:25pm

This is a major problem that search engines have to solve. There are no
easy answers but there are lots of possible techniques, all of which
suffer from false positives and false negatives. You might like to look
at the information retrieval literature to find out what’s been tried.

A simple metric for how alike two documents are is the Jaccard
similarity measure. If you can break your formatted files into words,
the Jaccard similarity between two documents is (no. of words in both
docs)/(no. of words in either doc). In mathematics, if you have two sets
A and B, the Jaccard similarity is the cardinality of the intersection
of A and B divided by the cardinality of the union of A and B: |A n B| /
|A u B|. This lies in [0, 1].

If you’re using words as features, it’s usual to eliminate common
function words such as “the”, “a”, “an”, etc. Of course you can use
other features than words, for example phrases.

Another possibility is the cosine similarity measure. You form a vector
for each document by counting the number of times each feature occurs,
and take the cosine of the angle between them in high-dimensional space.
This lies in [-1, 1].

Levenstein distance is not used by search engines, as two documents can
have very similar content but a large edit distance.

Dave