Problem with comparing huge amount of strings


#1

Hello together,

I got a problem grouping rows in a Database by similarity. What I try to
reach is the following:
I have a table looking like (just an example):

ID Companyname Group
1 Mickeysoft Ltd. NULL
2 Mickysoft LTD NULL
3 Mickeysft Limited NULL
4 Aple Inc NULL
5 APPLE INC NULL

and so on, you get the point. Group should be 1 for the IDs 1 to 3 and 2
for the IDs 4 and 5.

At the moment I compare two strings by making them lowercase, deleting
dots etc., deleting the substrings ‘Inc’, ‘Ltd’ etc. and then building
the Levenshtein-Distance of the metaphone-key of the two strings.
Works not really good and is damn slow, but it’s okay and best I could
figure out. (Nevertheless your hints on that are welcome too.)

My problem is, that I don’t know how to apply my compare-method in an
efficient way. What I’m doing now is selecting the first row where Group
is NULL and then selecting each row (where Group is NULL) in my database
in a loop again, comparing with the first selected string and setting
Group to a certain number if comapare method says they match.

That lasts a lang time and - worse - my code looks very ugly, even to a
very beginner as I am.

So if somebody of you has a good idea how to improve that, some docs on
how to implement such a thing or even a totally different approach to
get the results I want, I would be very glad.

Jan


#2

On Mon, Oct 6, 2008 at 1:03 AM, Jan F. removed_email_address@domain.invalid wrote:

certain number if comapare method says they match.
The problem with relying on a compare function for grouping is that it
leads naturally to an O(n^2) solution. Ideally, what you want is some
sort ofo “characteristic” function, so that two entries with the same
characteristic are likely to be in the same grouping. Of course,
coming up with the characteristic is not always easy, and may not even
be possible, but it’s a useful direction to think in. Look up fuzzy
searching techniques, too, and see if they have anything adaptable.

martin


#3

2008/10/6 Martin DeMello removed_email_address@domain.invalid:

loop again, comparing with the first selected string and setting Group to a
certain number if comapare method says they match.

The problem with relying on a compare function for grouping is that it
leads naturally to an O(n^2) solution. Ideally, what you want is some
sort ofo “characteristic” function, so that two entries with the same
characteristic are likely to be in the same grouping. Of course,
coming up with the characteristic is not always easy, and may not even
be possible, but it’s a useful direction to think in. Look up fuzzy
searching techniques, too, and see if they have anything adaptable.

Also, look into your database’s feature set. Most modern RDBMS come
with text searching capabilities and dealing with large volumes of
data is where those beasts excel. If such a characteristic function
can be found chances are that you can implement it as function in the
DB and resolve the issue via database queries / updates.

Kind regards

robert


#4

Martin DeMello wrote:

On Mon, Oct 6, 2008 at 1:03 AM, Jan F. removed_email_address@domain.invalid wrote:

certain number if comapare method says they match.
The problem with relying on a compare function for grouping is that it
leads naturally to an O(n^2) solution. Ideally, what you want is some
sort ofo “characteristic” function, so that two entries with the same
characteristic are likely to be in the same grouping.

Soundex on the first word, perhaps? (See wikipedia)

I don’t know how appropriate it would be for your dataset, but it’s easy
to try out. Maybe at least it will give you some smaller groups which in
turn can be subject to more detailled analysis. If it works well then
you can make an extra column for this value, with an index, so your DB
can search and group on it efficiently.


#5

Jan F. wrote:

Hello together,

I got a problem grouping rows in a Database by similarity. What I try to
reach is the following:
I have a table looking like (just an example):

ID Companyname Group
1 Mickeysoft Ltd. NULL
2 Mickysoft LTD NULL
3 Mickeysft Limited NULL
4 Aple Inc NULL
5 APPLE INC NULL

and so on, you get the point. Group should be 1 for the IDs 1 to 3 and 2
for the IDs 4 and 5.

At the moment I compare two strings by making them lowercase, deleting
dots etc., deleting the substrings ‘Inc’, ‘Ltd’ etc. and then building
the Levenshtein-Distance of the metaphone-key of the two strings.
Works not really good and is damn slow, but it’s okay and best I could
figure out. (Nevertheless your hints on that are welcome too.)

My problem is, that I don’t know how to apply my compare-method in an
efficient way. What I’m doing now is selecting the first row where Group
is NULL and then selecting each row (where Group is NULL) in my database
in a loop again, comparing with the first selected string and setting
Group to a certain number if comapare method says they match.

That lasts a lang time and - worse - my code looks very ugly, even to a
very beginner as I am.

So if somebody of you has a good idea how to improve that, some docs on
how to implement such a thing or even a totally different approach to
get the results I want, I would be very glad.

Jan

First at the minimum if you apply Lev distance for each pair of keys you
might want to implement this as a custom database function. For eg in
mysql you can create custom user functions
http://dev.mysql.com/doc/refman/6.0/en/adding-functions.html

The naive implementation of comparing each key with every other is
quadratic. You need a way of “blocking/chunking” the dataset.

  1. For eg can you assume that the first letter will never be misspelled?
    Then you can group by first letters and create a smaller search space.
  2. Or if you don’t accept a Lev distance > 2 then maybe you can first
    sort the names by length and only compute Lev distance on those which
    differ in size < 3?
  3. Perhaps the record has other fields which can be used individually or
    in combination to generate a sort key?
  4. There are more complicated blocking schemes - like n-gram(chunk by N
    common continuous characters) or even sampling based ones.

A good blocking algorithm will be the key to scalability.

There is lots of literature out there on this … search for sorted
neighborhood, record-linkage, merge-purge, hardening soft databases.

–Cheers
–Ragav


#6

On Mon, Oct 6, 2008 at 12:12 PM, Ragav S. removed_email_address@domain.invalid
wrote:

  1. There are more complicated blocking schemes - like n-gram(chunk by N
    common continuous characters) or even sampling based ones.

n-gram were my first thought - indeed, I’d started writing out a
bigram-based scheme, then I realised that it’d fail badly if there was
a common word like “systems” or “computers” that a lot of the entries
had. Maybe some sort of multipass scheme to first reduce each entry to
a characteristic word, then do n-gram frequency analysis on those
words (my idea was this: pass 1: make up a frequency table of bigrams,
pass 2: characterise each entry by the presence/multiplicity of the
six or so most common ones)

martin


#7

Jan F. wrote:

Hello together,

I got a problem grouping rows in a Database by similarity. What I try to
reach is the following:
I have a table looking like (just an example):

ID Companyname Group
1 Mickeysoft Ltd. NULL
2 Mickysoft LTD NULL
3 Mickeysft Limited NULL
4 Aple Inc NULL
5 APPLE INC NULL

and so on, you get the point. Group should be 1 for the IDs 1 to 3 and 2
for the IDs 4 and 5.

If you are trying to figure out every kind of typo then I do not think
that there is an algorithm that will suffice.

You said that they are in a database. What if you were to do a

select distinct Companyname from theTable
order by Companyname

Then just go over it yourself. Perhaps you can add a field calling it
newID or something like that. Then, enter a value in that field to
identify which ones go together. Perhaps you can even add another field
that has the correct spelling. Then, when you have gone through it all,
you can write something that has the correct spelling and have users go
through a pick list to prevent spelling errors.

On the other hand, depending on how the table is set up, you could just
mark all but one for deletion.

You need to do this by hand IMO. GL!


#8

Martin DeMello wrote:

On Mon, Oct 6, 2008 at 12:12 PM, Ragav S. removed_email_address@domain.invalid
wrote:

  1. There are more complicated blocking schemes - like n-gram(chunk by N
    common continuous characters) or even sampling based ones.

n-gram were my first thought - indeed, I’d started writing out a
bigram-based scheme, then I realised that it’d fail badly if there was
a common word like “systems” or “computers” that a lot of the entries
had. Maybe some sort of multipass scheme to first reduce each entry to
a characteristic word, then do n-gram frequency analysis on those
words (my idea was this: pass 1: make up a frequency table of bigrams,
pass 2: characterise each entry by the presence/multiplicity of the
six or so most common ones)

martin

Yes indeed. The choice of what blocking scheme to use is largely
dependent on what kinds of key variations are expected and the size of
the database.

  1. As a first stage of normalization, OP mentioned he was stripping of
    INC, LLC, Limited and the like. As you mentioned earlier it might
    possible to use a single token … so the second token like Systems,
    Computers etc can be stripped off during blocking and only get applied
    in Phase II while comparing intra block records using a more exact
    scheme like lev distance.

  2. Before we even go to a bigram scheme, simple schemes like Soundex,
    First Two character truncation, NYSIIS etc should be tried.

  3. Better yet two or more of these should be applied in multiple passes
    and the blocks built out of the union of blocks produced with each
    method.

  4. If bigrams are used then I would go with the full bigram indexing(not
    just the six most common) with a threshold value and reverse indexes .
    http://datamining.anu.edu.au/publications/2003/kdd03-3pages.pdf (sec
    2.1) (This paper talks of python code so possibly it could be easily
    converted to ruby).

–Cheers
–Ragav


#9

Hi, thank you very much, Ragav and Martin. You both helped me a lot!
I went into your posts on the other subthread. (MId:
gchq3f$1ai$removed_email_address@domain.invalid)

Jan


#10

Hello again,

Robert K. schrieb am 6.10.08 um 13:05:

NULL and then selecting each row (where Group is NULL) in my database in a
Also, look into your database’s feature set. Most modern RDBMS come
with text searching capabilities and dealing with large volumes of
data is where those beasts excel. If such a characteristic function
can be found chances are that you can implement it as function in the
DB and resolve the issue via database queries / updates.

thank you very much for your hints. I think you both suggest using sth.
like the soundex-function to characterize each single row in a certain
way and then group by this chracteristic. (Like Brian Chandler suggests
in another post too, thanks Brian.)
Fact is that I already tried the “soundex-solution”, but the results
don’t even come close to what I expected. I also tried to build the
metaphone-key of every string and group by that, but that wasn’t enough
too.

I think a mix of your hints and what Ragev Satish suggests in his post
will prabably lead me into the right direction:
First step is to “normalize” my strings (strip “Ltd.” etc.), build the
metaphone key and then write that back into the database instead of
calculating everything again for each comparison. Should be O(n), right?
(Read about Landau notation for the first time…)
After this step I’ll try to group my strings by other information -
maybe by country of origin, zip code (if available) or first letter -
before I’ll compare every string to each other.

I have to be honest: I never heard of n- or bi-gram schemes, so I think
I have to read a lot about that before. But it sounds very interesting
and I’m very thankful for these hints.

Kind regards
Jan


#11

On 08.10.2008 10:13, Jan F. wrote:

will prabably lead me into the right direction:
Yep, I agree.

First step is to “normalize” my strings (strip “Ltd.” etc.), build the
metaphone key and then write that back into the database instead of
calculating everything again for each comparison. Should be O(n), right?

Yes, kind of. In databases O calculus is not so important because SQL
is descriptive and not procedural. In other words, you have no control
(in reality there is some control you can exert but it is important to
keep in mind that SQL is not a programming language) over what
operations the DB will execute in order to give you the result set you
requested. It is more important to care about access paths so you
definitively want to index the result of this function.

You did not state which RDBMS you are using. I know Oracle very well
and SQL Server pretty well, so I speak only about the two. Chances are
that you can do similar things with other products. In both cases you
can define a function for this. In SQL Server you can then store a
calculated column which uses this function and in Oracle you can
simulate the calculated column with a trigger or create a function based
index. Then you can query that calculated column / function efficiently
and group by or order by that column. If you add analytic SQL to the
mix you can even do some nice additional tricks like counting per column
value etc.

Cheers

robert