I got a problem grouping rows in a Database by similarity. What I try to
reach is the following:
I have a table looking like (just an example):
ID Companyname Group
1 Mickeysoft Ltd. NULL
2 Mickysoft LTD NULL
3 Mickeysft Limited NULL
4 Aple Inc NULL
5 APPLE INC NULL
and so on, you get the point. Group should be 1 for the IDs 1 to 3 and 2
for the IDs 4 and 5.
At the moment I compare two strings by making them lowercase, deleting
dots etc., deleting the substrings ‘Inc’, ‘Ltd’ etc. and then building
the Levenshtein-Distance of the metaphone-key of the two strings.
Works not really good and is damn slow, but it’s okay and best I could
figure out. (Nevertheless your hints on that are welcome too.)
My problem is, that I don’t know how to apply my compare-method in an
efficient way. What I’m doing now is selecting the first row where Group
is NULL and then selecting each row (where Group is NULL) in my database
in a loop again, comparing with the first selected string and setting
Group to a certain number if comapare method says they match.
That lasts a lang time and - worse - my code looks very ugly, even to a
very beginner as I am.
So if somebody of you has a good idea how to improve that, some docs on
how to implement such a thing or even a totally different approach to
get the results I want, I would be very glad.