Help with string matching algorythm

t-rex · August 9, 2007, 2:28pm

Hello,

I would be so gratefull to ANYONE who can help me!

I’m writting a program in C++.
A part of my program is to compare two strings and as a result I have to
get a number (range:0-1) which represents a similarity beetwen those two
strings.

Which algorythm to use?
I searched Web, but there’s a million of them, and I don’t know which to
use.
I don’t need a solution in C++, just a hint which algorythm to use to
implement this concept.

Thanx!

t-rex · August 9, 2007, 2:32pm

Tomislav K. wrote:

I’m writting a program in C++.
A part of my program is to compare two strings and as a result I have to
get a number (range:0-1) which represents a similarity beetwen those two
strings.

Which algorythm to use?
I searched Web, but there’s a million of them, and I don’t know which to
use.
I don’t need a solution in C++, just a hint which algorythm to use to
implement this concept.

Thanx!

strcmp()

Todd

t-rex · August 9, 2007, 5:24pm

2007/8/9, Tomislav K. [email protected]:

I searched Web, but there’s a million of them, and I don’t know which to
use.
I don’t need a solution in C++, just a hint which algorythm to use to
implement this concept.

There is no general answer to your question. It depends on what you
want to do with the result. There must be some requirements or at
least more information about the nature of your problem. There is no
general definition of the term “similarity” for text strings - it
really depends on the application case.

Kind regards

robert

t-rex · August 9, 2007, 5:56pm

On Aug 9, 6:28 am, Tomislav K. [email protected] wrote:

A part of my program is to compare two strings and as a result I have to
get a number (range:0-1) which represents a similarity beetwen those two
strings.

You might want to google for string “correlation” algorithms.
(Depending on what sort of similarity you want.)

t-rex · August 9, 2007, 5:40pm

Tomislav K. wrote:

I searched Web, but there’s a million of them, and I don’t know which to
use.
I don’t need a solution in C++, just a hint which algorythm to use to
implement this concept.

Thanx!

That sounds like a fun problem. As someone already said in a reply, the
algorithm depends on your requirements, and what type of similarity you
want. For example, if you wanted to use this information to attempt a
new type of sorting, the algorithm I’m about to suggest would be
useless.

That being said, here’s what I would do–it’s conceptually very simple:

Find the “longest common subsequence”. What distinguishes this from
“longest common substring” (and makes it harder) is that the matching
letters don’t need to be adjacent. For example, the longest common
subsequence of “aaaacccceeee” and “aaaabbbbccc” is “aaaacccc”. This is
best calculated with dynamic programming, but you can probably find
guidance on that on the internet.
Compare the length of this substring with the lengths of the two
original strings. Perhaps something simple like “Percent similarity =
(length of common subsequence) / (average length of two original
strings)”.

Hope this helps,
Dan

t-rex · August 9, 2007, 7:46pm

n 8/9/07, Robert K. [email protected] wrote:

The problem description made me think of bioinformatics - especially
comparing genetic distances. You can measure similarity as the number
of changes needed to transform one string into another. If that
sounds like the type of similarity you need, look up Levenshtein
Distance: Levenshtein distance - Wikipedia

In fact, there was a Ruby Q. dealing with a similar problem: Word
Chains - Ruby Quiz - Word Chains (#44). The difference was that
the quiz only allowed changes that resulted in valid dictionary words.

-Adam

t-rex · August 14, 2007, 11:10am

I searched Web, but there’s a million of them, and I don’t know which to
use.
I don’t need a solution in C++, just a hint which algorythm to use to
implement this concept.

Thanx!

You problem is similar to finding the edit distance between two strings.
Have a look at Edit distance - Wikipedia and
String metric - Wikipedia.
I don’t know which one could give a result in the range [0,1], however.

t-rex · August 9, 2007, 6:16pm

Phrogz pisze:

On Aug 9, 6:28 am, Tomislav K. [email protected] wrote:

A part of my program is to compare two strings and as a result I have to
get a number (range:0-1) which represents a similarity beetwen those two
strings.

You might want to google for string “correlation” algorithms.
(Depending on what sort of similarity you want.)

You might try: http://amatch.rubyforge.org/

lopex

t-rex · August 14, 2007, 11:23am

Which algorythm to use?
I don’t know which one could give a result in the range [0,1], however.

I think using the levenshtein distance this way should do the trick :
levenshtein_distance(a, b) / max(a.size, b.size)

since the result of the levenshtein distance is at most the length of
the longer string.

t-rex · August 14, 2007, 1:00pm

As others already said, it depends on your needs.
But nobody mentioned soundex yet:
http://raa.ruby-lang.org/project/soundex/
which may or may not, be what you want.

Han H.