Finding the closest match?

benjohnson · August 12, 2006, 5:29am

I am trying to do a search for a crazy problem. Anyways, the search can
only contain one word and the database it’s searching contains one word
per row. Basically if someone tries to search for a word I want to find
the closest match.

Ex:

If someone types “cat” and “cat” is not in my database, but “bat” is,
how could I select bat?

Also, if someone types “cat” and I dont have anything remotely close to
that then I want to return nothing.

Lastly, these are not all english words either.

Any help is greatly appreciated.

benjohnson · August 12, 2006, 9:49am

Ben J. wrote:

I am trying to do a search for a crazy problem. Anyways, the search can
only contain one word and the database it’s searching contains one word
per row. Basically if someone tries to search for a word I want to find
the closest match.

Ex:

If someone types “cat” and “cat” is not in my database, but “bat” is,
how could I select bat?

Also, if someone types “cat” and I dont have anything remotely close to
that then I want to return nothing.

Compute some kind of signatures for each word (preferably numeric),
store them in the database with each word. These signatures should have
such property, that for similar (whatever that means in your context)
words they have similar values.
Then just compute signature of the word you want to look for and search
based on this.

The problem of course is with finding a suitable signature generation
algorithm. But the problem isn’t new and you should be able to find
something on that topic (I have never done such a thing but have heard
of this kind of solution).

benjohnson · August 12, 2006, 10:38am

Marcin S. wrote:

Ben J. wrote:

I am trying to do a search for a crazy problem. Anyways, the search can
only contain one word and the database it’s searching contains one word
per row. Basically if someone tries to search for a word I want to find
the closest match.

Ex:

If someone types “cat” and “cat” is not in my database, but “bat” is,
how could I select bat?

Also, if someone types “cat” and I dont have anything remotely close to
that then I want to return nothing.

Compute some kind of signatures for each word (preferably numeric),
store them in the database with each word. These signatures should have
such property, that for similar (whatever that means in your context)
words they have similar values.
Then just compute signature of the word you want to look for and search
based on this.

The problem of course is with finding a suitable signature generation
algorithm. But the problem isn’t new and you should be able to find
something on that topic (I have never done such a thing but have heard
of this kind of solution).

That would actually be perfect, because I could easily find the term
that is closest to that number. The only problem is that I have
absolutely no idea how to write that algorithm. Any idea what to search
for on google or do you know where I can start with this? I searched
Google for a good hour and basically ended up with nothing.

Thanks.

benjohnson · August 12, 2006, 4:45pm

Mysql does soundex, which isn’t exactly what you described, but might be
close enough:
http://www.madirish.net/tech.php?section=4&article=85

benjohnson · August 12, 2006, 10:46am

Marcin S. wrote:

Ben J. wrote:

I am trying to do a search for a crazy problem. Anyways, the search can
only contain one word and the database it’s searching contains one word
per row. Basically if someone tries to search for a word I want to find
the closest match.

Ex:

If someone types “cat” and “cat” is not in my database, but “bat” is,
how could I select bat?

Also, if someone types “cat” and I dont have anything remotely close to
that then I want to return nothing.

Compute some kind of signatures for each word (preferably numeric),
store them in the database with each word. These signatures should have
such property, that for similar (whatever that means in your context)
words they have similar values.
Then just compute signature of the word you want to look for and search
based on this.

The problem of course is with finding a suitable signature generation
algorithm. But the problem isn’t new and you should be able to find
something on that topic (I have never done such a thing but have heard
of this kind of solution).

Also, I guess what I’m talking about is very similar to a spell check
algorithm, but my words that I’m suggesting are not words from the
dictionary, just random strings in a database.

benjohnson · August 15, 2006, 2:17pm

Ben J. [email protected]
writes:

Also, if someone types “cat” and I dont have anything remotely close to
that then I want to return nothing.

Lastly, these are not all english words either.

Any help is greatly appreciated.

a simple brute force solution would be. Search for cat. Then search
for, ‘.at’ (notice regexp), then ‘c.t’, etc.

definitely above can be made smarter.

HTH.

Surendra S.
http://ssinghi.kreeti.com, http://www.kreeti.com
Read my blog at: http://cuttingtheredtape.blogspot.com/
,----
| “O thou my friend! The prosperity of Crime is like unto the lightning,
| whose traitorous brilliancies embellish the atmosphere but for an
| instant, in order to hurl into death’s very depths the luckless one
| they have dazzled.” – Marquis de Sade
`----

benjohnson · August 15, 2006, 2:17pm

On Saturday 12 August 2006 05:29, Ben J. wrote:

I am trying to do a search for a crazy problem. Anyways, the search
can only contain one word and the database it’s searching contains
one word per row. Basically if someone tries to search for a word I
want to find the closest match.

If you don’t need “closest match”, but “match from an indordinately
large set” will do, then look at Bloom filters:

http://dataspill.org/pages/projects/ruby-bloomfilter

Michael

–
Michael S.
mailto:[email protected]
http://www.schuerig.de/michael/

benjohnson · August 15, 2006, 2:17pm

Hi Ben,

you might have a look at the fuzzy search capabilities of ferret (
http://ferret.davebalmain.com) or at least the fuzzy search algorithm
that
lucene / ferret are using.

Cheers,
Jan

benjohnson · August 12, 2006, 7:26pm

Juan Felipe G. wrote:

Mysql does soundex, which isn’t exactly what you described, but might be
close enough:
Mad Irish :: Using Soundex with MySQL

I actually found this, which is pretty good:

The only problem is performance. I have to run this algorithm against
every word in the db. So as the first poster mentioned, if I could have
an algorithm that basically stored each word as an integer I could
simply find words that are similar with a range. So if I get “bat” and
let’s say it returns 20. I could do:

select word from words where rating < 30 and rating > 10

Anyways, that would be the optimal solution, if I could just find the
algorithm.

benjohnson · August 15, 2006, 2:17pm

That would actually be perfect, because I could easily find the term
that is closest to that number. The only problem is that I have
absolutely no idea how to write that algorithm. Any idea what to search
for on google or do you know where I can start with this? I searched
Google for a good hour and basically ended up with nothing.

The problem you describe has not yet been completely solved in computer
sciences - at least to my knowledge.

You are dealing with a set of words and search the “closest” match.
Thus,
first you have to define what “close” means. In mathematics and CS, the
“function” that returns the numeric distance between two elements of a
set
is called the set’s “metric”. An example is the euclidian metric
sqrt(sqr(x) + sqr(y)). Try to search the internet for “metric” and “word
metric” to find out about metrics in general.

The problem that presents itself now is to index a metric space without
much more special properties (as vector spaces would have). The only
index
structure I know is called “M Tree”.

Why index? Well, consider the following approach: Query all datasets and
then for each data set compute the difference between the query word and
the word in the data set. Sort the differences and select the 5
smallest.
Sloooow for large sets. So the indexing has to take place in the
database.

I do not know any database that implements M trees as of now and
extending
PostgreSQL, for example, to support M Tree indexes would propably break
your project’s scope.

So what to do? Settle for a heuristic solution!

The easiest thing here would be either to use the full text index engine
of the RDBMS of your choice or fall back to a separate server process
running on your server like Lucene or something that uses the
BloomFilters
or a similar approach. For each word you enter into your database, you
add
the word with the primary key of that entry to your word database server
process.

Another possibility would be to “factorize” similar words, i.e. similar
sounding words and/or letters are changed to the same. For example,
“similar” and “cimilar” could be considered the same.

Hope That Helps,

Manuel