Search for a string in another string allowing mismatches

Hi all,

I’m sure this is a very common problem. But can’t find a simple &
efficient way of solving it.

Basically, I want to find out if a string contains my search string as a
substring. However, a certain amount of mismatches has to be allowed.
Here’s an example:

query string:
“acgt”

subject string
“acctaggg”

If no mismatch was allowed, there would be 0 hits.
If 1 mismatch was allowed, the query string would match “acct”.
If 2 mismatches were allowed, the query string would match “acct” and
“aggg”.

If possible, I would like to do this using regular expressions, as my
search string can also contain ambiguous characters. So a real search
pattern might look something like this: /[ac][c][cg][t][acgt][c][gt][t]/

Unfortunately, performance is also very important, as I will have to
perform thousands of searches in strings that contain several million
characters.

I’d be very grateful for any suggestions!

Cheers,
Janus

On Sep 21, 2010, at 11:07 , Janus B. wrote:

Hi all,

I’m sure this is a very common problem. But can’t find a simple &
efficient way of solving it.

Hmm. I wouldn’t think this is common at all. I have dealt with a
somewhat similar situation in the past in two ways:

  1. Let the user do partial searches.

  2. Using the ‘soundex’ tools in a database

Neither of those matches your situation, which appears to be scanning
for near-matches in genetic material.

Unfortunately, performance is also very important, as I will have to
perform thousands of searches in strings that contain several million
characters.

This certainly isn’t my field of expertise, but I don’t expect a
fuzzy-logic search and ‘performance’ to be particularly compatible. Hope
I’m wrong!

If possible, I would like to do this using regular expressions, as my
search string can also contain ambiguous characters. So a real search
pattern might look something like this: /[ac][c][cg][t][acgt][c][gt][t]/

Gah.

What came to my mind was something like

fuzzysearch(“agct”, 1)

where “1” is the number of mismatches allowed. The code in question
would run the following searches:
/.gct/
/a.ct/
/ag.t/
/agc./

for fuzzysearch(“agct”,2) it would run
/…ct/
/.g.t/
/.gc./
/a…t/
/a.c./
/ag…/

I have no idea what the overhead of wildcards are; if there’s some
ultra-simple ultra-fast string matching tool somewhere that you can call
from Ruby or invoke through a command line, then you could take
advantage of the tiny set of possible alternatives, and break the fuzzy
search into 13 static searches:
agct
cgct
ggct
tgct
aact
acct
atct
.
.
.
agca
agcc
agcg

On Wed, Sep 22, 2010 at 3:07 AM, Janus B. [email protected]
wrote:

If no mismatch was allowed, there would be 0 hits.
If 1 mismatch was allowed, the query string would match “acct”.
If 2 mismatches were allowed, the query string would match “acct” and
“aggg”.

This does not use regular expressions.
It does not tell you which characters match. It tells you how many.

I don’t know if it is useful or fast enough but maybe it will give you
some ideas.

str = “acctaggg”

(0…str.length-4).each do |y|
p str.unpack(“@#{y}a4”)[0].split(//).zip(“acgt”.split(//)).map{|r|
r[0]<=>r[1]}.select{|f| f==0}.size
end

Harry

On Wed, Sep 22, 2010 at 9:47 PM, Harry K. [email protected]
wrote:

“acctaggg”
I don’t know if it is useful or fast enough but maybe it will give you

Harry

A little modification of my previous post.
Again, (lack of) speed may be a problem.

class String
def fuz(m,n) # (string to match, number of matches)
a = []
(0…size-4).each do |y|
t = self[y,4]
if t.split(//).zip(m.split(//)).map{|r| r[0]<=>r[1]}.select{|f|
f==0}.size>=n
a << t
end
end
a
end
end

str = “acctaggg”

p str.fuz(“acgt”,2) #> [“acct”, “aggg”]

Harry