I’m sure this is a very common problem. But can’t find a simple &
efficient way of solving it.
Basically, I want to find out if a string contains my search string as a
substring. However, a certain amount of mismatches has to be allowed.
Here’s an example:
query string:
“acgt”
subject string
“acctaggg”
If no mismatch was allowed, there would be 0 hits.
If 1 mismatch was allowed, the query string would match “acct”.
If 2 mismatches were allowed, the query string would match “acct” and
“aggg”.
If possible, I would like to do this using regular expressions, as my
search string can also contain ambiguous characters. So a real search
pattern might look something like this: /[ac][c][cg][t][acgt][c][gt][t]/
Unfortunately, performance is also very important, as I will have to
perform thousands of searches in strings that contain several million
characters.
I’m sure this is a very common problem. But can’t find a simple &
efficient way of solving it.
Hmm. I wouldn’t think this is common at all. I have dealt with a
somewhat similar situation in the past in two ways:
Let the user do partial searches.
Using the ‘soundex’ tools in a database
Neither of those matches your situation, which appears to be scanning
for near-matches in genetic material.
Unfortunately, performance is also very important, as I will have to
perform thousands of searches in strings that contain several million
characters.
This certainly isn’t my field of expertise, but I don’t expect a
fuzzy-logic search and ‘performance’ to be particularly compatible. Hope
I’m wrong!
If possible, I would like to do this using regular expressions, as my
search string can also contain ambiguous characters. So a real search
pattern might look something like this: /[ac][c][cg][t][acgt][c][gt][t]/
Gah.
What came to my mind was something like
fuzzysearch(“agct”, 1)
where “1” is the number of mismatches allowed. The code in question
would run the following searches:
/.gct/
/a.ct/
/ag.t/
/agc./
for fuzzysearch(“agct”,2) it would run
/…ct/
/.g.t/
/.gc./
/a…t/
/a.c./
/ag…/
I have no idea what the overhead of wildcards are; if there’s some
ultra-simple ultra-fast string matching tool somewhere that you can call
from Ruby or invoke through a command line, then you could take
advantage of the tiny set of possible alternatives, and break the fuzzy
search into 13 static searches:
agct
cgct
ggct
tgct
aact
acct
atct
.
.
.
agca
agcc
agcg
If no mismatch was allowed, there would be 0 hits.
If 1 mismatch was allowed, the query string would match “acct”.
If 2 mismatches were allowed, the query string would match “acct” and
“aggg”.
This does not use regular expressions.
It does not tell you which characters match. It tells you how many.
I don’t know if it is useful or fast enough but maybe it will give you
some ideas.
str = “acctaggg”
(0…str.length-4).each do |y|
p str.unpack(“@#{y}a4”)[0].split(//).zip(“acgt”.split(//)).map{|r|
r[0]<=>r[1]}.select{|f| f==0}.size
end
“acctaggg”
I don’t know if it is useful or fast enough but maybe it will give you
Harry
A little modification of my previous post.
Again, (lack of) speed may be a problem.
class String
def fuz(m,n) # (string to match, number of matches)
a = []
(0…size-4).each do |y|
t = self[y,4]
if t.split(//).zip(m.split(//)).map{|r| r[0]<=>r[1]}.select{|f|
f==0}.size>=n
a << t
end
end
a
end
end
str = “acctaggg”
p str.fuz(“acgt”,2) #> [“acct”, “aggg”]
Harry
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.