Matching against a zillion patterns

georgeuoa · October 15, 2009, 8:24am

i have some script in which i would like to match a string against
‘many’ regular expressions patterns.

def group(string)
if string=~ /pattern1 |patter2|pattern3|pattern(N)/ #where N =>100
group =1
else
…
end
end

My worry is the amount of patterns that i have (exceeding 400) and the
efficiency and sanity of such an approach.What would you advice?

Thank you.

georgeuoa · October 15, 2009, 10:15am

George G. wrote:

My worry is the amount of patterns that i have (exceeding 400) and the
efficiency and sanity of such an approach.What would you advice?

Thank you.

Are the matches randomly distributed across the patterns? If not,
perhaps you could arrange for the most common patterns in the first
search, the next most common in a subsequent search and so on.

Also recall that compiled regexps can be assigned to variables or
constants, including arrays.

Regards

Steve.

georgeuoa · October 15, 2009, 11:13am

2009/10/15 steve [email protected]:

end
most common in a subsequent search and so on.

Also recall that compiled regexps can be assigned to variables or constants,
including arrays.

Yep, this could be used to do something like

PATTERNS = {
1 => [/p1|p2|p3/, /p4|p5|p6/],
2 => [/p7|p8/],
}

def group(s)
PATTERNS.each do |gr, pats|
return gr if pats.any? {|rx| rx =~ s} # short circuit!
end
nil # or exception
end

Obviously you want the most frequent patterns in the beginning of
arrays and the least frequent at the end.

I’m not sure though about the sanity aspect of this. George, can you
disclose more details about the nature of your matching? Maybe
there’s a better and / or more efficient solution.

Kind regards

robert

georgeuoa · October 15, 2009, 11:35am

George G. wrote:

My worry is the amount of patterns that i have (exceeding 400) and the
efficiency and sanity of such an approach.What would you advice?

I would advise you measure it, with your real regexp and your real data.

I would expect that the regexp engine (written in C) would be much
faster than something you could cobble together yourself in Ruby. Even
400 times “match C1, skip match C1, match C2, skip, match C1, …” in C
will be very fast.

Ideally, see if you can find a true regular expression engine that
compiles to a DFA (deterministic finite state automaton). This will
never backtrack, looking at each incoming character no more than once.

georgeuoa · October 15, 2009, 11:20am

On Thu, Oct 15, 2009 at 8:24 AM, George G.
[email protected] wrote:

My worry is the amount of patterns that i have (exceeding 400) and the
efficiency and sanity of such an approach.What would you advice?

Your mega-pattern will be quite slow if many strings doesn’t match any
pattern, and even slower if many strings matches some patterns
partially, since the regexp engine would end up backtracking a lot.

Are you sure you can’t construct a more general pattern and test for
values of the match data after? I find it hard to imagine 400 useful
patterns without any similar structure. For example,

GROUPS = {
“somegroup” => 1,
“othergroup” => 2
}
if string =~ /^(\w+)\s+(\d+)$/
group = GROUPS[$1.downcase]
end

Another strategy is divide and conquer. See if you can group your
patterns into groups that are similar and construct a more general
regexp which use can use as a initial filter to determine which of the
actual patterns you need to test against. E.g.

if string =~ /superpattern1/
if string =~ /pattern1|pattern2|pattern3/
group = 1
end
end
if string =~ /superpattern2/
if string =~ /pattern4|pattern5|pattern6/
…

georgeuoa · October 15, 2009, 12:37pm

Hi,

Am Donnerstag, 15. Okt 2009, 15:24:35 +0900 schrieb George G.:

i have some script in which i would like to match a string against
‘many’ regular expressions patterns.

def group(string)
if string=~ /pattern1 |patter2|pattern3|pattern(N)/ #where N =>100
group =1
else
…
end
end

How about this:

p = [ /a/, /b/, /c/]
p.find { |q| “x” =~ q } #=> nil
p.find { |q| “a” =~ q } #=> /a/
p.find { |q| “c” =~ q } #=> /c/

Bertram

georgeuoa · October 15, 2009, 12:36pm

Take a look at the ragel language and parser generator.

You can generate your parser as a ruby c plugin.
Ragel compiles to an DFA, deterministic finite automata, it does not
look on a character twice.

I got great success on a similar problem…

http://www.complang.org/ragel/

georgeuoa · October 15, 2009, 12:48pm

On Oct 14, 2009, at 11:24 PM, George G. wrote:

My worry is the amount of patterns that i have (exceeding 400) and the
efficiency and sanity of such an approach.What would you advice?

Regarding efficiency and sanity, I’d have to echo the previous
responders that it depends a lot on what exactly you’re trying to do.
The one application that comes to mind where you’d have a large number
of patterns to match is a parser. If you can redefine your problem so
that it looks more like a parser, then that would open a whole world
of potential solutions.

Just for fun, taking the simple case where your patterns are actually
simple strings, one strategy would be to construct a tree from the
patterns. I’ll “borrow” the dataset from stocknames.info:

cat ./check_guest_list
#!/usr/bin/env ruby1.9

require “hpricot”
require “open-uri”

stock_names = open(“http://stocknames.info”) { |page| Hpricot(page) }
GUEST_LIST = stock_names.search(“//li”).map(&:innerHTML)

tree = {}
GUEST_LIST.each do |name|
cur_leaf = tree
name.each_char do |char|
cur_leaf = cur_leaf[char] ||= {}
end
cur_leaf[:greeting] = “puts ‘Hello, #{name}, welcome to the party!’”
end
print "Name please: "
guest = $stdin.gets.chomp

cur_leaf = tree
guest.each_char do |char|
cur_leaf = cur_leaf[char]
break unless cur_leaf
end

if cur_leaf
eval cur_leaf[:greeting]
else
puts “Sorry, you’re not on the list.”
end

./check_guest_list
Name please: Josh Ballanco
Sorry, you’re not on the list.

./check_guest_list
Name please: Ed Vanders
Hello, Ed Vanders, welcome to the party!

Hope that helps some!

Cheers,

Josh

georgeuoa · October 15, 2009, 4:17pm

2009/10/15 George G. [email protected]:

depending on which pattern the string contains.
Looks like this is tricky to become right. There seem to be some
people around that do bioinformatics, maybe some of them do have a
solution already.

Things you could do off the top of my head: since it seems your
patterns are only strings you could optimize the pattern by building a
trie from it and then creating a RX from that. I’ve done this before
(see below). Advantage is that your regular expression gets smaller
and matching becomes more efficient (because of eliminated
backtracking for NDA based RX engines).

There might be other options using one of the string matching
algorithms around (boyer moore and the like).

Kind regards

robert

def treeify(strings)
root = TextTreeNode.new
strings.each {|s| root.put s}
root.to_s
end

TextTreeNode = Struct.new :children, :char, :term do
def initialize(chr = nil)
self.char = chr
self.children = Hash.new {|h,k| h[k] = self.class.new(k)}
end

def put(s)
node = self
s.each_byte do |char|
node = node.children[char]
end
node.term = true
self
end

def to_s
dump(“”)
end

def dump(out)
out << Regexp.escape(char.chr) if char
ch = children.values

unless ch.empty?
  first = ch.shift
  bracket = term || !ch.empty?

  out << "(?:" if bracket
  first.dump(out)

  ch.each do |v|
    out << "|"
    v.dump(out)
  end

  out << ")" if bracket
  out << "?" if term
end

out

end

puts treeify %w{foo fort bar baz battery}

georgeuoa · October 16, 2009, 4:16am

Robert, let me disclose some more information about the nature of the
problem. Actually I have Protein sequences (strings of variable length
composed of a 20 letter alphabet) for example “CAARGNDLYSKNIG” can be
considered as a protein sequence. basically it’s just a string.

Have you looked at the BioRuby project? They may have classes that can
help.
http://bioruby.org/

georgeuoa · October 15, 2009, 1:12pm

I’m not sure though about the sanity aspect of this. George, can you
disclose more details about the nature of your matching? Maybe
there’s a better and / or more efficient solution.

Kind regards

robert

Thank you so much for all the replies and suggestions!

Robert, let me disclose some more information about the nature of the
problem. Actually I have Protein sequences (strings of variable length
composed of a 20 letter alphabet) for example “CAARGNDLYSKNIG” can be
considered as a protein sequence. basically it’s just a string.

In my case each of the strings that i have may fall into 2 distinct
classes or groups depending on whether they match a collection of about
400 distict patterns in the first instance or another 200 or so patterns
in the second instance. I do not have information on which is the most
common pattern.

To know which of the two groups that each of my sequences fall into, i
resulted in to writing the code that i had presented ealier but realised
that it may be inferior, inefficient or buggy and would like to improve
it.

I hope this helps to define the problem.

In a nutshell; Given a set of strings A, each of which may belong to a
group and where a group is characterised by a huge set of patterns(to be
matched against), create a method that classifies each of the strings
depending on which pattern the string contains.

Thank you.

georgeuoa · October 16, 2009, 6:59am

Robert, let me disclose some more information about the nature of the
problem. Actually I have Protein sequences (strings of variable length
composed of a 20 letter alphabet) for example “CAARGNDLYSKNIG” can be
considered as a protein sequence. basically it’s just a string.

I happen to be a molecular biologist myself, so this of course gets my
attention

I assume there is a particular reason why you are looking for sequence
features using regexp? Because it seems somewhat inefficient - which of
course got you here in the first place. There is no way that you can use
distance measures of sequences to cluster them, for example (like Blast

MCL)? If you are looking to group your seqs, that is. Also, databases
like ProDom do basically just that - looking for particular sequence
features in protein sequences. Sure, they focus on domains, but
depending of the nature of your regexps, their tools may be applicable
regardless. Then there is the new CS-BLAST, which uses scoring matrices

which may perhaps be derivable from your regexps, dunno. Well I
suppose I could offer more help if the idea behind this
fishing-experiment was a little clearer (i.e. what is it that you want
to find out).

In any case, I suppose depending on your search space, ruby may really
not be the ideal approach. Hope it works out in the end!

georgeuoa · October 16, 2009, 6:25am

Robert K. wrote:

2009/10/15 George G. [email protected]:

depending on which pattern the string contains.
Looks like this is tricky to become right. There seem to be some
people around that do bioinformatics, maybe some of them do have a
solution already.

Things you could do off the top of my head: since it seems your
patterns are only strings you could optimize the pattern by building a
trie from it and then creating a RX from that. I’ve done this before
(see below). Advantage is that your regular expression gets smaller
and matching becomes more efficient (because of eliminated
backtracking for NDA based RX engines).

Thank you so much Robert! I have tried your approach and compressed the
patterns to a single regular expression. (removed backtracking). Seems
like a parser solution approach may also help. e.g. the one that Josh
had illustrated

Thomas, I have looked at bioruby and I regulary use it and relatively
know it quite well, but we dont have a solution that i am aware of. I
might repost this on the bioruby list, together with a reference to this
thread.
Thank you so much.

georgeuoa · October 16, 2009, 8:23am

Marc H. wrote:

I assume there is a particular reason why you are looking for sequence
features using regexp? Because it seems somewhat inefficient - which of
course got you here in the first place. There is no way that you can use
distance measures of sequences to cluster them, for example (like Blast

MCL)?

First i cannot use any of the local alignment programs to do the
clustering since my sequences are known to undergo lots of
recombination, secondly i question the idea of BLAST BLOSUM matrices
that have been derived from unbiased genomes. Am working with a pathogen
that is famous for its high A/T content in its genome.

Then why not use adjusted matrices? Yes we can, but then The high
recombinations in the sequences renders that approach unusable. MCL is
good for grouping proteins but also depends on pairwise local alignments
and may be applicable for looking at unsupervised clustering or
grouping. Secondly there is the issue of a relevant BLAST e-cut off as
well as Inflation parameters to use with MCL.(Actually I have tried that
approach already with diff’ I values) and we are analysing the results.

Also, databases
like ProDom do basically just that - looking for particular sequence
features in protein sequences. Sure, they focus on domains, but
depending of the nature of your regexps, their tools may be applicable
regardless.

Their tools may be applicable but we already know the motifs that we are
searching for and they are not in ProDom or swissprot domains.

Then there is the new CS-BLAST, which uses scoring matrices

which may perhaps be derivable from your regexps, dunno.

Again BLAST and any alignment method for highly recombining sequences is
not of much use. Instead we are using alignment free approaches to infer
relationships in the sequences.

The exact patterns that we have, are known(from our experimental
evidence) to be associated with pathogenicity of a particular type and
that is why am looking for exact matches. Else i could just have aligned
the patterns and generated a HMM profile which i can again use to search
in a generic way for the groups. Actually that is the next step. But
first, I have opted to go for simple approaches first.

ruby may really not be the ideal approach
True Ruby may not be suited for some jobs and applications but again, so
is C++, java,php etc. The most important thing i think is the approach
and not the choice of programming language. Secondly i don’t like the
idea of black boxes that i don’t understand.

Please feel free to contact me at georgkam address hosted at google’s
email domain. We don’t want to hijack this excellent mailing list that
is meant for Ruby with molecular biology discussions. I have re-posted
this thread to bioruby and we can discuss it there.
For now Robert’s suggestion seems to have generated something that i
wanted, a non redundant motif that is free from backtracking.

Thank you.

georgeuoa · October 16, 2009, 1:57pm

2009/10/16 George G. [email protected]:

trie from it and then creating a RX from that. I’ve done this before
(see below). Advantage is that your regular expression gets smaller
and matching becomes more efficient (because of eliminated
backtracking for NDA based RX engines).

Thank you so much Robert! I have tried your approach and compressed the
patterns to a single regular expression. (removed backtracking).

And what did it do to performance? How many patterns do you need now
or were you able to squeeze everything into one?

Kind regards

robert

georgeuoa · October 16, 2009, 4:03pm

Jan Arts approach to the problem. forwarded from the bioruby list;
Hey George,

So if I understand correctly you’ve got a huge number of aminoacid
sequences (how many?) and about 400 regular expressions. And for each of
the aminoacid sequences: if they match just one of the regular
expressions they are put in box A and if they match none of the regexps,
they go into box B. Correct?

It just happens that something very similar was the subject of Jim
Tisdall’s (from Beginning Perl for Bioinformatics fame) talk at the
bioinformatics course we’re teaching at the moment

First thing: avoid loops. You don’t want to take loop over all regexps
for each AA sequences, or the other way around.

Are all regexps of the same length? Would be nice if they are, but not
critical. My approach would be to go over the data just once. So suppose
the regexps all are of the same length.

A. Prepare your data:
a. “Decode” the regexps into literal strings: e.g. /A[BC]D/ become
“ABD” and “ACD”.
b. Create a hash that contains all those things as keys.
c. Concatenate all AA sequences together, joined with a non-AA, let’s
say a semicolon “;”. E.g. CAARGNDLYSKNIG;GGARGNDLYSKNIG;KKARGNDLYSKNIG

B. Do the actual search
a. If the length of the strings to match (what used to be the regexps,
and are now the keys in the hash) is 5: take the first 5 characters of
your concatenated AA string and check if that substring exists as a key
in the hash. If so: you know that the AA sequence between the
surrounding “;” characters should go in box A.
b. Advance 1 position: take AAs 2 to 6.
c. Go back to a.

You might have to tweak this approach to exactly fit your requirements,
but if your code used to take a very long time, this might speed things
up immensely.

(George: can you forward this to the ruby mailing list it was discussed
on initially? Cheers)

Good luck,
jan.

georgeuoa · October 18, 2009, 2:08pm

Lars C. [email protected] (2009-10-15) schrieb:

Your mega-pattern will be quite slow if many strings doesn’t match any
pattern, and even slower if many strings matches some patterns
partially, since the regexp engine would end up backtracking a lot.

Why would it do try to backtrack here?

mfg, simon … l

georgeuoa · October 16, 2009, 9:37pm

On 10/16/2009 08:23 AM, George G. wrote:

Please feel free to contact me at georgkam address hosted at google’s
email domain. We don’t want to hijack this excellent mailing list that
is meant for Ruby with molecular biology discussions. I have re-posted
this thread to bioruby and we can discuss it there.

I don’t mind having the discussion on this list - the topic is really
interesting and provides some insights into classes of problems we are
rarely confronted with.

For now Robert’s suggestion seems to have generated something that i
wanted, a non redundant motif that is free from backtracking.

Sounds good! I’m glad the approach proves useful.

Cheers

robert