Fastest way to search for substrings in strings

Hello,

I’m building a PoC DNA sequence alignment tool. I need to be able to
search substrings in large strings quickly. What’s the best to approach
this? I’'m thinking of using simply regular expressions. Is there any
external library that deals with this? Other than the fact that the
sample string contains the sub-string, I need to know the exact position
of the substring (where it starts where it ends).

Thanks,

Panagiotis (atmosx) Atmatzidis

email: [email protected]
URL: http://www.convalesco.org
GnuPG ID: 0x1A7BFEC5
gpg --keyserver pgp.mit.edu --recv-keys 1A7BFEC5

“As you set out for Ithaca, hope the voyage is a long one, full of
adventure, full of discovery […]” - C. P. Cavafy

On Thu, Oct 9, 2014 at 4:42 PM, Panagiotis A.
[email protected] wrote:

I’m building a PoC DNA sequence alignment tool. I need to be able to search
substrings in large strings quickly. What’s the best to approach this? I’'m
thinking of using simply regular expressions. Is there any external library that
deals with this?

There are some fast string search algorithms out there (Boyer-Moore
etc.) but I guess in this case regex is the fastest than implementing
any of those in Ruby. (May be different with a C extension though.)

IIRC there are some libs for bioinformatics. Maybe here:
https://rubygems.org/gems?letter=D&page=91
https://rubygems.org/search?utf8=✓&query=dna

Other than the fact that the sample string contains the sub-string, I need to
know the exact position of the substring (where it starts where it ends).

Nitpick: if you know the start you automatically know the end because
you have the length of the substring already. :slight_smile:

Kind regards

robert

Hi Panagiotis,

There are some options for your problem, depending on the details, some
may
be better suited for your application than others. I do not know what
the
acronym ‘PoC’ stand for, so if something special is implied that I’ve
overlooked, sorry.

DNA sequence alignment is a problem with many tools to solve it.
However,
using ruby there are few general options, like using the BioRuby
framework
for parsing output of alignment programs. Not sure this is the best
option
for you, but maybe worth looking into.

You mentioned that you want to “… search substrings in large
strings…”,
this sounds like you want to perform fitting alignment [1]. If this is
the
case, I’m not aware of many options, which is why I made a gem for this
purpose [2]. It should be reasonably fast, since it uses the C++ SeqAn
library [3] for the actual alignments. Please note, I made this gem for
my
own purposes, and I don’t believe anyone else has actually used it, so
use
at your own risk.

If in fact you are doing something more specific, such as if you are
trying
to map reads to a reference genome, I’d suggest using a proper read
mapper,
like bowtie [4]. You can then parse the output in ruby. There may be
some
gems to help with this, but it’s simple enough to write one for yourself
(that’s what I did). I just found another gem [5], but have never tried
it.
This may also work for your purpose.

Hope this helps.

Cheers.

References

[1] ROSALIND | Glossary | Fitting alignment
[2] bioseqalign | RubyGems.org | your community gem host
[3] http://www.seqan.de/
[4] Bowtie: An ultrafast, memory-efficient short read aligner
[5] bio-bwa | RubyGems.org | your community gem host

On Thu, Oct 9, 2014 at 7:42 AM, Panagiotis A.
[email protected]

On Sun, Oct 19, 2014 at 7:03 AM, Stefano B.
[email protected]
wrote:

I do not know what the acronym ‘PoC’ stand for,

“Proof of Concept”

Peter

On Oct 9, 2014, at 5:42 PM, Panagiotis A. [email protected]
wrote:

I’m building a PoC DNA sequence alignment tool. I need to be able to search
substrings in large strings quickly. What’s the best to approach this? I’'m
thinking of using simply regular expressions. Is there any external library that
deals with this? Other than the fact that the sample string contains the
sub-string, I need to know the exact position of the substring (where it starts
where it ends).

It really depends on the subject strings and the search substrings.

However, in general, with the very long subject strings, and substring
that dont have easily detectable delimiters, you can easily find
yourself in exponential hell with regular expressions. You would have to
exercise extra care to minimize the need for backtracking in your
patterns. I find that this often yields long and hard-to-read regular
expressions.

I might be worthwhile to look at a scanner/FSM generator. I personally
like ragel[1], which outputs ruby code and is extremely fast. It has
several options for fine tuning the output for code-size/performance
tradeoffs.

It appears that PoC could also mean Point-of-Care in the context of DNA
data apps.

Regards,
Ammar

[1] Ragel State Machine Compiler

On Oct 19, 2014, at 9:06 PM, Robert K. [email protected]
wrote:

Before resorting something more complicated I would prove that the
simple thing is actually bad. Otherwise you’ll end up complicating
things unnecessary. :slight_smile:

I completely agree. :slight_smile:

Not having seen what the data in question looks like, I just threw
another idea out there. If a simpler approach with less external
dependencies and moving parts does the job, then hallelujah!

Cheers,
Ammar

On Sun, Oct 19, 2014 at 12:36 PM, Ammar A. [email protected]
wrote:

It really depends on the subject strings and the search substrings.

However, in general, with the very long subject strings, and substring that
don’t have easily detectable delimiters, you can easily find yourself in
exponential hell with regular expressions. You would have to exercise extra care
to minimize the need for backtracking in your patterns. I find that this often
yields long and hard-to-read regular expressions.

I might be worthwhile to look at a scanner/FSM generator. I personally like
ragel[1], which outputs ruby code and is extremely fast. It has several options
for fine tuning the output for code-size/performance tradeoffs.

Before resorting something more complicated I would prove that the
simple thing is actually bad. Otherwise you’ll end up complicating
things unnecessary. :slight_smile:

Kind regards

robert

Ciao Stefano!

I get from the name that you’re Italian. My mother is from Sicily, my
second “mother tongue” is Italian :slight_smile:

On 19 Oct 2014, at 08:03, Stefano B. [email protected] wrote:

Hi Panagiotis,

There are some options for your problem, depending on the details, some may be
better suited for your application than others. I do not know what the acronym
‘PoC’ stand for, so if something special is implied that I’ve overlooked, sorry.

As Peter said, stands for ‘Proof of Concept’.

DNA sequence alignment is a problem with many tools to solve it. However, using
ruby there are few general options, like using the BioRuby framework for parsing
output of alignment programs. Not sure this is the best option for you, but maybe
worth looking into.

I’m creating a proof of concept Sinatra application to perform basic
sequence alignment for my thesis (I’m about to get a degree in
Pharmacy). I’m using the bioruby library for translation (DNA to
protein). The tool will only deal with bacterias to avoid complications
such as slicing and huge sequences (e.g. human genome).

You mentioned that you want to “… search substrings in large strings…”, this
sounds like you want to perform fitting alignment [1]. If this is the case, I’m
not aware of many options, which is why I made a gem for this purpose [2]. It
should be reasonably fast, since it uses the C++ SeqAn library [3] for the actual
alignments. Please note, I made this gem for my own purposes, and I don’t believe
anyone else has actually used it, so use at your own risk.

Thanks for the pointers! I didn’t knew about the term ‘fitting
alignment’ although I’ve read several papers about history of sequencing
(from the 70s till today). That’s exactly what I need :slight_smile: Thanks for the
the gem, didn’t knew about it! Will this gem work with protein sequences
too? Say I’m translating a DNA sequence to proteins (find ORF’s on every
strand with length > X, then compare findings with my sample?).

If in fact you are doing something more specific, such as if you are trying to
map reads to a reference genome, I’d suggest using a proper read mapper, like
bowtie [4]. You can then parse the output in ruby. There may be some gems to help
with this, but it’s simple enough to write one for yourself (that’s what I did). I
just found another gem [5], but have never tried it. This may also work for your
purpose.

Hope this helps.

I don’t think I will need something as specific as bootie. Thanks for
your help and gem!

email: [email protected] mailto:[email protected]
URL: http://www.convalesco.org http://www.convalesco.org/
GnuPG ID: 0x1A7BFEC5
gpg --keyserver pgp.mit.edu http://pgp.mit.edu/ --recv-keys 1A7BFEC5

“As you set out for Ithaca, hope the voyage is a long one, full of adventure,
full of discovery […]” - C. P. Cavafy

Panagiotis (atmosx) Atmatzidis

email: [email protected]
URL: http://www.convalesco.org
GnuPG ID: 0x1A7BFEC5
gpg --keyserver pgp.mit.edu --recv-keys 1A7BFEC5

“As you set out for Ithaca, hope the voyage is a long one, full of
adventure, full of discovery […]” - C. P. Cavafy

Hello,

I might be worthwhile to look at a scanner/FSM generator. I personally like
ragel[1], which outputs ruby code and is extremely fast. It has several options
for fine tuning the output for code-size/performance tradeoffs.

It appears that PoC could also mean Point-of-Care in the context of DNA data
apps.

Regards,
Ammar

[1] Ragel State Machine Compiler

Thanks for the hint, Ill take a look at ragel!

Panagiotis (atmosx) Atmatzidis

email: [email protected]
URL: http://www.convalesco.org
GnuPG ID: 0x1A7BFEC5
gpg --keyserver pgp.mit.edu --recv-keys 1A7BFEC5

“As you set out for Ithaca, hope the voyage is a long one, full of
adventure, full of discovery […]” - C. P. Cavafy

On Sun, Oct 19, 2014 at 12:58 PM, Panagiotis A.
<[email protected]

wrote:

Ciao Stefano!

I get from the name that you’re Italian. My mother is from Sicily, my
second “mother tongue” is Italian :slight_smile:

:slight_smile:

As Peter said, stands for ‘Proof of Concept’.
I’m using the bioruby library for translation (DNA to protein). The tool
will only deal with bacterias to avoid complications such as slicing and
huge sequences (e.g. human genome).

I see, nice.

Thanks for the pointers! I didn’t knew about the term ‘fitting alignment’
although I’ve read several papers about history of sequencing (from the 70s
till today). That’s exactly what I need :slight_smile: Thanks for the the gem, didn’t
knew about it! Will this gem work with protein sequences too? Say I’m
translating a DNA sequence to proteins (find ORF’s on every strand with
length > X, then compare findings with my sample?).

Yes, the gem will work with protein sequences, you can specify
match/mismatch/indel penalties. Really though, for protein sequences you
should use a BLOSUM matrix or the like
(BLOSUM - Wikipedia),
something that isn’t currently supported. You should be able to modify
it
to work fairly easily, using seqan.

I don’t think I will need something as specific as bootie. Thanks for your
help and gem!

Ok.

Cheers.

On 09/10/2014 15:42, Panagiotis A. wrote:

Hello,

I’m building a PoC DNA sequence alignment tool. I need to be able to search
substrings in large strings quickly. What’s the best to approach this? I’'m
thinking of using simply regular expressions. Is there any external library that
deals with this? Other than the fact that the sample string contains the
sub-string, I need to know the exact position of the substring (where it starts
where it ends).

Thanks,

Panagiotis (atmosx) Atmatzidis

I’m fairly new to Ruby but from a naive understanding what’s wrong with
String#index ?

gvim