Finding longest repeated base sequence in one string

Hello all,

I need to make a script to find the longest sequence of DNA bases that
appears within a sequence of the genome of a fly that I have copied onto
a text file. It’s made of up of one continuous string of letters (A, C,
G, T).
I have so far only found solutions to find the longest repeated
substring in 2 or more arrays.

If any of you could give an insight on how I could go about doing it for
one single string made up of only the aforementioned letters.

Thank you in advance for all the help!

Please note my knowledge of Ruby is very limited, write as if to a
dummy.

Please also note the txt file attached is on the larger side (approx
3Mb).

EDIT: the substring has to appear at least 3 times.

I don’t quite understand the question. You have a string with arbitrary
letters, and want to find the longest substring in it which consists
only of the letters A, C, G and T. Is this correct?

A minimum of 3 times? This makes it a slightly different problem. In
particular, the solution is not unique, because more than one substring
of a certain length can occur 3 times. I guess the 3 occurances of the
substring must not overlap?

I found this solution helpful (stealing from
regex - How do I get the match data for all occurrences of a Ruby regular expression in a string? - Stack Overflow):

Assume that your string to search is stored in variable t, you get an
array of all those “maximum strings” by

t.to_enum(:scan,/([acgt]+).*(\1).*(\1)/).map { Regexp.last_match[1] 

}

Note the backreferences (\1) in the regular expression.

For example, if t has the value

"acgtcfactagtcagcatatgcaacgtacgacgccccccccc"

this returns

["acg", "ccc"]

because we have two solutions.

HOWEVER, I’m not sure how well this approach scales, if you have huge
data. The DNA of your fly is 3MB, which means that Ruby has to search
about 1 Million times. I strongly suggest that you first debug your
program with smaller input data.

Ronald F. wrote in post #1179957:

A minimum of 3 times? This makes it a slightly different problem. In
particular, the solution is not unique, because more than one substring
of a certain length can occur 3 times. I guess the 3 occurances of the
substring must not overlap?

I found this solution helpful (stealing from
regex - How do I get the match data for all occurrences of a Ruby regular expression in a string? - Stack Overflow):

Assume that your string to search is stored in variable t, you get an
array of all those “maximum strings” by

t.to_enum(:scan,/([acgt]+).*(\1).*(\1)/).map { Regexp.last_match[1]

}

Note the backreferences (\1) in the regular expression.

For example, if t has the value

"acgtcfactagtcagcatatgcaacgtacgacgccccccccc"

this returns

["acg", "ccc"]

because we have two solutions.

HOWEVER, I’m not sure how well this approach scales, if you have huge
data. The DNA of your fly is 3MB, which means that Ruby has to search
about 1 Million times. I strongly suggest that you first debug your
program with smaller input data.

Thanks. I’ll try to make it work with a smaller sequence first. And yes,
Considering it’s a long genetic sequence, there most likely will be many
results.

Thanks again for the help!

Suggestion for performance improvement: If you can give a reasonable
upper bound for the longest subsequence, and this upper bound is much
less than the total length of the DNA (maybe 10000 or 20000 characters),
and also can give a reasonable lower bound, you can make it faster by
letting it look only for substrings between these two lengths.

Ronald F. wrote in post #1179949:

I don’t quite understand the question. You have a string with arbitrary
letters, and want to find the longest substring in it which consists
only of the letters A, C, G and T. Is this correct?

That is correct. Sorry for the cryptic post. The longest substring that
appears a minimum of 3 times, I edited the main post.
The string in the txt file only has the letters ACGT, which are the
bases for DNA.

Marcel Diallo wrote in post #1179942:

Hello all,

I need to make a script to find the longest sequence of DNA bases that
appears within a sequence of the genome of a fly that I have copied onto
a text file. It’s made of up of one continuous string of letters (A, C,
G, T).
I have so far only found solutions to find the longest repeated
substring in 2 or more arrays.

If any of you could give an insight on how I could go about doing it for
one single string made up of only the aforementioned letters.

Thank you in advance for all the help!

Please note my knowledge of Ruby is very limited, write as if to a
dummy.

Please also note the txt file attached is on the larger side (approx
3Mb).

EDIT: the substring has to appear at least 3 times.


A candidate of length 301 follows, is not the best solution, is just to
show you how to approximate a solution.

content of file mosca.rb follows (mosca = fly in spanish). mosca =
Drosof.


u = File.open(‘mosca.txt’,‘r’).read
h = Hash.new(0)

1.upto(u.length - 300) do |i|
if u[i] == ‘A’
w = u[i … i+300]
h[w] += 1
end
end

puts “Frequencies: #{h.values.uniq}”

maxfreq = h.values.max

k1 = h.keys.find {|k| h[k] = maxfreq}

puts "A good candidate of length #{k1.length} is "
puts k1
puts “Its frequency is #{h[k1]}”

Output:

ruby mosca.rb

Frequencies: [1, 2, 3, 4, 6, 89, 90, 88, 5]
A good candidate of length 301 is
AATTCGTCAGAAATGAGCTAAACAAATTTAAATCATTAAATGCGAGCGGCGAATCCGGAAACAGCAACTTCAAACCAGTCACTCTGGCTGAACTAAATGGCCTGATAAACTCACTGGAATTAAAGAAAGCCCCAGGAACTGACAATCTTAACAACAAGACCATAATAAACTTACCTACAAAGGCCAGAATATATTTAATACTTATTTATAACAACATCCTGAGAACTGGACATTTCCCGAACAAATGGAAGCACGCTAGCATCTCAATGATTCCCAAACCAGGGAAATCACCATTTGCTCT
Its frequency is 90

Ronald F. wrote in post #1179957:

A minimum of 3 times? This makes it a slightly different problem. In
particular, the solution is not unique, because more than one substring
of a certain length can occur 3 times. I guess the 3 occurances of the
substring must not overlap?

I found this solution helpful (stealing from
regex - How do I get the match data for all occurrences of a Ruby regular expression in a string? - Stack Overflow):

Assume that your string to search is stored in variable t, you get an
array of all those “maximum strings” by

t.to_enum(:scan,/([acgt]+).*(\1).*(\1)/).map { Regexp.last_match[1]

}

Note the backreferences (\1) in the regular expression.

For example, if t has the value

"acgtcfactagtcagcatatgcaacgtacgacgccccccccc"

this returns

["acg", "ccc"]

because we have two solutions.

HOWEVER, I’m not sure how well this approach scales, if you have huge
data. The DNA of your fly is 3MB, which means that Ruby has to search
about 1 Million times. I strongly suggest that you first debug your
program with smaller input data.

I manage to make this work as long as I have the variable t as a string
of text, as you have done. But when I assign t to the text file, it
seems to run without error, but I get no result, the terminal just skips
a line and starts again.

Thanks for the help. Question: is there a way to limit the frequency to
more than 3? Say only get results that are more than 3 and nothing
under.

I don’t get it. The solution I gave, finds those cases where you have 3
(or more) occurances. When you want to find those which have more than 3
occurances, just modify the regexp so that it catches 4 or more, not 3
or more.

Ronald

Daniel T. wrote in post #1179969:


A candidate of length 301 follows, is not the best solution, is just to
show you how to approximate a solution.

content of file mosca.rb follows (mosca = fly in spanish). mosca =
Drosof.


Thanks for the help. Question: is there a way to limit the frequency to
more than 3? Say only get results that are more than 3 and nothing
under.

Marcel Diallo wrote in post #1179971:

I manage to make this work as long as I have the variable t as a string
of text, as you have done. But when I assign t to the text file, it
seems to run without error, but I get no result

Of course not. What sense does it make to apply a regexp against a
file???

Your file is around 3 MB. Ruby has a maximum string length of about 2GB,
according to

, so the only limit could be a limit imposed by your hardware (main
memory) and operating system. Still, 3 MB doesn’t sound that big, so
why don’t you just try and read it into memory?

Ronald

Ronald F. wrote in post #1179974:

Thanks for the help. Question: is there a way to limit the frequency to
more than 3? Say only get results that are more than 3 and nothing
under.

I don’t get it. The solution I gave, finds those cases where you have 3
(or more) occurances. When you want to find those which have more than 3
occurances, just modify the regexp so that it catches 4 or more, not 3
or more.

Ronald

Sorry if I misquoted, the answer was for the other gentleman who
answered.

Marcel Diallo wrote in post #1179942:

Hello all,

I need to make a script to find the longest sequence of DNA bases that
appears within a sequence of the genome of a fly that I have copied onto
a text file. It’s made of up of one continuous string of letters (A, C,
G, T).
I have so far only found solutions to find the longest repeated
substring in 2 or more arrays.

If any of you could give an insight on how I could go about doing it for
one single string made up of only the aforementioned letters.

Thank you in advance for all the help!

Please note my knowledge of Ruby is very limited, write as if to a
dummy.

Please also note the txt file attached is on the larger side (approx
3Mb).

EDIT: the substring has to appear at least 3 times.

I am pleased to inform you that I could find a very long sequence, it
consists of 10283 bases and it appears exactly 3 times.

You can find that sequence at starting positions 17159, 28665 and 40336
in your file. I only used the 500_000 first bases of your file to make
shorter the time required for the computation of such a sequence. The
code for the computation will not be disclosed, sorry about that.
Anyway,
one can check easily that the described sequence of length 10283 appears
three times as required and it is located in the given starting
positions.

Daniel T. wrote in post #1179995:

Marcel Diallo wrote in post #1179942:

Hello all,

I need to make a script to find the longest sequence of DNA bases that
appears within a sequence of the genome of a fly that I have copied onto
a text file. It’s made of up of one continuous string of letters (A, C,
G, T).
I have so far only found solutions to find the longest repeated
substring in 2 or more arrays.

If any of you could give an insight on how I could go about doing it for
one single string made up of only the aforementioned letters.

Thank you in advance for all the help!

Please note my knowledge of Ruby is very limited, write as if to a
dummy.

Please also note the txt file attached is on the larger side (approx
3Mb).

EDIT: the substring has to appear at least 3 times.

I am pleased to inform you that I could find a very long sequence, it
consists of 10283 bases and it appears exactly 3 times.

You can find that sequence at starting positions 17159, 28665 and 40336
in your file. I only used the 500_000 first bases of your file to make
shorter the time required for the computation of such a sequence. The
code for the computation will not be disclosed, sorry about that.
Anyway,
one can check easily that the described sequence of length 10283 appears
three times as required and it is located in the given starting
positions.

The results and code follows, but it is coded in Lisp. Results: The
longest sequence has length 11505, it appears 3 times at the positions
given and the three subsequences are not overlapping. The naive way I
used sorting in Ruby make the computation takes several minutes, so and
I used another language, Lisp, for exploring the solutions.

  • (load “mosquito.lisp”)

longest-common-substr for k = 11505 is (11508 53395)
longest-common-substr for k = 11506 is 2639
Hence, the greatest k such that f(k) >= k is 11505
The substrings are located at positions 27441, 15935 and 39112

Code follows in Lisp, it takes about 30 seconds. Using sorting in ruby
with
1.upto(bases.length).to_a.sort {|x,y| bases[x … -1] <=> bases[y …
-1]}

takes a lot of time (several minutes, while lisp (sbcl) takes 30
seconds).

From the documentation of the function to compute the longest sequence:
"Compute the longest common substring,
The parameter k is used to guarantee that each pair of substring are
at a distance greater than k. The condition f(k)>=k guarantees that
there is not overlapping so that the solution obtained is feasible.

Finally, since f(k) is a decreasing function of k, the
longest feasible solution is the greatest k such that f(k) >= k. Such
k can be found with a loop such as: while f(k)>=k increase k, but is
better to look for k using binary search.

Note that the ordering used for the indices implies that
mismatch(i, i+p) = k0 then mismatch(i,j)>= k0 for each j with
i<=j<=k0.
"

(defparameter bases (with-open-file (f “mosca.txt”
:direction :input
:element-type 'base-char)
(read-line f)))

(defun ordena()
(let ((anum (coerce (loop for i below (length bases) collect i)
'vector)))
(sort anum (lambda(i j) (string< bases bases :start1 i :start2
j)))))

(defparameter anum (ordena))

(defun mis(i j)
(let ((i1 (aref anum i))
(j1 (aref anum j)))
(- (mismatch bases bases :start1 i1 :start2 j1) i1)))

(defun longest-common-substr(k)
"Compute the longest common substring,
The parameter k is used to guarantee that each pair of substring are
at a distance greater than k. The condition f(k)>=k guarantees that
there is not overlapping so that the solution obtained is feasible.

Finally, since f(k) is a decreasing function of k, the
longest feasible solution is the greatest k such that f(k) >= k. Such
k can be found with a loop such as: while f(k)>=k increase k, but is
better to look for k using binary search.

Note that the ordering used for the indices implies that
mismatch(i, i+p) = k0 then mismatch(i,j)>= k0 for each j with
i<=j<=k0.
"

(loop with maxi = 0 with pos = 0 with aux = 0
for i upfrom 0 below (- (length anum) 3)
for a = (aref anum i)
for b = (aref anum (1+ i))
for c = (aref anum (+ 2 i))
when (and (> (abs (- a b)) k)
(> (abs (- a c)) k)
(> (abs (- b c)) k))
do
(progn
(setq aux (- (mismatch bases bases :start1 a :start2 c) a))
(when (> aux maxi)
(setq maxi aux
pos i)))
finally (return (values maxi pos))))

(multiple-value-bind (maxlen location)
(longest-common-substr 11505)
(format t “longest-common-substr for k = 11505 is ~S~%” (list maxlen
location))
(format t “longest-common-substr for k = 11506 is ~S~%”
(longest-common-substr 11506))
(format t “Hence, the greatest k such that f(k) >= k is 11505~%”)
(format t “The substrings are located at positions ~D, ~D and ~D~%”
(aref anum location) (aref anum (1+ location)) (aref anum (+ 2
location))))

Daniel T. wrote in post #1179969:

A candidate of length 301 follows, is not the best solution, is just to
show you how to approximate a solution.

content of file mosca.rb follows (mosca = fly in spanish). mosca =
Drosof.


u = File.open(‘mosca.txt’,‘r’).read
h = Hash.new(0)

1.upto(u.length - 300) do |i|
if u[i] == ‘A’
w = u[i … i+300]
h[w] += 1
end
end

puts “Frequencies: #{h.values.uniq}”

maxfreq = h.values.max

k1 = h.keys.find {|k| h[k] = maxfreq}

puts "A good candidate of length #{k1.length} is "
puts k1
puts “Its frequency is #{h[k1]}”

Output:

ruby mosca.rb

Frequencies: [1, 2, 3, 4, 6, 89, 90, 88, 5]
A good candidate of length 301 is

AATTCGTCAGAAATGAGCTAAACAAATTTAAATCATTAAATGCGAGCGGCGAATCCGGAAACAGCAACTTCAAACCAGTCACTCTGGCTGAACTAAATGGCCTGATAAACTCACTGGAATTAAAGAAAGCCCCAGGAACTGACAATCTTAACAACAAGACCATAATAAACTTACCTACAAAGGCCAGAATATATTTAATACTTATTTATAACAACATCCTGAGAACTGGACATTTCCCGAACAAATGGAAGCACGCTAGCATCTCAATGATTCCCAAACCAGGGAAATCACCATTTGCTCT

Its frequency is 90

Size = 50

text = IO.read(“mosca.txt”) ;
h = Hash.new(0)

(text.size - Size + 1).times{|i|
if “A” == text[i,1]
h[ text[i, Size] ] += 1
end
}

puts “Counts: #{ h.values.uniq.sort }”
best = h.max_by(&:last)
puts “Best of length #{Size} is:”
puts best[0]
puts “It occurred #{ best[1] } times.”

Working with a file of length 500_000, the output is:

Counts: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 90, 91, 92, 93,
94, 95, 96, 97, 98, 99, 100, 101, 102, 105, 106]
Best of length 50 is:
ATTTTCTCAATAGCATTAAAAATAACTGTCCAAGAGCGAAATGCCATACC
It occurred 106 times.

Where can I find mosca.txt ?

I wanna participate too!

Edit: Oops nevermind, I missed the attachment before. Got it now.

Robert H. wrote in post #1180062:

Where can I find mosca.txt ?

I wanna participate too!

Edit: Oops nevermind, I missed the attachment before. Got it now.

A solution in ruby was posted here
https://groups.google.com/d/msg/comp.lang.lisp/q8AORIttkf4/7ZwtXMaoAgAJ

The key ingredient was sorting the indexes of the bases. The sorting
took 17 seconds in ruby and about 0.19 seconds in Lisp. The algorithm
described about is able to search for repeated patterns quickly (less
than a minute) and it could be used with multiple threads to improve the
performance since the problem of sorting and looking for candidates can
be decomposed if it is required. Don’t know if blast or another genome
related packages can solve quickly this kind of problems, but probably
it should be so.