Longest Repeated Substring (#153)

On Jan 18, 2008, at 3:42 PM, Rados³aw Bu³at wrote:

My question is (I’m not familiar with RubyQuiz too much :)): episode
focus on algorithm (speed) or source code (readable)?

Hopefully both. :slight_smile:

How long input string I could expect?

Anybody found a good long text to work with yet? The text of the U.S.
Constitution perhaps? Tristram Shandy?

I would say that you should handle the biggest text you possibly
can. :slight_smile:

James Edward G. II

‘aaaaaa’.longest_repeating_substring #=> ‘aaaaa’

the quiz did not say that the two strings could not overlap.

is this correct james?

No, it isn’t. Look at example from James:

Example:

   $ echo banana | ruby longest_repeated_substring.rb
   an

   OR

   $ echo banana | ruby longest_repeated_substring.rb
   na

Following your reasoning ‘banana’ would give ‘ana’, rather than ‘an’
(or ‘na’).

Rados³aw Bu³at

http://radarek.jogger.pl - mój blog

On Jan 18, 2008, at 5:20 PM, James G. wrote:

“Repeated substrings may not overlap.”

sorry. missed that…

cheers.

a @ http://drawohara.com/

On Jan 18, 2008, at 4:06 PM, James G. wrote:

Anybody found a good long text to work with yet? The text of the
U.S. Constitution perhaps? Tristram Shandy?

I would say that you should handle the biggest text you possibly
can. :slight_smile:

James Edward G. II

http://www.fordham.edu/halsall/ancient/homer-illiad.txt

a @ http://codeforpeople.com/

On Jan 18, 8:12 pm, “ara.t.howard” [email protected] wrote:

My question is (I’m not familiar with RubyQuiz too much :)):
I would say that you should handle the biggest text you possibly
h.h. the 14th dalai lama
For The Illiad, I got:
’ who are
perishing and coming to a bad end. We will, however, since you so
bid us, refrain from actual fighting, but we will make serviceable
suggestions to the Argives’ (168 characters)

On Sat, 2008-01-19 at 11:12 +0900, ara.t.howard wrote:

How long input string I could expect?

a @ http://codeforpeople.com/

we can deny everything, except that we have the possibility of being
better. simply reflect on that.
h.h. the 14th dalai lama

There’s also War and Peace:
http://www.gutenberg.org/dirs/etext01/wrnpc12.txt

$ wc wrnpc12.txt
67403 564514 3285165 wrnpc12.txt

On Jan 18, 8:27 pm, fw [email protected] wrote:

http://www.fordham.edu/halsall/ancient/homer-illiad.txt
67403 564514 3285165 wrnpc12.txt
For War & Peace:

The Project Gutenberg Literary Archive Foundation has been ’ (63
characters)

That one took awhile - 5m29s real time vs 1m2s real time for the
Illiad - though the increase is close to linear given the number of
characters in each text (3285165 vs. 806921).

2008/1/19 yermej [email protected]:

James Edward G. II

Illiad
So we need to take care of unicode too?

  • though the increase is close to linear given the number of
    characters in each text (3285165 vs. 806921).


http://ruby-smalltalk.blogspot.com/


Whereof one cannot speak, thereof one must be silent.
Ludwig Wittgenstein

On Jan 19, 2008, at 5:29 AM, Robert D. wrote:

So we need to take care of unicode too?

As always, it’s your choice. That’s a nice bonus though.

James Edward G. II

On Jan 18, 2008, at 9:59 PM, yermej wrote:

That one took awhile - 5m29s real time vs 1m2s real time for the
Illiad - though the increase is close to linear given the number of
characters in each text (3285165 vs. 806921).

I’ve got the same output for both examples. Current times are 2.767s
and 12.894s elapsed on a 3GHz processor.

Ruby’s clever copy-on-write string handling really shines here.

Dave

2008/1/18, James G. [email protected]:

James Edward G. II

My solution offers speed nor readability; I went for a one-liner (if you
discount the require):

2008/1/20, Raf C. [email protected]:

Hopefully both. :slight_smile:

James Edward G. II

My solution offers speed nor readability; I went for a one-liner (if you
discount the require):

(Oops, pushed the wrong button.)

$ cat rq153.rb
#!/usr/bin/ruby -w

require ‘enumerator’

p(
ARGV.
join( ’ ').
reverse.
to_enum( :each_byte).
inject( []){ |a, b| a << (a.last || “”) + b.chr}.
map{ |a| a.reverse}.
inject( []){ |a, b| a << b.match( /^(.+).*\1/).to_a.pop }.
flatten.
compact.
sort_by{ |a| a.size}.
last
)

$ ./rq153.rb banana
“an”
$ ./rq153.rb aaabaaa
“aaa”
$ ./rq153.rb aaaaaa
“aaa”
$ ./rq153.rb ambidextrous
nil

Don’t even think of running it on a reasonably-sized text, lest you want
your PC to grind to a halt.
Morale: ruby code can be ugly if you put your mind to it.

Regards,
Raf

On 19 Jan 2008, at 16:21, Dave T. wrote:

Ruby’s clever copy-on-write string handling really shines here.

Hmm, I was happy with my solution (similiar times to yermej on 1.8GHz)
until I saw Dave’s times. I’m really looking forward to seeing the code.

/dh

Here’s my solution to the quiz. It’s has O(n) behaviour and takes
about 1 minute on the illiad and 5 minutes on war and peace on my
1.8GHz linux box (much longer on my powerbook).

It works by constructing a map from every letter in the text to an
array of its locations. It then iterates, increasing each string
(which sometimes creates splits) and removing strings which don’t have
at least 2 locations. Thus, for each length n, the algorithm only has
to deal with the patterns which already matched with length n-1. This
is easiest to see by running it with the verbose option:

$ echo banana | ruby -v find_repeats.rb
ruby 1.8.6 (2007-09-23 patchlevel 110) [powerpc-darwin9.0.0]
Initial: {“a”=>[1, 3, 5], “b”=>[0], “n”=>[2, 4], “\n”=>[6]}
Filter (len=1): {“a”=>[1, 3, 5], “n”=>[2, 4]}
Grow (len=2): {“an”=>[1, 3], “na”=>[2, 4], “a\n”=>[5]}
Filter (len=2): {“an”=>[1, 3], “na”=>[2, 4]}
Grow (len=3): {“na\n”=>[4], “nan”=>[2], “ana”=>[1, 3]}
Filter (len=3): {}
an

Cheers,
Denis

find_repeats.rb:

#!/usr/bin/env ruby

text = ARGF.read
size = text.size

Build a map from each (1-character) string to a list of its positions

roots = {}
size.times do |o|
s = text[o,1]
if roots.has_key? s
roots[s] << o
else
roots[s] = [o]
end
end

puts “Initial: #{roots.inspect}” if $VERBOSE
len = 1
first = nil
while true do

Remove entries which don’t have at least 2 non-overlapping

occurances
roots.delete_if do |s,offsets|
count = 0
last = nil
offsets.each do |o|
next if last && last+len > o
last = o
count += 1
end
count < 2
end
puts “Filter (len=#{len}): #{roots.inspect}” if $VERBOSE
break if roots.size == 0
first = roots[roots.keys[0]][0]

Increase len by 1 and replace each existing root with the set of

longer roots
len += 1
new_roots = {}
roots.each do |s,offsets|
offsets.each do |o|
next if o > size - len
s = text[o,len]
if new_roots.has_key? s
new_roots[s] << o
else
new_roots[s] = [o]
end
end
end
roots = new_roots
puts “Grow (len=#{len}): #{roots.inspect}” if $VERBOSE
end

exit if first == nil

puts text[first,len-1]

Here is my solution. It is easy to follow but breaks down on larger
inputs:

Find first non-overlapping repeated substring contained in the input

string.

Search space is smaller for longer substrings, so search for longest

ones
first.

Returns - Longest repeated substring, or nil if none

def longest_repeated_substring(input)
len = input.size / 2 # Max size is half total length, since strings
cannot
overlap

while len > 0
# Find all substrings of given length
sub_strings = {}
for i in 0…input.size-len
sub_str = input[i…i+len]

  if not sub_strings.has_key?(sub_str)
    sub_strings[sub_str] = i+len # Add to list, track end pos for

overlaps
elsif sub_strings[sub_str] < i
return sub_str # First non-overlapping match ties for longest
end
end

len -= 1

end

nil
end

puts longest_repeated_substring(ARGV[0])

Thanks,

Justin

Here is my solution using Suffix arrays

class SuffixArray
attr_accessor :suffix_array
def initialize(the_string)
@the_string = the_string
@suffix_array = Array.new
#build the suffixes
last_index = the_string.length-1
(0…last_index).each do |i|
the_suffix = the_string[i…last_index]
the_position = i
# << is the append (or push) operator for arrays in Ruby
@suffix_array << { :suffix=>the_suffix, :position=>the_position }
end

#sort the suffix array
@suffix_array.sort! { |a,b| a[:suffix] <=> b[:suffix] }

end

end

text = STDIN.read

highest_count = 0
longest_string = “”
sa = SuffixArray.new(text)
sa.suffix_array.each_with_index do |s,i|
j = 1
if sa.suffix_array[i+1]
while sa.suffix_array[i][:suffix][0,j] ==
sa.suffix_array[i+1][:suffix][0,j]
if j > highest_count
highest_count = j
longest_string = sa.suffix_array[i][:suffix][0,j]
end
j += 1
end
end

end
p longest_string


I get the following times

ILLIAD :
$ time cat homer-illiad.txt | ruby suffix.rb
" who are\nperishing and coming to a bad end. We will, however, since
you
so\nbid us, refrain from actual fighting, but we will make
serviceable\nsuggestions to the Argives"

real 1m15.566s
user 1m14.810s
sys 0m0.410s

WAR AND PEACE
$ time cat wrnpc12.txt | ruby suffix.rb
"\r\n\r\nThe Project Gutenberg Literary Archive Foundation has been "

real 4m55.145s
user 4m49.979s
sys 0m1.951s

On Jan 20, 12:04 pm, Dave T. [email protected] wrote:

I wouldn’t be surprised if the idea of searching only 1/2 of the
second string to prevent overlaps is wrong… :slight_smile:

I think you’re right in that it’s wrong. :wink:

If you submit the string “ababab” to your program, it comes back with
“aba” as the longest non-overlapping substring, but the answer should
be “ab”. When you compare the consecutive sorted suffixes “abab” and
“ababab”, you allow strings up to 3 (“ababab”.size / 2) to be used,
but in fact, they overlap in all but 2 characters.

I’ll post my solution in a reply, which is very similar to your except
in the overlap prevention code, which, I have to admit, is pretty
ugly. And I’m not even convinced that I got it right!

Eric

My hack at the substring problem (nice one, James) is based on suffix
trees.

Suffix trees and suffix arrays and a great tool for substring
operations. The idea is to create a list of all the possible suffixes
in the original string. For “banana” this would be

banana
anana
nana
ana
na
a

Because the list contains an entry starting at every position in the
string, we know that all possible substrings of the original string
must occur at the start of one of these list elements. The substring
“nan”, for example, occurs at the start of the 3rd entry.

Now sort them

a
ana
anana
banana
na
nana

Now we know that all substrings that start with the same sequence of
characters must occur at the start of items in the list that are
adjacent. Searching through to list to find these is a linear operation.

A suffix array is basically this data structure. For efficiency,
though, it doesn’t store the actual strings. Instead it stores the
offset of the start of the string (for our example, this would be [6,
4, 2, 1, 5, 3]). You’ll find a whole bunch of stuff on using suffix
arrays for searching and string matching, particularly in the area of
DNA sequencing.

It turns out that in Ruby, you don’t always have to go to the trouble
of constructing the suffix array. When you take a substring in Ruby,
it doesn’t copy the string data. Instead, it just constructs a new
string object (just a few words of memory) that references into the
original string. Only when either string is modified does the string
content get copied. This means the code for constructing the suffix
list is both simple and relatively efficient:

def initialize(text)
@suffix_list = []
len = text.length
len.times { |i| @suffix_list << text[i, len] } # the ,len] is a
hack…
@suffix_list.sort!
end

The only ugly part is the use of [i, len] to create the substring.
Technically it should be 1…len, but that constructs a new,
unnecessary range object. Using ‘len’ as the final index does the same
thing, because the substring stops at the end of the input string, but
it can be misleading to read.

The code to scan the suffix list looks at each adjacent pair, and sees
if it can find a longer matching substring at the start of each that
it has previously found. Because James disallowed overlapping
substrings, you have to be careful to look at no more that 1/2 of the
longest string. The basic loop looks like this:

 s1 = @suffix_list[0]

 1.upto(@suffix_list.size - 1) do |i|

   s2 = @suffix_list[i]
   max_possible = s2.length / 2   # stop them overlapping

   while  # quick sanity check - saves doing the more expensive

substring if it fails
s1[max_length_so_far] == s2[max_length_so_far] &&
# stop strings from overlapping
max_length_so_far < max_possible &&
# brute force comparison
s1[0,max_plus_one] == s2[0,max_plus_one]

     max_length_so_far = max_plus_one
     max_plus_one += 1
     found_at = i
   end
   s1 = s2
 end

The while loop has three conditions. The last one is the money earner:
it looks for a match at the start of the two strings which is one
longer that the longest match found so far. If it finds it, it records
this as the new longest match, and then tries for a even longer one
with the current pair.

The second condition on the while loop stops us considering more than
1/2 the second string. Because our strings are sorted, if thereis
overlap, the second string will always be the one that contains the
overlapping segment of the first.

The first condition is just an optimization: it stops us doing the
more expensive substring comparison if the last characters of the two
substrings we’re comparing don’t match.

So, put it all together, and we have

- - - - - - - - - - - - -

class CommonSeq

def initialize(text)
@suffix_list = []
len = text.length
len.times { |i| @suffix_list << text[i, len] } # the ,len] is a
hack…
@suffix_list.sort!
end

def find_substrings
max_length_so_far = 0
max_plus_one = 1 # save a little math in the loop
found_at = nil

 # Look at all adjacent pairs of suffices.
 s1 = @suffix_list[0]

 1.upto(@suffix_list.size - 1) do |i|

   s2 = @suffix_list[i]
   max_possible = s2.length / 2   # stop them overlapping

   while  # quick sanity check - saves doing the more expensive

substring if it fails
s1[max_length_so_far] == s2[max_length_so_far] &&
# stop strings from overlapping
max_length_so_far < max_possible &&
# brute force comparison
s1[0,max_plus_one] == s2[0,max_plus_one]

     max_length_so_far = max_plus_one
     max_plus_one += 1
     found_at = i
   end
   s1 = s2
 end

 if found_at
   suffix = @suffix_list[found_at]
   [max_length_so_far, suffix[0, max_length_so_far]]
 else
   nil
 end

end
end

if FILE == $0
seq = CommonSeq.new(STDIN.read.chomp)
puts seq.find_substrings
end

- - - - - - - - - - - - -

Run times are shown for the Illiad and War and Peace:

dave[tmp/seq] time ruby seq.rb <homer.txt
168
who are
perishing and coming to a bad end. We will, however, since you so
bid us, refrain from actual fighting, but we will make serviceable
suggestions to the Argives
ruby seq.rb < homer.txt 2.82s user 0.05s system 99% cpu 2.872 total

dave[tmp/seq] time ruby seq.rb <war.txt
63

The Project Gutenberg Literary Archive Foundation has been
ruby seq.rb < war.txt 12.49s user 0.17s system 99% cpu 12.737 total

Here’s the somewhat basic set of tests I used

- - - - - - - - - - - - -

require ‘seq’
require ‘test/unit’

class CommonSeq
attr_reader :suffix_list
end

class TestSeq < Test::Unit::TestCase

def test_basic_suffix_creation
cs = CommonSeq.new(“banana”)
assert_equal(%w{a ana anana banana na nana}, cs.suffix_list)
end

def test_empty
cs = CommonSeq.new("")
assert_nil cs.find_substrings
end

def test_length_one
cs = CommonSeq.new(“a”)
assert_nil cs.find_substrings
end

def test_length_two_no_match
cs = CommonSeq.new(“ab”)
assert_nil cs.find_substrings
end

def test_length_two_with_match
cs = CommonSeq.new(“aa”)
assert_equal [ 1, “a”], cs.find_substrings
end

def test_length_three_no_match
cs = CommonSeq.new(“abc”)
assert_nil cs.find_substrings
end

def test_length_three_adjacent_match
cs = CommonSeq.new(“aab”)
assert_equal [ 1, “a”], cs.find_substrings
end

def test_length_three_separated_match
cs = CommonSeq.new(“aba”)
assert_equal [ 1, “a”], cs.find_substrings
end

def test_does_not_find_overlapping_match_length_one
cs = CommonSeq.new(“aaa”)
assert_equal [ 1, “a”], cs.find_substrings
end

def test_does_not_find_overlapping_match_length_three
cs = CommonSeq.new(“aaaa”)
assert_equal [ 2, “aa”], cs.find_substrings
end

def test_does_not_find_overlapping_match_length_two
cs = CommonSeq.new(“ababa”)
assert_equal [ 2, “ab”], cs.find_substrings
end

def test_banana
cs = CommonSeq.new(“banana”)
assert_equal [ 2, “an”], cs.find_substrings
end
end

- - - - - - - - - - - - -

I wouldn’t be surprised if the idea of searching only 1/2 of the
second string to prevent overlaps is wrong… :slight_smile:

Dave

Oh, and the run times of my solution for The Illiad and War and
Peace
data sets are 9.179s and 43.875s respectively. This is running
on a 2.33GHz processor.

Eric

On Jan 20, 2:23 pm, “Eric I.” [email protected] wrote:

On Jan 20, 12:04 pm, Dave T. [email protected] wrote:

I wouldn’t be surprised if the idea of searching only 1/2
of the second string to prevent overlaps is wrong… :slight_smile:

I think you’re right in that it’s wrong. :wink:

…snip

I’ll post my solution in a reply, which is very similar to your
except in the overlap prevention code, which, I have to admit, is
pretty ugly. And I’m not even convinced that I got it right!

Dave’s code can be corrected by realizing that since all suffix
strings end at the same place (the end of the string), then of the two
adjacent strings being tested, one is a suffix of the other.

This means that to detect overlap, the following test can be used:

if prefix.length + s1.length > s2.length then
# Overlap
end

where “prefix” is the current prefix being checked in the two adjacent
suffix strings.

Here is a picture. Pretend in the “ababab” case, we are checking the
adjacent strings “abab” and “ababab”. Since one is a suffix of the
other, they can be lined up as they appeared in the original string
(in your mind):

abab
ababab

Now, the prefix being checked might be “aba”. It matches both strings,
but if you check “aba”.length + s1.length (7), it’s too long to fit in
s2.length (6). In other words, they line up like this:

ababab # s2
abab # s1
aba # prefix, lined up with s2
^
`---- # overlap exists because the prefix
# as lined up with s2 overlaps with s1
# when s1 is lined up with s2 as they
# appear in the original string. In other
# words, the “aba” in s2 goes past the
# beginning of the “aba” in s1.

Adding this test (instead of the s2.length / 2 test) and also testing
adjacent strings that start with the prefix currently being searched
(to find later matches if earlier ones overlap) would correct Dave’s
solution and shouldn’t be much more complicated.

-JJ