Longest Repeated Substring (#153)

A solution to RubyQuiz #153.

Finds the longest, non-overlapping repeated substring in its input.

See Ruby Quiz - Longest Repeated Substring (#153) for details.

The latest version of this solution can also be found at

http://learnruby.com/examples/ruby-quiz-153.shtml .

When run, the input can be on the command line, come from standard

input, or come from a file:

ruby lrs.rb banana

ruby lrs.rb “Madam I’m Adam.”

ruby lrs.rb -f homer-illiad.txt

cat homer-illiad.txt | ruby lrs.rb

The basic technique used by this solution is to create an array of

all suffixes of the data. So if the input were “banana”, the array

would contain [“banana”, “anana”, “nana”, “ana”, “na”, “a”]. Then

we sort this array, so it would now contain [“a”, “ana”, “anana”,

“banana”, “na”, “nana”]. Finally we can compare neighboring entries

in the array to see if they share a long enough prefix to beat the

current best.

Extra care must be taken if the substrings are not allowed to

overlap. Consider the input “ananana”; the longest non-overlapping

substring is “ana”. The array of sorted suffixes of is [“a”, “ana”,

“anana”, “ananana”, “na”, “nana”, “nanana”]. The 2nd and 3rd items

can only have a match of “an” because the “ana” would overlap, and

the same is true with the 3rd and 4th items. However by comparing

the 2nd and 4th items we can get the desired result of “ana”. So

under certain circumstances we have to compare an item with more

than just its immediate predecessor.

This program seems to run reasonably fast. It should run in O(n *

log n) time in most cases, assuming that Array’s sort method

provides that performance. Due to the rare cases when the program

cannot just compare an item and its immediate predecessor, there may

be some strange cases where it requires O(n ** 2). Because Ruby

allows a computed substring to share the data with the original

string (until one of the strings is altered, i.e., “copy on write”),

the memory used is linear to the input size.

returns the maximum of the two parameters

def max(a, b)
a >= b ? a : b
end

Return the longest common prefix between two strings. If max is

specified then the longest common prefix cannot exceed it

def longest_common_prefix(s1, s2, max = nil)
l1, l2 = s1.size, s2.size
min = l1 < l2 ? l1 : l2
min = min < max ? min : max if max
min.times do |i|
return s1.slice(0, i) if s1[i] != s2[i]
end
return s1.slice(0, min)
end

Returns the longest repeated substring in a given string.

def longest_repeated_substring(string)
size = string.length

put every possible suffix into an array

suffixes = Array.new(size)
size.times do |i|
suffixes[i] = string.slice(i, size)
end

sort the array of suffixes, so common substrings (i.e., prefixes

of suffixes) will be found in neighboring elements of the array

suffixes.sort!

best = “”
at_least_size = 1 # the size to meet or exceed to be the new best
distance = nil
neighbors_to_check = 1

compare pairs of consecutive suffixes and see how much initial

commonality there is

(size - 1).times do |i|

(1…size).each do |i|
# p [i, neighbors_to_check]
s1 = suffixes[i]

# generally we will only need to compare the ith item and the one
# preceding it; however if we were in a position to reject a long
# enough common substring due to overlap issues, then we may have
# to compare an ith item with additional preceding items;
# neighbors_to_check tracks how many neighbors we need to check
neighbors_to_check.downto(1) do |neighbor|
  s2 = suffixes[i - neighbor]

  # make sure that these to suffixes further apart than the size
  # of the current best; we don't explicitly track the index of
  # these suffixes, but since all suffixes go to the end of the
  # initial string, the size can be used as a proxy
  distance = (s1.size - s2.size).abs
  if distance < at_least_size
    if s1.size >= at_least_size &&
        s2.size >= at_least_size &&
        s1.slice(0, at_least_size) == s2.slice(0, at_least_size)
      neighbors_to_check = max(neighbors_to_check, neighbor + 1)
    else
      neighbors_to_check = neighbor
    end
    next
  end

  # if neighboring suffixes don't at least match as far as the

best,
# no need to check more carefully
unless s1.slice(0, at_least_size) == s2.slice(0, at_least_size)
neighbors_to_check = neighbor
next
end

  # get the longest common prefix that's no larger than distance,
  # since at that point the substrings overlap
  best = longest_common_prefix(s1, s2, distance)
  at_least_size = best.size + 1
  if best.size == distance
    neighbors_to_check = max(neighbors_to_check, neighbor + 1)
  else
    neighbors_to_check = neighbor
  end
end

end

best
end

if $0 == FILE
string = nil
if ARGV[0] == “-f”
open(ARGV[1]) do |f|
string = f.read
end
elsif ARGV.size == 0
string = STDIN.read
elsif ARGV[0] =~ /^-/ || ARGV.size > 1
STDERR.puts “usage:”
STDERR.puts " #{$0} (note: input comes from standard input)"
STDERR.puts " #{$0} string"
STDERR.puts " #{$0} -f filename"
exit
else
string = ARGV[0]
end

result = longest_repeated_substring(string)
puts result && “"#{result}" (#{result.length} characters)” ||
“none”
end

On Jan 18, 2008 9:23 PM, Ruby Q. [email protected] wrote:

Your program will be passed some text on STDIN and is expected to print the
longest repeated substring within that text to STDOUT.

I’m guessing that someone wants to break a Vigenere cipher. The
longest repeated substring in text enciphered with a Vigenere cipher
has a close relationship to the length of the key, and once the key
length is determined, the cipher is halfway broken. At least that’s
one application for this sort of algorithm…

On Jan 20, 2008 1:24 PM, Eric I. [email protected] wrote:

input, or come from a file:

“banana”, “na”, “nana”]. Finally we can compare neighboring entries

under certain circumstances we have to compare an item with more

l1, l2 = s1.size, s2.size
def longest_repeated_substring(string)
suffixes.sort!
# p [i, neighbors_to_check]
# make sure that these to suffixes further apart than the size
neighbors_to_check = neighbor
end
end
string = f.read
string = ARGV[0]
end

result = longest_repeated_substring(string)
puts result && “"#{result}" (#{result.length} characters)” ||
“none”
end

Nice code. From the “Lost World” by Sir Arthur Conan Doyle, I get…

“ve had the escape of your life, young fellah my lad”

Todd

How many memory do you guys use when running the code? I run of of
memory
when read illiad … I have 1G ddr.

Thanks,
Jan

2008/1/18 yermej [email protected]:

For The Illiad, I got:
’ who are
perishing and coming to a bad end. We will, however, since you so
bid us, refrain from actual fighting, but we will make serviceable
suggestions to the Argives’ (168 characters)

Using Eric’s code, for “The Iliad” downloaded from the Gutenberg
project…

"On this the rest of the Acheans with one voice were for respecting
the priest and taking the ransom that he offered; but not so
Agamemnon, who spoke fiercely to him and sent him roughly away. " (193
characters)

Todd

Hi,

With some delay, here is my solution. I started on Friday with my
take
on making suffix trees respect overlaps but then stopped. I have to
admit that it was Eric’s solution that made take another look on
this.
The current version is about 3 times slower than Eric’s submission.
It
passes all my test cases though. It also takes the idea from (IIRC)
Dave
Thomas’s submission to calculate the original start position on the
basis of the string’s length.

BTW it seems to me that ruby19 (cygwin, win xp) uses a lot more
memory for this.

Regards,
Thomas.

#!/usr/bin/env ruby

class String
def longest_substring_st4
suffixes = Array.new(size) {|i| self[i…size]}
suffixes.sort!
common = ‘’
comlen = 0
suffixes.each_with_index do |curr, index|
next if index == 0
curridx = size - curr.size
pindex = index - 1
pindex.downto(pindex - comlen) do |i|
pred = suffixes[i]
psize = pred.size
predidx = size - psize
maxlen = [(predidx - curridx).abs, psize].min - 1
next if maxlen < comlen
prefix = pred[0 … comlen]
break if prefix != curr[0…comlen]
(comlen + 1 … maxlen).each do |i|
p = pred[i]
c = curr[i]
if p == c
prefix << p
else
break
end
end
common = prefix
comlen = common.size
break if comlen <= maxlen
end
end
return common
end
end

if FILE == $0
if ARGV.empty?
puts String.new(STDIN.read).longest_substring_st4
else
ARGV.each {|f| puts
String.new(File.read(f)).longest_substring_st4}
end
end

On 21 Jan 2008, at 15:12, Todd B. wrote:

Using Eric’s code, for “The Iliad” downloaded from the Gutenberg
project…

"On this the rest of the Acheans with one voice were for respecting
the priest and taking the ransom that he offered; but not so
Agamemnon, who spoke fiercely to him and sent him roughly away. " (193
characters)

Todd

In my copy of the Illiad (there’s a phrase I never thought I’d use!),
those two sections differ in whitespace / linebreaks:

On this the rest of the Achaeans with one voice were for
respecting the priest and taking the ransom that he offered; but not
so Agamemnon, who spoke fiercely to him and sent him roughly away.
“Old man,” said he, "let me not find you tarrying about our ships, nor
yet coming hereafter.

and

"On this the rest of the Achaeans with one voice were for respecting
the priest and taking the ransom that he offered; but not so
Agamemnon, who spoke fiercely to him and sent him roughly away. So
he went back in anger, and Apollo, who loved him dearly, heard his
prayer.

Did you get yours from
http://www.fordham.edu/halsall/ancient/homer-illiad.txt
?

/dh

I had the idea to use suffix trees too, but I literally built a tree.
Each node holds a start index and length for a suffix.
As I add more substrings, I count the number of matching characters
with any previous suffixes, and record the max as needed (after
handling overlap).

Unfortunately, this turns out to be slower that the ones that just
sort all the strings,
and runs out of memory on the Illiad…

-Adam


class String
#helper - counts number of matching characters.
def matchlen other
i=0
other.each_byte{|b|
break if b!=self[i]
i+=1
}
i
end
end

class Node
def initialize start,len,tail=nil
@i=start
@l=len
@h=tail ? tail : {}
end

def insert idx,len,matched=0
match = @h[$STR[idx]]
#add or expand child
if match
match.expand(idx,len,matched)
else
@h[$STR[idx]]=Node.new(idx,len)
end
end

def expand idx,len,matched
#count matching characters
matchlen = $STR[@i,@l].matchlen($STR[idx,len])

updateMax(idx-matched, matchlen+matched) if matchlen+matched > $max
if matchlen < @l
  #split if partial match
  split_at(matchlen, idx,len)
else
  #else add remainder of unmatched characters
  insert(idx+@l, len-@l, matchlen+matched)
end

end

def split_at point, idx,len
#one child contains the remainder of the original string(s)
newchild = Node.new(@i+point,@l-point, @h)
@h={}
@h[$STR[@i+point]]=newchild
#the other child has the remainder of the new str
@h[$STR[idx+point]]=Node.new(idx+point, len-point)
@l=point #shorten our length
end

def updateMax idx,matchlen
#if our string ends past the beginining of the match,
# discount the overlap
overlap = @i+@l-idx-1
matchlen-=overlap if overlap > 0
if matchlen>$max
$max = matchlen
$maxpt = idx
end
end
end

$STR=ARGF.read
$max=0
$maxpt=0
slen = $STR.length
half= (slen/2.0).ceil
@root=Node.new(0,0)

#we don’t have to look at any substrings longer than half the input
(0…half).each{|i|
@root.insert(i,half)
}
#and the ones that start in the second half of the input are shorter
still
(half+1…slen).each{|i|
len = slen-i
break if len<$max
@root.insert(i,len)
}

puts $STR[$maxpt,$max].inspect
puts “\n#{$max} chars”

On Jan 20, 7:04 pm, Dave T. [email protected] wrote:

ana
a
ana
anana
banana
na
nana

Now we know that all substrings that start with the same sequence of
characters must occur at the start of items in the list that are
adjacent. Searching through to list to find these is a linear operation.

Hi Dave,

Thanks for the brilliant idea–this time I really feel like I’ve
learned something new! :slight_smile:

I even took my time to implement it in C (80 lines) and I must admit
that ruby performs very well in this case: my C version is only about
6 times faster than your ruby…

However, with the algorithm presented, I don’t see how one can obtain
the correct result for the “aaaa” case. Lets see; the suffix list
would be:

aaaa
aaa
aa
a

now, sorted:

a
aa
aaa
aaaa

And the catch is–in order to obtain the correct result, that is “aa”,
one needs to observe “aa” and “aaaa”, but these are not adjacent. In
this case two pairs of adjacent suffixes (“aa” and “aaa”, “aaa” and
“aaaa”) might give “aa”, but they do overlap.

So do we need a special-case treatment here?

On Jan 21, 2008 9:49 AM, Denis H. [email protected] wrote:

Todd
and

"On this the rest of the Achaeans with one voice were for respecting
the priest and taking the ransom that he offered; but not so
Agamemnon, who spoke fiercely to him and sent him roughly away. So
he went back in anger, and Apollo, who loved him dearly, heard his
prayer.

Did you get yours from http://www.fordham.edu/halsall/ancient/homer-illiad.txt

http://www.gutenberg.org/dirs/etext00/iliad10.txt

It goes to show that any text on the “web” is spurious :slight_smile:

Side note and totally off topic… why do people like to use two el’s
in “The Iliad”?

?

/dh

Todd

Hi Alex,

Have the same problem with you :slight_smile:

I think we should understand the ‘adjacent’ as ‘not overlapped adjacent’
here. Each time we take two adjacent suffix, we check whether they
overlapped first - if so, make s2 point to next suffix, until we found a
not
overlapped one.

For aaaa: on the step of comparing aa and aaa, we found they overlapped,
so
we keep s1 = aa, but make s2 = aaaa, the one next to aaa.

Thanks
Jan

Suffix tree is brilliant method, but has been invented long time ago,
besides, others submitted ‘suffix tree’ solution already.
It was challenge!

Please take a look at my solution (attached),
in my experiments it’s faster than Eric’s code.

BRs
Sergey V.


Times

[vsv@gx Q153]$ ruby -v
ruby 1.8.6 (2007-03-13 patchlevel 0) [i686-linux]

Eric’s code

[vsv@gx Q153]$ time ruby Q153_EricI.rb <homer-illiad.txt
" who are
perishing and coming to a bad end. We will, however, since you so
bid us, refrain from actual fighting, but we will make serviceable
suggestions to the Argives" (168 characters)

real 0m14.263s
user 0m14.185s
sys 0m0.076s

My solution

[vsv@gx Q153]$ time ruby Q153_SerVo.rb <homer-illiad.txt
who are
perishing and coming to a bad end. We will, however, since you so
bid us, refrain from actual fighting, but we will make serviceable
suggestions to the Argives
[168]

real 0m7.676s
user 0m7.456s
sys 0m0.220s

War and Peace

[vsv@gx Q153]$ time ruby Q153_SerVo.rb <wrnpc12.txt

The Project Gutenberg Literary Archive Foundation has been
[63]

real 0m30.511s
user 0m29.546s
sys 0m0.424s

Bonus feature:

With linebreaks removed from paragraph body

[vsv@gx Q153]$ time ruby Q153_SerVo.rb -t <homer-illiad.txt
a messenger from Jove, who, though he be not near, yet takes thought
for
you and pities you. He bids you get the Achaeans instantly under arms,
for
you shall take Troy. There are no longer divided counsels among the
gods;
Juno has brought them over to her own mind, and woe betides the Trojans
at
the hands of Jove. Remember this
[330]

real 0m14.672s
user 0m14.129s
sys 0m0.544s

Hi there,

thought first of two loops, one that moves the start position
of the candidate substring, and another that makes it longer.
It turned out that one loop that combines both is sufficient.

Without considering searching (text.index) the loop is O(n).
However, the index method is O(n). Thus, worst case (for my
program) is O(n^2).

Can one do it in O(n*log(n))?

#!/usr/bin/ruby

ruby quiz 153 - longest repeated substring

print the first one found if several such substrings exist

Urs Meyer, 2008-01-22

read text from STDIN, convert to lower case, as you like

text = STDIN.read.tr("\nA-Z"," a-z")

start position, determines the remaining search space

start = 0
longest_substring = “”

search while remaining search space is at least twice the

the size of the currently longest substring

while (2 * longest_substring.size) < (text.length - start)

# generate substring to search for with size is one bigger
# than longest found so far
substring = text[start...(start+longest_substring.size+1)]

# search for it
i = text.index(substring, start+substring.size)

if i.nil?
    # nothing found, advance start position
    start += 1
else
    # found a longer one, record it
    longest_substring = substring
end

end

puts longest_substring

$ echo banana | ruby lrs.rb
an
$ echo aa | ruby lrs.rb
a
$ echo aaa | ruby lrs.rb
a
$ echo aaaa | ruby lrs.rb
aa
$

some timings…

using random characters:
$ ruby -e “puts Array.new(100000){(‘a’[0]+rand(26)).chr}.join” | ( time
ruby lrs.rb > /dev/null )
21.015u 0.046s 0:21.39 98.4% 0+0k 0+0io 1636pf+0w

with just a bunch of 'a’s the job is easier:
$ ruby -e “puts ‘a’*100000” | ( time ruby lrs.rb > /dev/null )
3.390u 0.015s 0:03.50 97.1% 0+0k 0+0io 3705pf+0w

lets see whether the result is correct:
$ ruby -e “puts ‘a’*100000” | ruby lrs.rb | wc -c
50001

oops, but
$ echo “a” | wc -c
2

o.k. we are fine

if you are curious, I ran it a few times on 100000 random chars:
lnxmfqu
nvpban
ibhwilj
eggfur
mbljqx
gzvxhw

~~
Regards
Urs

On Jan 21, 9:28 am, Jan [email protected] wrote:

[Note: parts of this message were removed to make it a legal post.]

How many memory do you guys use when running the code? I run of of memory
when read illiad … I have 1G ddr.

Hi Jan,

I have 2G. I’m surprised that 1G would be a problem, though. I can
run my code on the War and Peace text, which is 4 times larger than
The Illiad.

The problem with pasting code here is that there can be line break
issues depending on the reader you’re using. I’m wondering whether a
bad line break broke something. One way around this potential issue
might be to pull the code from:

http://learnruby.com/examples/ruby-quiz-153.shtml

Eric

On Jan 20, 5:18 pm, JJ [email protected] wrote:

if prefix.length + s1.length > s2.length then
# Overlap
end

where “prefix” is the current prefix being checked in the two adjacent
suffix strings.

Adding this test (instead of the s2.length / 2 test) and also testing
adjacent strings that start with the prefix currently being searched
(to find later matches if earlier ones overlap) would correct Dave’s
solution and shouldn’t be much more complicated.

That’s very close to what I do, but I go in the opposite direction. I
(try to) track the size window of how many previous sorted suffixes
must be checked against the last in the sequence.

So if I’m comparing substrings a and b in […, a, b, …], and if the
amount of non-overlap is smaller than the longest found substring so
far, then I enlarge the window. So now when I’m focusing on c in
[…, a, b, c, …] I’ll compare it to both a and b.

There needs to be a way to shrink the window as well, so it doesn’t
grow monotonically, and so if the a and c don’t have the same prefix
which exceeds the length of the longest found substring so far, the
window shrinks.

Part of my complexity comes from trying to detect cases where I can
avoid the full comparison early, and so every time I try to do the
avoid (by using “next” to start the next iteration), I must also do
some window management.

Maybe re-orienting the code in your way would simplify things. If you
have the time, I’d be very curious to see what you would come up with.

Eric

I noticed a huge difference in memory use on whether I run the code
with ruby 1.8 or 1.9.

With ruby 1.8, the suffix-array-based/-like solutions require about
120mb for the Iliad. With 1.9 … I get an “failed to allocate memory
(NoMemoryError)” error too (virtual memory being disabled, 1GB).

Thomas.

On 22 Jan 2008, at 00:55, Todd B. wrote:

Side note and totally off topic… why do people like to use two el’s
in “The Iliad”?

Long version: because the capital-I looks identical to the lower case
L so people think they’ve seen two 'L’s. Of course they also heard an
‘I’ so they insert that first.

Short version: Iggnorance.

/dh

On Jan 22, 2008 7:06 AM, Denis H. [email protected] wrote:

On 22 Jan 2008, at 00:55, Todd B. wrote:

Side note and totally off topic… why do people like to use two el’s
in “The Iliad”?

Long version: because the capital-I looks identical to the lower case
L so people think they’ve seen two 'L’s. Of course they also heard an
‘I’ so they insert that first.

Short version: Iggnorance.

Okay, this is an ignorant reply. I’m really sorry to the people that
actually want to program. Just delete this :slight_smile:

Your previous link has two “L’s”. And many of the other posts do as
well. And, I am pretty sure it has nothing to do with fonts.

Todd

Your previous link has two “L’s”. And many of the other posts do as
well. And, I am pretty sure it has nothing to do with fonts.

I think, it’s “Ilias” in ancient Greek anyway. It’s quite a common
mistake/transliteration.

On Jan 22, 2008 3:25 PM, Todd B. [email protected] wrote:

Short version: Iggnorance.

Okay, this is an ignorant reply. I’m really sorry to the people that
actually want to program. Just delete this :slight_smile:

I meant “this will be an ignorant reply”. It is hard to throw candor
correctly over the internet.

Has anyone figured out why my text result differs from others?
Spacing/newlines, obviously, is a factor, but I get the same result on
every machine. But then, if you think about it, then to use something
like this, you really have to trust the data.

Todd