Longest Repeated Substring (#153)

bbazzarrakk · January 24, 2008, 5:13pm

I first saw this problem years ago. It was one of the Perl’s Quiz of
the Week
problems. I remember being surprised at how the solution was sort of
counter-intuitive to me. I mean that my first thought was: start at
half the
string size, try to find a string of that length that occurs twice in
the full
text, reduce the length by one character and recheck until you find a
match.
Yoan Blanc coded that strategy up for us:

text = STDIN.read

(text.length/2).downto 1 do |l|
match = Regexp.new("(.{#{l}})\1").match(text)
if match
puts text[match.offset(1)[0]…(match.offset(1)[1]-1)]
break
end
end

Technically the regular expression engine will choke on this solution
for a
significantly large text. The reason is that quantifiers in {â€¦} limits
are
only allowed to be so large. The theory is as I described though.

However, even if it could match any size, this solution turns out to be
far too
slow to be practical. The reason is that given a megabyte or more of
text, the
longest repeated substring is typically less than 200 bytes. That means
you do
a whole lot of wasted checks before you are even in the range that a
match can
occur.

Yoan realized this and submitted a second solution that flipped the
search
around. Starting with a match of zero characters and finding the next
biggest
puts you in the right neighborhood from the get go.

The other key realized in Yoan’s second solution is that if we know
there is a
four character repeated substring at some index, we can also count on
the fact
that there are one, two, and three character repeated substrings at the
same
index. Given that, you can keep checking at the same index until you
don’t find
a match for a given size. The size found just before that was the
biggest at
that index.

The Perl guys found further means to optimize this approach, by trimming
the
search space. However, this too breaks down in the face of
significantly large
inputs. That’s why we have suffix trees.

The idea of a suffix tree is to build a list of every suffix that occurs
in the
text. Thus the quiz example “banana” breaks down to:

banana
anana
nana
ana
na
a

That may not look like a lot of help yet, but watch what happens if we
sort the
entries:

a
ana
anana
banana
na
nana

See how the common prefixes group together? We can now compare adjacent
entries
of our suffix tree for prefixes they have in common. In this case, the
second
and third element share an “an” and that’s one of the possible answers.
Note
that the second and third share “ana” as well, but selecting that one
causes
overlap:

[ana]na
[ana]

There’s a catch though. Consider this example input from Eric I.:
ananana.
The suffix tree is:

ananana
nanana
anana
nana
ana
na
a

That sorts into:

a
ana
anana
ananana
na
nana
nanana

In this case comparing just the second and third entries gives us “an”,
because
“ana” would overlap. But, if you compare the second and fourth entries,
“ana”
becomes a valid answer. That tells us that it’s sometimes necessary to
look
ahead more than just one step.

Enough explaining, let’s read some code. I’m going to show Eric’s
solution
below, because it was very well commented and that helped me understand
it.
However, I’m going to remove those comment in the following listings in
the
interests of space. I also made a couple of tiny edits that I felt
simplified
the code a touch.

First we have a simple helper method:

def longest_common_prefix(s1, s2, max = nil)
min = [s1.size, s2.size, max].compact.min
min.times do |i|
return s1.slice(0, i) if s1[i] != s2[i]
end
return s1.slice(0, min)
end

…

Given two Strings, this code will find the longest prefix they share.
It begins
by finding the smallest length among the passed Strings and an optional
provided
maximum. It then walks those indexes looking for mismatched characters.

When it finds a mismatch, it returns what came before. How it does that
is a
little clever though, since it seems to use the same numbers. Consider
that we
are comparing “abc” and “abd” on the third character. That means i = 2
and
s1[i] != s2[i]. That will cause the code to return s1.slice(0, i), but
remember
that i is used as a length there, not an index. That’s why we get the
first two
characters back (“ab”).

If no mismatches arise during the search, they match all the way to our
limit
and that is returned.

The next method is where all the action is and it’s a doozy, so we will
take it
in pieces. The first chunk builds and sorts the suffix tree:

…

def longest_repeated_substring(string)
size = string.length

suffixes = Array.new(size)
size.times do |i|
suffixes[i] = string.slice(i, size)
end

suffixes.sort!

…

This code might seem wildly inefficient (memory-wise) at first glance,
but Ruby
will surprise you here. When you slice() a substring like this, Ruby
cheats and
doesn’t actually make a copy of the text. Instead, the new object is a
pointer
into the old object. Now, if you change either String down the road,
Ruby must
and will finish the job, turning it into a full copy. We call this
behavior
“Copy on Write.” Since we won’t be changing these Strings though, just
examining them, we’re safe with the tiny pointers for the duration.

Note that the size variable is always used as the substring length here,
though
it’s too long in all but the first case. Ruby will stop at the end of
the
String even if you provide a longer length, so it does the same thing
and avoids
unneeded math or Range object construction.

OK, let’s begin the search through the tree:

…

best = “”
at_least_size = 1 # the size to meet or exceed to be the new best
distance = nil
neighbors_to_check = 1

(1…size).each do |i|
s1 = suffixes[i]

 neighbors_to_check.downto(1) do |neighbor|
   s2 = suffixes[i - neighbor]

   distance = (s1.size - s2.size).abs
   if distance < at_least_size
     if s1.size >= at_least_size &&
         s2.size >= at_least_size &&
         s1.slice(0, at_least_size) == s2.slice(0, at_least_size)
       neighbors_to_check = [neighbors_to_check, neighbor + 1].max
     else
       neighbors_to_check = neighbor
     end
     next
   end

   # ...

First, some variables are set to hold the best match so far, the size a
match
would need to be to beat it, the distance between the substrings, and
how far
down the tree we need to search. After, that we enter a simple loop
trying each
suffix.

The neighbors_to_check iteration is our look ahead for similar suffixes
that
share our prefix. We usually only need to look one step ahead, but
these checks
watch for overlap cases and extends our search when needed.

Note that we figure a distance between substrings here, even though we
didn’t
remember where they started. That’s possible because all suffixes are
anchored
to the far edge of our original text. Knowing that, comparing lengths
is the
same as comparing starting indexes.

A distance below our needed match size warns us that there is overlap
and we may
need to look further ahead in this case. The if conditions just ensure
the
Strings are of the length we should even consider and that they match at
least
that much. If all of that turns out to be true, our search ahead is
extended.

Now we are ready to do the actual comparisons:

   # ...

   unless s1.slice(0, at_least_size) == s2.slice(0, at_least_size)
     neighbors_to_check = neighbor
     next
   end

   best = longest_common_prefix(s1, s2, distance)
   at_least_size = best.size + 1
   if best.size == distance
     neighbors_to_check = [neighbors_to_check, neighbor + 1].max
   else
     neighbors_to_check = neighbor
   end
 end

end

best
end

…

The unless check we see here is just a short circuit optimization.
There’s no
need to check for common prefixes if the Strings aren’t a match at least
as long
as our current best.

If we’ve made it this far, we know the current two Strings share a
prefix that
meets or exceeds our current best. A hand-off is made to
longest_common_prefix() to grab our new best and the rest of the
iterator resets
the search parameters.

The best we found in the entire search is then returned as the solution.

Here’s the interface code that wraps this method:

…

if $0 == FILE
string = nil
if ARGV[0] == “-f”
string = File.read(ARGV[1])
elsif ARGV.size == 0
string = STDIN.read
elsif ARGV[0] =~ /^-/ || ARGV.size > 1
STDERR.puts “usage:”
STDERR.puts " #{$0} (note: input comes from standard input)"
STDERR.puts " #{$0} string"
STDERR.puts " #{$0} -f filename"
exit
else
string = ARGV[0]
end

result = longest_repeated_substring(string)
puts %Q{"#{result}" (#{result.length} characters)}
end

Most of the above code just sorts out the command-line arguments. The
checks
determine whether to read the text from a file, STDIN, or the arguments
themselves.

The last two lines call the workhorse method we just finished examining
and
print details about the best match found.

My thanks to all who stretched my brain with all of the excellent
discussion
about this week’s problem.

Tomorrow we will try Ruby’s hand at coin counting…