Determining the common prefix for several strings


#1

I have several strings to analyze. The strings are the names of a list
of things that were once ordered and now are not. A typical array of
names might look like this:

item001
item004
item002
item002b
item002C
item 5
item 10
itemize this

I want to parse these names and determine that “item” is the common
prefix.

I know how to armstrong my way through it a character at a time, but I’m
thinking that Ruby probably has a very elegant way of doing this that
I’m not familiar with yet.

Thanks, Todd


#2

On Aug 2, 2007, at 19:31 , Todd B. wrote:

item 5
item 10
itemize this

I want to parse these names and determine that “item” is the common
prefix.

I know how to armstrong my way through it a character at a time,
but I’m
thinking that Ruby probably has a very elegant way of doing this that
I’m not familiar with yet.

I don’t know of a particularly elegant way to do this in Ruby, but my
initial thoughts are

list = <<-EOS
item001
item004
item002
item002b
item002C
item 5
item 10
itemize this
EOS

sort by length: prefix can’t be longer than shortest string.

items = list.split("\n").sort_by { |i| i.length }

use shortest as prefix

first = items.shift
items.inject(first) do |prefix, item|
next prefix if item.match(/^#{prefix}/)

have to go char by char

n = 0
for i in 0…(prefix.length - 1) do
break if prefix[i] != item[i]
n = i
end
prefix[0…n]
end

Michael G.
grzm seespotcode net


#3

El Aug 3, 2007, a las 2:31 AM, Todd B.
escribió:

item 5
item 10
itemize this

I want to parse these names and determine that “item” is the common
prefix.

I know how to armstrong my way through it a character at a time,
but I’m
thinking that Ruby probably has a very elegant way of doing this that
I’m not familiar with yet.

This is a solution that avoids going char by char:

items = [
‘item001’,
‘item004’,
‘item002’,
‘item002b’,
‘item002C’,
‘item 5’,
‘item 10’,
‘itemize this’
]

min_length = items.map {|i| i.length}.min
cuts = items.map {|i| i.split(//)[0, min_length]}
cuts.transpose.each_with_index do |t, i|
if t.uniq.length != 1
puts cuts.first[0, i].join(’’)
break
end
end

We first check which is the length of the shortest string in the
array, then split by chars to store those many ones, and then
transpose the matrix. Note that transposing is OK because we cut the
strings. That’s longer to explain than to write it in Ruby as you
see. Once you get the transpose you can go row by row to detect the
one that has more than one distinct element. Again that’s easy
delegating the job to uniq.

– fxn


#4

El Aug 3, 2007, a las 4:41 AM, Xavier N.
escribió:

end
Ah, and the prefix can be extracted from the original string instead
of the split:

splits = items.map {|i| i.split(//)}
(0…splits.length).each do |i|
if splits.map {|s| s[i]}.uniq.length != 1
puts items.first[0, i]
break
end
end

– fxn


#5

El Aug 3, 2007, a las 4:46 AM, Xavier N.
escribió:

splits = items.map {|i| i.split(//)}
(0…splits.length).each do |i|
if splits.map {|s| s[i]}.uniq.length != 1
puts items.first[0, i]
break
end
end

There’s a bug in the upper limit of the range, it should be the
length of some element of the splits

splits = items.map {|i| i.split(//)}
(0…splits.first.length).each do |i|
if splits.map {|s| s[i]}.uniq.length != 1
puts items.first[0, i]
break
end
end

We could compute the length of the shortest string and use that finer
upper limit, but that’s not necessary because even in the case that
the shortest string is the common prefix, we would get some nil in
the map through columns in the next iteration, and thus at least two
distinct values. On the other hand there’s the corner case when all
strings are equal, that’s not handled in the above code because they
way to do it depends on the details.

Please forgive those many revisions, I guess I need some sleep at
5:22 AM.

– fxn


#6

El Aug 3, 2007, a las 4:26 AM, Xavier N.
escribió:

min_length = items.map {|i| i.length}.min
cuts = items.map {|i| i.split(//)[0, min_length]}
cuts.transpose.each_with_index do |t, i|
if t.uniq.length != 1
puts cuts.first[0, i].join(’’)
break
end
end

Hmmm, too convoluted. I realized you don’t need the transpose and
thus the min length and cuts if you go by columns in the loop. This
is the same idea only more simple:

splits = items.map {|i| i.split(//)}
(0…splits.length).each do |i|
if splits.map {|s| s[i]}.uniq.length != 1
puts splits.first[0, i].join(’’)
break
end
end

– fxn


#7

On 8/3/07, Todd B. removed_email_address@domain.invalid wrote:

I want to parse these names and determine that “item” is the common
prefix.

If I understood you correctly, I think this will do it.

r = [“itemz14”,“item004”,“item002”,“item002b”,“item 5”,“itemize this”]

s = []
a ,b = r.max.split(//), r.min.split(//)
a.each_with_index{|x,y|s << y if x != b[y]}
(0…s[0]).each {|x| print a[x]}

Harry


#8

Holy Cow! You guys are geniuses! Y’all are using these methods like
I’ve never seen or even thought of using them before.

Problem is solved - script is working - THANKS!!!

Todd


#9
  puts items.first[0, i]
  puts items.first[0, i]

depends on the details.

Please forgive those many revisions, I guess I need some sleep at 5:22 AM.

– fxn

Here is yet another way to solve this problem.

min_len = items.min{ |i,j| i.length <=> j.length }.length
common_prefix = “”
(0…min_len).each do |i|
col = items.inject(""){ |sum,elem| sum << elem[i] }.squeeze
if col.length > 1
puts common_prefix
break
end
common_prefix << col
end

Ron


#10

item 5
item 10
itemize this

I want to parse these names and determine that “item” is the common
prefix.

Haha! Regexp magic!

list = <<-EOS
item001
item004
item002
item002b
item002C
item 5
item 10
itemize this
EOS

/\A(.).(\n\1.)\Z/.match(list)[1]
=> “item”

:smiley:
Daniel


#11

On Behalf Of Todd B.:

Holy Cow! You guys are geniuses! Y’all are using these

methods like. I’ve never seen or even thought of using them before.

don’t count me pls. i’m not holy nor genius, (possibly a cow :slight_smile: but just
a nuby learning ruby fr the marvelous ruby community (thanks to matz,
nobu, guydecoux, mauricio, dblack, …ml-list.size ).

ff is another example (of my nubiness). it shows niftiness of bang
methods, chains, and the underutilized all?,

CC:\family\ruby>cat test.rb
list = %w(itemone itemtwo item222 item0033 item11 itexthis
itemthisandthat itemt
hose)

min = list.sort!{|a,b| a.size <=> b.size}.first
break if list.all?{|a| a =~ /^#{min}/} while min.chop!

puts min
C:\family\ruby>ruby test.rb
ite

C:\family\ruby>

i luv my family. i luv ruby :slight_smile:

kind regards -botp


#12

2007/8/3, Daniel DeLorme removed_email_address@domain.invalid:

item002C
item001
=> “item”
Amazing! I wonder how this performs for large input sets (because of
the two “." and the "()”). Could generate a lot of backtracking
(apart from the string length itself).

This is probably better in terms of individual comparisons. The idea
is to reduce the prefix length whenever a mismatch is found
(attached). That solution also avoids sorting and multiple traversal
of the input in general. So it’s suited for stream based processing
(i.e. for reading large files in one go).

Kind regards

robert


#13

El Aug 3, 2007, a las 6:23 AM, Harry K.
escribió:

If I understood you correctly, I think this will do it.

r = [“itemz14”,“item004”,“item002”,“item002b”,“item 5”,“itemize this”]

s = []
a ,b = r.max.split(//), r.min.split(//)
a.each_with_index{|x,y|s << y if x != b[y]}
(0…s[0]).each {|x| print a[x]}

Good, so lexicographic order indeed helps. I would prefer to order
the array only once though, and construct the prefix right ahead as a
string so that the code reads more direct:

prefix = ‘’
min, max = items.sort.values_at(0, -1)
min.split(//).each_with_index do |c, i|
break if c != max[i, 1]
prefix << c
end
puts prefix

My approach with splits removed and handling of the corner case where
all items are equal is:

prefix = items.first
(0…items.first.length).each do |i|
if items.map {|item| item[i, 1]}.uniq.length != 1
prefix = prefix[0, i]
break
end
end
puts prefix

That solution does not order the array, but explicitly loops over all
the columns in the prefix plus one via map, and uniqs them. The
approach based on sort seems a bit better to me.

– fxn


#14

Robert K. wrote:

2007/8/3, Daniel DeLorme removed_email_address@domain.invalid:

/\A(.).(\n\1.)\Z/.match(list)[1]
=> “item”

Amazing! I wonder how this performs for large input sets (because of
the two “." and the "()”). Could generate a lot of backtracking
(apart from the string length itself).

I think it’s ok because the .* pattern is always anchored to line
endings. In this context, .\n is the equivalent of [^\n]\n which
leaves no room for backtracking.

Daniel


#15

Robert K. wrote:

I believe you are wrong. Look at “\A(.).” - there are multiple ways
this can match (including matching \n inside any of the .* parts).
Anchoring does not help much here IMHO.

Yes, there is backtracking for the first part (there has to be, the
whole point being to find the longest matching string), but \n CANNOT
match inside any of the .* parts unless you specify the m modifier.

Daniel


#16

2007/8/3, Daniel DeLorme removed_email_address@domain.invalid:

endings. In this context, .\n is the equivalent of [^\n]\n which
leaves no room for backtracking.

I believe you are wrong. Look at “\A(.).” - there are multiple ways
this can match (including matching \n inside any of the .* parts).
Anchoring does not help much here IMHO.

Kind regards

robert


#17

El Aug 3, 2007, a las 1:35 PM, Xavier N.
escribió:

prefix = ‘’
min, max = items.sort.values_at(0, -1)
min.split(//).each_with_index do |c, i|
break if c != max[i, 1]
prefix << c
end
puts prefix

Yet another iteration: the prefix can be extracted with a regexp,
it’s shorter although perhaps more obscure:

min, max = items.sort.values_at(0, -1)
puts min+max =~ /(.).{#{min.length-1}}(?!\1)/m ? $` : min

I don’t like the ternary and the fact that we look for the first
mismatch instead of extracting directly the prefix in one shot.

– fxn


#18

2007/8/3, Daniel DeLorme removed_email_address@domain.invalid:

Robert K. wrote:

I believe you are wrong. Look at “\A(.).” - there are multiple ways
this can match (including matching \n inside any of the .* parts).
Anchoring does not help much here IMHO.

Yes, there is backtracking for the first part (there has to be, the
whole point being to find the longest matching string), but \n CANNOT
match inside any of the .* parts unless you specify the m modifier.

Correct. I always confuse Ruby’s /m with other language’s variants.
Thanks for the reeducation!

Still it would be interesting to do benchmarks with larger numbers of
words to see what happens… I have attached a test. It seems the RX
approach is faster very long.

Kind regards

robert


#19

I think that this thread explained to me just why so many people love
Ruby so much. There are wildly different approaches to solving this
that all work. People are very different in how they think and how they
solve problems. Ruby can be seen here as SO flexible that, no matter
how you think things through, Ruby still covers all the bases making it
a comfortable shoe for everyone. This may be a unique example of one
size actually fitting all. That is just amazing and borders on magic.

While I am quite good in a few languages, I cannot wait to get good at
Ruby.


#20

El Aug 3, 2007, a las 3:39 PM, Xavier N.
escribió:

it’s shorter although perhaps more obscure:

min, max = items.sort.values_at(0, -1)
puts min+max =~ /(.).{#{min.length-1}}(?!\1)/m ? $` : min

I don’t like the ternary and the fact that we look for the first
mismatch instead of extracting directly the prefix in one shot.

This is a direct way:

min, max = items.sort.values_at(0, -1)
puts (min+max).match(/\A(.).(?=.{#{max.length}}\z)\1/m)[1]

Of course if you can rely on some character that does no belong to
the items that’s easier, it just takes a simplification of Daniel’s
in the case of “\n”:

min, max = items.sort.values_at(0, -1)
puts “#{min}\n#{max}”.match(/^(.).\n\1/)[1]

Albeit those positive approaches are direct they backtrack quite a
lot, perhaps the fixed-length look-ahead with \z in the latter may
get some optimization, I don’t knw. On the contrary the negative
approach, the one that looks for a mismatch, does not backtrack at
all, in that sense that’s the direct one.

Anyway, I doubt I’d use a regexp approach in production code.

– fxn