Determining the common prefix for several strings

toadkicker · August 4, 2007, 12:36am

The obligatory solution using zip…

data contains a list of strings

aa = data.map{ |x| x.split(//)}
bz = aa[0].zip(*(aa[1 … -1])).map{ |x| x.join("")}
target = bz.index( bz.find{ |x| x.squeeze.size != 1})
answer = (target.nil? ? aa[0] :
target >= 1 ? aa[0][0 … target - 1] :
[]).join("")

Eric

toadkicker · August 5, 2007, 6:41pm

On Aug 3, 3:30 pm, Eric P. [email protected] wrote:

Eric
Doh! This isn’t correct. It’s hard to improve over
Michael’s routine back in the initial response. You
need to find any one of the shortest strings, remove it,
and use that instead of aa[0] as above.

toadkicker · August 4, 2007, 12:28am

PeÃ±a wrote:

[…]
ff is another example (of my nubiness). it shows niftiness of bang methods, chains, and the underutilized all?,

don’t forget the any?

puts r.first[0…(0…r.first.size).find{|i|
r.any?{|s|r.first[0…i]!=s[0…i]}}]

cheers

Simon

toadkicker · August 6, 2007, 8:42am

Wolfgang NÃ¡dasi-donner wrote:
EOS

s = list.split("\n").sort{|a,b|a.length<=>b.length}
pre =
s.collect{|e|e.split(’’)[0…(s[0].length)]}.transpose.collect{|e|e.uniq}.reject{|e|e.length>1}.flatten.join(’’)
p pre # => “item”

Wolfgang N#adasi-Donner

:oops: - I made an error!

The corrected version is…

list = <<-EOS
item0aa
item1aa
item3a
EOS
s = list.split("\n").sort{|a,b|a.length<=>b.length}
pre =
s.collect{|e|e.split(’’)[0…(s[0].length)]}.transpose.collect{|e|e.uniq}.
inject(’’){|r,c|(c.length>1)?(break r):(r<<c[0])}
p pre # => “item”

Wolfgang NÃ¡dasi-Donner

toadkicker · August 5, 2007, 10:46pm

Xavier N. wrote:

El Aug 3, 2007, a las 2:31 AM, Todd B.
escribiï¿½:
item 5

item 10
itemize this

I want to parse these names and determine that “item” is the common
prefix.

I know how to armstrong my way through it a character at a time,
but I’m
thinking that Ruby probably has a very elegant way of doing this that
I’m not familiar with yet.

This is a solution that avoids going char by char:

items = [
‘item001’,
‘item004’,
‘item002’,
‘item002b’,
‘item002C’,
‘item 5’,
‘item 10’,
‘itemize this’
]

min_length = items.map {|i| i.length}.min
cuts = items.map {|i| i.split(//)[0, min_length]}
cuts.transpose.each_with_index do |t, i|
if t.uniq.length != 1
puts cuts.first[0, i].join(’’)
break
end
end

We first check which is the length of the shortest string in the
array, then split by chars to store those many ones, and then
transpose the matrix. Note that transposing is OK because we cut the
strings. That’s longer to explain than to write it in Ruby as you
see. Once you get the transpose you can go row by row to detect the
one that has more than one distinct element. Again that’s easy
delegating the job to uniq.

– fxn

A variant of this is:

list = <<-EOS
item001
item004
item002
item002b
item002C
item 5
item 10
itemize this
EOS
s = list.split("\n").sort{|a,b|a.length<=>b.length}
pre =
s.collect{|e|e.split(’’)[0…(s[0].length)]}.transpose.collect{|e|e.uniq}.reject{|e|e.length>1}.flatten.join(’’)
p pre # => “item”

Wolfgang N#adasi-Donner

toadkicker · August 6, 2007, 9:21am

On Aug 3, 10:31 am, Todd B. [email protected] wrote:

I want to parse these names and determine that “item” is the common
prefix.

I know this has been done to death, but here’s my preferred code:

list = %{
item001
item004
item002
item002b
item002C
item 5
item 10
itemize this
}.strip.map { |line| line.strip }.sort_by { |str| str.length }

prefix = list.first.dup

until prefix.empty? or list.all? { |str| str =~ /^#{prefix}/ }
prefix.chop!
end

p prefix

I didn’t think of using chop! until someone else posted a solution
with that in it. It’s the first really good use of chop! I’ve seen.

This solution also demonstrates the lovely #emtpy? and #all? methods.
(I know #all? has been seen before in this thread, but it’s such a
handy method.)

Cheers,
Gavin

toadkicker · August 6, 2007, 9:26am

On Aug 3, 5:58 pm, Peña, Botp [email protected] wrote:

min = list.sort!{|a,b| a.size <=> b.size}.first
break if list.all?{|a| a =~ /^#{min}/} while min.chop!

I just realised that the solution I just posted elsewhere in this
thread is very very similar to this one.

Be careful that using bang methods in chains is sometimes problematic,
as they can return nil if nothing changes (sort! doesn’t have this
problem). For interested readers, find the problem with this code:

title = “the moon in june”
title.downcase!.capitalize!

Cheers,
Gavin

toadkicker · August 6, 2007, 1:51pm

2007/8/6, PeÃ±a, Botp [email protected]:

indeed. as long as the array has no nil elements, w’re ok w sort!
=> [1, 1, 1]

ruby power, indeed.
All these solutions have the issue that they modify the list which
might or might not be ok. I tend to prefer to view the item list as
read only. So I’d do:

min = list.min{|a,b| a.size <=> b.size}.dup
min.chop! until list.all?{|a| a =~ /^#{min}/}

I have added some interesting benchmarking:

$ ./prefix.rb
user system total
real
1000 * list.rx 0.016000 0.000000 0.016000 (
0.014000)
1000 * list.rx with conversion 0.031000 0.000000 0.031000 (
0.030000)
1000 * list.one_pass 0.219000 0.000000 0.219000 (
0.218000)
1000 * list.min 0.328000 0.000000 0.328000 (
0.337000)
1000 * list.first.dup 0.640000 0.000000 0.640000 (
0.641000)
1 * large.rx 0.000000 0.000000 0.000000 (
0.000000)
1 * large.rx with conversion 0.000000 0.000000 0.000000 (
0.000000)
1 * large.one_pass 25.985000 0.000000 25.985000 (
26.298000)
1 * large.min 34.953000 0.000000 34.953000 (
35.098000)
1 * large.first.dup 27.719000 0.000000 27.719000 (
27.732000)

(see attachment for code)

Kind regards

robert

toadkicker · August 6, 2007, 5:12pm

Todd B. wrote:

I have several strings to analyze. The strings are the names of a list
of things that were once ordered and now are not. A typical array of
names might look like this:

item001
item004
item002
item002b
item002C
item 5
item 10
itemize this

I want to parse these names and determine that “item” is the common
prefix.

I know how to armstrong my way through it a character at a time, but I’m
thinking that Ruby probably has a very elegant way of doing this that
I’m not familiar with yet.

Thanks, Todd

Wow, I’m surprised nobody had the idea of simply using Array#abbrev,
which is included in the stdlib.

Regards
Stefan

toadkicker · August 6, 2007, 9:12pm

Stefan R. wrote:

Wow, I’m surprised nobody had the idea of simply using Array#abbrev,
which is included in the stdlib.

Regards
Stefan

Well that looks simple enough! Looks like I would take the shortest
key returned and chop! the last character to get the common prefix. If
the shortest key length was 1, there is no common prefix.

I had never used Abbrev before - thanks for pointing it out. This
wasn’t a contest, but if it were - I think you won!

Todd

toadkicker · August 6, 2007, 9:27pm

El Aug 6, 2007, a las 5:13 PM, Stefan R.
escribió:

Wow, I’m surprised nobody had the idea of simply using Array#abbrev,
which is included in the stdlib.

There’s no guarantee you can obtain the maximum common prefix from
its output, because it computes unambiguous prefixes:

%w{ item1 item2 }.abbrev
=> {“item1”=>“item1”, “item2”=>“item2”}

See, no “item” there, you end up basically with the information you
had before the call.

– fxn

toadkicker · August 6, 2007, 10:42am

From: Gavin S. [mailto:[email protected]]

On Aug 3, 5:58 pm, Peña, Botp [email protected] wrote:

> min = list.sort!{|a,b| a.size <=> b.size}.first

> break if list.all?{|a| a =~ /^#{min}/} while min.chop!

I just realised that the solution I just posted elsewhere in this

thread is very very similar to this one.

Be careful that using bang methods in chains is sometimes problematic,

as they can return nil if nothing changes (sort! doesn’t have this

problem).

indeed. as long as the array has no nil elements, w’re ok w sort!

irb(main):009:0> s=“test”
=> “test”
irb(main):010:0> s.downcase!.downcase!
NoMethodError: undefined method `downcase!’ for nil:NilClass
from (irb):10
from :0
irb(main):011:0> a=[1,1,1]
=> [1, 1, 1]
irb(main):012:0> a.sort!.sort!
=> [1, 1, 1]

i realized just now that i posted the first code. That should be,

min = list.sort!{|a,b| a.size <=> b.size}.shift # <–shift instead of
first
break if list.all?{|a| a =~ /^#{min}/} while min.chop!

but anyway, after a couple of tests, the sort does not help much, so

min = list.min{|a,b| a.size <=> b.size}
break if list.all?{|a| a =~ /^#{min}/} while min.chop!

hmm… Gavin, your until code just gave me a hint. if we reverse logic,
we can save the break and make it more readable too.

min = list.min{|a,b| a.size <=> b.size}
min.chop! until list.all?{|a| a =~ /^#{min}/}

and we can use switch w #any? anytime too

min = list.min{|a,b| a.size <=> b.size}
min.chop! while list.any?{|a| a !~ /^#{min}/}

ruby power, indeed.

kind regards -botp

toadkicker · August 7, 2007, 3:17am

From: Robert K. [mailto:[email protected]]

I have added some interesting benchmarking:

$ ./prefix.rb

user system total

real

1000 * list.rx 0.016000 0.000000 0.016000 (

0.014000)

1000 * list.rx with conversion 0.031000 0.000000 0.031000 (

0.030000)

1 * large.rx 0.000000 0.000000 0.000000 (

0.000000)

1 * large.rx with conversion 0.000000 0.000000 0.000000 (

0.000000)

Robert, thanks for the benchmark. I included Brian|Martin’s solution in
the mix but still the regex scheme is way a lot better w regards to
speed.

Could some kind soul enlighten me why the regex solution works better,
and even best using large strings?

kind regards -botp

toadkicker · August 7, 2007, 4:26am

From: Logan C. [mailto:[email protected]]

I’ve edited out some false starts. Please feel free to point

out any flaws

that I’ve missed

Abbrev.abbrev(items).sort_by { |s,| s.length }.first.first[0…-2]

I cannot prove it wrong, hacker Logan

i just put some cosmetics though by getting the extract fr min,

irb(main):099:0> list=[
irb(main)1* “item two of those”,
irb(main):101:1* “item two”,
irb(main):102:1* “item1”,
irb(main):103:1* “item22”,
irb(main):104:1* “item333”,
irb(main):105:1* “item 444”,
irb(main):106:1* “item 5555”,
irb(main):107:1* “item6a”,
irb(main):108:1* “item6b”,
irb(main):109:1* “item—x”,
irb(main):110:1* “itemy”,
irb(main):111:1* “item whatever :)”,
irb(main):112:1* ]
=> [“item two of those”, “item two”, “item1”, “item22”, “item333”, “item
444”, “item 5555”, “item6a”, “item6b”, “item—x”, “itemy”, “item
whatever :)”]
irb(main):113:0>
list.abbrev.min{|(a1,),(b1,)|a1.length<=>b1.length}.first[0…-2]
=> “item”

so,
list.abbrev.min{|(a1,),(b1,)|a1.length<=>b1.length}.first[0…-2]

your magic number “0…2” is an amazingly simple hack (it looks like 42
:). but i’m not sure how this abbrev scheme stands stands against the
regex solutions though.

kind regards -botp

toadkicker · August 7, 2007, 3:48am

On 8/6/07, Xavier N. [email protected] wrote:

=> {“item1”=>“item1”, “item2”=>“item2”}

See, no “item” there, you end up basically with the information you
had before the call.

irb(main):027:0> items = [
irb(main):028:1* ‘item001’,
irb(main):029:1* ‘item004’,
irb(main):030:1* ‘item002’,
irb(main):031:1* ‘item002b’,
irb(main):032:1* ‘item002C’,
irb(main):033:1* ‘item 5’,
irb(main):034:1* ‘item 10’,
irb(main):035:1* ‘itemize this’
irb(main):036:1> ]
irb(main):039:0> Abbrev.abbrev(items).sort_by { |s,|
s.length}.first.first[0…-
2]
=> “item”
irb(main):040:0> items = [“happy”, “happening”, “hapless”]
=> [“happy”, “happening”, “hapless”]
irb(main):041:0> Abbrev.abbrev(items).sort_by { |s,|
s.length}.first.first[0…-
2]
=> “hap”
irb(main):042:0> items = [“c”, “b”, “a”]
=> [“c”, “b”, “a”]
irb(main):043:0> Abbrev.abbrev(items).sort_by { |s,|
s.length}.first.first[0…-
2]
=> “”
irb(main):044:0>

I’ve edited out some false starts. Please feel free to point out any
flaws
that I’ve missed

Abbrev.abbrev(items).sort_by { |s,| s.length }.first.first[0…-2]

toadkicker · August 7, 2007, 7:16am

On 07.08.2007 03:16, Peña wrote:

Robert, thanks for the benchmark. I included Brian|Martin’s solution in the mix but still the regex scheme is way a lot better w regards to speed.

Could some kind soul enlighten me why the regex solution works better, and even best using large strings?

My theory is that it’s completely done in C and there is only one
invocation. All other solutions need some form of iteration implemented
in Ruby.

Kind regards

robert

toadkicker · August 7, 2007, 1:11pm

On Aug 7, 2007, at 3:47 AM, Logan C. wrote:

=> {“item1”=>“item1”, “item2”=>“item2”}
irb(main):031:1* ‘item002b’,
=> [“happy”, “happening”, “hapless”]
irb(main):044:0>

I’ve edited out some false starts. Please feel free to point out
any flaws
that I’ve missed

If I understand correctly that approach that says: “since I can get
some absolute shortest (perhaps not unique as fas as length is
concerned) unambiguous minimum, belonging to string S, that shortest
minus one is ambiguous”. That’s the key idea, and I think there’s
something into it.

What happens is that, as far as I can see, you can only conclude
there’s ambiguity with some other string, which is not the same than
saying that you’ve got the maximum common prefix. See for example:

%w{abc acb abd acd}.abbrev.sort_by {|s,| s.length}.first.first[0…-2]
=> “ac”

– fxn

PS: I think abbrev does too much work compared to more specific
solutions, but it is worth exploring for the sake of thinking
nonetheless.

toadkicker · August 8, 2007, 4:20am

From: Xavier N. [mailto:[email protected]]

%w{abc acb abd acd}.abbrev.sort_by {|s,|

s.length}.first.first[0…-2]

=> “ac”

duh, i should read further on what abrrev really does. Thanks for the
enlightenment.

PS: I think abbrev does too much work compared to more specific

solutions, but it is worth exploring for the sake of thinking

nonetheless.

yes. this has been an enlightening thread nonetheless.
kind regards -botp