Remove duplicates of array of object based on a attribute

senthil · March 6, 2007, 3:03pm

hi all,
how to remove duplicates of an array of objects based a
attribute of the object. For ex
i am having an array of ruby beans named diagnoses . i want
remove duplicates from the based on the diagnoses id. assume diagnoses
have attributes id and weightage .So for two diagnoses with same id and
different weightage , the diagnoses with lower weightage should be
removed.
Can anyone help me??

senthil · March 6, 2007, 3:29pm

On 3/6/07, senthil [email protected] wrote:

       i am having an array of  ruby beans named diagnoses . i want
remove duplicates from the based on the diagnoses id. assume diagnoses
have attributes id and weightage .So for two diagnoses with same id and
different weightage , the diagnoses with lower weightage should be
removed.
Can anyone help me??

From: http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/228538

module Enumerable
  def group_by &b
    h = Hash.new{|h,k| h[k] = []}
    each{|x| h[x.instance_eval(&b)] << x}
    h.values
  end
end

old_diagnoses = [
  {:id => 1, :w => 30},
  {:id => 2, :w => 20},
  {:id => 3, :w => 10},
  {:id => 1, :w => 10},
  {:id => 1, :w => 40},
  {:id => 2, :w => 50},
  {:id => 4, :w => 60},
  {:id => 4, :w => 30},
  {:id => 2, :w => 20},
  {:id => 3, :w => 10}
]
new_diagnoses = []

groups = old_diagnoses.group_by{ |d| d[:id] }

groups.each do |group|
  new_diagnoses << group.sort_by{ |g| g[:w] }.last
end

p old_diagnoses
p new_diagnoses

[{:w=>30, :id=>1}, {:w=>20, :id=>2}, {:w=>10, :id=>3}, {:w=>10, :id=>1},
{:w=>40, :id=>1}, {:w=>50, :id=>2}, {:w=>60, :id=>4}, {:w=>30, :id=>4},
{:w=>20, :id=>2}, {:w=>10, :id=>3}]

[{:w=>40, :id=>1}, {:w=>50, :id=>2}, {:w=>10, :id=>3}, {:w=>60, :id=>4}]

senthil · March 6, 2007, 3:31pm

On Mar 6, 7:03 am, senthil [email protected] wrote:

hi all,
how to remove duplicates of an array of objects based a
attribute of the object. For ex
i am having an array of ruby beans named diagnoses . i want
remove duplicates from the based on the diagnoses id. assume diagnoses
have attributes id and weightage .So for two diagnoses with same id and
different weightage , the diagnoses with lower weightage should be
removed.

Here’s my best shot at it:

require ‘set’
class Array
def uniq_by
seen = Set.new
select{ |x| seen.add?( yield( x ) ) }
end
end

a = [ {:a=>1, :d=>1}, {:b=>2}, {:c=>3}, {:a=>1, :d=>3} ]
p a, a.uniq, a.uniq_by{ |h| h[:a] }
#=> [{:a=>1, :d=>1}, {:b=>2}, {:c=>3}, {:a=>1, :d=>3}]
#=> [{:a=>1, :d=>1}, {:b=>2}, {:c=>3}, {:a=>1, :d=>3}]
#=> [{:a=>1, :d=>1}, {:b=>2}]

(Note how :b=>2 and :c=>3 have the same value for :a (nil), so only
one is included.)

Here’s another (assumedly slower) version that doesn’t rely on Set:

class Array
def uniq_by
seen = {}
select{ |x|
v = yield(x)
!seen[v] && (seen[v]=true)
}
end
end

senthil · March 6, 2007, 4:45pm

senthil, please don’t take this personally, your question is OK, but the
following sounds so very wrong:

       i am having an array of  ruby beans (...)

All we have in Ruby are objects. No beans, POROs, ERBs, and all this
cruft.

Regards,
Pit

senthil · March 6, 2007, 3:45pm

On Mar 6, 7:27 am, “Phrogz” [email protected] wrote:

Here’s another (assumedly slower) version that doesn’t rely on Set:

Huh…actually, the hash-based one seems faster than the Set-based
one:

require ‘set’
class Array
def uniq_by1
seen = Set.new
select{ |x| seen.add?( yield( x ) ) }
end
def uniq_by2
seen = {}
select{ |x| !seen[v=yield(x)] && (seen[v]=true) }
end
end

require ‘benchmark’
a = [ {:a=>1, :d=>1}, {:b=>2}, {:c=>3}, {:a=>1, :d=>3},
{:a=>2, :e=>7}, {:a=>3, :b=>2}, {:a=>1}, {:a=>4}, {:f=>6} ]
N = 10_000
Benchmark.bmbm{ |x|
x.report( ‘with_set’ ){
N.times{
a.uniq_by1{ |h| h[:a] }
a.uniq_by1{ |h| h[:b] }
}
}
x.report( ‘with_hash’ ){
N.times{
a.uniq_by2{ |h| h[:a] }
a.uniq_by2{ |h| h[:b] }
}
}
}

#=> Rehearsal ---------------------------------------------
#=> with_set 1.840000 0.030000 1.870000 ( 2.401238)
#=> with_hash 1.270000 0.030000 1.300000 ( 1.701307)
#=> ------------------------------------ total: 3.170000sec
#=>
#=> user system total real
#=> with_set 1.820000 0.020000 1.840000 ( 2.187477)
#=> with_hash 1.250000 0.020000 1.270000 ( 1.555490)

(Yes, my laptop is rather old and slow.)

senthil · March 6, 2007, 7:57pm

And here’s the inevitable one-liner… :}

(But I do prefer the group_by version…)

gegroet,
Erik V. - http://www.erikveen.dds.nl/

################################################################

arr = [
{:id => 1, :w => 30},
{:id => 2, :w => 20},
{:id => 3, :w => 10},
{:id => 1, :w => 10},
{:id => 1, :w => 40},
{:id => 2, :w => 50},
{:id => 4, :w => 60},
{:id => 4, :w => 30},
{:id => 2, :w => 20},
{:id => 3, :w => 10}
]

################################################################

res1=arr.inject({}){|h,o|(h[o[:id]]||=[])<<o;h}.values.map{|a|
a.sort_by{|o|o[:w]}.pop}

################################################################

res2 =
arr.inject({}) do |h,o|
(h[o[:id]] ||= []) << o ; h
end.values.collect do |a|
a.sort_by do |o|
o[:w]
end.pop
end

################################################################

module Enumerable
def hash_by(&block)
inject({}){|h, o| (h[block.call(o)] ||= []) << o ; h}
end

def group_by(&block)
hash_by(&block).sort.transpose.pop
end
end

res3 =
arr.group_by do |o|
o[:id]
end.collect do |a|
a.sort_by do |o|
o[:w]
end.pop
end

################################################################

p res1
p res2
p res3

################################################################

senthil · March 6, 2007, 9:50pm

Erik V. wrote:

And here’s the inevitable one-liner… :}

Not that we’re golfing, but I like this one better in terms of one-
linedness:
Hash[ *map{ |o| [ o[:id], o ] }.flatten ].values

senthil · March 6, 2007, 9:57pm

On Mar 6, 1:47 pm, “Phrogz” [email protected] wrote:

Erik V. wrote:

And here’s the inevitable one-liner… :}

Not that we’re golfing, but I like this one better in terms of one-
linedness:
Hash[ *map{ |o| [ o[:id], o ] }.flatten ].values

Oops, I meant:
Hash[ *a.map{ |o| [ o[:id], o ] }.flatten ].values

senthil · March 7, 2007, 12:42am

Hash[ *a.map{ |o| [ o[:id], o ] }.flatten ].values

Not bad…

How does this ensure that the maximum :w is used?

gegroet,
Erik V. - http://www.erikveen.dds.nl/

senthil · March 6, 2007, 10:00pm

On Mar 6, 7:40 am, “Phrogz” [email protected] wrote:

On Mar 6, 7:27 am, “Phrogz” [email protected] wrote:

Here’s another (assumedly slower) version that doesn’t rely on Set:

Huh…actually, the hash-based one seems faster than the Set-based
one:

And faster still, by a hair, is a last-in approach. Upon reflection,
all these techniques rely only on methods already in Enumerable, so
they can be put there instead of being Array-specific.

module Enumerable
require ‘set’
def uniq_by1
seen = Set.new
select{ |x| seen.add?( yield( x ) ) }
end
def uniq_by2
seen = {}
select{ |x| !seen[v=yield(x)] && (seen[v]=true) }
end
def uniq_by3
Hash[ *map{ |x| [ yield(x), x ] }.flatten ].values
end

def uniq_by4
  # fastest, preserves last-seen value for a key
  h = {}
  each{ |x| h[yield(x)] = x }
  h.values
end

def uniq_by5
  # near-fastest, preserves first-seen value for a key
  h = {}
  each{ |x| v=yield(x); h[v]=x unless h.include?(v) }
  h.values
end

end

a = [ {:a=>1, :d=>1}, {:b=>2}, {:c=>3}, {:a=>1, :d=>3},
{:a=>2, :e=>7}, {:a=>3, :b=>2}, {:a=>1}, {:a=>4}, {:f=>6} ]

require ‘benchmark’
N = 20_000
Benchmark.bmbm{ |x|
x.report( ‘with set’ ){
N.times{
a.uniq_by1{ |h| h[:a] }
a.uniq_by1{ |h| h[:b] }
}
}
x.report( ‘with hash’ ){
N.times{
a.uniq_by2{ |h| h[:a] }
a.uniq_by2{ |h| h[:b] }
}
}
x.report( ‘Hash.[].values’ ){
N.times{
a.uniq_by3{ |h| h[:a] }
a.uniq_by3{ |h| h[:b] }
}
}
x.report( ‘#values (last in)’ ){
N.times{
a.uniq_by4{ |h| h[:a] }
a.uniq_by4{ |h| h[:b] }
}
}
x.report( ‘#values (first in)’ ){
N.times{
a.uniq_by5{ |h| h[:a] }
a.uniq_by5{ |h| h[:b] }
}
}
}

#=> Rehearsal ------------------------------------------------------
#=> with set 2.500000 0.016000 2.516000 ( 2.547000)
#=> with hash 1.312000 0.000000 1.312000 ( 1.313000)
#=> Hash.[].values 2.453000 0.000000 2.453000 ( 2.453000)
#=> #values (last in) 1.110000 0.000000 1.110000 ( 1.109000)
#=> #values (first in) 1.296000 0.000000 1.296000 ( 1.297000)
#=> --------------------------------------------- total: 8.687000sec
#=>
#=> user system total real
#=> with set 2.000000 0.000000 2.000000 ( 1.999000)
#=> with hash 1.297000 0.000000 1.297000 ( 1.297000)
#=> Hash.[].values 2.531000 0.000000 2.531000 ( 2.532000)
#=> #values (last in) 1.125000 0.015000 1.140000 ( 1.140000)
#=> #values (first in) 1.344000 0.000000 1.344000 ( 1.344000)

senthil · March 7, 2007, 12:53am

On 3/6/07, Erik V. [email protected] wrote:

Hash[ *a.map{ |o| [ o[:id], o ] }.flatten ].values

Not bad…

How does this ensure that the maximum :w is used?

Hash[ *a.map{ |o| [ o[:id], o ] }.flatten ].values
=> [{:id=>1, :w=>40}, {:id=>2, :w=>20}, {:id=>3, :w=>10}, {:id=>4,
:w=>30}]

Hash[*(a.sort_by{|z|z[:id]}).map{|o|[o[:id],o]}.flatten].values
=> [{:id=>1, :w=>40}, {:id=>2, :w=>50}, {:id=>3, :w=>10}, {:id=>4,
:w=>60}]