Finding duplicates in an array and its index number

Gurdipe_D · April 19, 2012, 11:53pm

Hi,

I have following array and I am trying to find if there are duplicates
in the array. I also want to find out their index.

e.g.

a=[1,2,3,1,3]

Here 1 is a duplicate which comes twice and its index is 0 and 3 in the
array.

Any suggestions?

Thanks

newtoruby_ca · April 20, 2012, 12:08am

I’m curious what the purpose is. If it’s simply to remove the duplicates
you can call .uniq on the array to remove duplicates.

newtoruby_ca · April 20, 2012, 12:09am

I dont have to remove the duplicates but just need to find them with
their index

newtoruby_ca · April 20, 2012, 12:10am

If you really want to know how it finds the duplicates, look at the
source
code for the .uniq method as a starting point.

newtoruby_ca · April 20, 2012, 12:49am

Hi,

As always, there are many ways to do this.

A highlevel approach: (short but not really inefficient)

x = [1, 2, 3, 2, 4, 5, 4, 4]

a = x.each_with_index.group_by(&:first).inject({}) do |result, (val,
group)|
next result if group.length == 1
result.merge val => group.map {|pair| pair[1]}
end

p a

A more efficient lowlevel solution would be to iterate over the array,
compare the values and save the indices in case of duplicates.

newtoruby_ca · April 20, 2012, 1:03am

Jan E. wrote in post #1057484:

A more efficient lowlevel solution would be to iterate over the array,
compare the values and save the indices in case of duplicates.

… like this:

x = [1, 2, 3, 2, 4, 5, 4, 4]

duplicates = {}
x.each_with_index do |value, i|
(i + 1).upto x.length - 1 do |j|
if x[j] == value
duplicates[value] = [i] if duplicates[value].nil?
duplicates[value] << j
break
end
end
end

p duplicates

newtoruby_ca · April 20, 2012, 12:58am

On 20/04/12 09:53, newto ruby wrote:

array.

Any suggestions?

Thanks

How about;

1.9.3p125 :001 > foo = [1,2,3,1,3]
=> [1, 2, 3, 1, 3]
1.9.3p125 :002 > dup_indices = Hash.new {|h,k| h[k]=[]}
=> {}
1.9.3p125 :003 > foo.each_index {|i| dup_indices[foo[i]] << i unless 1
== foo.count(foo[i])}
=> [1, 2, 3, 1, 3]
1.9.3p125 :004 > dup_indices
=> {1=>[0, 3], 3=>[2, 4]}

Sam

newtoruby_ca · April 20, 2012, 12:34pm

I do a lot of indexing of data. Usually they’re multi-field records, and
each data set will have multiple indexes. The technique I use is below.
This indexes all elements of the array. So, then to find duplicates, you
have to loop again to find array counts > 1. Anyway, it’s a slightly
different way to do it in the event an index of all entries would be
useful for lookups.

data = [1, 2, 3, 1, 3, 4, 6, 5, 4]
dataIndex = {}

data.each_with_index do |value, indx|

index_key = value
index_value = indx

if !dataIndex.has_key?(index_key)
dataIndex.store(index_key, [])
end

dataIndex[index_key].push(index_value)
end

p dataIndex

dataIndex.each do |key, indxs|
if indxs.count > 1
# do something
puts “#{key} appears at #{indxs.join(’, ')}”
end
end

– gw

newtoruby_ca · April 20, 2012, 12:47pm

On 04/19/2012 11:53 PM, newto ruby wrote:

Hi, I have following array and I am trying to find if there are
duplicates in the array. I also want to find out their index.

Generate a hash with unique items as keys, and arrays of indices as
values,
then select only those with more than one index:

array = [1, 2, 3, 4, 5, 3, 6, 7, 2, 8, 1, 9]

array.each_with_index.reduce({}) { |hash, (item, index)|
hash[item] = (hash[item] || []) << index
hash
}.select { |key, value|
value.size > 1
}

=> {1=>[0, 10], 2=>[1, 8], 3=>[2, 5]}

newtoruby_ca · April 20, 2012, 6:55pm

Greg W. wrote in post #1057552:

I do a lot of indexing of data. Usually they’re multi-field records, and
each data set will have multiple indexes. The technique I use is below.
This indexes all elements of the array. So, then to find duplicates, you
have to loop again to find array counts > 1. Anyway, it’s a slightly
different way to do it in the event an index of all entries would be
useful for lookups.

DOH. That first loop only needed to be:

data.each_with_index do |value, indx|
if !dataIndex.has_key?(key)
dataIndex.store(key, [])
end
dataIndex[key].push(value)
end

That’s what I get for doing a quick transcribe of project code.

– gw

newtoruby_ca · April 21, 2012, 1:48am

On Apr 20, 2012, at 03:46 , Lars H. wrote:

hash
}.select { |key, value|
value.size > 1
}

=> {1=>[0, 10], 2=>[1, 8], 3=>[2, 5]}

Don’t use reduce/inject for non-reductive applications. Use something
more appropriate, like a plain each.

Also use Hash to its full capabilities:

hash = Hash.new { |h,k| h[k] = [] }

then it is as clean as:

array.each_with_index do |val, idx|
hash[val] << idx
end

Notice how you’re not constantly re-assigning hash for no good reason in
my version? That adds up, but more importantly it obfuscates the
original intent.

newtoruby_ca · April 22, 2012, 1:46pm

OK folks,

news from the service department. Here’s the shootout:

gist.github.com

https://gist.github.com/rklemme/2463600

find-duplicates.rb

require 'benchmark'

REP = 100_000

dat = [1, 2, 3, 2, 4, 5, 4, 4]

Benchmark.bmbm 25 do |x|
  x.report "Jan low level" do
    REP.times do
      dups = {}

This file has been truncated. show original

output.txt

$ ruby19 -v
ruby 1.9.3p125 (2012-02-16 revision 34643) [i686-linux]
$ ruby19 b.rb 
Rehearsal -------------------------------------------------------------
Jan low level               1.320000   0.000000   1.320000 (  1.373991)
Jan multi Hash              2.250000   0.000000   2.250000 (  2.395521)
Sam                         1.630000   0.000000   1.630000 (  2.000988)
Lars                        1.590000   0.000000   1.590000 (  1.675986)
Lars simpler                1.550000   0.000000   1.550000 (  1.616905)
Lars faster?                1.400000   0.000000   1.400000 (  1.467628)

This file has been truncated. show original

Turns out quickest is one of the less arcane solutions:

  dups = {}

  dat.each_with_index do |val, idx|
    (dups[val] ||= []) << idx
  end

  dups.delete_if {|k,v| v.size == 1}

A few remarks:

It’s amazing how many forms even the same algorithm can take in Ruby.
Just the different options for adding to a Hash (#default_proc,
#fetch, ||=) can make a huge visual difference although the underlying
strategy of all these algorithms is the same.

Downside of Jan’s first approach is that it has effort O(n**2) in
terms of elements in the array because of the nested iteration. Sam’s
approach has a similar quality although less obvious (it’s in the
#count).

Allocating multiple Hash instances is what makes Jan’s second approach
slow.

As always, #inject (or #reduce) approaches are slower than others.

The best strategy is to create a Hash with elements as keys and Arrays
of indexes as values. This approach won’t suffer for large inputs
from the O(n**2) issue.

For one off scripts and infrequent execution it doesn’t really matter
much.

Kind regards

robert