I need to remove duplicates from an array of arrays. I can’t use
Array#uniq because some fields are different and not part of the
“key.” Here’s an example where the first 3 elements of each sub array
are the “key” and determine uniqueness. I want to keep only the first
one I get.
This works. However, it is super slow when operating on my dataset.
My arrays contain hundreds of thousands of sub arrays. The unique key
for each sub array is the first 12 (of 18) elements. It is taking many
seconds to produce each intermediate array (“dupes” in the example
above), so deduping the entire thing would likely take days.
I need to remove duplicates from an array of arrays. I can’t use
Array#uniq because some fields are different and not part of the “key.”
Here’s an example where the first 3 elements of each sub array are the
“key” and determine uniqueness. I want to keep only the first one I get.
Here is my first attempt at solving the problem:
?> dedup a
=> [[1, 2, 3, 4, 5]]
This works. However, it is super slow when operating on my dataset. My
arrays contain hundreds of thousands of sub arrays. The unique key for each
sub array is the first 12 (of 18) elements. It is taking many seconds to
produce each intermediate array (“dupes” in the example above), so deduping
the entire thing would likely take days.
Anyone have a superior and faster solution?
See if this speeds it up meaningfully (and make sure I’ve got the
logic right):
def dedup(ary)
uniq = {}
ary.each do |line|
uniq[ary[0…2]] ||= line
end
uniq.values
end
(And Joel, I have presorted the array prior to removing the
dupes so I have already taken care of the ordering issue.)
I think what Joel was referring to was that in Ruby 1.8 a Hash doesn’t
maintain insertion order when traversed (Ruby 1.9 does maintain
insertion order):
On Jul 17, 2009, at 5:39 PM, David A. Black wrote:
a = [[1, 2, 3, 4, 5], [1, 2, 3, 9, 4], [1, 2, 3, 4, 4]]
?> dupes = ary.select { |row| row[0…2] == line[0…2] }
is taking many seconds to produce each intermediate array (“dupes”
ary.each do |line|
uniq[ary[0…2]] ||= line
end
uniq.values
end
David and Joel,
you both provided the same solution. I will test this to see what kind
of performance I get. It will be hell on memory, but I assumed any
solution likely would be. (And Joel, I have presorted the array prior
to removing the dupes so I have already taken care of the ordering
issue.)
“key” and determine uniqueness. I want to keep only the first one I get.
ary.map do |line|
each sub array is the first 12 (of 18) elements. It is taking many seconds
ary.each do |line|
dupes so I have already taken care of the ordering issue.)
I believe the version you had originally, where you do a mapping of
the whole array, will typically use much more memory than the hash
version. Let’s say your original array has 1000 inner arrays, with 10
that are considered unique. The mapping will be a new array, also of
1000 elements. The hash will have 10 key/value pairs – thus much
smaller.
On Jul 17, 2009, at 6:14 PM, David A. Black wrote:
doesn’t
irb(main):004:0> h[“sadf”] = 3
ary.each do |line|
key = line[0…2]
next if uniq[key]
res << (uniq[key] = line)
end
res
end
David
I missed the beginning of this thread, but here is an implementation
I’ve used successfully:
def uniq_by(subject, &block)
h = {}
a = []
subject.each do |s|
comparator = yield(s)
unless h[comparator]
a.push(s)
h[comparator] = s
end
end
a
end
Usage:
u = uniq_by(ary|{ |item| item.element }
Basically, what this allows you to do is specify what exactly about an
array item must be unique. It also preserves the original array order,
with a “first entry wins” approach to duplicate elimination.
On Jul 17, 2009, at 7:55 PM, David A. Black wrote:
I believe the version you had originally, where you do a mapping of
the whole array, will typically use much more memory than the hash
version. Let’s say your original array has 1000 inner arrays, with 10
that are considered unique. The mapping will be a new array, also of
1000 elements. The hash will have 10 key/value pairs – thus much
smaller.
Oh yes, my version had terrible execution performance and memory
performance. I was trying to figure out how to use a hash but did not
make the leap to the ||= construction on my own. I knew I was missing
something obvious… all of your rapid responses proved it.
FYI, the dedup code you provided performs quite admirably. I’ll take a
look at its memory footprint when I get in the office Monday and
report back.
I was trying to figure out how to use a hash but did not
make the leap to the ||= construction on my own.
I wouldn’t recommend freezing one’s knowledge or leap abilities at a
particular point, though I’m certainly in sympathy with being
suspicious of punctuation-heavy stuff, but in the case of ||= it’s
such a common idiom, and so easy to learn, that it seems like a bit of
an artificial hardship not to learn it. Still, if will work too
The ||= idiom should work
fine
…until it doesn’t (I think you know what I’m refering to).
The hash default thing? I don’t think that comes into play here, does
it?