Uniq with count; better way?

ralphshnelvar · January 23, 2012, 3:08pm

Ok, I tried some benchmarks. We have now even more variables, as they
also depend on “maxval” from the dataset.

maxval = 1000
ar = [].tap{|a| 1_000_000.times {a << rand(maxval)}}

b.report(“Meier:”) {
n.times {
hist = Array.new(maxval+1, 0)
ar.each{|x| hist[x] += 1;}
result = Hash.new(0)
0.upto(maxval){|i| result[i] = hist[i] unless hist[i] == 0}
result
}
}

On my jruby and my windows- mri 1.8.7 my algorithm was fastest for
maxvalue of 10, 100 or 10000, for example:

SIZE
1000000
MAXVAL
10000
user system total real
Ralph Shneiver: 0.533000 0.000000 0.533000 ( 0.518000)
Meier: 0.312000 0.000000 0.312000 ( 0.312000)
Keinich #1 0.814000 0.000000 0.814000 ( 0.814000)

(I have no 1.9.3 yet on my windows PC, so it may be different there)

But here are two observations:

speed-up is not so big as I expected. In C, I expect array lookup to
be factors better than hash calculation (followed by an array-lookup in
the hash-table…). In Ruby it seem to be not so much faster. But the
speedup gets bigger for bigger values of maxval.
My algorithm sometimes run much much slower when #kennich1 had run
before mine. It seem to get worse with big values of maxval, but not
with jruby --1.9 option. It is not the array-allocate itself that is the
problem.
Is it possible that group_by changes the internal array structure, So I
get a non-continous-array?

Regards

Karsten Meier