Uniq with count; better way?

ralphshnelvar · January 17, 2012, 9:03pm

Kenichi let’s add the word “histogram” to the search.

Perhaps naming the method something “histogram” related would be a good
choice.

Abinoam Jr.

ralphshnelvar · January 17, 2012, 9:07pm

On Tue, Jan 17, 2012 at 20:02, Abinoam Jr. [email protected] wrote:

Kenichi let’s add the word “histogram” to the search.

Perhaps naming the method something “histogram” related would be a good
choice.

A histogram is a diagram, visually representing frequencies. A hash is
not
a histogram.

ralphshnelvar · January 17, 2012, 10:30pm

On 01/17/2012 12:07 PM, Adam P. wrote:

It’s a reasonable suggestion.

http://en.wikipedia.org/wiki/Histogram:

“”"
In a more general mathematical sense, a histogram is a function mi that
counts the number of observations that fall into each of the disjoint
categories (known as bins), whereas the graph of a histogram is merely
one way to represent a histogram.
“”"

ralphshnelvar · January 18, 2012, 2:57pm

On Tue, Jan 17, 2012 at 12:08 PM, Peter V.
[email protected] wrote:

I assumed (incorrectly turns out for Ruby 1.9.3) that for large data sets
there
would be a significant performance difference. Because, upon determining the
“bins” of uniq values, that is essentially a form of “sorting”, which can
be O(n^2)
if not careful.

There is no sorting going on. Some tests explicitly use a Hash others
use Enumerable#group_by which also uses a Hash (you can see it from
the return).

Questions:

Why would ruby 1.9.3. be so much better at this than ruby 1.8.7 ?
Could it be because the Hash is now “ordered” so it can do an efficient
algorithm when adding an entry to a bin?

No. Hashes in 1.9.* only maintain insertion order - nothing else.

Or is it object creation?

The difference is more likely to do with the dramatically changed
implementation of MRI between 1.8.* and 1.9.*. Maybe there were also
some Hash implementation tweaks but the ordering is certainly not
responsible.

Why are certain methods leading to non-linear behavior?

I assume that this might have to do with memory allocation and GC.
Choosing n=1 for repetition is too low to yield proper results IMHO
since GC for the previous test might kick in etc.

Btw. with Benchmark.bmbm you can have a warmup phase before the real
measurement.

I tried to tweak a bit but not much gain (~1% at most):

gist.github.com

https://gist.github.com/rklemme/1633008

hash-count.rb

require 'benchmark'

# create some data
DAT = (3..7).map do |i|
  Array.new(10 ** i) { rand 100 }
end

2.times do
  Benchmark.bm 15 do |b|
    DAT.each do |ar|

This file has been truncated. show original

output.txt

$ ruby19 bm.rb
                      user     system      total        real
Size       1000  n=      1000
Ralph Shneiver:   0.344000   0.000000   0.344000 (  0.347000)
Sigurd:           0.422000   0.000000   0.422000 (  0.410000)
Keinich #1        1.765000   0.000000   1.765000 (  1.767000)
RK #1             1.188000   0.000000   1.188000 (  1.194000)
Keinich #2        0.390000   0.000000   0.390000 (  0.395000)
Magnus Holm:      0.500000   0.000000   0.500000 (  0.492000)
Abinoam #1:       2.360000   0.000000   2.360000 (  2.354000)

This file has been truncated. show original

Cheers

robert

ralphshnelvar · January 17, 2012, 10:53pm

On Tue, Jan 17, 2012 at 21:29, Joel VanderWerf
[email protected]wrote:

It’s a reasonable suggestion.

http://en.wikipedia.org/wiki/**Histogram http://en.wikipedia.org/wiki/Histogram
:

Hm! I haven’t heard that use before. Still, would that be
well-understood?
If most people think of it as the visual representation of the
frequency,
the method name might seem off. But thanks for pointing that one out.

ralphshnelvar · January 18, 2012, 7:23pm

On Wed, Jan 18, 2012 at 2:57 PM, Robert K.
[email protected]wrote:

There is no sorting going on.
I beg to differ …

I would assume the only task with an order higher than O(n) is the
“binning”.

That is

finding the hash key for which the value needs to be incremented
or finding the set to which the entry needs to be added

I would expect those searches to go dramatically faster if the “keys”
are indexed, so the algorithm can find the key in O(log(n)). So there
the fundamental of a sorting algorithm is used (where to insert the
new element, or add it to the bin if it already exists).

But maybe, because I choose a constant and low number of only
100 bins, this effect does not come out clearly …

I changed the number of “bins” from 100 to 1_000_000 and for certain
algorithms it has a larger impact then others (but none are dramatic
in Ruby 1.9.3).

Binning 1_000_000 numbers over different numbers of bins · GitHub

The initial “Ralph Shneiver” algorithm is least affected.

Some tests explicitly use a Hash others

No. Hashes in 1.9.* only maintain insertion order - nothing else.

Do they have a sort of b_tree index on the keys internally ?

Or is it object creation?

The difference is more likely to do with the dramatically changed
implementation of MRI between 1.8.* and 1.9.*. Maybe there were also
some Hash implementation tweaks but the ordering is certainly not
responsible.

OK, understood.

Why are certain methods leading to non-linear behavior?

I assume that this might have to do with memory allocation and GC.
Choosing n=1 for repetition is too low to yield proper results IMHO
since GC for the previous test might kick in etc.

“Ralph Shneiver:” => { result = Hash.new(0); ar.each { |x| result[x] +=
1 }

Could it be that the reason that the "Ralph Shneiver" method remains stable and linear in all tests, is that it only assigns max 100 entries in the hash and then does nothing more than _increase a number_ (an "in place" action that does not assign new memory and triggers no GC).

Btw. with Benchmark.bmbm you can have a warmup phase before the real
measurement.

Thanks.

I tried to tweak a bit but not much gain (~1% at most):
Couting occurrencies in an Enumerable · GitHub

If I understand correctly, you tested for MRI Ruby 1.9.3 and find
similar results ?

Thanks,

Peter

ralphshnelvar · January 18, 2012, 10:22pm

Have you had a chance to look at Yura Sokolov’s Hash optimization
feature request?

Feature #5903: Optimize st_table (take 2) - Ruby master - Ruby Issue Tracking System

Jon

Fail fast. Fail often. Fail publicly. Learn. Adapt. Repeat.
http://thecodeshop.github.com | http://jonforums.github.com/
twitter: @jonforums

ralphshnelvar · January 18, 2012, 9:27pm

Peter,

Wednesday, January 18, 2012, 11:22:49 AM, you wrote:

PV> On Wed, Jan 18, 2012 at 2:57 PM, Robert K.
PV> [email protected]wrote:

There is no sorting going on.
PV> I beg to differ …

I agree with you. Obviously there is sorting going on.

PV> I would assume the only task with an order higher than O(n) is the
PV> “binning”.

PV> That is
PV> * finding the hash key for which the value needs to be incremented
PV> * or finding the set to which the entry needs to be added

PV> I would expect those searches to go dramatically faster if the
“keys”
PV> are indexed, so the algorithm can find the key in O(log(n)). So
there
PV> the fundamental of a sorting algorithm is used (where to insert the
PV> new element, or add it to the bin if it already exists).

I’m looking at the C code for uniq. I think I understand what it is
doing.

The C code makes a call to ary_make_hash. I can guess what it does. I
am wondering if there is an equivalent Ruby call.

uniq and what we are doing here are closely related.

The basic idea for uniq (in the case where there is no block given)
appears to be

Create a hash from an array (i.e ary_make_hash). I’m guessing that
duplicates are ignored by the nature of creating a hash.
For each element of the original array, attempt to delete that key
from the created hash. If you are able to do it, add that key to the
uniq array.

Clever. I’m guessing doing it that way is faster than manipulation the
output from Hash#to_a

PV> But maybe, because I choose a constant and low number of only
PV> 100 bins, this effect does not come out clearly …

PV> I changed the number of “bins” from 100 to 1_000_000 and for certain
PV> algorithms it has a larger impact then others (but none are dramatic
PV> in Ruby 1.9.3).

PV> Binning 1_000_000 numbers over different numbers of bins · GitHub

PV> The initial “Ralph Shneiver” algorithm is least affected.

It’s Shenlvar not Shnelver!

PV> Some tests explicitly use a Hash others

No. Hashes in 1.9.* only maintain insertion order - nothing else.

PV> Do they have a sort of b_tree index on the keys internally ?

Or is it object creation?

The difference is more likely to do with the dramatically changed
implementation of MRI between 1.8.* and 1.9.*. Maybe there were also
some Hash implementation tweaks but the ordering is certainly not
responsible.

PV> OK, understood.

Why are certain methods leading to non-linear behavior?

I assume that this might have to do with memory allocation and GC.
Choosing n=1 for repetition is too low to yield proper results IMHO
since GC for the previous test might kick in etc.

“Ralph Shneiver:” =>> { result = Hash.new(0); ar.each { |x| result[x] +=
1 }

PV>
PV> Could it be that the reason that the “Ralph Shneiver” method remains
PV> stable and linear in all tests, is that it only assigns max 100
entries in
PV> the
PV> hash and then does nothing more than increase a number (an “in
place”
PV> action that does not assign new memory and triggers no GC).
PV>

I suspect your speculation is quite correct.

I am going to paste in stuff from another post for the reader’s
convenience. I am quoting AJ

AJ> n = 100_000
AJ> Benchmark.bm(15) do |b|
AJ> b.report(“Ralph Shneiver:”) { n.times { a = [4,5,6,4,5,6,6,7];
AJ> result = Hash.new(0); a.each { |x| result[x] += 1 }; result} }
AJ> b.report(“Sigurd:”) { n.times {
AJ> [4,5,6,4,5,6,6,7].inject(Hash.new(0)) {|res, x| res[x] += 1; res } }
}
AJ> b.report(“Keinich #1”) { n.times {
Hash[a.group_by{|n|n}.map{|k,
AJ> v|[k, v.size]}] } }
AJ> b.report(“Keinich #2”) { n.times {
AJ> Hash.new(0).tap{|h|a.each{|n|h[n] += 1}} } }
AJ> b.report(“Magnus H.:”) { n.times {
AJ> [4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) { |x, res| res[x] +=
1
AJ> } } }
AJ> b.report(“Abinoam #1:”) { n.times { Hash[
AJ> [4,5,6,4,5,6,6,7].sort.chunk {|n| n}.map {|ix, els| [ix, els.size] }
]
AJ> } }
AJ> end

“Sigurd” creates a new object, “res”, for each iteration of each

“Keinich #1” appends stuff to a hash table value every time a new value
is detected. This has got to be more work.

“Keinich #2” is quite close to “Shnelvar” and the timings show this.

“Abinoam #1” seems to be doing a lot of work. At least two sorts. The
first an explicit one and the second an implied one in the creation of a
hash table.

Btw. with Benchmark.bmbm you can have a warmup phase before the real
measurement.

PV> Thanks.

I tried to tweak a bit but not much gain (~1% at most):
Couting occurrencies in an Enumerable · GitHub

PV> If I understand correctly, you tested for MRI Ruby 1.9.3 and find
PV> similar results ?

PV> Thanks,

PV> Peter

ralphshnelvar · January 18, 2012, 10:26pm

On Wed, Jan 18, 2012 at 9:26 PM, Ralph S. [email protected]
wrote:

PV> The initial “Ralph Shneiver” algorithm is least affected.

It’s Shenlvar not Shnelver!

Very sorry for that … This is the second time on short notice
that I err with names :-/

I am going to paste in stuff from another post for the reader’s

convenience. I am quoting AJ

AJ> n = 100_000
AJ> Benchmark.bm(15) do |b|
AJ> b.report(“Ralph Shneiver:”) { n.times { a = [4,5,6,4,5,6,6,7];

I am afraid the misreading of your name started here …

Peter

ralphshnelvar · January 18, 2012, 10:52pm

On Wed, Jan 18, 2012 at 10:24 PM, Ryan D.
[email protected]wrote:

% grep -i sort hash.c st.c
%

Not conclusive, I know.

You’re more than welcome to read the two files. They’re less than 5k loc
in sum.

I stand corrected … (should have looked instead of speculated).

Do I understand correctly from st_lookup in st.c that:

if (table->entries_packed) there is a linear search in bins
else FIND_ENTRY uses a hashing mechanism to keep the cost of
look-up essentially linear.

I had speculated that a b-tree with sorting was used to keep the
look-up cost low (but I presume now that is not the case)!

Thanks !

Peter

ralphshnelvar · January 19, 2012, 9:11am

On Wed, Jan 18, 2012 at 10:50 PM, Peter V.
[email protected] wrote:

On Wed, Jan 18, 2012 at 10:24 PM, Ryan D. [email protected]wrote:

On Jan 18, 2012, at 12:26 , Ralph S. wrote:

I agree with you. Obviously there is sorting going on.
…
I stand corrected … (should have looked instead of speculated).

Do I understand correctly from st_lookup in st.c that:

if (table->entries_packed) there is a linear search in bins

else FIND_ENTRY uses a hashing mechanism to keep the cost of
look-up essentially linear.

I had speculated that a b-tree with sorting was used to keep the
look-up cost low (but I presume now that is not the case)!

Peter, a Hash uses hashing to keep lookup cost low. This has
complexity O(1).

Cheers

robert

ralphshnelvar · January 18, 2012, 10:25pm

On Jan 18, 2012, at 12:26 , Ralph S. wrote:

I agree with you. Obviously there is sorting going on.

The nice thing about open source is you guys don’t need to sit around
and blindly speculate…

% cd RUBY19
/Users/ryan/Links/RUBY19
% grep -i sort hash.c st.c
%

Not conclusive, I know.

You’re more than welcome to read the two files. They’re less than 5k loc
in sum.

ralphshnelvar · January 19, 2012, 3:14pm

On Thu, Jan 19, 2012 at 9:10 AM, Robert K.
[email protected]wrote:

I stand corrected … (should have looked instead of speculated).
Peter, a Hash uses hashing to keep lookup cost low. This has complexity
O(1).

Hash table - Wikipedia

Thank you, indeed.

Actually, by now I understand it would be impossible to generically
use b-tree for the keys of a hash, since some objects can be used as
a key that have no < or > operator and thus not sortable …

1.9.3-p0 :001 > a = Object.new # => #Object:0x9d88660
1.9.3-p0 :002 > b = Object.new # => #Object:0x9d814c8
1.9.3-p0 :003 > a < b
NoMethodError: undefined method `<’ for #Object:0x9d88660
1.9.3-p0 :004 > h= {a => ‘10’, b => ‘100’}
=> {#Object:0x9d88660=>“10”, #Object:0x9d814c8=>“100”}
1.9.3-p0 :005 > h[a]
=> “10”

Learned a lot, sorry for the fuss …

Peter

ralphshnelvar · January 19, 2012, 2:13pm

Robert,

RK> Peter, a Hash uses hashing to keep lookup cost low. This has
complexity O(1).

Oh god, I feel so stupid.

Thanks for pointing that out.

RK> Hash table - Wikipedia

ralphshnelvar · January 19, 2012, 4:58pm

On Wed, Jan 18, 2012 at 6:25 PM, Peter V.
[email protected] wrote:

Peter

Sorry Ralph Shenlvar!

ralphshnelvar · January 23, 2012, 11:15am

If your data items are integers, and from a rather small range (compared
to computer memory…), then you can use an array instead of an hash:

maxval = 10
result = Array.new(maxval+1, 0)
ar.each{
|x| result[x] += 1;
}

This returns an array and not an hash.
[0, 0, 0, 0, 2, 2, 3, 1, 0, 0, 0]

To make a histogram, that data structure is even better. Otherwise you
need to transform it to a hash again. But for large data sets I still
expect it to be faster:
Your cpu does not need to calculate a hash key of every single data
item, because the data item is already a perfect key for the array. Also
no hash key collisions can occur.

Regards

Karsten Meier

ralphshnelvar · January 23, 2012, 11:47am

On Mon, Jan 23, 2012 at 11:15 AM, Karsten Meier
[email protected] wrote:

[0, 0, 0, 0, 2, 2, 3, 1, 0, 0, 0]

To make a histogram, that data structure is even better. Otherwise you
need to transform it to a hash again. But for large data sets I still
expect it to be faster:

Don’t expect, measure. There’s Benchmark…

Your cpu does not need to calculate a hash key of every single data
item, because the data item is already a perfect key for the array. Also
no hash key collisions can occur.

But if only few of the numbers in the range are used you waste a
potentially large Array for just a few entries.

Kind regards

robert

ralphshnelvar · January 23, 2012, 3:32pm

On Mon, Jan 23, 2012 at 3:08 PM, Karsten Meier
[email protected] wrote:

result = Hash.new(0)
0.upto(maxval){|i| result[i] = hist[i] unless hist[i] == 0}

You could also do

result.each_with_index {|c,i| result[i] = c if c.nonzero?}

result
}
}

That’s just part of the testing code, isn’t it? Why not share the
complete code?

Keinich #1 0.814000 0.000000 0.814000 ( 0.814000)
before mine. It seem to get worse with big values of maxval, but not
with jruby --1.9 option. It is not the array-allocate itself that is the
problem.
Is it possible that group_by changes the internal array structure, So I
get a non-continous-array?

No. It’s more likely that you are hit by GC I’d say. You could also
try Benchmark.bmbm for warm up before the test.

Kind regards

robert

ralphshnelvar · January 23, 2012, 3:39pm

On Mon, Jan 23, 2012 at 3:08 PM, Karsten Meier <
[email protected]> wrote:

result = Hash.new(0)
SIZE
1000000
MAXVAL
10000
user system total real
Ralph Shneiver: 0.533000 0.000000 0.533000 ( 0.518000)
Meier: 0.312000 0.000000 0.312000 ( 0.312000)
Keinich #1 0.814000 0.000000 0.814000 ( 0.814000)

(I have no 1.9.3 yet on my windows PC, so it may be different there)

Interesting.

I added your algorithm to the list and tested on ruby 1.9.3

$ ruby -v
ruby 1.9.3p0 (2011-10-30 revision 33570) [i686-linux]

SIZE
1000000
MAXVAL
1000
user system total real
Ralph Shneiver: 0.370000 0.000000 0.370000 ( 0.369229)
Sigurd: 0.420000 0.000000 0.420000 ( 0.418634)
Meier: 0.270000 0.000000 0.270000 ( 0.274136)
Keinich #1 0.320000 0.000000 0.320000 ( 0.320962)
Keinich #2 0.380000 0.000000 0.380000 ( 0.372422)
Magnus H.: 0.420000 0.000000 0.420000 ( 0.423316)
Abinoam #1: 0.600000 0.000000 0.600000 ( 0.597028)

And I also retested in the latest jruby-head (1.7.0.dev)

$ ruby -v
jruby 1.7.0.dev (ruby-1.8.7-p357) (2012-01-23 f80ab05) (Java HotSpot™
Server VM 1.6.0_26) [linux-i386-java]

SIZE
1000000
MAXVAL
1000
user system total real
Ralph Shneiver: 0.492000 0.000000 0.492000 ( 0.476000)
Sigurd: 0.473000 0.000000 0.473000 ( 0.473000)
Meier: 0.287000 0.000000 0.287000 ( 0.287000)
Keinich #1 0.308000 0.000000 0.308000 ( 0.308000)
Keinich #2 7.374000 0.000000 7.374000 ( 7.374000)
Magnus H.: NoMethodError: undefined method `each_with_object’ for
#Array:0x19c6163
file at sb.rb:30
times at org/jruby/RubyFixnum.java:261
…

So, at least for these 2 cased, your algorithm seems somewhat
faster.

As long as the array is not “sparsely” populated, this
approach certainly makes sense.

HTH,

Peter

ralphshnelvar · January 23, 2012, 3:43pm

On Mon, Jan 23, 2012 at 3:38 PM, Peter V.
[email protected]wrote:

I added your algorithm to the list and tested on ruby 1.9.3

The full code for my recent tests is here:

new benchmark with a "Meier" method added · GitHub

Peter