Uniq with count; better way?

Kenichi let’s add the word “histogram” to the search.

Perhaps naming the method something “histogram” related would be a good
choice.

Abinoam Jr.

On Tue, Jan 17, 2012 at 20:02, Abinoam Jr. [email protected] wrote:

Kenichi let’s add the word “histogram” to the search.

Perhaps naming the method something “histogram” related would be a good
choice.

A histogram is a diagram, visually representing frequencies. A hash is
not
a histogram.

On 01/17/2012 12:07 PM, Adam P. wrote:

It’s a reasonable suggestion.

http://en.wikipedia.org/wiki/Histogram:

“”"
In a more general mathematical sense, a histogram is a function mi that
counts the number of observations that fall into each of the disjoint
categories (known as bins), whereas the graph of a histogram is merely
one way to represent a histogram.
“”"

On Tue, Jan 17, 2012 at 12:08 PM, Peter V.
[email protected] wrote:

I assumed (incorrectly turns out for Ruby 1.9.3) that for large data sets
there
would be a significant performance difference. Because, upon determining the
“bins” of uniq values, that is essentially a form of “sorting”, which can
be O(n^2)
if not careful.

There is no sorting going on. Some tests explicitly use a Hash others
use Enumerable#group_by which also uses a Hash (you can see it from
the return).

Questions:

Why would ruby 1.9.3. be so much better at this than ruby 1.8.7 ?
Could it be because the Hash is now “ordered” so it can do an efficient
algorithm when adding an entry to a bin?

No. Hashes in 1.9.* only maintain insertion order - nothing else.

Or is it object creation?

The difference is more likely to do with the dramatically changed
implementation of MRI between 1.8.* and 1.9.*. Maybe there were also
some Hash implementation tweaks but the ordering is certainly not
responsible.

Why are certain methods leading to non-linear behavior?

I assume that this might have to do with memory allocation and GC.
Choosing n=1 for repetition is too low to yield proper results IMHO
since GC for the previous test might kick in etc.

Btw. with Benchmark.bmbm you can have a warmup phase before the real
measurement.

I tried to tweak a bit but not much gain (~1% at most):

Cheers

robert

On Tue, Jan 17, 2012 at 21:29, Joel VanderWerf
[email protected]wrote:

It’s a reasonable suggestion.

http://en.wikipedia.org/wiki/**Histogramhttp://en.wikipedia.org/wiki/Histogram
:

Hm! I haven’t heard that use before. Still, would that be
well-understood?
If most people think of it as the visual representation of the
frequency,
the method name might seem off. But thanks for pointing that one out. :slight_smile:

On Wed, Jan 18, 2012 at 2:57 PM, Robert K.
[email protected]wrote:

There is no sorting going on.
I beg to differ …

I would assume the only task with an order higher than O(n) is the
“binning”.

That is

  • finding the hash key for which the value needs to be incremented
  • or finding the set to which the entry needs to be added

I would expect those searches to go dramatically faster if the “keys”
are indexed, so the algorithm can find the key in O(log(n)). So there
the fundamental of a sorting algorithm is used (where to insert the
new element, or add it to the bin if it already exists).

But maybe, because I choose a constant and low number of only
100 bins, this effect does not come out clearly …

I changed the number of “bins” from 100 to 1_000_000 and for certain
algorithms it has a larger impact then others (but none are dramatic
in Ruby 1.9.3).

Binning 1_000_000 numbers over different numbers of bins · GitHub

The initial “Ralph Shneiver” algorithm is least affected.

Some tests explicitly use a Hash others

No. Hashes in 1.9.* only maintain insertion order - nothing else.

Do they have a sort of b_tree index on the keys internally ?

Or is it object creation?

The difference is more likely to do with the dramatically changed
implementation of MRI between 1.8.* and 1.9.*. Maybe there were also
some Hash implementation tweaks but the ordering is certainly not
responsible.

OK, understood.

Why are certain methods leading to non-linear behavior?

I assume that this might have to do with memory allocation and GC.
Choosing n=1 for repetition is too low to yield proper results IMHO
since GC for the previous test might kick in etc.

“Ralph Shneiver:” => { result = Hash.new(0); ar.each { |x| result[x] +=
1 }

Could it be that the reason that the "Ralph Shneiver" method remains stable and linear in all tests, is that it only assigns max 100 entries in the hash and then does nothing more than _increase a number_ (an "in place" action that does not assign new memory and triggers no GC).

Btw. with Benchmark.bmbm you can have a warmup phase before the real
measurement.

Thanks.

I tried to tweak a bit but not much gain (~1% at most):
Couting occurrencies in an Enumerable · GitHub

If I understand correctly, you tested for MRI Ruby 1.9.3 and find
similar results ?

Thanks,

Peter

Have you had a chance to look at Yura Sokolov’s Hash optimization
feature request?

Feature #5903: Optimize st_table (take 2) - Ruby master - Ruby Issue Tracking System

Jon


Fail fast. Fail often. Fail publicly. Learn. Adapt. Repeat.
http://thecodeshop.github.com | http://jonforums.github.com/
twitter: @jonforums

Peter,

Wednesday, January 18, 2012, 11:22:49 AM, you wrote:

PV> On Wed, Jan 18, 2012 at 2:57 PM, Robert K.
PV> [email protected]wrote:

There is no sorting going on.
PV> I beg to differ …

I agree with you. Obviously there is sorting going on.

PV> I would assume the only task with an order higher than O(n) is the
PV> “binning”.

PV> That is
PV> * finding the hash key for which the value needs to be incremented
PV> * or finding the set to which the entry needs to be added

PV> I would expect those searches to go dramatically faster if the
“keys”
PV> are indexed, so the algorithm can find the key in O(log(n)). So
there
PV> the fundamental of a sorting algorithm is used (where to insert the
PV> new element, or add it to the bin if it already exists).

I’m looking at the C code for uniq. I think I understand what it is
doing.

The C code makes a call to ary_make_hash. I can guess what it does. I
am wondering if there is an equivalent Ruby call.

uniq and what we are doing here are closely related.

The basic idea for uniq (in the case where there is no block given)
appears to be

  1. Create a hash from an array (i.e ary_make_hash). I’m guessing that
    duplicates are ignored by the nature of creating a hash.
  2. For each element of the original array, attempt to delete that key
    from the created hash. If you are able to do it, add that key to the
    uniq array.

Clever. I’m guessing doing it that way is faster than manipulation the
output from Hash#to_a

PV> But maybe, because I choose a constant and low number of only
PV> 100 bins, this effect does not come out clearly …

PV> I changed the number of “bins” from 100 to 1_000_000 and for certain
PV> algorithms it has a larger impact then others (but none are dramatic
PV> in Ruby 1.9.3).

PV> Binning 1_000_000 numbers over different numbers of bins · GitHub

PV> The initial “Ralph Shneiver” algorithm is least affected.

It’s Shenlvar not Shnelver! :slight_smile:

PV> Some tests explicitly use a Hash others

No. Hashes in 1.9.* only maintain insertion order - nothing else.

PV> Do they have a sort of b_tree index on the keys internally ?

Or is it object creation?

The difference is more likely to do with the dramatically changed
implementation of MRI between 1.8.* and 1.9.*. Maybe there were also
some Hash implementation tweaks but the ordering is certainly not
responsible.

PV> OK, understood.

Why are certain methods leading to non-linear behavior?

I assume that this might have to do with memory allocation and GC.
Choosing n=1 for repetition is too low to yield proper results IMHO
since GC for the previous test might kick in etc.

“Ralph Shneiver:” =>> { result = Hash.new(0); ar.each { |x| result[x] +=
1 }

PV>
PV> Could it be that the reason that the “Ralph Shneiver” method remains
PV> stable and linear in all tests, is that it only assigns max 100
entries in
PV> the
PV> hash and then does nothing more than increase a number (an “in
place”
PV> action that does not assign new memory and triggers no GC).
PV>

I suspect your speculation is quite correct.

I am going to paste in stuff from another post for the reader’s
convenience. I am quoting AJ

AJ> n = 100_000
AJ> Benchmark.bm(15) do |b|
AJ> b.report(“Ralph Shneiver:”) { n.times { a = [4,5,6,4,5,6,6,7];
AJ> result = Hash.new(0); a.each { |x| result[x] += 1 }; result} }
AJ> b.report(“Sigurd:”) { n.times {
AJ> [4,5,6,4,5,6,6,7].inject(Hash.new(0)) {|res, x| res[x] += 1; res } }
}
AJ> b.report(“Keinich #1”) { n.times {
Hash[a.group_by{|n|n}.map{|k,
AJ> v|[k, v.size]}] } }
AJ> b.report(“Keinich #2”) { n.times {
AJ> Hash.new(0).tap{|h|a.each{|n|h[n] += 1}} } }
AJ> b.report(“Magnus H.:”) { n.times {
AJ> [4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) { |x, res| res[x] +=
1
AJ> } } }
AJ> b.report(“Abinoam #1:”) { n.times { Hash[
AJ> [4,5,6,4,5,6,6,7].sort.chunk {|n| n}.map {|ix, els| [ix, els.size] }
]
AJ> } }
AJ> end

“Sigurd” creates a new object, “res”, for each iteration of each

“Keinich #1” appends stuff to a hash table value every time a new value
is detected. This has got to be more work.

“Keinich #2” is quite close to “Shnelvar” and the timings show this.

“Abinoam #1” seems to be doing a lot of work. At least two sorts. The
first an explicit one and the second an implied one in the creation of a
hash table.

Btw. with Benchmark.bmbm you can have a warmup phase before the real
measurement.

PV> Thanks.

I tried to tweak a bit but not much gain (~1% at most):
Couting occurrencies in an Enumerable · GitHub

PV> If I understand correctly, you tested for MRI Ruby 1.9.3 and find
PV> similar results ?

PV> Thanks,

PV> Peter

On Wed, Jan 18, 2012 at 9:26 PM, Ralph S. [email protected]
wrote:

PV> The initial “Ralph Shneiver” algorithm is least affected.

It’s Shenlvar not Shnelver! :slight_smile:

Very sorry for that … This is the second time on short notice
that I err with names :-/

I am going to paste in stuff from another post for the reader’s

convenience. I am quoting AJ

AJ> n = 100_000
AJ> Benchmark.bm(15) do |b|
AJ> b.report(“Ralph Shneiver:”) { n.times { a = [4,5,6,4,5,6,6,7];

I am afraid the misreading of your name started here …

Peter

On Wed, Jan 18, 2012 at 10:24 PM, Ryan D.
[email protected]wrote:

% grep -i sort hash.c st.c
%

Not conclusive, I know.

You’re more than welcome to read the two files. They’re less than 5k loc
in sum.

I stand corrected … (should have looked instead of speculated).

Do I understand correctly from st_lookup in st.c that:

  • if (table->entries_packed) there is a linear search in bins
  • else FIND_ENTRY uses a hashing mechanism to keep the cost of
    look-up essentially linear.

I had speculated that a b-tree with sorting was used to keep the
look-up cost low (but I presume now that is not the case)!

Thanks !

Peter

On Wed, Jan 18, 2012 at 10:50 PM, Peter V.
[email protected] wrote:

On Wed, Jan 18, 2012 at 10:24 PM, Ryan D. [email protected]wrote:

On Jan 18, 2012, at 12:26 , Ralph S. wrote:

I agree with you. Obviously there is sorting going on.

I stand corrected … (should have looked instead of speculated).

Do I understand correctly from st_lookup in st.c that:

  • if (table->entries_packed) there is a linear search in bins
  • else FIND_ENTRY uses a hashing mechanism to keep the cost of
    look-up essentially linear.

I had speculated that a b-tree with sorting was used to keep the
look-up cost low (but I presume now that is not the case)!

Peter, a Hash uses hashing to keep lookup cost low. This has
complexity O(1).

Cheers

robert

On Jan 18, 2012, at 12:26 , Ralph S. wrote:

I agree with you. Obviously there is sorting going on.

The nice thing about open source is you guys don’t need to sit around
and blindly speculate…

% cd RUBY19
/Users/ryan/Links/RUBY19
% grep -i sort hash.c st.c
%

Not conclusive, I know.

You’re more than welcome to read the two files. They’re less than 5k loc
in sum.

On Thu, Jan 19, 2012 at 9:10 AM, Robert K.
[email protected]wrote:

I stand corrected … (should have looked instead of speculated).
Peter, a Hash uses hashing to keep lookup cost low. This has complexity
O(1).

Hash table - Wikipedia

Thank you, indeed.

Actually, by now I understand it would be impossible to generically
use b-tree for the keys of a hash, since some objects can be used as
a key that have no < or > operator and thus not sortable …

1.9.3-p0 :001 > a = Object.new # => #Object:0x9d88660
1.9.3-p0 :002 > b = Object.new # => #Object:0x9d814c8
1.9.3-p0 :003 > a < b
NoMethodError: undefined method `<’ for #Object:0x9d88660
1.9.3-p0 :004 > h= {a => ‘10’, b => ‘100’}
=> {#Object:0x9d88660=>“10”, #Object:0x9d814c8=>“100”}
1.9.3-p0 :005 > h[a]
=> “10”

Learned a lot, sorry for the fuss …

Peter

Robert,

RK> Peter, a Hash uses hashing to keep lookup cost low. This has
complexity O(1).

Oh god, I feel so stupid.

Thanks for pointing that out.

RK> Hash table - Wikipedia

On Wed, Jan 18, 2012 at 6:25 PM, Peter V.
[email protected] wrote:

Peter

Sorry Ralph Shenlvar!

If your data items are integers, and from a rather small range (compared
to computer memory…), then you can use an array instead of an hash:

maxval = 10
result = Array.new(maxval+1, 0)
ar.each{
|x| result[x] += 1;
}

This returns an array and not an hash.
[0, 0, 0, 0, 2, 2, 3, 1, 0, 0, 0]

To make a histogram, that data structure is even better. Otherwise you
need to transform it to a hash again. But for large data sets I still
expect it to be faster:
Your cpu does not need to calculate a hash key of every single data
item, because the data item is already a perfect key for the array. Also
no hash key collisions can occur.

Regards

Karsten Meier

On Mon, Jan 23, 2012 at 11:15 AM, Karsten Meier
[email protected] wrote:

[0, 0, 0, 0, 2, 2, 3, 1, 0, 0, 0]

To make a histogram, that data structure is even better. Otherwise you
need to transform it to a hash again. But for large data sets I still
expect it to be faster:

Don’t expect, measure. There’s Benchmark…

Your cpu does not need to calculate a hash key of every single data
item, because the data item is already a perfect key for the array. Also
no hash key collisions can occur.

But if only few of the numbers in the range are used you waste a
potentially large Array for just a few entries.

Kind regards

robert

On Mon, Jan 23, 2012 at 3:08 PM, Karsten Meier
[email protected] wrote:

result = Hash.new(0)
0.upto(maxval){|i| result[i] = hist[i] unless hist[i] == 0}

You could also do

result.each_with_index {|c,i| result[i] = c if c.nonzero?}

result
}
}

That’s just part of the testing code, isn’t it? Why not share the
complete code?

Keinich #1 0.814000 0.000000 0.814000 ( 0.814000)
before mine. :frowning: It seem to get worse with big values of maxval, but not
with jruby --1.9 option. It is not the array-allocate itself that is the
problem.
Is it possible that group_by changes the internal array structure, So I
get a non-continous-array?

No. It’s more likely that you are hit by GC I’d say. You could also
try Benchmark.bmbm for warm up before the test.

Kind regards

robert

On Mon, Jan 23, 2012 at 3:08 PM, Karsten Meier <
[email protected]> wrote:

result = Hash.new(0)
SIZE
1000000
MAXVAL
10000
user system total real
Ralph Shneiver: 0.533000 0.000000 0.533000 ( 0.518000)
Meier: 0.312000 0.000000 0.312000 ( 0.312000)
Keinich #1 0.814000 0.000000 0.814000 ( 0.814000)

(I have no 1.9.3 yet on my windows PC, so it may be different there)

Interesting.

I added your algorithm to the list and tested on ruby 1.9.3

$ ruby -v
ruby 1.9.3p0 (2011-10-30 revision 33570) [i686-linux]

SIZE
1000000
MAXVAL
1000
user system total real
Ralph Shneiver: 0.370000 0.000000 0.370000 ( 0.369229)
Sigurd: 0.420000 0.000000 0.420000 ( 0.418634)
Meier: 0.270000 0.000000 0.270000 ( 0.274136)
Keinich #1 0.320000 0.000000 0.320000 ( 0.320962)
Keinich #2 0.380000 0.000000 0.380000 ( 0.372422)
Magnus H.: 0.420000 0.000000 0.420000 ( 0.423316)
Abinoam #1: 0.600000 0.000000 0.600000 ( 0.597028)

And I also retested in the latest jruby-head (1.7.0.dev)

$ ruby -v
jruby 1.7.0.dev (ruby-1.8.7-p357) (2012-01-23 f80ab05) (Java HotSpot™
Server VM 1.6.0_26) [linux-i386-java]

SIZE
1000000
MAXVAL
1000
user system total real
Ralph Shneiver: 0.492000 0.000000 0.492000 ( 0.476000)
Sigurd: 0.473000 0.000000 0.473000 ( 0.473000)
Meier: 0.287000 0.000000 0.287000 ( 0.287000)
Keinich #1 0.308000 0.000000 0.308000 ( 0.308000)
Keinich #2 7.374000 0.000000 7.374000 ( 7.374000)
Magnus H.: NoMethodError: undefined method `each_with_object’ for
#Array:0x19c6163
file at sb.rb:30
times at org/jruby/RubyFixnum.java:261

So, at least for these 2 cased, your algorithm seems somewhat
faster.

As long as the array is not “sparsely” populated, this
approach certainly makes sense.

HTH,

Peter

On Mon, Jan 23, 2012 at 3:38 PM, Peter V.
[email protected]wrote:

I added your algorithm to the list and tested on ruby 1.9.3

The full code for my recent tests is here:

new benchmark with a "Meier" method added · GitHub

Peter