Kenichi let’s add the word “histogram” to the search.
Perhaps naming the method something “histogram” related would be a good
choice.
Abinoam Jr.
Kenichi let’s add the word “histogram” to the search.
Perhaps naming the method something “histogram” related would be a good
choice.
Abinoam Jr.
On Tue, Jan 17, 2012 at 20:02, Abinoam Jr. [email protected] wrote:
Kenichi let’s add the word “histogram” to the search.
Perhaps naming the method something “histogram” related would be a good
choice.
A histogram is a diagram, visually representing frequencies. A hash is
not
a histogram.
On 01/17/2012 12:07 PM, Adam P. wrote:
It’s a reasonable suggestion.
http://en.wikipedia.org/wiki/Histogram:
“”"
In a more general mathematical sense, a histogram is a function mi that
counts the number of observations that fall into each of the disjoint
categories (known as bins), whereas the graph of a histogram is merely
one way to represent a histogram.
“”"
On Tue, Jan 17, 2012 at 12:08 PM, Peter V.
[email protected] wrote:
I assumed (incorrectly turns out for Ruby 1.9.3) that for large data sets
there
would be a significant performance difference. Because, upon determining the
“bins” of uniq values, that is essentially a form of “sorting”, which can
be O(n^2)
if not careful.
There is no sorting going on. Some tests explicitly use a Hash others
use Enumerable#group_by which also uses a Hash (you can see it from
the return).
Questions:
Why would ruby 1.9.3. be so much better at this than ruby 1.8.7 ?
Could it be because the Hash is now “ordered” so it can do an efficient
algorithm when adding an entry to a bin?
No. Hashes in 1.9.* only maintain insertion order - nothing else.
Or is it object creation?
The difference is more likely to do with the dramatically changed
implementation of MRI between 1.8.* and 1.9.*. Maybe there were also
some Hash implementation tweaks but the ordering is certainly not
responsible.
Why are certain methods leading to non-linear behavior?
I assume that this might have to do with memory allocation and GC.
Choosing n=1 for repetition is too low to yield proper results IMHO
since GC for the previous test might kick in etc.
Btw. with Benchmark.bmbm you can have a warmup phase before the real
measurement.
I tried to tweak a bit but not much gain (~1% at most):
Cheers
robert
On Tue, Jan 17, 2012 at 21:29, Joel VanderWerf
[email protected]wrote:
It’s a reasonable suggestion.
http://en.wikipedia.org/wiki/**Histogramhttp://en.wikipedia.org/wiki/Histogram
:
Hm! I haven’t heard that use before. Still, would that be
well-understood?
If most people think of it as the visual representation of the
frequency,
the method name might seem off. But thanks for pointing that one out.
On Wed, Jan 18, 2012 at 2:57 PM, Robert K.
[email protected]wrote:
There is no sorting going on.
I beg to differ …
I would assume the only task with an order higher than O(n) is the
“binning”.
That is
I would expect those searches to go dramatically faster if the “keys”
are indexed, so the algorithm can find the key in O(log(n)). So there
the fundamental of a sorting algorithm is used (where to insert the
new element, or add it to the bin if it already exists).
But maybe, because I choose a constant and low number of only
100 bins, this effect does not come out clearly …
I changed the number of “bins” from 100 to 1_000_000 and for certain
algorithms it has a larger impact then others (but none are dramatic
in Ruby 1.9.3).
Binning 1_000_000 numbers over different numbers of bins · GitHub
The initial “Ralph Shneiver” algorithm is least affected.
Some tests explicitly use a Hash others
No. Hashes in 1.9.* only maintain insertion order - nothing else.
Do they have a sort of b_tree index on the keys internally ?
Or is it object creation?
The difference is more likely to do with the dramatically changed
implementation of MRI between 1.8.* and 1.9.*. Maybe there were also
some Hash implementation tweaks but the ordering is certainly not
responsible.
OK, understood.
Why are certain methods leading to non-linear behavior?
I assume that this might have to do with memory allocation and GC.
Choosing n=1 for repetition is too low to yield proper results IMHO
since GC for the previous test might kick in etc.
“Ralph Shneiver:” => { result = Hash.new(0); ar.each { |x| result[x] +=
1 }
Btw. with Benchmark.bmbm you can have a warmup phase before the real
measurement.
Thanks.
I tried to tweak a bit but not much gain (~1% at most):
Couting occurrencies in an Enumerable · GitHub
If I understand correctly, you tested for MRI Ruby 1.9.3 and find
similar results ?
Thanks,
Peter
Have you had a chance to look at Yura Sokolov’s Hash optimization
feature request?
Feature #5903: Optimize st_table (take 2) - Ruby master - Ruby Issue Tracking System
Jon
Fail fast. Fail often. Fail publicly. Learn. Adapt. Repeat.
http://thecodeshop.github.com | http://jonforums.github.com/
twitter: @jonforums
Peter,
Wednesday, January 18, 2012, 11:22:49 AM, you wrote:
PV> On Wed, Jan 18, 2012 at 2:57 PM, Robert K.
PV> [email protected]wrote:
There is no sorting going on.
PV> I beg to differ …
I agree with you. Obviously there is sorting going on.
PV> I would assume the only task with an order higher than O(n) is the
PV> “binning”.
PV> That is
PV> * finding the hash key for which the value needs to be incremented
PV> * or finding the set to which the entry needs to be added
PV> I would expect those searches to go dramatically faster if the
“keys”
PV> are indexed, so the algorithm can find the key in O(log(n)). So
there
PV> the fundamental of a sorting algorithm is used (where to insert the
PV> new element, or add it to the bin if it already exists).
I’m looking at the C code for uniq. I think I understand what it is
doing.
The C code makes a call to ary_make_hash. I can guess what it does. I
am wondering if there is an equivalent Ruby call.
uniq and what we are doing here are closely related.
The basic idea for uniq (in the case where there is no block given)
appears to be
Clever. I’m guessing doing it that way is faster than manipulation the
output from Hash#to_a
PV> But maybe, because I choose a constant and low number of only
PV> 100 bins, this effect does not come out clearly …
PV> I changed the number of “bins” from 100 to 1_000_000 and for certain
PV> algorithms it has a larger impact then others (but none are dramatic
PV> in Ruby 1.9.3).
PV> Binning 1_000_000 numbers over different numbers of bins · GitHub
PV> The initial “Ralph Shneiver” algorithm is least affected.
It’s Shenlvar not Shnelver!
PV> Some tests explicitly use a Hash others
No. Hashes in 1.9.* only maintain insertion order - nothing else.
PV> Do they have a sort of b_tree index on the keys internally ?
Or is it object creation?
The difference is more likely to do with the dramatically changed
implementation of MRI between 1.8.* and 1.9.*. Maybe there were also
some Hash implementation tweaks but the ordering is certainly not
responsible.
PV> OK, understood.
Why are certain methods leading to non-linear behavior?
I assume that this might have to do with memory allocation and GC.
Choosing n=1 for repetition is too low to yield proper results IMHO
since GC for the previous test might kick in etc.
“Ralph Shneiver:” =>> { result = Hash.new(0); ar.each { |x| result[x] +=
1 }
PV>
PV> Could it be that the reason that the “Ralph Shneiver” method remains
PV> stable and linear in all tests, is that it only assigns max 100
entries in
PV> the
PV> hash and then does nothing more than increase a number (an “in
place”
PV> action that does not assign new memory and triggers no GC).
PV>
I suspect your speculation is quite correct.
I am going to paste in stuff from another post for the reader’s
convenience. I am quoting AJ
AJ> n = 100_000
AJ> Benchmark.bm(15) do |b|
AJ> b.report(“Ralph Shneiver:”) { n.times { a = [4,5,6,4,5,6,6,7];
AJ> result = Hash.new(0); a.each { |x| result[x] += 1 }; result} }
AJ> b.report(“Sigurd:”) { n.times {
AJ> [4,5,6,4,5,6,6,7].inject(Hash.new(0)) {|res, x| res[x] += 1; res } }
}
AJ> b.report(“Keinich #1”) { n.times {
Hash[a.group_by{|n|n}.map{|k,
AJ> v|[k, v.size]}] } }
AJ> b.report(“Keinich #2”) { n.times {
AJ> Hash.new(0).tap{|h|a.each{|n|h[n] += 1}} } }
AJ> b.report(“Magnus H.:”) { n.times {
AJ> [4,5,6,4,5,6,6,7].each_with_object(Hash.new(0)) { |x, res| res[x] +=
1
AJ> } } }
AJ> b.report(“Abinoam #1:”) { n.times { Hash[
AJ> [4,5,6,4,5,6,6,7].sort.chunk {|n| n}.map {|ix, els| [ix, els.size] }
]
AJ> } }
AJ> end
“Sigurd” creates a new object, “res”, for each iteration of each
“Keinich #1” appends stuff to a hash table value every time a new value
is detected. This has got to be more work.
“Keinich #2” is quite close to “Shnelvar” and the timings show this.
“Abinoam #1” seems to be doing a lot of work. At least two sorts. The
first an explicit one and the second an implied one in the creation of a
hash table.
Btw. with Benchmark.bmbm you can have a warmup phase before the real
measurement.
PV> Thanks.
I tried to tweak a bit but not much gain (~1% at most):
Couting occurrencies in an Enumerable · GitHub
PV> If I understand correctly, you tested for MRI Ruby 1.9.3 and find
PV> similar results ?
PV> Thanks,
PV> Peter
On Wed, Jan 18, 2012 at 9:26 PM, Ralph S. [email protected]
wrote:
PV> The initial “Ralph Shneiver” algorithm is least affected.
It’s Shenlvar not Shnelver!
Very sorry for that … This is the second time on short notice
that I err with names :-/
I am going to paste in stuff from another post for the reader’s
convenience. I am quoting AJ
AJ> n = 100_000
AJ> Benchmark.bm(15) do |b|
AJ> b.report(“Ralph Shneiver:”) { n.times { a = [4,5,6,4,5,6,6,7];
I am afraid the misreading of your name started here …
Peter
On Wed, Jan 18, 2012 at 10:24 PM, Ryan D.
[email protected]wrote:
% grep -i sort hash.c st.c
%Not conclusive, I know.
You’re more than welcome to read the two files. They’re less than 5k loc
in sum.
I stand corrected … (should have looked instead of speculated).
Do I understand correctly from st_lookup
in st.c
that:
I had speculated that a b-tree with sorting was used to keep the
look-up cost low (but I presume now that is not the case)!
Thanks !
Peter
On Wed, Jan 18, 2012 at 10:50 PM, Peter V.
[email protected] wrote:
On Wed, Jan 18, 2012 at 10:24 PM, Ryan D. [email protected]wrote:
On Jan 18, 2012, at 12:26 , Ralph S. wrote:
I agree with you. Obviously there is sorting going on.
…
I stand corrected … (should have looked instead of speculated).Do I understand correctly from
st_lookup
inst.c
that:
- if (table->entries_packed) there is a linear search in bins
- else FIND_ENTRY uses a hashing mechanism to keep the cost of
look-up essentially linear.I had speculated that a b-tree with sorting was used to keep the
look-up cost low (but I presume now that is not the case)!
Peter, a Hash uses hashing to keep lookup cost low. This has
complexity O(1).
Cheers
robert
On Jan 18, 2012, at 12:26 , Ralph S. wrote:
I agree with you. Obviously there is sorting going on.
The nice thing about open source is you guys don’t need to sit around
and blindly speculate…
% cd RUBY19
/Users/ryan/Links/RUBY19
% grep -i sort hash.c st.c
%
Not conclusive, I know.
You’re more than welcome to read the two files. They’re less than 5k loc
in sum.
On Thu, Jan 19, 2012 at 9:10 AM, Robert K.
[email protected]wrote:
I stand corrected … (should have looked instead of speculated).
Peter, a Hash uses hashing to keep lookup cost low. This has complexity
O(1).
Thank you, indeed.
Actually, by now I understand it would be impossible to generically
use b-tree for the keys of a hash, since some objects can be used as
a key that have no < or > operator and thus not sortable …
1.9.3-p0 :001 > a = Object.new # => #Object:0x9d88660
1.9.3-p0 :002 > b = Object.new # => #Object:0x9d814c8
1.9.3-p0 :003 > a < b
NoMethodError: undefined method `<’ for #Object:0x9d88660
1.9.3-p0 :004 > h= {a => ‘10’, b => ‘100’}
=> {#Object:0x9d88660=>“10”, #Object:0x9d814c8=>“100”}
1.9.3-p0 :005 > h[a]
=> “10”
Learned a lot, sorry for the fuss …
Peter
Robert,
RK> Peter, a Hash uses hashing to keep lookup cost low. This has
complexity O(1).
Oh god, I feel so stupid.
Thanks for pointing that out.
If your data items are integers, and from a rather small range (compared
to computer memory…), then you can use an array instead of an hash:
maxval = 10
result = Array.new(maxval+1, 0)
ar.each{
|x| result[x] += 1;
}
This returns an array and not an hash.
[0, 0, 0, 0, 2, 2, 3, 1, 0, 0, 0]
To make a histogram, that data structure is even better. Otherwise you
need to transform it to a hash again. But for large data sets I still
expect it to be faster:
Your cpu does not need to calculate a hash key of every single data
item, because the data item is already a perfect key for the array. Also
no hash key collisions can occur.
Regards
Karsten Meier
On Mon, Jan 23, 2012 at 11:15 AM, Karsten Meier
[email protected] wrote:
[0, 0, 0, 0, 2, 2, 3, 1, 0, 0, 0]
To make a histogram, that data structure is even better. Otherwise you
need to transform it to a hash again. But for large data sets I still
expect it to be faster:
Don’t expect, measure. There’s Benchmark…
Your cpu does not need to calculate a hash key of every single data
item, because the data item is already a perfect key for the array. Also
no hash key collisions can occur.
But if only few of the numbers in the range are used you waste a
potentially large Array for just a few entries.
Kind regards
robert
On Mon, Jan 23, 2012 at 3:08 PM, Karsten Meier
[email protected] wrote:
result = Hash.new(0)
0.upto(maxval){|i| result[i] = hist[i] unless hist[i] == 0}
You could also do
result.each_with_index {|c,i| result[i] = c if c.nonzero?}
result
}
}
That’s just part of the testing code, isn’t it? Why not share the
complete code?
Keinich #1 0.814000 0.000000 0.814000 ( 0.814000)
before mine. It seem to get worse with big values of maxval, but not
with jruby --1.9 option. It is not the array-allocate itself that is the
problem.
Is it possible that group_by changes the internal array structure, So I
get a non-continous-array?
No. It’s more likely that you are hit by GC I’d say. You could also
try Benchmark.bmbm for warm up before the test.
Kind regards
robert
On Mon, Jan 23, 2012 at 3:08 PM, Karsten Meier <
[email protected]> wrote:
result = Hash.new(0)
SIZE
1000000
MAXVAL
10000
user system total real
Ralph Shneiver: 0.533000 0.000000 0.533000 ( 0.518000)
Meier: 0.312000 0.000000 0.312000 ( 0.312000)
Keinich #1 0.814000 0.000000 0.814000 ( 0.814000)(I have no 1.9.3 yet on my windows PC, so it may be different there)
Interesting.
I added your algorithm to the list and tested on ruby 1.9.3
$ ruby -v
ruby 1.9.3p0 (2011-10-30 revision 33570) [i686-linux]
SIZE
1000000
MAXVAL
1000
user system total real
Ralph Shneiver: 0.370000 0.000000 0.370000 ( 0.369229)
Sigurd: 0.420000 0.000000 0.420000 ( 0.418634)
Meier: 0.270000 0.000000 0.270000 ( 0.274136)
Keinich #1 0.320000 0.000000 0.320000 ( 0.320962)
Keinich #2 0.380000 0.000000 0.380000 ( 0.372422)
Magnus H.: 0.420000 0.000000 0.420000 ( 0.423316)
Abinoam #1: 0.600000 0.000000 0.600000 ( 0.597028)
And I also retested in the latest jruby-head (1.7.0.dev)
$ ruby -v
jruby 1.7.0.dev (ruby-1.8.7-p357) (2012-01-23 f80ab05) (Java HotSpot™
Server VM 1.6.0_26) [linux-i386-java]
SIZE
1000000
MAXVAL
1000
user system total real
Ralph Shneiver: 0.492000 0.000000 0.492000 ( 0.476000)
Sigurd: 0.473000 0.000000 0.473000 ( 0.473000)
Meier: 0.287000 0.000000 0.287000 ( 0.287000)
Keinich #1 0.308000 0.000000 0.308000 ( 0.308000)
Keinich #2 7.374000 0.000000 7.374000 ( 7.374000)
Magnus H.: NoMethodError: undefined method `each_with_object’ for
#Array:0x19c6163
file at sb.rb:30
times at org/jruby/RubyFixnum.java:261
…
So, at least for these 2 cased, your algorithm seems somewhat
faster.
As long as the array is not “sparsely” populated, this
approach certainly makes sense.
HTH,
Peter
On Mon, Jan 23, 2012 at 3:38 PM, Peter V.
[email protected]wrote:
I added your algorithm to the list and tested on ruby 1.9.3
The full code for my recent tests is here:
new benchmark with a "Meier" method added · GitHub
Peter
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.
Sponsor our Newsletter | Privacy Policy | Terms of Service | Remote Ruby Jobs