Indexing hash with longer strings faster?

Hi, I have some wierd results measuring Ruby’s hash operations,
particulary indexing hash using two member string array. Have a look:

require ‘benchmark’
=> false

puts(Benchmark.measure do
?> h = {}

1000000.times do |i|
?> i1 = rand(100)

i2 = rand(100)
a = [i1.to_s, i2.to_s]
h[a] ||= 0
h[a] += 1

end
puts h.size
end)
10000
12.900000 0.050000 12.950000 ( 13.088721)
=> nil

puts(Benchmark.measure do
?> h = {}

1000000.times do |i|
?> i1 = rand(100)

i2 = rand(100)
a = [i1.to_s + '00000', i2.to_s + '00000']
h[a] ||= 0
h[a] += 1

end
puts h.size
end)
10000
7.700000 0.040000 7.740000 ( 7.858381)
=> nil

In other words, I just made array members slightly longer and working
with hash was considerably faster.

Why’s that happening?

BTW,

jablan-mbp:~ $ ruby -v
ruby 1.8.6 (2008-08-11 patchlevel 287) [universal-darwin9.0]

On Tue, Jul 28, 2009 at 6:50 PM, jablan[email protected] wrote:

Benchmarking is tricky business. I am not sure my benchmarks are worth
much, but they do not confirm your results. BTW using: ruby 1.9.1p243
(2009-07-16 revision 24175) [i686-linux]

First you really should not bm inside irb. Second you should use bmbm
for rehearsals, but even with that you could get bitten by tons of
things happening on your machine.
Nevertheless for whats worth it - and because I adore benchmarking :wink:

  • here is my code and results:

require ‘benchmark’

N = 100_000
5.times do
Benchmark.bmbm do | bm |
h = Hash::new( 0 )
bm.report( “2 chars” ) do
N.times do
h[ [rand(100).to_s, rand(100).to_s] ] += 1
end
end
h = Hash::new( 0 )
bm.report( “7 chars” ) do
N.times do
h[ [rand(100).to_s + “00000”, rand(100).to_s + “00000” ] ] += 1
end
end
end
end
robert@siena:~/log/ruby/ML 19:50:13
519/25 > ruby bm1.rb
Rehearsal -------------------------------------------
2 chars 0.700000 0.010000 0.710000 ( 0.953483)
7 chars 0.960000 0.010000 0.970000 ( 1.367117)
---------------------------------- total: 1.680000sec

          user     system      total        real

2 chars 0.740000 0.020000 0.760000 ( 1.031102)
7 chars 1.010000 0.020000 1.030000 ( 1.323072)
Rehearsal -------------------------------------------
2 chars 0.720000 0.000000 0.720000 ( 0.955180)
7 chars 0.980000 0.010000 0.990000 ( 1.320693)
---------------------------------- total: 1.710000sec

          user     system      total        real

2 chars 0.710000 0.000000 0.710000 ( 0.923019)
7 chars 0.900000 0.010000 0.910000 ( 1.262338)
Rehearsal -------------------------------------------
2 chars 0.700000 0.000000 0.700000 ( 0.926117)
7 chars 0.990000 0.020000 1.010000 ( 1.356475)
---------------------------------- total: 1.710000sec

          user     system      total        real

2 chars 0.710000 0.000000 0.710000 ( 0.940901)
7 chars 0.880000 0.020000 0.900000 ( 1.269465)
Rehearsal -------------------------------------------
2 chars 0.700000 0.010000 0.710000 ( 1.016537)
7 chars 0.990000 0.010000 1.000000 ( 1.360516)
---------------------------------- total: 1.710000sec

          user     system      total        real

2 chars 0.720000 0.010000 0.730000 ( 0.920533)
7 chars 0.920000 0.000000 0.920000 ( 1.198681)
Rehearsal -------------------------------------------
2 chars 0.720000 0.010000 0.730000 ( 0.946056)
7 chars 1.020000 0.010000 1.030000 ( 1.343575)
---------------------------------- total: 1.760000sec

          user     system      total        real

2 chars 0.720000 0.010000 0.730000 ( 1.023185)
7 chars 0.930000 0.000000 0.930000 ( 1.211184)


Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]

First, thanks a lot for your benchmarking tips, I’m new in the
business; and thanks for your BM-friendly rewrite of my code.

However, I still keep getting unexpected results:

jablan-mbp:dev $ ruby bm1.rb
Rehearsal -------------------------------------------
2 chars 0.840000 0.000000 0.840000 ( 0.863001)
7 chars 0.590000 0.010000 0.600000 ( 0.597103)
---------------------------------- total: 1.440000sec

          user     system      total        real

2 chars 0.900000 0.000000 0.900000 ( 0.909053)
7 chars 0.620000 0.000000 0.620000 ( 0.644989)
Rehearsal -------------------------------------------
2 chars 0.910000 0.000000 0.910000 ( 0.924786)
7 chars 0.620000 0.010000 0.630000 ( 0.631793)
---------------------------------- total: 1.540000sec

          user     system      total        real

2 chars 0.900000 0.000000 0.900000 ( 0.909935)
7 chars 0.570000 0.000000 0.570000 ( 0.578445)
Rehearsal -------------------------------------------
2 chars 0.880000 0.010000 0.890000 ( 0.892464)
7 chars 0.650000 0.000000 0.650000 ( 0.672226)
---------------------------------- total: 1.540000sec

          user     system      total        real

2 chars 0.910000 0.010000 0.920000 ( 0.947815)
7 chars 0.580000 0.000000 0.580000 ( 0.593960)
Rehearsal -------------------------------------------
2 chars 0.890000 0.010000 0.900000 ( 0.896618)
7 chars 0.650000 0.000000 0.650000 ( 0.660125)
---------------------------------- total: 1.550000sec

          user     system      total        real

2 chars 0.900000 0.000000 0.900000 ( 0.919042)
7 chars 0.590000 0.010000 0.600000 ( 0.595202)
Rehearsal -------------------------------------------
2 chars 0.900000 0.000000 0.900000 ( 0.913919)
7 chars 0.660000 0.010000 0.670000 ( 0.682484)
---------------------------------- total: 1.570000sec

          user     system      total        real

2 chars 0.920000 0.000000 0.920000 ( 0.950286)
7 chars 0.590000 0.010000 0.600000 ( 0.601025)

I will try to run the code on other platforms and/or ruby versions and
I will post the findings… The bad thing is that I’m not doing this
just for kicks, I need to optimise a nasty script… :frowning:

On Tue, Jul 28, 2009 at 9:16 PM, Robert D.[email protected]
wrote:

can you try this in 1.8
 …map{ |x| x.hash }.uniq.size

Sorry forgot two things
you did not produce 00, 01, thus
[*0…99].map( &:to_s ).map( a:hash )
and secondly,
a negative result does not say anything about collisions, only a
positive (result < 100) would. In that case we would need to look into
the source, ( I do not think profiling might show internal hash
behavior ).
Cheers
Robert

[Antoine de Saint-Exupéry]


Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]

You pointed to the right direction here!

Although your example gives me 100 as result (all the hashes are
unique), here’s what suggests that hash collision indeed is the reason
of the slowness:

[*0…99].map{|i| [*0…99].map{|j| [i.to_s,j.to_s].hash}}.flatten.uniq.size
=> 1756

[*0…99].map{|i| [*0…99].map{|j| [i.to_s + ‘00000’,j.to_s + ‘00000’].hash}}.flatten.uniq.size
=> 10000

Can you please just post the ruby 1.9 results here?

Thanks!

2009/7/28 Mladen Jablanović [email protected]:
Is this 1.8?
Given your results it is not completely unthinkable that the hash
values for “00”…“99” have more collisions and if collision handling
is with linked lists that might slow down.
A priori I would not believe in that but why not check?

I get in 1.9 the expected
[*“00”…“99”].map( &:hash ).uniq.size
=> 100

can you try this in 1.8
…map{ |x| x.hash }.uniq.size

?

HTH
Robert

Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]

2009/7/29 Mladen Jablanović [email protected]:

You pointed to the right direction here!

Although your example gives me 100 as result (all the hashes are
unique), here’s what suggests that hash collision indeed is the reason
of the slowness:

[*0…99].map{|i| [*0…99].map{|j| [i.to_s,j.to_s].hash}}.flatten.uniq.size
=> 1756
[*0…99].map{|i| [*0…99].map{|j| [i.to_s + ‘00000’,j.to_s + ‘00000’].hash}}.flatten.uniq.size
=> 10000

OMG my code was sloppy, but you got it right, here is the 1.9 code
506/18 > cat hashes.rb && ruby -v hashes.rb
#!/usr/local/bin/ruby -w

encoding: utf-8

file: /home/robert/log/ruby/ML/hashes.rb

p [*0…99].map{|i| [*0…99].map{|j|
[i.to_s,j.to_s].hash}}.flatten.uniq.size
p [*0…99].map{|i| [*0…99].map{|j| [i.to_s + ‘00000’,j.to_s + ‘00000’].hash}}.flatten.uniq.size

vim: sts=2 sw=2 ft=ruby expandtab nu :

ruby 1.9.1p243 (2009-07-16 revision 24175) [i686-linux]
10000
10000

as expected :).
Cheers
Robert

Can you please just post the ruby 1.9 results here?

Thanks!


Toutes les grandes personnes ont d’abord été des enfants, mais peu
d’entre elles s’en souviennent.

All adults have been children first, but not many remember.

[Antoine de Saint-Exupéry]