Sane #hash implementation?


#1

I have a Set subclass, Block. I need two blocks to be considered the
same by Set iff they have the same elements. From what I gathered, Set
treats two objects as equal when they are eql? and have the same hash.

Implementing Block#eql? seems to be trivial, as Set#== is there to help:

class Block < Set
def eql? other
self == other
end
end

I’m stuck when it comes to Block#hash, though; I need these to be true:
Block.new.hash == Block.new.hash
Block.new([1,2]).hash == Block.new([1,2]).hash

What’s the proper way to go at it? Adding the elements’ hashes together
is obviously wrong, as four different elements might produce the same
sum when paired; also, the docs say, ‘any hash value that exceeds the
capacity of a Fixnum will be truncated before being used,’ so hash can’t
exceed the (word - 1 bit) size.

– Shot


#2

On Tue, 30 Jan 2007 07:38:20 +0900, Shot (Piotr S.) wrote:

end

– Shot

There’s nothing about hash codes that requires that
a.hash != b.hash if a != b
You want to avoid collisions as much as possible, but
that’s an issue of performance, not correctness, so you’d be perfectly
correct in adding up the hash codes of the members to create the has
code.

Nor is there anything that requires you to keep your hash code within
the
capacity of a Fixnum when you generate it. That’s why it’s automatically
truncated for you. The only reason you need to know is so that your hash
function doesn’t hash things in such a way that the truncated number has
lots of collisions. (Once again a performance issue).

You may want to look at http://www.partow.net/programming/hashfunctions/
which discusses some relatively important hash functions. You may have
to
experiment to see what works best. Wikipedia may also have good
information.

–Ken


#3

On Jan 29, 2007, at 14:38, Shot (Piotr S.) wrote:

I have a Set subclass, Block. I need two blocks to be considered the
same by Set iff they have the same elements. From what I gathered, Set
treats two objects as equal when they are eql? and have the same hash.

I’m stuck when it comes to Block#hash, though; I need these to be
true:
Block.new.hash == Block.new.hash
Block.new([1,2]).hash == Block.new([1,2]).hash

Try:

class Block
def hash
to_a.hash
end
end


#4

Ken B.:

There’s nothing about hash codes that requires that
a.hash != b.hash if a != b
You want to avoid collisions as much as possible, but that’s an issue
of performance, not correctness, so you’d be perfectly correct in
adding up the hash codes of the members to create the has code.

True, but I do want to be able to key Hashes with my Blocks, so I do
want to avoid collisions as much as possible. I should’ve tested the
add-up-the-members’-hashes approach before/instead of crossing it out,
though.

You may want to look at
http://www.partow.net/programming/hashfunctions/
which discusses some relatively important hash functions.
You may have to experiment to see what works best. Wikipedia
may also have good information.

Thanks for the pointers! I knew about hash functions before, I just
wasn’t sure whether there’s an idiomatic/popular/proper Ruby approach
to writing #hash.

– Shot


#5

Eric H.:

On Jan 29, 2007, at 14:38, Shot (Piotr S.) wrote:

I’m stuck when it comes to Block#hash, though; I need these to be true:
Block.new.hash == Block.new.hash
Block.new([1,2]).hash == Block.new([1,2]).hash

Try:

class Block
def hash
to_a.hash
end
end

Thanks a lot, Eric! This is the ‘d’oh!’ solution I was looking for. :slight_smile:

– Shot


#6

On Jan 29, 2007, at 23:26, Shot (Piotr S.) wrote:

def hash
to_a.hash
end
end

Thanks a lot, Eric! This is the ‘d’oh!’ solution I was looking for. :slight_smile:

You may want to define #eql? in terms of #to_a as well…


#7

Shot (Piotr S.) schrieb:

class Block
def hash
to_a.hash
end
end

Thanks a lot, Eric! This is the ‘d’oh!’ solution I was looking for. :slight_smile:

Shot, you should be aware that this works for your example given above,
but not for the following:

Block.new([1,12]).hash # => 57
Block.new([12,1]).hash # => 23

If this is a problem, you have to change the implementation to

class Block
def hash
sort.hash
end
end

Regards,
Pit


#8

On Tue, Jan 30, 2007 at 05:10:51PM +0900, Eric H. wrote:

class Block
def hash
to_a.hash
end
end

Thanks a lot, Eric! This is the e$B!Fe(Bde$B!Ge(Boh!e$B!Ge(B solution I was looking for.
:slight_smile:

You may want to define #eql? in terms of #to_a as well…

That would fail for the same reason as your Block#hash.


#9

[I’m resending this since the ML server is dropping some of my messages to this thread for some reason.]

On Tue, Jan 30, 2007 at 04:26:18PM +0900, Shot (Piotr S.) wrote:

class Block
def hash
to_a.hash
end
end

Thanks a lot, Eric! This is the d’oh! solution I was looking for. :slight_smile:

It’s not a correct one though.

require ‘set’
a = Set.new(0…1000)
b = Set.new((0…1000).sort_by{rand})
a == b # => true
a.to_a.hash # => 31141761
b.to_a.hash # => 374826672

(The order of the elements in Hash#to_a can change.)

Try this, it’s O(n ln n) but actually works

require ‘set’
class Block < Set
def hash
map{|x| x.hash}.sort.hash
end
end
a = Block.new(0…1000)
b = Block.new((0…1000).sort_by{rand})
a == b # => true
a.hash # => 944434529
b.hash # => 944434529


#10

Mauricio F.:

(The order of the elements in Hash#to_a can change.)

Thanks for pointing this out, I fixed my specs and my code.

Try this, it’s O(n ln n) but actually works

Ok, I guess I have a question on what does exactly Benchmark results
mean. Doesn’t the below mean hash_map_sort is actually a bit slower than
hash_sort?

Also, to ask two basic, follow-up questions: which Benchmark time is
the most significant one and is it more-or-less independent from other
system load? (I guess the two runs wouldn’t differ as much any of the
times was other-load-independent…)

$ cat block_hash.rb
require ‘benchmark’
require ‘set’

class Block < Set
def hash_sort
sort.hash
end
def hash_map_sort
map {|a| a.hash}.sort.hash
end
end

srand 1979
array = (0…1_000_000).sort_by {rand}

Benchmark.bmbm do |b|
b.report ‘hash_sort’ do
Block.new(array).hash_sort
end
b.report ‘hash_map_sort’ do
Block.new(array).hash_map_sort
end
end

$ ruby block_hash.rb
Rehearsal -------------------------------------------------
hash_sort 8.000000 1.366667 9.366667 ( 6.208513)
hash_map_sort 9.750000 1.500000 11.250000 ( 7.179749)
--------------------------------------- total: 20.616667sec

                user     system      total        real

hash_sort 9.666667 1.100000 10.766667 ( 6.677047)
hash_map_sort 9.566667 1.550000 11.116667 ( 6.809388)

$ ruby block_hash.rb
Rehearsal -------------------------------------------------
hash_sort 8.133333 1.316667 9.450000 ( 5.746571)
hash_map_sort 9.650000 1.766667 11.416667 ( 7.065570)
--------------------------------------- total: 20.866667sec

                user     system      total        real

hash_sort 9.316667 1.100000 10.416667 ( 6.684088)
hash_map_sort 9.450000 1.566667 11.016667 ( 6.796299)

– Shot


#11

Eric H.:

You may want to define #eql? in terms of #to_a as well…

Thanks, but (considering Mauricio F.’s fix) what would be the
benefit of defining #eql? in terms of Set#sort instead of Set#==?

Te docs say on Set#==, ‘Returns true if two sets are equal. The equality
of each couple of elements is defined according to Object#eql?.’

– Shot


#12

Shot (Piotr S.) schrieb:

Ok, I guess I have a question on what does exactly Benchmark results
mean. Doesn’t the below mean hash_map_sort is actually a bit slower than
hash_sort?

This is what I would expect. The advantage of Mauricio’s solution is
that you can use it even if the Set elements are not comparable to each
other. My proposal was meant for your use case with integers as Set
elements.

end
With this benchmark, you also measure the time needed to construct the
Block instances. If you want to compare the two #hash methods, you
should change your benchmark to something like:

block = Block.new(array)

Benchmark.bmbm do |b|
b.report ‘hash_sort’ do
block.hash_sort
end
b.report ‘hash_map_sort’ do
block.hash_map_sort
end
end

From your benchmark, you can see that it takes a lot of time to compute
the hash values. Will your sets contain 1_000_000 elements? How often
will they change? How are they created? Maybe you need to cache the hash
values or implement a different hash algorithm.

Regards,
Pit


#13

Pit C.:

The advantage of Mauricio’s solution is that you can use it even if
the Set elements are not comparable to each other. My proposal was
meant for your use case with integers as Set elements.

Ah, right. Still, blocks will contain integers only, so
Array#sort-based Block#hash and Set#==-based Block#eql?
should suffice.

With this benchmark, you also measure the time needed to construct
the Block instances. If you want to compare the two #hash methods,
you should change your benchmark to something like:

D’oh! Why did I think moving the array creation
out was the most I can get to unbias the benchmark?

From your benchmark, you can see that it takes a lot of time to
compute the hash values. Will your sets contain 1_000_000 elements?
How often will they change? How are they created? Maybe you need to
cache the hash values or implement a different hash algorithm.

I don’t know, yet. I was just curious whether the elements’-hash-based
hash algorithm would be any faster than the just-sort-based one (given
that from a quick glance it seemed to incorporate additional work); now
I understand its advantage lies elsewhere.

Thanks a lot for your input!

– Shot


#14

On Wed, Jan 31, 2007 at 06:37:14PM +0900, Shot (Piotr S.) wrote:

Eric H.:

You may want to define #eql? in terms of #to_a as well…

Thanks, but (considering Mauricio F.’s fix) what would be the
benefit of defining #eql? in terms of Set#sort instead of Set#==?

“None”. You cannot just
def eql?(o); sort == o.sort end
either, since there’s no guarantee there’s a total order associated to
the
set (it seems my reply to Pit where I pointed this out was dropped too):

require ‘set’
Set.new([‘a’, 1]).sort # =>

~> -:2:in `sort’: comparison of String with 1 failed (ArgumentError)

~> from -:2

This is why the hash implementation I gave you does
map{|x| x.hash}.sort.hash # sorting hash values, i.e. Integers
instead of
sort.map{|x| x.hash}.hash

Even when the elements can be ordered, sort would be O(n ln n) vs. #=='s
O(n)

Of course, the former is a “fast NlnN”, being written in C, so it would
seem
that it can keep the speed advantage up to large values of N.

However, #sort will process the whole array before it can start
comparing
values, and #== can begin right away. If there are many differing
elements,
the probability of finding one of them in the first comparisons is very
high.

Somewhat surprisingly, #== can beat #sort even in the most favorable
case for
the latter (only one differing element, the first in the sorted array
but not
in Hash#to_a, small sets):

require ‘benchmark’
require ‘set’
size = 64
elm = -10
elm2 = -2
s1 = Set.new((0…size).to_a << -1)
s2 = Set.new((0…size).to_a << elm)

GC.disable # don’t let this interfere
def bm(s1, s2, elm, iterations = 5000)
puts “#== will stop after #{1 + s2.to_a.index(elm)} .include?
tests”
Benchmark.bmbm(10) do |bm|
bm.report(“Set#sort”) { iterations.times{ s1.sort == s2.sort } }
bm.report(“Set#==”) { iterations.times{ s1 == s2 } }
end
end

bm(s1, s2, elm)
puts “-” * 40
bm(s1, Set.new((0…size).to_a << elm2), elm2)

>> #== will stop after 43 .include? tests

>> Rehearsal ---------------------------------------------

>> Set#sort 0.700000 0.050000 0.750000 ( 0.768818)

>> Set#== 0.370000 0.010000 0.380000 ( 0.381117)

>> ------------------------------------ total: 1.130000sec

>>

>> user system total real

>> Set#sort 0.680000 0.060000 0.740000 ( 0.736052)

>> Set#== 0.380000 0.010000 0.390000 ( 0.380995)

>> ----------------------------------------

>> #== will stop after 7 .include? tests

>> Rehearsal ---------------------------------------------

>> Set#sort 0.620000 0.010000 0.630000 ( 0.637358)

>> Set#== 0.170000 0.000000 0.170000 ( 0.175028)

>> ------------------------------------ total: 0.800000sec

>>

>> user system total real

>> Set#sort 0.750000 0.050000 0.800000 ( 0.807318)

>> Set#== 0.170000 0.010000 0.180000 ( 0.176633)


#15

Mauricio F.:

On Wed, Jan 31, 2007 at 06:37:14PM +0900, Shot (Piotr S.) wrote:

what would be the benefit of defining #eql?
in terms of Set#sort instead of Set#==?

“None”. You cannot just
def eql?(o); sort == o.sort end
either

But I can

class Block < Set
def eql? other
self == other
end
end

can’t I? Set#== says that ‘the equality of each couple of elements is
defined according to Object#eql?’ and that, coupled with Block#hash,
is enough for me to create a set of blocks acting the way I need it.

Note: I need a set of blocks to consider two blocks the same if the
blocks contain the same integers, i.e., I need two such blocks to be
eql? and have the same hashes; the above seems to do it for me quite
nicely…

(Thaks a lot for your detailed explanation snipped here,
I appreciate it very much and learned a lot from it!)

GC.disable # don’t let this interfere

I’m learning new things every day, this one will come in very handy.

– Shot