Forum: Ruby Sane #hash implementation?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Shot (Piotr S.) (Guest)
on 2007-01-30 00:39
(Received via mailing list)
I have a Set subclass, Block. I need two blocks to be considered the
same by Set iff they have the same elements. From what I gathered, Set
treats two objects as equal when they are eql? and have the same hash.

Implementing Block#eql? seems to be trivial, as Set#== is there to help:

class Block < Set
  def eql? other
    self == other
  end
end

I’m stuck when it comes to Block#hash, though; I need these to be true:
Block.new.hash == Block.new.hash
Block.new([1,2]).hash == Block.new([1,2]).hash

What’s the proper way to go at it? Adding the elements’ hashes together
is obviously wrong, as four different elements might produce the same
sum when paired; also, the docs say, ‘any hash value that exceeds the
capacity of a Fixnum will be truncated before being used,’ so hash can’t
exceed the (word - 1 bit) size.

-- Shot
Ken B. (Guest)
on 2007-01-30 04:11
(Received via mailing list)
On Tue, 30 Jan 2007 07:38:20 +0900, Shot (Piotr S.) wrote:

> end
>
> -- Shot

There's nothing about hash codes that requires that
  a.hash != b.hash if a != b
You want to avoid collisions as much as possible, but
that's an issue of performance, not correctness, so you'd be perfectly
correct in adding up the hash codes of the members to create the has
code.

Nor is there anything that requires you to keep your hash code within
the
capacity of a Fixnum when you generate it. That's why it's automatically
truncated for you. The only reason you need to know is so that your hash
function doesn't hash things in such a way that the truncated number has
lots of collisions. (Once again a performance issue).

You may want to look at http://www.partow.net/programming/hashfunctions/
which discusses some relatively important hash functions. You may have
to
experiment to see what works best. Wikipedia may also have good
information.

--Ken
Eric H. (Guest)
on 2007-01-30 06:48
(Received via mailing list)
On Jan 29, 2007, at 14:38, Shot (Piotr S.) wrote:
> I have a Set subclass, Block. I need two blocks to be considered the
> same by Set iff they have the same elements. From what I gathered, Set
> treats two objects as equal when they are eql? and have the same hash.
>
> I’m stuck when it comes to Block#hash, though; I need these to be
> true:
> Block.new.hash == Block.new.hash
> Block.new([1,2]).hash == Block.new([1,2]).hash

Try:

class Block
   def hash
     to_a.hash
   end
end
Shot (Piotr S.) (Guest)
on 2007-01-30 09:24
(Received via mailing list)
Ken B.:

> There's nothing about hash codes that requires that
>   a.hash != b.hash if a != b
> You want to avoid collisions as much as possible, but that's an issue
> of performance, not correctness, so you'd be perfectly correct in
> adding up the hash codes of the members to create the has code.

True, but I do want to be able to key Hashes with my Blocks, so I do
want to avoid collisions as much as possible. I should’ve tested the
add-up-the-members’-hashes approach before/instead of crossing it out,
though.

> You may want to look at
> http://www.partow.net/programming/hashfunctions/
> which discusses some relatively important hash functions.
> You may have to experiment to see what works best. Wikipedia
> may also have good information.

Thanks for the pointers! I knew about hash functions before, I just
wasn’t sure whether there’s an idiomatic/popular/proper Ruby approach
to writing #hash.

-- Shot
Shot (Piotr S.) (Guest)
on 2007-01-30 09:27
(Received via mailing list)
Eric H.:

> On Jan 29, 2007, at 14:38, Shot (Piotr S.) wrote:

>> I’m stuck when it comes to Block#hash, though; I need these to be true:
>> Block.new.hash == Block.new.hash
>> Block.new([1,2]).hash == Block.new([1,2]).hash

> Try:

> class Block
>   def hash
>     to_a.hash
>   end
> end

Thanks a lot, Eric! This is the ‘d’oh!’ solution I was looking for. :)

-- Shot
Eric H. (Guest)
on 2007-01-30 10:11
(Received via mailing list)
On Jan 29, 2007, at 23:26, Shot (Piotr S.) wrote:
>>   def hash
>>     to_a.hash
>>   end
>> end
>
> Thanks a lot, Eric! This is the ‘d’oh!’ solution I was looking for. :)

You may want to define #eql? in terms of #to_a as well...
Mauricio F. (Guest)
on 2007-01-30 10:21
(Received via mailing list)
On Tue, Jan 30, 2007 at 05:10:51PM +0900, Eric H. wrote:
> >>class Block
> >>  def hash
> >>    to_a.hash
> >>  end
> >>end
> >
> >Thanks a lot, Eric! This is the $B!F(Bd$B!G(Boh!$B!G(B solution I was looking 
for.
> >:)
>
> You may want to define #eql? in terms of #to_a as well...

That would fail for the same reason as your Block#hash.
Pit C. (Guest)
on 2007-01-30 10:23
(Received via mailing list)
Shot (Piotr S.) schrieb:
>> class Block
>>   def hash
>>     to_a.hash
>>   end
>> end
>
> Thanks a lot, Eric! This is the ‘d’oh!’ solution I was looking for. :)

Shot, you should be aware that this works for your example given above,
but not for the following:

   Block.new([1,12]).hash  # => 57
   Block.new([12,1]).hash  # => 23

If this is a problem, you have to change the implementation to

   class Block
     def hash
       sort.hash
     end
   end

Regards,
Pit
Mauricio F. (Guest)
on 2007-01-30 13:49
(Received via mailing list)
[I'm resending this since the ML server is dropping some of my messages
to
this thread for some reason.]

On Tue, Jan 30, 2007 at 04:26:18PM +0900, Shot (Piotr S.) wrote:
> > class Block
> >   def hash
> >     to_a.hash
> >   end
> > end
>
> Thanks a lot, Eric! This is the d'oh! solution I was looking for. :)

It's not a correct one though.

  require 'set'
  a = Set.new(0..1000)
  b = Set.new((0..1000).sort_by{rand})
  a == b                                             # => true
  a.to_a.hash                                        # => 31141761
  b.to_a.hash                                        # => 374826672

(The order of the elements in Hash#to_a can change.)


Try this, it's O(n ln n) but actually works

  require 'set'
  class Block < Set
    def hash
      map{|x| x.hash}.sort.hash
    end
  end
  a = Block.new(0..1000)
  b = Block.new((0..1000).sort_by{rand})
  a == b                                             # => true
  a.hash                                             # => 944434529
  b.hash                                             # => 944434529
Shot (Piotr S.) (Guest)
on 2007-01-31 12:44
(Received via mailing list)
Mauricio F.:

> (The order of the elements in Hash#to_a can change.)

Thanks for pointing this out, I fixed my specs and my code.

> Try this, it's O(n ln n) but actually works

Ok, I guess I have a question on what does exactly Benchmark results
mean. Doesn’t the below mean hash_map_sort is actually a bit slower than
hash_sort?

Also, to ask two basic, follow-up questions: which Benchmark time is
the most significant one and is it more-or-less independent from other
system load? (I guess the two runs wouldn’t differ as much any of the
times was other-load-independent…)



$ cat block_hash.rb
require 'benchmark'
require 'set'

class Block < Set
  def hash_sort
    sort.hash
  end
  def hash_map_sort
    map {|a| a.hash}.sort.hash
  end
end

srand 1979
array = (0..1_000_000).sort_by {rand}

Benchmark.bmbm do |b|
  b.report 'hash_sort' do
    Block.new(array).hash_sort
  end
  b.report 'hash_map_sort' do
    Block.new(array).hash_map_sort
  end
end



$ ruby block_hash.rb
Rehearsal -------------------------------------------------
hash_sort       8.000000   1.366667   9.366667 (  6.208513)
hash_map_sort   9.750000   1.500000  11.250000 (  7.179749)
--------------------------------------- total: 20.616667sec

                    user     system      total        real
hash_sort       9.666667   1.100000  10.766667 (  6.677047)
hash_map_sort   9.566667   1.550000  11.116667 (  6.809388)



$ ruby block_hash.rb
Rehearsal -------------------------------------------------
hash_sort       8.133333   1.316667   9.450000 (  5.746571)
hash_map_sort   9.650000   1.766667  11.416667 (  7.065570)
--------------------------------------- total: 20.866667sec

                    user     system      total        real
hash_sort       9.316667   1.100000  10.416667 (  6.684088)
hash_map_sort   9.450000   1.566667  11.016667 (  6.796299)



-- Shot
Pit C. (Guest)
on 2007-01-31 13:08
(Received via mailing list)
Shot (Piotr S.) schrieb:
> Ok, I guess I have a question on what does exactly Benchmark results
> mean. Doesn’t the below mean hash_map_sort is actually a bit slower than
> hash_sort?

This is what I would expect. The advantage of Mauricio's solution is
that you can use it even if the Set elements are not comparable to each
other. My proposal was meant for your use case with integers as Set
elements.

> end
With this benchmark, you also measure the time needed to construct the
Block instances. If you want to compare the two #hash methods, you
should change your benchmark to something like:

   block = Block.new(array)

   Benchmark.bmbm do |b|
     b.report 'hash_sort' do
       block.hash_sort
     end
     b.report 'hash_map_sort' do
       block.hash_map_sort
     end
   end

 From your benchmark, you can see that it takes a lot of time to compute
the hash values. Will your sets contain 1_000_000 elements? How often
will they change? How are they created? Maybe you need to cache the hash
values or implement a different hash algorithm.

Regards,
Pit
Shot (Piotr S.) (Guest)
on 2007-01-31 13:18
(Received via mailing list)
Eric H.:

> You may want to define #eql? in terms of #to_a as well...

Thanks, but (considering Mauricio F.’s fix) what would be the
benefit of defining #eql? in terms of Set#sort instead of Set#==?

Te docs say on Set#==, ‘Returns true if two sets are equal. The equality
of each couple of elements is defined according to Object#eql?.’

-- Shot
Shot (Piotr S.) (Guest)
on 2007-01-31 17:27
(Received via mailing list)
Pit C.:

> The advantage of Mauricio's solution is that you can use it even if
> the Set elements are not comparable to each other. My proposal was
> meant for your use case with integers as Set elements.

Ah, right. Still, blocks will contain integers only, so
Array#sort-based Block#hash and Set#==-based Block#eql?
should suffice.

> With this benchmark, you also measure the time needed to construct
> the Block instances. If you want to compare the two #hash methods,
> you should change your benchmark to something like:

D’oh! Why did I think moving the array creation
out was the most I can get to unbias the benchmark?

> From your benchmark, you can see that it takes a lot of time to
> compute the hash values. Will your sets contain 1_000_000 elements?
> How often will they change? How are they created? Maybe you need to
> cache the hash values or implement a different hash algorithm.

I don’t know, yet. I was just curious whether the elements’-hash-based
hash algorithm would be any faster than the just-sort-based one (given
that from a quick glance it seemed to incorporate additional work); now
I understand its advantage lies elsewhere.

Thanks a lot for your input!

-- Shot
Shot (Piotr S.) (Guest)
on 2007-09-26 01:00
(Received via mailing list)
Mauricio F.:

> On Wed, Jan 31, 2007 at 06:37:14PM +0900, Shot (Piotr S.) wrote:

>> what would be the benefit of defining #eql?
>> in terms of Set#sort instead of Set#==?

> "None". You cannot just
>   def eql?(o); sort == o.sort end
> either

But I can

class Block < Set
  def eql? other
    self == other
  end
end

can’t I? Set#== says that ‘the equality of each couple of elements is
defined according to Object#eql?’ and that, coupled with Block#hash,
is enough for me to create a set of blocks acting the way I need it.

Note: I need a set of blocks to consider two blocks the same if the
blocks contain the same integers, i.e., I need two such blocks to be
eql? and have the same hashes; the above seems to do it for me quite
nicely…

(Thaks a lot for your detailed explanation snipped here,
I appreciate it very much and learned a lot from it!)

>   GC.disable # don't let this interfere

I’m learning new things every day, this one will come in very handy.

-- Shot
Mauricio F. (Guest)
on 2007-09-26 01:10
(Received via mailing list)
On Wed, Jan 31, 2007 at 06:37:14PM +0900, Shot (Piotr S.) wrote:
> Eric H.:
>
> > You may want to define #eql? in terms of #to_a as well...
>
> Thanks, but (considering Mauricio F.’s fix) what would be the
> benefit of defining #eql? in terms of Set#sort instead of Set#==?

"None". You cannot just
  def eql?(o); sort == o.sort end
either, since there's no guarantee there's a total order associated to
the
set (it seems my reply to Pit where I pointed this out was dropped too):

  require 'set'
  Set.new(['a', 1]).sort                             # =>
  # ~> -:2:in `sort': comparison of String with 1 failed (ArgumentError)
  # ~>   from -:2

This is why the hash implementation I gave you does
  map{|x| x.hash}.sort.hash  # sorting hash values, i.e. Integers
instead of
  sort.map{|x| x.hash}.hash

Even when the elements can be ordered, sort would be O(n ln n) vs. #=='s
O(n)

Of course, the former is a "fast NlnN", being written in C, so it would
seem
that it can keep the speed advantage up to large values of N.

However, #sort will process the whole array before it can start
comparing
values, and #== can begin right away. If there are many differing
elements,
the probability of finding one of them in the first comparisons is very
high.

Somewhat surprisingly, #== can beat #sort even in the most favorable
case for
the latter (only one differing element, the first in the sorted array
but not
in Hash#to_a, small sets):

  require 'benchmark'
  require 'set'
  size = 64
  elm  = -10
  elm2 = -2
  s1 = Set.new((0..size).to_a << -1)
  s2 = Set.new((0..size).to_a << elm)

  GC.disable # don't let this interfere
  def bm(s1, s2, elm, iterations = 5000)
      puts "#== will stop after #{1 + s2.to_a.index(elm)} .include?
tests"
      Benchmark.bmbm(10) do |bm|
        bm.report("Set#sort") { iterations.times{ s1.sort == s2.sort } }
        bm.report("Set#==")   { iterations.times{ s1 == s2 } }
      end
  end

  bm(s1, s2, elm)
  puts "-" * 40
  bm(s1, Set.new((0..size).to_a << elm2), elm2)

  # >> #== will stop after 43 .include? tests
  # >> Rehearsal ---------------------------------------------
  # >> Set#sort    0.700000   0.050000   0.750000 (  0.768818)
  # >> Set#==      0.370000   0.010000   0.380000 (  0.381117)
  # >> ------------------------------------ total: 1.130000sec
  # >>
  # >>                 user     system      total        real
  # >> Set#sort    0.680000   0.060000   0.740000 (  0.736052)
  # >> Set#==      0.380000   0.010000   0.390000 (  0.380995)
  # >> ----------------------------------------
  # >> #== will stop after 7 .include? tests
  # >> Rehearsal ---------------------------------------------
  # >> Set#sort    0.620000   0.010000   0.630000 (  0.637358)
  # >> Set#==      0.170000   0.000000   0.170000 (  0.175028)
  # >> ------------------------------------ total: 0.800000sec
  # >>
  # >>                 user     system      total        real
  # >> Set#sort    0.750000   0.050000   0.800000 (  0.807318)
  # >> Set#==      0.170000   0.010000   0.180000 (  0.176633)
This topic is locked and can not be replied to.