String#hash changed in Ruby 1.9?

dpitman · May 4, 2009, 4:24pm

Hi all,
in ruby 1.8.7:
david@trince ~$ ruby -e ‘puts “abc”.hash’
833038373
david@trince ~$ ruby -e ‘puts “abc”.hash’
833038373
david@trince ~$ ruby -e ‘puts “abc”.hash’
833038373

[always the same number]

in ruby 1.9.1:
david@trince ~$ ruby -e ‘puts “abc”.hash’
402929305
david@trince ~$ ruby -e ‘puts “abc”.hash’
-403532784
david@trince ~$ ruby -e ‘puts “abc”.hash’
-650364342

What happened? Is this intentional? Rationale? Any tips on how to
replace it?

dpitman · May 4, 2009, 4:41pm

Am Montag 04 Mai 2009 16:22:01 schrieb David P.:

in ruby 1.9.1:
david@trince ~$ ruby -e ‘puts “abc”.hash’
402929305
david@trince ~$ ruby -e ‘puts “abc”.hash’
-403532784
david@trince ~$ ruby -e ‘puts “abc”.hash’
-650364342

What happened? Is this intentional?

1.9 uses murmurhash(http://murmurhash.googlepages.com/) with a random
seed
which is generated once per application-run.

Any tips on how to replace it?

What does it hurt if the hash value of a string does not remain constant
between runs of the application?

HTH,
Sebastian

dpitman · May 4, 2009, 4:49pm

Any tips on how to replace it?

What does it hurt if the hash value of a string does not remain constant
between runs of the application?

In my case it’s pretty bad. I use it in a command line utility to cache
rake tasks. I create one cachefile for each directory, naming them using
the String#hash of the full path (Dir.pwd.hash). If the hash is
different the next time the program runs the cache lookup fails (and I
get a new cache file instead of the old one).

So, I don’t need anything fancy, just an equivalent to Dir.pwd.hash that
stay consistent. Do I need to MD5 it? Feels like overkill. Why was this
changed in the first place?

dpitman · May 4, 2009, 10:58pm

On Tue, 5 May 2009 02:15:09 +0900, Robert K. wrote:

Hm… But you do admit that this is a bit abusive, do you?
Especially since there are no guarantees that you won’t have any
collisions with a hash value like the one returned by #hash.

Oh yes, it’s just a quick and convenient way of doing it. Dunno if I’d
call it “abusive”, but it’s sure not military grade programming…

How about storing your cache files with a fixed name in the original
directory? Or have a file with metadata (mapping from path to cache
file name)?

Fixed name won’t work; most directories are scm tracked so it’d be a
mess to keep the cache files out of the way. One big(ish) cache file
might work. Maybe even a sqlite db. Have to run some benchmark on that.

that stay consistent. Do I need to MD5 it? Feels like overkill. Why
was this changed in the first place?

That’s an interesting question. I’m curious as well. Maybe the
changes are just a side effect of a new - supposedly better - hashing
algorithm.

The link sebastian provided (http://murmurhash.googlepages.com/) was
interesting but not exhaustive and I still don’t know when/how/why the
behaviour was changed. Perhaps the ml archives will tell?

dpitman · May 4, 2009, 7:15pm

On 04.05.2009 16:48, David P. wrote:

Any tips on how to replace it?
What does it hurt if the hash value of a string does not remain constant
between runs of the application?

In my case it’s pretty bad. I use it in a command line utility to cache rake tasks. I create one cachefile for each directory, naming them using the String#hash of the full path (Dir.pwd.hash). If the hash is different the next time the program runs the cache lookup fails (and I get a new cache file instead of the old one).

Hm… But you do admit that this is a bit abusive, do you? Especially
since there are no guarantees that you won’t have any collisions with a
hash value like the one returned by #hash.

How about storing your cache files with a fixed name in the original
directory? Or have a file with metadata (mapping from path to cache
file name)?

So, I don’t need anything fancy, just an equivalent to Dir.pwd.hash that stay consistent. Do I need to MD5 it? Feels like overkill. Why was this changed in the first place?

That’s an interesting question. I’m curious as well. Maybe the changes
are just a side effect of a new - supposedly better - hashing algorithm.

Kind regards

robert