Google_hash v 0.7.0 released. Faster hashes for MRI

aris · October 10, 2012, 9:01pm

Hello.
Pleased to announce v 0.7.0 of the google_hash gem has been released,
mostly thanks to a patch from rolve.

Changes:
fixed building in linux with newer GCC’s, fixed building in windows
with broken system command (?)
bump internal google_hash version to 0.8.2

README/teaser:

The goal: a better hash for Ruby, either one that is faster or more
space efficient than ruby’s default.
To attempt to accomplish this, this library wraps the google hash
sparse and dense hashes [1], which seem to perform much better
at least for the #each method. It also creates some “specialized”
hashes, for instance, those that take an integer for their key,
for even better performance.

ruby 1.9.3p194 (2012-04-20 revision 35410) [i686-linux]
inserting 400000 objects

Ruby Standard Hash
populate integer 0.324
#each 0.660
lookup int 0.083

GoogleHashDenseIntToRuby
populate integer 0.114
#each 0.050
lookup int 0.080

see GitHub - rdp/google_hash: wrapper for google's hash functions, for ruby for more.
-roger-

rogerdpack · October 10, 2012, 10:56pm

On Oct 10, 2012, at 12:00 , Roger P. [email protected] wrote:

The goal: a better hash for Ruby, either one that is faster or more
space efficient than ruby’s default.
To attempt to accomplish this, this library wraps the google hash
sparse and dense hashes [1], which seem to perform much better
at least for the #each method. It also creates some “specialized”
hashes, for instance, those that take an integer for their key,
for even better performance.

From the readme:

These also use significantly less memory, because (if you specify IntToInt, it
stores only 4 bytes per int, instead of Ruby’s usual 20 bytes). This also frees
up Ruby so it doesn’t hvae to garbage collect as much. Yea!

20 bytes?? What exactly is this referring to?

rogerdpack · October 11, 2012, 5:43pm

From the readme:

These also use significantly less memory, because (if you specify IntToInt, it
stores only 4 bytes per int, instead of Ruby’s usual 20 bytes). This also frees
up Ruby so it doesn’t hvae to garbage collect as much. Yea!

20 bytes?? What exactly is this referring to?

Normal size (or at least used to be) of a “ruby object” in 32 bit MRI.
I believe it’s RObject:
http://fossies.org/dox/ruby-1.9.3-p194/structRObject.html
The size may be bigger in 64 bit MRI, I’m not sure.
-r

rogerdpack · October 11, 2012, 11:41pm

Roger P. [email protected] wrote:

From the readme:

These also use significantly less memory, because (if you specify IntToInt,
it stores only 4 bytes per int, instead of Ruby’s usual 20 bytes). This also
frees up Ruby so it doesn’t hvae to garbage collect as much. Yea!

20 bytes?? What exactly is this referring to?

I thought this was referring to st_table_entry size in st.c, but that’s
6 words (24 bytes in 32-bit, 48 bytes in 64-bit) on MRI 1.9 with ordered
hashes (unpacked).

Normal size (or at least used to be) of a “ruby object” in 32 bit MRI.
I believe it’s RObject:
http://fossies.org/dox/ruby-1.9.3-p194/structRObject.html
The size may be bigger in 64 bit MRI, I’m not sure.

RObject is 40 bytes on 64-bit MRI. On the plus side with 64-bit,
embedded strings can be up to 23 bytes (vs 11 bytes for 32-bit) so
there’s a better chance of avoiding malloc() overhead with strings.

rogerdpack · October 12, 2012, 12:13am

On Oct 11, 2012, at 14:40 , Eric W. [email protected] wrote:

Normal size (or at least used to be) of a “ruby object” in 32 bit MRI.
I believe it’s RObject:
http://fossies.org/dox/ruby-1.9.3-p194/structRObject.html
The size may be bigger in 64 bit MRI, I’m not sure.

RObject is 40 bytes on 64-bit MRI. On the plus side with 64-bit,
embedded strings can be up to 23 bytes (vs 11 bytes for 32-bit) so
there’s a better chance of avoiding malloc() overhead with strings.

But we’re talking ints here… so they just take up the space of VALUE.
I don’t know why that’s being compared against RObject.

rogerdpack · October 12, 2012, 2:38am

RObject is 40 bytes on 64-bit MRI. On the plus side with 64-bit,
embedded strings can be up to 23 bytes (vs 11 bytes for 32-bit) so
there’s a better chance of avoiding malloc() overhead with strings.

But we’re talking ints here… so they just take up the space of VALUE. I don’t
know why that’s being compared against RObject.

I presumed they were encapsulated within an RObject, so would take up
as much space, though I suppose they might not be.

Anyway it does decrease the time to do a GC from 0.1s to 0.002 with a
hash of 2M integers, so that’s worth something

GoogleHashDenseIntToInt
“dense”
“took”
“0.002”

“ruby hash”
“took” “3.381”
“0.103”

But by all means, if it doesn’t improve your throughput, don’t use it
-roger-