Forum: Ruby google_hash v 0.7.0 released. Faster hashes for MRI.

Posted by Roger Pack (Guest)
on 2012-10-10 21:01
(Received via mailing list)
Hello.
Pleased to announce v 0.7.0 of the google_hash gem has been released,
mostly thanks to a patch from rolve.

Changes:
  fixed building in linux with newer GCC's, fixed building in windows
with broken system command (?)
  bump internal google_hash version to 0.8.2

README/teaser:

The goal: a better hash for Ruby, either one that is faster or more
space efficient than ruby's default.
To attempt to accomplish this, this library wraps the google hash
sparse and dense hashes [1], which seem to perform much better
at least for the #each method.  It also creates some "specialized"
hashes, for instance, those that take an integer for their key,
for even better performance.

ruby 1.9.3p194 (2012-04-20 revision 35410) [i686-linux]
inserting 400000 objects

Ruby Standard Hash
populate integer        0.324
#each                   0.660
lookup int              0.083

GoogleHashDenseIntToRuby
populate integer        0.114
#each                   0.050
lookup int              0.080

see https://github.com/rdp/google_hash for more.
-roger-
Posted by Ryan Davis (Guest)
on 2012-10-10 22:56
(Received via mailing list)
On Oct 10, 2012, at 12:00 , Roger Pack <rogerdpack2@gmail.com> wrote:

> The goal: a better hash for Ruby, either one that is faster or more
> space efficient than ruby's default.
> To attempt to accomplish this, this library wraps the google hash
> sparse and dense hashes [1], which seem to perform much better
> at least for the #each method.  It also creates some "specialized"
> hashes, for instance, those that take an integer for their key,
> for even better performance.

From the readme:

> These also use significantly less memory, because (if you specify IntToInt, it 
stores only 4 bytes per int, instead of Ruby's usual 20 bytes).  This also frees 
up Ruby so it doesn't hvae to garbage collect as much.  Yea!

20 bytes?? What exactly is this referring to?
Posted by Roger Pack (Guest)
on 2012-10-11 17:43
(Received via mailing list)
> From the readme:
>
>> These also use significantly less memory, because (if you specify IntToInt, it 
stores only 4 bytes per int, instead of Ruby's usual 20 bytes).  This also frees 
up Ruby so it doesn't hvae to garbage collect as much.  Yea!
>
> 20 bytes?? What exactly is this referring to?

Normal size (or at least used to be) of a "ruby object" in 32 bit MRI.
 I believe it's RObject:
http://fossies.org/dox/ruby-1.9.3-p194/structRObject.html
The size may be bigger in 64 bit MRI, I'm not sure.
-r
Posted by Eric Wong (Guest)
on 2012-10-11 23:41
(Received via mailing list)
Roger Pack <rogerdpack2@gmail.com> wrote:
> > From the readme:
> >
> >> These also use significantly less memory, because (if you specify IntToInt, 
it stores only 4 bytes per int, instead of Ruby's usual 20 bytes).  This also 
frees up Ruby so it doesn't hvae to garbage collect as much.  Yea!
> >
> > 20 bytes?? What exactly is this referring to?

I thought this was referring to st_table_entry size in st.c, but that's
6 words (24 bytes in 32-bit, 48 bytes in 64-bit) on MRI 1.9 with ordered
hashes (unpacked).

> Normal size (or at least used to be) of a "ruby object" in 32 bit MRI.
>  I believe it's RObject:
> http://fossies.org/dox/ruby-1.9.3-p194/structRObject.html
> The size may be bigger in 64 bit MRI, I'm not sure.

RObject is 40 bytes on 64-bit MRI.  On the plus side with 64-bit,
embedded strings can be up to 23 bytes (vs 11 bytes for 32-bit) so
there's a better chance of avoiding malloc() overhead with strings.
Posted by Ryan Davis (Guest)
on 2012-10-12 00:13
(Received via mailing list)
On Oct 11, 2012, at 14:40 , Eric Wong <normalperson@yhbt.net> wrote:

>
>> Normal size (or at least used to be) of a "ruby object" in 32 bit MRI.
>> I believe it's RObject:
>> http://fossies.org/dox/ruby-1.9.3-p194/structRObject.html
>> The size may be bigger in 64 bit MRI, I'm not sure.
>
> RObject is 40 bytes on 64-bit MRI.  On the plus side with 64-bit,
> embedded strings can be up to 23 bytes (vs 11 bytes for 32-bit) so
> there's a better chance of avoiding malloc() overhead with strings.

But we're talking ints here... so they just take up the space of VALUE. 
I don't know why that's being compared against RObject.
Posted by Roger Pack (Guest)
on 2012-10-12 02:38
(Received via mailing list)
>> RObject is 40 bytes on 64-bit MRI.  On the plus side with 64-bit,
>> embedded strings can be up to 23 bytes (vs 11 bytes for 32-bit) so
>> there's a better chance of avoiding malloc() overhead with strings.
>
> But we're talking ints here... so they just take up the space of VALUE. I don't 
know why that's being compared against RObject.

I presumed they were encapsulated within an RObject, so would take up
as much space, though I suppose they might not be.

Anyway it does decrease the time to do a GC from 0.1s to 0.002 with a
hash of 2M integers, so that's worth something :)

GoogleHashDenseIntToInt
"dense"
"took"
"0.002"

"ruby hash"
"took" "3.381"
"0.103"

But by all means, if it doesn't improve your throughput, don't use it :)
-roger-
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.