[ANN] google_hash 0.1.1 -- it has a #each!

rogerdpack · December 16, 2009, 12:47am

Pleased to announce the initial release of a “google_hash” gem.

Its goal. To boldly be faster than any hash hash before (cue star trek
TNG theme).

Or basically a better hash, either one that is faster or more space
efficient than ruby’s default. To attempt this we wrap the google
sparse and dense hashes [1].

Speed results (populating/iterating over 500000 integers):

1.9.1p376 (mingw):

Hash (Ruby default)
0.359375 (populate)
1.1875 (each)

GoogleHashDense
0.1875 (populate)
0.078125 (each)

GoogleHashSparse
0.53125 (populate)
0.078125 (each)

Usage:

a = GoogleHashSparse.new
b = GoogleHashDense.new # or just GoogleHash.new

a[3] = 4
b[4] = ‘abc’
a[‘abc’.hash] = ‘some complex object’

it only accepts int’s currently–only because I’m too lazy to add more

types yet.
a.each{|k, v| }

Installation:

gem install google_hash (if on doze, you’ll need the devkit installed)

Both these classes are currently more space efficient than a hash,
because they store keys as “native” ints, so the keys no longer affect
GC time, as well as only use 4 bytes instead of 20 (or 8 instead of 40,
on 64 bit). This should release some stress on the GC. In terms of
total memory usage, GoogleHashDense uses more (more buckets), and is
more speedy, and GoogleHashSparse uses less space, and is much more
memory efficient (2 bits per entry, or so I’m told).

This is meant to be one more tool in the rubyists toolbelt when trying
to optimize speed-wise, and plans to expand to more types, but at least
with this release it has a #each method.

If you have a desired use case let me know and I might well be able to
code it up for you.

Enjoy.
-r

[1] Google Code Archive - Long-term storage for Google Code Project Hosting.

rogerdpack · December 16, 2009, 5:02pm

Roger P. wrote:

Pleased to announce the initial release of a “google_hash” gem.

Both these classes are currently more space efficient than a hash,
because they store keys as “native” ints, so the keys no longer affect
GC time, as well as only use 4 bytes instead of 20 (or 8 instead of 40,
on 64 bit). This should release some stress on the GC. In terms of
total memory usage, GoogleHashDense uses more (more buckets), and is
more speedy, and GoogleHashSparse uses less space, and is much more
memory efficient (2 bits per entry, or so I’m told).

[1] Google Code Archive - Long-term storage for Google Code Project Hosting.

Neat. Does it keep the order in which the elements were entered? Is it
sortable?

rogerdpack · December 16, 2009, 5:35pm

Roger P.:

Pleased to announce the initial release of a “google_hash” gem.

Its goal. To boldly be faster than any
hash before (cue star trek TNG theme).

Or basically a better hash, either one that is faster or more space
efficient than ruby’s default. To attempt this we wrap the google
sparse and dense hashes [1].

[â€¦]

it only accepts int’s currently–only

because I’m too lazy to add more types yet.

[â€¦]

If you have a desired use case let me know and
I might well be able to code it up for you.

Wow, great! I have a very big need for a fast Set of Integers (Fixnums
or Bignums) â€“ Iâ€™m yet to look at your implementation, but am I right
that it should be more-or-less simply doable with your approach?

â€” Shot

rogerdpack · December 17, 2009, 12:53am

Aldric G. wrote:

Roger P. wrote:

Pleased to announce the initial release of a “google_hash” gem.

Neat. Does it keep the order in which the elements were entered? Is it
sortable?

Nope. What would be the requirements for something you’re looking for?

Shot wrote:

I have a big need for a fast Set of integers/bignums

You just need fast iteration? Fast insertion?

Yeah it should be doable.

The google hash lib itself has both Hashes and Sets–I just have focused
on the Hash stuff as of yet, but hope to do them all (with more types)
eventually.

With just int (or double) you could reasonably easily store your keys
and/oor values as native.

With Bignum I’m not totally sure how they’re stored in memory, but with
some hacking it is probably possible to “store them away” (i.e. store a
copy of them on entry so that Ruby is no longer memory tracking them).

What would the ideal be in terms of requirements? (oh and feel free to
checkout the code/fork/file issues – GitHub - rdp/google_hash: wrapper for google's hash functions, for ruby
)

Enjoy.
-r

rogerdpack · December 17, 2009, 2:52pm

Roger P. wrote:

Aldric G. wrote:

Roger P. wrote:

Pleased to announce the initial release of a “google_hash” gem.

Neat. Does it keep the order in which the elements were entered? Is it
sortable?

Nope. What would be the requirements for something you’re looking for?

In Ruby 1.9, hashes remember in which order the data was entered (as
opposed to being “somehow sorted”. Basically, a data structure with the
capabilities of the Dictionary in the Facets gem.

rogerdpack · December 17, 2009, 8:09am

Hi,
I’m not sure, but Judy (http://judy.sourceforge.net/) and it’s Ruby
bindings (http://rjudy.sourceforge.net/) seem to be another candidate
for efficient hash implementations. I tried purple
(http://purple.rubyforge.org/), which is built upon Judy, too.

hth
ralf

rogerdpack · December 20, 2009, 12:52am

Roger P.:

Shot wrote:

I have a big need for a fast Set of integers/bignums

You just need fast iteration? Fast insertion?

Iteration, mostly. NaÃ¯ve profiling of my Set-heavy code says it spends
most of the time in Hash#each_key (as a result of calls to Set#each).

The google hash lib itself has both Hashes and Sets–I just have
focused on the Hash stuff as of yet, but hope to do them all (with
more types) eventually.

Ah, perfect. Iâ€™ll definitely take a look
(at both your gem and Googleâ€™s variants).

With just int (or double) you could reasonably
easily store your keys and/oor values as native.

Yeah; unfortunately, in most cases (definitely in the slow ones)
I need Bignum support; Iâ€™m storing sets of (usually large) integers
as turned-on Bignumsâ€™ bits.

With Bignum I’m not totally sure how they’re stored in memory,
but with some hacking it is probably possible to “store them away”
(i.e. store a copy of them on entry so that Ruby is no longer memory
tracking them).

One of the routes I can take is writing my own IntSet class in (possibly
inline) C or D, but Iâ€™d rather prototype it in Ruby first, and rewrite
only if necessary. It seems I should at least try to make it use the
Googleâ€™s hash/set implementation.

What would the ideal be in terms of requirements?

Hm, I guess I should implement and profile the skelatal IntSet first
to be sure, but something with fast #each and #combination(2) calls
would already make wonders in my case (Set#each calls Hash#each_key,
while my current Set#pairs implementation consists of calling
to_a.combination(2), so probably thereâ€™s a lot of space for improvement
already right there).

BTW â€“ Iâ€™m not sure why, but got the below under
both 1.8.7p202 and 1.9.1p376 (both 64-bit):

shot@devielle:~$ gem install google_hash
ERROR: Error installing google_hash:
sane requires os (>= 0, runtime)
shot@devielle:~$ gem install sane
ERROR: Error installing sane:
sane requires os (>= 0, runtime)
shot@devielle:~$ gem install os
Successfully installed os-0.3.0
1 gem installed
Installing ri documentation for os-0.3.0…
Installing RDoc documentation for os-0.3.0…
shot@devielle:~$ gem install google_hash
Building native extensions. This could take a while…
Successfully installed sane-0.16.0
Successfully installed google_hash-0.1.1
2 gems installed
Installing ri documentation for sane-0.16.0…
Installing ri documentation for google_hash-0.1.1…
Updating class cache with 2317 classes…
Installing RDoc documentation for sane-0.16.0…
Installing RDoc documentation for google_hash-0.1.1…
shot@devielle:~$

â€” Shot

rogerdpack · December 23, 2009, 1:40am

Or basically a better hash, either one that is faster or more space
efficient than ruby’s default. To attempt this we wrap the google
sparse and dense hashes [1].

As a note, also released 0.2.1 recently.

Changes:

Added RubyToRuby Hash [drop in replacement for the default hash].
added #keys and #values and #keys_combination_2 methods.

In general the :int => Ruby hashes are much faster than Ruby’s hash
lookups.
The RubyToRuby Hash is a bit slower for insertion/lookup and far
faster for #each than the default hash.

new usage:

a = GoogleHashDenseRubyToRuby.new # or GoogleHash.new
b = GoogleHashDenseLongToRuby.new # a hash that is only :int => Ruby
b = GoogleHashSparseLongToRuby.new # a hash that is only :int => Ruby,
and uses less memory [Sparse]

a[3] = 4
b[4] = ‘abc’
b[‘abc’.hash] = ‘some complex object’

a.each{|k, v| … }

a.keys => Array
a.values => Array

speed comparison:

1.9 mingw results

http://pastie.org/752318

1.9.2 linux:

http://pastie.org/752333

Enjoy.
-r

rogerdpack · December 27, 2009, 2:05am

Pleased to announce the initial release of a “google_hash” gem.

Speed results (populating/iterating over 500000 integers):

1.9.1p376 (mingw):

Hash (Ruby default)
0.359375 (populate)
1.1875 (each)

GoogleHashDense
0.1875 (populate)
0.078125 (each)

Released another update…

v 0.3.0

ChangeLog:

Now has several new types of hashes, ex:

GoogleHashDenseRubyToRuby # just like a normal Hash–you can store Ruby
values and Ruby keys.

GoogleHashDenseIntToInt # :int => :int only – this one stores all its
keys and values a native, so ruby’s GC doesn’t ever touch them!

also has some slight speed improvements.

See GitHub - rdp/google_hash: wrapper for google's hash functions, for ruby

for more information.

Merry Christmas.
-r

rogerdpack · December 27, 2009, 3:26am

Roger P. [email protected] wrote:

Pleased to announce the initial release of a “google_hash” gem.

Its goal. To boldly be faster than any hash hash before (cue star trek
TNG theme).

Or basically a better hash, either one that is faster or more space
efficient than ruby’s default. To attempt this we wrap the google
sparse and dense hashes [1].

Both these classes are currently more space efficient than a hash,
because they store keys as “native” ints, so the keys no longer affect
GC time, as well as only use 4 bytes instead of 20 (or 8 instead of 40,
on 64 bit). This should release some stress on the GC. In terms of
total memory usage, GoogleHashDense uses more (more buckets), and is
more speedy, and GoogleHashSparse uses less space, and is much more
memory efficient (2 bits per entry, or so I’m told).

Hi Roger,

Any chance of having one of these can become the default MRI Hash
implementation, or even replace most MRI-internal uses of st.c?
Much of the st.c stuff inside MRI uses numeric keys.

rogerdpack · December 27, 2009, 4:05am

Hi Roger,

Any chance of having one of these can become the default MRI Hash
implementation, or even replace most MRI-internal uses of st.c?
Much of the st.c stuff inside MRI uses numeric keys.

I’m sure it would be possible–and possibly not even terribly hard–the
only concerns would be licensing, and the fact that it’s written in C++
instead of C.

That being said, I could create a wrapper for it that takes a file and
converts its hashes “automagically” into GoogleHash.new’s…
if desired.
-r