Memory-mapped persistent hash?

kelleysislander · May 17, 2007, 8:50am

Hi,

Essentially I’m looking for something that acts like a ruby Hash,
is nearly as fast as a ruby Hash, but is persistent. It should
also be able to remember gigabytes of information on disk, while
not using a ton of resident RAM in the process.

Also, when I restart the ruby process, the overhead when
initially “connecting to or opening” this persistent hash should
be very minimal before I can start using it.

Does such a beast exist?

From what I’ve read, it seems like the Berkeley DB might fit the
bill; although I’m a little worried I couldn’t find the cost of
a commercial license listed anywhere on the website.

Are there other options besides BDB?

Thanks,

Regards,

Bill

kelleysislander · May 17, 2007, 9:04am

Bill K. schrieb:

Essentially I’m looking for something that acts like a ruby Hash,
is nearly as fast as a ruby Hash, but is persistent. It should
also be able to remember gigabytes of information on disk, while
not using a ton of resident RAM in the process.

Also, when I restart the ruby process, the overhead when
initially “connecting to or opening” this persistent hash should
be very minimal before I can start using it.

Yes, I would love such a thing, too.

Sven

kelleysislander · May 17, 2007, 9:11am

On 17/05/07, Bill K. [email protected] wrote:

Regards,

Bill

SQLite?

Farrel

kelleysislander · May 17, 2007, 9:43am

On Thu, May 17, 2007 at 03:48:07PM +0900, Bill K. wrote:

From what I’ve read, it seems like the Berkeley DB might fit the
bill; although I’m a little worried I couldn’t find the cost of
a commercial license listed anywhere on the website.

If you need it for a software product you will sell, then ask them -
they’ll
happily discuss pricing. BDB is used in many embedded applications, so I
don’t expect it would be prohibitatively expensive.

Sure, BDB is now owned by Oracle - but then again, even a full Oracle
install is free (for a single process with up to 1GB of RAM and 4GB of
tablespace)

The suggestion someone else made of sqlite3 is a good one, and you can
always put an OR mapping like ActiveRecord in front of it, but clearly
there’s an overhead of converting your DB requests into SQL and back
again.

Regards,

Brian.

kelleysislander · May 17, 2007, 10:51am

From: “Farrel L.” [email protected]

SQLite?

Thanks; I’ll explain more below…

From: “Brian C.” [email protected]

On Thu, May 17, 2007 at 03:48:07PM +0900, Bill K. wrote:

From what I’ve read, it seems like the Berkeley DB might fit the
bill; although I’m a little worried I couldn’t find the cost of
a commercial license listed anywhere on the website.

If you need it for a software product you will sell, then ask them - they’ll
happily discuss pricing. BDB is used in many embedded applications, so I
don’t expect it would be prohibitatively expensive.

It can’t hurt to ask… I saw a (rather large) figure reported on
the web, which was way out of my price range; but it may not have
been accurate or current.

The suggestion someone else made of sqlite3 is a good one, and you can
always put an OR mapping like ActiveRecord in front of it, but clearly
there’s an overhead of converting your DB requests into SQL and back again.

Indeed; actually I’m already using sqlite3 for relational data.
I like it a lot.

It just seemed to me that using memory-mapped I/O, it might be
possible to get maybe an order of magnitude faster performance
than sqlite3 for raw key/value style fetches and stores, as long
as lots of the relevant pages were being cached by the VMM.

Certainly if the VMM were able to cache all the pages representing
the hash database file, I’d expect performance could approach an
in-memory hash–which appears to be around two orders of magnitude
faster than sqlite3. Although, once the hash database file
significantly exceeds RAM size, one would expect performance to
degrade to I/O speeds given totally random access patterns; however
if just a subset of the total keys were being read repeatedly, I’d
expect performance to stay high. (The latter being a case I’m
interested in.)

So I figured there might conceivably already be several existing
implementations of such a thing. (But had trouble finding them
with web searches, if they exist–apart from BDB.)

Regards,

Bill

kelleysislander · May 17, 2007, 1:20pm

On Thu, May 17, 2007 at 05:50:11PM +0900, Bill K. wrote:

degrade to I/O speeds given totally random access patterns; however
if just a subset of the total keys were being read repeatedly, I’d
expect performance to stay high. (The latter being a case I’m
interested in.)

So I figured there might conceivably already be several existing
implementations of such a thing. (But had trouble finding them
with web searches, if they exist–apart from BDB.)

Well, there’s ruby-mmap. Looking at the docs, this effectively gives you
a
very large string, which is not what you’re looking for, but maybe you
could
layer something on top of this.

If your keys are small but your objects are large, you could keep all
the
keys in RAM (e.g. using something like Madeleine) and just store
pointers to
the data. Then you only(*) have to worry about fragmentation and garbage
collection within your mmap’d file.

Or you could just keep keys in RAM and use separate disk files for each
object.

Regards,

Brian.

(*)

kelleysislander · May 17, 2007, 6:07pm

Brian C. wrote:
…

Or you could just keep keys in RAM and use separate disk files for each
object.

That’s sounding sorta like fsdb[1]. I’m not sure I can recommend it in
this case, though, since you pay for Marshal.dump every time you write
an object into the database, plus locking, plus IO. I’d expect
FSDB::Database::[]= will be orders of magnitude slower than Hash#[]= .

It will be interesting to hear any solution using mmap…

[1] FSDB

kelleysislander · May 17, 2007, 8:57pm

From: “Tim P.” [email protected]

consolidating unused mmap regions; etc. All the standard MMU stuff
you normally don’t have to deal with in Ruby.

It would be much easier to implement if all the objects being stored
in the Hash were guaranteed to be the same size. Then you would just
need an free/allocated array to keep track of what can go where in the
mmap.

Agreed. But ironically, what gets me, is that with a modern
VMM, this is exactly what is already going on with Ruby’s hash
in memory. Except that the backing store is the system swap
file, and so, not persistent.

In principle, I just want to change the backing store to a
memory mapped file, instead.

I’ve wondered what would happen if one took a nice malloc
implementation, made it operate inside a heap that was
memory-mapped onto a file, and then took something like the
STL hash_map (or ruby’s hash) and wired it to the malloc
allocating from the memory-mapped file.

Intuitively, it seems it would have no choice but to perform
fantastically, as long as the whole file could be mapped
into memory.

However, once the file size exceeded available memory, I
can imagine that it might degrade to sub-optimal performace.

Along these lines, I’ve also wondered if one could get a
ruby application to persist similarly, (in principle!)
by wiring ruby’s memory allocation functions to a malloc
that was allocating from a memory-mapped file. Of course
the tricky part would be dealing with all the objects
containing system resources that can’t be persisted,
such as file handles, etc. etc. Probably a nightmare in
practical terms, unless the language and its libraries
were designed that way from the start…

Ah, well. In talking about this, it seems there are really
two scenarios for memory-mapped persistent hashes. One when
all the pages can fit in RAM; and the other when the filesize
would greately exceed available RAM (and even worse, when
the filesize exceeds the maximum contiguous block that
can be even mapped into the process address space at all.)

Hmm…

Regards,

Bill

kelleysislander · May 17, 2007, 7:14pm

On 5/17/07, Joel VanderWerf [email protected] wrote:

It will be interesting to hear any solution using mmap…

You could do a mmap solution. Modify Hash such that []= does a
Marshal.dump of your object, stores the object into the mmap, and then
that memory location is stored in the Hash instead of the object.

[] must also be modified to take the memory location, Marshal.load the
object from the mmap, and then return the object.

The hard part of this is doing the memory management of the mmap –
when an object is deleted from the hash, removing it from the mmap;
consolidating unused mmap regions; etc. All the standard MMU stuff
you normally don’t have to deal with in Ruby.

It would be much easier to implement if all the objects being stored
in the Hash were guaranteed to be the same size. Then you would just
need an free/allocated array to keep track of what can go where in the
mmap.

Blessings,
TwP

kelleysislander · May 17, 2007, 9:57pm

On May 17, 2007, at 12:48 AM, Bill K. wrote:

Regards,

Bill

cfp:~ > cat a.rb
require ‘yaml’
require ‘time’

*nix AND windblows. built-in to ruby

DB =
begin
require ‘sdbm’
SDBM
rescue LoadError
require ‘dbm’
DBM
end

db = DB.new ‘db’

last_time = db[‘last_time’]
this_time = Time.now.iso8601(2)

y ‘last_time’ => last_time, ‘this_time’ => this_time

db[‘last_time’] = this_time
cfp:~ > ruby a.rb

this_time: “2007-05-17T13:56:13.97-06:00”
last_time:
cfp:~ > ruby a.rb

this_time: “2007-05-17T13:56:15.59-06:00”
last_time: “2007-05-17T13:56:13.97-06:00”
cfp:~ > ruby a.rb

this_time: “2007-05-17T13:56:16.91-06:00”
last_time: “2007-05-17T13:56:15.59-06:00”

-a

kelleysislander · May 17, 2007, 10:52pm

From: “ara.t.howard” [email protected]

be very minimal before I can start using it.
cfp:~ > cat a.rb
DBM
end

db = DB.new ‘db’

Hi ara,

Thanks, I’ve never tried ‘dbm’. I’ll give it a shot.

I’ve tried ‘sdbm’ on both Perl and Ruby, and it was horribly
horribly bad. Even with about 20,000 keys, using small-sized
keys and values, the database would bloat up to several
gigabytes and start failing to store new keys. When using
larger sized values, it would fail more rapidly. And the
max-sized values that would work at all seemed to be around
a couple KBytes. (These issues occurred on both windows and
unix… however it seemed to have additional serious issues
on windows. This was a few years ago…)

Hopefully ‘dbm’ is better. I’ll test it out tonight.

Thanks,

Bill

kelleysislander · May 18, 2007, 7:01am

On May 17, 2007, at 2:51 PM, Bill K. wrote:

require ‘time’

Hi ara,

Thanks, I’ve never tried ‘dbm’. I’ll give it a shot.

on linux it’s basically just a tiny layer on bdb - but it gets built
by ruby automatically

Hopefully ‘dbm’ is better. I’ll test it out tonight.

oh. it has to work!? picky!

at least it has cross-platform failure going for it!

-a

kelleysislander · May 18, 2007, 4:31pm

On May 17, 2007, at 11:01 PM, Bill K. wrote:

QDBM runs on *nix and windows, handles large databases,
is LGPL, has outstanding documentation, comes with
ruby bindings. And it’s fast.

there are so dang many of the xxxDBM libs i’d forgotten about that
one, but i figured one of them would be the key to the realm. this
whole thing did not support my first rule of software delevelopment
which is that

“it’s written. it’s free. and it’s the first link on google.”

i guess this time it’s more like

“it’s written. it’s free. and will require several iterations of
posting to the mailing list of a cutting edge language and source
reading.”

at least the first two were true!

cheers.

-a

kelleysislander · May 18, 2007, 7:03am

initially “connecting to or opening” this persistent hash should
require 'dbm'
DBM
end

db = DB.new ‘db’

Thanks for the tip, ara!

In looking at ruby’s dbm extension extconf.rb, I learned about
qdbm:

This appears to be an amazing library. I think its author,
Mikio Hirabayashi is going to be my hero of 2007.

QDBM runs on *nix and windows, handles large databases,
is LGPL, has outstanding documentation, comes with
ruby bindings. And it’s fast.

Thanks,

Regards,

Bill

Memory-mapped persistent hash?

*nix AND windblows. built-in to ruby

db[‘last_time’] = this_time cfp:~ > ruby a.rb

this_time: “2007-05-17T13:56:13.97-06:00” last_time: cfp:~ > ruby a.rb

this_time: “2007-05-17T13:56:15.59-06:00” last_time: “2007-05-17T13:56:13.97-06:00” cfp:~ > ruby a.rb

db[‘last_time’] = this_time
cfp:~ > ruby a.rb

this_time: “2007-05-17T13:56:13.97-06:00”
last_time:
cfp:~ > ruby a.rb

this_time: “2007-05-17T13:56:15.59-06:00”
last_time: “2007-05-17T13:56:13.97-06:00”
cfp:~ > ruby a.rb