Re: ruby sdbm library reliable?

BergerS_Daniel · May 16, 2006, 4:48pm

-----Original Message-----
From: Jamey C. [mailto:[email protected]]
Sent: Tuesday, May 16, 2006 8:37 AM
To: ruby-talk ML
Subject: ruby sdbm library reliable?

Does anyone have any recent experience with using Ruby’s sdbm
library?

The dbm/gdbm/sdbm libraries are toys, imho. I wouldn’t trust them to
store more than a few thousand records, let alone a million. I’ve seen
them fall down horribly on Windows. Just ask Hal F… As far as I
can tell, the dbm/gdbm/sdbm wrappers were included because thats what
Perl does.

I would be looking at storing up to a million records, with
each record
being up to 200 bytes, and each key being up to 7 bytes.

I’m very curious Jamey - why aren’t you using KirbyBase?

Anyway, if you want something more robust yet still simple, I recommend
sqlite.

Regards,

Dan

This communication is the property of Qwest and may contain confidential
or
privileged information. Unauthorized use of this communication is
strictly
prohibited and may be unlawful. If you have received this communication
in error, please immediately notify the sender by reply e-mail and
destroy
all copies of the communication and any attachments.

BergerS_Daniel · May 17, 2006, 2:07pm

On 5/16/06, Berger, Daniel [email protected] wrote:

The dbm/gdbm/sdbm libraries are toys, imho. I wouldn’t trust them to
store more than a few thousand records, let alone a million. I’ve seen
them fall down horribly on Windows. Just ask Hal F… As far as I
can tell, the dbm/gdbm/sdbm wrappers were included because thats what
Perl does.

They also consistently fail tests on Gentoo GNU/Linux for me (and a few
others).

Michal

This message is my property! Do not read it, and immediately destroy
it!!!

BergerS_Daniel · May 16, 2006, 5:13pm

Berger, Daniel wrote:

The dbm/gdbm/sdbm libraries are toys, imho. I wouldn’t trust them to
store more than a few thousand records, let alone a million. I’ve seen
them fall down horribly on Windows. Just ask Hal F… As far as I
can tell, the dbm/gdbm/sdbm wrappers were included because thats what
Perl does.

Thanks, Dan. That’s the kind of info I was looking for before I spent a
lot of time testing.

I’m very curious Jamey - why aren’t you using KirbyBase?

Well, here’s the deal.

I’m looking at trying to improve KirbyBase’s performance, to make it
scale better. One way I am doing this is I have started playing around
with using Skiplists for KirbyBase indexes, rather than arrays. For a
lot of queries, this has greatly speeded up the part of the query that
grabs all matching keys.

But, I’m still running into the bottleneck of taking those keys from the
query and using them to pull the actual records in from the flat text
file where KirbyBase stores the actual data.

So, I’ve been looking at ways to get around this. I would basically
look at replacing the back-end flat file format of KirbyBase with
something faster. The user wouldn’t see any difference in how they
interacted with KirbyBase. One of the possibilities was to store the
data records in some kind of dbm-like file (i.e. gdbm, qdbm, etc.) and
sdbm looked attractive because it is included with Ruby. The downside
to this is that the data is no longer kept in easily-editable plain-text
files.

Another option would be to keep the entire table in memory as a hash and
use some kind of journal file to handle updates/deletes. This might be
an issue on memory-constrained systems, but I’m thinking this is less
and less of an issue on modern pcs.

I’m just kind of brainstorming right now. KirbyBase performs pretty
well, but I would like to see it scale better.

Jamey

Confidentiality Notice: This email message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
confidential and/or privileged information. If you are not the intended
recipient(s), you are hereby notified that any dissemination,
unauthorized review, use, disclosure or distribution of this email and
any materials contained in any attachments is prohibited. If you receive
this message in error, or are not the intended recipient(s), please
immediately notify the sender by email and destroy all copies of the
original message, including attachments.

BergerS_Daniel · May 17, 2006, 2:35pm

I’ve been interested for a long time in doing a fast system for datasets
up
to a few million rows with journalled updates. (I wrote the
journal-arena
already, I use it for fast access to log entries.) If you like, I’d be
happy
to help out also.

BergerS_Daniel · May 17, 2006, 3:49pm

Francis C. wrote:

I’ve been interested for a long time in doing a fast system for
datasets up
to a few million rows with journalled updates. (I wrote the journal-arena
already, I use it for fast access to log entries.) If you like, I’d be
happy
to help out also.
Sure thing!

I’ve been doing some thinking about Logan’s idea of offering a choice of
multiple backends. That sounds intriguing. I could see that idea,
coupled with something like qdbm, to provide different options for the
user.

The default backend could remain the plain-text files that KirbyBase
currently uses. But, if the user has qdbm (as an example) installed,
KirbyBase could optionally use that as the backend resulting in a
noticeable (I would hope) performance increase.

By the way, does anyone have experience using qdbm? It looks like the
library is well maintained and there is plenty of documentation. I’m
curious if it is reliable and does it seem to scale.

As far as the indexing goes, I’ve recently become infatuated with
Skiplists. In some informal testing, my very crude skiplist
implementation seemed to be a nice balance between a hash and an array
for general lookups. I’ve also thought about trying out a B-Tree
implementation for the indexing, with the added possibility of
storing/accessing the indexes from disk as opposed to loading them
completely into memory before use. I think Logan mentioned that he had
some thoughts in that direction.

I would definitely be interested in seeing your work with journaling. I
was thinking of possibly using journaling in conjunction with having the
entire table in memory. Updates and deletes would be made directly to
the table in memory as well as written to a journal file, so that, when
it came time to write the table back out to disk, I would only have to
update the table’s disk file with the contents of the journal file as
opposed to having to write out the memory-based table. Like I said,
this is just brainstorming. It may not even make sense to try to keep
the entire table in memory.

Jamey

Confidentiality Notice: This email message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
confidential and/or privileged information. If you are not the intended
recipient(s), you are hereby notified that any dissemination,
unauthorized review, use, disclosure or distribution of this email and
any materials contained in any attachments is prohibited. If you receive
this message in error, or are not the intended recipient(s), please
immediately notify the sender by email and destroy all copies of the
original message, including attachments.

BergerS_Daniel · May 17, 2006, 4:19pm

So much depends on the application. And it’s important to decide whether
or
not we’re willing to write fairly specialized implementations for
particular
application classes. I’ve done a fair bit of that and I’m not afraid to
do
it, but it’s axiomatically an inefficient process. But if you go the
opposite direction, toward creating a general-purpose lightweight data
manager, you might as well just use sqllite. If we put our heads
together,
we ought to be able to find a middle ground: what, for example, would be
a
generic architecture for a lightweight database? What are the essential
pieces and the optional pieces, and precisely what interfaces do they
expose? Done right, that would enable optimizations for particular apps
by
swapping out components.

The two big problems with journaling data-storage are: collecting the
garbage, and rebuilding the indices when the run starts (because there’s
no
graceful place to put them in the journal). One of the apps I often deal
with is indexed searching on logfiles. A journalling arena implemented
in
virtual memory works well here because garbage collection is automatic.
But
fast index generation would be a huge plus. These apps generally index
whole
words, Lucene-style.

Another thing I do a lot is LDAP directories augmented with presence
data.
Here, journalling is a tougher fit because most of the data is not
garbage,
but some of it becomes garbage at very high rates. The indexing is also
a
serious problem because LDAP permits partial-word searches.

Skiplists look really interesting, thanks for pointing them out.

BergerS_Daniel · May 17, 2006, 5:31pm

On Wednesday 17 May 2006 7:48 am, Jamey C. wrote:

I’ve been doing some thinking about Logan’s idea of offering a choice of
multiple backends. That sounds intriguing. I could see that idea,
coupled with something like qdbm, to provide different options for the
user.

I wrote something a while back that might be helpful (or not). I wanted
a
lightweight mechanism to persist data to disk via a hashlike interface.
I
didn’t want to use pstore because one pays a pretty high cost for each
transaction once the data store gets any significant amount of data in
it,
and one has to go through extra effort to make pstore multiprocess safe.
I
also wanted it to have flat RAM usage, regardless of the number of items
in
the data store.

What I ended up with is something that can be used as an LRU cache or
simply a
hash. It is completely disk based and is multithread/multiprocess safe.
It
still needs some love, but it works.

It’s slower than pstore (when starting a transaction, performaing an
action,
closing a transaction) for small numbers of items in the data store, and
much
faster than pstore once the data store has a few thousand items.

I’m planning on packaging it up as a seperate micropackage in the IOWA
area on
Rubyforge. It’s probably no faster than Kirbybase is now, but if it
might be
useful in some capacity, I’ll bump that up on my priority list.

Thanks

Kirk H.

BergerS_Daniel · May 17, 2006, 7:41pm

Kirk H. wrote:

still needs some love, but it works.

It’s slower than pstore (when starting a transaction, performaing an action,
closing a transaction) for small numbers of items in the data store, and much
faster than pstore once the data store has a few thousand items.

I’m planning on packaging it up as a seperate micropackage in the IOWA area on
Rubyforge. It’s probably no faster than Kirbybase is now, but if it might be
useful in some capacity, I’ll bump that up on my priority list.

This sounds really interesting. Thanks, Kirk!

Jamey

Confidentiality Notice: This email message, including any attachments,
is for the sole use of the intended recipient(s) and may contain
confidential and/or privileged information. If you are not the intended
recipient(s), you are hereby notified that any dissemination,
unauthorized review, use, disclosure or distribution of this email and
any materials contained in any attachments is prohibited. If you receive
this message in error, or are not the intended recipient(s), please
immediately notify the sender by email and destroy all copies of the
original message, including attachments.

BergerS_Daniel · May 17, 2006, 7:00pm

Kirk H. wrote:

transaction once the data store gets any significant amount of data in it,
faster than pstore once the data store has a few thousand items.

I’m planning on packaging it up as a seperate micropackage in the IOWA area on
Rubyforge. It’s probably no faster than Kirbybase is now, but if it might be
useful in some capacity, I’ll bump that up on my priority list.

Sounds interesting. I wrote FSDB with some of the same ideas in mind,
but probably we used different approaches. Looking forward to seeing it
when it comes out.