Looking for a Fast Persistent Store

bobh · August 15, 2006, 2:17pm

Kirk H. wrote:

How fast is fast enough, and what kind of transaction support do you
need?

> I have a class that is part of the suite of caches I provide in IOWA > that I have recently been working on. > > I'm currently expanding the test cases for it and cleaning up some > bugs, and will go ahead and release it as a standalone package this > weekend. Maybe it will help you. Francis C. wrote: > Kirk, how did you do this? Are you storing immutable objects and > turning older versions into garbage somehow? If you have no indexing, > then how do you enforce a consistent view across processes without > incurring disk i/o (at least having to read the inode)? Or are you > managing an arena that is backed by shared memory and you convert the > key values deterministically into offsets? > > Sorry for all the questions, but I've just had to fight with this > problem too many times and it would be great if you've come up with a > better mousetrap.

I’m interested in this, too…please make an announcement on the list
when you release it?

-Justin

bobh · August 15, 2006, 2:17pm

On Sat, 12 Aug 2006, Francis C. wrote:

With all the times I’ve reinvented this wheel, I’ve never tried
storing millions of data elements each in its own file. Do you have
performance metrics you can share? Did you tune it for a particular
filesystem?

Not really. I have not stored the results of any tests to date. The
largest test, however, created a million item cache, and I had no
problems.

The main consideration if one were planning on storing large numbers of
elements is making sure that the filesystem is built with large numbers
of
small files taken into consideration.

However, a lot of files can fit into a surprisingly non-messy directory
structure.

Consider an SHA512 hash for a key that has the first few characters of:

dd232b224979c0e

If I am running a cache with a bucket depth of 2, and a bucket width of
2,
the file for that cache is going to go into the directory

dd/23/

Given an even distribution of hashes, with a million records, one should
have 256 directories at the top level, 256 at the second level, and 15
or
16 files under each of those.

That’s pretty easy to handle efficiently.

Kirk H.

bobh · August 15, 2006, 2:17pm

I was so interested in this idea that I cobbled up a quick test on a
workstation machine with a medium-bandwidth IO channel. 100,000 1K files
brought the shared-memory filesystem to its knees almost immediately. On
an
oxide-based journalling filesystem, it seemed to do 10 to 20 thousand
files
a second pretty consistently. That’s a decent but not huge fraction of
the
available channel bandwidth. Haven’t tried a million files. I’m still
interested though.

bobh · August 15, 2006, 2:17pm

On Aug 11, 2006, at 2:03 PM, [email protected] wrote:

I’m currently expanding the test cases for it and cleaning up some
bugs, and will go ahead and release it as a standalone package this
weekend. Maybe it will help you.

Oh, this sounds very cool! Even if it doesn’t help with this
particular situation, I have a pretty good idea where it will come in
handy.

Looking forward to it.

Cheers,
Bob

Kirk H.

Bob H. – blogs at <http://www.recursive.ca/
hutch/>
Recursive Design Inc. – http://www.recursive.ca/
Raconteur – http://www.raconteur.info/
xampl for Ruby – http://rubyforge.org/projects/xampl/

bobh · August 15, 2006, 2:17pm

On 8/11/06, Bob H. [email protected] wrote:

External fragmentation can be
a problem.

Clearly a directory hierarchy is the only way to make this work, and
Kirk’s
idea of partitioning the space by segmenting the hash seems as
reasonable as
any. So far I’m very surprised to find that performance may be
reasonable.

But what do you mean by “external fragmentation”? If you’re thinking of
NFS,
I’d rather cut myself than attempt something like this on NFS, even
though
Kirk says it’s supported.

bobh · August 15, 2006, 2:17pm

On Aug 11, 2006, at 7:00 PM, Francis C. wrote:

interested though.
In the Java thing we wrote we had 100 of thousands of files in a
directory, nothing shared though. No problems until you did a ‘ls’ by
mistake Backups are also a problem. External fragmentation can be
a problem. I know we experimented with the kind of thing that I think
Kirk was suggesting. By building a hierarchy you can keep the numbers
of files/directories down, and this avoids the backup (and ‘ls’)
problems.

Cheers,
Bob

Bob H. – blogs at <http://www.recursive.ca/
hutch/>
Recursive Design Inc. – http://www.recursive.ca/
Raconteur – http://www.raconteur.info/
xampl for Ruby – http://rubyforge.org/projects/xampl/

bobh · August 15, 2006, 2:18pm

[email protected] writes:

I’m working on creating a second version that strips out all of the
LRU cache code to provide a simple, fast data persistence
implementation without any of that extra overhead, and will be better
able to report on just how much of a performance hit that overhead
brings with it soon.

(How) do you handle object references?

bobh · August 15, 2006, 2:17pm

On 8/12/06, Bob H. [email protected] wrote:

collection, and so on).

Cheers,
Bob

That’s a pretty constant tradeoff, speed for space, and it shows up in
many
different ways. I’m convinced that’s why there’s no single answer to
this
problem- every application will need different optimizations. However,
this
is somewhat less true than it once was, for two reasons: journaling
filesystems, and the fact that disk storage is still getting cheaper
every
year at a rapid rate- it’s the only element in the computing chain that
still is. So my inclination (subject to change depending on the
particular
case) is to waste space if it will save time (development time and/or
runtime).

bobh · August 15, 2006, 2:17pm

On Aug 11, 2006, at 8:12 PM, Francis C. wrote:

any. So far I’m very surprised to find that performance may be
reasonable.

So was I but it actually is quite good (I’ll publish some
unscientific benchmarks in the next couple of days).

But what do you mean by “external fragmentation”? If you’re
thinking of NFS,
I’d rather cut myself than attempt something like this on NFS, even
though
Kirk says it’s supported.

Sorry, jargon from 3 decades ago. External fragmentation, in this
case is the disk space wasted due to small files being put into a
larger segment (e.g. 1k file in a 4k disk segment is a 3k loss of
space (external fragmentation)). Things like Perst, qdbm, Purple,
SQLite-as-persistent-store will avoid almost all external
fragmentation, at the cost of internal fragmentation. In the case of
the file system, there isn’t any space wasted within the 1k file,
but the persistent stores all have this problem (and you’ll see them
talking about free space management, best-fit, first-fit, garbage
collection, and so on).

Cheers,
Bob

Bob H. – blogs at <http://www.recursive.ca/
hutch/>
Recursive Design Inc. – http://www.recursive.ca/
Raconteur – http://www.raconteur.info/
xampl for Ruby – http://rubyforge.org/projects/xampl/

bobh · August 15, 2006, 2:18pm

On 8/9/06, Bob H. [email protected] wrote:

These kinds of stores are normally implemented as persistent hashes
or BTrees (or both). I know that Sleepycat’s Berkeley DB <http://
www.sleepycat.com/> does this, and I’ve used Java based systems that
do this. I also know of some C based things but they don’t have Ruby
wrappers. I can’t find anything in the Ruby world, and I don’t care
too much if it isn’t pure Ruby.

SQLite has fully Ruby support and is free.

http://sqlite-ruby.rubyforge.org/

bobh · August 15, 2006, 2:18pm

On Sun, 13 Aug 2006, Christian N. wrote:

(How) do you handle object references?

It uses Marshal.

Kirk H.

bobh · August 15, 2006, 2:18pm

From: “Francis C.” [email protected]

I was so interested in this idea that I cobbled up a quick test on a
workstation machine with a medium-bandwidth IO channel. 100,000 1K files
brought the shared-memory filesystem to its knees almost immediately. On an
oxide-based journalling filesystem, it seemed to do 10 to 20 thousand files
a second pretty consistently. That’s a decent but not huge fraction of the
available channel bandwidth. Haven’t tried a million files. I’m still
interested though.

Hi,

What is being measured? Access time for files that already exist?
Creation of new files? Scanning the directory structure for a list
of existing files?

At a prior gig, we used to split a couple hundred thousand
encyclopedia articles up as 12/34/56.xxx sort of format. It worked
adequately for our needs–our batch-oriented processing was expected
to run overnight anyway–but my impression was that as long as the
filename was known, accessing file 12/34/56.xxx seemed quick,
whereas directory scans to enumerate the existing filenames were
pretty slow.

(This was on sunOs/solaris systems circa 2000… i don’t know if
the filesystem was journalled or not.)

Regards,

Bill

bobh · August 15, 2006, 2:18pm

On 8/12/06, Bill K. [email protected] wrote:

filename was known, accessing file 12/34/56.xxx seemed quick,
whereas directory scans to enumerate the existing filenames were
pretty slow.

I wanted to put a rough lower bound on the performance of the
approach, just to decide if it’s worth pursuing at all. (Kirk
obviously thinks it is, but I don’t know if the class of applications
that interests him is anything like the class that interests me.) So I
measured creation of 1k files and re-writing of 1k files. (The prior
case involves creating an inode and data pages, the second involves
touching an inode and creating data pages.) I didn’t test reads of the
files (which would involve touching an inode) because I didn’t feel
like optimizing the test (by remounting my filesystem with the
inode-touch for last-access turned off). The test box was a
medium-powered Linux workstation with a 2.6 kernel, a single SATA
drive, and ext3 filesystems.

I’d expect under normal conditions to get maybe 15 megabytes per
second of disk-write bandwidth from this system, although with tuning
I could probably get a lot more. But again, I was going for a smell
test here. This whole approach is attractive because it’s easy to
code, so I’d want to use it for light-duty applications on non-tuned
hardware. For an application with more stringent requirements, I’d
make a different tradeoff in development time and probably use a
different approach.

Anyway, I first tried it on /dev/shm. That worked really nice for
10,000 files, took about 0.9 seconds consistently to create new files
and 0.6 seconds to re-write them. The same test with 100,000 files
totally de-stabilized the machine. I didn’t want to reboot it so I
waited. Fifteen minutes later it was back. But what a strange journey
that must have been. Obviously this approach doesn’t make a lot of
sense on a shm device anyway, but I had to know.

With a disk-based filesystem, things were a lot better. For 10,000
files, about 1.6 seconds to create them and 1.2 to rewrite them. Those
numbers were consistent across many trials. Similar results at 30,000
and 60,000 files, just scale upward. At 100,000 files things got
screwy. The create-time got variable, ranging from 5 seconds to almost
15 seconds from run to run. During all the runs, the machine didn’t
appear to destabilize and remained responsive. Obviously processor
loads were very low. I didn’t make a thorough study of page faults and
swapping activity though. But notice that the implied throughput is an
interestingly high fraction of my notional channel bandwidth of 15
megabytes/sec. And the journalling FS means that I don’t even think
about the movement of the R/W head inside the disk drive anymore. (Of
course that may matter a great deal on Windows, but if I’m using
Windows then everything about the project costs vastly more anyway,
so who cares?)

So I’m claiming without further study (and without trying to explain
the results) that the lower bound on performance is in the region of
5000 writes a second. That’s just at the fuzzy edge of being worth
doing. I usually think in terms of a “budget” for any operation that
must be repeated for a continuously-running server application. In
general, I want to be able to do a bare minimum of 1000 “useful
things” per second on a sustained basis, on untuned hardware. (“Useful
things” might be dynamic web-pages generated, or guaranteed-delivery
messages processed, etc.) So this approach uses nearly 20% of my
budget. It’s a big number. (Just to show how I apply this kind of
analysis: I never worry about adding a SHA-1 hash calculation to any
critical code path, because I know I can do 250,000 of those per
second without breaking a sweat.)

bobh · August 15, 2006, 2:18pm

What about writing a thin C extension for interfacing with Perst?
Going back and forth from C to Java may be tricky, though… But,
then, what about any of the non-Java or -Ruby options? Just write a
quick C extension/wrapper for Ruby, or use SWIG to do it for you.

M.T.

bobh · August 15, 2006, 2:18pm

On 8/12/06, Matt T. [email protected] wrote:

What about writing a thin C extension for interfacing with Perst?
Going back and forth from C to Java may be tricky, though… But,
then, what about any of the non-Java or -Ruby options? Just write a
quick C extension/wrapper for Ruby, or use SWIG to do it for you.

M.T.

For that matter, you could just try using JRuby, which would let you
call
Java code directly. www.jruby.org

bobh · August 15, 2006, 2:19pm

On Aug 12, 2006, at 10:40 PM, Matt T. wrote:

What about writing a thin C extension for interfacing with Perst?
Going back and forth from C to Java may be tricky, though… But,
then, what about any of the non-Java or -Ruby options? Just write a
quick C extension/wrapper for Ruby, or use SWIG to do it for you.

M.T.

That’s a good point. I don’t think I want to involve Java in this if
I can help it. The fellow who wrote Perst, Konstantin Knizhnik
http://www.garret.ru/~knizhnik/databases.html has also written
GOODS, FastDB, and GigaBASE. If I was going to write a C extension
I’d go with one of those. Konstantin has also written something
called DyBASE which is specifically for Dynamic languages, like Ruby
and Python, and comes with bindings for Ruby 1.6.x. I’ve asked
Konstantin about the state of DyBASE and am trying to work out if
that is worth updating to Ruby 1.8.4

Cheers,
Bob

Bob H. – blogs at <http://www.recursive.ca/
hutch/>
Recursive Design Inc. – http://www.recursive.ca/
Raconteur – http://www.raconteur.info/
xampl for Ruby – http://rubyforge.org/projects/xampl/

bobh · August 15, 2006, 2:19pm

Hi,

On Aug 12, 2006, at 8:27 AM, Francis C. wrote:

year at a rapid rate- it’s the only element in the computing chain
that
still is. So my inclination (subject to change depending on the
particular
case) is to waste space if it will save time (development time and/or
runtime).

I absolutely agree which is why I’ve implemented the filesystem route
first. There is no performance problem with it. The problem is that
I’ve got to write a bunch of files out and if for some reason even
one file failed to write I’d have a consistency problem. There are
easy ways to mitigate these risks but they can take an excruciatingly
long time to run. Moving to a transactional store (like Perst in
Java, or SQLite in ruby) gives me the guarantees of consistency that
I’m looking for, but SQLite in ruby is 3-5 times slower than direct
to the file system on OS/X with journaling on (unless there’s
something wrong in my testing… silly benchmarks forthcoming). In
Java, Perst is much faster than the filesystem.

Now, I’ve thought of implementing the same kind of techniques that
some persistent stores use to assure consistency but directly in the
filesystem. The only quick-enough techniques that I can think of
would require a recovery after failure, but that shouldn’t be too big
a problem (especially the frequency of failure will be low judging
from my current experience where I’ve not had any failures yet).

Bob H. – blogs at <http://www.recursive.ca/
hutch/>
Recursive Design Inc. – http://www.recursive.ca/
Raconteur – http://www.raconteur.info/
xampl for Ruby – http://rubyforge.org/projects/xampl/

bobh · August 15, 2006, 2:19pm

On 8/13/06, Bob H. [email protected] wrote:

something wrong in my testing… silly benchmarks forthcoming). In
Java, Perst is much faster than the filesystem.

Ok, did some reading up on Perst- no real magic there. They simply put
together a lot of the good techniques you would expect to find in a
virtual arena manager, and they made some guesses about the handful of
tough questions you have to answer. (Allocation-grain size, page size,
etc.) I used same kind of bitmap-strategy for the persistent
GD-message system I mentioned upthread, but Perst makes a different
choice: they sequentially search the bitmap pages for a big-enough
free space when storing an object. Even though they use the obvious
optimization (storing a max-available-size on each page), I remember
being phobic about that approach and did something much more elaborate
(but reliably O(1)). But given your comments about Perst’s speed, they
have obviously made the right choices overall!

I think something like Perst could be done in Ruby, possibly augmented
with a tiny C extension to access virtual-memory features.

bobh · August 15, 2006, 2:19pm

On 8/13/06, Francis C. [email protected] wrote:

to run overnight anyway–but my impression was that as long as the
case involves creating an inode and data pages, the second involves
touching an inode and creating data pages.) I didn’t test reads of the
files (which would involve touching an inode) because I didn’t feel
like optimizing the test (by remounting my filesystem with the
inode-touch for last-access turned off). The test box was a
medium-powered Linux workstation with a 2.6 kernel, a single SATA
drive, and ext3 filesystems.

You are rebuilding many of the features of git.
http://git.or.cz/index.html

Git is the SCM designed by Linus to replaces Bitkeeper. C source code
is freely available.

The git file store hashes files into an array of 255 directories and
keeps them initially as small files. Later on you can run a pass which
transparently converts them into single file packs. A pack may have
many thousand files that can be accessed via their hash name. Packs
recover the space lost to block fragmentation. Git directory objects
are used to map hash names to human readable strings.

When packs are created there is code that searches for similarities
between files in the pack. If similar files are found they are stored
as deltas from the original file. All objects stored in the system,
files or packs, are zlib compressed.

It wouldn’t be a lot of trouble to take the indexing code out of
cLucene and adjust it to use the git file store.

Transactions are achieved by first adding all of your files to the
file store, then a single ‘commit’ makes them become visible. If you
never do the commit there are tools for pruning dangling objects out
of the db. But the dangling objects aren’t visible so it’s just a
cleanup step.

Git is an archival system, objects in the database can not be changed,
they can only be superseded by a later version. Cryptographic hashes
will immediately detect if something in the db has been altered.

I have multi gigabyte git files stores containing millions of objects.
Performance is pretty constant until your working set won’t fit into
memory. The killer feature of these file stores is support for
distributed replication with multiple people adding objects to their
distributed copies.

bobh · August 15, 2006, 2:19pm

On 8/13/06, Jon S. [email protected] wrote:

You are rebuilding many of the features of git.
http://git.or.cz/index.html

Write-once/read-many is the optimization profile for archival storage,
but not for a persistent session cache. You’re proving yet again that
there are almost as many valid and useful approaches to this problem
as there are applications.