Faster Marshaling?

Exploring options… wondering if there’s anything that can replace
marshaling that’s similar in usage (dump & load to/from disk file), but
faster than the native implementation in Ruby 1.8.6

I can explain some details if necessary, but in short:

  • I need to marshal,

  • I need to swap data sets often enough that performance
    will be a problem (currently it can take several seconds to restore
    some marshaled data – way too long)

  • the scaling is such that more RAM per box is costly enough to pay for
    development of a more RAM efficient design

  • faster performance Marshaling is worth asking about to see how much
    it’ll get me.

I’m hoping there’s something that’s as close to a memory space dump &
restore as possible – no need to “reconstruct” data piece by piece
which Ruby seems to be doing now. It takes < 250ms to load an 11MB raw
data file via readlines, and 2 seconds to load a 9MB sized Marshal file,
so clearly Ruby is busy rebuilding stuff rather than just pumping a RAM
block with a binary image.

TIA for any ideas.

– gw

On Saturday 26 July 2008 21:58:22 Greg W. wrote:

  • I need to swap data sets often enough that performance
    will be a problem (currently it can take several seconds to restore
    some marshaled data – way too long)

Why do you need to do this yourself?

  • the scaling is such that more RAM per box is costly enough to pay for
    development of a more RAM efficient design

What about more swap per box?

It might be slower, maybe not, but it seems like the easiest thing to
try.

Another possibility would be to use something like ActiveRecord –
though you
probably want something much more lightweight (suggestions? I keep
forgetting
what’s out there…) – after all, you probably aren’t operating on the
whole
dataset at once, so what you really want is something reasonably fast at
loading/saving individual objects.

On Jul 26, 2008, at 19:58 PM, Greg W. wrote:

will be a problem (currently it can take several seconds to restore
restore as possible – no need to “reconstruct” data piece by piece
which Ruby seems to be doing now. It takes < 250ms to load an 11MB raw
data file via readlines, and 2 seconds to load a 9MB sized Marshal
file,

readlines? not read? readlines should be used for text, not binary
data. Also, supplying an IO to Marshal.load instead of a pre-read
String adds about 30% overhead for constant calls to getc.

9MB seems like a lot of data to load, how many objects are in the
dump? Do you really need to load a set of objects that large?

so clearly Ruby is busy rebuilding stuff rather than just pumping a
RAM
block with a binary image.

Ruby is going to need to call allocate for each object in order to
register with the GC and build the proper object graph. I doubt
there’s a way around this without extensive modification to ruby.

Eric H. wrote:

On Jul 26, 2008, at 19:58 PM, Greg W. wrote:

will be a problem (currently it can take several seconds to restore
restore as possible – no need to “reconstruct” data piece by piece
which Ruby seems to be doing now. It takes < 250ms to load an 11MB raw
data file via readlines, and 2 seconds to load a 9MB sized Marshal
file,

Ruby is going to need to call allocate for each object in order to
register with the GC and build the proper object graph. I doubt
there’s a way around this without extensive modification to ruby.

Hmm. makes sense of course, was just hoping someone had a clever
replacement.

I’ll just have to try clever code that minimizes the frequency of
re-loads.

If you’re curious about the back story, I’ve explained it more below.

readlines? not read? readlines should be used for text, not binary
data. Also, supplying an IO to Marshal.load instead of a pre-read
String adds about 30% overhead for constant calls to getc.

Wasn’t using it on binary data – was just making a note that an 11MB
tab file (about 45,000 lines) took all of 250ms (actually 90ms on my
server drives) to read into an array using readlines. Whereas loading a
marshaled version of that same data (reorganized, and saved as an array
of hashes) from a file that happened to be 9Mb took almost 2 seconds –
so there’s clearly a lot of overhead in re-storing a marshaled object.
That was my point.

9MB seems like a lot of data to load, how many objects are in the
dump? Do you really need to load a set of objects that large?

Yes, and that’s not the largest, but it’s an average. Range is 1 MB to
30 MB of raw data per file. A few are 100+ MB, one is 360 MB on it’s
own, but it’s an exception.

This is a data aggregation framework. One generic framework will run as
multiple app-specific instances where each application has a data set of
4-8GB of raw text data (from 200-400 files). That raw data is loaded,
reorganized into standardized structures, and one or more indexes
generated per original file.

One application instance per server. The server is used as a workgroup
intranet web server by day (along with it’s redundant twin), and as an
aggregator by night.

That 9MB Marshaled file is the result of one data source of 45,000 lines
being re-arranged, each data element cleansed and transformed, and then
stored as an array of hashes. An index is stored as a separate Marshaled
file so it can be loaded independently.

Those 300 or so original files, having been processed and indexed, are
now searched and combined in a complex aggregation (sadly, not just
simple mergers) which nets a couple dozen tab files for LOAD DATA into a
database for the web app.

Based on a first version of this animal, spikes on faster hardware,
accounting for new algorithms and growth in data sizes, this process
will take several hours even on a new intel server even with everything
loaded into RAM. And that’s before we start to add a number of new
tasks to the application.

Of course, we’re looking at ways to split the processing to take
advantage of multiple cores, but that just adds more demand on memory
(DRb way too slow by a couple orders of magnitude to consider using as a
common “memory” space" for all cores).

The aggregation is complex enough that in a perfect world, I’d have the
entire data set in RAM all at once, because any one final data table
pulls it’s data from numerous sources and alternate sources if the
primary doesn’t have it on a field by field basis. field1 comes from
sourceX, field2 from sourceA, and sourceB if A doesn’t have it. It gets
hairy :slight_smile:

Unlike a massive web application where any one transaction can take as
long as 1 second even 2 to complete, and you throw more machines at it
to handle increase in requests, this is a task trying to get tens of
millions of field tranformations, and millions of hash reads completed
linearly as quickly as possible. So, the overhead of DRb and similar
approaches aren’t good enough.

David M. wrote:

  • I need to swap data sets often enough that performance
    will be a problem (currently it can take several seconds to restore
    some marshaled data – way too long)

Why do you need to do this yourself?

As a test, I took that one 9MB sample file mentioned above, and loaded
it as 6 unique objects to see how long that would take, and how much RAM
would get used – Ruby ballooned into using 500MB of RAM. In theory I
would like to have every one of those 300 files in meory, but
logistically I can easily get away with 50 to 100 at once. But if Ruby
is going to balloon that massively, I won’t even get close to 50 such
data sets in RAM at once. So, I “need” to be able to swap data sets in &
out of RAM as needed (hopefully with an algorithm that minimizes the
swapping by processing batches which all refc the same loaded data
sets).

  • the scaling is such that more RAM per box is costly enough to pay for
    development of a more RAM efficient design

What about more swap per box? It might be slower, maybe not, but it seems
like the easiest thing to try.

More “swap”? You mean virtual memory? I may be wrong, but I am assuming
regardless of how effective VM is, I can easily saturate real RAM, and
it’s been my experience that systems just don’t like all of their real
RAM full.

Unless there’s some Ruby commands to tell it to specifically push
objects into the OS’s VM, I think I am stuck having to manage RAM
consumption on my own. ??

Another possibility would be to use something like ActiveRecord –

Using the db especially through AR would be glacial. We have a db-based
process now, and need something faster.

– gw

On Sunday 27 July 2008 00:07:10 Greg W. wrote:

  • the scaling is such that more RAM per box is costly enough to pay for
    development of a more RAM efficient design

What about more swap per box? It might be slower, maybe not, but it seems
like the easiest thing to try.

More “swap”? You mean virtual memory? I may be wrong, but I am assuming
regardless of how effective VM is, I can easily saturate real RAM, and
it’s been my experience that systems just don’t like all of their real
RAM full.

In general, yes. However, if this is all the system it’s doing, I’m
suggesting
that it may be useful – assuming there isn’t something else that makes
this
impractical, like garbage collection pulling everything out of RAM to
see if
it can be collected. (I don’t know enough about how Ruby garbage
collection
works to know if this is a problem.)

But then, given the sheer size problem you mentioned earlier, it
probably
wouldn’t work well.

Another possibility would be to use something like ActiveRecord –

Using the db especially through AR would be glacial. We have a db-based
process now, and need something faster.

I specifically mean something already designed for this purpose – not
necessarily a traditional database. Something like berkdb, or “stone” (I
think that’s what it was called) – or splitting it into a bunch of
files, on
a decent filesystem.

David M. wrote:

On Sunday 27 July 2008 00:07:10 Greg W. wrote:

Another possibility would be to use something like ActiveRecord –

Using the db especially through AR would be glacial. We have a db-based
process now, and need something faster.

I specifically mean something already designed for this purpose – not
necessarily a traditional database. Something like berkdb, or “stone” (I
think that’s what it was called) – or splitting it into a bunch of
files, on a decent filesystem.

Berkely DB has been sucked up by Oracle, and I don’t think it ever ran
on OS X anyway.

We have talked about skipping Marshaling and going straight to standard
text files on disk and then using read commands that point to a specific
file line.

We haven’t spiked that yet, but I don’t see it being significantly
faster than using a local db (especially since db cache might be
useful), but it’s something we’ll probably at least investigate just to
prove it’s comparative performance. It might be faster just because we
can keep all indexes in RAM. Get some 15,000 rpm drives, probably
implement some caching to reduce disk reads

So, yeah, maybe that or even sqlite might be suitable if the RAM thing
just gets too obnoxious to solve. Something that would prove to be
faster than MySQL.

– gw

On Sunday 27 July 2008 00:33:56 Greg W. wrote:

We have talked about skipping Marshaling and going straight to standard
text files on disk and then using read commands that point to a specific
file line.

If the files aren’t changing, you probably want to seek to a specific
byte
offset in the file, rather than a line – the latter requires you to
read
through the entire file up to that line.

We haven’t spiked that yet, but I don’t see it being significantly
faster than using a local db (especially since db cache might be
useful),

More useful than the FS cache?

So, yeah, maybe that or even sqlite might be suitable if the RAM thing
just gets too obnoxious to solve. Something that would prove to be
faster than MySQL.

For what it’s worth, ActiveRecord does work on SQLite. So does Sequel,
and I
bet DataMapper does, too.

I mentioned BerkDB because I assumed it would be faster than SQLite –
but
that’s a completely uninformed guess.

Greg W. wrote:

  • the scaling is such that more RAM per box is costly enough to pay for
    development of a more RAM efficient design
    What about more swap per box? It might be slower, maybe not, but it seems
    like the easiest thing to try.

More “swap”? You mean virtual memory? I may be wrong, but I am assuming
regardless of how effective VM is, I can easily saturate real RAM, and
it’s been my experience that systems just don’t like all of their real
RAM full.

More swap might help, if you assign one ruby process per data set. Then
switching data sets means just letting the vm swap in a different
process, if it needs to.

On 27.07.2008 07:44, David M. wrote:

On Sunday 27 July 2008 00:33:56 Greg W. wrote:

We have talked about skipping Marshaling and going straight to standard
text files on disk and then using read commands that point to a specific
file line.

If the files aren’t changing, you probably want to seek to a specific byte
offset in the file, rather than a line – the latter requires you to read
through the entire file up to that line.

Array#pack and String#unpack come to mind. But IMHO this is still
inferior to using a relational database because in the end it comes down
to reimplementing the same mechanisms that are present there already.

For what it’s worth, ActiveRecord does work on SQLite. So does Sequel, and I
bet DataMapper does, too.

But keep in mind that AR and the like introduce some overhead of
themselves. It might be faster to just use plain SQL to get at the
data.

But given the problem description I would definitively go for a
relational or other database system. There is no point in inventing the
wheel (aka fast indexing of large data volumes on disk) yourself. You
might even check RAA for an implementation of B-trees.

Kind regards

robert

On Jul 27, 2008, at 12:33 AM, Greg W. wrote:

So, yeah, maybe that or even sqlite might be suitable if the RAM thing
just gets too obnoxious to solve.

I would be shocked if SQLite can’t be made to solve the problem well
with the right planning. That little database is always surprising
me. Don’t forget to look into the following two features as it sounds
like they may be helpful in this case:

  • In memory databases
  • Attaching multiple SQLite files to perform queries across them

James Edward G. II

On 27.07.2008 12:21, Robert K. wrote:

But given the problem description I would definitively go for a
relational or other database system. There is no point in inventing the
wheel (aka fast indexing of large data volumes on disk) yourself. You
might even check RAA for an implementation of B-trees.

Just after sending I remembered a thread in another newsgroup. The
problem sounds a bit related to yours and eventually the guy ended up
using CDB:

http://cr.yp.to/cdb.html

There’s even a Ruby binding:

http://raa.ruby-lang.org/project/cdb/

His summary is here, the problem description is at the beginning of the
thread:

http://groups.google.com/group/comp.unix.programmer/msg/420c2cef773f5188

Kind regards

robert

On Sun, Jul 27, 2008 at 11:32:35PM +0900, James G. wrote:

  • In memory databases
  • Attaching multiple SQLite files to perform queries across them

Yup, SQLite is a lovely piece of work. If you go that route, you might
give
Amalgalite a try, it embeds sqlite3 inside a ruby extension using the
SQLite
amalgamation source. I’d love to see if it can stand up to the demands
of your
system.

enjoy,

-jeremy

Greg W. wrote:

Berkely DB has been sucked up by Oracle, and I don’t think it ever ran
on OS X anyway.

OS X, eh? First, check the MacPorts project (http://macports.org) and
search for the db44 or db46 ports for your Berkley DB. Second, even
though it’s in an early development stage, you might want to look into
MacRuby (http://ruby.macosforge.org). In MacRuby, every Ruby object is
also a subclass of NSObject, so in theory you should be able to use all
of the same NSData read/write operations as Cocoa objects. It may well
be that restoring MacRuby objects written out in this way doesn’t
currently work (haven’t had a chance to try it myself…yet), but in
that case, you could at least file a bug with the project.