Eric H. wrote:
On Jul 26, 2008, at 19:58 PM, Greg W. wrote:
will be a problem (currently it can take several seconds to restore
restore as possible – no need to “reconstruct” data piece by piece
which Ruby seems to be doing now. It takes < 250ms to load an 11MB raw
data file via readlines, and 2 seconds to load a 9MB sized Marshal
file,
Ruby is going to need to call allocate for each object in order to
register with the GC and build the proper object graph. I doubt
there’s a way around this without extensive modification to ruby.
Hmm. makes sense of course, was just hoping someone had a clever
replacement.
I’ll just have to try clever code that minimizes the frequency of
re-loads.
If you’re curious about the back story, I’ve explained it more below.
readlines? not read? readlines should be used for text, not binary
data. Also, supplying an IO to Marshal.load instead of a pre-read
String adds about 30% overhead for constant calls to getc.
Wasn’t using it on binary data – was just making a note that an 11MB
tab file (about 45,000 lines) took all of 250ms (actually 90ms on my
server drives) to read into an array using readlines. Whereas loading a
marshaled version of that same data (reorganized, and saved as an array
of hashes) from a file that happened to be 9Mb took almost 2 seconds –
so there’s clearly a lot of overhead in re-storing a marshaled object.
That was my point.
9MB seems like a lot of data to load, how many objects are in the
dump? Do you really need to load a set of objects that large?
Yes, and that’s not the largest, but it’s an average. Range is 1 MB to
30 MB of raw data per file. A few are 100+ MB, one is 360 MB on it’s
own, but it’s an exception.
This is a data aggregation framework. One generic framework will run as
multiple app-specific instances where each application has a data set of
4-8GB of raw text data (from 200-400 files). That raw data is loaded,
reorganized into standardized structures, and one or more indexes
generated per original file.
One application instance per server. The server is used as a workgroup
intranet web server by day (along with it’s redundant twin), and as an
aggregator by night.
That 9MB Marshaled file is the result of one data source of 45,000 lines
being re-arranged, each data element cleansed and transformed, and then
stored as an array of hashes. An index is stored as a separate Marshaled
file so it can be loaded independently.
Those 300 or so original files, having been processed and indexed, are
now searched and combined in a complex aggregation (sadly, not just
simple mergers) which nets a couple dozen tab files for LOAD DATA into a
database for the web app.
Based on a first version of this animal, spikes on faster hardware,
accounting for new algorithms and growth in data sizes, this process
will take several hours even on a new intel server even with everything
loaded into RAM. And that’s before we start to add a number of new
tasks to the application.
Of course, we’re looking at ways to split the processing to take
advantage of multiple cores, but that just adds more demand on memory
(DRb way too slow by a couple orders of magnitude to consider using as a
common “memory” space" for all cores).
The aggregation is complex enough that in a perfect world, I’d have the
entire data set in RAM all at once, because any one final data table
pulls it’s data from numerous sources and alternate sources if the
primary doesn’t have it on a field by field basis. field1 comes from
sourceX, field2 from sourceA, and sourceB if A doesn’t have it. It gets
hairy
Unlike a massive web application where any one transaction can take as
long as 1 second even 2 to complete, and you throw more machines at it
to handle increase in requests, this is a task trying to get tens of
millions of field tranformations, and millions of hash reads completed
linearly as quickly as possible. So, the overhead of DRb and similar
approaches aren’t good enough.
David M. wrote:
- I need to swap data sets often enough that performance
will be a problem (currently it can take several seconds to restore
some marshaled data – way too long)
Why do you need to do this yourself?
As a test, I took that one 9MB sample file mentioned above, and loaded
it as 6 unique objects to see how long that would take, and how much RAM
would get used – Ruby ballooned into using 500MB of RAM. In theory I
would like to have every one of those 300 files in meory, but
logistically I can easily get away with 50 to 100 at once. But if Ruby
is going to balloon that massively, I won’t even get close to 50 such
data sets in RAM at once. So, I “need” to be able to swap data sets in &
out of RAM as needed (hopefully with an algorithm that minimizes the
swapping by processing batches which all refc the same loaded data
sets).
- the scaling is such that more RAM per box is costly enough to pay for
development of a more RAM efficient design
What about more swap per box? It might be slower, maybe not, but it seems
like the easiest thing to try.
More “swap”? You mean virtual memory? I may be wrong, but I am assuming
regardless of how effective VM is, I can easily saturate real RAM, and
it’s been my experience that systems just don’t like all of their real
RAM full.
Unless there’s some Ruby commands to tell it to specifically push
objects into the OS’s VM, I think I am stuck having to manage RAM
consumption on my own. ??
Another possibility would be to use something like ActiveRecord –
Using the db especially through AR would be glacial. We have a db-based
process now, and need something faster.
– gw