Dealing with large chunks of data

al-evans · November 15, 2006, 1:41am

In a site I’m working on, I have to generate statistics on demand.

This requires going through substantially all three tables of the
database, without about 1500 records in table A owning about 12,000
records in tables B and C. This is after the site’s been up about three
weeks, so I know it will get a lot worse.

There’s no issue with memory (dedicated server), and the database
operation is fast enough (0.2 second or so), but the number crunching is
killing me, presently taking about 16 seconds to iterate through all the
data and produce the hash that eventually gets displayed.

I’ve gotten to the point where I’m thinking about rewriting the routine
in Perl or even C. I’d much prefer to avoid that.

I would greatly appreciate suggestions from anybody who’s confronted a
similar situation.

Thanks,

–Al Evans

al-evans · November 15, 2006, 2:50am

On Nov 14, 2006, at 7:41 PM, Al Evans wrote:

crunching is

Thanks,

–Al Evans

I’d suspect that the “data” you’re iterating through is in the form
of ActiveRecord models, yes? If so, you might get a big win if you
can avoid that overhead completely by dropping into SQL and using
something like select_all [1] (to get hashes back rather than model
objects).

Perhaps a bit of code would let us help you better.

-Rob

[1] Parked at Loopia
DatabaseStatements.html#M004819

Rob B. http://agileconsultingllc.com
[email protected]

al-evans · November 15, 2006, 3:06am

Al Evans wrote:

data and produce the hash that eventually gets displayed.

I’ve gotten to the point where I’m thinking about rewriting the routine
in Perl or even C. I’d much prefer to avoid that.

I would greatly appreciate suggestions from anybody who’s confronted a
similar situation.

Can you store precomputed stats in the database, updating them as
each piece of data is added?

Or can you automatically calculate and save the stats once a day,
and update these on demand using only the data added since?

–
We develop, watch us RoR, in numbers too big to ignore.

al-evans · November 17, 2006, 11:25pm

Calle D. wrote:

On 11/15/06, Al Evans [email protected] wrote:

I’ve gotten to the point where I’m thinking about rewriting the routine
in Perl or even C. I’d much prefer to avoid that.

Before you do that, try doing as much of the calculating as possible
in SQL. I had a similar problem to yours recently, and brought the
statistics generation time down from 45 seconds to 0.6 by letting the
database do the computation.

Thanks! This is what I did. The code is uglier now, but it works almost
10 times as fast.

–Al Evans

al-evans · November 15, 2006, 7:41am

On 11/15/06, Al Evans [email protected] wrote:

I’ve gotten to the point where I’m thinking about rewriting the routine
in Perl or even C. I’d much prefer to avoid that.

Before you do that, try doing as much of the calculating as possible
in SQL. I had a similar problem to yours recently, and brought the
statistics generation time down from 45 seconds to 0.6 by letting the
database do the computation.

Calle D. -*- [email protected]

Dealing with large chunks of data

Before you do that, try doing as much of the calculating as possible in SQL. I had a similar problem to yours recently, and brought the statistics generation time down from 45 seconds to 0.6 by letting the database do the computation.

Before you do that, try doing as much of the calculating as possible
in SQL. I had a similar problem to yours recently, and brought the
statistics generation time down from 45 seconds to 0.6 by letting the
database do the computation.