Best way to go - query large database?


#1

Hi, looking for the best solution here. This is the situation: I have a
large database (at least 100000 records). I have to find all the
records from a group of people from a certain date, and after that I
have to query and make calculations with a set of those records from
every member of that group.
Looping through every member of the group, and calculating, would take
me very long, I’m afraid. I was wondering if it is possible to do the
first query, and then loop (query) through the Array without going to
the database again. To do a series of query on a (much smaller) Array,
that is.
Is that possible? If so, how? Is that the best way then?
If it is not possible, what would be the fastest way to go?
Or is there another to look at this problem?
Many thanks for your thoughts ande suggestions.


#2

One possibility would be to use CREATE TEMPORARY TABLE and create your
smaller recordset that way.

Another way would be to use sql to do the calculations if possible on
the fly using only the first query. (find_by_sql query)

Yet another way could be to pull all the data out with joins/relations
(assuming you have multiple tables here) and then loop through it with
code performing the calculations that way without going back to the
database since you hopefully can get all your data out with one query.

The third one would probably be my choice, but I don’t know how the
data is stored.

Hopefully that could get you started.

Fredrik


#3

On Tuesday 16 January 2007 19:35, Rudy wrote:

Is that possible? If so, how?
You may be able to pull in all the required objects through eagerly
loaded associations.

Is that the best way then?

For more specific suggestions you need to put more specifics into your
question. It may even be sensible to offload everything you’re trying
to do to a single, probably complicated, SQL statement.

Michael


Michael S.
mailto:removed_email_address@domain.invalid
http://www.schuerig.de/michael/


#4

Let me try to describe the situation a bit more clear: I have 20 users
who all have given points to each other. Each ‘answer/point’ is stored
in a record. (= 20 x 20 = 400 records) What I need is to calculate the
average of the points for every user, and list them. (Because I have
maybe 100 users in total, who are giving points on a lot of occasions,
the database will be be growing rapidly) What I’m looking for is the
most stable (speedwise) way to get the appropriate records and do the
calculations. I’d prefer to avoid a situation were results showing up
would slowly start to take more time as the database got bigger.
Thanks for any suggestion anyway.


#5

On Jan 16, 2007, at 1:35 PM, Rudy wrote:

Hi, looking for the best solution here. This is the situation: I
have a large database (at least 100000 records). I have to find all
the records from a group of people from a certain date, and after
that I have to query and make calculations with a set of those
records from every member of that group.

Are the latter calculations just an attempt to further filter the
results, or are you calculating new data? If the former, you can
probably just use more complicated conditions to find the data. If
the latter, you can do the calculation on the database using a more
complex find_by_sql statement, or in Rails by executing Ruby code on
the result set. The latter will probably take longer if the system
is small, but be much faster when you have multiple clients hitting
the server to do that kind of operation.

-faisal


#6

Roderick van Domburg wrote:

If I interpret this correctly, the following should quite simply work:
Answers.find(:all, :select => ‘user_id, AVG(points)’, :group =>
‘user_id’)

Obviously that should have been “Answer” instead of “Answers”, excuse my
monkey hand-coding.

  • Roderick

#7

Rudy wrote:

Let me try to describe the situation a bit more clear: I have 20 users
who all have given points to each other. Each ‘answer/point’ is stored
in a record. (= 20 x 20 = 400 records) What I need is to calculate the
average of the points for every user, and list them. (Because I have
maybe 100 users in total, who are giving points on a lot of occasions,
the database will be be growing rapidly) What I’m looking for is the
most stable (speedwise) way to get the appropriate records and do the
calculations. I’d prefer to avoid a situation were results showing up
would slowly start to take more time as the database got bigger.
Thanks for any suggestion anyway.

If I interpret this correctly, the following should quite simply work:
Answers.find(:all, :select => ‘user_id, AVG(points)’, :group =>
‘user_id’)

  • Roderick

#8

On Jan 16, 2007, at 4:27 PM, Rudy wrote:

Let me try to describe the situation a bit more clear: I have 20 users
who all have given points to each other. Each ‘answer/point’ is stored
in a record. (= 20 x 20 = 400 records) What I need is to calculate the
average of the points for every user, and list them.

you can presumably do this with a join, nested select, and some sort
of average (e.g. sum(points)/count(points), assuming that the points
are stored in floats).

that said, it’s probably cleaner, easier, more portable, and faster*
to select all the records you want and then use ruby to calculate the
averages. sql will probably do a faster job of calculating the data
for small data sets, but as the data grows and the number of clients
grows it’s going to be progressively slower. by putting it in ruby
you move the progressive slowness onto the part of the system that
can be easily replicated through the addition of cheap hardware.

grain of salt: i haven’t tested this. i haven’t even tried doing
it. as with all estimations of performance, you should treat this as
made up until you’ve profiled it.

-faisal


#9

On Jan 16, 2007, at 5:40 PM, Roderick van Domburg wrote:

Answer.find(:all, :select => ‘user_id, AVG(points)’, :group =>
‘user_id’)

i take back my earlier claim that it would be easier to do this in
ruby. also, if you do it this way in the db and your points are
integers postgresql (at least) will produce answers in floats, so
there appears to be no drawback to doing it this way, at least at first.

-faisal


#10

I would the RDBMS do its job and do the calculation in an SQL
statement. Should be faster (if you do have indices) and reduces the
load between your sql machine and the webserver.