Speeding up ActiveRecord Querying on very large joins?

When I first started my project, I thought long and hard about the exact
information I would need. Because my site works largely on statistical
analysis and virtual matchups can be done by any number of scenarios, it
makes it very difficult to come up with a common solution.

The simple facts are that at any type ‘any two’ teams can be compared
against one another. The data that is used for comparison is very
extensive. When I started with a large join, the return time was so
large that at times it would freeze up my development machine. So, I
carefully and systematically broke it down into several smaller joins to
ease the load.

This worked fairly well, considering the type of data I’m pulling. With
that said, the best time I can find when running this particular query
is 540 ms. As more and more work load hits the test server, hours later
this lone query can reach up to 14,000 ms (14 seconds is just too long).

So, I’m trying to find a way to speed up my query, optimize it, or
figure out what I can do to improve this result. Now the model method
I’m about to show you is in fact enormous, but if you take it a part
piece by piece you’ll see that I’m using selects for specific columns,
joins for specific tables and trying to do several smaller queries to
speed things up.

My question is what can I do to improve this?

def self.compareteams(teamone, teamtwo)

week_start_date

=Time.now.beginning_of_week.to_date.strftime(’%Y-%m-%d’)
week_end_date = Time.now.end_of_week.to_date.strftime(’%Y-%m-%d’)
compiled_on = week_start_date…week_end_date
tsrs = Team.find(teamone,teamtwo, :select => “teams.name,
teams.mascot, teams.formal_name, teams.image, tsrs_ratings.tsrs”, :joins
=> [:tsrs_ratings], :conditions => {:tsrs_ratings => {:compiled_on =>
compiled_on}}, :order => “teams.id”)
tsos_offense = Team.find(teamone,teamtwo, :select =>
“tsos_offenses.totoff, tsos_offenses.rushoff, tsos_offenses.passoff,
tsos_offenses.scoroff, tsos_offenses.rzonoff, tsos_offenses.fumlost,
tsos_offenses.passhint, tsos_offenses.tolost, tsos_offenses.sacksall,
tsos_offenses.tackflossall, tsos_offenses.passeff,
tsos_offenses.firdwns, tsos_offenses.thrdwncon, tsos_offenses.fthdwncon,
tsos_offenses.totals”, :joins => [:tsos_offenses], :conditions =>
{:tsos_offenses => {:compiled_on => compiled_on}}, :order => “teams.id”)
tsos_defense = Team.find(teamone,teamtwo, :select =>
“tsos_defenses.totdef, tsos_defenses.rushdef, tsos_defenses.passdef,
tsos_defenses.scordef, tsos_defenses.rzondef, tsos_defenses.fumgain,
tsos_defenses.passint, tsos_defenses.togain, tsos_defenses.sacks AS
allsacks, tsos_defenses.tackfloss, tsos_defenses.passeffdef,
tsos_defenses.firdwnsdef, tsos_defenses.thrdwncondef,
tsos_defenses.fthdwncon, tsos_defenses.totals”, :joins =>
[:tsos_defenses], :conditions => {:tsos_defenses => {:compiled_on =>
compiled_on}}, :order => “teams.id”)
tsos_steams = Team.find(teamone,teamtwo, :select =>
“tsos_steams.kickret, tsos_steams.puntret, tsos_steams.netpunt,
tsos_steams.kickretdef, tsos_steams.puntretdef, tsos_steams.totals”,
:joins => [:tsos_steams], :conditions => {:tsos_steams => {:compiled_on
=> compiled_on}}, :order => “teams.id”)
offense = Team.find(teamone,teamtwo, :select =>
“total_offenses.ydspgm AS offtotal, rushing_offenses.ydspg AS offrush,
passing_offenses.ydspgm AS offpass, scoring_offenses.avg AS offscoring,
red_zone_offenses.pct AS offrzone, fumbles_losts.fumbles_lost AS
fumlost, passes_had_intercepteds.passes_had_intercepted AS passhint,
sacks_alloweds.pg AS sacksall, tackles_for_loss_alloweds.pg AS
tflossall, passing_efficiencies.rating AS pasoffeff,
first_downs_offenses.pg AS firdwnoff,
third_down_conversion_percentages.pct AS thirdwnpctoff,
fourth_down_conversion_percentages.pct AS fourdwnpctoff”, :joins =>
[:total_offenses, :rushing_offenses, :passing_offenses,
:scoring_offenses, :red_zone_offenses, :fumbles_losts,
:passes_had_intercepteds, :sacks_alloweds, :tackles_for_loss_alloweds,
:passing_efficiencies, :first_downs_offenses,
:third_down_conversion_percentages,
:fourth_down_conversion_percentages], :conditions => {:total_offenses =>
{:compiled_on => compiled_on}, :rushing_offenses => {:compiled_on =>
compiled_on}, :passing_offenses => {:compiled_on => compiled_on},
:scoring_offenses => {:compiled_on => compiled_on}, :red_zone_offenses
=> {:compiled_on => compiled_on}, :fumbles_losts => {:compiled_on =>
compiled_on}, :passes_had_intercepteds => {:compiled_on => compiled_on},
:sacks_alloweds => {:compiled_on => compiled_on},
:tackles_for_loss_alloweds => {:compiled_on => compiled_on},
:passing_efficiencies => {:compiled_on => compiled_on},
:first_downs_offenses => {:compiled_on => compiled_on},
:third_down_conversion_percentages => {:compiled_on => compiled_on},
:fourth_down_conversion_percentages => {:compiled_on => compiled_on}},
:order => “teams.id”)
defense = Team.find(teamone,teamtwo, :select =>
“total_defenses.ydspgm AS deftotal, rushing_defenses.ydspg AS defrush,
pass_defenses.ydspgm AS defpass, scoring_defenses.ptspgm AS defscoring,
red_zone_defenses.pct AS defrzone, fumbles_recovereds.fumbles_recovered
AS fumrec, passes_intercepteds.passes_intercepted AS passint, sacks.pg
AS sacksdef, tackles_for_losses.pg AS tfloss,
pass_efficiency_defenses.rating AS pasdefeff, first_downs_defenses.pg AS
firdwndef, third_down_percentage_defenses.pct AS thirdwnpctdef,
fourth_down_percentage_defenses.pct AS fourdwnpctdef”, :joins =>
[:total_defenses, :rushing_defenses, :pass_defenses, :scoring_defenses,
:red_zone_defenses, :fumbles_recovereds, :passes_intercepteds, :sacks,
:tackles_for_losses, :pass_efficiency_defenses, :first_downs_defenses,
:third_down_percentage_defenses, :fourth_down_percentage_defenses],
:conditions => {:total_defenses => {:compiled_on => compiled_on},
:rushing_defenses => {:compiled_on => compiled_on}, :pass_defenses =>
{:compiled_on => compiled_on}, :scoring_defenses => {:compiled_on =>
compiled_on}, :red_zone_defenses => {:compiled_on => compiled_on},
:fumbles_recovereds => {:compiled_on => compiled_on},
:passes_intercepteds => {:compiled_on => compiled_on}, :sacks =>
{:compiled_on => compiled_on}, :tackles_for_losses => {:compiled_on =>
compiled_on}, :pass_efficiency_defenses => {:compiled_on =>
compiled_on}, :first_downs_defenses => {:compiled_on => compiled_on},
:third_down_percentage_defenses => {:compiled_on => compiled_on},
:fourth_down_percentage_defenses => {:compiled_on => compiled_on}},
:order => “teams.id”)
turnovers = Team.find(teamone,teamtwo, :select =>
“turnover_margins.margin AS tomargin, tsos_turnover_margins.margin AS
tomarginrating”, :joins => [:turnover_margins, :tsos_turnover_margins],
:conditions => {:turnover_margins => {:compiled_on => compiled_on},
:tsos_turnover_margins => {:compiled_on => compiled_on}}, :order =>
“teams.id”)
special_teams = Team.find(teamone,teamtwo, :select =>
“kickoff_returns.avg AS kickydsper,
kickoff_return_yardage_defenses.ydsperret AS kickydsperdef,
punt_returns.avg AS puntydsper, punt_return_yardage_defenses.ydsperret
AS puntydsperdef, net_puntings.net_avg AS netpunt”, :joins =>
[:kickoff_returns, :kickoff_return_yardage_defenses, :punt_returns,
:punt_return_yardage_defenses, :net_puntings], :conditions =>
{:kickoff_returns => {:compiled_on => compiled_on},
:kickoff_return_yardage_defenses => {:compiled_on => compiled_on},
:punt_returns => {:compiled_on => compiled_on},
:punt_return_yardage_defenses => {:compiled_on => compiled_on},
:net_puntings => {:compiled_on => compiled_on}}, :order => “teams.id”)

return tsrs, tsos_offense, tsos_defense, tsos_steams, offense,
defense, turnovers, special_teams

end

8 variables returned.

The bottleneck begins to occur on offense = … (around 30 ms) and then
deepens on defense = … (around 460 ms).

These are indeed large but everything listed here is absolutely required
for my app to work. What can I do to optimize this?

thanks.

For readability, this is better to look at:

http://pastie.org/574570

On Thu, 2009-08-06 at 22:42 +0200, Alpha B. wrote:

For readability, this is better to look at:

http://pastie.org/574570

Are you already using Memcached? If not, you’ll want to. Your data
doesn’t change very often, so one strategy you might consider is kicking
off a job that runs your queries in the ‘wee hours’ of the morning-after
so they’re cached for your users.

HTH,
Bill

On Aug 6, 2009, at 1:36 PM, Alpha B. wrote:

against one another. The data that is used for comparison is very
later
My question is what can I do to improve this?
As someone else mentioned if you can cache it, cache it.

That said… am I reading that right that in one of your queries you
are joining on 13 different tables?

  • Find the SQL produced for that query and run it through your
    database using whatever necessary to get the query plann (ie. EXPLAIN
    SELECT…). Make sure you’ve got indexes on the columns that need
    them.

  • Play around with breaking that query up into smaller pieces. It may
    be that splitting them into individual queries and “building the join”
    yourself turns out to be faster. This is particularly true if your DB
    isn’t picking up the indexes and if you have massive tables, but are
    only returning a very small set of data (and again your DB isn’t
    picking up the indexes and lopping things off from the get go).

If you’re using MySQL it can be very frustrating when you have the
right indexes, but it chooses not to use them. Look to see if you can
force it to or give it hints.

bill walton wrote:

On Thu, 2009-08-06 at 22:42 +0200, Alpha B. wrote:

For readability, this is better to look at:

http://pastie.org/574570

Are you already using Memcached? If not, you’ll want to. Your data
doesn’t change very often, so one strategy you might consider is kicking
off a job that runs your queries in the ‘wee hours’ of the morning-after
so they’re cached for your users.

HTH,
Bill

Hi Bill,

thanks for the response mate. I haven’t checked out memcached and will
go look at that now.

The problem for running my queries is that there are 120 teams so that
would mean approximately 14,400 quries since any two teams can be
matched up and compared to one another. I’m not sure that would work.
Even without any type of stoppage time, I’m probably looking at around
90 minutes to run all those queries?

Maybe I can do something like that. Without optimization (can’t quite
think of anything further I can do to optimize it (unless memcached will
work) - the only thing further I think that would help would be a
multi-tiered server setup with the database sitting on a server by
itself and using nginx as a front end on another server and my content
on yet another server.

I’ve been contemplating this but right now it would be fairly expensive
and I don’t have enough time to implement it before the start of this
season.

On Aug 6, 9:36 pm, Alpha B. [email protected]
wrote:

The bottleneck begins to occur on offense = … (around 30 ms) and then
deepens on defense = … (around 460 ms).

These are indeed large but everything listed here is absolutely required
for my app to work. What can I do to optimize this?

First off use EXPLAIN … in your database’s the query browser/console
to find out how database is executing your query. You may find that
it’s doing something really stupid or that there are indices missing.

Fred

Phillip and Fred,

Many thanks for those suggestions as well. I’m on it as we speak.

On the table where the bottleneck is deepening the following is with
explain:

http://pastie.org/574630

You need to add quite a few indexes to those tables then.

Yeah,

I’m using innodb so I thought that the indexes were automatically setup
when specifying _id on any column. I was mistaken.

Alpha B. wrote:

Yeah,

I’m using innodb so I thought that the indexes were automatically setup
when specifying _id on any column. I was mistaken.

Um, yeah. :slight_smile:

Also, consider switching to PostgreSQL. You’ll get better performance
without sacrificing referential integrity.

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

On the table where the bottleneck is deepening the following is with
explain:

http://pastie.org/574630

Philip H. wrote:

That said… am I reading that right that in one of your queries you
are joining on 13 different tables?

  • Find the SQL produced for that query and run it through your
    database using whatever necessary to get the query plann (ie. EXPLAIN
    SELECT…). Make sure you’ve got indexes on the columns that need
    them.

  • Play around with breaking that query up into smaller pieces. It may
    be that splitting them into individual queries and “building the join”
    yourself turns out to be faster. This is particularly true if your DB
    isn’t picking up the indexes and if you have massive tables, but are
    only returning a very small set of data (and again your DB isn’t
    picking up the indexes and lopping things off from the get go).

If you’re using MySQL it can be very frustrating when you have the
right indexes, but it chooses not to use them. Look to see if you can
force it to or give it hints.

Here’s the new pastie:

http://pastie.org/574754

It appears that the indexes are being used now? I see the rows returned
down to 1 - 3 rows which is much better than seeing 400+ rows on each
column in the first segment.

However, on the EXTRA it says that the rest are using WHERE but I don’t
specifically saying it’s using index. I see the key indexes it shows
but shouldn’t it be saying “Using where; Using index”?

The time appears to be a bit better - went down from 14 seconds to 5
seconds but still bottlenecks a bit. I might have to do as you said -
break up the two tables that I’m joining 14 tables on each…

Offense joined with 14 tables (bottleneck begins to occur)
Defense joined with 14 tables (bottleneck is massivly apparent)

All the other joins are 0.0 or 0.1 ms. So, I think my best bet is to
break these up into smaller…

I broke up the offense, defense joins in half and ran a new test…

20 ms (winner)…

I still get all my data so with the indexes in place, the join splits,
it looks good - I’ll have to do some extensive testing though to see how
it load balances in production…

Many thanks everyone. I really appreciate the help with this
optimization.

If you setup foreign key constraints, then the indexes will
automatically be created on this [parent]_id columns. (And InnoDB
doesn’t let you remove them even if you wanted to.)

On Aug 6, 2:35 pm, Alpha B. [email protected]

Maybe not directly relevant, but have you considered that you may have
normalized your database into nonfunctionality? From the queries
you’ve shown, most of the tables have one data field(?!), a date, and
a foreign key field for the team. This would be an appropriate
structure for describing a system where each type of statistic might
be generated at different times, but it seems completely inappropriate
for your application where all the stats are (obviously) generated
simultaneously.

Why did you decide to go this way, vs a single ‘stats’ table for
everything?

–Matt J.

Matt J. wrote:

Maybe not directly relevant, but have you considered that you may have
normalized your database into nonfunctionality? From the queries
you’ve shown, most of the tables have one data field(?!), a date, and
a foreign key field for the team. This would be an appropriate
structure for describing a system where each type of statistic might
be generated at different times, but it seems completely inappropriate
for your application where all the stats are (obviously) generated
simultaneously.

Why did you decide to go this way, vs a single ‘stats’ table for
everything?

–Matt J.

Hi Matt,

There are approx. 37 different categories of statistics for NCAA
football. Each category houses up to 14 columns. I would never attempt
to build one table that housed 518 fields.

All of the stats are not generated simultaneously. They are built into
separate weeks. Each statistics table houses 120 rows of data paired to
the given week it was compiled. So, if there’s say 17 weeks (counting
bowls) for college football, then there will be 17 weeks worth of data
(17 x 120 rows).

Yes, I could simply place all of that into one giant table (1 stats
table with 518 columns that contains 2040 rows per year. That’s not
going to happen. It’s inefficient and difficult to manage. If the
table gets corrupted, very likely given that type of scenario, it kills
my entire site at once. You also have to account for 3 x /tmp space for
the largest table you have in your database. Given the nature of this,
and the fact that these stats will not be deleted for years, that can
hit a server pretty hard.

My tables are normalized properly.

Thanks.

Hi Matt,

Again, you bring up some solid points, and some I agree with and others
I don’t. However, at this stage in my app (at least for this season) I
can’t alter or test any of what you speak about.

I will definitely test a single table in development as the season
progresses and see how much more optimized it is. I’m just skeptical
but I’m sure I’m not the only one that is this way when it comes to a
table this large.

I know it’s not the biggest table ever, but a table that size still has
me concerned. However, you could be right! Maybe it’s the real way to
go. I don’t know at this point.

I just wanted to let you know that I do hear what you are saying and
that if you are correct in everything you say, it would help me for each
new season.

So, my thanks to you for the input and feedback, and most of all - for
following up on your comments with some solid rebuttals!

Take care mate.

On Aug 7, 2:26 pm, Alpha B. [email protected]
wrote:

Why did you decide to go this way, vs a single ‘stats’ table for
everything?

–Matt J.

Hi Matt,

There are approx. 37 different categories of statistics for NCAA
football. Each category houses up to 14 columns. I would never attempt
to build one table that housed 518 fields.

Why not? Even MySQL’s fairly restrictive limits are more than 5 times
that (see
http://dev.mysql.com/doc/refman/5.0/en/column-count-limit.html).
The general argument against that is that tables with that many rows
aren’t correctly separating concerns. (see this post:
http://forums.digitalpoint.com/showpost.php?s=c6f4b98064317f2ad75a532f0e1268ee&p=8255322&postcount=11
for a prime example of doing it wrong)

However, in this case, a single event (a game) generates all 518
values.

All of the stats are not generated simultaneously. They are built into
separate weeks. Each statistics table houses 120 rows of data paired to
the given week it was compiled. So, if there’s say 17 weeks (counting
bowls) for college football, then there will be 17 weeks worth of data
(17 x 120 rows).

Nothing controversial there.

Yes, I could simply place all of that into one giant table (1 stats
table with 518 columns that contains 2040 rows per year. That’s not
going to happen. It’s inefficient and difficult to manage. If the
table gets corrupted, very likely given that type of scenario, it kills
my entire site at once. You also have to account for 3 x /tmp space for
the largest table you have in your database. Given the nature of this,
and the fact that these stats will not be deleted for years, that can
hit a server pretty hard.

Some quick math:

518 columns x 8 bytes per column (could be less, but assume there’s
some overhead) → 4144 bytes per row.
MySQL’s limit is just shy of 65k, so no issue there.

2040 rows x 4144 bytes → roughly 8.1 MB.

Given that there are live applications that manage hundreds if not
thousands of GBs of data in MySQL, I think this is a exceptionally
premature optimization, unless you plan on running the server
unaltered for the next 500 or so years. 3x /tmp space is similarly
beneath consideration, unless you’re serving from an embedded
microcontroller or something.

Worrying about “table corruption” makes even less sense - if ANY data
gets corrupted, it’s time to restore from backup.

Finally, I’ll point out that while a combined row is way under the
limit, the mondo-JOIN query you’re headed for is getting close to
another limit - the maximum 61 tables allowed in a join clause.

–Matt J.

Alpha B. wrote:

Hi Matt,

Again, you bring up some solid points, and some I agree with and others
I don’t. However, at this stage in my app (at least for this season) I
can’t alter or test any of what you speak about.

Poppycock. You might not be willing to alter your production DB this
season, but you can (and probably should) set up a testing environment
to see what you can do.

I will definitely test a single table in development as the season
progresses and see how much more optimized it is. I’m just skeptical
but I’m sure I’m not the only one that is this way when it comes to a
table this large.

I think a 518-field table sounds odd, but perhaps right for what you’re
doing. However, a large number of single-column tables sounds far
odder.

I know it’s not the biggest table ever, but a table that size still has
me concerned.

Why?

However, you could be right! Maybe it’s the real way to
go. I don’t know at this point.

I just wanted to let you know that I do hear what you are saying and
that if you are correct in everything you say, it would help me for each
new season.

So, my thanks to you for the input and feedback, and most of all - for
following up on your comments with some solid rebuttals!

Take care mate.

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Alpha B. wrote:
[…]

I mentioned in paragraph two that I will be setting up a test
environment for it.

OK; I missed that.

But, again, I’m not going to alter my production DB
at this time. The season is just a couple weeks away and it wouldn’t
allow for enough time to implement, test, and then complete.

Sure it would. This is not a big change from the point of view of your
app.

I won’t
risk any issues.

Why?

I talked about table corruption previously. Matt simply said well you
can restore from backup. If I had 50,000 subscribers all expecting a
service and a table this large that would basically represent all of the
data provided in the service failed, there will be a critical outage.
Table corruption has a larger chance of occurring on larger tables.

Um, what? Table corruption is simply not an issue in any DB environment
I have ever worked with. And I’m not sure why you say it has a greater
chance of occurring on larger tables. If it’s simply that there’s more
data, well, that’s true of many small tables too.

Basically, if you can’t trust your DB server not to corrupt your tables,
then don’t use that DB server. It’s that simple.

I like to ensure whatever I provide works 100%. My current situation is
a lot more improved (query times are really good right now) and with
caching, will only get better.

What I’m doing right now is not wrong, it can just be more efficient
later on (as Matt pointed out).

Perhaps. I haven’t examined your queries in detail

At this point in time, we can agree to disagree on this topic.

But we shouldn’t. That’s a sign that more discussion is necessary, and
should not simply be ignored. It’s lazy, and not in the Larry Wall
sense. With what you’re doing, I doubt that you can afford that kind of
laziness.

Thanks.

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]