Optimizing query

Hi!

I’m trying to speed up some queries in my app, running Rails 3.2 and
need
some help. I’m running the app at Heroku on postgresql. I’m new to
postgresql and need some help to optimize a query so it effectively uses
indices.

This is the query I’m currently working on:
http://snipt.net/Linuus/rails-query?key=ee73173643e8d21a5a487d8a329c7a26

In the link above you can also see the full EXPLAIN (query plan) as well
as
my current indices. It doesn’t seem to use any of my indices at the
moment…

So, any help is very appreciated.

Regards
Linus

Would probably help if you posted the generated SQL as well as the
output of EXPLAIN ANALYZE, which shows not just what the planner
thinks the costs are, but what the costs actually were.

You could also try running ANALYZE on the tables in question, or
ANALYZE VERBOSE and checking if the estimates are pretty close to
reality, e.g.

ANALYZE VERBOSE categories;

Jim Crate

Thank you. I have added the generated SQL to the snipt in my first post.

Regarding the EXPLAIN ANALYZE, can I somehow run that on Heroku? I’m
using
their shared database so I don’t have access to any psql console…

I did run it on my local machine though. The thing is that it differs in
database version (Local is PostgreSQL 9.1.2 while Heroku still uses 8.3
or
something) and also in size.

Anyway, the EXPLAIN ANALYZE from my local machine can be found here:
http://snipt.net/Linuus/postgresql?key=e4728f444b24839e3f80adf3829bcba9

Regards
Linus

On Wed, Jan 25, 2012 at 8:20 PM, Linus P. <
[email protected]> wrote:

Thank you. I have added the generated SQL to the snipt in my first post.

Regarding the EXPLAIN ANALYZE, can I somehow run that on Heroku? I’m using
their shared database so I don’t have access to any psql console…

I did run it on my local machine though. The thing is that it differs in
database version (Local is PostgreSQL 9.1.2 while Heroku still uses 8.3 or
something) and also in size.

On Debian/Ubuntu you typically have a choice of an 8.4 or a 9.1 …

Try

$ apt-cache search postgresql-8.4

and then you could install that alongside 9.1.

It will run on a different port
(one on 5432 and one on 5433 IIRC, check $ netstat -atn).

That said, I have no information that the optimization would be
different for such a “simple” query.

Anyway, the EXPLAIN ANALYZE from my local machine can be found here:
http://snipt.net/Linuus/postgresql?key=e4728f444b24839e3f80adf3829bcba9

Just out of curiosity ... what problem are you trying to solve ? Did you actually see any slow queries ? I don't see obvious problems with the query ... Not sure if this is the area that deserves your first attention wrt performance optimization ... there is this whole story about "premature optimization" (Google it to find many rants about it ...).

HTH,

Peter

Well, maybe it’s not necessary… It is the slowest of my queries as far
as
I can see anyway.

I’m using Rails 3.2 and I have enabled automatic EXPLAIN on slow queries
(>0.5s). This query is currently being “explained” in my console when I
run
it on my localhost. “SQL (556.5ms) …”

556.5ms seems a bit slow to me, isn’t it?

Then a bunch of other stuff is happening and I get this:
“Completed 200 OK in 2737ms (Views: 813.5ms | ActiveRecord: 703.2ms)”

So, 550ms of the total 703ms is the above query.

Maybe I’m just picky? :slight_smile:

Hi

I tested to remove the .order(…) part and indeed, the query time goes
down to ~100ms. However, it doesn’t help to add indices, at least not as
I
did :slight_smile:

add_index :categories, :name
add_index :subcategories, :name

Did some more testing and if I keep the .order… but don’t join the
products table I get a query that runs at about ~55ms. So the bottleneck
seems to be the products table.
The query that I’m running looks like this:

Category.eager_load(:subcategories)
.joins(“INNER JOIN products AS p ON resellercategories.id =
p.resellercategory_id”)
.order(“categories.name ASC, subcategories.name ASC”)

(Skipping the gender here…)

What I have is Categories and Subcategories. They are related to each
other
through a Resellercategories table. Products are related to
Resellercategories.
So, the reason that I want to join the products as well is because I
only
want to show categories and subcategories that actually have some
products
(there are some empty categories/subcategories still).

So the above query is what we came up with in another thread here in the
group.

  • Maybe there is a better way to check if a category/subcategory has
    products without joining the entire products table?

BTW, in my dev database there is about 8700 rows in products table, 2200
rows in resellercategories, 5 rows in categories and 45 rows in
subcategories.

On Wed, Jan 25, 2012 at 8:51 PM, Linus P. <
[email protected]> wrote:

Well, maybe it’s not necessary… It is the slowest of my queries as far
as I can see anyway.

I’m using Rails 3.2 and I have enabled automatic EXPLAIN on slow queries
(>0.5s). This query is currently being “explained” in my console when I run
it on my localhost. “SQL (556.5ms) …”

556.5ms seems a bit slow to me, isn’t it?

Indeed.

TL;DR

  • On my system, a similar query seems a lot faster
  • you may have a bad “sort” on categories.name, subcategories.name
    without
    index ?

I had a similar database structure here, filled it with 10,000 records
(you
seem to have around 2,200 ?),
postgresql 9.1 on Ubuntu and ran a similar query:

cl1 = ContactLine.includes(:contact) ; nil
cl2 = cl1.includes(:contact => :person) ; nil
cl3 = cl2.where(“people.first_name = ‘M’ OR people.first_name =
‘longer_word’”) ; nil # 8 times ‘M’ and 8 times ‘longer_word’
cl4 = cl3.order(“contacts.email”) ; nil

167:0> puts cl4.explain
SQL (16.9ms) SELECT “contact_lines”.“id” AS …
FROM “contact_lines”
LEFT OUTER JOIN “contacts” ON “contacts”.“id” =
“contact_lines”.“contact_id”
LEFT OUTER JOIN “people” ON “people”.“id” = “contacts”.“person_id”
WHERE (people.first_name = ‘M’ OR people.first_name = ‘longer_word’)
ORDER BY contacts.email
EXPLAIN (1.4ms) EXPLAIN SELECT “contact_lines”.“id” …
EXPLAIN for: SELECT “contact_lines”.“id” …
QUERY PLAN

Sort (cost=691.05…691.06 rows=1 width=2699)
Sort Key: contacts.email
→ Hash Join (cost=455.53…691.04 rows=1 width=2699)
Hash Cond: (contact_lines.contact_id = contacts.id)
→ Seq Scan on contact_lines (cost=0.00…198.00 rows=10000
width=50)
→ Hash (cost=455.52…455.52 rows=1 width=2649)
→ Hash Join (cost=229.01…455.52 rows=1 width=2649)
Hash Cond: (contacts.person_id = people.id)
→ Seq Scan on contacts (cost=0.00…189.00
rows=10000 width=41)
→ Hash (cost=229.00…229.00 rows=1 width=2608)
→ Seq Scan on people (cost=0.00…229.00
rows=1 width=2608)
Filter: (((first_name)::text =
‘M’::text)
OR ((first_name)::text = ‘longer_word’::text))
(12 rows)
=> nil

The log on a development server (plain rails s) is:

Started GET “/contact_lines” for 127.0.0.1 at 2012-01-25 22:26:41 +0100
Processing by ContactLinesController#index as HTML
SQL (8.6ms) SELECT “contact_lines”.“id” AS t0_r0…
… ORDER BY contacts.email
Rendered contact_lines/index.html.haml within layouts/application
(99.4ms)
Completed 200 OK in 105ms (Views: 89.5ms | ActiveRecord: 14.7ms)

About indexes … The above is without explicitly set indexes.

Assuming that the first_name::text filtering would be the most
expensive, I
added
this migration:

class AddIndexFirstName < ActiveRecord::Migration
def change
add_index :people, :first_name
end
end

This worked:

\d people

                                  Table "public.people"
Column    |            Type             |

Modifiers
--------------±----------------------------±----------------------------------------------------
id | integer | not null default
nextval(‘people_id_seq’::regclass)
full_name | character varying(255) |
first_name | character varying(255) |

Indexes:
“people_pkey” PRIMARY KEY, btree (id)
“index_people_on_first_name” btree (first_name)

The EXPLAIN changed:

SQL (12.7ms) SELECT “contact_lines”.“id” AS t0_r0, …

                                                       QUERY PLAN

Sort (cost=474.59…474.59 rows=1 width=2699)
Sort Key: contacts.email
→ Hash Join (cost=239.07…474.58 rows=1 width=2699)
Hash Cond: (contact_lines.contact_id = contacts.id)
→ Seq Scan on contact_lines (cost=0.00…198.00 rows=10000
width=50)
→ Hash (cost=239.05…239.05 rows=1 width=2649)
→ Hash Join (cost=12.54…239.05 rows=1 width=2649)
Hash Cond: (contacts.person_id = people.id)
→ Seq Scan on contacts (cost=0.00…189.00
rows=10000 width=41)
→ Hash (cost=12.53…12.53 rows=1 width=2608)
→ Bitmap Heap Scan on people
(cost=8.52…12.53 rows=1 width=2608)
Recheck Cond: (((first_name)::text =
‘M’::text) OR ((first_name)::text = ‘longer_word’::text))
→ BitmapOr (cost=8.52…8.52 rows=1
width=0)
→ Bitmap Index Scan on
index_people_on_first_name (cost=0.00…4.26 rows=1 width=0)
Index Cond:
((first_name)::text = ‘M’::text)
→ Bitmap Index Scan on
index_people_on_first_name (cost=0.00…4.26 rows=1 width=0)
Index Cond:
((first_name)::text = ‘longer_word’::text)
(17 rows)

But the performance was the same. A log of a hit to the server.

Processing by ContactLinesController#index as HTML
SQL (11.7ms) SELECT “contact_lines”.“id” AS t0_r0, …

WHERE (people.first_name = ‘M’ OR people.first_name = ‘longer_word’)
Rendered contact_lines/index.html.haml within layouts/application
(104.1ms)
Completed 200 OK in 112ms (Views: 90.8ms | ActiveRecord: 19.8ms)

Maybe it is related to something in your query plan that I do not see
here
and is suspicious:

from
http://snipt.net/Linuus/postgresql?key=e4728f444b24839e3f80adf3829bcba9

Sort (cost=3267.32…3272.25 rows=1972 width=2104) (actual
time=39.308…39.468 rows=1880 loops=1)
Sort Key: categories.name, subcategories.name
Sort Method: quicksort Memory: 313kB ####### ??

from
http://snipt.net/Linuus/rails-query?key=ee73173643e8d21a5a487d8a329c7a26

you seem to not have an index on those 2 name columns ?

What happens to performance when you remove try sort ?

Category.eager_load(:subcategories)
.joins(“INNER JOIN products AS p ON resellercategories.id =
p.resellercategory_id”)

WITHOUT SORT .order("categories.name ASC, subcategories.name

ASC")
.where(“p.gender = ‘unisex’ OR p.gender = ‘male’”)

If that makes a significant difference, what happens when you add
indices
on those 2
columns (and possibly remove some indices on other columns) ?

Then a bunch of other stuff is happening and I get this:
“Completed 200 OK in 2737ms (Views: 813.5ms | ActiveRecord: 703.2ms)”

So, 550ms of the total 703ms is the above query.

Maybe I’m just picky? :slight_smile:

Those indeed seem “large” numbers for a new/fresh application…

Peter

On Thu, Jan 26, 2012 at 6:15 PM, Linus P. <
[email protected]> wrote:

products table I get a query that runs at about ~55ms. So the bottleneck
What I have is Categories and Subcategories. They are related to each
products without joining the entire products table?

It is possible to add a :counter_cache , but then you need to make sure
you
use the proper methods for each product that you add or remove from the
association.

Alternative to the default counter cache (from Rails), you could build
your
own
logic as in:

  • has_male_products

changing you query to …

Category.eager_load(:subcategories).
where(:has_male_products => true).
order(…)

Then you would need to set the cache on the appropriate categories in
an after_save on the product you are
creating/updating/deactivating/(deleting ?).

Both ideas would probably be faster for querying, but certainly more
complex for making sure that cache is always correct.

HTH,

Peter

On Thu, Jan 26, 2012 at 6:37 PM, Peter V.
[email protected]wrote:

add_index :subcategories, :name

group.
Alternative to the default counter cache (from Rails), you could build
order(…)

Then you would need to set the cache on the appropriate categories in
an after_save on the product you are
creating/updating/deactivating/(deleting ?).

Both ideas would probably be faster for querying, but certainly more
complex for making sure that cache is always correct.

Sorry to reply to my own post.

TL;DR Is there pagination? Then a smaller set may return much faster.

I was thinking over my reply and may have forgotten a fundamental
aspect … If you say 2200 categories, 8000 products.

How many entries does you query return ?
(replace .add with .count at the end).

How many do you need ?

What happens when you add .limit(20) to your query ?

By which “primary object” do you want to sort and paginate ?
(I will assume ‘Product’ in the discussion below).

With the includes that are currently implemented, you may
have to redo the whole query into 2 qeuries …

  1. for fetching the “primary objects”
    (e.g. exactly 20 Products, no additional “has_many” data, because
    that would increase the number of returned rows for 1 product and
    make proper pagination in the database impossible; including
    “belongs_to” here is no problem)

  2. a second query for fetching eventual “has_many” data
    on those 20 “primary products” (is that “Reification” ?)

If the performance problem could be solved by taking the
pagination into account, that would be a much better
solution that building cache columns in this early phase
of your project.

HTH,

Peter

Also, another reason for the query that I forgot to mention is that if a
user filters the products for female products only for instance, it
should
only show categories and subcategories that contains products for that
gender.

The gender is specified in the products table…

This query is only to get the appropriate Categories & Subcategories.
There
are 5 Categories and 45 Subcategories.

My products are imported from webshops which are using different
categories
for the same things (“tshirt”, “t-shirts”, “t-shirt”, “short sleeved
shirts” may all be the same). To cope with this issue I have the
“Resellercategories” that I relate to a specific Category and
Subcategory.

Category/Subcategory -> Resellercategory -> Product

(Resellercategories are never seen in the frontend)

If I filter the products for “male” I only want to display (non-empty)
Categories/Subcategories that have male products.

To answer your question, there is no pagination on the Categories and
Subcategories that I’m fetching here with this query. The query only
returns the appropriate Categories/Subcategories. So if filtered by
“male”
it returns 3 Categories and 12 Subcategories (In my dev environment).

I am using pagination for the actual products though, but that is a
separate query.

On Thu, Jan 26, 2012 at 8:51 PM, Linus P. <
[email protected]> wrote:

(Resellercategories are never seen in the frontend)

If I filter the products for “male” I only want to display (non-empty)
Categories/Subcategories that have male products.

To answer your question, there is no pagination on the Categories and
Subcategories that I’m fetching here with this query. The query only
returns the appropriate Categories/Subcategories. So if filtered by “male”
it returns 3 Categories and 12 Subcategories (In my dev environment).

OK, I understand (small amount of categories, much more products).

If you find no other way, then the caching the “has_male_products” etc
in the Category/Subcategory may be the best remaining way.

I was also thinking, maybe it is acceptable that the set of Categories
with “male_products” etc. is only update every hour or so. That would
avoid the complexity of real-time update of that cache column. On the
other hand, an after_save on products is not that difficult.

Just as a test, does your performance improve significantly if you add
a “has_male_products” column on categories and filter on that?

HTH,

Peter