Mysql query optimization for order condition

dazuma · December 15, 2010, 10:26pm

Hello everyone! I’ve a question that I hope someone in here would be
able to
answer. Today I was writing a simple test for a method that takes the
last
“n” (where n is an integer) users that signed up on my web app.

I know it’s a simple method, but then I started to think about the
optimal
way to query the DB, and I tried different options. The first thing I
did
was create 1000 users through a simple:

1000.times {Factory(:student)}

The difference between the first and the last records is about 2
minutes.

first created_at: 2010-12-15 20:35:41
last created_at: 2010-12-15 20:37:51

And this is what I’ve got on the console when using
ActiveRecord::Base.uncached:

User.order(“created_at DESC”).limit(5).reverse
User Load (7.4ms) SELECT “users”.* FROM “users” ORDER BY created_at
DESC
LIMIT 5

User.order(“created_at ASC”).last(5)
User Load (1179.1ms) SELECT “users”.* FROM “users” ORDER BY
created_at
ASC

User.last(5) # I didn’t even know I could add a parameter to last
User Load (1171.2ms) SELECT “users”.* FROM “users”

User.all.last(5)
User Load (1177.3ms) SELECT “users”.* FROM “users”

Well, there’s obviously a huge performance impact between querying the
DB
with the first one and the rest. There’re some things I would love to
understand.

If the users are created in order, this means the last user is the
last
that signed up. This means that the first query would have had to order
all
the 1000 users by date, then take only 5 and then reverse the array
right???
So, why is it so fast???
The second query, in this case would not have to order the users, as
by
default they are all ordered right? So, it would only have to take the
last
5 elements of the array. But why is it slower than the first one?
The third and fourth queries, I guess load all the users in memory and
then return only the last 5.

Any information about this would be really helpful. Thanks.

dazuma · December 15, 2010, 10:37pm

Daniel A. wrote in post #968711:

Hello everyone! I’ve a question that I hope someone in here would be
able to
answer. Today I was writing a simple test for a method that takes the
last
“n” (where n is an integer) users that signed up on my web app.

I know it’s a simple method, but then I started to think about the
optimal
way to query the DB, and I tried different options. The first thing I
did
was create 1000 users through a simple:

1000.times {Factory(:student)}

The difference between the first and the last records is about 2
minutes.

first created_at: 2010-12-15 20:35:41
last created_at: 2010-12-15 20:37:51

And this is what I’ve got on the console when using
ActiveRecord::Base.uncached:

User.order(“created_at DESC”).limit(5).reverse
User Load (7.4ms) SELECT “users”.* FROM “users” ORDER BY created_at
DESC
LIMIT 5

User.order(“created_at ASC”).last(5)
User Load (1179.1ms) SELECT “users”.* FROM “users” ORDER BY
created_at
ASC

User.last(5) # I didn’t even know I could add a parameter to last
User Load (1171.2ms) SELECT “users”.* FROM “users”

User.all.last(5)
User Load (1177.3ms) SELECT “users”.* FROM “users”

Well, there’s obviously a huge performance impact between querying the
DB
with the first one and the rest. There’re some things I would love to
understand.

Herewith my speculations.

If the users are created in order, this means the last user is the
last
that signed up. This means that the first query would have had to order
all
the 1000 users by date, then take only 5 and then reverse the array
right???
So, why is it so fast???

Probably because most of the work is being done in the database, which
is then only returning 5 records for Ruby to work with in memory. (If
you were using PostgreSQL, you could make it even faster by adding an
index on created_at. Come to think of it, maybe Rails already does
this, though I doubt it. However, MySQL is too stupid to do backward
scans of indices, IIRC.)

The second query, in this case would not have to order the users, as
by
default they are all ordered right? So, it would only have to take the
last
5 elements of the array. But why is it slower than the first one?

Well, the DB has to do the same operation (sort all the users by date),
though this time in reverse. However, then it returns a bigger result
set to Ruby, which means that Ruby has more data to manipulate in
memory. I’m surprised that the time difference is that big, though.

The third and fourth queries, I guess load all the users in memory and
then return only the last 5.

Yes – but they won’t do what you want. Since there’s no order
specified in your query, the DB will return records in an unpredictable
order, and so taking the last 5 of them won’t necessarily yield the 5
newest records.

Any information about this would be really helpful. Thanks.

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

dazuma · December 16, 2010, 12:21am

Thanks Marnen.

I think the times are only related to the time taken to execute the SQL
query, not to put that in memory and then sort the array.

In the first case, I’m curious about how MySQL can sort all the elements
so
fast. I’ve read
(Ruby Algorithms: Sorting, Trie & Heaps - igvita.com)
that
Ruby’s Array sort method implementation is pretty good. It uses Quick
Sort.
I’ve tried to to sort an array of strings an another one of integers,
already sorted both of them, with irb and the result it’s
almost instantaneous. Maybe MySQL uses quick sort too…

About the second query you say it has to sort the users in the opposite
order which MySQL should take the same amount of time than before, but
this
is not the case. I see what you mean about Ruby, but I think that time
is
only for MySQL.

About the third and fourth queries, I’ve compared both results with
queries
one and two and I get the same array. Why would the DB return the
records in
an unpredictable order?

In the Wikipedia http://en.wikipedia.org/wiki/Quicksort, the right
sidebar
explains the reason why Quick Sort is slow when the set is already
sorted.
Oh, this is also a pretty nice
websitehttp://www.sorting-algorithms.com/reversed-initial-order,
it has animations of different sort algorithms.

dazuma · December 16, 2010, 12:48am

Pleaee quote when replying

Daniel A. wrote in post #968730:

Thanks Marnen.

I think the times are only related to the time taken to execute the SQL
query, not to put that in memory and then sort the array.

You’re right that the times don’t include in-memory sorting. But on
reflection, those times are ActiveRecord load times. That means they
most likely include the time to retrieve records from the DB – and then
to instantiate Ruby objects for each record!

Try this experiment: run those two queries from your favorite database
client, outside of Rails. See if the time difference persists.

In the first case, I’m curious about how MySQL can sort all the elements
so
fast.

Why not look up the MySQL source code and find out?

I’ve read
(Ruby Algorithms: Sorting, Trie & Heaps - igvita.com)
that
Ruby’s Array sort method implementation is pretty good. It uses Quick
Sort.
I’ve tried to to sort an array of strings an another one of integers,
already sorted both of them, with irb and the result it’s
almost instantaneous. Maybe MySQL uses quick sort too…

Perhaps. It probably has other tricks.

About the second query you say it has to sort the users in the opposite
order which MySQL should take the same amount of time than before, but
this
is not the case. I see what you mean about Ruby, but I think that time
is
only for MySQL.

See above.

About the third and fourth queries, I’ve compared both results with
queries
one and two and I get the same array.

Sure, you will sometimes, but there’s no guarantee.

Why would the DB return the
records in
an unpredictable order?

Because that’s the way SQL databases work. The records within each
table have no intrinsic ordering, and the DB is allowed to return them
to you in whatever order it likes, unless you specify otherwise. If you
need a particular order, then you always, without exception, need to
specify what that order is.

In the Wikipedia http://en.wikipedia.org/wiki/Quicksort, the right
sidebar
explains the reason why Quick Sort is slow when the set is already
sorted.
Oh, this is also a pretty nice
websitehttp://www.sorting-algorithms.com/reversed-initial-order,
it has animations of different sort algorithms.

I’ll check those out. I do not think, however, that the difference in
time you’re seeing is due to sort algorithms as such.

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

Sent from my iPhone

dazuma · December 16, 2010, 4:13pm

Thanks! That makes much more sense.

dazuma · December 16, 2010, 10:48am

Daniel A. wrote in post #968711:

Hello everyone! I’ve a question that I hope someone in here would be
able to
answer. Today I was writing a simple test for a method that takes the
last
“n” (where n is an integer) users that signed up on my web app.

I know it’s a simple method, but then I started to think about the
optimal
way to query the DB, and I tried different options. The first thing I
did
was create 1000 users through a simple:

1000.times {Factory(:student)}

The difference between the first and the last records is about 2
minutes.

first created_at: 2010-12-15 20:35:41
last created_at: 2010-12-15 20:37:51

And this is what I’ve got on the console when using
ActiveRecord::Base.uncached:

User.order(“created_at DESC”).limit(5).reverse
User Load (7.4ms) SELECT “users”.* FROM “users” ORDER BY created_at
DESC
LIMIT 5

User.order(“created_at ASC”).last(5)
User Load (1179.1ms) SELECT “users”.* FROM “users” ORDER BY
created_at
ASC

User.last(5) # I didn’t even know I could add a parameter to last
User Load (1171.2ms) SELECT “users”.* FROM “users”

User.all.last(5)
User Load (1177.3ms) SELECT “users”.* FROM “users”

Well, there’s obviously a huge performance impact between querying the
DB with the first one and the rest.

Only the first of your four tested queries has a “LIMIT 5” in the SQL.

With that SQL, only 5 records are transfered to the Ruby array. This
is because the “limit(5)” function does that (I assume on an
ActiveRecord::Relation object) and can thus modify the SQL query.

When the last(5) is called on the resulting ActiveRecord::Relation
object,
it acts more like the Array#last method and operates on the data that is
already fully loaded from the database. So, this would transfer all 1000
records to a Ruby array and then in Ruby select 5 records out of that
array.

I presume that is the problem. For a database, processing 1000 records
should not be a real problem. But transferring them to the ruby program
may take some time.

In almost all cases, the construct which triggers a LIMIT on the
database
is the better one, if you can do an ORDER BY that can be expressed
in the database (you can use database functions in the database to add
some flexibility there). This is typically used when you want to
paginate
through a very large dataset, because you do not want to load all the
records into ruby, only the small subset to render that 1 paginated
page.

So, I see two proper solutions where you clearly see that the order and
the limit are applied.

User.all(:order => “created_at DESC”, :limit => 5)
User.order(“created_at DESC”).limit(5)

They both result in:
User Load (0.5ms) SELECT “users”.* FROM “users” ORDER BY created_at
DESC LIMIT 5
User Load (0.3ms) SELECT “users”.* FROM “users” ORDER BY created_at
DESC LIMIT 5