Handling Hot Data with ActiveRecord

Sadaf_N · July 21, 2015, 4:31pm

Hello!

We are thinking of different ways to handle hot data in our system. We
get a lot of information that’s very relevant for periods of time. So,
for example, we have some data that we query on the basis of

counts over last 30 minutes
data collected in the last 30 minutes
items collected today
items collected within the past week
…and so on.

We are looking at different strategies to keep this data updated and to
manage scaling the database. So, we are looking at horizontal sharding
[splitting data into multiple tables that inherit from a master and have
exclusion constraints] and are also considering having tables that hold
data as:

All the data
Data for the last 24 hours
Data for the last 2 hours
That way, when we want something recent, we would just query the most
recent table, but in the few occasions that we need something more, we
go to the larger tables that are sharded.

Just for reference, we are doing something like this:
select count(*), count(distinct group_id) from
(select group_id, recorded_on from data_store_v2 order by id DESC limit
900000) td
where recorded_on >= (now() - ‘24 hr’::INTERVAL);

We are getting 800,000 data items a day right now and the above query
takes around 14 seconds on a single table that is not sharded and has
around 268million records. Every week, this table becomes slightly
slower since we add close to 6 million records every week.

I’ve read this:

and am looking at ways that everything is managed under Rails, if
possible.

So, the questions (and thanks for reading this far) are:

What is a good way to do this while still working within
ActiveRecord?
This is a capability that we’d like to attach to any model that might
need it. What would be a good way to approach a gem for it?

I’m sure there will be a few more questions as we progress on this.

Thanks,
Mohit.

mohits · July 21, 2015, 9:34pm

First I would work to learn patterns in time increments, which occur
most
often, and plan from there. Then I would design with regard to DB code
and
servers, optimal Master-slave replication. Pretty sure that in
replication
WHERE clauses like (where recorded_on >= (now() - ‘24 hr’::INTERVAL)
could be eliminated at the user level. But I am not certain.

mohits · July 22, 2015, 3:32pm

Hi Elizabeth,

Thank you for replying.

On 22/7/2015 3:33 AM, Elizabeth McGurty wrote:

First I would work to learn patterns in time increments, which occur
most often, and plan from there.

We do know that - we have a few peak hours when more data comes in, and
then there are hours (like night time) where the data drops to a trickle
for some of the records. However, there are some status emails, etc.
that come at a fixed frequency - few times an hour, irrespective of the
hour.

Then I would design with regard to DB code and servers, optimal
Master-slave replication. Pretty sure that in replication WHERE
clauses like (where recorded_on >= (now() - ‘24 hr’::INTERVAL) could
be eliminated at the user level. But I am not certain.

While this is something that might help, I’m more looking at solutions
that involve having hot data tables in the manner:

All the data
Data for the last 24 hours
Data for the last 2 hours
That way, when we want something recent, we would just query the most
recent table, but in the few occasions that we need something more, we
go to the larger tables that are sharded.

Best Regards,
Mohit.

mohits · July 22, 2015, 11:33pm

Amen! Amen! To what Scott has written. I actually wanted to offer an
alternative SQL procedure but thought that I shouldn’t as out of context
with ROR…

mohits · July 23, 2015, 5:11am

Hi Scott,

Thanks for your email. Your inputs are certainly useful.

On 22/7/2015 9:54 PM, Scott R. wrote:

On Jul 21, 2015, at 8:30 AM, Mohit S. [email protected] wrote:

select count(*), count(distinct group_id) from
(select group_id, recorded_on from data_store_v2 order by id DESC limit 900000)
td
where recorded_on >= (now() - ‘24 hr’::INTERVAL);
Some suggestions in a slightly different direction:

rewrite without that completely unnecessary subquery, and see if the query
time improves

We have tried this and the query is quite a bit slower. Filtering to
the last 900k records before doing the recorded_on part helped speed it
up.

check that appropriate indexes exist and are being used by the optimizer

really analyze the query execution plan, and look for more advanced
opportunities, for instance, order by recorded_on instead of id, since,
presumably, the query will already access the rows by an index on ordered_on;
consider dropping the limit & order altogether; take that query to a PostgreSQL
list…

Your email got me going back to look at all the parts again since
obviously the query should be using the index and it was still slow.
Further search last night made me realize that it’s not the indexes that
are a problem. The problem is the count(distinct group_id) part which
seems to be quite slow in PostgreSQL. This is a lot faster:
select count(*) from
(select distinct group_id from
data_store_v2 where recorded_on >= ‘2015-06-06’) td;
than:
select count(distinct group_id) from
data_store_v2 where recorded_on >= ‘2015-06-06’;

as explained here:

So, I guess the real problem was being masked by something else and an
incorrect assumption on my part

put some effort into learning SQL better; OK, we all make mistakes sometimes
and maybe this is just that; but it sure looks to me like someone who doesn’t
really understand SQL struggling at throwing together various clauses until the
correct answer pops out (note that in addition to the issue I pointed out in 1,
the subquery is selecting a column which is completely unused–probably doesn’t
affect anything, but just another sign that the person writing the query did not
understand it).

Thanks for the analysis
I do understand SQL and I thought I’m not throwing things together…
for my understanding, which column was unnecessary? I thought we needed
all:

group_id for counting the distinct group_id
recorded_on for the subsequent query on it
id only for getting the most recent records

Most of my personal work is with SQLite3 on embedded platforms, but this
discussion resolves the problem for now. It now moves away from being a
Rails issue to being a PostgreSQL issue.

Best Regards,
Mohit.

mohits · July 23, 2015, 5:55am

On Jul 22, 2015, at 9:10 PM, Mohit S. [email protected] wrote:

We have tried this and the query is quite a bit slower. Filtering to the last
900k records before doing the recorded_on part helped speed it up.

I don’t understand how that could possibly be the case if there’s an
index on recorded_on.

So, I guess the real problem was being masked by something else and an incorrect
assumption on my part
I would expect the select count(distinct…) to be a major contributor
to the time taken by the query, just given the amount of work it must
do. The select count(*) from (select distinct…) alternative is a nice
tip

Thanks for the analysis
I do understand SQL and I thought I’m not throwing things together… for my
understanding, which column was unnecessary? I thought we needed all:

group_id for counting the distinct group_id
recorded_on for the subsequent query on it
id only for getting the most recent records

I apologize–I misread the query structure. I got it into my head as:

select … from (select … from … where recorded_on … order by …
limit …)

I think you can see how THAT query would have better fit my description
of being poorly constructed.

I’m glad that my somewhat off-base pontification still managed to point
you in a useful direction!

–
Scott R.
[email protected]
http://www.elevated-dev.com/
https://www.linkedin.com/in/scottribe/
(303) 722-0567 voice

mohits · July 22, 2015, 3:55pm

On Jul 21, 2015, at 8:30 AM, Mohit S. [email protected] wrote:

select count(*), count(distinct group_id) from
(select group_id, recorded_on from data_store_v2 order by id DESC limit 900000)
td
where recorded_on >= (now() - ‘24 hr’::INTERVAL);

Some suggestions in a slightly different direction:

rewrite without that completely unnecessary subquery, and see if the
query time improves
check that appropriate indexes exist and are being used by the
optimizer
really analyze the query execution plan, and look for more advanced
opportunities, for instance, order by recorded_on instead of id, since,
presumably, the query will already access the rows by an index on
ordered_on; consider dropping the limit & order altogether; take that
query to a PostgreSQL list…
put some effort into learning SQL better; OK, we all make mistakes
sometimes and maybe this is just that; but it sure looks to me like
someone who doesn’t really understand SQL struggling at throwing
together various clauses until the correct answer pops out (note that in
addition to the issue I pointed out in 1, the subquery is selecting a
column which is completely unused–probably doesn’t affect anything, but
just another sign that the person writing the query did not understand
it).

–
Scott R.
[email protected]
http://www.elevated-dev.com/
https://www.linkedin.com/in/scottribe/
(303) 722-0567 voice

mohits · July 23, 2015, 2:13pm

On Jul 22, 2015, at 11:27 PM, Mohit S. [email protected] wrote:

Hi Scott,

On 23/7/2015 11:54 AM, Scott R. wrote:

On Jul 22, 2015, at 9:10 PM, Mohit S. [email protected] wrote:

We have tried this and the query is quite a bit slower. Filtering to the last
900k records before doing the recorded_on part helped speed it up.
I don’t understand how that could possibly be the case if there’s an index on
recorded_on.

Because it was the count(distinct x) that was the problem
Doing only a count(*) is faster without the subquery… and is what we have
switched to.

So does the limit reduce it to less than 1 day’s rows?

–
Scott R.
[email protected]
http://www.elevated-dev.com/
https://www.linkedin.com/in/scottribe/
(303) 722-0567 voice

mohits · July 23, 2015, 7:29am

Hi Scott,

On 23/7/2015 11:54 AM, Scott R. wrote:

On Jul 22, 2015, at 9:10 PM, Mohit S. [email protected] wrote:

We have tried this and the query is quite a bit slower. Filtering to the last
900k records before doing the recorded_on part helped speed it up.
I don’t understand how that could possibly be the case if there’s an index on
recorded_on.

Because it was the count(distinct x) that was the problem
Doing only a count(*) is faster without the subquery… and is what we
have switched to.

So, I guess the real problem was being masked by something else and an
incorrect assumption on my part
I would expect the select count(distinct…) to be a major contributor to the
time taken by the query, just given the amount of work it must do. The select
count(*) from (select distinct…) alternative is a nice tip

…and that is what the cause was.

I’m glad that my somewhat off-base pontification still managed to point you in a
useful direction!

Yes, thanks again.

Best Regards,
Mohit.

mohits · July 23, 2015, 10:47pm

On Jul 23, 2015, at 7:14 AM, Mohit S. [email protected] wrote:

SELECT count(*) only with a limit of 900k records based on ID DESC, followed
by the recorded_on part = 700ms

SELECT count(*) on the whole table using only recorded_on in the WHERE = 350ms

That’s kind of what I expect. That adding the count(distinct…) makes
them slower is not a surprise, but I was surprised that adding the
count(distinct…) inverts the relative performance of those two.

So, anyway, it sounds like you might have gotten it fast enough. If not,
bounce the discussion over to pg’s general mail list, and we can talk
about how to maintain a summary table without going through the hassle
of full-on sharding

–
Scott R.
[email protected]
http://www.elevated-dev.com/
https://www.linkedin.com/in/scottribe/
(303) 722-0567 voice

mohits · July 23, 2015, 3:16pm

Hi Scott,

On 23/7/2015 8:12 PM, Scott R. wrote:

Because it was the count(distinct x) that was the problem
Doing only a count(*) is faster without the subquery… and is what we have
switched to.
So does the limit reduce it to less than 1 day’s rows?

Yes, the idea of the limit (900k records) was to limit it to the records
within 1 day and then run the query on the smaller record set. However,
after your comments, when I was looking around, I resolved that the
slowness is because of the count (distinct x) and not because of the
indexes, etc.

For reference, on a 230million record table, the numbers were roughly
along these lines:

SELECT count(*), count (distinct group_id) with a limit of 900k
records based on ID DESC, followed by the recorded_on part = 8.6 seconds
SELECT count(*), count (distinct group_id) on the whole table using
only recorded_on in the WHERE = 14 seconds
SELECT count(*) only with a limit of 900k records based on ID DESC,
followed by the recorded_on part = 700ms
SELECT count(*) on the whole table using only recorded_on in the WHERE
= 350ms
–> Clearly, the culprit was the count (distinct group_id) - that
benefits a lot by using a subquery to limit the number of records it
considers
SELECT count(*) from (select distinct group_id from data_store_v2
where recorded_on >= ‘’) --> takes around 900ms

So, we are combining these in the final query now… this takes around
900ms to get both values (count and count distinct)
– get the fields from 2 different subqueries
select * from
– first field is got for the count()
(select count() AS all_count from data_store_v2 where recorded_on >=
(now() AT TIME ZONE ‘Asia/Singapore’ - ‘24 hr’::INTERVAL)) as t1,
– and Joining in the second one for the count (distinct group_id)
(select count(*) from
– this is yet another subquery
(select distinct drive_id from data_store_v2 where recorded_on >=
(now() AT TIME ZONE ‘Asia/Singapore’ - ‘24 hr’::INTERVAL)) td ) as t2;

But as I mentioned, this is now a PostrgreSQL question not an
ActiveRecord or Rails question

Thanks for digging with me!

Best Regards,
Mohit.