So many users, so little time - looking for a better way

jeffpritchard · October 21, 2008, 4:33am

OK, time to come clean. Right now my app is doing something really
cheesy. The first step is alway admitting that you have a problem,
right?

So here’s the deal. One of the features of my app is that, based on
some user settings, it is supposed to send out “reminder” emails at
certain times of the day. Each user has their own time settings for
these reminders.

Right now I have a background task that, once every 15 mins, looks at
every gosh darn user and decides whether or not to send them a reminder.
Once every 15 mins I hit every single user row in the database.
brrrrrrr. Yeah, I know, gives me the willies too.

So is there some sort of a “best practice” established for how to do
some caching/intelligent-scheduling to overcome this horrendous mis-use
of the database? Is there some sort of a plugin that would allow me to
sort the database by that time-of-day column and intelligently find just
those within a certain range of times?

I’m very nearly an SQL virgin, so if this is painfully simple and I’m
just a moron, be gentle.

thanks,
jp

jeffpritchard · October 21, 2008, 5:10am

You should be able to construct a query (or use find with :conditions)
and let the database do the work. Post the relative tables, fields, and
classes and I’m sure you’ll get many offers of help.

Peace.

jeffpritchard · October 21, 2008, 6:09am

On Mon, Oct 20, 2008 at 7:33 PM, Jeff P.
[email protected] wrote:

Right now I have a background task that, once every 15 mins, looks at
every gosh darn user and decides whether or not to send them a reminder.
Once every 15 mins I hit every single user row in the database.
brrrrrrr. Yeah, I know, gives me the willies too.

So, gosh, maybe you’re running your server on an ancient Intel 286
box on Craplix “Muddy Mammoth” – 15 minutes? so even that box
executed, oh, eleventeen gazillion instructions in that time period?

Plus /yet another/ SQL query. Brutal. I’ll bet your phone is ringing off
the wall with customer complaints.

When your traffic gets twitteresque, please call back. Operators are
standing by.

/* see also “premature optimization” */

Hassan S. ------------------------ [email protected]

jeffpritchard · October 21, 2008, 6:44am

Build yourself a carousel and with buckets some time (15 min?) apart.
Install each user’s task in it’s bucket and rotate the carousel with
the clock. As a bucket appears, service any tasks it contains and
move them to the appropriate future bucket.

This is fairly easy to do with a doubly linked list - I’m not to sure
yet how it maps into the MVC world view.

On Oct 20, 4:33 pm, Jeff P. [email protected]

jeffpritchard · October 21, 2008, 6:52am

Hi Jeff,

Depending on how you have the data stored, yes you should be able to
something in SQL to make the search better.

Does each user just have one time of day assigned to them? If so, then
you
can just put a condition on the find query.

Something like user.reminder_time > start_range and user.reminder_time <
end_range and let the DB do the grunt work for you.

If the user notification is something like ‘every 4 hours’ then you are
may have to process each user individually to work out whether the
current
time fits that range. Or have a ‘next_notification’ col that you update
each time you send a notification to that user. Then do the db query on
the next_notification field.

Cheers
Simon

On Tue, 21 Oct 2008 10:33:11 +0800, Jeff P.

jeffpritchard · October 21, 2008, 7:28am

On Mon, Oct 20, 2008 at 10:08 PM, Bsytlz [email protected] wrote:

There is no so such thing as premature optimization, maybe mis-
prioritized optimization (may be more impactful things to address
first), but if you wait until you “have to” optimize to do it…it is
too late.

And the OP presented zero evidence of an actual problem, or hint
of any potential problem. Running one SQL query every 15 minutes
seems, uh, not too much to worry about. IMO.

There is however a such thing as a premature jerky, smart ass response
like the one above…

Thanks for the evaluation. And your qualifications on constructing
and running high-traffic web sites? Just curious.

–
Hassan S. ------------------------ [email protected]

jeffpritchard · October 21, 2008, 7:09am

When your traffic gets twitteresque, please call back. Operators are
standing by.

/* see also “premature optimization” */

There is no so such thing as premature optimization, maybe mis-
prioritized optimization (may be more impactful things to address
first), but if you wait until you “have to” optimize to do it…it is
too late. Especially is you cansider tha optimization may require
rethinking table structure and/or other fundamental architecture.
There is however a such thing as a premature jerky, smart ass response
like the one above…

That said, a couple of questions first:

How is your query structured now? find(:all, ???)?
What restrictions do you have on the reminder period if any? Can
they chose any time of the day, any frequency? Or is it restricted ot
“every 15 minutes”, “every hour” etc?
How many users do you have now? How many do you expect need to
scale too?
How long are the queries taking now? too long?

This are questions that you should answer/aks when making a decision
on the need and approach for optimization. For instance restricting
the notification times/intervals to a few fixed values lends itself to
multiple improvements, at a cahcing/queing level, as well as at an SQL
query level.

A few VERY general ideas on how to approach this are:

SQL optimization, use direct SQL connections and optimize the query
as appropriate for your DB type using constructs such as SORT, GROUP
BY, creating indexes,etc.
Do the sort/search in “object space” using a partitioning approach.
Fake a crude “bucketing” system using additional look-up tables, or
more specifically has_many relationships to parition based on fixed
intervals. for example an hours_of_day model that has_many_and
belongs_to_many :reminders, with one row for each of the 24 hours of
the day. You could then relate each reminder to on one of the rows.
Your background process then ask for all records that are associated
with the row representing the current time…convoluted I know. The
performance banifit may be questionable because of the JOINs required
for a habtm look-up, but it should be less brute force then a find
all.

I have quite a few others ideas, but will leave it at this…

jeffpritchard · October 21, 2008, 9:55am

There are up to two per day for each user, but they can be handled
separately. Yes, currently using the disgusting “find(:all)” approach.

No, with just 100 users so far there is no sign of a problem. I just
consider the current method to be very inelegant and would rather do it
in a more intelligent way. Even if, once we get to a few thousand
users, it still isn’t causing a “problem”, it will be using significant
resources, thus reducing the responsiveness of the app. That said, some
up-front profiling of the current method might convince me that I’m
wasting my time. I’ll do that first. Point taken (even if made with
more than a touch of venom)

So is it generally the case that having the db hunt down the few records
we’re after each time will execute substantially faster than returning
all of the records and doing the comparison “outside the database”? As
a novice db guy, the two approaches sound roughly similar, since both
cases involve doing essentially the same comparison on the same number
of records. Perhaps there is a lot of overhead involved in returning
records to the app.

I would think that some method of caching the info such that it didn’t
have to actually go to the db every time would make the biggest
difference in execution time. Unfortunately, such a solution would also
be more complex than just using a better query.

thanks for all the responses.

best,
jp

Bsytlz wrote:

When your traffic gets twitteresque, please call back. Operators are
standing by.

/* see also ï¿½"premature optimization" */ ï¿½ ï¿½:-)

There is no so such thing as premature optimization, maybe mis-
prioritized optimization (may be more impactful things to address
first), but if you wait until you “have to” optimize to do it…it is
too late. Especially is you cansider tha optimization may require
rethinking table structure and/or other fundamental architecture.
There is however a such thing as a premature jerky, smart ass response
like the one above…

That said, a couple of questions first:

How is your query structured now? find(:all, ???)?

What restrictions do you have on the reminder period if any? Can
they chose any time of the day, any frequency? Or is it restricted ot
“every 15 minutes”, “every hour” etc?

How many users do you have now? How many do you expect need to
scale too?

How long are the queries taking now? too long?

This are questions that you should answer/aks when making a decision
on the need and approach for optimization. For instance restricting
the notification times/intervals to a few fixed values lends itself to
multiple improvements, at a cahcing/queing level, as well as at an SQL
query level.

A few VERY general ideas on how to approach this are:

SQL optimization, use direct SQL connections and optimize the query
as appropriate for your DB type using constructs such as SORT, GROUP
BY, creating indexes,etc.

Do the sort/search in “object space” using a partitioning approach.

Fake a crude “bucketing” system using additional look-up tables, or
more specifically has_many relationships to parition based on fixed
intervals. for example an hours_of_day model that has_many_and
belongs_to_many :reminders, with one row for each of the 24 hours of
the day. You could then relate each reminder to on one of the rows.
Your background process then ask for all records that are associated
with the row representing the current time…convoluted I know. The
performance banifit may be questionable because of the JOINs required
for a habtm look-up, but it should be less brute force then a find
all.

I have quite a few others ideas, but will leave it at this…

jeffpritchard · October 21, 2008, 12:44pm

Jeff P. wrote:

So is it generally the case that having the db hunt down the few records
we’re after each time will execute substantially faster than returning
all of the records and doing the comparison “outside the database”? As
a novice db guy, the two approaches sound roughly similar, since both
cases involve doing essentially the same comparison on the same number
of records. Perhaps there is a lot of overhead involved in returning
records to the app.

Yes, the database is going to be faster at figuring out which records
need to be returned, ASSUMING proper indexing, table structure, and
generally sound approach to querying. If you don’t use indexes, your
data is not structured well, and you’re querying on half the fields in
the database, it may well be faster to just pull it all back into Rails
and work through it.

As to why this is so, two major things come to mind: marshaling the data
from the db server to the application and object creation. Suppose you
have your thousand users in your system, but only 20 of them have
reminders to send. If the database finds them for you, it returns just
twenty and your application creates just twenty objects (and you have no
more decisions to make). If you pull all one thousand back first, the
database has to send more data back to you and then you’re going to
create 1000 objects in your application. If you need related data for
the decision making, you’re going to be making more than 1000 objects,
and possibly more database queries (unless you’ve used eager loading,
which increases the amount of data sent back to the app).

Whatever creates the data should be the fastest at ripping through it.
The database engine is optimized for data storage and retrieval. When
used properly, nothing is going to be faster than the db engine’s native
access to the data.

Peace.,

jeffpritchard · October 21, 2008, 7:39am

Did you read the original post?

It’s not just one SQL, it is one SQL and then hit every user object
returned to see if it is time to send the email.
Anyone can see that eventually this is going to cause a problem, and I
don’t know about you, but I would rather find that sort of thing now,
before my customers notice a problem than wait until it actually causes
one.

Oh, and you don’t actually need a qualification to be able to pick a
premature jerky response, just some experience on the internet.

On Tue, 21 Oct 2008 13:28:22 +0800, Hassan S.

jeffpritchard · October 21, 2008, 5:06pm

Jeff,
A few pointers since I am maintaining and enhancing a rather large
application with similar functionality. It seems to work well (most
of the times):

I am firing an SQL query every minute to check for reminders instead
of 15 minutes so the actual query is not an issue. Its construction
is: you may want to create a separate table called “reminders” (or
something similar) which stores reminders records for only those users
who want them. That way if majority of your users do not use this
functionality, you are not incurring an unnecessary overhead. This
table will have an id as every rails table does and will have an index
on id. It should also have the foreign keys, at least for the users
(user_id), make sure that it is indexed as well since you will be
joining reminders and users table based on this foreign key. You can
have other foreign keys as well, for example, event_id if the
reminders are for particular events. That is how one to many
relationship works. You may want to refer to any of the online
tutorials available on basic database design which is a prerequisite
to making good rails applications.

The scheduling of these recurring tasks is also important. You want
to use a cron style scheduler that does not load the rails environment
every time the task is run. If you use the unix/linux cron to
schedule a rails runner, it will load rails environment everytime you
call the task which is a very slow and demanding process. My
application uses Backgroundrb. I am looking for alternatives which
are simpler, easier to debug, and more reliable.

The point of all this is that you are not taking on a trivial task,
rather a substantial one if you want to better architect your
application so you may want to think through it carefully. Having
said that, you are thinking of the right things. May be you want to
experiment with a prototype which is not in your critical path?

Hope this helps.

Regards,

Bharat

jeffpritchard · October 21, 2008, 8:58pm

On Tue, Oct 21, 2008 at 8:36 PM, Bharat [email protected] wrote:

who want them. That way if majority of your users do not use this

The scheduling of these recurring tasks is also important. You want
to use a cron style scheduler that does not load the rails environment
every time the task is run. If you use the unix/linux cron to
schedule a rails runner, it will load rails environment everytime you
call the task which is a very slow and demanding process. My
application uses Backgroundrb. I am looking for alternatives which
are simpler, easier to debug, and more reliable.

Can you file a bug report with a reliable way to reproduce them? I am
sure, we can iron it out.

jeffpritchard · October 21, 2008, 8:58pm

On Tue, Oct 21, 2008 at 12:55 AM, Jeff P.
[email protected]

There are up to two per day for each user, but they can be handled
separately. Yes, currently using the disgusting “find(:all)” approach.

Since you mentioned using a background task for this, I assumed a
straight SQL query. My bad.

No, with just 100 users so far there is no sign of a problem. I just
consider the current method to be very inelegant and would rather do it
in a more intelligent way. Even if, once we get to a few thousand
users, it still isn’t causing a “problem”, it will be using significant
resources, thus reducing the responsiveness of the app.

My point, which I attempted – apparently unsuccessfully! – to make
humorously, is that running a single SQL qeuery against a properly
indexed database, with even a row count in the millions, is trivial.
Really. And if it’s being done from a background app, should have
little or no effect on your Rails app’s performance from the user’s
perspective anyway.

So is it generally the case that having the db hunt down the few records
we’re after each time will execute substantially faster than returning
all of the records and doing the comparison “outside the database”?

Absolutely – that’s what an RDBMS is designed for.

I would think that some method of caching the info such that it didn’t
have to actually go to the db every time would make the biggest
difference in execution time. Unfortunately, such a solution would also
be more complex than just using a better query.

The execution time to retrieve the list and build the queue is trivial
compared to the amount of time to actually process the mail queue,
particularly as the user count grows.

Still, for the sake of argument – build your “mailing list” in memory,
and update it when a change event occurs in the app (user submits
a changed profile, new user joins, etc.). Though you’ll still incur at
least one update/insert into the DB, which is more expensive than a
select.

One last venom-free comment regarding premature optimization –
if you “fix” something today that /isn’t/ broken, you may be making it
harder to fix a /real/ bottleneck that crops up tomorrow.

Just sayin’ …

Best of luck,

Hassan S. ------------------------ [email protected]

jeffpritchard · October 21, 2008, 9:13pm

Hemant K. wrote:

Can you file a bug report with a reliable way to reproduce them? I am
sure, we can iron it out.

OK Hemant, I will try to do it as constructively as possible. You have
been most helpful. Just the thought of debugging Backgroundrb requires
a lot of determination and will-power. Is there a simple guide to debug
a mal-functioning Backgroundrb worker anywhere? I searched long and
hard on the web and did not come up with much. If you could publish a
series of tutorials on debugging that would be most helpful. The
Advanced Rails Recipes book recipe is of no use now since the API has
changed.

I have done everything by the book and my app still produces obscure
debuggin messages in backgroundrb_debug.log files though things seem to
be working. Some decent troubleshooting guides is what the community
needs sorely.

Regards,

Bharat

jeffpritchard · October 21, 2008, 9:44pm

First of all this is not an easy task.

There are two main part to for fill this task: gathering of users that
needs to remind and sending the actual reminder.

Depending of your application data structure the selection of users
should be done by the used database engine as they are build for that
kind of tasks.
You should build appropriate SQL statement, optimize indexes for that
query and try to gain as much performance of it as possible.

Sending the reminders are the different issue and I will probably use
set of daemons to perform it.

However, there are some additional issues you had to address.

Like:

What if for any reason the reminders are not send in appropriate
time? Should you ignore that unsend reminder or not.
How would you handle increase number of reminders if your system
can not send them in one circle time? How will you detect such
situation?
If you use more threads to perform sending the reminder how will
they share the same cache of selected users.

This is one of interesting problems that arise from a seams to be easy
new feature like sending the reminder to the user.

If we have more information perhaps our observations could be more
concrete.

jeffpritchard · October 21, 2008, 9:52pm

Jeff,

You could use the same approach as Ryan in his last backgournd job
screencast: #129 Custom Daemon - RailsCasts.

Or take a look at some scheduler for the task (already passing the
specific user information).

I spent some time researching background tasks, my notes might help
you:

            BACKGROUND PROCESSES              --------------------

http://wiki.rubyonrails.org/rails/pages/HowToRunBackgroundJobsInRails

== Asynchronous Tasks

Launched from Rails request/reponse cycle + optional Message Queueing

Options
. ar_mailer (only for emails) => GOOD (only email)
. starling + workling => EXCELLENT
. Ap4r
. Spawn / background => SIMPLE
. ActiveMessaging
. BackgroundFu
. WorkerQueue
. Rake in Background => SIMPLES
(#127 Rake in Background - RailsCasts)

== Recurring Jobs (cron-style)

Scheduled or permanently running in the background

Options
. daemon_generator (rails_cron) => SIMPLE
#129 Custom Daemon - RailsCasts
. BackgrounDRb
. rufus-scheduler => GOOD
. Taskr
. UNIX Cron and Runner (simple, but consumes resources and you can’t
controll well)

Cheers, Sazima

On Oct 21, 12:33 am, Jeff P. [email protected]

jeffpritchard · October 21, 2008, 9:49pm

On Wed, Oct 22, 2008 at 12:43 AM, Bharat R.
[email protected] wrote:

hard on the web and did not come up with much. If you could publish a
series of tutorials on debugging that would be most helpful. The
Advanced Rails Recipes book recipe is of no use now since the API has
changed.

I have done everything by the book and my app still produces obscure
debuggin messages in backgroundrb_debug.log files though things seem to
be working. Some decent troubleshooting guides is what the community
needs sorely.

Okay, I will see, what can be done about this. In the meanwhile, you
can upgrade to 1.1 version where debugging is more sanitized.