Production deployment speed "wakeup" issue

The deployment scenario…

Apache2 on shared host, proxying to lighttpd, which has 3 external
fcgis running on localhost. The fcgis are managed by spinner/spawner.

We’re noticing a definite speed issue on “first requests” to this site.

For example:

  • Hit the site a few times, paying no attention to load time
  • Wait x period of time (haven’t quite narrowed this down yet, but
    probably 5-10 mins)
  • Hit site again once - this request will take anywhere from 5 - 30
    or so seconds
  • Reload site a few times - these requests will be very quick - less
    than one second

These load times are reflected not just in the “feel” we get from
using the site, but are confirmed by the production.log

What’s odd is that the time seems to be inconsistent with where it
happens. The DB portion of the time is always very very small, even
on the “first request” long requests. The overall completed time,
for example, might be 10 seconds. Sometimes the ‘Rendering’
component would be 6-7 seconds of that overall time, but sometimes it
will be very small (under 1 second) and the other 8-9 seconds that
aren’t explained by either Rendering or DB time are lost
to…something?

So, basically…

  • Has anyone seen this issue before and know what the problem is?
  • Are there settings in any of apache2, lighttpd or rails itself that
    I’m unaware of which might cure this?

The app uses the Globalize plugin, but is otherwise pretty standard.
We’ve tried most combinations of switching between rails 1.0 and
rails 1.1.2, tweaking ActionController::Base.allow_concurrency (we
were also getting the “dropped mysql conn” errors in dev mode…),
tweaking ActionView::Base.cache_template_loading (thought that might
be slowing views down?), and so on all to no avail.

Thoughts?

-Matt

Matt Jankowski wrote:

Thoughts?

-Matt

I ran accross this in Apache’s proxy docs…

If you’re using the ProxyBlock directive, hostnames’ IP addresses are
looked up and cached during startup for later match test. This may take
a few seconds (or more) depending on the speed with which the hostname
lookups occur

Matt Jankowski wrote:

probably 5-10 mins)

  • Hit site again once - this request will take anywhere from 5 - 30 or
    so seconds
  • Reload site a few times - these requests will be very quick - less
    than one second

This has come up a number of times on the list. It may be that your
sleeping fcgi processes are swapped out, and take time to be brought
back to life. Various people have recommended using cron and wget (or
curl) to request a dynamic page every few minutes to keep response times
short.

either Rendering or DB time are lost to…something?
The slow rendering is more puzzling than the “missing time” - Rails
couldn’t measure the time taken to swap a process back in.

regards

Justin

Matt Jankowski wrote:

Thoughts?

No thoughts, but here’s a hack:-) I was seeing this problem (shared
host, running Apache/fcgi), and the occasional long connect times made
me think my fcgi dispatcher was getting swapped out. So I added this to
one of my controllers:

def ping
render :text => “Ping!”
end

And I run this script on my desktop machine:


require ‘open-uri’

def pingit(url)
stuff = ‘’
begin
open(url) do |f|
stuff = f.read
end
rescue Exception => e
puts(“#{e} #{e.to_s} in #{url}\n”)
end
stuff
end

while true
puts Time.new
s = pingit(ARGV[0])

puts s

sleep(600)
end


This way I can say “./Pingit.rb http://domain/controller/ping” and it
will hit the site every ten minutes, showing me any errors or failures
to connect. It seems to work fairly well – site responds pretty
consistently in a second or two – but this is a totally heuristic
approach.

–Al Evans

Just following up on my own post from a while back, with a report on how
the issue below resolved itself.

Biggest issues we found

  • HUGE problem - the linux kernel which the machine was running was a
    release from the 2.4 series which had big VM/swap issues. This machine

which had been running a J2EE app with decent speeds and under very
little
load for the past year - was recently repurposed to do hosting for a few
rails applications. I have no idea why the rails apps brought out the
demons that the J2EE app had not, but they did. Moral of the story -
make
sure your kernel release is up to date.

  • Remember to index your DB! Maybe it’s because I’m thinking in terms
    of
    models and not in terms of DB tables/rows, but I consistently forget to
    add indexes to my tables while using migrations to create the DB.
    Needless to say, going back in and indexing frequently used associations
    provided a HUGE speedup for the application.

  • Lighttpd / Apache issue - we found a strange condition with apache
    proxying back to lighty where, on certain requests (usually asset files

js, css, images, etc) that were over ~20k in size, we’d get a lockup
between apache/lighty. With a high timeout on the proxy, this leads to
a
scenario where the browser has the entire HTML page, but it’s waiting on
some assets to render, and sits there until the proxy has timed out.
We’ve since switched to Apache2.2.2/mod_proxy_balancer/mongrel, and
aren’t
particularly interested in tracking down what the actual issue here is.

So, in conclusion, the mongrel/apache/proxy_balancer setup (along with
rewrite rules in apache to serve static requests), is absolutely great,
and easy to manage with capistrano and mongrel cluster. With the kernel
fix, the removal of lighty, and the db indexing, the application is much
much quicker and the “first page wait” issue is essentially gone.

  • Hit the site a few times, paying no attention to load time
  • Wait x period of time (haven’t quite narrowed this down yet, but
    probably 5-10 mins)
  • Hit site again once - this request will take anywhere from 5 - 30
    or so seconds

i can top that, with lighttpd, don’t hit the site for a few hours. then
the next request is a response 500. press F5, and then its fine. nothing
decidedly interesting in the logs other than the fastcgi process decided
to disappear.

i’m going to try mongrel when getting around to deploying…