Qrp followup, deployment results

I posted about qrp (http://qrp.rubyforge.org/) many weeks ago, but only
deployed it to a live site a few weeks ago (after a few bug fixes
leading up to qrp v0.4.0).

qrp was deployed late on 2008-03-27 to roughly half our servers, and
then fully deployed on 2008-03-28. So far, the results have been
fairly decent (see below).

The only change I needed to make to Mongrel was the following
patch to disable the excessive logging since I disabled concurrency
in Mongrel:

— a/mongrel.rb 2008-03-03 16:42:04.000000000 -0800
+++ b/mongrel.rb 2008-04-17 15:30:57.313952784 -0700
@@ -210,7 +210,7 @@
# after the reap is done. It only runs if there are workers to
reap.
def reap_dead_workers(reason=‘unknown’)
if @workers.list.length > 0

  •    STDERR.puts "#{Time.now}: Reaping #{@workers.list.length} 
    

threads for slow workers because of ‘#{reason}’"

  •    #STDERR.puts "#{Time.now}: Reaping #{@workers.list.length} 
    

threads for slow workers because of ‘#{reason}’"
error_msg = “Mongrel timed out this thread: #{reason}”
mark = Time.now
@workers.list.each do |worker|
@@ -278,7 +278,7 @@
worker_list = @workers.list

           if worker_list.length >= @num_processors
  •            STDERR.puts "Server overloaded with 
    

#{worker_list.length} processors (#@num_processors max). Dropping
connection."

  •            #STDERR.puts "Server overloaded with 
    

#{worker_list.length} processors (#@num_processors max). Dropping
connection."
client.close rescue nil
reap_dead_workers(“max processors”)
else

As far as I know, we’re the only Rails site running qrp in our
configuration, but it should be safe now that we’re doing it :wink:

Results:

While I don’t have hard numbers for average response time and standard
deviation, they have not changed much since qrp was deployed.

However, our metric for requests taking over 10 seconds has improved
greatly since qrp was deployed[1].

The Date is actually shifted by one day (so the report that I received
on 2008-03-02 was actually for the previous days traffic).

While a few hundredths of one percent doesn’t sound like a lot, that’s
still a reasonable amount of unhappy users that get bogged down.

Date | % of requests taking >10s, (0-100)
2008-03-01 | 0.1192 | *****************
2008-03-02 | 0.1537 | ***********************
2008-03-03 | 0.0634 | *********
2008-03-04 | 0.1094 | ****************
2008-03-05 | 0.1241 | ******************
2008-03-06 | 0.1075 | ****************
2008-03-07 | 0.1086 | ****************
2008-03-08 | 0.1664 | ************************
2008-03-09 | 0.1647 | ************************
2008-03-10 | 0.0705 | **********
2008-03-11 | 0.1190 | *****************
2008-03-12 | 0.1754 | **************************
2008-03-13 | 0.1202 | ******************
2008-03-14 | 0.1351 | ********************
2008-03-15 | 0.1463 | *********************
2008-03-16 | 0.1468 | **********************
2008-03-17 | 0.1425 | *********************
2008-03-18 | 0.1271 | *******************
2008-03-19 | 0.1260 | ******************
2008-03-20 | 0.1209 | ******************
2008-03-21 | 0.1438 | *********************
2008-03-23 | 0.1139 | *****************
2008-03-24 | 0.0916 | *************
2008-03-25 | 0.1469 | **********************
2008-03-26 | 0.1316 | *******************
2008-03-26 | 0.1323 | *******************
2008-03-27 | 0.1397 | ********************
2008-03-28 | 0.0927 | *************
2008-03-29 | 0.0425 | ******
2008-03-30 | 0.0440 | ******
2008-03-31 | 0.0461 | ******
2008-04-01 | 0.0357 | *****
2008-04-02 | 0.0319 | ****
2008-04-03 | 0.0325 | ****
2008-04-04 | 0.0314 | ****
2008-04-05 | 0.0664 | *********
2008-04-05 | 0.0652 | *********
2008-04-06 | 0.0823 | ************
2008-04-07 | 0.0605 | *********
2008-04-08 | 0.0553 | ********
2008-04-09 | 0.0537 | ********
2008-04-10 | 0.1166 | *****************
2008-04-11 | 0.0512 | *******
2008-04-12 | 0.0546 | ********
2008-04-13 | 0.0619 | *********
2008-04-14 | 0.0519 | *******
2008-04-15 | 0.0421 | ******
2008-04-16 | 0.0441 | ******
2008-04-17 | 0.0409 | ******

We had some internal problems on 2008-04-10 so things went to hell that
day.

Once again, qrp is needed for a Rails site I work on because:

a) we unfortunately use a web service run by folks who suck at the
Internet. Unfortunately the tech folks like myself have little
control of this.

b) One of our internal backend services have some pathologically
bad corner cases we occasionally hit. Eliminating them
isn’t possible due to strange business requirements (and some
of the troublesome backend code is proprietary and we can’t
improve it).

[1] yes, I realize that saying that the number of >10s responses have
dropped is like saying we’ve won the Special Olympics :slight_smile: