The fair proxy balancer

This is primarily aimed at Grzegorz N., the author of the “fair”
proxy balancer patch for Nginx, but I’m posting this here in case
others want to chip in.

I posted the following on Ezra Z.'s blog recently, in
conjunction with the announcement of the patch. After I posted this,
we have been seeing some rather more extreme examples of non-uniform
request distributions, with some mongrels piling up lots and lots of
connections while other sit completely idle.

We have been running this patch on a live Rails site for a couple of
weeks. We switched from Lighttpd + FastCGI to Nginx + Mongrel for a
couple of technical reasons I won’t go into here. Generally
performance has been worse, but I have been unable to pin down what’s
wrong. From what I can see, the fair patch is not working
consistently. A large portion of the requests will go to a mongrel
which is already processing a request. Here is an output from “ps” on
one of our boxes:

1003 11941 33.7 2.7 131484 112716 ? Rl 15:52 2:24
mongrel_rails [10000/1/89]: handling 127.0.0.1: GET
/kalender/liste/2007/10/28/1
1003 11944 1.2 0.8 54336 35580 ? Sl 15:52 0:05
mongrel_rails [10002/0/1]: idle
1003 11947 4.3 2.8 135804 116924 ? Sl 15:52 0:18
mongrel_rails [10008/0/9]: idle
1003 11950 3.1 0.9 58508 39684 ? Sl 15:52 0:13
mongrel_rails [10011/0/374]: idle
1003 11953 3.3 0.9 58196 39428 ? Sl 15:52 0:14
mongrel_rails [10013/0/370]: idle
1003 11957 3.5 1.3 74784 55944 ? Sl 15:52 0:15
mongrel_rails [10001/0/10]: idle
1003 11961 3.3 0.9 58472 39700 ? Sl 15:52 0:14
mongrel_rails [10012/0/390]: idle
1004 5891 2.3 6.8 302544 283032 ? Rl 15:52 2:24
mongrel_rails [10010/3/26]: handling 127.0.0.1: GET
/bulletin/show/26916
1003 11970 40.7 2.8 138408 119528 ? Sl 15:52 2:52
mongrel_rails [10004/1/75]: handling 127.0.0.1: GET /
1003 11974 40.7 5.1 233756 214824 ? Sl 15:52 2:52
mongrel_rails [10007/2/68]: handling 127.0.0.1: GET
/feed/messages/rss/963/2722
1003 11978 32.1 2.7 133924 115088 ? Sl 15:52 2:15
mongrel_rails [10009/1/79]: handling 127.0.0.1: GET
/kategori/liste/Revival
1003 11990 28.6 2.9 141688 122916 ? Sl 15:52 2:00
mongrel_rails [10005/1/85]: handling 127.0.0.1: GET /generelt/search
1003 11998 27.1 2.8 136816 118020 ? Sl 15:52 1:53
mongrel_rails [10006/1/78]: handling 127.0.0.1: GET
/kalender/liste/2007/9/26
1003 12002 31.8 2.7 131552 112732 ? Sl 15:52 2:13
mongrel_rails [10010/0/89]: idle

Mongrel is running with a custom extension I have written that extends
the process title with status information. The three numbers are the
port, the number of concurrent requests ,and total number of requests
processed during the mongrel’s lifetime What is apparent from this
output is that a bunch of the mongrels are generally not used. This
would not be a problem if several other mongrels were not being forced
to process multiple concurrent requests. Because of the giant Rails
lock, this means certain requests will be queued after other requests,
which impairs response time. (We have a lot of fairly slow requests,
in the 5-10-second range.)

Alexander.

Hi!

On Mon, Dec 03, 2007 at 02:10:44PM +0100, Alexander S. wrote:

This is primarily aimed at Grzegorz N., the author of the “fair”
proxy balancer patch for Nginx, but I’m posting this here in case
others want to chip in.

Well, here I am. Bullseye :slight_smile:

I posted the following on Ezra Z.'s blog recently, in
conjunction with the announcement of the patch. After I posted this,
we have been seeing some rather more extreme examples of non-uniform
request distributions, with some mongrels piling up lots and lots of
connections while other sit completely idle.

The standard question – have you tried the latest snapshot? :slight_smile: (though
it might not be any different, asking just in case). Also, as you
mention 5
10 second requests, please increase:

#define FS_TIME_SCALE_OFFSET 1000

in file src/http/modules/ngx_http_upstream_fair_module.c (line 407 in my
copy) to e.g. 20000. I’ll make it configurable without recompiling
nginx, too (or remove it at all, if I find an elegant solution).
Requests
running this long may confuse the module which might just result in the
behaviour you’re seeing.

If increasing FS_TIME_SCALE_OFFSET does not help, could you please
compile nginx --with-debug and gather the debug_http data?

Alexander.

Best regards,
Grzegorz N.

On 12/3/07, Grzegorz N. [email protected] wrote:

The standard question – have you tried the latest snapshot? :slight_smile: (though
it might not be any different, asking just in case). Also, as you mention 5
10 second requests, please increase:

I was using an older snapshot (there was no new snapshot at the time I
wrote my comment, I’m pretty sure).

The new version, in combination with FS_TIME_SCALE_OFFSET set to 60000
for good measure, seems to cause a more uniform the distribution. So
that’s a lot of help. Thanks!

Even so, sometimes the balancer seems to go into a state where it’s
not using all mongrels:

1003 23335 66.0 8.6 384908 360700 ? Rl 10:31 29:28
mongrel_rails [10010/433/358]: handling 127.0.0.1: HEAD
/feed/calendar/global/91/6de4
1003 29040 28.3 5.7 257764 238764 ? Rl Dec04 278:29
mongrel_rails [10011/18/3465]: handling 127.0.0.1: HEAD
/feed/calendar/circle/2289/9441
1003 18917 4.3 7.9 350384 330632 ? Sl Dec04 87:32
mongrel_rails [10012/0/1707]: idle
1003 19334 7.5 4.6 211640 192568 ? Sl Dec04 152:54
mongrel_rails [10013/0/3666]: idle

The first mongrel, as you can see, has 433 requests queued. This is
something that happened during the night.

Some of these requests time out; some of these requests are very
expensive legacy feeds that have never been optimized. Does the
balancer penalize upstreams that time out a lot, by any chance? Is
there a way to force the algorithm not using any weighting, but always
schedule connections to the upstream with the fewest queued requests?

If increasing FS_TIME_SCALE_OFFSET does not help, could you please
compile nginx --with-debug and gather the debug_http data?

I will do this.

Alexander.

On Mon, 3 Dec 2007 19:33:44 +0100
Grzegorz N. [email protected] wrote:

Well, here I am. Bullseye :slight_smile:

Just wondering - do you know mod_backhand?
http://www.backhand.org/mod_backhand/

I love very much the way how it recalculates where the request should go
on
the fly via its “candidacy functions”. Of course I would love to see
something similiar in nginx.

That would require a method of comunication between upstream servers
(php-cgi, mongrels, whatever) and nginx, as upstreams should be able to
inform nginx about their state and the state of the machine they’re on.
Anyone interested in implementing something in this direction?

Jure Pečar
http://jure.pecar.org/

On 12/3/07, Alexander S. [email protected] wrote:

Mongrel is running with a custom extension I have written that extends
the process title with status information.

In case anyone is interested in this extension, here it is:

http://purefiction.net/mongrel_proctitle/

Alexander.