Load balancer performance problems

I am running a rails app on an up-to-date gentoo linux box. I have 4
mongrel servers, and I’ve tried using both nginx and pound as my
front-end/load-balancer.

Everything runs very quickly for a while, and then requests start to
take a really long time to complete.

The thing that confuses me is that, if I point my browser to port 80 on
the server to hit the load-balancer, I run into these performance
problems, but if I point my browser directly to any of the mongrel
servers, things work real fast.

Does anyone have any idea what sorts of errors I could have made, either
in coding or setup, that would cause the load-balancer to bog down? I
can usually send the request to each of the individual mongrel servers
and have it complete before the request sent to the load balancer does
anything at all.

thanks,
Will

Are you using sendfile ? do you have sendfile gem installed?

both nginx slows down and pound slows down ?

are you serving up static content either through some nice rewrite rules
(see Ezra’s nginx setup or mongrel docs for pound) ?

On 9/20/06, Will M. [email protected] wrote:

the server to hit the load-balancer, I run into these performance
Will


Posted via http://www.ruby-forum.com/.


Charles Brian Q.
self-promotion: www.seebq.com
highgroove studios: www.highgroove.com
slingshot rails hosting: www.slingshothosting.com

i thought sendfile needed needed anymore with Mongrel? thought Zed
posted
that in one of the more recent releases…

i’m having the same issue as Will and I don’t have sendfile installed
(because I removed it). the app will serve thousands of request at 0.2

0.8 seconds and then the render times will just shoot up to 4-8 seconds
for
no reason…database isn’t the issue (at leastnot according to the “DB”
vs
“Render” times in Rails log)

ed

On 9/20/06, Charles Q. [email protected] wrote:

servers, things work real fast.

Posted via http://www.ruby-forum.com/.

www.seebq.com
highgroove studios: http://www.highgroove.comwww.highgroove.com
slingshot rails hosting: http://www.slingshothosting.com
www.slingshothosting.com


Ed Hickey
Developer
Litmus Media
816-533-0409
[email protected]
A Member of Think Partnership, Inc
www.ThinkPartnership.com
Amex ticker symbol: THK

Are the two of you that are seeing this problem running in
production mode? And as you say that this happens with pound, it
might be a mongrel or rails issue as pound proxies everything making
mongrel serve static files too which it shouldn’t be doing in a
production environment. With nginx are you using a good config file
[1] that does the correct rewrites to make sure nginx serves all
static and rails page cached files?

Also what kind of server environment are you running on? Does the

site sit idle for a while before this happens? Maybe its being
swapped out to disk and then needs to be swapped back in? If you can
provide more details I’m sure that we can help you figure out what it
is.

-Ezra

[1] Ruby on Rails Blog / What is Ruby on Rails for?

i should have given more information:

in our production env:
mongrel (0.3.13.4)
mongrel_cluster (0.2.0)
F5 load balancer (balancing 10 mongrel processes)

the ‘site’ is more of an XML service (though it doesn’t use AWS). it
serves http requests back XML exclusively, no sessions, no rhtml, etc.
we’re currently getting only around 120 requests per min but the site is
hardly ever idle.

most of the requests get served in <= 0.5s but occasionally the render
time
jumps way up (>4s) for no visibile reason. the Rails logs report quick
DB
times, it’s just the render times that are high. this is interesting
because the DB query is quite large (6 joins w/ some tables having
almost
1mil rows) but the result set is very small (maybe 5 rows).

im using the to_xml method, not rxml templates.

ed

On 9/20/06, Ezra Z. [email protected] wrote:

idle for a while before this happens? Maybe its being swapped out to disk
i thought sendfile needed needed anymore with Mongrel? thought Zed posted
On 9/20/06, Charles Q. [email protected] wrote:

problems, but if I point my browser directly to any of the mongrel
thanks,

Amex ticker symbol: THK


Ed Hickey
Developer
Litmus Media
816-533-0409
[email protected]
A Member of Think Partnership, Inc
www.ThinkPartnership.com
Amex ticker symbol: THK

Will, you might like to read this:

http://www.mail-archive.com/[email protected]/msg01593.html

Altho I don’t quite grok it, it seems relevant,
Vish

My problem seems to be different from Ed’s –

At the moment I’m using nginx, with a configuration file based on the
one you cited below.

Let me describe what the server is doing:

The app is basically a collaborative office type environment for a
handful of people. At the moment I have, at maximum, 5 people logged on
simultaneously. Usually I have 2. It has instant messenging, so every
5 seconds it calls an action that completes in ~.0025 seconds and every
10 seconds calls a different action which completes in ~.005 seconds.
When there is only one person logged in, the problem crops up less
often, and it never happens when the server has been idle for a while,
only in the middle of sustained use.

If one is using the application, everythign seems very fast for a while,
and then, suddenly, a request will take 15 seconds or more to complete.
The strange thing about this, is that I can hit all of the servers
separately just fine while this request is stalled by pointing my
browser straight at them rather than going through the load-balancer on
port 80. Moreover, if I tail production.log (I am running in production
mode), I can see that stalled request takes no more time the usual to
complete, once mongrel sees it.

At first I thought that I had just written crappy code, and I spent a
bunch of time locating slow actions and speeding them up, making my
session smaller (I am using Stefan K.'s sql_session_store as my
container), and speeding up some of my DB queries, and this improved my
normal performance quite a bit, but it didn’t do anything to lower the
frequency of these stalls.

If I go down to 1 mongrel, performance is abysmal when people are logged
on and chat request/user list polling (described above) is happening,
but is perfectly reasonable otherwise. If I go up to 10 mongrels,
performance through the load balancer is worthless, with at least halfof
its requests stalling, but performance for any of the individual
mongrels is great, if I point to them directly.

I have plenty (280MB) of free RAM, and my mongrels all stabilize at
using about 30MB. According to top, my cpu is ~98% idle all of the
time.

that’s all I’ve got for now, thanks for listening,
Will

Ezra Z. wrote:

Are the two of you that are seeing this problem running in
production mode? And as you say that this happens with pound, it
might be a mongrel or rails issue as pound proxies everything making
mongrel serve static files too which it shouldn’t be doing in a
production environment. With nginx are you using a good config file
[1] that does the correct rewrites to make sure nginx serves all
static and rails page cached files?

Also what kind of server environment are you running on? Does the
site sit idle for a while before this happens? Maybe its being
swapped out to disk and then needs to be swapped back in? If you can
provide more details I’m sure that we can help you figure out what it
is.

-Ezra

[1] Ruby on Rails Blog / What is Ruby on Rails for?

I gave this a read, and it didn’t seem related to my problem, to me,
because I don’t have any particularly slow actions, at least not so far
as the rails log tells me.

Maybe my question boils down to this: if, according to my production
log, all of my rails actions complete in at least .3 sec, and I am
basically never getting more than 1-2 req/sec, and when I am getting
more requests than that they are requests that complete in .001 sec or
less, what can be causing my app to occasionally stall and take 10
seconds or more to complete?

Vishnu G. wrote:

Will, you might like to read this:

http://www.mail-archive.com/[email protected]/msg01593.html

Altho I don’t quite grok it, it seems relevant,
Vish

On Sep 25, 2006, at 7:50 AM, Will wrote:

seconds or more to complete?
Have you looked at networking statistics to see if you’re getting
high retransmits and such?

Perhaps there’s a bad piece of networking gear somewhere?


– Tom M.

I’m not quite sure where to look to debug this sort of problem, but it
seems likely that the problem is either with the router, or with
something higher up in the university IT administration that I don’t
know about:

The machine is in the dmz for the router. I set up my server in nginx
to listen on both port 80 and port 4000. Nginx then proxies to a
mongrel cluster. If I point httperf to our domain name or the ip
address of our router, at one of my slowest actions on port 80, I get an
average of .2 requests/sec, whereas if I do it on port 4000 I get an
average of 3 requests/sec. If I point it, instead, either at localhost
or at 192.168.0.101 on ports 80 or 4000, I get 12 requests/sec.

I guess I’ll get rid of the router and see how that works.

thanks guys,
Will
[email protected]

Tom M. wrote:

On Sep 25, 2006, at 7:50 AM, Will wrote:

seconds or more to complete?
Have you looked at networking statistics to see if you’re getting
high retransmits and such?

Perhaps there’s a bad piece of networking gear somewhere?


– Tom M.

On Mon, 25 Sep 2006 22:40:30 +0200
Will [email protected] wrote:

average of .2 requests/sec, whereas if I do it on port 4000 I get an
average of 3 requests/sec. If I point it, instead, either at localhost
or at 192.168.0.101 on ports 80 or 4000, I get 12 requests/sec.

You need to break down the performance of each component using a series
of small experiments. Start from one end of the chain and run a series
of tests on it going back through, then test each piece of the chain as
it’s connected to the next piece. Tedious but it’ll force you to
examine each part and will help find the problem.

You should also check out mtr (matt’s traceroute) and do a traceroute
to/from various locations. You might have something messed up and it’s
dropping packets along the way.

Last thing, go grab Wireshark/Ethereal, take a laptop and you can hook
it up between different components and grab chunks of TCP between them.
It’s got a nice graphical front end, is really easy to use, and it will
even decode the TCP streams so you can see what’s going on. Main thing
you’re looking for is really insane timings (look at the timestamps on
the left and find big differences) and you’re looking for bad packets
(they show up as a red/brown).

Otherwise, if you don’t have the tools or expertise to figure it out,
I’d recommend hiring someone who’s qualified/certified in your router
equipment to come and look at it. Someone good could probably know
right away what is wrong and it’ll be loads cheaper than you doing it
for the next month.


Zed A. Shaw, MUDCRAP-CE Master Black Belt Sifu

http://mongrel.rubyforge.org/
http://www.lingr.com/room/3yXhqKbfPy8 – Come get help.