Mongrel hanging with high CPU usage

I’m running a mongrel cluster behind Apache + mod_proxy. Several times
a day, one Mongrel will spike to 99-101% CPU usage and freeze there.
The standard mongrel_rails restart command won’t affect it, nor will
it respond to anything short of a kill -6. Memory usage remains low
when this happens. There doesn’t seem to be a pattern behind it;
sometimes it’ll happen several times in quick succession, other times
it’ll go for hours without a problem. Usage is pretty light, as the
site is pre-launch and just in use by a few testers. It’s not tied to
any specific action like file uploads, nor any particular URL. Even
when I hit it with a heavy stress test via apache bench I am unable to
duplicate the problem reliably.

So far I’ve tried the following: reinstalling the native C MySQL gem
(v2.7); checking open file descriptors (lsof generally reports about
60 files under the hung process); strace and gdb (no glaring errors
jump out at me, but I haven’t used these tools in the past so I may be
missing something); setting SetEnv force-proxy-request-1.0 1 and proxy-
nokeepalive 1; setting ActiveRecord::Base.verification_timeout to a
lower setting; and leaving my shoes by the door for the server gnomes
to fill with candy. Nothing seems to have worked.

My environment is:

Mongrel 1.1.5
MongrelCluster 1.0.5
Apache 2.2.3, using mod_proxy
Ruby 1.8.6
RHEL 5.1

Has anyone seen this behavior before? Does anyone have any other debug
tips that might be useful? At this point I’m pretty lost, and not sure
at all if the problem is in my application or in the server stack.
Thanks for any insight you might have.

j

On Jun 18, 2008, at 4:30 PM, Josh F. wrote:

when I hit it with a heavy stress test via apache bench I am unable to
to fill with candy. Nothing seems to have worked.
tips that might be useful? At this point I’m pretty lost, and not sure
at all if the problem is in my application or in the server stack.
Thanks for any insight you might have.

Next time this happens get the PID of the errant process and run this
one it:

$ strace -p

And make a pastie of some of the output. Try to see what it’s stuck
on doing.

Cheers-

On Wed, 2008-06-18 at 16:30 -0700, Josh F. wrote:

I’m running a mongrel cluster behind Apache + mod_proxy. Several times
a day, one Mongrel will spike to 99-101% CPU usage and freeze there.
The standard mongrel_rails restart command won’t affect it, nor will
it respond to anything short of a kill -6.

Do what Ezra says and use strace, but in the meantime, you can use “god”
or “monit” to monitor the process and restart it when this happens.
Certainly better to track down the root cause if possible, though…

Yours,

Tom

On Jun 18, 2008, at 7:34 PM, Josh F. wrote:

returning -1 for instance.) Naturally this is one of the times when it
runs fine for hours, but I’ll pastie what I find when it happens next.
Any general pointers on what sort of things I should be looking for in
there?

Thanks,
Josh

Generally if you catch a process that is spinning at 100% cpu it will
be stuck in some kind of loop, so catching it in action is important.
Lot’s of times it will be looping and blocking on the database or some
C ext gone off into the weeds to die. So seeing what strings it might
be writing to any files or sockets can help trace it down to a section
of code sometimes.

Cheers-

  • Ezra

Thanks for the tips. I’ve been using god, but when one of the Mongrels
gets into this unresponsive state a standard mongrel_rails
cluster::restart on the port in question fails to restart the process.
I have to shell in and issue a kill with -6 or -9. I can’t rely on a
restart condition on CPU usage, as usage remains within normal bounds
right up until it jumps to 100% and jams.

I’ve also run strace on the process. While I’m not exactly sure how to
interpret it, I’m not seeing any obvious errors (system calls
returning -1 for instance.) Naturally this is one of the times when it
runs fine for hours, but I’ll pastie what I find when it happens next.
Any general pointers on what sort of things I should be looking for in
there?

Thanks,
Josh

Thanks guys – that was enough to point me in the right direction. (A
regexp was getting stuck on some gnarly markup.) For future Googlers,
I also found this post helpful:
http://weblog.jamisbuck.org/2006/9/22/inspecting-a-live-ruby-process

Thanks for your help!

Josh

Josh,

Make God (or whatever monitoring you setup) to run $ strace -p

when the CPU% spike hits and have it mailed to you, or logged.


Aníbal Rojas

http://anibal.rojas.com.ve

Ric For wrote:

Ezra, having same issue and this is closet forum listing I see.
Mongrel_rails running up to 100% CPU after 10-12 hours. Restarting the
processes with monit solves for awhile but not the real fix.

A pastie of the stack trace:
http://pastie.org/643236

Same stack trace here. Ric? Anyone?

Ezra, having same issue and this is closet forum listing I see.
Mongrel_rails running up to 100% CPU after 10-12 hours. Restarting the
processes with monit solves for awhile but not the real fix.

A pastie of the stack trace:

http://pastie.org/643236

Any help appreciated.

Ric

Ezra Z. wrote:

On Jun 18, 2008, at 4:30 PM, Josh F. wrote:

when I hit it with a heavy stress test via apache bench I am unable to
to fill with candy. Nothing seems to have worked.
tips that might be useful? At this point I’m pretty lost, and not sure
at all if the problem is in my application or in the server stack.
Thanks for any insight you might have.

Next time this happens get the PID of the errant process and run this
one it:

$ strace -p

And make a pastie of some of the output. Try to see what it’s stuck
on doing.

Cheers-