Ruby/Fastcgi going into uninterruptible after random periods of time

Hello there -

I manage some servers that run Ruby on Rails on top of Apache 2.0 and
FastCGI. Been running them for a little over a year now, and recently
upgraded OS/Ruby/FastCGI and have encountered an unexpected problem that
I’ve been having trouble trying to trace down.

The servers have 8 cores, 8GB memory, and run about a dozen ruby
on rails applications, each application has a dedicated apache
instance with dedicated fastcgi.

Original configuration(stable, despite some memory leaks):
Fedora Core 4 32-bit
Apache 2.0.54
mod_fastcgi 2.4.2
Ruby 1.8.4

  • a wide assortment of approx 25 gems and other ruby add-ons
    (I can provide a list if needed)

The applications are all in-house apps. I didn’t write them, I
just manage the environment. In this configuration the servers
are rock solid stable, though most active application leaks
memory like crazy and is auto restarted once or twice a day.

New configuration
CentOS 4.5 32-bit
Apache 2.0.52
mod_fastcgi 2.4.2
Ruby 1.8.5

  • same wide assortment of approx 25 gems and other add-ons that
    were recompiled against the above version of ruby.

The problem is after random periods of time(3-20 hours) two of
the most busy ruby apps on the boxes(account for 80% of the
load) start going into uninterruptible sleep(“D”) for no
apparent reason. There is no I/O problem, the boxes run at
~20% CPU, have real fast local disk subsystems, most of what
these apps do are talk to databases(on other systems). No
errors in the log. One of the apps does a lot of stuff over
NFS, but the NFS server is fine(been running straight for
the past 6 months without a glitch), so I don’t doubt any
of the infrastructure since it only happens on the updated
software, and not at the same time, one server will go bad
at one point then maybe an hour or two later another one
will go. All 4 systems are load balanced and get an even
distribution of hits.

One of the four systems hasn’t had a problem in about 60
hours since I fixed a mod_fastcgi issue, our automation
system(CFEngine) was pushing out a binary of mod_fastcgi
that was compiled against fedora core 4 to the CentOS systems,
once I fixed that and pushed a mod_fastcgi compiled against
CentOS that seemed to stablize two of the systems(one’s been
running fine ever since, the other ran for about 18 hours
before having an issue). Prior to that mod_fastcgi change
both systems were going out at least once every 6 hours.
So, two systems have the symptoms real bad, one is really
not bad at all, and one hasn’t shown anything in more than
2 days. All are identical(logic suggests there must be
something different but our installation processes are so
automated that there really is not much chance of there
being things different).

I don’t know if it’s mod_fastcgi, maybe it’s Ruby, or maybe
it’s one of the modules… We’ve been running the same config
in a test environment for more than a month without the
slightest problem. But at the same time we ran 64-bit Ruby
in the same test environment for more than a month and it
lasted less than an hour in production before we had to
roll back(system performed a good 60-75% slower under
load than in 32-bit).

I’ve worked on this a good 6-8 hours over the past week,
trying to troubleshoot it, searching the net for anyone
else that might of had the same problem with no luck.

In the mean time I’ve been playing with mod_fcgid, got
it running but some of the basic ‘smoke tests’ that run
against our app fail on a particular part of the app with
mod_fcgid but not with mod_fastcgi. Haven’t had much time
to look in-depth on it, Ruby just reports ‘transaction
aborted’, for some reason(even after raising timeouts in
mod_fcgid).

I’m sure this is a lot to take in but I’m grasping at
straws here, hoping that maybe someone has some insight,
of all platforms I would of thought the newer one would
of been more solid than the older one.

While I’ve been managing the systems that run these Ruby
apps, I’m by no means a Ruby developer, though I have
been using Linux and stuff for about 12 years and
consider myself fairly adept at it. I haven’t encountered
this type of situation before myself, nor has my other
very experienced system admin.

And before anyone suggest ditch mod_fastcgi, I’ve already
thought about it and we probably will at some point,
though, until now it’s worked surprisingly well(I’ve
read about lots of complaints about setting it up, for
us it’s a breeze). Worst case I can roll back to the
older OS and config in a couple of hours but I really
want to ditch that old distribution…

I plan to post to the fastcgi mailing list as well, still
waiting to get subscribed to it …

thanks in advance for any insights…

nate

nate wrote:

are rock solid stable, though most active application leaks

The problem is after random periods of time(3-20 hours) two of
the most busy ruby apps on the boxes(account for 80% of the
load) start going into uninterruptible sleep(“D”) for no
apparent reason. There is no I/O problem, the boxes run at
~20% CPU, have real fast local disk subsystems, most of what
these apps do are talk to databases(on other systems).

Processes in the “D” state for significant periods of time tell me that
there is an I/O problem. Are these Ruby processes? Are other
processes/kernel threads, for example, “kswapd” or “bdflush” also in a
“D” state?

If you haven’t already done so, install the “sysstat” RPM. That will
give you a marvelous tool called “iostat”. Run “iostat -x -t” with
ten-second samples and collect the logs in a file. When your system
starts getting “D” state processes, look at the time stamp and compare
it with the “iostat” logs. If you see one or more disks with a
utilization of 100 percent, you’re waiting for disk.

Another thing to look at is those memory leaks. I don’t remember what
Fedora Core 4 used for a kernel, but CentOS 4.5 is a 2.6.9 kernel IIRC.
The memory leaks may be starving the system of memory. “vmstat” will
tell you that – if you see lots of swap in and swap out, your memory
leaks are making your system thrash.

In the mean time I’ve been playing with mod_fcgid, got
of been more solid than the older one.
Well … I wouldn’t say grasping at straws so much as trying to many
things at once to really isolate the problem. You’ve got way too many
variables and factors at play here. If you’ve got one system /
configuration that’s stable, clone it and quit futzing with it for a
week or so. :slight_smile: Then make one change at a time, so you know what broke
it!

Incidentally, memory leaks are defects – period. Either the
infrastructure or the application code is broken. If it’s your
application code, get your developers busy fixing it. If it’s open
source code, bring it to the attention of the relevant community.

And before anyone suggest ditch mod_fastcgi, I’ve already
thought about it and we probably will at some point,
though, until now it’s worked surprisingly well(I’ve
read about lots of complaints about setting it up, for
us it’s a breeze). Worst case I can roll back to the
older OS and config in a couple of hours but I really
want to ditch that old distribution…

I think that’s exactly what you need to do – roll back to a stable
configuration and introduce changes one at a time.

Good luck!

On Nov 8, 2007, at 9:31 PM, M. Edward (Ed) Borasky wrote:

with a utilization of 100 percent, you’re waiting for disk.

Another thing to look at is those memory leaks. I don’t remember
what Fedora Core 4 used for a kernel, but CentOS 4.5 is a 2.6.9
kernel IIRC. The memory leaks may be starving the system of memory.
“vmstat” will tell you that – if you see lots of swap in and swap
out, your memory leaks are making your system thrash.

didn’t know about ‘D’ ed - thx.

a @ http://codeforpeople.com/

M. Edward (Ed) Borasky wrote:

Processes in the “D” state for significant periods of time tell me that
there is an I/O problem. Are these Ruby processes? Are other
processes/kernel threads, for example, “kswapd” or “bdflush” also in a
“D” state?

Normally I would agree, and no there is not, running iostat on the box
as well says the disks are 99.99% or 100% idle. Swap usage is 0 as well.
Which is what gets me so dumbfounded. The I/O subsystems on these
boxes are fast as hell(waaaaaay overkill). And the apps hardly use
disk at all short of writing log files(and sometimes ruby session
files).

If you haven’t already done so, install the “sysstat” RPM. That will
give you a marvelous tool called “iostat”. Run “iostat -x -t” with
ten-second samples and collect the logs in a file. When your system
starts getting “D” state processes, look at the time stamp and compare
it with the “iostat” logs. If you see one or more disks with a
utilization of 100 percent, you’re waiting for disk.

Agree there too, and already done, sorry if I didn’t clarify in the
original email! Top shows a lot of I/O wait as does sar(as far as
CPU I/O wait) but it seems just to be an artifact of the defunct
processes as there is no other signs that I/O is being used.

Another thing to look at is those memory leaks. I don’t remember what
Fedora Core 4 used for a kernel, but CentOS 4.5 is a 2.6.9 kernel IIRC.
The memory leaks may be starving the system of memory. “vmstat” will
tell you that – if you see lots of swap in and swap out, your memory
leaks are making your system thrash.

Agree there too and there is no signs of swapping, the systems have 8GB
of memory and seem to run in the 3-5GB range. The biggest app does have
a memory leak and the system auto restarts it once the memory usage for
the app exceeds 5GB.

Well … I wouldn’t say grasping at straws so much as trying to many
things at once to really isolate the problem. You’ve got way too many
variables and factors at play here. If you’ve got one system /
configuration that’s stable, clone it and quit futzing with it for a
week or so. :slight_smile: Then make one change at a time, so you know what broke it!

Yeah that’s a good point, However honestly even having been managing
stuff for as long as I have, this sort of change seemed fairly
trivial, I mean the OS is almost the same, same major version of
apache, same version of fastcgi, etc… And it worked fine in test
environments for a month. The developers are pushing hard to
upgrade to ruby 1.8.6 in the next couple of weeks…so that’ll be
another change. But in general yeah I do like the approach of
change one thing at a time, I just was getting so frustrated with
having such an obsolete platform in our environment still(when
most everything else has been moved off). And the failure to
deploy 64-bit ruby(due to load issues), I just wanted to get it
done.

Worst case I can roll back to the other OS in a couple hours,
however rebuilding another version of ruby, and the associated
gems/addons, then packaging them as RPMs and adjusting CFEngine
to push them out is about a 16-24 hour job, not something I
look forward to repeating very often(just did it for Ruby 1.8.6)

Incidentally, memory leaks are defects – period. Either the
infrastructure or the application code is broken. If it’s your
application code, get your developers busy fixing it. If it’s open
source code, bring it to the attention of the relevant community.

Agree again there, good points all around. The developers have been
attentive to the memory leaks but so far haven’t done a whole lot.
They do get the emails about the auto restarts of the apps when
they do exhaust their memory. One developer here brought up a
big memory leak fix in the hash(?) routines in ruby 1.8.6 which
is part of the push to upgrade, to see if that helps. Turns out
in investigating this problem I dug into the version of Ruby
1.8.5 and it turns out Red hat back ported that particular fix
to 1.8.5, and so far the memory usage lines are similar to 1.8.4,
so it seems that particular leak didn’t impact us as much as we
might of hoped.

We’ve also spent some time on a memory leak profiler, forget the
name off hand but it runs under 1.8.6, to my knowledge we haven’t
had a lot of success getting it working(not on the dev side of
things so they have made progress since I last got an update).

I think that’s exactly what you need to do – roll back to a stable
configuration and introduce changes one at a time.

That may be what I end up doing, really there isn’t much that has
changed:

  • Apache same version (2 minor revs earlier)
  • OS update (fairly significant forklift)
  • Ruby update
  • Fastcgi same version (compiled against new OS)

So I could head the route of running 1.8.5 ruby on the older OS, but
I do dread the amount of work that entails, especially when we’re
being pushed to 1.8.6 in a matter of 2, maybe 3 weeks.

Appreciate the thoughts, mostly confirming what I already suspected,
was hoping for a miracle that would save me a lot more work, but I
knew it probably wasn’t going to happen, but you(hopefully) can’t
blame me for hoping :slight_smile:

nate

M. Edward (Ed) Borasky wrote:

Now that’s bizarre – I/O wait without disk utilization. I wonder if
you need to check your kernel memory split and high memory I/O
configuration. You may not be getting all the page cache space you can
have with an 8 GB box.

It’s using the default SMP kernel, no special parameters.

5 GB occupied in an 8 GB machine? Do you mean you have 3 GB of free
memory? That would be a big problem. What kernel versions did the old
and new systems have?

Well 5GB spread across all the ruby processes for that app(~20), not
1 process with 5GB. I don’t believe memory usage to be an issue
either, if the memory was constrained the system would swap. We have
a java process that runs on the same systems that consume about 1GB
of memory in a single process and it hasn’t even blinked.

One system I just restarted the apps on about 20 minutes ago, and
within about 10 minutes one process was already in the “D” state
(unusual, usually it’s a few hours). Memory didn’t get to 5GB
after a day or so of normal traffic load, Saturdays are real low
traffic.

Old kernel: 2.6.17-1.2142_FC4smp (last release for fedora core 4)
New kernel: 2.6.9-55.0.9.ELsmp (2nd most current release for CentOS 4.5
/
Red Hat enterprise 4.0 update 5)

I believe I traced the problem in mod_fcgid down, so I’m continuing the
test in in the test environment with the expectation it will pass now,
and try it out in production on Monday. The problem with mod_fcgid was
that it seems to be ignoring a configuration parameter for timeouts,
and defaults to aborting the ruby process after 40 seconds, one
particular
url can take longer than 40 seconds due to some bad SQL(SQL alone takes
up to 30 seconds), I found in the source that the default IPC timeout
is 40 seconds, which matches exactly with all of the failures in the
apache logs(every one stops at 40 seconds), changed the source,
recompiled
and pushed out the updated version to the test servers.

thanks!

nate

nate wrote:

apache logs(every one stops at 40 seconds), changed the source, recompiled
and pushed out the updated version to the test servers.

Yay! I’ve confirmed that my fix for mod_fcgi worked, after 6 test runs,
the 7th one generated a query that took 55 seconds, and it succeeded,
so 7 complete test passes in a row, I’ll let it run for the rest of the
weekend and see if any problems pop up, if not then we’ll go to
production
with mod_fcgid in the hope it may clear the issue(I want to go to
mod_fcgid anyways because it can auto restart the ruby processes after
processing X number of requests, dramatically reducing the impact of
memory leaks, as well as being more actively maintained).

nate

nate wrote:

Agree there too, and already done, sorry if I didn’t clarify in the
original email! Top shows a lot of I/O wait as does sar(as far as
CPU I/O wait) but it seems just to be an artifact of the defunct
processes as there is no other signs that I/O is being used.

Now that’s bizarre – I/O wait without disk utilization. I wonder if
you need to check your kernel memory split and high memory I/O
configuration. You may not be getting all the page cache space you can
have with an 8 GB box.

Agree there too and there is no signs of swapping, the systems have 8GB
of memory and seem to run in the 3-5GB range. The biggest app does have
a memory leak and the system auto restarts it once the memory usage for
the app exceeds 5GB.

5 GB occupied in an 8 GB machine? Do you mean you have 3 GB of free
memory? That would be a big problem. What kernel versions did the old
and new systems have?

nate wrote:

memory? That would be a big problem. What kernel versions did the old
and new systems have?

Well 5GB spread across all the ruby processes for that app(~20), not
1 process with 5GB. I don’t believe memory usage to be an issue
either, if the memory was constrained the system would swap. We have
a java process that runs on the same systems that consume about 1GB
of memory in a single process and it hasn’t even blinked.

Sorry … I wasn’t clear. Free memory in large quantities is a
problem, not the fact that you’ve got 5 GB full out of 8 GB. Linux uses
memory for lots of things, but the two biggest are application programs
(code and data space) and page cache (where stuff read from and written
to disk) goes. It tries as hard as it can to fill all available memory
with those two things and balance the applications’ demand between the
two of them.

The way you figure this out is to look at the first few lines of “top”.
You’ll see something like this (from a 1 GB machine):

Mem: 901256k total, 890708k used, 10548k free, 178360k buffers
Swap: 996020k total, 0k used, 996020k free, 341112k cached

Ignore the swap – my machine isn’t swapping and you say yours isn’t
either. “used” + “free” should equal “total”. “used” - “buffers” -
“cached” is what’s taken up by the code and data. So on my machine:

10548 free
178360 buffers
341112 cached
371236 used - buffers - cached = code + data

901256

Since you’re not swapping, any free memory you have above the minimum
set by the kernel should be going into “buffers” and “cached” to
reduce the amount of I/O you need to do.

Old kernel: 2.6.17-1.2142_FC4smp (last release for fedora core 4)
New kernel: 2.6.9-55.0.9.ELsmp (2nd most current release for CentOS 4.5 /
Red Hat enterprise 4.0 update 5)

Well, when you get all the other things humming, an upgrade to CentOS 5
will give you a newer kernel, Ruby, Apache, MySQL and PostgreSQL. But
there have been lots of “enhancements” between 4.5 and 5.0 in the area
of security – I found the learning curve for all the firewall stuff
rather steep. It’s worth it, of course, but it wasn’t a bullet I had the
energy to bite at the time.

I believe I traced the problem in mod_fcgid down, so I’m continuing the
test in in the test environment with the expectation it will pass now,
and try it out in production on Monday. The problem with mod_fcgid was
that it seems to be ignoring a configuration parameter for timeouts,
and defaults to aborting the ruby process after 40 seconds, one particular
url can take longer than 40 seconds due to some bad SQL(SQL alone takes
up to 30 seconds), I found in the source that the default IPC timeout
is 40 seconds, which matches exactly with all of the failures in the
apache logs(every one stops at 40 seconds), changed the source, recompiled
and pushed out the updated version to the test servers.

Interesting … you actually fixed something in software rather than
throw hardware at it. Most folks would have moved the database to
something that could complete the query in 3 seconds and sent the bill
to their clients. :wink:

M. Edward (Ed) Borasky wrote:

Since you’re not swapping, any free memory you have above the minimum
set by the kernel should be going into “buffers” and “cached” to
reduce the amount of I/O you need to do.

Yep I understand all that, to illustrate, one of the server’s memory
usage -

http://portal.aphroland.org/~aphro/memory-usage.png

Well, when you get all the other things humming, an upgrade to CentOS 5
will give you a newer kernel, Ruby, Apache, MySQL and PostgreSQL. But
there have been lots of “enhancements” between 4.5 and 5.0 in the area
of security – I found the learning curve for all the firewall stuff
rather steep. It’s worth it, of course, but it wasn’t a bullet I had the
energy to bite at the time.

Fortunately the firewall isn’t something I’ll have to deal with, these
machines are part of a larger network behind shared network
infrastructure,
firewalls, load balancers, etc. So no firewalls on the local systems are
needed.

Interesting … you actually fixed something in software rather than
throw hardware at it. Most folks would have moved the database to
something that could complete the query in 3 seconds and sent the bill
to their clients. :wink:

HAH, yeah my last job was like that to some degree, a phrase the
VP of engineering used to say on outage conference calls “is there
anything I can buy that would make this problem go away”. It was a
good place to play with new stuff, always buying… The servers would
average about 1.5 transactions a second before falling over, their
main OLTP database is probably in the 50TB range now(orders of
magnitude larger than any other OLTP database in the world). Pretty
poor software…led to lots of problems…

Fortunately the developers at my current place are much more,
how should I say…customer oriented(with me being the customer I’d
say), makes my job easier…

And that bad query, is supposed to be fixed soon by adding another
kind of index, there is already in index on the column but the query
forces the search to be all lower case, so the DB ends up doing a
full scan across about 4.5 million rows(oracle 10g R2)

nate

nate wrote:

Interestingly enough, “older” is a misnomer, since the Fedora OS was
something like 2.6.17-xx and the CentOS/RHEL 4.5 was 2.6.9-xx. :slight_smile: But
Red Hat tweaks their kernels mercilessly, and considering some of the
talented folks they have and the close ties they have with the server
hardware vendors, an RHEL kernel or its CentOS clone is usually better
than one from Fedora.

Does anyone here know what version of Ruby and what OS/DBMS/webserver
the BaseCamp people run in production? BaseCamp, after all, is the
mother of all Rails applications. :slight_smile:

nate wrote:

M. Edward (Ed) Borasky wrote:

Since you’re not swapping, any free memory you have above the minimum
set by the kernel should be going into “buffers” and “cached” to
reduce the amount of I/O you need to do.

Yep I understand all that, to illustrate, one of the server’s memory
usage -

Rolled 3 of the 4 servers back to the older OS last night, this
morning the 4th system freaked out again and it wasn’t caught straight
away, when I logged in I noticed another ruby process was in “D”,
and this one wasn’t being served by apache, but rather it was a
ruby cron job, so there must be some bug in ruby or interaction between
this version of ruby and the OS. Going to upgrade to a newer rev
of ruby 1.8.5(-114) in test environment today to hopefully roll to the
prod system tomorrow to see if that helps.

nate