Mongrel server file descriptor leak?

jngo · February 4, 2009, 3:20pm

Excuse my ignorance for I’m a RoR newby.

We have a number of RoR applications running on a server with a cPanel
installation. One of the clients sites went belly up with the error
message:
Errno::EMFILE
Too many open files - socket(2)

Digging a bit I found that the mongrel server had a large number of
sockets open and appears to be leaking them fairly frequently. The
lsof looks like this

mongrel_r 11184 username 193u unix 0xda4fa480 637209350 socket
mongrel_r 11184 username 194u unix 0xf685a680 637408911 socket
mongrel_r 11184 username 195u unix 0xcc2ea3c0 637684747 socket

The application doesn’t do anything explicitly with sockets. As far as
I know the RoR installation is standard - from what I understand the
hosting company support installed it and RoR is only semi-supported by
cPanel so they’re somewhat reluctant/lack the knowledge to be of much
assistance?

I did google searches of every combination of ruby/rails/socket/leak/
cPanel/file descriptor/etc that I could think of and didn’t find
anything. I’m not exactly sure of what versions of ruby/rails/mongrel
we have installed but didn’t find anything about this being a known
issue that had been fixed. I’d warrant a guess that we’re not running
the bleeding edge at any rate and are likely a few versions behind but
again I’m not entirely sure how ruby is installed/managed under cPanel/
WHM.

Any ideas on where to look further or what the issue could be?

jngo · February 4, 2009, 7:36pm

Can you post the content of your mongrel_cluster.yml (or equivalent
command
line options in use) & mongrel version…?

Also the output of “ulimit -a” for the mongrel process owner.

–

jngo · February 10, 2009, 9:06pm

Benjamin Grant wrote:

Can you post the content of your mongrel_cluster.yml (or equivalent
command
line options in use) & mongrel version…?

Also the output of “ulimit -a” for the mongrel process owner.

–

The mongrels are started under CPanel and we’re not using clusters.

From ps it looks like they’re just started with:
/usr/bin/ruby /usr/bin/mongrel_rails start -p 12005 -d -e production -P
log/mongrel.pid

It’s running mongrel 1.1.5
ruby 1.8.7 (2008-06-20 patchlevel 22) [i686-linux]

I’m not sure exactly how many file descriptors causes it to die - though
I’ve seen 400+ and it running fine and I’m guessing it dies at 512.

Ulimit doesn’t seem like it’s changed at all.

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
file size (blocks, -f) unlimited
pending signals (-i) 1024
max locked memory (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 73728
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

jngo · February 10, 2009, 10:07pm

This is what Mongrel invokes when that limit is hit:

def reap_dead_workers(reason='unknown')
  if @workers.list.length > 0
    STDERR.puts "#{Time.now}: Reaping #{@workers.list.length}

threads
for slow workers because of ‘#{reason}’"
error_msg = “Mongrel timed out this thread: #{reason}”
mark = Time.now
@workers.list.each do |worker|
worker[:started_on] = Time.now if not worker[:started_on]

      if mark - worker[:started_on] > @timeout + @throttle
        STDERR.puts "Thread #{worker.inspect} is too old, killing."
        worker.raise(TimeoutError.new(error_msg))
      end
    end
  end

So check your mongrel logs and see if mongrel is attempting (and
possibly
failing) to reap workers.

You could also slap this into application.rb:

def timeout
    Timeout.timeout(APPCONTROLLER_TIMEOUT) do
        yield
    end
end

Where APPCONTROLLER_TIMEOUT is the number of seconds you want any given
controller action to run for. This will effecively force timeouts to
occur
independnt of host/os limits and mongrel’s handling of exceptions thrown
when reaching them.

I’ve observed good results with that in some cases.

→ But, see also: http://ph7spot.com/articles/system_timer for more on
timeouts and how effective (or ineffective) they can be.

Meanwhile, raising that open files limit might give you some time to
bug-hunt with. Depends on how frequent the problem is being triggered.
And have a look at HAProxy and/or NGINX if you’re not already running
them
as a proxy tier / combo. in front of mongrel.

–