FastCGI processes sometimes 'hang'


#1

I am running a RoR application on Apache 1.3/RedHat 7.3/MySQL 3.1.23
(Old versions I know, but upgrading to latest versions are not practical
for a number of reasons). There are 5 RoR FastCGI processes configured
using FastCgiServer.

What I am finding is that, after a while, some of the FastCGI processes
seem to ‘hang’. They no longer process requests, and the only way to
remove them is to use “kill -9”.

When all 5 FastCGI processes enter this state, my production site no
longer works.

Has anyone else had a similar problem? Is there an elegant work-around
that I can use to detect these dead processes and kill them?


#2

I should probably also mention that I suspect that the problem has
something to do with garbage collection.

My reason for thinking this is that I initially had the garbage
collector configured to clean up every 25 requests:

ie:

RailsFCGIHandler.process! nil, 25

But when I changed it back to automatic GC,

RailsFCGIHandler.process!

I found that the processes ran for a significantly longer amount of
time.


#3

On 1/12/06, John J. removed_email_address@domain.invalid wrote:

longer works.

Has anyone else had a similar problem? Is there an elegant work-around
that I can use to detect these dead processes and kill them?

I am experiencing something similar. Apache at my hosting provider is
configured to send a signal -USR1 to the fcgi processes every four
hours in order to make them exit and restart. What seems to be
happening is that the FCGI process receives the USR1 and doesn’t exit
until the next request. Meanwhile Apache thinks it has killed the
process and doesn’t send it any more requests. After a while I reach
my process limit with processes stuck in this state. kill -9 will kill
them and get things going again.

I have been playing around with changes to dispatch.fcgi, here’s my
current code but it isn’t always working correctly.

if ENV[“RAILS_ENV”] == “production”

ENV[‘GEM_PATH’]=’/home/jonsmirl/gems’

class MyRailsFCGIHandler < RailsFCGIHandler

def initialize(log_file_path = nil, gc_request_period = nil)
  super(log_file_path, gc_request_period)
  trap('TERM', method(:exit_now_handler).to_proc);
end

def process!(provider = FCGI)
  # Make a note of $" so we can safely reload this instance.
  mark!

  run_gc! if gc_request_period

  usr1 = trap("USR1", "DEFAULT")
  provider.each_cgi do |cgi|
    trap("USR1", usr1)
    process_request(cgi)

    case when_ready
      when :reload
        reload!
      when :restart
        close_connection(cgi)
        restart!
      when :exit
        close_connection(cgi)
        break
    end

    gc_countdown
    trap("USR1", "DEFAULT")
  end

  GC.enable
  dispatcher_log :info, "terminated gracefully"

rescue SystemExit => exit_error
  dispatcher_log :info, "terminated by explicit exit"

rescue Object => fcgi_error
  # retry on errors that would otherwise have terminated the FCGI 

process,
# but only if they occur more than 10 seconds apart.
if !(SignalException === fcgi_error) && Time.now - @last_error_on

10
@last_error_on = Time.now
dispatcher_error(fcgi_error, “almost killed by this error”)
retry
else
dispatcher_error(fcgi_error, “killed by this error”)
end
end

def exit_now_handler(signal)
  dispatcher_log :info, "ignoring request to terminate immediately"
end

end

MyRailsFCGIHandler.process! nil, 50

else

RailsFCGIHandler.process! nil, 50
end


Jon S.
removed_email_address@domain.invalid


#4

Jon S. wrote:

I am experiencing something similar. Apache at my hosting provider is
configured to send a signal -USR1 to the fcgi processes every four
hours in order to make them exit and restart. What seems to be
happening is that the FCGI process receives the USR1 and doesn’t exit
until the next request. Meanwhile Apache thinks it has killed the
process and doesn’t send it any more requests. After a while I reach
my process limit with processes stuck in this state. kill -9 will kill
them and get things going again.

Thanks for the hint Jon. I had thought about modifying the
RailsFCGIHandler so that the process exits after (say) 25 requests
instead of invoking the garbage collector. I was not, however, aware of
the USR1 signal thing. I think I will play around with the
RailsFCGIHandler and see if I get more reliability.


#5

On 1/15/06, John J. removed_email_address@domain.invalid wrote:

Thanks for the hint Jon. I had thought about modifying the
RailsFCGIHandler so that the process exits after (say) 25 requests
instead of invoking the garbage collector. I was not, however, aware of
the USR1 signal thing. I think I will play around with the
RailsFCGIHandler and see if I get more reliability.

In the ruby fcgi gem there is a file called README.signals. It
describes what needs to be done to make Apache fcgi work correctly.
The problem is that Rails fcgi_handler.rb is not implementing what
that file says to do.

My hosting provider is doing a ‘graceful’ Apache restart every four
hours. Apache sends out the USR1 signals like described in
README.signals. Without changing the Rails fcgi_handler code the USR1
signal gets queued and the process doesn’t exit since Rails has
registed a USR1 handler. Queuing is what ruby is supposed to do if the
main thread is stuck in the select(). The USR1 signal will be dequeued
and handled when the select() completes. But Apache has restarted and
is disconnected from the process and the select never completes and
the process never exits. After a while these build up and I reach my
process limit. At that point all of the process will be running but
they are disconnected from Apache - then you start getting a
permanent Error 500.

Another way I am getting Error 500 at dreamhost is via sigTERM. They
seem to have a supervisor process out there that looks for ‘extra’
FCGI processes and sends them a sigTERM. TERM is bad because it make
fcgi exit even if it is in the middle of processing a request. That’s
a guarantee way to get an intermittent Error 500.

After a while I end up in a steady dance of disconnected process
getting TERM to kill them, that works. But the TERM is also hitting
random good processes too. Thus the random Error 500 behavior at
dreamhost. The site keeps running but it is really broken.

My solution to this is to disable kill from TERM and make USR1 work
correctly. This seems to be working but this is a slow, long term
problem and it is hard to tell if I really have eliminated the Error
500’s.

One part I don’t understand is why the selects don’t complete to the
disconnected processes after the Apache graceful restart. It seems
like these sockets should be getting closed and causing the select to
return nil but this doesn’t happen. I haven’t figured out if Apache
isn’t closing the socket or if Ruby is broken on completing the select
when the socket closes. If the select() completed the queued USR1
signal would run and the process would exit. I’ve tried playing with
the code in this area and I keep getting the processes stuck in zombie
state.


Jon S.
removed_email_address@domain.invalid