I am running a RoR application on Apache 1.3/RedHat 7.3/MySQL 3.1.23 (Old versions I know, but upgrading to latest versions are not practical for a number of reasons). There are 5 RoR FastCGI processes configured using FastCgiServer. What I am finding is that, after a while, some of the FastCGI processes seem to 'hang'. They no longer process requests, and the only way to remove them is to use "kill -9". When all 5 FastCGI processes enter this state, my production site no longer works. Has anyone else had a similar problem? Is there an elegant work-around that I can use to detect these dead processes and kill them?
on 2006-01-13 02:09
on 2006-01-13 02:31
I should probably also mention that I suspect that the problem has something to do with garbage collection. My reason for thinking this is that I initially had the garbage collector configured to clean up every 25 requests: ie: RailsFCGIHandler.process! nil, 25 But when I changed it back to automatic GC, RailsFCGIHandler.process! I found that the processes ran for a significantly longer amount of time.
on 2006-01-15 04:07
On 1/12/06, John J. <firstname.lastname@example.org> wrote: > longer works. > > Has anyone else had a similar problem? Is there an elegant work-around > that I can use to detect these dead processes and kill them? I am experiencing something similar. Apache at my hosting provider is configured to send a signal -USR1 to the fcgi processes every four hours in order to make them exit and restart. What seems to be happening is that the FCGI process receives the USR1 and doesn't exit until the next request. Meanwhile Apache thinks it has killed the process and doesn't send it any more requests. After a while I reach my process limit with processes stuck in this state. kill -9 will kill them and get things going again. I have been playing around with changes to dispatch.fcgi, here's my current code but it isn't always working correctly. if ENV["RAILS_ENV"] == "production" ENV['GEM_PATH']='/home/jonsmirl/gems' class MyRailsFCGIHandler < RailsFCGIHandler def initialize(log_file_path = nil, gc_request_period = nil) super(log_file_path, gc_request_period) trap('TERM', method(:exit_now_handler).to_proc); end def process!(provider = FCGI) # Make a note of $" so we can safely reload this instance. mark! run_gc! if gc_request_period usr1 = trap("USR1", "DEFAULT") provider.each_cgi do |cgi| trap("USR1", usr1) process_request(cgi) case when_ready when :reload reload! when :restart close_connection(cgi) restart! when :exit close_connection(cgi) break end gc_countdown trap("USR1", "DEFAULT") end GC.enable dispatcher_log :info, "terminated gracefully" rescue SystemExit => exit_error dispatcher_log :info, "terminated by explicit exit" rescue Object => fcgi_error # retry on errors that would otherwise have terminated the FCGI process, # but only if they occur more than 10 seconds apart. if !(SignalException === fcgi_error) && Time.now - @last_error_on > 10 @last_error_on = Time.now dispatcher_error(fcgi_error, "almost killed by this error") retry else dispatcher_error(fcgi_error, "killed by this error") end end def exit_now_handler(signal) dispatcher_log :info, "ignoring request to terminate immediately" end end MyRailsFCGIHandler.process! nil, 50 else RailsFCGIHandler.process! nil, 50 end -- Jon S. email@example.com
on 2006-01-15 08:34
Jon S. wrote: > > I am experiencing something similar. Apache at my hosting provider is > configured to send a signal -USR1 to the fcgi processes every four > hours in order to make them exit and restart. What seems to be > happening is that the FCGI process receives the USR1 and doesn't exit > until the next request. Meanwhile Apache thinks it has killed the > process and doesn't send it any more requests. After a while I reach > my process limit with processes stuck in this state. kill -9 will kill > them and get things going again. > Thanks for the hint Jon. I had thought about modifying the RailsFCGIHandler so that the process exits after (say) 25 requests instead of invoking the garbage collector. I was not, however, aware of the USR1 signal thing. I think I will play around with the RailsFCGIHandler and see if I get more reliability.
on 2006-01-15 19:09
On 1/15/06, John J. <firstname.lastname@example.org> wrote: > > > > Thanks for the hint Jon. I had thought about modifying the > RailsFCGIHandler so that the process exits after (say) 25 requests > instead of invoking the garbage collector. I was not, however, aware of > the USR1 signal thing. I think I will play around with the > RailsFCGIHandler and see if I get more reliability. In the ruby fcgi gem there is a file called README.signals. It describes what needs to be done to make Apache fcgi work correctly. The problem is that Rails fcgi_handler.rb is not implementing what that file says to do. My hosting provider is doing a 'graceful' Apache restart every four hours. Apache sends out the USR1 signals like described in README.signals. Without changing the Rails fcgi_handler code the USR1 signal gets queued and the process doesn't exit since Rails has registed a USR1 handler. Queuing is what ruby is supposed to do if the main thread is stuck in the select(). The USR1 signal will be dequeued and handled when the select() completes. But Apache has restarted and is disconnected from the process and the select never completes and the process never exits. After a while these build up and I reach my process limit. At that point all of the process will be running but they are disconnected from Apache - then you start getting a permanent Error 500. Another way I am getting Error 500 at dreamhost is via sigTERM. They seem to have a supervisor process out there that looks for 'extra' FCGI processes and sends them a sigTERM. TERM is bad because it make fcgi exit even if it is in the middle of processing a request. That's a guarantee way to get an intermittent Error 500. After a while I end up in a steady dance of disconnected process getting TERM to kill them, that works. But the TERM is also hitting random good processes too. Thus the random Error 500 behavior at dreamhost. The site keeps running but it is really broken. My solution to this is to disable kill from TERM and make USR1 work correctly. This seems to be working but this is a slow, long term problem and it is hard to tell if I really have eliminated the Error 500's. One part I don't understand is why the selects don't complete to the disconnected processes after the Apache graceful restart. It seems like these sockets should be getting closed and causing the select to return nil but this doesn't happen. I haven't figured out if Apache isn't closing the socket or if Ruby is broken on completing the select when the socket closes. If the select() completed the queued USR1 signal would run and the process would exit. I've tried playing with the code in this area and I keep getting the processes stuck in zombie state. -- Jon S. email@example.com