Why not ignore stale PID files?

Hi,

I have an application which is dying horrible deaths
(i.e. segmentation faults) in mid-flight, in production… And of
course, I should fix it. But while I find and fix the bugs, I found
something I think should be different - I can work on submitting a
patch, as it is quite simple, but I might be losing something on my
rationale.

When Mongrel segfaults, it does not -obviously- get to clean up after
itself, so it does not remove the PID files. As an example:

$ sudo /etc/init.d/mongrel-cluster start
Starting mongrel-cluster: Starting all mongrel_clusters…
mongrel-cluster.
$ sudo cat tmp/pids/mongrel.8203.pid | xargs kill -9
$ sudo /etc/init.d/mongrel-cluster status
(…)
found pid_file: tmp/pids/mongrel.8203.pid
missing mongrel_rails: port 8203
(…)
$ sudo /etc/init.d/mongrel-cluster restart
Restarting mongrel-cluster: Restarting all mongrel_clusters…
** !!! PID file tmp/pids/mongrel.8203.pid already exists. Mongrel could
be running already. Check your log/mongrel.8203.log for errors.
** !!! Exiting with error. You must stop mongrel and clear the .pid
before I’ll attempt a start.
mongrel-cluster.

So, what’s the solution? I must manually do:

$ sudo rm tmp/pids/mongrel.8203.pid
$ sudo /etc/init.d/mongrel-cluster restart

And now it works.

What should happen? Well, ‘status’ already found that there is a stale
PID. Of course, the ‘status’ action means exactly that: Get the
status, do nothing else. But the ‘stop’ action should clean the PIDs
if they do no longer exist, and the ‘start’ action should check
whether the process with that PID is alive, and ignore it if it’s
not. At least, this behaviour should be specifiable via the
configuration file.

What do you think?


Gunnar W. - [email protected] - (+52-55)5623-0154 / 1451-2244
PGP key 1024D/8BB527AF 2001-10-23
Fingerprint: 0C79 D2D1 2C4E 9CE4 5973 F800 D80E F35A 8BB5 27AF

use the mongrel_cluster --clean option

On Thu, 5 Jun 2008 16:08:06 -0500
Gunnar W. [email protected] wrote:

What should happen? Well, ‘status’ already found that there is a stale
PID. Of course, the ‘status’ action means exactly that: Get the
status, do nothing else. But the ‘stop’ action should clean the PIDs
if they do no longer exist, and the ‘start’ action should check
whether the process with that PID is alive, and ignore it if it’s
not. At least, this behaviour should be specifiable via the
configuration file.

That would be the ideal situation, but Ruby doesn’t have good enough
process management APIs to do this portably. To make it work you’d
have to portably be able to take a PID and see if there’s a mongrel
running with that PID.

You can’t use /proc or /sys because that’s linux only. You can’t use
ps because the OSX morons changed everything, Solaris has different
format, etc.

If you were to do this, you’d have to dip into C code to pull it off.

Now, if you’re only on linux then you could write yourself a small
little hack to the mongrel_rails script that did this with info out
of /proc.


Zed A. Shaw


Mongrel-users mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/mongrel-users

kill -0 cat pid_file >& /dev/null

more like

kill -0 $(<pid_file) >& /dev/null

regards,
Istvan

Zed A. Shaw dijo [Fri, Jun 06, 2008 at 01:01:32AM -0400]:

Now, if you’re only on linux then you could write yourself a small
little hack to the mongrel_rails script that did this with info out
of /proc.

Oh, silly me… I thought Ruby’s Process class did with the
architectural incompatibilities… What I wrote to check for the
status is quite straightforward:

------------------------------------------------------------
#!/usr/bin/ruby
require 'yaml'
confdir = '/etc/mongrel-cluster/sites-enabled'
restart_cmd = '/etc/init.d/mongrel-cluster restart'
needs_restart = false

(Dir.open(confdir).entries - ['.', '..']).each do |site|
  conf = YAML.load_file "#{confdir}/#{site}"
  pid_location = [conf['cwd'],
  conf['pid_file']].join('/').gsub(/\.pid$/, '*.pid')
  pid_files = Dir.glob(pid_location)

  pid_files.each do |pidf|
    pid = File.read(pidf)
    begin
      Process.getpgid(pid.to_i)
    rescue Errno::ESRCH
      warn "Process #{pid} (cluster #{site}) is dead!"
      File.unlink pidf
      needs_restart = true
    end
  end
end

system(restart_cmd) if needs_restart
------------------------------------------------------------

(periodically run via cron)

I guess this works in any Unixy environment… I have no idea on
whether Windows implements something similar to Process.getpgid, or
for that matter, anything on Windows’ process management.

Greetings,


Gunnar W. - [email protected] - (+52-55)5623-0154 / 1451-2244
PGP key 1024D/8BB527AF 2001-10-23
Fingerprint: 0C79 D2D1 2C4E 9CE4 5973 F800 D80E F35A 8BB5 27AF

Gunnar W. [email protected] wrote:

If you were to do this, you’d have to dip into C code to pull it off.

I guess this works in any Unixy environment… I have no idea on
whether Windows implements something similar to Process.getpgid, or
for that matter, anything on Windows’ process management.

Process.kill(0, pid) also works and is (in my experience) more
widely used.

Tikhon Bernstam dijo [Thu, Jun 05, 2008 at 07:29:22PM -0700]:

use the mongrel_cluster --clean option

Very good addition to the overall logic, keeps things cleaner :slight_smile:


Gunnar W. - [email protected] - (+52-55)5623-0154 / 1451-2244
PGP key 1024D/8BB527AF 2001-10-23
Fingerprint: 0C79 D2D1 2C4E 9CE4 5973 F800 D80E F35A 8BB5 27AF