Not all sigchld recieved

leksito · July 14, 2020, 1:07am

Hi!
I tried to run this script

processes_count = 5

@rout_objects = []

trap('SIGCHLD') do
  pid = Process.wait(-1, Process::WNOHANG)
  puts pid.inspect
end

processes_count.times do |index|
  puts "creating new process #{index}"
  rout, wout = IO.pipe

  pid = fork {
    $stdout.reopen wout
    rout.close

    process_name = "Process ##{index}"
    Process.setproctitle process_name

    2.times do |i|
      puts "#{process_name}: Message ##{i}"
      sleep 1
    end
    exit 0
  }
  @rout_objects << rout
  wout.close
end

loop do
  out_ready, _, _ = IO.select(@rout_objects, nil, nil)
  out_ready.each do |rout|
    begin
      puts rout.read_nonblock(100)
    rescue EOFError
      nil
    end
  end
end

it spawns 5 processes that print some lines and exit. The main process catches SIGCHLD signals to print children pids. But sometimes I see that not all SIGCHLD signals are trapped. Output example:

creating new process 0
creating new process 1
creating new process 2
creating new process 3
creating new process 4
Process #0: Message #0
Process #1: Message #0
Process #2: Message #0
Process #3: Message #0
Process #4: Message #0
Process #0: Message #1
Process #1: Message #1
Process #2: Message #1
Process #3: Message #1
Process #4: Message #1
6951
6952
6953
6954

Here we can see that only 4 SIGCHLD signals received and one is lost.
I’m using ruby 2.3.8 on Ubuntu 19.04. Why not all SIGCLD signals are catched?

I don’t want to use waitall() or waitpid(pid_of_child) because i do not want to block main process.

specious · July 14, 2020, 4:56pm

I was able to repeat your results on a quad-core processor. It looks like a race condition in your trap. I managed to get it to work correctly by changing to

trap('SIGCHLD') do
  pid = Process.wait(-1, Process::WUNTRACED)
  puts "Child #{pid} ended"
end

I also changed your delay to sleep rand(1..5) in the children just to add some variety, and it seems to work ok. Although that’s the problem with race conditions, you can never really prove that it’s fixed, only that it isn’t occurring any more…

Also, the final read loop is endless, although it doesn’t actually loop forever because it eventually blocks forever on the IO.select call once the child processes go away. Might be an idea to add a read timeout on the IO.select

leksito · July 14, 2020, 6:14pm

Thank you for your reply. Your answer helped me to find the solution. All signals are sended to the main process, but trap does not catch all of them, because they do not pushed in “queue”, each new signal overrides previous. It means that if SIGCHILD received, we can call waitpid more than 1 times to check all dead child processes. The SIGCHLD handler should be:

trap('SIGCHLD') do
  while pid = Process.wait(-1, Process::WNOHANG)  rescue nil
    puts pid
  end
end

specious · July 14, 2020, 9:09pm

Nice solution. Makes sense too, if another signal arrives while the trap is already executing, how else could it work?