Having problems with open4 and stuck forked processes

Tim_U · September 22, 2010, 2:38pm

I am running a batch process which uses the wkhtmltoimage-i386 binary
to make screenshots of urls. Unfortunately this is in beta and it
frequently hangs up and takes up 100% of one of the CPUs on the
machine.

I have the following code to try and detect the hung process and kill
it but it doesn’t always work and I was wondering if anybody has a
better idea of how to do this. When I run it by testing simple
commands like sleep it works perfectly. In production with this binary
it doesn’t seem to always work.

def Util.shell_with_timeout(cmd, seconds = 3600)
#the default timeout is an hour. That’s probably way too long

Timeout::timeout(seconds) {
  @pid, @stdin, @stdout, @stderr = Open4.popen4(cmd)
  ignored, @status = Process::waitpid2 @pid
  if @status.exitstatus != 0
    raise "Exit Status not zero"
  end
}

@stdout ? @stdout.read.strip : ''

rescue Timeout::Error
Process.detach @pid
Process.kill ‘SIGKILL’, @pid
raise “Process Timed out”
rescue => e
msg = @stderr ? @stderr.read.strip : ‘’
msg += e.to_s
raise “Error during execution of command #{cmd}\n #{msg}”
end

Tim_U · September 22, 2010, 5:34pm

On Wed, Sep 22, 2010 at 2:31 PM, Tim U. [email protected] wrote:

I am running a batch process which uses the wkhtmltoimage-i386 binary
to make screenshots of urls. Unfortunately this is in beta and it
frequently hangs up and takes up 100% of one of the CPUs on the
machine.

I have the following code to try and detect the hung process and kill
it but it doesn’t always work and I was wondering if anybody has a
better idea of how to do this. When I run it by testing simple
commands like sleep it works perfectly. In production with this binary
it doesn’t seem to always work.

What do you mean by that? Goes the timeout undetected? Can’t you
kill the process? Are there any unexpected error messages /
exceptions?

@stdout ? @stdout.read.strip : ‘’
rescue Timeout::Error
Process.detach @pid
Process.kill ‘SIGKILL’, @pid
raise “Process Timed out”
rescue => e
msg = @stderr ? @stderr.read.strip : ‘’
msg += e.to_s
raise “Error during execution of command #{cmd}\n #{msg}”
end

A frequent problem with #popen methods is to not read file descriptors
which can make the client hang (i.e. if it writes more than fits into
a pipe). That could be something to check since you are not reading
any of the streams.

Kind regards

robert

Tim_U · September 23, 2010, 1:59am

What do you mean by that? Â Goes the timeout undetected? Â Can’t you
kill the process? Â Are there any unexpected error messages /
exceptions?

Obviously the timeout is not detected. I am not sure about the
exceptions as it happens when I am not looking but I will ramp up the
logging and see if I can trap anything.

A frequent problem with #popen methods is to not read file descriptors
which can make the client hang (i.e. if it writes more than fits into
a pipe). Â That could be something to check since you are not reading
any of the streams.

If you have any pointers to documentation about this I would really
appreciate it. I know so little about unix processes and pipes and
such.

Tim_U · September 23, 2010, 9:46am

Thread.abort_on_exception = true

at the beginning of your script.

Euhm (asking this because I honestly don’t know) will this work for
Processes ? (he’s not using Thread)

Elise

Tim_U · September 23, 2010, 10:09am

elise huard wrote:

Euhm (asking this because I honestly don’t know) will this work for
Processes ? (he’s not using Thread)

Timeout::timeout uses a thread internally - and it raises an exception
asynchronously in the main thread, which makes it unsafe in just about
any application you can think of for it.

It would be safer to use select() on the data coming from the child to
wait for the process to terminate (when you read end-of-file)

Tim_U · September 23, 2010, 8:24am

On 23.09.2010 01:59, Tim U. wrote:

What do you mean by that? Goes the timeout undetected? Can’t you
kill the process? Are there any unexpected error messages /
exceptions?

Obviously the timeout is not detected.

I don’t find that obvious at all from your initial description.

I am not sure about the
exceptions as it happens when I am not looking but I will ramp up the
logging and see if I can trap anything.

You could start by doing

Thread.abort_on_exception = true

at the beginning of your script.

A frequent problem with #popen methods is to not read file descriptors
which can make the client hang (i.e. if it writes more than fits into
a pipe). That could be something to check since you are not reading
any of the streams.

If you have any pointers to documentation about this I would really
appreciate it. I know so little about unix processes and pipes and
such.

I don’t have anything handy but I guess Google will help.

A pipe is basically what it looks like: it’s a piece of pipe with you
write to on one end and read from at the other end. At the read end
there is a valve. If nobody reads the valve stays closed and you can’t
fill in more at the write end. If you use blocking IO your process
blocks on the system call and won’t be active before you read from the
other end. (This is a bit simplistic because it leaves threads and
interpreter implementation out of the way but this is basically what
happens).

Kind regards

robert

Tim_U · September 24, 2010, 9:12am

It would be safer to use select() on the data coming from the child to
wait for the process to terminate (when you read end-of-file)

Do you know of any examples on how to do that? I am willing to rewrite
my code obviously.

Tim_U · September 25, 2010, 12:25am

It would be safer to use select() on the data coming from the child to
wait for the process to terminate (when you read end-of-file)

But what if the process hangs?

Wouldn’t I need to use timeout to check for that anyway?

Tim_U · September 25, 2010, 10:15am

On 25.09.2010 00:25, Tim U. wrote:

It would be safer to use select() on the data coming from the child to
wait for the process to terminate (when you read end-of-file)

But what if the process hangs?

Wouldn’t I need to use timeout to check for that anyway?

Select can be called with a timeout which guarantees that the call
returns in time regardless whether there is any data available.

Kind regards

robert

Tim_U · September 23, 2010, 10:19am

On Thu, Sep 23, 2010 at 9:42 AM, elise huard [email protected]
wrote:

Thread.abort_on_exception = true

at the beginning of your script.

Euhm (asking this because I honestly don’t know) will this work for
Processes ? (he’s not using Thread)

But he uses Timeout which AFAIK uses threads internally for monitoring.

Cheers

robert

Tim_U · October 4, 2010, 3:25am

On Sat, Sep 25, 2010 at 9:05 PM, Robert K.
[email protected] wrote:

in time regardless whether there is any data available.

Hey guys I want to revist this issue because I can’t seem to find any
documentation on how to do this.

What I want to do seem simple enough. I want to shell out to a process
which sometimes gets stuck. It won’t return at all. It just sits there
taking up 100% of the CPU (one of the cores anyway). I just want to
make sure that if the process does not end in a reasonable amount of
time I want to kill it.

So far I have tried wrapping it in a timeout block but that doesn’t
always trigger for some reason. I have plenty of error handling and
have an ensure block which says to kill the process if it exists but
nothing I do seems to work. Sooner or later I get a stuck process that
hangs around forever till I kill it by hand.

Surely there is a simple way to do this.

Here is the code I have so far.

gist.github.com

https://gist.github.com/timuckun/609119

gistfile1.rb

module Util
  def Util.not_implemented
    raise "not implemented yet"
  end

  def Util.kill_process_if_exists pid
    running = true
    begin
      Process.kill(0, pid)
      #if we get here the process is alive

This file has been truncated. show original

Tim_U · October 4, 2010, 4:39am

On 10/3/10, Tim U. [email protected] wrote:

always trigger for some reason. I have plenty of error handling and
have an ensure block which says to kill the process if it exists but
nothing I do seems to work. Sooner or later I get a stuck process that
hangs around forever till I kill it by hand.

Timeout::timeout is kind of a hack. It’s probably better to avoid it.

Surely there is a simple way to do this.

Here is the code I have so far.

gist:609119 · GitHub

Your problem may be that you’re sending signal 0; you should pass
“TERM” or (if that won’t work) “KILL” as the first parameter to
Signal.kill. signal 0 just queries if the process can receive signals
or not…

If you want to use select instead of timeout, then instead of this:

Timeout::timeout(seconds) {

  @pid, @stdin, @stdout, @stderr = Open4.popen4(cmd)

  ignored, @status = Process::waitpid2 @pid

  if @status.exitstatus != 0
    raise "Exit Status not zero"
  end
}

You should use something like this: (UNTESTED)

  @pid, @stdin, @stdout, @stderr = Open4.popen4(cmd)

  if IO::select([@stdout],nil,nil,seconds)
    Util.kill_process_if_exists? @pid
  else
    fail 'unexpected data on stdout'
  end

  ignored, @status = Process::waitpid2 @pid

  if @status.exitstatus != 0
    raise "Exit Status not zero"
  end

Except, if the external process actually prints something to stdout,
then you need to call select in a loop until select returns nil, with
decreasing timeouts depending on how much time has passed.

Unfortunately, ‘ri Kernel#select’ seems to be broken… it just
refers you back to Kernel#select. I hope somebody fixes that. Check
what it says in the pickaxe instead. (There’s a free version available
online if you don’t own a copy yourself.)

Tim_U · October 14, 2010, 2:13pm

Except, if the external process actually prints something to stdout,
then you need to call select in a loop until select returns nil, with
decreasing timeouts depending on how much time has passed.

Well I tried to go a different route and ran into a strange issue.

I found a shell script on the net and modified it a bit see this

gist.github.com

https://gist.github.com/timuckun/626072

gistfile1.txt

#!/bin/bash


DIR=$(dirname $(readlink -f $0))
BINARY=${DIR}/wkhtmltoimage-i386
OPTIONS=''
SCRIPT="${BINARY} ${OPTIONS} $1 $2"

TIMEOUT=$3

This file has been truncated. show original

This shell script works perfectly when I use it from bash but it works
weird when I call it with backtics in ruby.

basically what happens is that the backtics don’t return until the
timeout is expired no matter what happens.

It’s the weirdest thing.

Does anybody have an explanation for that?