Process.fork weirdness

Hi all,

I’ve run into some strange problems when forking child processes and
logging it in a file. Here’s a simplified example:

test = open(“output.txt”, “w”)
1.upto(10) do |i|
test.print(i.to_s + “,”)
Process.fork {sleep(10)}
test.puts i
end
Process.waitall

Now I would expect this as output:

1,1
2,2
3,3
4,4
5,5
6,6
7,7
8,8
9,9
10,10

However, the above script produces this instead:

1,1,1
2,1,1
2,2
3,1,1
2,2
3,3
4,1,1
2,2
3,3
4,4
5,1,1
2,2
3,3
4,4
5,5
6,6
7,1,1
2,2
3,3
4,4
5,5
6,1,1

It goes on until both values are 10, as if there were two loops within
each other. Note that 7,1,1 appears before 6,1,1 though. I have
absolutely no idea how this is produced, but it must be caused by
forking. If I remove Process.fork I get the expected output. Also if I
open and close the file each time for writing I also get the expected
output:

1.upto(10) do |i|
test = open(“output.txt”, “a”)
test.print(i.to_s + “,”)
test.close
Process.fork {sleep(10)}
test = open(“output.txt”, “a”)
test.puts i
test.close
end
Process.waitall

I spend a lot of time figuring out, that my algorithm was in fact
working ok, but my way of writing the output was causing the problems.
I’d really like to know what’s the reason this behaviour. Any
suggestions?

Regards,
janus

On Jul 12, 2009, at 3:04 PM, Janus B. wrote:

It goes on until both values are 10, as if there were two loops within
each other. Note that 7,1,1 appears before 6,1,1 though. I have
absolutely no idea how this is produced, but it must be caused by
forking. If I remove Process.fork I get the expected output.

You need to flush the output stream before forking so that the child
process doesn’t get any buffered data. Otherwise, every time the child
process exits, it will flush the duplicate data out to the
‘output.txt’ file.

Gary W. wrote:

You need to flush the output stream before forking so that the child
process doesn’t get any buffered data. Otherwise, every time the child
process exits, it will flush the duplicate data out to the ‘output.txt’
file.

Huh, I never knew that. That’s dangerous if the parent process is
multithreaded, isn’t it? Because even if the forking thread flushes all
open files, other threads can keep filling buffers before the fork
happens.

In process.c, rb_f_fork() flushes stdout and stderr before forking, so
at least these two buffers are protected from other ruby threads. It
seems like there should be an option to flush everything.

On Jul 13, 2009, at 1:18 PM, Joel VanderWerf wrote:

Huh, I never knew that. That’s dangerous if the parent process is
multithreaded, isn’t it? Because even if the forking thread flushes
all open files, other threads can keep filling buffers before the
fork happens.

It is probably a bad idea to have multiple threads writing to a shared
file descriptor without proper synchronization. With proper
synchronization you should be able to safely flush and fork.

Joel VanderWerf wrote:

In process.c, rb_f_fork() flushes stdout and stderr before forking, so
at least these two buffers are protected from other ruby threads. It
seems like there should be an option to flush everything.

Maybe the best policy is to do any one of the following:

  1. Avoid buffered output (puts, write), and instead use unbuffered calls
    (syswrite, send) or set io.sync=true.

  2. Don’t mix threads and fork, and flush everything before forking.

  3. Create children only as

    fork {do_stuff; exit!}
    fork {do_stuff; exec …}
    system “…”

and so on. (And be careful that the do_stuff doesn’t flush any buffered
IO that belongs to the parent.)

Most libraries aren’t going to follow #1, and #2 seems impractical to
enforce.

Actually, I guess #3 is why I haven’t seen this problem before… it’s
often a good idea to use exit! in fork to avoid duplicating at_exit()
handlers anyway. I wish there were a method that the child could call
before anything else to clear (not flush) all buffers and also to
clear the at_exit handlers.

Joel VanderWerf wrote:

But what about a library using IO#write to talk http, a rest
client, let’s say? This would not want IO#sync=true. It might not
really be possible to make the library aware of the need for
synchronization around buffered writes.

I don’t think you’d have stdio-type buffering on a socket. At least, I
don’t find that I have to do sock.flush every time I send a message.

Gary W. wrote:

On Jul 13, 2009, at 1:18 PM, Joel VanderWerf wrote:

Huh, I never knew that. That’s dangerous if the parent process is
multithreaded, isn’t it? Because even if the forking thread flushes
all open files, other threads can keep filling buffers before the fork
happens.

It is probably a bad idea to have multiple threads writing to a shared
file descriptor without proper synchronization. With proper
synchronization you should be able to safely flush and fork.

I was thinking of situations in which one program component runs in a
worker thread and manages its own files, not shared with the main thread
(but the descriptors are shared in the same process of course). If the
files are log files, they would probably have IO#sync=true, so no danger
there. But what about a library using IO#write to talk http, a rest
client, let’s say? This would not want IO#sync=true. It might not
really be possible to make the library aware of the need for
synchronization around buffered writes. In this case, maybe it would be
better to “firewall” the http library in its own process, and
communicate over unix sockets.

Anyway, thanks for making me aware of one more thing to worry about!

On Jul 14, 2009, at 12:23 PM, Joel VanderWerf wrote:

Ah, that makes sense. It would be only the libc and ruby buffers (in
user space) that would get copied in the fork, not network stack
buffers in kernel space.

That is correct. IO data buffered in the kernel is not duplicated by a
fork.

But wait… Net:HTTP uses BufferedIO#writeline on sockets. That
class doesn’t itself buffer writes (only reads), but it is defined
in terms of IO#write, which is buffered, isn’t it?

Ruby marks IO objects associated with sockets as ‘synchronized’ (i.e.
unbuffered). Net::HTTP wraps the underlying ruby socket with
BufferedIO but my quick inspection of the code seems to indicate that
BufferedIO doesn’t introduce a write buffer, just some debugging
scaffolding and a helper method for writing lines terminated with ‘\r
\n’.

Brian C. wrote:

Joel VanderWerf wrote:

But what about a library using IO#write to talk http, a rest
client, let’s say? This would not want IO#sync=true. It might not
really be possible to make the library aware of the need for
synchronization around buffered writes.

I don’t think you’d have stdio-type buffering on a socket. At least, I
don’t find that I have to do sock.flush every time I send a message.

Ah, that makes sense. It would be only the libc and ruby buffers (in
user space) that would get copied in the fork, not network stack buffers
in kernel space.

But wait… Net:HTTP uses BufferedIO#writeline on sockets. That class
doesn’t itself buffer writes (only reads), but it is defined in terms of
IO#write, which is buffered, isn’t it?

I guess the reason buffering makes sense with Net::HTTP, as opposed to
most other socket code, is that it writes header lines one at a time.

Gary W. wrote:

But wait… Net:HTTP uses BufferedIO#writeline on sockets. That class
doesn’t itself buffer writes (only reads), but it is defined in terms
of IO#write, which is buffered, isn’t it?

Ruby marks IO objects associated with sockets as ‘synchronized’ (i.e.
unbuffered).

Good to know, thanks. I always just used send/recv to be sure, but now I
know that’s not necessary.

There really don’t seem to be many cases in which fork is going to cause
buffer corruption, then, which explains why the issue doesn’t come up
much. We can rule out the cases of sockets (sync=true), log files
(sync=true), most binary data files (syswrite), and pipes (usually a
good idea to set sync=true, AFAIK). That leaves batch filtering of
files, which is most often performed on stdin/stdout, and these
descriptors are flushed in ruby’s #fork wrapper.

It’s a real joy to retrace the steps of ruby developers, and realize how
many wise decisions they made.

One thing left to watch out for is race conditions in (for example):

File.open(…, “w”) do |f|
YAML.dump(…, f)

before the file is closed, suppose another thread forks and exits

end

This example shows it in action:

require ‘yaml’

a = [1,2,3]

th = Thread.new do
File.open("/tmp/foo.yml", “w”) do |f|
YAML.dump a, f
sleep 3
end
end

fork do
sleep 1
end

th.join

The result is the foo.yml has two concatenated copies of the correct
yaml string.

This doesn’t happen with Marshal.dump. Probably since it is core ruby,
this possibility was anticipated. Though I don’t see in marshal.c where
fflush is called, and sync is false after the #dump.

On Jul 14, 2009, at 4:29 PM, Joel VanderWerf wrote:

That leaves batch filtering of files, which is most often performed
on stdin/stdout, and these descriptors are flushed in ruby’s #fork
wrapper.

Interesting. I didn’t realize that Kernel#fork flushes stdout and
stderr. That should probably be in the rdoc text for fork.

One thing left to watch out for is race conditions in (for example):

File.open(…, “w”) do |f|
YAML.dump(…, f)

before the file is closed, suppose another thread forks and exits

end

I can think of three ways to avoid these sorts of race conditions:

– do all forking before opening files
– do all forking before creating any threads
– use a synchronization mechanism to control the timing of
forks in a multi-threaded process

In all cases I think it requires library code to be aware of these
constraints. I’m not sure there is a way to protect buffered IO
objects against a library that can fork at indeterminate times.

Gary W.

Thanks for the explanation!