Ruby Forum Ruby-core > making FileUtils.rm_rf robust: is anyone interested?

Posted by Jim Meyering (Guest)
on 04.10.2006 18:26
(Received via mailing list)
Hello,

I see that FileUtils.rm_rf cannot handle a tree containing a
relative names longer than PATH_MAX.

These commands create a hierarchy a/0...0/0...0/...
where the name specifying the deepest directory has length 4097.
That is usually greater than PATH_MAX.

    ( mkdir a && cd a &&
      for i in $(seq 16); do d=$(printf %0255d 0); mkdir $d && cd $d; 
done )

This shows that rm_rf doesn't remove "a":

    ruby -r fileutils -e 'FileUtils.rm_rf("a")'
    test -d a && echo failed to remove a

It prints this:

    failed to remove a

It is not at all trivial to fix this "properly".
By "properly," I mean in a way that rm_rf can remove an arbitrarily
deep hierarchy securely while remaining efficient and thread safe.
Modulo hard-coded diagnostics, the C implementation in the GNU coreutils
package (src/remove.c) should be appropriate.

Jim
Posted by Nobuyoshi Nakada (Guest)
on 05.10.2006 11:20
(Received via mailing list)
Hi,

At Thu, 5 Oct 2006 01:25:00 +0900,
Jim Meyering wrote in [ruby-core:08999]:
> It is not at all trivial to fix this "properly".
> By "properly," I mean in a way that rm_rf can remove an arbitrarily
> deep hierarchy securely while remaining efficient and thread safe.
> Modulo hard-coded diagnostics, the C implementation in the GNU coreutils
> package (src/remove.c) should be appropriate.

It doesn't feel appropriate to chdir inside a library since it affects
whole process.
Posted by Hugh Sasse (Guest)
on 05.10.2006 11:40
(Received via mailing list)
On Thu, 5 Oct 2006, Nobuyoshi Nakada wrote:

> Hi,
> 
> At Thu, 5 Oct 2006 01:25:00 +0900,
> Jim Meyering wrote in [ruby-core:08999]:
> > It is not at all trivial to fix this "properly".
> > By "properly," I mean in a way that rm_rf can remove an arbitrarily
> > deep hierarchy securely while remaining efficient and thread safe.
> > Modulo hard-coded diagnostics, the C implementation in the GNU coreutils
> > package (src/remove.c) should be appropriate.

After a quick look, I couldn't figure that out, however:
> 
> It doesn't feel appropriate to chdir inside a library since it affects
> whole process.

isn't it possible to pass a block to chdir, so that after executing the
block one is back when one was?
> 
> -- 
> Nobu Nakada
> 

        Hugh
Posted by Jim Meyering (Guest)
on 05.10.2006 15:26
(Received via mailing list)
"Nobuyoshi Nakada" <nobu@ruby-lang.org> wrote:
> At Thu, 5 Oct 2006 01:25:00 +0900,
> Jim Meyering wrote in [ruby-core:08999]:
>> It is not at all trivial to fix this "properly".
>> By "properly," I mean in a way that rm_rf can remove an arbitrarily
>> deep hierarchy securely while remaining efficient and thread safe.
>> Modulo hard-coded diagnostics, the C implementation in the GNU coreutils
>> package (src/remove.c) should be appropriate.
>
> It doesn't feel appropriate to chdir inside a library since it affects
> whole process.

Hello,

You are right that calling chdir (or fchdir) is not appropriate in
a library:  it would render the caller thread-*un*safe.

However, given sufficient O/S support, the implementation in
coreutils/src/remove.c is indeed robust and thread-safe.  As of
coreutils-6.0 (the latest is coreutils-6.3), "rm -r" can remove an
arbitrarily deep hierarchy in a thread-safe manner on a system with
support for openat-like functions (Linux-2.6.16 and newer and Solaris 
10).
I have taken great pains to ensure that the code degrades gracefully,
so that it works as well (but sacrifices thread safety) on systems that
have neither openat nor sufficient /proc support.  If you require
thread safety even without openat support, then currently you must
compromise on robustness: i.e., the code must once again be subject
to the PATH_MAX limitation.

However a robust, efficient, *and* always-thread-safe implementation
is possible: if the PATH_MAX limitation is encountered, incur the cost
of a single fork and then perform the remaining operations (including
f/chdir calls) from a separate process.

If you are interested, a viable alternative may involve using the fts
implementation from the coreutils (the same one that's in gnulib).
Then, once that version of fts has the proposed additional feature,
ruby's rm_rf will "just work".

FYI, fts is the file system traversing tool that is used by chmod,
chgrp, chown, and du.  It too takes advantage of openat, when possible,
and degrades gracefully.  However, it also has an option to make it use
the existing approach of accessing each operand via its full, relative
file name.  The version if coreutils/gnulib was initially based on
the one from *BSD and glibc, but I have changed its ABI slightly in 
order
to make it work for arbitrarily deep hierarchies.  For example, those
programs can now process hierarchies a million levels deep or more.

Jim
Posted by Charles Oliver Nutter (Guest)
on 05.10.2006 16:18
(Received via mailing list)
Jim Meyering wrote:
> to the PATH_MAX limitation.
I'm not sure how much weight this will carry, but since we ship Ruby's
libraries with JRuby we're hoping the same logic described above will be
implementable in Java. Also ...

> 
> However a robust, efficient, *and* always-thread-safe implementation
> is possible: if the PATH_MAX limitation is encountered, incur the cost
> of a single fork and then perform the remaining operations (including
> f/chdir calls) from a separate process.
> 

We would strongly prefer to avoid any implementation that requires fork,
since we can't really support fork in JRuby. Also, wouldn't a fork
preclude this method from working on Windows?

Wouldn't it perhaps be better to support chdir at a per-thread level?
Posted by Joel VanderWerf (Guest)
on 05.10.2006 17:51
(Received via mailing list)
Charles Oliver Nutter wrote:
...
> Wouldn't it perhaps be better to support chdir at a per-thread level?

Then ruby's thread scheduler would have to chdir for each context
switch. Is there any other reason not to do this? It's hard to see how
existing code could usefully depend on the working dir being global
rather than per-thread.
Posted by Charles Nutter (headius)
on 05.10.2006 17:53
(Received via mailing list)
On 10/5/06, Joel VanderWerf <vjoel@path.berkeley.edu> wrote:
> Charles Oliver Nutter wrote:
> ...
> > Wouldn't it perhaps be better to support chdir at a per-thread level?
>
> Then ruby's thread scheduler would have to chdir for each context
> switch. Is there any other reason not to do this? It's hard to see how
> existing code could usefully depend on the working dir being global
> rather than per-thread.

I agree that it makes more sense for the current dir to be
thread-specific, but I can't speak to the complexity of supporting
this behavior in C Ruby. For JRuby, it would be a trivial change,
since current directory is only emulated with a per-JRuby-runtime
variable. We would simply move that variable into a per-thread
context, and chdir would then be thread-safe.
Posted by Nobuyoshi Nakada (Guest)
on 05.10.2006 18:13
(Received via mailing list)
Hi,

At Thu, 5 Oct 2006 22:26:24 +0900,
Jim Meyering wrote in [ruby-core:09008]:
> However, given sufficient O/S support, the implementation in
> coreutils/src/remove.c is indeed robust and thread-safe.  As of
> coreutils-6.0 (the latest is coreutils-6.3), "rm -r" can remove an
> arbitrarily deep hierarchy in a thread-safe manner on a system with
> support for openat-like functions (Linux-2.6.16 and newer and Solaris 10).

Thank you, I'll consider it later.

> However a robust, efficient, *and* always-thread-safe implementation
> is possible: if the PATH_MAX limitation is encountered, incur the cost
> of a single fork and then perform the remaining operations (including
> f/chdir calls) from a separate process.

I thought about it too.

Another idea suggested by akr is renaming too long path names
to shorter one before traverse.
Posted by Jim Meyering (Guest)
on 06.10.2006 14:43
(Received via mailing list)
[resend]
Charles Oliver Nutter <Charles.O.Nutter@Sun.COM> wrote:
> Jim Meyering wrote:
...
>> However a robust, efficient, *and* always-thread-safe implementation
>> is possible: if the PATH_MAX limitation is encountered, incur the cost
>> of a single fork and then perform the remaining operations (including
>> f/chdir calls) from a separate process.
>
> We would strongly prefer to avoid any implementation that requires fork,
> since we can't really support fork in JRuby. Also, wouldn't a fork
> preclude this method from working on Windows?

With WOE, it wouldn't perform a "fork" per se.
There, rm_rf could use "spawnvp" to execute a new command
to handle the unusual event that it encounters the PATH_MAX limit.
The gnulib execute module provides a portable way to do that:
http://cvs.savannah.gnu.org/viewcvs/gnulib/lib/execute.c?root=gnulib&view=markup
But I'm no Windows expert, so take this with a big grain of salt.

> Wouldn't it perhaps be better to support chdir at a per-thread level?

Do you know how to do that portably, so that it affects only
rm_rf?  What if some other concurrently-running code requires
the process-wide semantics of chdir?  So imagine that there is
a new function with the thread-local semantics.  Maybe...
But is it available now?
Posted by Charles Oliver Nutter (Guest)
on 06.10.2006 20:25
(Received via mailing list)
Jim Meyering wrote:
> With WOE, it wouldn't perform a "fork" per se.
> There, rm_rf could use "spawnvp" to execute a new command
> to handle the unusual event that it encounters the PATH_MAX limit.
> The gnulib execute module provides a portable way to do that:
> http://cvs.savannah.gnu.org/viewcvs/gnulib/lib/execute.c?root=gnulib&view=markup
> But I'm no Windows expert, so take this with a big grain of salt.

Either way we wouldn't really be able to support it, and we'd have to
hack our own version of FileUtils that doesn't spawn or fork anything :(

> 
>> Wouldn't it perhaps be better to support chdir at a per-thread level?
> 
> Do you know how to do that portably, so that it affects only
> rm_rf?  What if some other concurrently-running code requires
> the process-wide semantics of chdir?  So imagine that there is
> a new function with the thread-local semantics.  Maybe...
> But is it available now?
> 

I do not know of any such feature in the C domain, but that's not my
area. We emulate chdir support in JRuby by keeping a separate variable
for cwd. When operations that are directory-sensitive are called, we
provide the cwd for them, normalizing paths manually as necessary. So
far it has worked fairly well for us, and a thread-specific approach
would a logical next step.

In Java, we don't really even have the ability to chdir, so this
emulation was the only safe way to support it.