File.unlink(nonwestern_filename) ---> Error on Windows

Hi!

I use Ruby on Windows, and tried to remove all files in a directory
with the code given below. But if the directory contains files with
filenames having non-western characters the operation fails.

I first encountered this problem when using FileUtils.rm_r, and that
method also fails (for the same reason I guess). This makes FileUtils
quite useless in some situations. We have for example Subversion
projects that contain files with Japanese characters (for testing that
our product works with such characters), and I also tried with Arabic
characters (stored in Unicode in NTFS in both cases).

Is it possible to get Ruby to work with filenames containing
non-western characters at all on Windows? If so, what should I do?

/Johan H.


Dir.chdir “nonwestern-files”

for entry in Dir.entries(".")
next if entry == “.”
next if entry == “…”
n = File.unlink(entry)
puts “failed to delete #{entry}” if n == 0
end

On Jul 5, 2007, at 8:00 AM, [email protected] wrote:

our product works with such characters), and I also tried with Arabic
for entry in Dir.entries(".")
next if entry == “.”
next if entry == “…”
n = File.unlink(entry)
puts “failed to delete #{entry}” if n == 0
end

First make sure you set the KCODE

Try using the chars class from ActiveSupport (yes it is a gem that is
part of Rails but it provides a great deal of utf-8 processing)

On 7/5/07, John J. [email protected] wrote:

On Jul 5, 2007, at 8:00 AM, [email protected] wrote:

I use Ruby on Windows, and tried to remove all files in a directory
with the code given below. But if the directory contains files with
filenames having non-western characters the operation fails.

First make sure you set the KCODE

Using KCODE does not change anything. I have tried:

$ ruby -Ke rm-files.rb
$ ruby -Ks rm-files.rb
$ ruby -Ku rm-files.rb
$ ruby -Ka rm-files.rb
$ ruby -Kn rm-files.rb

The problematic files are stored with a name that is a 16-bit
character string in NTFS (what I called Unicode in my earlier mail,
perhaps one should call it “almost UTF-16” or UCS-2, I don’t know the
finer details). Anyway, I don’t think setting KCODE solves my problem.

Try using the chars class from ActiveSupport (yes it is a gem that is
part of Rails but it provides a great deal of utf-8 processing)

See above. I don’t think NTFS stores Unicode filenames in UTF-8.

My assumption when starting to look at this problem was: that a
filename that I got from one function (Dir.entries) would be directly
usable in another function (File.unlink). That was quite naive I
realize :slight_smile:

But it is still a real problem. As it is now, FileUtils.rm_r does not
work on an arbitrary file-tree. As soon as it contains a file with
“wrong” filename it fails. Maybe this is just a consequence of the way
Ruby is ported to Windows.

/Johan H.

On 7/5/07, [email protected] [email protected] wrote:

character string in NTFS (what I called Unicode in my earlier mail,
filename that I got from one function (Dir.entries) would be directly
I’m not sure but maybe win32-utils (win32-file perhaps?) might have a way
to save the day. http://rubyforge.org/projects/win32utils/

Hi,

On 05/07/07, [email protected] [email protected] wrote:

The problematic files are stored with a name that is a 16-bit
character string in NTFS (what I called Unicode in my earlier mail,
perhaps one should call it “almost UTF-16” or UCS-2, I don’t know the
finer details). Anyway, I don’t think setting KCODE solves my problem.

I haven’t used Windows for a long while, but unless something has
changed in the newest releases, Ruby uses the Windows legacy code page
for interacting with the system, which is by default Windows-1252 on
English systems, Shift_JIS on Japanese systems, etc.

Internally, Windows is all Unicode, as is NTFS (I think it’s UTF-16,
but that’s not really important for this discussion), but applications
using legacy code pages can’t communicate strings outside that code
page to the OS.

That means that if you set the legacy code page to Shift_JIS, you can
read and write Japanese file names, but not Arabic ones. If you set it
to Windows-1252, you can use acute accents, but can’t touch Japanese
files.

I am led to believe that there is a UTF-8 code page in Windows, and it
is possible to set the legacy code page on an
application-by-application basis, at least on XP (though you might
need a separate Power Toy or similar to do it). If you can get that to
work, it might be possible to manipulate files via the UTF-8
representation of their name. I’ve never seen it done, though, so this
is entirely hypothetical.

Paul.

On Jul 5, 2007, at 11:32 AM, [email protected] wrote:

perhaps one should call it “almost UTF-16” or UCS-2, I don’t know the
finer details). Anyway, I don’t think setting KCODE solves my problem.

Translation from utf-16 and utf-8 shouldn’t be a problem.
Check out unicode.org for more on this than you really want to, or
there is a nice blog article at joelonsoftware

But it is still a real problem. As it is now, FileUtils.rm_r does not
work on an arbitrary file-tree. As soon as it contains a file with
“wrong” filename it fails. Maybe this is just a consequence of the way
Ruby is ported to Windows.

/Johan H.

Some file utilities are specifically non-windows. That may be part of
the problem you are having.
Many of those file utilities out there are Ruby versions of utilities
found on *nix systems. Sorry about that.
Much of that is documented in the pickaxe book (v.2) in the second
half of the book. (sorry again, I’m not saying RTFM, just that it is
noted there.)

The win32utils will hopefully do the job. Let us know what works!
This kind of problem is common for lots of people.