Dir.entires and UTF-8

Hi,

What’s going on here? Ths is on MacOS X 10.4.4. Looks like
Dir#entries returns strings encoded with some encoding I didn’t
expect. How can I convert the string to UTF8?

$KCODE=‘UTF8’
require ‘jcode’
s=“äöüßÄÖÜ”
puts s.split(//).inspect

=> [“ä”, “ö”, “ü”, “ß”, “Ä”, “Ö”, “Ü”]

test_dir="/tmp/test"
mkdir #{test_dir}
touch #{test_dir}/#{s}
f=Dir.entries(test_dir).last
puts f.split(//).inspect

=> [“a”, “”, “o”, “”, “u”, “”, “ß”, “A”, “”, “O”, “”, “U”, “”]

Timo

Hi,

On 1/13/06, Timo H. [email protected] wrote:

What’s going on here? Ths is on MacOS X 10.4.4. Looks like
Dir#entries returns strings encoded with some encoding I didn’t
expect. How can I convert the string to UTF8?

You have got a corrent UTF-8 string. Unlike Windows XP, Mac OS X
decomposes character components as much as possible (Sorry I forgot
the correct term for this policy). So what you got:

=> [“a”, “”, “o”, “”, “u”, “”, “ß”, “A”, “”, “O”, “”, “U”, “”]

is decomposed form of your string, a+umlaut, o+umlaut, etc.

matz.

On 1/12/06, Yukihiro M. [email protected] wrote:

the correct term for this policy). So what you got:

=> [“a”, “”, “o”, “”, “u”, “”, “ß”, “A”, “”, “O”, “”, “U”, “”]

is decomposed form of your string, a+umlaut, o+umlaut, etc.

matz.

Matz refers to Unicode Normalization Form D (NFD). According to
Technotes | Apple Developer Documentation (HFS Plus Volume
Format):

“HFS Plus stores strings fully decomposed and in canonical order. HFS
Plus compares strings in a case-insensitive fashion. Strings may
contain Unicode characters that must be ignored by this comparison.
For more details on these subtleties, see Unicode Subtleties.”

-A

On 12/01/06, Yukihiro M. [email protected] wrote:

On 1/13/06, Timo H. [email protected] wrote:

What’s going on here? Ths is on MacOS X 10.4.4. Looks like
Dir#entries returns strings encoded with some encoding I didn’t
expect. How can I convert the string to UTF8?
You have got a corrent UTF-8 string. Unlike Windows XP, Mac OS X
decomposes character components as much as possible (Sorry I forgot
the correct term for this policy). So what you got:

IIRC, that’s the correct term. (Decomposed.)

-austin

How can I convert the string to UTF8?

You have got a corrent UTF-8 string. Unlike Windows XP, Mac OS X
decomposes character components as much as possible (Sorry I forgot
the correct term for this policy). So what you got:

=> [“a”, “”, “o”, “”, “u”, “”, “ß”, “A”, “”, “O”, “”, “U”, “”]

is decomposed form of your string, a+umlaut, o+umlaut, etc.

Hi Matz, Austin and A.

Thanks for the clarification. Unicode is more comlex than it seems in
the first place…

Nevertheless that doesn’t solve my current problem. What I’m trying
to do is to organize files within a directory into subfolders based
on the first N characters of the file name. Here’s my code (w/o error
handling) which works fine for 8bit characters, but doesn’t work for
e.g. umlauts:

$KCODE=‘UTF8’
require ‘jcode’
require ‘pathname’
require ‘fileutils’
wd, len = Pathname.new(ARGV[0]), ARGV[1].to_i
files=wd.children.reject{|f| f.directory?}
files.each do |f|
dir = wd + Pathname.new(f.basename.to_s.split(//)[0…len-1].join)
dir.mkdir unless dir.exist?
FileUtils.mv f, dir
end

I guess I have to recompose the decomposed filename somehow. Are
there any tools for that in the standard library or somewhere else?

Thanks for your help,

Timo

Hi,

to answer my own question, here’s a solution. Use the ‘unicode’ gem
and change the line

dir = wd + Pathname.new(f.basename.to_s.split(//)[0…len-1].join)

to

dir = wd + Pathname.new(Unicode::compose(f.basename.to_s).split(//)
[0…len-1].join)

Then it works.

Timo