Forum: Ruby Dir.entires and UTF-8

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Timo H. (Guest)
on 2006-01-12 17:23
(Received via mailing list)
Hi,

What's going on here? Ths is on MacOS X 10.4.4. Looks like
Dir#entries returns strings encoded with some encoding I didn't
expect. How can I convert the string to UTF8?

$KCODE='UTF8'
require 'jcode'
s="äöüßÄÖÜ"
puts s.split(//).inspect
# => ["ä", "ö", "ü", "ß", "Ä", "Ö", "Ü"]
test_dir="/tmp/test"
`mkdir #{test_dir}`
`touch #{test_dir}/#{s}`
f=Dir.entries(test_dir).last
puts f.split(//).inspect
# => ["a", "", "o", "", "u", "", "ß", "A", "", "O", "", "U", ""]

Timo
Yukihiro M. (Guest)
on 2006-01-12 18:18
(Received via mailing list)
Hi,

On 1/13/06, Timo H. <removed_email_address@domain.invalid> wrote:

> What's going on here? Ths is on MacOS X 10.4.4. Looks like
> Dir#entries returns strings encoded with some encoding I didn't
> expect. How can I convert the string to UTF8?

You have got a corrent UTF-8 string.  Unlike Windows XP, Mac OS X
decomposes character components as much as possible (Sorry I forgot
the correct term for this policy).  So what you got:

> # => ["a", "", "o", "", "u", "", "ß", "A", "", "O", "", "U", ""]

is decomposed form of your string, a+umlaut, o+umlaut, etc.

matz.
A LeDonne (Guest)
on 2006-01-12 18:33
(Received via mailing list)
On 1/12/06, Yukihiro M. <removed_email_address@domain.invalid> wrote:
> the correct term for this policy).  So what you got:
>
> > # => ["a", "", "o", "", "u", "", "ß", "A", "", "O", "", "U", ""]
>
> is decomposed form of your string, a+umlaut, o+umlaut, etc.
>
> matz.
>

Matz refers to Unicode Normalization Form D (NFD). According to
http://developer.apple.com/technotes/tn/tn1150.html (HFS Plus Volume
Format):

"HFS Plus stores strings fully decomposed and in canonical order. HFS
Plus compares strings in a case-insensitive fashion. Strings may
contain Unicode characters that must be ignored by this comparison.
For more details on these subtleties, see Unicode Subtleties."

-A
Austin Z. (Guest)
on 2006-01-12 18:33
(Received via mailing list)
On 12/01/06, Yukihiro M. <removed_email_address@domain.invalid> wrote:
> On 1/13/06, Timo H. <removed_email_address@domain.invalid> wrote:
>> What's going on here? Ths is on MacOS X 10.4.4. Looks like
>> Dir#entries returns strings encoded with some encoding I didn't
>> expect. How can I convert the string to UTF8?
> You have got a corrent UTF-8 string.  Unlike Windows XP, Mac OS X
> decomposes character components as much as possible (Sorry I forgot
> the correct term for this policy).  So what you got:

IIRC, that's the correct term. (Decomposed.)

-austin
Timo H. (Guest)
on 2006-01-13 11:42
(Received via mailing list)
>>  How can I convert the string to UTF8?
>
> You have got a corrent UTF-8 string.  Unlike Windows XP, Mac OS X
> decomposes character components as much as possible (Sorry I forgot
> the correct term for this policy).  So what you got:
>
>> # => ["a", "", "o", "", "u", "", "ß", "A", "", "O", "", "U", ""]
>
> is decomposed form of your string, a+umlaut, o+umlaut, etc.

Hi Matz, Austin and A.

Thanks for the clarification. Unicode is more comlex than it seems in
the first place...

Nevertheless that doesn't solve my current problem. What I'm trying
to do is to organize files within a directory into subfolders based
on the first N characters of the file name. Here's my code (w/o error
handling) which works fine for 8bit characters, but doesn't work for
e.g. umlauts:

$KCODE='UTF8'
require 'jcode'
require 'pathname'
require 'fileutils'
wd, len = Pathname.new(ARGV[0]), ARGV[1].to_i
files=wd.children.reject{|f| f.directory?}
files.each do |f|
   dir = wd + Pathname.new(f.basename.to_s.split(//)[0..len-1].join)
   dir.mkdir unless dir.exist?
   FileUtils.mv f, dir
end

I guess I have to recompose the decomposed filename somehow. Are
there any tools for that in the standard library or somewhere else?

Thanks for your help,

Timo
Timo H. (Guest)
on 2006-01-17 15:33
(Received via mailing list)
Hi,

to answer my own question, here's a solution. Use the 'unicode' gem
and change the line

>   dir = wd + Pathname.new(f.basename.to_s.split(//)[0..len-1].join)

to

dir = wd + Pathname.new(Unicode::compose(f.basename.to_s).split(//)
[0..len-1].join)

Then it works.

Timo
This topic is locked and can not be replied to.