Forum: Ruby Dir.entires and UTF-8

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
D5ed8e0a7843874dea90219264566b6b?d=identicon&s=25 Timo Hoepfner (thoepfner)
on 2006-01-12 16:23
(Received via mailing list)
Hi,

What's going on here? Ths is on MacOS X 10.4.4. Looks like
Dir#entries returns strings encoded with some encoding I didn't
expect. How can I convert the string to UTF8?

$KCODE='UTF8'
require 'jcode'
s="äöüßÄÖÜ"
puts s.split(//).inspect
# => ["ä", "ö", "ü", "ß", "Ä", "Ö", "Ü"]
test_dir="/tmp/test"
`mkdir #{test_dir}`
`touch #{test_dir}/#{s}`
f=Dir.entries(test_dir).last
puts f.split(//).inspect
# => ["a", "", "o", "", "u", "", "ß", "A", "", "O", "", "U", ""]

Timo
0ec4920185b657a03edf01fff96b4e9b?d=identicon&s=25 Yukihiro Matsumoto (Guest)
on 2006-01-12 17:18
(Received via mailing list)
Hi,

On 1/13/06, Timo Hoepfner <th-dev@onlinehome.de> wrote:

> What's going on here? Ths is on MacOS X 10.4.4. Looks like
> Dir#entries returns strings encoded with some encoding I didn't
> expect. How can I convert the string to UTF8?

You have got a corrent UTF-8 string.  Unlike Windows XP, Mac OS X
decomposes character components as much as possible (Sorry I forgot
the correct term for this policy).  So what you got:

> # => ["a", "", "o", "", "u", "", "ß", "A", "", "O", "", "U", ""]

is decomposed form of your string, a+umlaut, o+umlaut, etc.

matz.
05e48e632fdd0b2c25d27042f52c11d5?d=identicon&s=25 A LeDonne (Guest)
on 2006-01-12 17:33
(Received via mailing list)
On 1/12/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> the correct term for this policy).  So what you got:
>
> > # => ["a", "", "o", "", "u", "", "ß", "A", "", "O", "", "U", ""]
>
> is decomposed form of your string, a+umlaut, o+umlaut, etc.
>
> matz.
>

Matz refers to Unicode Normalization Form D (NFD). According to
http://developer.apple.com/technotes/tn/tn1150.html (HFS Plus Volume
Format):

"HFS Plus stores strings fully decomposed and in canonical order. HFS
Plus compares strings in a case-insensitive fashion. Strings may
contain Unicode characters that must be ignored by this comparison.
For more details on these subtleties, see Unicode Subtleties."

-A
31ab75f7ddda241830659630746cdd3a?d=identicon&s=25 Austin Ziegler (Guest)
on 2006-01-12 17:33
(Received via mailing list)
On 12/01/06, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
> On 1/13/06, Timo Hoepfner <th-dev@onlinehome.de> wrote:
>> What's going on here? Ths is on MacOS X 10.4.4. Looks like
>> Dir#entries returns strings encoded with some encoding I didn't
>> expect. How can I convert the string to UTF8?
> You have got a corrent UTF-8 string.  Unlike Windows XP, Mac OS X
> decomposes character components as much as possible (Sorry I forgot
> the correct term for this policy).  So what you got:

IIRC, that's the correct term. (Decomposed.)

-austin
D5ed8e0a7843874dea90219264566b6b?d=identicon&s=25 Timo Hoepfner (thoepfner)
on 2006-01-13 10:42
(Received via mailing list)
>>  How can I convert the string to UTF8?
>
> You have got a corrent UTF-8 string.  Unlike Windows XP, Mac OS X
> decomposes character components as much as possible (Sorry I forgot
> the correct term for this policy).  So what you got:
>
>> # => ["a", "", "o", "", "u", "", "ß", "A", "", "O", "", "U", ""]
>
> is decomposed form of your string, a+umlaut, o+umlaut, etc.

Hi Matz, Austin and A.

Thanks for the clarification. Unicode is more comlex than it seems in
the first place...

Nevertheless that doesn't solve my current problem. What I'm trying
to do is to organize files within a directory into subfolders based
on the first N characters of the file name. Here's my code (w/o error
handling) which works fine for 8bit characters, but doesn't work for
e.g. umlauts:

$KCODE='UTF8'
require 'jcode'
require 'pathname'
require 'fileutils'
wd, len = Pathname.new(ARGV[0]), ARGV[1].to_i
files=wd.children.reject{|f| f.directory?}
files.each do |f|
   dir = wd + Pathname.new(f.basename.to_s.split(//)[0..len-1].join)
   dir.mkdir unless dir.exist?
   FileUtils.mv f, dir
end

I guess I have to recompose the decomposed filename somehow. Are
there any tools for that in the standard library or somewhere else?

Thanks for your help,

Timo
D5ed8e0a7843874dea90219264566b6b?d=identicon&s=25 Timo Hoepfner (thoepfner)
on 2006-01-17 14:33
(Received via mailing list)
Hi,

to answer my own question, here's a solution. Use the 'unicode' gem
and change the line

>   dir = wd + Pathname.new(f.basename.to_s.split(//)[0..len-1].join)

to

dir = wd + Pathname.new(Unicode::compose(f.basename.to_s).split(//)
[0..len-1].join)

Then it works.

Timo
This topic is locked and can not be replied to.