2009/12/28 Brian C. [email protected]
not matching UTF-8.
puts “Name: #{fn.inspect}”
puts “Encoding: #{fn.encoding}”
puts “Chars: #{fn.chars.to_a.inspect}”
puts “Codepoints: #{fn.codepoints.to_a.inspect}”
puts “Bytes: #{fn.bytes.to_a.inspect}”
puts
end
then post the results for this file here. Then also post what you think
the true filename is.
The true filename is (from the Finder and Terminal):
-rw-r–r–@ 1 benoitdaloze staff 3758 Jul 17 2008 español.lng
So, with the ‘ñ’.
I don’t know which is the encoding of the filename on HFS+, from
Wikipedia
it s said as UTF-16, with Decomposition:
“names which are also character encoded in
UTF-16http://en.wikipedia.org/wiki/UTF-16and normalized to a form
very nearly the same as Unicode
Normalization Form D
(NFD)http://en.wikipedia.org/wiki/Unicode_normalization
[4] http://en.wikipedia.org/wiki/HFS_Plus#cite_note-3 (which means
that
precomposed characters like é are decomposed in the HFS+ filename and
therefore count as two
characters[5]http://en.wikipedia.org/wiki/HFS_Plus#cite_note-4”
So, that’s probably a problem of encoding for Dir.[]
I changed a little the script, to compare with a String hard-coded
inside
the script (rn = “español.lng”)
ruby 1.9.2dev (2009-12-11 trunk 26067) [x86_64-darwin10.2.0]
Source encoding: UTF-8
External encoding: UTF-8
Format:
String in the code
filename from Dir[]
String equality: false
Name:
“español.lng”
“español.lng”
Encoding:
UTF-8
UTF-8
Chars:
[“e”, “s”, “p”, “a”, “ñ”, “o”, “l”, “.”, “l”, “n”, “g”]
[“e”, “s”, “p”, “a”, “n”, “̃”, “o”, “l”, “.”, “l”, “n”, “g”]
Codepoints:
[101, 115, 112, 97, 241, 111, 108, 46, 108, 110, 103]
[101, 115, 112, 97, 110, 771, 111, 108, 46, 108, 110, 103]
Bytes:
[101, 115, 112, 97, 195, 177, 111, 108, 46, 108, 110, 103]
[101, 115, 112, 97, 110, 204, 131, 111, 108, 46, 108, 110, 103]
Then you can see whether: (1) Dir.[] is returning the correct sequence
of bytes for the filename or not; and (2) Dir.[] is tagging the string
with the correct encoding or not.
(1) Dir[] seems to return a correct String in UTF-8, while being
different
(!!) from a String inside in UTF-8
But looking at the codepoints and bytes, it’s very different …
(2) That’s probably the case, let’s look by forcing the encoding to
MacRoman:
Or not … making crazy results like: “espan\xCC\x83ol.lng” or
“espan\u0303ol.lng”
Well, this is out of my poor knowledge of encoding I’m afraid 
The most frustrating is it’s printing the same…
P.S.: Well I got also filenames with “\r”, quite weared,no? (“Target
Application Alias\r”, and it “\r” is shown as “?” in the Terminal)
(This is one of the thousands of cases I did not document in
ALLOWED_CHARS = “A-Za-z0-9 %#:$@?!=+~&|'()\[\]{}.,\r_-”
File.rename(f, File.dirname(f) + ‘/’ + name)
Posted via http://www.ruby-forum.com/.
Yes, tr! returns nil on name.tr!(‘ñ’, ‘n’), but it would work on a String
inside the script (eg: “eño”.tr!(‘ñ’, ‘n’))