Encoding issue for special characters on Windows


#1

Hi,

I am facing an issue with special characters handling inside a Ruby
script running on Windows and am sure some of you could help me on
this.

This script copies files such as “<English_name>.txt” to
“<Other_language_name>.txt”. But once translated, the new filename may
have special characters. ‘ä’ for instance.

Running
puts
'ä’in a Ruby script gives
'õ’as an output, whereas the same code in irb gives
‘ä’
There must be an encoding issue at some point in my script but I
didn’t manage to fix it (tried different values of ‘#encoding:
without success). Any clue ?

Many thanks in advance
Best regards

Nicolas


#2

Le 9 janvier 2009 à 10:10, Nicolas G. a écrit :

There must be an encoding issue at some point in my script but I
didn’t manage to fix it (tried different values of ‘#encoding:
without success). Any clue ?

It depends. If you are trying to echo something to the console, you’ll
have to use CP850.

The character for ä is 228 in the ISO8859-1 [1] encoding that your file
seems to use, and that corresponds to the õ character in CP850 [2].

Now, if you’re writing something on the screen as a means of control or
debug while manipulating files, don’t convert your output to CP850 in
your resulting file ! You’d better stay in ISO, or maybe even in UTF-8,
depending on what your real goal is (website, internal application,
database, etc).

Fred
[1] : http://en.wikipedia.org/wiki/ISO/IEC_8859-1
[2] : http://en.wikipedia.org/wiki/Code_page_850


#3

Nicolas G. removed_email_address@domain.invalid writes:

Running
puts ‘ä’
in a Ruby script gives
‘õ’
as an output, whereas the same code in irb gives
‘ä’

There must be an encoding issue at some point in my script but I
didn’t manage to fix it (tried different values of ‘#encoding:
without success). Any clue ?

I use emacs. In emacs, you’d just put:

#!/usr/bin/ruby

-- coding:utf-8 --

puts
“ä”
to have the script encoded in utf-8 and therefore outputing an utf-8
byte stream.
Then of course, you have to have an utf-8 terminal:

[pjb@simias :0.0 tmp]$ chmod 755 test.rb
[pjb@simias :0.0 tmp]$ export LC_CTYPE=en_US.UTF-8
[pjb@simias :0.0 tmp]$ ./test.rb
ä
[pjb@simias :0.0 tmp]$ cat test.rb
#!/usr/bin/ruby

-- coding:utf-8 --

puts
“ä”[pjb@simias :0.0 tmp]$

Notice that in irb, with an utf-8 terminal, “ä”.length == 2

Of course, you can choose to use iso-8859-1 or iso-8859-15, just
substitute utf-8.


#4

On 10 jan, 16:24, “F. Senault” removed_email_address@domain.invalid wrote:

It depends. If you are trying to echo something to the console, you’ll
have to use CP850.

The character for ä is 228 in the ISO8859-1 [1]encodingthat your file
seems to use, and that corresponds to the õ character in CP850 [2].

Now, if you’re writing something on the screen as a means of control or
debug while manipulating files, don’t convert your output to CP850 in
your resulting file ! You’d better stay in ISO

Hi and sorry for the delay,

You were right. The screen output was the only one concerned by the
issue. The result in the filesystem was allright. So everything is
working as expected since I have no need to display the filenames once
in production.

Thanks to both of you