Basically, the to_yaml method seems to use some strange hex escape
sequences which do not correspond to ANSI, UTF-8 or windows-1252…
The funny part is that when I load the same string from YAML, it is
displayed correctly in the console. This would be fine, except that
when I tried to save it to a file the international characters are not
displayed properly (or better, they are converted to the corresponding
ANSI/UTF-8 characters). What’s going on here? What encoding does
to_yaml use to escape international characters?
According to the docs it should be UTF-8, but apparently it is not.
I figured out the other half, unfortunately: the reason why I can’t
view the characters in ANSI or UTF8 is because I’m inputting from DOS,
which means, unfortunately, “Code Page 437” (http://en.wikipedia.org/
wiki/Code_page_437).
IIRC the various YAML implementations in each language can choose
to output UTF-8, or unicode-escaped ASCII. I think a YAML implementation
has to be able to read either.
à gets translated to \205
ù gets translated to \227
But why?
I guess that your understanding is just wrong. I’m not really sure from
where your program gets those accented chars that are translated to
those specific escaped octal sequences. But if you’re specifying them in
string constants in your program, then it all depends on according to
what encodig your editor displays it.
The program gets the accented characters from a dBaseIV file, a MS
Access File and some YAML files.
I use Komodo Edit as my editor and it does handle UTF-8 correctly.
I know! That’s why I did not understand why I got \202. I do not know
which charset ruby uses to convert the characters. I tried iconv and
jcode but ended up with the same results. At first I thought it was
because of the library I used (builder for example). The only
explanation I found was on that python tutorial.
Thanks.
Jamal
mortee wrote:
Jamal Bengeloun wrote:
à gets translated to \205
ù gets translated to \227
But why?
I guess that your understanding is just wrong. I’m not really sure from
where your program gets those accented chars that are translated to
those specific escaped octal sequences. But if you’re specifying them in
string constants in your program, then it all depends on according to
what encodig your editor displays it.
Jamal Bengeloun wrote:
Thanks a lot for your help. I thought I will be going mad with this. I
thought it had something to do with ruby being C based (I saw something
on the internet about the difference between Python and JPython and the
accented characters were encoded in UTF-8 and not html escaped).
What if the end rendering engine is not a browser (I checked and you’re
absolutely right, it does work in a browser)? How to get true UTF-8
encoded characters instead of HTML escaped ones? I am using builder to
generate XML files from the data I get.
Thanks a lot for your explanation (it really did enlighten me) and your
help.
Thanks a lot for your help. I thought I will be going mad with this. I
thought it had something to do with ruby being C based (I saw something
on the internet about the difference between Python and JPython and the
accented characters were encoded in UTF-8 and not html escaped).
What if the end rendering engine is not a browser (I checked and you’re
absolutely right, it does work in a browser)? How to get true UTF-8
encoded characters instead of HTML escaped ones? I am using builder to
generate XML files from the data I get.
Thanks a lot for your explanation (it really did enlighten me) and your
help.
Jamal
Konrad M. wrote:
Quoth Jamal Bengeloun:
characters into utf-8
…
Does someone have an explanation?
Does anyone know how to get those characters into the final xml files?
Jamal Bengeloun wrote:
Thanks a lot for your help. I thought I will be going mad with this. I
thought it had something to do with ruby being C based (I saw something
on the internet about the difference between Python and JPython and the
accented characters were encoded in UTF-8 and not html escaped).
What if the end rendering engine is not a browser (I checked and you’re
absolutely right, it does work in a browser)? How to get true UTF-8
encoded characters instead of HTML escaped ones? I am using builder to
generate XML files from the data I get.
Thanks a lot for your explanation (it really did enlighten me) and your
help.
The program gets the accented characters from a dBaseIV file, a MS
Access File and some YAML files.
I use Komodo Edit as my editor and it does handle UTF-8 correctly.
I know! That’s why I did not understand why I got \202. I do not know
which charset ruby uses to convert the characters.
When converting some accented characters to \nnn then it doesn’t use any
encoding. It just represents the verbatim non-ascii bytes it sees in the
string it gets. Encodig/decoding happens mainly when you input accented
chars on your keyboard, and they get converted to some byte (-sequence)
to be stored in a string, and when those strings are displayed, and they
are converted back to some printable characters.
Problems arise when the displaying code interprets the same string
according to a different charset than what it was encoded according to.
For example, when you puts a string in irb, then it’s your terminal’s
current charset which determines how the bytes in the string are
actually displayed. In contrast, when you use p (or, for that matter,
inspect), then non-ascii characters get ascaped as \nnn.
But at the end I simply used the ADODB wrapper to open my dbf file and I
did not have any character encoding problems after that.
Thanks a lot.
Jamal Abdou-Karim Bengeloun
mortee wrote:
When converting some accented characters to \nnn then it doesn’t use any
encoding. It just represents the verbatim non-ascii bytes it sees in the
string it gets. Encodig/decoding happens mainly when you input accented
chars on your keyboard, and they get converted to some byte (-sequence)
to be stored in a string, and when those strings are displayed, and they
are converted back to some printable characters.
Problems arise when the displaying code interprets the same string
according to a different charset than what it was encoded according to.
For example, when you puts a string in irb, then it’s your terminal’s
current charset which determines how the bytes in the string are
actually displayed. In contrast, when you use p (or, for that matter,
inspect), then non-ascii characters get ascaped as \nnn.
I am not sure about that, I’ll have to check. What I noticed though is
that builder converted the accented characters correctly when coming
from yaml files, but had problems (it did convert them but… See my
previous post) getting those coming from the dbf file to hit the target.
It seems that the problem was coming from the page code encoding.
Thanks
Jamal Abdou-Karim Bengeloun
Konrad M. wrote:
If I’m not mistaken, HTML and XML encoding is the same. So you’re good
for
those &#xxxxxx; chars.
HTH,
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.