To_yaml and international characters

h3rald · October 23, 2007, 2:46pm

Hello,

I noticed some weird behavior when converting a string containing
international characters to YAML:

irb(main):002:0> ‘test òùè’.to_yaml
=> “— “test \x95\x97\x8A”\n”
irb(main):003:0>

…but:

irb(main):001:0> ‘test òùè’
=> “test \225\227\212”

Basically, the to_yaml method seems to use some strange hex escape
sequences which do not correspond to ANSI, UTF-8 or windows-1252…
The funny part is that when I load the same string from YAML, it is
displayed correctly in the console. This would be fine, except that
when I tried to save it to a file the international characters are not
displayed properly (or better, they are converted to the corresponding
ANSI/UTF-8 characters). What’s going on here? What encoding does
to_yaml use to escape international characters?
According to the docs it should be UTF-8, but apparently it is not.

Ruby version: 1.8.6
OS: Windows XP

Any ideas?

h3rald · October 23, 2007, 3:06pm

On 10/23/07, h3raLd [email protected] wrote:

=> “test \225\227\212”
\225\227\212 is the same as \x95\x97\x8A, the former in octal, and the
latter in hex.

irb(main):002:0> 0x95.to_s(8)
=> “225”
irb(main):003:0> 0x97.to_s(8)
=> “227”
irb(main):004:0> 0x8a.to_s(8)
=> “212”

Bye

h3rald · October 23, 2007, 3:47pm

On Oct 23, 3:05 pm, “Luis P.” [email protected] wrote:

irb(main):004:0> 0x8a.to_s(8)
=> “212”

Bye

–
Luis P.http://ktulu.com.ar/blog/

Thanks a lot, this solves part of the mystery!

I figured out the other half, unfortunately: the reason why I can’t
view the characters in ANSI or UTF8 is because I’m inputting from DOS,
which means, unfortunately, “Code Page 437” (http://en.wikipedia.org/
wiki/Code_page_437).

h3rald · October 23, 2007, 3:54pm

On 10/23/07, h3raLd [email protected] wrote:

Hello,

I noticed some weird behavior when converting a string containing
international characters to YAML:

irb(main):002:0> ‘test òùè’.to_yaml
=> “— "test \x95\x97\x8A"\n”
irb(main):003:0>

IIRC the various YAML implementations in each language can choose
to output UTF-8, or unicode-escaped ASCII. I think a YAML implementation
has to be able to read either.

h3rald · October 30, 2007, 2:56am

Quoth Jamal Bengeloun:

characters into utf-8

…

Does someone have an explanation?

Does anyone know how to get those characters into the final xml files?

Any help would be greatly appreciated.

Jamal

In short, you’re asking what the difference between “\303\251”, “Ã©”,
and “‚” are.

The first is an octal sequence embedded in a string (it happens to be
the
same as utf-8 ‘Ã©’). The second is also utf-8 ‘Ã©’. These two are the same
string ("\303\251" == “Ã©”). The last, ‘‚’ is the html-escaped
notation
for a ‘Ã©’ (I’m trusting your email for the correct number here). That
is,
literally “‚” != “Ã©”, but they should render the same to a browser
capable of displaying utf-8.

HTH,

h3rald · October 30, 2007, 3:16am

Jamal Bengeloun wrote:

Ã gets translated to \205
Ã¹ gets translated to \227

But why?

I guess that your understanding is just wrong. I’m not really sure from
where your program gets those accented chars that are translated to
those specific escaped octal sequences. But if you’re specifying them in
string constants in your program, then it all depends on according to
what encodig your editor displays it.

For instance, I usually edit my scripts as UTF-8 text files, and I treat
my sting constants that way too. In that case, if I put an Ã© in a string
constant, it gets interpreted as \303\251, and not as \202. It’s just
the octal representation of the byte(s) your editor displays as a
specific accented character.

mortee

h3rald · October 30, 2007, 1:27am

Sorry but I do not get it. Plus I am not sure it is only related to
YAML.

I am working on something similar and the only answers I can relate are
those in Python (such as:
http://www.reportlab.com/i18n/python_unicode_tutorial.html). I mean I
got so far as understanding that:

Ã© gets translated to \202
Ã¨ gets translated to \212
Ã gets translated to \205
Ã§ gets translated to \207
Ã¢ gets translated to \203
Ãª gets translated to \210
Ã® gets translated to \214
Ã´ gets translated to \223
Ã» gets translated to \226
Ã¤ gets translated to \204
Ã« gets translated to \211
Ã¯ gets translated to \213
Ã¶ gets translated to \224
Ã¹ gets translated to \227

But why?

The app I am working on gets its data from different sources (yaml
files, dBaseIV files, MS Access files) and then produces xml files (via
builder).

When using print you get the original character. When using p, you get
the escaped equivalent.

And that’s only the start of your problems! When trying to get those
characters into utf-8

Ã© gets translated to \202 that then gets translated to ‚
Ã¨ gets translated to \212 that then gets translated to Š
Ã gets translated to \205 that then gets translated to …
Ã§ gets translated to \207 that then gets translated to ‡
Ã¢ gets translated to \203 that then gets translated to ƒ
Ãª gets translated to \210 that then gets translated to ˆ
Ã® gets translated to \214 that then gets translated to Œ
Ã´ gets translated to \223 that then gets translated to “
Ã» gets translated to \226 that then gets translated to –
Ã¤ gets translated to \204 that then gets translated to „
Ã« gets translated to \211 that then gets translated to ‰
Ã¯ gets translated to \213 that then gets translated to ‹
Ã¶ gets translated to \224 that then gets translated to ”
Ã¹ gets translated to \227 that then gets translated to —

Does someone have an explanation?

Does anyone know how to get those characters into the final xml files?

Any help would be greatly appreciated.

Jamal

Luis P. wrote:

On 10/23/07, h3raLd [email protected] wrote:

=> “test \225\227\212”
\225\227\212 is the same as \x95\x97\x8A, the former in octal, and the
latter in hex.

irb(main):002:0> 0x95.to_s(8)
=> “225”
irb(main):003:0> 0x97.to_s(8)
=> “227”
irb(main):004:0> 0x8a.to_s(8)
=> “212”

Bye

h3rald · October 30, 2007, 12:05pm

Probably. I am a beginner in ruby.

The program gets the accented characters from a dBaseIV file, a MS
Access File and some YAML files.

I use Komodo Edit as my editor and it does handle UTF-8 correctly.

I know! That’s why I did not understand why I got \202. I do not know
which charset ruby uses to convert the characters. I tried iconv and
jcode but ended up with the same results. At first I thought it was
because of the library I used (builder for example). The only
explanation I found was on that python tutorial.

Thanks.

Jamal

mortee wrote:

Jamal Bengeloun wrote:

Ã gets translated to \205
Ã¹ gets translated to \227

But why?

I guess that your understanding is just wrong. I’m not really sure from
where your program gets those accented chars that are translated to
those specific escaped octal sequences. But if you’re specifying them in
string constants in your program, then it all depends on according to
what encodig your editor displays it.

For instance, I usually edit my scripts as UTF-8 text files, and I treat
my sting constants that way too. In that case, if I put an Ã© in a string
constant, it gets interpreted as \303\251, and not as \202. It’s just
the octal representation of the byte(s) your editor displays as a
specific accented character.

mortee

h3rald · October 30, 2007, 1:30pm

Jamal Bengeloun wrote:
Thanks a lot for your help. I thought I will be going mad with this. I
thought it had something to do with ruby being C based (I saw something
on the internet about the difference between Python and JPython and the
accented characters were encoded in UTF-8 and not html escaped).

What if the end rendering engine is not a browser (I checked and you’re
absolutely right, it does work in a browser)? How to get true UTF-8
encoded characters instead of HTML escaped ones? I am using builder to
generate XML files from the data I get.

Thanks a lot for your explanation (it really did enlighten me) and your
help.

Jamal

It should be possible to convert CP437 -
Code page 437 - Wikipedia - to UTF-8 using iconv.

iconv -l | grep -i CP437 # => 437 CP437 IBM437 CSPC8CODEPAGE437

“How to get true UTF-8 encoded characters instead of HTML escaped ones?”

This should be doable with http://htmlentities.rubyforge.org .

(For a Ruby & UTF-8 snippet btw see
http://snippets.dzone.com/posts/show/4527 ).

Cheers,

j. k.

h3rald · October 30, 2007, 12:09pm

Thanks a lot for your help. I thought I will be going mad with this. I
thought it had something to do with ruby being C based (I saw something
on the internet about the difference between Python and JPython and the
accented characters were encoded in UTF-8 and not html escaped).

What if the end rendering engine is not a browser (I checked and you’re
absolutely right, it does work in a browser)? How to get true UTF-8
encoded characters instead of HTML escaped ones? I am using builder to
generate XML files from the data I get.

Thanks a lot for your explanation (it really did enlighten me) and your
help.

Jamal

Konrad M. wrote:

Quoth Jamal Bengeloun:

characters into utf-8

…

Does someone have an explanation?

Does anyone know how to get those characters into the final xml files?

Any help would be greatly appreciated.

Jamal

In short, you’re asking what the difference between “\303\251”, “Ã©”,
and “‚” are.

The first is an octal sequence embedded in a string (it happens to be
the
same as utf-8 ‘Ã©’). The second is also utf-8 ‘Ã©’. These two are the same
string ("\303\251" == “Ã©”). The last, ‘‚’ is the html-escaped
notation
for a ‘Ã©’ (I’m trusting your email for the correct number here). That
is,
literally “‚” != “Ã©”, but they should render the same to a browser
capable of displaying utf-8.

HTH,

h3rald · October 30, 2007, 3:42pm

Quoth Jamal Bengeloun:

Thanks a lot for your explanation (it really did enlighten me) and your

Does someone have an explanation?
The first is an octal sequence embedded in a string (it happens to be
the
same as utf-8 ‘Ã©’). The second is also utf-8 ‘Ã©’. These two are the same
string ("\303\251" == “Ã©”). The last, ‘‚’ is the html-escaped
notation
for a ‘Ã©’ (I’m trusting your email for the correct number here). That
is,
literally “‚” != “Ã©”, but they should render the same to a browser
capable of displaying utf-8.

HTH,

If I’m not mistaken, HTML and XML encoding is the same. So you’re good
for
those &#xxxxxx; chars.

HTH,

h3rald · October 30, 2007, 3:25pm

Thanks, I am going to try with html entities.

However, I recheked with my browsers and:

when the accented character comes from Ã YAML file, it is correctly HTML
encoded, however when it comes from the dBase file, it goes again
through:

Ã© gets translated to \202 that then gets translated to ‚
Ã¨ gets translated to \212 that then gets translated to Š
Ã gets translated to \205 that then gets translated to …
Ã§ gets translated to \207 that then gets translated to ‡
Ã¢ gets translated to \203 that then gets translated to ƒ
Ãª gets translated to \210 that then gets translated to ˆ
Ã® gets translated to \214 that then gets translated to Œ
Ã´ gets translated to \223 that then gets translated to “
Ã» gets translated to \226 that then gets translated to –
Ã¤ gets translated to \204 that then gets translated to „
Ã« gets translated to \211 that then gets translated to ‰
Ã¯ gets translated to \213 that then gets translated to ‹
Ã¶ gets translated to \224 that then gets translated to ”
Ã¹ gets translated to \227 that then gets translated to —

like the behavior seen on this page (python behavior however
(http://www.reportlab.com/i18n/python_unicode_tutorial.html))

For example:

[dBase > XML] Ã© gets translated to \202 that then gets translated to
‚ (single low-9 quotation mark)
[YAML > XML] Ã© gets translated to é

Thanks for your help!

Jamal

Jimmy K. wrote:

Jamal Bengeloun wrote:
Thanks a lot for your help. I thought I will be going mad with this. I
thought it had something to do with ruby being C based (I saw something
on the internet about the difference between Python and JPython and the
accented characters were encoded in UTF-8 and not html escaped).

What if the end rendering engine is not a browser (I checked and you’re
absolutely right, it does work in a browser)? How to get true UTF-8
encoded characters instead of HTML escaped ones? I am using builder to
generate XML files from the data I get.

Thanks a lot for your explanation (it really did enlighten me) and your
help.

Jamal

It should be possible to convert CP437 -
Code page 437 - Wikipedia - to UTF-8 using iconv.

iconv -l | grep -i CP437 # => 437 CP437 IBM437 CSPC8CODEPAGE437

“How to get true UTF-8 encoded characters instead of HTML escaped ones?”

This should be doable with http://htmlentities.rubyforge.org .

(For a Ruby & UTF-8 snippet btw see
http://snippets.dzone.com/posts/show/4527 ).

Cheers,

j. k.

h3rald · October 30, 2007, 9:04pm

Jamal Bengeloun wrote:

Probably. I am a beginner in ruby.

The program gets the accented characters from a dBaseIV file, a MS
Access File and some YAML files.

I use Komodo Edit as my editor and it does handle UTF-8 correctly.

I know! That’s why I did not understand why I got \202. I do not know
which charset ruby uses to convert the characters.

When converting some accented characters to \nnn then it doesn’t use any
encoding. It just represents the verbatim non-ascii bytes it sees in the
string it gets. Encodig/decoding happens mainly when you input accented
chars on your keyboard, and they get converted to some byte (-sequence)
to be stored in a string, and when those strings are displayed, and they
are converted back to some printable characters.

Problems arise when the displaying code interprets the same string
according to a different charset than what it was encoded according to.

For example, when you puts a string in irb, then it’s your terminal’s
current charset which determines how the bytes in the string are
actually displayed. In contrast, when you use p (or, for that matter,
inspect), then non-ascii characters get ascaped as \nnn.

mortee

h3rald · November 13, 2007, 10:33am

Sorry for the delay,

You are totally right. In order to get what I want I used

formated_value = Iconv.new(‘UTF-8’, ‘CP850’).iconv(input.to_s)

And… It worked.

But at the end I simply used the ADODB wrapper to open my dbf file and I
did not have any character encoding problems after that.

Thanks a lot.

Jamal Abdou-Karim Bengeloun

mortee wrote:

When converting some accented characters to \nnn then it doesn’t use any
encoding. It just represents the verbatim non-ascii bytes it sees in the
string it gets. Encodig/decoding happens mainly when you input accented
chars on your keyboard, and they get converted to some byte (-sequence)
to be stored in a string, and when those strings are displayed, and they
are converted back to some printable characters.

Problems arise when the displaying code interprets the same string
according to a different charset than what it was encoded according to.

For example, when you puts a string in irb, then it’s your terminal’s
current charset which determines how the bytes in the string are
actually displayed. In contrast, when you use p (or, for that matter,
inspect), then non-ascii characters get ascaped as \nnn.

mortee

h3rald · November 13, 2007, 10:37am

I am not sure about that, I’ll have to check. What I noticed though is
that builder converted the accented characters correctly when coming
from yaml files, but had problems (it did convert them but… See my
previous post) getting those coming from the dbf file to hit the target.

It seems that the problem was coming from the page code encoding.

Thanks

Jamal Abdou-Karim Bengeloun

Konrad M. wrote:

If I’m not mistaken, HTML and XML encoding is the same. So you’re good
for
those &#xxxxxx; chars.

HTH,