Issue with accents (UTF-8) - is it supposed to work?

thbar · March 1, 2009, 12:07pm

Hi,
not sure if it’s an oddity in my code, a bug or non-implemented feature
in
IronRuby or Mono - so I’m reporting it here. When using accents inside
strings (“BarrÃ¨re”) that I pass to either buttons or datagridviews, they
translate into “BarrAÂ¨re”. Here’s a sample (also available on
githubhttp://github.com/thbar/ironruby-labs/blob/ca47f06024e936690d427d297909c9a78b0481e6/ui/006_datagridview.rb
):

form = Magic.build do
form(:text => “DataGridView sample”, :width => 800, :height => 600) do
# nifty - current Magic.build makes it possible to reuse the
control that has been added
@grid = data_grid_view :dock => DockStyle.fill
@grid.column_count = 2
@grid.columns[0].name = “First name”
@grid.columns[1].name = “Last name”

@grid.rows.add("Thibaut","BarrÃ¨re") # using my name with its nasty

accent - utf-8 ?
end
end

After editing the datagridview, I noticed a log on stdout from mono:

009-03-01 11:48:36.927 mono[5512:10b] WARNING:
CFSTR(“Barr\37777777703\37777777603\37777777702\37777777650re”) has
non-7
bit chars, interpreting using MacOS Roman encoding for now, but this
will
change. Please eliminate usages of non-7 bit chars (including escaped
characters above \177 octal) in CFSTR().

So I guess the issue probably boils down to non-MacOS Roman support in
Mono.

What do you think ?

– Thibaut

thbar · March 3, 2009, 3:45pm

Hi,

not sure if it’s an oddity in my code, a bug or non-implemented feature in
IronRuby or Mono - so I’m reporting it here. When using accents inside
strings (“Barrère”) that I pass to either buttons or datagridviews, they
translate into “BarrA¨re”. Here’s a sample (also available on github):

Bumping this one - do you have some idea of what’s happening there ?
Is it a mono related issue ?

– Thibaut

thbar · March 3, 2009, 4:10pm

No not a mono related issue. I get the same results when i run your
sample
on windows with MS.NET
It must be an encoding thing. When I set the $KCODE to “UTF-8” it still
has
the same behavior which is weird I guess

On Tue, Mar 3, 2009 at 3:35 PM, Thibaut BarrÃ¨re

thbar · March 3, 2009, 7:04pm

Iâ€™ll take a look.

Tomas

From: [email protected]
[mailto:[email protected]] On Behalf Of Ivan Porto
Carrero
Sent: Tuesday, March 03, 2009 6:58 AM
To: [email protected]
Subject: Re: [Ironruby-core] Issue with accents (UTF-8) - is it supposed
to work ?

No not a mono related issue. I get the same results when i run your
sample on windows with MS.NET http://MS.NET
It must be an encoding thing. When I set the $KCODE to “UTF-8” it still
has the same behavior which is weird I guess

On Tue, Mar 3, 2009 at 3:35 PM, Thibaut BarrÃ¨re
<[email protected]mailto:[email protected]> wrote:
Hi,

not sure if it’s an oddity in my code, a bug or non-implemented feature in
IronRuby or Mono - so I’m reporting it here. When using accents inside
strings (“BarrÃ¨re”) that I pass to either buttons or datagridviews, they
translate into “BarrAÂ¨re”. Here’s a sample (also available on github):
Bumping this one - do you have some idea of what’s happening there ?
Is it a mono related issue ?

– Thibaut

@grid = data_grid_view :dock => DockStyle.fill
009-03-01 11:48:36.927 mono[5512:10b] WARNING:
CFSTR(“Barr\37777777703\37777777603\37777777702\37777777650re”) has non-7
bit chars, interpreting using MacOS Roman encoding for now, but this will
change. Please eliminate usages of non-7 bit chars (including escaped
characters above \177 octal) in CFSTR().
So I guess the issue probably boils down to non-MacOS Roman support in Mono.
What do you think ?
– Thibaut

Ironruby-core mailing list
[email protected]mailto:[email protected]
http://rubyforge.org/mailman/listinfo/ironruby-core

thbar · March 3, 2009, 7:52pm

If I run this in Ruby 1.8.6:

ruby â€“Ku uni.rb

And uni.rb is UTF-8 encoded w/o BOM:

puts $KCODE
puts ‘hÃ¨llo’.size

Iâ€™ll get output:
UTF-8
6

So that clearly doesnâ€™t work as one might expect. String literals in MRI
1.8 are always binary (ie. the accented character is stored as any other
2 bytes in the string).
AFAIK $KCODE only affects some built-in and library methods â€“ for
example String#inspect, regular expression, conversion libraries, etc.

Although IronRuby stores string literals in UTF16 .NET strings, to be
fully compatible with MRI 1.8 we use a custom BinaryEncoding for these
strings. When a string is converted to an array of bytes using this
encoding, only 8 bits of each character are used (the other bits are
required to be 0). This works fine for encodings that use a single byte
per character. Itâ€™s broken for multi-byte encodings but thatâ€™s a problem
with Ruby 1.8 in general.

If you want to use Unicode you should not use 1.8 semantics. You should
use -19 switch to run your script in 1.9 mode and either add a UTF8 BOM
preamble or Ruby encoding magic comment:

#encoding: UTF-8
puts ‘hÃ¨llo’.size

ruby19 uni.rb
5

ir.exe -19 uni.rb
5

In a hosted app you can set 1.9 compat mode when creating the
ScriptEngine/Runtime:

var ruby = IronRuby.Ruby.CreateEngine((setup) => {
setup.Options[“Compatibility”] = RubyCompatibility.Ruby19
});

Tomas

From: [email protected]
[mailto:[email protected]] On Behalf Of Tomas M.
Sent: Tuesday, March 03, 2009 9:56 AM
To: [email protected]
Subject: Re: [Ironruby-core] Issue with accents (UTF-8) - is it supposed
to work ?

Iâ€™ll take a look.

Tomas

From: [email protected]
[mailto:[email protected]] On Behalf Of Ivan Porto
Carrero
Sent: Tuesday, March 03, 2009 6:58 AM
To: [email protected]
Subject: Re: [Ironruby-core] Issue with accents (UTF-8) - is it supposed
to work ?

No not a mono related issue. I get the same results when i run your
sample on windows with MS.NET http://MS.NET
It must be an encoding thing. When I set the $KCODE to “UTF-8” it still
has the same behavior which is weird I guess
On Tue, Mar 3, 2009 at 3:35 PM, Thibaut BarrÃ¨re
<[email protected]mailto:[email protected]> wrote:
Hi,

not sure if it’s an oddity in my code, a bug or non-implemented feature in
IronRuby or Mono - so I’m reporting it here. When using accents inside
strings (“BarrÃ¨re”) that I pass to either buttons or datagridviews, they
translate into “BarrAÂ¨re”. Here’s a sample (also available on github):
Bumping this one - do you have some idea of what’s happening there ?
Is it a mono related issue ?

– Thibaut

@grid = data_grid_view :dock => DockStyle.fill
009-03-01 11:48:36.927 mono[5512:10b] WARNING:
CFSTR(“Barr\37777777703\37777777603\37777777702\37777777650re”) has non-7
bit chars, interpreting using MacOS Roman encoding for now, but this will
change. Please eliminate usages of non-7 bit chars (including escaped
characters above \177 octal) in CFSTR().
So I guess the issue probably boils down to non-MacOS Roman support in Mono.
What do you think ?
– Thibaut

Ironruby-core mailing list
[email protected]mailto:[email protected]
http://rubyforge.org/mailman/listinfo/ironruby-core

thbar · March 3, 2009, 9:25pm

Hi Tomas,

thanks for your two messages and the in-depth explanation. Working
with -19 and #encoding: UTF-8 indeed solves the issue (tested on
Mono).

encoding. We will also attach the KCODE encoding to the MutableString at
creation time. This doesn’t affect Ruby 1.8 functionality, it only affects
conversions to CLR string. So if you use KCODE = “U” the CLR strings should
be correctly encoded (they are not now as you are experiencing). I’ll
implement this feature as soon as possible.

I think affecting strings only when conversion occurs to CLR is a
pretty neat idea.

I like that a lot more than having to add #encoding and -19 (also
because I’m not sure what the impact would be to use -19 just for
that).

Because I was curious, I had a look at Rails (2.2.2) output for some
of these operations:

Loading development environment (Rails 2.2.2)
“hèllo”.size>> “hèllo”.size
=> 6

“hèllo”.chars
=> #<ActiveSupport::Multibyte::Chars:0x2378348 @wrapped_string=“hèllo”>

“hèllo”.chars.size
=> 5

‘€2.99’[0,1]
=> “\342”

‘€2.99’.first
=> “€”

‘€2.99’.first
=> “€”

So pretty much rough access through array is pure byte, while .first
takes multibytes into account.

I think the spirit of what you suggest is somewhat close from that.

I like it - and will test it when you’ll have it implemented.

cheers and thanks for your idea,

– Thibaut

thbar · March 3, 2009, 9:15pm

Actually the 1.8 parser is somewhat influenced by the current $KCODE.
Multi-byte characters could be part of identifiers and also the decision
of where a string literal ends needs to deal with multi-byte characters.
However, the resulting literals are just plain byte arrays with no
knowledge of encoding so String#size method is still broken.

To achieve a better .NET interop in IronRuby, we will honor KCODE when
creating MutableStrings. The representation of the string will be byte[]
if it contains any non-ascii characters and KCODE is set to a non-ascii
encoding. We will also attach the KCODE encoding to the MutableString at
creation time. This doesnâ€™t affect Ruby 1.8 functionality, it only
affects conversions to CLR string. So if you use KCODE = â€œUâ€ the CLR
strings should be correctly encoded (they are not now as you are
experiencing). Iâ€™ll implement this feature as soon as possible.

Tomas

From: [email protected]
[mailto:[email protected]] On Behalf Of Tomas M.
Sent: Tuesday, March 03, 2009 10:36 AM
To: [email protected]
Subject: Re: [Ironruby-core] Issue with accents (UTF-8) - is it supposed
to work ?

If I run this in Ruby 1.8.6:

ruby â€“Ku uni.rb

And uni.rb is UTF-8 encoded w/o BOM:

puts $KCODE
puts ‘hÃ¨llo’.size

Iâ€™ll get output:
UTF-8
6

So that clearly doesnâ€™t work as one might expect. String literals in MRI
1.8 are always binary (ie. the accented character is stored as any other
2 bytes in the string).
AFAIK $KCODE only affects some built-in and library methods â€“ for
example String#inspect, regular expression, conversion libraries, etc.

Although IronRuby stores string literals in UTF16 .NET strings, to be
fully compatible with MRI 1.8 we use a custom BinaryEncoding for these
strings. When a string is converted to an array of bytes using this
encoding, only 8 bits of each character are used (the other bits are
required to be 0). This works fine for encodings that use a single byte
per character. Itâ€™s broken for multi-byte encodings but thatâ€™s a problem
with Ruby 1.8 in general.

If you want to use Unicode you should not use 1.8 semantics. You should
use -19 switch to run your script in 1.9 mode and either add a UTF8 BOM
preamble or Ruby encoding magic comment:

#encoding: UTF-8
puts ‘hÃ¨llo’.size

ruby19 uni.rb
5

ir.exe -19 uni.rb
5

In a hosted app you can set 1.9 compat mode when creating the
ScriptEngine/Runtime:

var ruby = IronRuby.Ruby.CreateEngine((setup) => {
setup.Options[“Compatibility”] = RubyCompatibility.Ruby19
});

Tomas

From: [email protected]
[mailto:[email protected]] On Behalf Of Tomas M.
Sent: Tuesday, March 03, 2009 9:56 AM
To: [email protected]
Subject: Re: [Ironruby-core] Issue with accents (UTF-8) - is it supposed
to work ?

Iâ€™ll take a look.

Tomas

From: [email protected]
[mailto:[email protected]] On Behalf Of Ivan Porto
Carrero
Sent: Tuesday, March 03, 2009 6:58 AM
To: [email protected]
Subject: Re: [Ironruby-core] Issue with accents (UTF-8) - is it supposed
to work ?

No not a mono related issue. I get the same results when i run your
sample on windows with MS.NET http://MS.NET
It must be an encoding thing. When I set the $KCODE to “UTF-8” it still
has the same behavior which is weird I guess
On Tue, Mar 3, 2009 at 3:35 PM, Thibaut BarrÃ¨re
<[email protected]mailto:[email protected]> wrote:
Hi,

not sure if it’s an oddity in my code, a bug or non-implemented feature in
IronRuby or Mono - so I’m reporting it here. When using accents inside
strings (“BarrÃ¨re”) that I pass to either buttons or datagridviews, they
translate into “BarrAÂ¨re”. Here’s a sample (also available on github):
Bumping this one - do you have some idea of what’s happening there ?
Is it a mono related issue ?

– Thibaut

@grid = data_grid_view :dock => DockStyle.fill
009-03-01 11:48:36.927 mono[5512:10b] WARNING:
CFSTR(“Barr\37777777703\37777777603\37777777702\37777777650re”) has non-7
bit chars, interpreting using MacOS Roman encoding for now, but this will
change. Please eliminate usages of non-7 bit chars (including escaped
characters above \177 octal) in CFSTR().
So I guess the issue probably boils down to non-MacOS Roman support in Mono.
What do you think ?
– Thibaut

Ironruby-core mailing list
[email protected]mailto:[email protected]
http://rubyforge.org/mailman/listinfo/ironruby-core