Forum: Ruby-core Character encodings - a radical suggestion

Posted by Michael Selig (Guest)
on 2008-09-17 03:28
(Received via mailing list)
Hi,

You might at first glance think that this post should go to ruby-dev, 
but
please read to the end!

I have been pulling my hair out trying to convert a relatively simple 
app
to support m17n under Ruby 1.9 to see what is involved. I need to 
support
all common locales worldwide, and data can also be stored in UTF-8 or
UTF-16. I was hoping that Ruby 1.9 was going to take the hard work out 
of
this for me. It has to a certain extent, but UTF-16 is the problem - it
breaks so many things, due to its "ASCII incompatibility" (using Ruby's
definition). I can't even do simple things like pull out fields and
substitute into another string without testing "encoding compatibility".
Something as simple as:

  puts "The value is #{val}"

fails if val is UTF-16 data.

At one stage I got so frustrated that I was even thinking about going 
back
to Python :-(
So I have ended up transcoding any UTF-16 data to UTF-8, and now things
are going much better.

Maybe I am doing something wrong - if so please suggest something I can 
do
other than transcode the UTF-16.

But this has lead me to look back at the issues with UTF-16 I have hit,
and to think about all the internal code in Ruby to handle "ASCII
incompatible" encodings, and the overhead involved with supporting it.

And I think that other Ruby programmers may end up doing what I have 
done
- avoid using UTF-16 internally because it is too hard.

So my radical suggestion is this:

Remove internal support for non-ASCII encodings completely, and when
reading/writing UTF-16 (and UTF-32) files automatically transcode 
to/from
UTF-8.

My reasons:

- String & Regexp operations should just "work" without the programmer
worrying about encoding comaptibility (I think!)
- The programmer only has to think about character encodings at the
"interfaces" (files, network interfaces) not throughout the program 
logic
- To my knowledge UTF-16 & UTF-32 are the only "non-ASCII compatible" as
Ruby defines it
- To my knowledge no one actually uses UTF-16 or UTF-32 as a locale
- I would avoid having to use ugly modes to open a file like
"r:UTF-16LE:UTF-8" (very minor)
- Ruby's internal code would be simpler & cleaner and therefore probably
faster and easier to maintain

Maybe I have got this all wrong - I am relatively new to m17n!

Cheers
Mike
Posted by James Gray (bbazzarrakk)
on 2008-09-17 04:45
(Received via mailing list)
On Sep 16, 2008, at 8:20 PM, Michael Selig wrote:

>
>   puts "The value is #{val}"
>
> fails if val is UTF-16 data.

I'm not sure I support the pull-them out strategy, but I can confirm
that supporting UTF-16 in CSV has eaten about a week of my time and
counting.  I keep thinking I have it and finding new problem…

James Edward Gray II
Posted by Michael Selig (Guest)
on 2008-09-17 04:54
(Received via mailing list)
Hi,

In my previous mail, I think I made a mistake. I am too used to working 
in
a UTF-8 locale, and I forgot about the situations where your locale is 
not
UTF-8 or ASCII. Sorry!!

So unfortunately in a general sense you can never simply ignore encoding
compatibility.
Therefore I can either hope that a user's UTF-8 & UTF-16 data is
compatible with their locale, or I have to transcode everything to 
UTF-8.
What else can I do?

However, I think support for UTF-16 & UTF-32 internally is not
particularly useful and that support for them may not really be
justifiable.

Mike
Posted by James Gray (bbazzarrakk)
on 2008-09-17 04:59
(Received via mailing list)
On Sep 16, 2008, at 8:20 PM, Michael Selig wrote:

> …but UTF-16 is the problem - it breaks so many things, due to its  
> "ASCII incompatibility" (using Ruby's definition). I can't even do  
> simple things like pull out fields and substitute into another  
> string without testing "encoding compatibility". Something as simple  
> as:
>
>   puts "The value is #{val}"
>
> fails if val is UTF-16 data.

How ironic…  I ran into this issue about five minutes ago.  It's
killing the CSV implementation I thought I finally had right.  :(

How would you save this?  Instead of:

   %Q{"#{val}"}  # boom for UTF-16!

Can we do:

   ['"', val, '"'].map { |s| s.encode("UTF-16BE") }.join

?  Yeah, that seems to work.  It sucks, but it works.

James Edward Gray II
Posted by Michael Selig (Guest)
on 2008-09-17 06:31
(Received via mailing list)
On Wed, 17 Sep 2008 12:51:14 +1000, James Gray 
<james@grayproductions.net>
wrote:

> Can we do:
>
>    ['"', val, '"'].map { |s| s.encode("UTF-16BE") }.join
>
> ?  Yeah, that seems to work.  It sucks, but it works.

Yep, it sure sucks!

I have been doing some more thinking about these ongoing issues....

<soapbox>

Using Ruby SHOULD be making our lives easier, not harder. Other 
languages
like Python have taken an easier route to m17n - represent all strings
internally as unicode codepoints. Then there should never be a need to
check encoding compatibility, right? I am not saying that this is a
perfect solution either, by the way. But having to work around this
"Encoding Compatibility Error" all the time is just a pain for apps 
which
need to work in different countries with different locales. 
Unfortunately
it is leading me towards the path of having to transcode everything to
UTF-8, even though in 99% of cases all the data IS going to be 
compatible
and be in the user's locale. I don't want so much of my time taken up, 
and
be forced to write ugly code to take care of the remaining 1%. Maybe the
problem is that Ruby is being too generous supporting all these 
different
encodings internally! That was one reason why I raised the idea of
removing UTF-16 & 32 support - at least that way I know that the ASCII
strings from my program can work with any user data. But then the 
further
problem: What if you need to work with (or at least take into account 
the
possibility of) 2 or more non-ascii (but ascii compatible) encodings 
(eg:
the user's locale & UTF-8)?

What may solve this issue is if Ruby itself would automatically encode
incompatible strings in a compatible encoding (UTF-8 I guess). The only
time you should then get "Encoding Compatibility Errors" is when writing
data to a file or network stream in a certain encoding and a character
cannot be represented. That's it.

Just a thought...

</soapbox>

Mike
Posted by Tanaka Akira (Guest)
on 2008-09-17 10:25
(Received via mailing list)
In article <op.uhlqb7b29245dp@kool>,
  "Michael Selig" <michael.selig@fs.com.au> writes:

> - To my knowledge UTF-16 & UTF-32 are the only "non-ASCII compatible" as  
> Ruby defines it

ISO-2022-JP is another example.

> - To my knowledge no one actually uses UTF-16 or UTF-32 as a locale

They are not usable as locale encoding.

http://www.opengroup.org/onlinepubs/007908799/xbd/...
| * The encoded values associated with the members of the portable character
|   set are each represented in a single byte.
Posted by Robert Klemme (Guest)
on 2008-09-17 10:28
(Received via mailing list)
Disclaimer: I haven't used 1.9 encoding stuff so far. Nevertheless my 
0.02EUR:

2008/9/17 Michael Selig <michael.selig@fs.com.au>:
> <soapbox>
>
> Using Ruby SHOULD be making our lives easier, not harder. Other languages
> like Python have taken an easier route to m17n - represent all strings
> internally as unicode codepoints.

Which is also what Java does.  I have always found Java's approach to
encodings very clean and workable.  But if I remember correctly Matz
once said that Unicode does not cover all Asian symbols so it might
not be a too good choice for internal representation.

> That was one reason why I raised the idea of removing UTF-16 & 32 support -
>
> Just a thought...
>
> </soapbox>

I believe that one reason for the difficulties we encounter now is the
fact that String is historically used for binary and text data.  So
there is no clear separation between the two and this bears potential
for confusion and bugs.

A clean solution would probably involve having a character type which
is capable of representing *all* possible symbols and model String as
sequence of those characters.  Encoding would then be done during
input and output only.  Questions I see

1. Is this feasible, i.e. is there something similar to Unicode
without its limitations?

2. Is it fast enough for the general case?

3. What happens to binary Strings? or more generally:

4. What happens to old (pre 1.9) code?

i18n is a nasty beast...

Kind regards

robert
Posted by Michal Suchanek (Guest)
on 2008-09-17 12:30
(Received via mailing list)
On 17/09/2008, Robert Klemme <shortcutter@googlemail.com> wrote:

>  1. Is this feasible, i.e. is there something similar to Unicode
>  without its limitations?

What is "similar to Unicode without its limitations"? You mean it
contains every character anybody would ever what to write on a
computer in a single encoding? So there would have to be a central
committee to which everybody submits any character they think of so
that it gets a codepoint assigned? And that committee assigns
codepoints to those submissions unquestioningly so that no
special-purpose encodings are ever needed?

And you even try to ask if this is feasible?

Well, my answer is that it is not feasible.

Thanks

Michal
Posted by James Gray (bbazzarrakk)
on 2008-09-17 14:59
(Received via mailing list)
On Sep 16, 2008, at 11:21 PM, Michael Selig wrote:

> in the user's locale.
I believe Matz has said in the past that transcoding is what they are
trying to avoid in general.  You can loose data that way and thus the
core team doesn't favor it.  (I hope I got that right.  It's from
memory, so don't blame me for putting words in Matz's mouth.)

Besides, I'm not sure if it's the characters I have tried or just that
Ruby's transcoding still needs work, but I've tried converting some
Shift_JIS to UTF-8 that it just couldn't handle.  We would have to
have a better conversion rate to support a strategy like this.

James Edward Gray II
Posted by NARUSE, Yui (Guest)
on 2008-09-17 15:48
(Received via mailing list)
Hi,

James Gray wrote:
>> transcode everything to UTF-8, even though in 99% of cases all the 
> a better conversion rate to support a strategy like this.
We can convert "all Shift_JIS characters" to Unicode now.
But current problem is, there are some mappings Shift_JIS and Unicode 
conversion.
Once you convert data from Shift_JIS to Unicode, true meaning of some 
characters
may be lost forever. (e.g. YEN SIGN Problem)

If we develop "a better" conversion, this problem will be more complex.
Posted by Yukihiro Matsumoto (Guest)
on 2008-09-17 16:36
(Received via mailing list)
Hi,

In message "Re: [ruby-core:18640] Character encodings - a radical 
suggestion"
    on Wed, 17 Sep 2008 10:20:13 +0900, "Michael Selig" 
<michael.selig@fs.com.au> writes:

|So my radical suggestion is this:
|
|Remove internal support for non-ASCII encodings completely, and when  
|reading/writing UTF-16 (and UTF-32) files automatically transcode to/from  
|UTF-8.

What happens with non Unicode text under your suggestion?

My conservative suggestion is that:

Put "r:UTF-16BE:UTF-8" for mode when you open an UTF-16 file to read,
so that your internal strings are all UTF-8 encoding.

|My reasons:
|
|- String & Regexp operations should just "work" without the programmer  
|worrying about encoding comaptibility (I think!)
|- The programmer only has to think about character encodings at the  
|"interfaces" (files, network interfaces) not throughout the program logic

My "suggestion" satisfies above two.

|- To my knowledge UTF-16 & UTF-32 are the only "non-ASCII compatible" as  
|Ruby defines it

As akr stated this is wrong.

|- To my knowledge no one actually uses UTF-16 or UTF-32 as a locale

Yes.

|- I would avoid having to use ugly modes to open a file like  
|"r:UTF-16LE:UTF-8" (very minor)

This is ugly indeed.  We might add more Unicode support in the
future.  But we are no hurry.

|- Ruby's internal code would be simpler & cleaner and therefore probably  
|faster and easier to maintain

Dropping UTF-{16,32} is not enough.  Unless we abandon non-Unicode
encoding support altogether, it won't be THAT simple.  And I am not
going to remove their support.  I use them everyday.

              matz.
Posted by Matthias Wächter (Guest)
on 2008-09-17 16:48
(Received via mailing list)
On 9/17/2008 3:39 PM, NARUSE, Yui wrote:
> We can convert "all Shift_JIS characters" to Unicode now.
> But current problem is, there are some mappings Shift_JIS and Unicode
> conversion.
> Once you convert data from Shift_JIS to Unicode, true meaning of some
> characters
> may be lost forever. (e.g. YEN SIGN Problem)
> 
> If we develop "a better" conversion, this problem will be more complex.

Is there a complete characterization of this whole problem? It seems
to be the main reason for sticking to non-UTF-8 character sets in
Ruby these days, and concluding from what I have read about it, a
solution could be the addition of missing characters/codepoints to
Unicode. Why does no-one consider going that way, but instead builds
a complicated stack of functions for conversions on top level?

To some extent, it looks like 'some' people like insisting on the
status quo as it makes them feel special, swimming upstream the
Unicode waterfall, retaining on regional locales instead of solving
the issue. I do explicitly not refer to Ruby or the developers, they
just accept these special needs more than other computer language
designers with less sympathy for this anomaly.

Nevertheless, a persisting fix is needed, and I think writing more
and more clutches for encoding conversion goes the wrong way. This
might still be needed for legacy file support, but day-to-day work
should not have to deal with this issue so prominently.

cheers,
- Matthias
Posted by Yukihiro Matsumoto (Guest)
on 2008-09-17 17:09
(Received via mailing list)
Hi,

In message "Re: [ruby-core:18663] Re: Character encodings - a radical 
suggestion"
    on Wed, 17 Sep 2008 23:09:32 +0900, Matthias Wächter 
<matthias@waechter.wiz.at> writes:

|Is there a complete characterization of this whole problem? It seems
|to be the main reason for sticking to non-UTF-8 character sets in
|Ruby these days, and concluding from what I have read about it, a
|solution could be the addition of missing characters/codepoints to
|Unicode. Why does no-one consider going that way, but instead builds
|a complicated stack of functions for conversions on top level?

Just because it's impossible.  History sucks.  We have mixed up YEN
SIGN and REVERSE SOLIDUS for long time.  They cannot be distinguished
without context information.  Technically 0x5c should mean REVERSE
SOLIDUS, but not always so for humans.

Besides that, Unicode is not a panacea.  Some character set
(e.g. GB18030 for Chinese characters) is even bigger than Unicode.
In fact, GB18030 is a super set of Unicode.

|To some extent, it looks like 'some' people like insisting on the
|status quo as it makes them feel special, swimming upstream the
|Unicode waterfall, retaining on regional locales instead of solving
|the issue. I do explicitly not refer to Ruby or the developers, they
|just accept these special needs more than other computer language
|designers with less sympathy for this anomaly.
|
|Nevertheless, a persisting fix is needed, and I think writing more
|and more clutches for encoding conversion goes the wrong way. This
|might still be needed for legacy file support, but day-to-day work
|should not have to deal with this issue so prominently.

You are free to feel so, but it's us who take up the burden.  Hoever,
we are open for complain about usability, e.g. no for 
"r:UTF-16LE:UTF-8".

              matz.
Posted by NARUSE, Yui (Guest)
on 2008-09-17 18:54
(Received via mailing list)
Hi,

Yukihiro Matsumoto wrote:
> Besides that, Unicode is not a panacea.  Some character set
> (e.g. GB18030 for Chinese characters) is even bigger than Unicode.
> In fact, GB18030 is a super set of Unicode.

Emacs-Mule is another encoding which is bigger than Unicode and Ruby 
supports.

And pictgraphs which are used by Japanese Mobile Phones are also not in 
Unicode.
Posted by Michal Suchanek (Guest)
on 2008-09-17 19:02
(Received via mailing list)
On 17/09/2008, James Gray <james@grayproductions.net> wrote:
> substitute into another string without testing "encoding compatibility".
> Something as simple as:
> >
> >        puts "The value is #{val}"
> >
> > fails if val is UTF-16 data.
> >
>
>  I'm not sure I support the pull-them out strategy, but I can confirm that
> supporting UTF-16 in CSV has eaten about a week of my time and counting.  I
> keep thinking I have it and finding new problem…

For your own program you could override String.+ to automagically
convert its parameters. I thought this is good enough but you cannot
do that for libraries - ruby does not provide any way of bolting on
such feature and hiding it from users of the library so that they get
the standard behaviour.

Still there are multiple ways of combining strings, and these could be
used to distinguish different encoding handling.

So my suggestion is to make
 - String.+ do the conversion if possible (it creates a new string so
it can be different)
 - String.<< to only append compatible strings
 - I am not sure about string interpolation - it technically creates a
new string each time so it could just convert but this could get
complex if many stings are included in the interpolation.

Note that even with automatic conversion you get cases when strings
cannot be converted to some superset so somebody could break your
application that seems to work OK by supplying input in an exotic
encoding.

There are other string functions, though. It is unclear what
Object.inspect should do. It is generally used to show stuff to the
user. But should it convert the string to the user locale, show it in
hex with locale information appended, or what?

IO could be configurable to either do the necessary conversion or not. 
Like

STDOUT.autoconvert=true

then you could write any strings to stdout without problems (as long
as the stdout encoding is known and can handle all your strings).

Also Array.join could perhaps accept some parameter that either
specifies the desired encoding of the result or specifies that the
strings should be converted so that they can actually be concateneted.

Generally I can imagine the automatic conversion working like this
(either as part of core or as an addon):

1) each encoding has a list of compatible supersets

2) each encoding has a list of (incompatible) equivalents [optional] -
typical for legacy 8bit encodings which have several variants with the
characters reordered in different ways

3) each encoding has a list of incompatible (without conversion) 
supersets

Then string operations could be performed this way:

1) an operation on two strings where one is compatible superset of the
other is done without conversion, and the result has encoding of the
superset. This is basically the extension of the ASCII-compatible
concept to other encodings that could have this feature.

If conversion is not allowed and 1) is not applicable (note that each
encoding is compatible superset of itself) en exception is raised.

If conversion is allowed the autoconversion could follow:

2) if the strings ere encoded in incompatible but equivalent encodings
convert one to the encoding of the other based on some order of
preference.

3) if there is the same incompatible superset for both strings (or
superset of superset ..) convert both strings to this superset. If
multiple supersets are available consult order of preference.

If neither 2) nor 3) are applicable raise an exception.

I am not sure that 2) would ever apply. Some iso encodings should be
generally equivalent to some dos or windows codepages but there might
be one or two different characters that make the encodings
non-equivalent. Perhaps the strings could be checked for these
characters but then just converting to a superset might be easier.

Thanks

Michal
Posted by Tim Bray (Guest)
on 2008-09-17 19:07
(Received via mailing list)
On Sep 17, 2008, at 9:45 AM, NARUSE, Yui wrote:

> And pictgraphs which are used by Japanese Mobile Phones are also not  
> in Unicode.

They have Private Use Area codepoints for emoji, e.g. 
http://www.au.kddi.com/ezfactory/tec/spec/img/typeD.pdf

Seems reasonable. -Tim
Posted by Michal Suchanek (Guest)
on 2008-09-17 19:51
(Received via mailing list)
On 17/09/2008, Yukihiro Matsumoto <matz@ruby-lang.org> wrote:
>  |Unicode. Why does no-one consider going that way, but instead builds
>  In fact, GB18030 is a super set of Unicode.
>

I wonder how people who suggest Unicode as the single internal
encoding would react if GB18030 was suggested instead ;-)

Thanks

Michal
Posted by NARUSE, Yui (Guest)
on 2008-09-17 19:51
(Received via mailing list)
Tim Bray wrote:
> On Sep 17, 2008, at 9:45 AM, NARUSE, Yui wrote:
> 
>> And pictgraphs which are used by Japanese Mobile Phones are also not 
>> in Unicode.
> 
> They have Private Use Area codepoints for emoji, e.g. 
> http://www.au.kddi.com/ezfactory/tec/spec/img/typeD.pdf
> 
> Seems reasonable. -Tim

Yes, they have private use area codepoints, but I think they are not 
reasonable.

The first reason is that the area is Private Use Area.

Moreover there are some mobile phone careers in Japan and they define 
own emoji.
And their PUA codepoints is conflicted.
http://creation.mb.softbank.jp/web/web_pic_about.html

So they can't be 'Uni'code yet.
Posted by NARUSE, Yui (Guest)
on 2008-09-17 20:27
(Received via mailing list)
Hi,

Michal Suchanek wrote:
>  - String.+ do the conversion if possible (it creates a new string so
> it can be different)

The problem is not "can convert" or "cannot convert".
Different mappings and information lost in conversion is the true 
problem.
So they can't be avoided and we can't use automatic conversion.

> Note that even with automatic conversion you get cases when strings
> cannot be converted to some superset so somebody could break your
> application that seems to work OK by supplying input in an exotic
> encoding.

Difinition of superset is difficult problem.

> There are other string functions, though. It is unclear what
> Object.inspect should do. It is generally used to show stuff to the
> user. But should it convert the string to the user locale, show it in
> hex with locale information appended, or what?

In previous conversation, Object#inspect should be dependent from 
locale.

> STDOUT.autoconvert=true

this seems non-thread-safe.

> Generally I can imagine the automatic conversion working like this
> (either as part of core or as an addon):
> 
> 1) each encoding has a list of compatible supersets

Define "compatible" is this problem.
And what is "incompatible"?

> 2) each encoding has a list of (incompatible) equivalents [optional] -
> typical for legacy 8bit encodings which have several variants with the
> characters reordered in different ways

Such extension sometimes breaks compatibility.
For example, U+9AD8 is assigned in 0x3962 in Shift_JIS.
http://www.unicode.org/cgi-bin/GetUnihanData.pl?co...

this character has a variation, which has a codepoint U+9aD9 in Unicode.
http://www.unicode.org/cgi-bin/GetUnihanData.pl?co...
But this is unified in Shift_JIS(JIS X 0208).

So 0x3962 includes U+9AD8 and U+9AD9 but once it is converted to 
Unicode,
this is only be U+9AD8.

Moreover Windows Code Page 932 include U+9AD9...

So this is not easy.
# ISO-8859-X may easy

> 3) each encoding has a list of incompatible (without conversion) supersets
> 
> Then string operations could be performed this way:
> 
> 1) an operation on two strings where one is compatible superset of the
> other is done without conversion, and the result has encoding of the
> superset. This is basically the extension of the ASCII-compatible
> concept to other encodings that could have this feature.

The problem is we don't kwno what encodings are compatible as ASCII.

> If conversion is allowed the autoconversion could follow:

implement "switch to allow the autoconversion" seems difficult... anyway

> 2) if the strings ere encoded in incompatible but equivalent encodings
> convert one to the encoding of the other based on some order of
> preference.

This means,
when charset C includes A and B, string in A + string in B => string in 
C ?
When those conversion doesn't lost any information, this is reasonable.

> 3) if there is the same incompatible superset for both strings (or
> superset of superset ..) convert both strings to this superset. If
> multiple supersets are available consult order of preference.

What is incompatible superset mean?

> I am not sure that 2) would ever apply. Some iso encodings should be
> generally equivalent to some dos or windows codepages but there might
> be one or two different characters that make the encodings
> non-equivalent. Perhaps the strings could be checked for these
> characters but then just converting to a superset might be easier.

Theoretically it is yes.
But practical encodings seem dirty.
Posted by Yukihiro Matsumoto (Guest)
on 2008-09-17 20:34
(Received via mailing list)
Hi,

In message "Re: [ruby-core:18668] Re: Character encodings - a radical 
suggestion"
    on Thu, 18 Sep 2008 00:55:40 +0900, "Michal Suchanek" 
<hramrach@centrum.cz> writes:

|I wonder how people who suggest Unicode as the single internal
|encoding would react if GB18030 was suggested instead ;-)

Yeah!  We should do that!

...no, please.  Its encoding scheme is horrific.  You have to read two
bytes to tell how many bytes the code point occupies (up to 4 bytes).

              matz.
Posted by Tim Bray (Guest)
on 2008-09-17 21:14
(Received via mailing list)
On Sep 17, 2008, at 8:01 AM, Yukihiro Matsumoto wrote:

> Besides that, Unicode is not a panacea.  Some character set
> (e.g. GB18030 for Chinese characters) is even bigger than Unicode.
> In fact, GB18030 is a super set of Unicode.

Also, Mojikyo is much larger than Unicode.

So, there are two reasonable goals:
1. Provide useful features for strings and characters in the general
case, including non-Unicode data
2. Provide good support for Unicode

The problem is that the people who care about #1 and don't care about
#2 very much, and vice versa.  The people who spend all their time
doing Web/Internet stuff mostly only care about #2.  It seems that
Unicode is important so Ruby should have some special low-level
language facilities, especially for efficient access to codepoints.
If we can do that without harming #1, then everyone should be happy.

  -T
Posted by Michal Suchanek (Guest)
on 2008-09-17 21:53
(Received via mailing list)
On 17/09/2008, NARUSE, Yui <naruse@airemix.jp> wrote:
> > Still there are multiple ways of combining strings, and these could be
> > used to distinguish different encoding handling.
> >
> > So my suggestion is to make
> >  - String.+ do the conversion if possible (it creates a new string so
> > it can be different)
> >
>
>  The problem is not "can convert" or "cannot convert".
>  Different mappings and information lost in conversion is the true problem.
>  So they can't be avoided and we can't use automatic conversion.

Yes, even for the "common www encodings" one cannot convert some
Japanese encodings with the Yen vs backslash confusion safely. And
there are other problems I am sure.

>
..
> > STDOUT.autoconvert=true
> >
>
>  this seems non-thread-safe.

Using a single IO in multiple threads is non-safe so this API does not
introduce any new problem. It is also similar to the other IO
properties that can be already set in non-thread-safe way.

>
>
> > Generally I can imagine the automatic conversion working like this
> > (either as part of core or as an addon):
> >
> > 1) each encoding has a list of compatible supersets
> >
>
>  Define "compatible" is this problem.
>  And what is "incompatible"?

Compatible here means that 7bit ASCII is compatible subset of utf-8 or
any (most?) of the iso-8859-x encodings. You can join the strings
without any conversion. Similarily BCDIC could be considered
compatible subset of the EBCDIC codepages if these are ever
implemented.

>
>  this character has a variation, which has a codepoint U+9aD9 in Unicode.
> http://www.unicode.org/cgi-bin/GetUnihanData.pl?co...
>  But this is unified in Shift_JIS(JIS X 0208).
>
>  So 0x3962 includes U+9AD8 and U+9AD9 but once it is converted to Unicode,
>  this is only be U+9AD8.
>
>  Moreover Windows Code Page 932 include U+9AD9...
>
>  So this is not easy.

If there are characters that are different in one encoding and mapped
to a single codepoint  in another encoding these are not equivalent.
Strings in those encodings could be considered equivalent as long as
they do not contain such characters but it is questionable if scanning
the string is desired. On the other hand, the conversion would process
the whole string anyway so it could be attempted for such cases, and
aborted if such character is encountered.

>
>  The problem is we don't kwno what encodings are compatible as ASCII.

That's aways possible to know only by looking at the codepoint table,
the same as the ascii-compatible encodings were defined.

>
>
> > If conversion is allowed the autoconversion could follow:
> >
>
>  implement "switch to allow the autoconversion" seems difficult... anyway

I did not mean to implement a switch - I wanted to define converting
and non-converting operations. However, for non-string objects that
use strings such switch would be indeed needed.

>
>
> > 2) if the strings ere encoded in incompatible but equivalent encodings
> > convert one to the encoding of the other based on some order of
> > preference.
> >
>
>  This means,
>  when charset C includes A and B, string in A + string in B => string in C ?
>  When those conversion doesn't lost any information, this is reasonable.

Here I wanted to distinguish two cases but they are in fact pretty
much the same:
 - conversion into an encoding that has the same number of codepoints,
just reordered
 - conversion into an encoding with larger number of codepoints

This should be probably handled by encoding preference.

When strings in iso-8859-x and the corresponding windows codepage
should be added, and the windows codepage is preferred over Unicode
encodings and iso-8859 encodings the codepage should be used. On the
other hand, if utf-8 is preferred utf-8 should be used for the result.

I am not sure how that preference would be set, though.

You could set general preference at program start but setting
preference for each operation  would make the system complicated. But
for a single operation the preference could be enforced by converting
the operands manually.

>
>
> > 3) if there is the same incompatible superset for both strings (or
> > superset of superset ..) convert both strings to this superset. If
> > multiple supersets are available consult order of preference.
> >
>
>  What is incompatible superset mean?

That means that at the string would have to be converted to be
represented in the "superset encoding". However, the conversion should
be unambiguous.

>  But practical encodings seem dirty.
>

Thanks

Michal
Posted by Michael Selig (Guest)
on 2008-09-18 02:11
(Received via mailing list)
Hi,

Thanks for all the replies - I am not an expert on all these encodings,
and I (obviously mistakenly!) assumed that all other encodings could be
converted to Unicode.

When I first looked at Ruby 1.9's encoding support I thought "that's 
neat
- I think it will solve my m17n problems". However as I got into it I 
soon
discovered that it wasn't nearly this easy!

Here is a summary of my issues:

- Non "ASCII-compatible" data is almost impossible to work with. Just 
take
a look at what James Gray was proposing to do for CSV.

- When developing standard classes & mixins that could be installed in 
any
country, virtually all methods that handle more than 1 string are going 
to
have to worry about the possibility of dealing with incompatible
encodings. This is a major overhead to a programmer - it may not be
acceptable to let it raise an error.

- Other alternative languages to Ruby which represent all strings as
Unicode don't have this problem. Although they may not be a 100% 
solution
in Japan & China, they would certainly be fine for me to use.

- As my application is under my control, I can make the decision to
transcode everything to UTF-8 if I want to. I was hoping not to, but I
think the extra code I would have to write to test encoding 
compatibility
would not be worthwhile as it would be in so many places. And yes, I 
could
write a

- For people like James who are trying to modify a standard library like
CSV, which on the surface looks like a simple task, it is really quite
daunting.


My "ideal" would be that Ruby automatically converted to a common 
encoding
rather than raising an Encoding Compatibility Error. And although 
Unicode
apparently may not cope with every character on the planet at present, I
guess it will one day, and it seems to me to be the sensible thing to 
use
as the "common encoding" - or UTF-8 to be precise.

That way, in the 99% of cases where the encodings ARE compatible, Ruby
would work exactly as it does now.

But it also means that I can write methods and not have to worry about
them blowing up because of encoding incompatibility.

It *does* mean that strings may "magically" be converted to UTF-8, but I
don't see this as a big deal as long as when they are output they are
converted back to the necessary encoding (which I think Ruby does with
files now). If the "magic" conversion is a problem, maybe there should 
be
a switch to turn it on & off.
This auto-convert policy should also be used with non-destructive 
methods
like String#== etc so the programmer needn't worry whether the same
character has a different representation on each side of the "==".
The ASCII-8BIT encoding should be reserved as a "special case" and not 
be
subject to auto-conversion, because it is going to be mainly used for
"byte strings".
Yes, there may be a performance overhead doing this. But is this a big
deal if it only happens in 1% of cases?

Sure there are issues with this, like what to do with text that cannot 
be
encoded to Unicode (now that I know it exists!), and also the
implementation of these suggestions may not be easy, but I think *not*
doing something about these issues may make the dev community have a
negative impression of Ruby, which would be a great, great shame.

Cheers
Mike

On Thu, 18 Sep 2008 00:28:03 +1000, Yukihiro Matsumoto
Posted by Austin Ziegler (austin)
on 2008-09-18 02:50
(Received via mailing list)
On Wed, Sep 17, 2008 at 10:09 AM, Matthias Wächter
<matthias@waechter.wiz.at> wrote:
> Is there a complete characterization of this whole problem? It seems
> to be the main reason for sticking to non-UTF-8 character sets in
> Ruby these days, and concluding from what I have read about it, a
> solution could be the addition of missing characters/codepoints to
> Unicode. Why does no-one consider going that way, but instead builds
> a complicated stack of functions for conversions on top level?

While there is a private use plane, it's not generally interoperable
to use the private use plane in Unicode. Adding glyphs to Unicode is a
lengthy process that requires going through a standards body. The
Unicode standard is updated every few years, but the Unicode
consortium is much more likely to listen to the Japanese standards
bodies than Ruby programmers.

> To some extent, it looks like 'some' people like insisting on the
> status quo as it makes them feel special, swimming upstream the
> Unicode waterfall, retaining on regional locales instead of solving
> the issue. I do explicitly not refer to Ruby or the developers, they
> just accept these special needs more than other computer language
> designers with less sympathy for this anomaly.

The reality is that Unicode *doesn't* completely represent all Asian
languages well (see the discussions around Han unification for a brief
primer on the issues involved). The problem is exacerbated in the
academic arena where people want to be able to represent ancient
characters accurately, but it's not limited to that. Just because you
and I can represent our words in under one hundred characters doesn't
mean that it's appropriate to do the same with others' languages.

It's getting better, but it's still not perfect.

> Nevertheless, a persisting fix is needed, and I think writing more
> and more clutches for encoding conversion goes the wrong way. This
> might still be needed for legacy file support, but day-to-day work
> should not have to deal with this issue so prominently.

Day-to-day work *doesn't*.

Deal with all of your stuff in a single encoding (UTF-8, UTF-16,
whatever) and you don't even have to think about it.

If you *ever* deal with more than one encoding, you're going to run
into this problem in *any* language.

Sorry.

-austin, still working on a blog post about a .NET Unicode/XML bug
Posted by Urabe Shyouhei (Guest)
on 2008-09-18 03:30
(Received via mailing list)
Michael Selig wrote:
> should be a switch to turn it on & off. 
Have you read Matz's post abount yen sign problem?  Converter IS a
problem; you cannot make a converter over (Encoding A -> Unicode ->
Encoding A).  That must lose some input.  Data loss is the worst thing
to introduce, so ruby asks you to take the risk by explicitly calling a
conversion method.

Problems on character encodings are sourced from complexities of human
activities.  I can hardly believe there are any simple, perfect, and/or
"neat" solution.
Posted by Michael Selig (Guest)
on 2008-09-18 04:40
(Received via mailing list)
----- Original Message -----
From: "Urabe Shyouhei" <shyouhei@ruby-lang.org>
> Have you read Matz's post abount yen sign problem?  Converter IS a
> problem; you cannot make a converter over (Encoding A -> Unicode ->
> Encoding A).  That must lose some input.  Data loss is the worst thing
> to introduce, so ruby asks you to take the risk by explicitly calling a
> conversion method.

Yes I have read it.
Do you refuse to drive a car because you may have a crash?

I was only suggesting conversion to Unicode as a way of preventing an 
error
being raised.
If you need to work with encodings that are not Unicode compatible, I 
was
suggesting that Ruby works exactly as it does at the moment. The 
conversion
was only suggested when you deal with incompatible encodings, which is 
not
going to be common, but is something that programmers whose software is 
used
internationally have to deal with. Also I suggested that there should be 
a
way of turning it off, just in case you are worried that the conversion
might happen accidentally.

For my application (and I think for many other people's too) it is *far*
better to possibly screw up a character or two than to have to write 
lots of
ugly code to cope with the incompatibility. The alternative is 
transcoding
to UTF-8, which will mean those characters will be screwed up anyhow.

Mike
Posted by James Gray (bbazzarrakk)
on 2008-09-18 05:51
(Received via mailing list)
On Sep 17, 2008, at 9:32 PM, Michael Selig wrote:

> I was only suggesting conversion to Unicode as a way of preventing  
> an error being raised.

Hey, I know character encodings are hard.  I'm still trying to get CSV
completely converted.  I'm getting closer all the time, but it's been
tricky for sure.

I'm sure it's a bit of Ruby's fault.  The m17n code is still a little
raw and I ran into several issues just exploring it.  All of our
efforts here are making things better though.  Look at how many bugs
were fixed in the last week just do to emails from you and me.

The other thing that's very important to remember is that character
encodings are just plain hard to get right.  I think it's a pretty big
testament that Ruby makes it possible for us to support all these
encodings now.

I'm definitely in the self-centered-universe camp that thought Unicode
was best for most things.  I know I would still recommend it in many
cases, because it's pretty easy to implement and it does work in many
cases.

However, our Japanese friends are trying to tell us it's not a
universal solution.  It doesn't always work well for them in
particular, so they would prefer we make something better.  I for one
am grateful for them teaching me this new lesson, hard or not.

And if you prefer to do the UTF-8 everywhere strategy, you can,
right?  Transcode everything to UTF-8 when it comes in and then you
can pretend it's all UTF-8 (because it is!), right?  Don't we have the
best of both worlds now?

James Edward Gray II
Posted by Tim Bray (Guest)
on 2008-09-18 05:58
(Received via mailing list)
On Sep 17, 2008, at 8:43 PM, James Gray wrote:

> And if you prefer to do the UTF-8 everywhere strategy, you can,  
> right?  Transcode everything to UTF-8 when it comes in and then you  
> can pretend it's all UTF-8 (because it is!), right?  Don't we have  
> the best of both worlds now?

Well, yes, as long as Ruby will let me get at the codepoints
efficiently.  Oh, and in an ideal world, use Unicode properties like I
can in Perl (\p{Lu} for example).  I think that would make all use
Unicode whiners shut up.  -T
Posted by James Gray (bbazzarrakk)
on 2008-09-18 06:03
(Received via mailing list)
On Sep 17, 2008, at 10:49 PM, Tim Bray wrote:

> On Sep 17, 2008, at 8:43 PM, James Gray wrote:
>
>> And if you prefer to do the UTF-8 everywhere strategy, you can,  
>> right?  Transcode everything to UTF-8 when it comes in and then you  
>> can pretend it's all UTF-8 (because it is!), right?  Don't we have  
>> the best of both worlds now?
>
> Well, yes, as long as Ruby will let me get at the codepoints  
> efficiently.

Is unpack("U*") not meeting that need?  I'm not trying to be a jerk,
I'm seriously asking.

James Edward Gray II
Posted by Tim Bray (Guest)
on 2008-09-18 06:17
(Received via mailing list)
On Sep 17, 2008, at 8:55 PM, James Gray wrote:

> Is unpack("U*") not meeting that need?  I'm not trying to be a jerk,  
> I'm seriously asking.

In fact, that produces the correct answer, and it's what I actually
use in my RX code 
(http://www.tbray.org/ongoing/When/200x/2008/06/10/RX-Work
).  The problem is that it could be a lot more efficient.  It means I
have to take care of organizing the input into chunks and being
careful that I haven't chunked in the middle of a UTF-8 character and
so on, when what I really want, when x is an IO, is

x.each_codepoint do |u|
   # u is a fixint
end

with the buffering and utf-8 unpacking being done at a low level
without wasting memory. -T
Posted by Michael Selig (Guest)
on 2008-09-18 06:18
(Received via mailing list)
----- Original Message -----
From: "Austin Ziegler" <halostatue@gmail.com>
To: <ruby-core@ruby-lang.org>
Sent: Thursday, September 18, 2008 10:42 AM

> If you *ever* deal with more than one encoding, you're going to run
> into this problem in *any* language.

I nearly agree with you, except that these days "day-to-day" work can
involve using data from web sites, from email, from RPC servers etc. 
Unless
you know that your HTTP, SMTP classes etc are going to return data in 
your
locale's encoding, you may very well be dealing with more than one 
encoding.

Cheers
Mike
Posted by Yukihiro Matsumoto (Guest)
on 2008-09-18 07:11
(Received via mailing list)
Hi,

In message "Re: [ruby-core:18681] Re: Character encodings - a radical 
suggestion"
    on Thu, 18 Sep 2008 09:03:35 +0900, "Michael Selig" 
<michael.selig@fs.com.au> writes:

|Thanks for all the replies - I am not an expert on all these encodings,  
|and I (obviously mistakenly!) assumed that all other encodings could be  
|converted to Unicode.
|
|When I first looked at Ruby 1.9's encoding support I thought "that's neat  
|- I think it will solve my m17n problems". However as I got into it I soon  
|discovered that it wasn't nearly this easy!

I am sorry that life is not that easy.

|Here is a summary of my issues:
|
|- Non "ASCII-compatible" data is almost impossible to work with. Just take  
|a look at what James Gray was proposing to do for CSV.

Yes, basically support for UTF-{16,32} are very limited, so that
I believe libraries are OK to omit them.  We should document that
clearly, but note that 1.9.1 has not been released yet.

|- Other alternative languages to Ruby which represent all strings as  
|Unicode don't have this problem. Although they may not be a 100% solution  
|in Japan & China, they would certainly be fine for me to use.

Ruby does not prohibit you to do the same thing as alternative
languages - converting back and force at the surface.  The point is, I
think, we haven't yet provided nifty API to do so.  If you can live
with Python's open-read-and-decode, I think you are able to stand
Ruby's "r:UTF-16:UTF-8" or open-read-and-encode.

If we need something more, it should be better API to reduce the cost
of Unicode based application, not making the language Unicode centric.

Let me rephrase, it's OK for you to make your application/library
Unicode centric, but not the language itself.  The one can declare his
library to support only ASCII compatible text, or UTF-8 text.  The
users must care about converting non-conformed text.

|- When developing standard classes & mixins that could be installed in any  
|country, virtually all methods that handle more than 1 string are going to  
|have to worry about the possibility of dealing with incompatible  
|encodings. This is a major overhead to a programmer - it may not be  
|acceptable to let it raise an error.

For any serious application/library, there are three choices:

(a) choose US-ASCII
(b) choose UTF-8 (or any specific encoding)
(c) choose to live with multiple encoding

But the last one is not an easy way, indeed.  I don't want to force
any Ruby users the hard way.  Users should choose anything they want.
But I don't want to deny the possibility.

|It *does* mean that strings may "magically" be converted to UTF-8, but I  
|don't see this as a big deal as long as when they are output they are  
|converted back to the necessary encoding (which I think Ruby does with  
|files now). If the "magic" conversion is a problem, maybe there should be  
|a switch to turn it on & off.
|This auto-convert policy should also be used with non-destructive methods  
|like String#== etc so the programmer needn't worry whether the same  
|character has a different representation on each side of the "==".
|The ASCII-8BIT encoding should be reserved as a "special case" and not be  
|subject to auto-conversion, because it is going to be mainly used for  
|"byte strings".

If you can do implicit conversion at I/O, why do you have to care
about encoding mixing?  Your program should treat single encoding
anyway.  Auto-conversion is bad, believe me.

              matz.
Posted by Michael Selig (Guest)
on 2008-09-18 07:14
(Received via mailing list)
Hi,

----- Original Message -----
From: "James Gray" <james@grayproductions.net>
To: <ruby-core@ruby-lang.org>
Sent: Thursday, September 18, 2008 1:43 PM

> The other thing that's very important to remember is that character
> encodings are just plain hard to get right.  I think it's a pretty big
> testament that Ruby makes it possible for us to support all these
> encodings now.

> However, our Japanese friends are trying to tell us it's not a
> universal solution.  It doesn't always work well for them in
> particular, so they would prefer we make something better.  I for one
> am grateful for them teaching me this new lesson, hard or not.

I agree with you. And if I am coming across as being a jerk or too 
dogmatic,
I don't mean to be!
I was trying to make some constructive suggestions (some may have been
misguided :-) so that Ruby can meet my needs better, and I think my 
needs
may be quite common as software and data sources become more 
international.
The intent is to make Ruby's character encoding issues as transparent as
possible without losing the specific Japanese/Chinese/Welsh(?) 
requirements
(now that I understand them a bit better), and of course to provoke 
further
discussion about the issue.

I think other people will soon face the sorts of problems you and I have
been hitting over the past couple of weeks. It has been a very good 
learning
experience for me also!

> And if you prefer to do the UTF-8 everywhere strategy, you can,
> right?  Transcode everything to UTF-8 when it comes in and then you
> can pretend it's all UTF-8 (because it is!), right?  Don't we have the
> best of both worlds now?

Nearly! Even if I transcode to UTF-8, I still have to make sure I do it 
at
every interface, and that includes to Ruby's standard classes as well - 
not
just IO. So I'll still have to check encodings of strings returned from
network classes, and that's something I don't think I need to do with 
other
languages that support Unicode, because there is only one internal 
string
representation. Also testing that I have got it all right may be a 
nightmare
(not that I am anywhere near that stage yet!). It would be so much nicer 
to
have Ruby handle most of this for me.

I am a relatively recent convert to Ruby, mainly from Python. This means 
I
am constantly thinking "could I do this easier/better with Python?". And 
the
answer to that question for this latest project seems to be leaning 
towards
"yes" unfortunately, and I'd like to say a definitive "no", because I 
like
so many things about Ruby a great deal!

Cheers
Mike.
Posted by Yukihiro Matsumoto (Guest)
on 2008-09-18 07:18
(Received via mailing list)
Hi,

In message "Re: [ruby-core:18687] Re: Character encodings - a radical 
suggestion"
    on Thu, 18 Sep 2008 12:49:56 +0900, Tim Bray <Tim.Bray@Sun.COM> 
writes:

|Well, yes, as long as Ruby will let me get at the codepoints  
|efficiently.  Oh, and in an ideal world, use Unicode properties like I  
|can in Perl (\p{Lu} for example).  I think that would make all use  
|Unicode whiners shut up.  -T

OK, now Ruby 1.9 has String#each_codepoint and understands \p{Lu} for
regular expression.  I hope all Unicode whiners would complain no
longer.

              matz.
Posted by Yukihiro Matsumoto (Guest)
on 2008-09-18 07:31
(Received via mailing list)
Hi,

In message "Re: [ruby-core:18678] Re: Character encodings - a radical 
suggestion"
    on Thu, 18 Sep 2008 02:17:32 +0900, Tim Bray <Tim.Bray@Sun.COM> 
writes:

|Also, Mojikyo is much larger than Unicode.

Indeed.  I've heard some people used prototype of M17N Ruby to process
Mojikyo text.  Since Mojikyo character set is not compatible with
Unicode, it was their only way to process Mojikyo text using scripting
language.  That is one my primary motivation over the current M17N
design, although 1.9.1 does not support Mojikyo encoding yet.

              matz.

|So, there are two reasonable goals:
|1. Provide useful features for strings and characters in the general  
|case, including non-Unicode data
|2. Provide good support for Unicode
|
|The problem is that the people who care about #1 and don't care about  
|#2 very much, and vice versa.  The people who spend all their time  
|doing Web/Internet stuff mostly only care about #2.  It seems that  
|Unicode is important so Ruby should have some special low-level  
|language facilities, especially for efficient access to codepoints.   
|If we can do that without harming #1, then everyone should be happy.

We have just started to care about #2 lately, since we are about to
finish #1.

              matz.
Posted by Yukihiro Matsumoto (Guest)
on 2008-09-18 07:32
(Received via mailing list)
Hi,

In message "Re: [ruby-core:18693] Re: Character encodings - a radical 
suggestion"
    on Thu, 18 Sep 2008 14:06:46 +0900, "Michael Selig" 
<michael.selig@fs.com.au> writes:

|Nearly! Even if I transcode to UTF-8, I still have to make sure I do it at 
|every interface, and that includes to Ruby's standard classes as well - not 
|just IO. So I'll still have to check encodings of strings returned from 
|network classes, and that's something I don't think I need to do with other 
|languages that support Unicode, because there is only one internal string 
|representation. Also testing that I have got it all right may be a nightmare 
|(not that I am anywhere near that stage yet!). It would be so much nicer to 
|have Ruby handle most of this for me.

You have pointed out important issue here, I think.  Let me think
about it.  Although I still don't believe auto-conversion is the way
to go.

              matz.
Posted by Michael Selig (Guest)
on 2008-09-18 07:35
(Received via mailing list)
Hi,

----- Original Message -----
From: "Yukihiro Matsumoto" <matz@ruby-lang.org>
To: <ruby-core@ruby-lang.org>
Sent: Thursday, September 18, 2008 3:03 PM

> If you can do implicit conversion at I/O, why do you have to care
> about encoding mixing?  Your program should treat single encoding
> anyway.  Auto-conversion is bad, believe me.

I/O is only one way I can get data.
For example, when I get data via HTTP, how do I control what encoding 
Ruby's
libraries are going to use to return data? I assume I'll get whatever
encoding is specified in the HTTP header, so I'll have to remember to
convert to UTF-8. It is dangerous to assume that I'll always get UTF-8 
data
via HTTP.

Mike
Posted by Urabe Shyouhei (Guest)
on 2008-09-18 08:33
(Received via mailing list)
Michael Selig wrote:
> ----- Original Message ----- From: "Urabe Shyouhei"
> <shyouhei@ruby-lang.org>
>> Have you read Matz's post abount yen sign problem?  Converter IS a
>> problem; you cannot make a converter over (Encoding A -> Unicode ->
>> Encoding A).  That must lose some input.  Data loss is the worst thing
>> to introduce, so ruby asks you to take the risk by explicitly calling a
>> conversion method.
>
> Yes I have read it.
> Do you refuse to drive a car because you may have a crash?

We are *designing* a car now.  Don't you need a seatbelt because you'll
never crash your car? That's insane.

Current design might not be perfect.  There must be a better practice.
But that "better" should be a synonym of "safer".
Posted by James Gray (bbazzarrakk)
on 2008-09-18 14:45
(Received via mailing list)
On Sep 18, 2008, at 12:10 AM, Yukihiro Matsumoto wrote:

> |can in Perl (\p{Lu} for example).  I think that would make all use
> |Unicode whiners shut up.  -T
>
> OK, now Ruby 1.9 has String#each_codepoint and understands \p{Lu} for
> regular expression.  I hope all Unicode whiners would complain no
> longer.

Thanks for putting up with all our whining about encodings Matz.  We
know you are working hard to make things better for all of us.

James Edward Gray II
Posted by Tim Bray (Guest)
on 2008-09-18 22:31
(Received via mailing list)
On Sep 17, 2008, at 10:10 PM, Yukihiro Matsumoto wrote:

> OK, now Ruby 1.9 has String#each_codepoint and understands \p{Lu} for
> regular expression.  I hope all Unicode whiners would complain no
> longer.

The community of Unicode whiners says "thank you" to the community of
Ruby implementors.  I'll grab this code and see how it works for XML
parsing.  -Tim
Posted by Martin Duerst (Guest)
on 2008-09-19 10:19
(Received via mailing list)
At 10:20 08/09/17, Michael Selig wrote:
>Hi,
>
>You might at first glance think that this post should go to ruby-dev, but  
>please read to the end!

If it's in English, it should be ruby-core, not ruby-dev, as far as I
understand.

>       puts "The value is #{val}"
>
>fails if val is UTF-16 data.

I think in this case, the reason why you see the problem only for
UTF-16 is that your string, other than the interpolated data, is
currently all US-ASCII. But immagine that sooner or later you
(or somebody) is going to localize your application. Then the
string might be in any encoding, and you'll get much more
"encoding compatibility" exceptions.


>At one stage I got so frustrated that I was even thinking about going back  
>to Python :-(
>So I have ended up transcoding any UTF-16 data to UTF-8, and now things  
>are going much better.
>
>Maybe I am doing something wrong - if so please suggest something I can do  
>other than transcode the UTF-16.

I think your problem is more general, and you should transcode other
encodings to UTF-8, too, if you're not sure you'll be in a situation
with a single encoding.


>But this has lead me to look back at the issues with UTF-16 I have hit,  
>and to think about all the internal code in Ruby to handle "ASCII  
>incompatible" encodings, and the overhead involved with supporting it.
>
>And I think that other Ruby programmers may end up doing what I have done  
>- avoid using UTF-16 internally because it is too hard.

I agree that all non-ASCII encodings should come with a sticker with
a big warning on it, at least.


>So my radical suggestion is this:
>
>Remove internal support for non-ASCII encodings completely, and when  
>reading/writing UTF-16 (and UTF-32) files automatically transcode to/from  
>UTF-8.

I can understand the former part. Providing something half-baked
can have advantages and disadvantages.


>My reasons:
>
>- String & Regexp operations should just "work" without the programmer  
>worrying about encoding comaptibility (I think!)

See below.

>- The programmer only has to think about character encodings at the  
>"interfaces" (files, network interfaces) not throughout the program logic

This is desirable/good architecture. Ruby 1.9 will force you to do that,
or come up with some other architecture, but won't handle things
automatically for you.

>- To my knowledge UTF-16 & UTF-32 are the only "non-ASCII compatible" as  
>Ruby defines it

No, there are others, such as iso-2022-jp. But they are not really the
main issue. You can get an encoding incompatibility error for any two
ASCII-compatible encodings. E.g. iso-8859-1 and iso-8859-2, or any two
others. The reason that you currently don't is that one of your strings
(or a regexp) always is ASCII-only, even if it's labeled as something
else.

>- To my knowledge no one actually uses UTF-16 or UTF-32 as a locale

True.

>- I would avoid having to use ugly modes to open a file like  
>"r:UTF-16LE:UTF-8" (very minor)

Telling Ruby what encoding you expect from the outside is kind of
unavoidable. But it would indeed help if it would suffice to tell
a Ruby application only once that you want to handle everything
internally in a certain encoding.

>- Ruby's internal code would be simpler & cleaner and therefore probably  
>faster and easier to maintain

If everything is done in UTF-8 all the time, yes. But I don't think
we will go there soon (I wouldn't mind). Speed isn't too much of
an issue, but of course the code would be quite a bit simpler.

Regards,   Martin.



#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Posted by Martin Duerst (Guest)
on 2008-09-19 10:20
(Received via mailing list)
At 10:21 08/09/18, Urabe Shyouhei wrote:

>Have you read Matz's post abount yen sign problem?

The yen sign problem is indeed a big problem. It's similar to the
Y2K problem (people knew they shouldn't use just two digits, but they
did, and people know they shouldn't use 0x5c for the Japanese
currency anymore, but they still do), except that there is no deadline,
and so there is not enough pressure to fix it.

>Converter IS a
>problem; you cannot make a converter over (Encoding A -> Unicode ->
>Encoding A).

Sorry, but for all the encodings in daily use, including those in
Japan, round-tripping via Unicode works fine. Unicode was explicitly
designed to do that (at the expense of introducing quite a bit of
what some people might call garbage). This very much includes the
Yen/Backslash. The problems may start when you try to do some 
processing.
(many kinds of processing are not affected, but some are)

>That must lose some input.  Data loss is the worst thing
>to introduce, so ruby asks you to take the risk by explicitly calling a
>conversion method.

Taking the risk explicitly is fine. But some people may feel that it's
easier to do that application-by-application than string by string.


>Problems on character encodings are sourced from complexities of human
>activities.

Very much so indeed.

>I can hardly believe there are any simple, perfect, and/or
>"neat" solution.

Who said Unicode is neat? It's just that sometimes one messy
solution is better than a mess of many solutions :-(.

Regards,   Martin.



#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp      mailto:duerst@it.aoyama.ac.jp
Posted by Martin Duerst (Guest)
on 2008-09-19 10:20
(Received via mailing list)
[most of this mail, and some others, was written Wednesday,
and so may repeat some of what Matz and others have said,
but I had big problems getting mail out.]

At 13:21 08/09/17, Michael Selig wrote:

>I have been doing some more thinking about these ongoing issues....
>
><soapbox>
>
>Using Ruby SHOULD be making our lives easier, not harder.

Very much so.

>Other languages  
>like Python have taken an easier route to m17n - represent all strings  
>internally as unicode codepoints. Then there should never be a need to  
>check encoding compatibility, right?

Yes. The requirement is that you have to make sure your application
knows what encoding it's dealing with, and that you have to make sure
you can convert everything, even 'private use' characters appearing
with a certain frequency in East Asian encodings.

>I am not saying that this is a  
>perfect solution either, by the way. But having to work around this  
>"Encoding Compatibility Error" all the time is just a pain for apps which  
>need to work in different countries with different locales. Unfortunately  
>it is leading me towards the path of having to transcode everything to  
>UTF-8, even though in 99% of cases all the data IS going to be compatible  
>and be in the user's locale. I don't want so much of my time taken up, and  
>be forced to write ugly code to take care of the remaining 1%.

In my view, you either have a true single-encoding situation, in which
case Ruby should work great, or you have a mixed-encoding situation.
And even 1% of "other" encodings means a mixed situation.

In a mixed situation, going "Unicode inside" (which for Ruby means
"UTF-8 inside") is the best thing to do in most cases. Unicode inside
is a model that many, many applications and several programming 
languages
have choosen for many good reasons. Ruby currently supports it, but not
as seamlessly as it could. Getting more input about where things
hurt most is very helpful.

There are probably two things that differ from "all Unicode inside"
programming languages such as Perl, Python, and Java:

- Because Ruby allows you to use all kinds of non-Unicode encodings,
  it may give the impression that things work with mixed encodings,
  and lets you postpone some necessary cleanup that you'd otherwise do
  upfront.

- When reading data, in Java and friends, you only have to indicate
  the external encoding. In Ruby, you have to mention UTF-8, too,
  because otherwise the encoding is used just as a label, without
  conversion. For a "Unicode inside" application, that's an additional
  burden. [I'm glad to see that Matz thinks that's ugly, too,
  and wants to do something about it in the future.]

I have suggested that we introduce some kind of
"encoding policy" that lets some things happen "automagically".
(see http://www.sw.it.aoyama.ac.jp/2007/pub/IUC31-ruby/Paper.html,
Section 6). One such policy could be "whenever you might get an
exception due to an encoding mismatch, try to transcode (e.g., to
UTF-8). Another could be "transcode all input to UTF-8 unless
there is a specific indication that another encoding is wanted".

The main problem with such an approach is that it's very difficult
to do this globally, because libraries may have very different
assumptions or restrictions, and Ruby doesn't have a 'per library'
concept.

My understanding is that similar problems can happen with class
extensions (two different libraries adding or changing methods
with the same name in the same class,..., or one library depending
on a change where another depends on having nothing changed,...),
and that some solution to this problem is one of the things that
Matz mentioned when talking about Ruby 2.0. If such a solution
would indeed happen, I guess it wouldn't be too difficult to
also use that solution for dealing with "encoding policies".
But all this is currently just some vague feeling, none of it
exists in actual code.


Regards,   Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp     mailto:duerst@it.aoyama.ac.jp
Posted by Martin Duerst (Guest)
on 2008-09-19 10:20
(Received via mailing list)
At 23:28 08/09/17, Yukihiro Matsumoto wrote:

>    on Wed, 17 Sep 2008 10:20:13 +0900, "Michael Selig" 
><michael.selig@fs.com.au> writes:

>|- I would avoid having to use ugly modes to open a file like  
>|"r:UTF-16LE:UTF-8" (very minor)
>
>This is ugly indeed.  We might add more Unicode support in the
>future.  But we are no hurry.

The problem here is that it would be much better if we could
avoid forcing many people to use such ugly stuff for a few
years. But I have to admit that I don't know exactly how
a better solution would look like.

Regards,   Martin.



#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Posted by Martin Duerst (Guest)
on 2008-09-19 10:21
(Received via mailing list)
At 00:01 08/09/18, Yukihiro Matsumoto wrote:
>|Unicode. Why does no-one consider going that way, but instead builds
>|a complicated stack of functions for conversions on top level?
>
>Just because it's impossible.  History sucks.  We have mixed up YEN
>SIGN and REVERSE SOLIDUS for long time.  They cannot be distinguished
>without context information.  Technically 0x5c should mean REVERSE
>SOLIDUS, but not always so for humans.

Thanks for putting it so bluntly. The Europeans did similar things
in the ISO 646 age (7-bit encodings with national variants), but were
fortunate enough to go through an intermediate stage of 8-bit encodings
before going multibyte.


>Besides that, Unicode is not a panacea.

Definitely not. But it makes a lot of things a lot easier for a
lot of people.


>Some character set
>(e.g. GB18030 for Chinese characters) is even bigger than Unicode.
>In fact, GB18030 is a super set of Unicode.

How exactly? I know that the Chinese government is requiring
GB 18030 support for software sold in China, and that the Unicode
Consortium and all the companies involved have been working hard to
make sure that this requirement is met by converting from and to
Unicode so that applications can use Unicode internally.


>|should not have to deal with this issue so prominently.
>
>You are free to feel so, but it's us who take up the burden.

I can't speak for Matz, but I think anybody who wants to share
some burden by providing patches and such is also very welcome,
although some of the issues discussed on this list are not yet
at the level where somebody could write a patch.


Regards,    Martin.



#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Posted by Yukihiro Matsumoto (Guest)
on 2008-09-19 10:33
(Received via mailing list)
Hi,

In message "Re: [ruby-core:18728] Re: Character encodings - a radical 
suggestion"
    on Fri, 19 Sep 2008 17:11:53 +0900, Martin Duerst 
<duerst@it.aoyama.ac.jp> writes:

|>Besides that, Unicode is not a panacea.
|
|Definitely not. But it makes a lot of things a lot easier for a
|lot of people.

Indeed.  And we'd like to help people's life easier (using Unicode).
Out M17N API is not the final cut (although it's almost fixed for
1.9.1).  If there's any idea to make Ruby's Unicode support better,
let us hear.  But no thanks in advance for the proposals to abandon
what we have now. ;-)

              matz.
Posted by Michael Selig (Guest)
on 2008-09-19 11:42
(Received via mailing list)
On Fri, 19 Sep 2008 18:24:41 +1000, Yukihiro Matsumoto
<matz@ruby-lang.org> wrote:


> If there's any idea to make Ruby's Unicode support better,
> let us hear.  But no thanks in advance for the proposals to abandon
> what we have now. ;-)
>

I assume that you are referring to my suggestion to remove support for
"non-ASCII compatible" encodings? I don't think that it is an 
unreasonable
proposal given the fact that it is so difficult to handle them. I have
often pulled out features in software I have written that seemed a "good
idea at the time" but later turned out out not to be. I am not saying 
that
they are a bad idea, but I currently cannot see their value as a
"fully-fledged" internal encoding, and I was concerned that supporting
them may make a "rod for your own back" when it comes to handling them 
in
libraries.

Of course you are the boss when it comes to Ruby, but I feel that you
could have phrased this statement ("But no thanks in advance....") a
little better. It and the "Unicode whiners" comment are rather offensive 
I
feel. We are only trying to help.

Mike.
Posted by Yukihiro Matsumoto (Guest)
on 2008-09-19 12:01
(Received via mailing list)
Oops, I misfired my mail reader; the following is the right one:

In message "Re: [ruby-core:18732] Re: Character encodings - a radical 
suggestion"
    on Fri, 19 Sep 2008 18:34:21 +0900, "Michael Selig" 
<michael.selig@fs.com.au> writes:

|I assume that you are referring to my suggestion to remove support for  
|"non-ASCII compatible" encodings?

No, I was referring the past proposals like "abandon all M17N and
choose Unicode as unified internal character set, like other `major'
languages do" as such.

|Of course you are the boss when it comes to Ruby, but I feel that you  
|could have phrased this statement ("But no thanks in advance....") a  
|little better.

I am sorry if my phrase appeared offensive.  UTF-16 is a nasty beast,
but as I stated we have other beasts (dummy encodings), so that simply
removing UTF-16 would help us little.  We have to do it consistently,
if we do.

              matz.
Posted by Yukihiro Matsumoto (Guest)
on 2008-09-19 12:25
(Received via mailing list)
Hi,

In message "Re: [ruby-core:18732] Re: Character encodings - a radical 
suggestion"
    on Fri, 19 Sep 2008 18:34:21 +0900, "Michael Selig" 
<michael.selig@fs.com.au> writes:

|I assume that you are referring to my suggestion to remove support for  
|"non-ASCII compatible" encodings?

No, I was referring the past proposal like "abandon all M17N and
choose M17N as unified internal character set, like other `major'
languages do" as such.

|Of course you are the boss when it comes to Ruby, but I feel that you  
|could have phrased this statement ("But no thanks in advance....") a  
|little better.

I am sorry that my phrase appear offensive.


 It and the "Unicode whiners" comment are rather offensive I
|feel. We are only trying to help.
|
|Mike.
|
Posted by James Gray (bbazzarrakk)
on 2008-09-19 14:39
(Received via mailing list)
On Sep 19, 2008, at 4:34 AM, Michael Selig wrote:

> It and the "Unicode whiners" comment are rather offensive I feel.

I took that as a joke and laughed when I read it.  I think matz has a
good sense of humor.  :)

James Edward Gray II
Posted by Austin Ziegler (austin)
on 2008-09-19 14:44
(Received via mailing list)
On Fri, Sep 19, 2008 at 8:30 AM, James Gray <james@grayproductions.net> 
wrote:
> On Sep 19, 2008, at 4:34 AM, Michael Selig wrote:
>> It and the "Unicode whiners" comment are rather offensive I feel.
> I took that as a joke and laughed when I read it.  I think matz has a good
> sense of humor.  :)

GMail tells me that, at least in this thread, Tim Bray is the one who
used it first to describe himself. ;)

-austin
Posted by Dave Thomas (Guest)
on 2008-09-19 15:48
(Received via mailing list)
On Sep 19, 2008, at 4:52 AM, Yukihiro Matsumoto wrote:

>  UTF-16 is a nasty beast,
> but as I stated we have other beasts (dummy encodings), so that simply
> removing UTF-16 would help us little.  We have to do it consistently,
> if we do.

I'm no expert in any of this, but I wonder if part of the problem
might be that Ruby tries to support all encodings both internally and
externally. Might it be easier to support the full set externally, but
to have a more limited set internally? For example, you could support
UTF-16<any endian> as an external encoding, but transcode to UTF-8 on
the way in. You could still support a rich variety of internal
encodings, including the Asian ones you need. But you wouldn't have to
deal with UTF-16 when implementing Regexp#escape :)  So, keep the
current set of encodings, but only allow a reasonable (ASCII-
compliant) subset as internal encodings.



Dave
Posted by Yukihiro Matsumoto (Guest)
on 2008-09-19 16:40
(Received via mailing list)
Hi,

In message "Re: [ruby-core:18742] Re: Character encodings - a radical 
suggestion"
    on Fri, 19 Sep 2008 22:40:20 +0900, Dave Thomas <dave@pragprog.com> 
writes:

|I'm no expert in any of this, but I wonder if part of the problem  
|might be that Ruby tries to support all encodings both internally and  
|externally. Might it be easier to support the full set externally, but  
|to have a more limited set internally?

That what we thought.  Limited support for UTF-16 and dummy encodings
is our result, which seems to be imperfect.

|For example, you could support  
|UTF-16<any endian> as an external encoding, but transcode to UTF-8 on  
|the way in. You could still support a rich variety of internal  
|encodings, including the Asian ones you need. But you wouldn't have to  
|deal with UTF-16 when implementing Regexp#escape :)  So, keep the  
|current set of encodings, but only allow a reasonable (ASCII- 
|compliant) subset as internal encodings.

I think you've suggested something valuable, but I cannot imagine the
detail (yet).  Magically transcode text in unsupported encoding to
certain supported encoding.  Hmm.

UTF-16 and UTF-8 are easier set, since they are semantically same.
But how should we treat ISO-2022-JP for example?
# No, I am asking myself, not you, Dave.

              matz.
Posted by Tim Bray (Guest)
on 2008-09-19 17:42
(Received via mailing list)
On Sep 19, 2008, at 2:52 AM, Yukihiro Matsumoto wrote:

> I am sorry if my phrase appeared offensive.  UTF-16 is a nasty beast,
> but as I stated we have other beasts (dummy encodings), so that simply
> removing UTF-16 would help us little.  We have to do it consistently,
> if we do.

Actually, I think it would be perfectly OK to remove runtime support
for UTF-16, if you make input and output possible.  I can't think of
any practical advantages to handling multiple UCS encodings, for
regexing or parsing or splitting or matching.  UTF-16 is horrible; C#
and Java will be paying the price for choosing 16-bit characters long
after we're dead.  -Tim
Posted by Tim Bray (Guest)
on 2008-09-19 17:42
(Received via mailing list)
On Sep 19, 2008, at 6:40 AM, Dave Thomas wrote:

> I'm no expert in any of this, but I wonder if part of the problem  
> might be that Ruby tries to support all encodings both internally  
> and externally. Might it be easier to support the full set  
> externally, but to have a more limited set internally? For example,  
> you could support UTF-16<any endian> as an external encoding, but  
> transcode to UTF-8 on the way in. You could still support a rich  
> variety of internal encodings, including the Asian ones you need.  
> But you wouldn't have to deal with UTF-16 when implementing  
> Regexp#escape :)  So, keep the current set of encodings, but only  
> allow a reasonable (ASCII-compliant) subset as internal encodings.

+1  -Tim
Posted by Tim Bray (Guest)
on 2008-09-19 17:42
(Received via mailing list)
On Sep 19, 2008, at 2:34 AM, Michael Selig wrote:

> . It and the "Unicode whiners" comment are rather offensive I feel.  
> We are only trying to help.

To be fair, I think the "Unicode whiners" thing is a joke that I
started.  I am proud to be a Unicode whiner.  Also preacher, advocate,
dialectician, and polemicist.  -Tim
Posted by Michael Selig (Guest)
on 2008-09-20 03:08
(Received via mailing list)
On Fri, 19 Sep 2008 19:52:30 +1000, Yukihiro Matsumoto
<matz@ruby-lang.org> wrote:

>
> I am sorry if my phrase appeared offensive.  UTF-16 is a nasty beast,
> but as I stated we have other beasts (dummy encodings), so that simply
> removing UTF-16 would help us little.  We have to do it consistently,
> if we do.

No problem - it appears I misunderstood you, sorry. Easy to happen with
email, unfortunately :-(

Perhaps we need to go back to basics with this discussion. As a mere
English speaker, I do not fully understand the issues that are faced by
Japanese and other encodings. What I have gathered from this discussion 
is
(please tell me if I am wrong):

- There are characters that Ruby needs to support which cannot be 
uniquely
mapped to Unicode
- In fact there are entire character sets that we want to support in 
Ruby
that are not supported in Unicode
- There are ambiguous characters in some character sets - same code for
different characters

I think it would be a benefit if we all got to understand a bit more:

- How the character ambiguity (eg: Yen/ backslash) issue is handled at 
the
moment - generally, not just with Ruby. ie: how do you know that a 
printer
or screen is going to show the right character?
- How the various "non-ascii compatible" encodings are used in practice.
eg: it is my understanding that UTF-7 is really only used in email, and
that it would be straightforward to immediately transcode it to/from 
UTF-8
in an POP/IMAP library, so UTF-7 could be avoided completely as an
"internal" encoding in Ruby. It's as if were were treating UTF-7 like
base64 - just a transformation of a "real" encoding. (In fact UTF-16 & 
32
could be considered the same sort of thing, except they may be used more
widely.)
- How a Japanese programmer would handle the situation of dealing with a
combination of a Japanese non-Unicode compatible character set, and say 
a
UTF-8 encoding which included non-ascii characters, and non-Japanese 
ones.
ie: Is there a reasonable alternative to encoding both to Unicode &
somehow dealing with the "difficult characters" as special cases?

Could someone out there please succinctly explain these things to us
westerners? Then perhaps our thinking about this issue may be more 
aligned.

Thanks
Mike
Posted by Martin Duerst (Guest)
on 2008-09-20 03:21
(Received via mailing list)
At 17:20 08/09/17, Robert Klemme wrote:
>encodings very clean and workable.  But if I remember correctly Matz
>once said that Unicode does not cover all Asian symbols so it might
>not be a too good choice for internal representation.

In some sense, this is true, but then this is true for any other 
encoding
(in particular all those used in Asia), too. So that's not really an
argument (apart from the fact that if you really need, you can always
use the huge private-use areas provided by Unicode; not that I would
suggest that though myself).


>I believe that one reason for the difficulties we encounter now is the
>fact that String is historically used for binary and text data.  So
>there is no clear separation between the two and this bears potential
>for confusion and bugs.

That's a part of the problem, but not too big a part.


>A clean solution would probably involve having a character type which
>is capable of representing *all* possible symbols and model String as
>sequence of those characters.

Well, yes, that could be done, e.g. model a character as the union
of Unicode codepoints and any other odd objects. Users could then
define their own objects for their own characters, e.g. with lots
of metadata, or font information, or what not. Such ideas have been
around for a long time, but in contrast to Ruby's current model,
which in some ways is on the edge, but still doable, such a model
quickly becomes way more complex and hopelessly slow.


>Encoding would then be done during
>input and output only.  Questions I see
>
>1. Is this feasible, i.e. is there something similar to Unicode
>without its limitations?

Unicode is as good as it gets. And it gets better and better
(not all historic or minority scripts are encoded yet, but
that work is ongoing). The conclusion is: If you want something
better than Unicode (in particular something with more scripts
and characters covered), the best thing is to contribute to
Unicode.

Regards,   Martin.



#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Posted by Martin Duerst (Guest)
on 2008-09-20 04:04
(Received via mailing list)
At 09:42 08/09/18, Austin Ziegler wrote:
>to use the private use plane in Unicode.
Very much agreed. Private use areas (a small area in the BMP
(Base Multilingual Plane) and planes 15 and 16) are free-for-all,
which means you are never really sure what you get there.

(for those who want some more terse background reading, I
recommend http://www.w3.org/TR/charmod/#sec-PrivateUse)

>Adding glyphs to Unicode is a
>lengthy process that requires going through a standards body. The
>Unicode standard is updated every few years, but the Unicode
>consortium is much more likely to listen to the Japanese standards
>bodies than Ruby programmers.

Well, yes, first because the relevant Japanese standards body
is a member of ISO/IEC JTC1/SC2/WG2, the group responsible for
ISO 10646, which is in sync with Unicode. And second because
Ruby programmers as a group don't have any particular character
encoding needs.


>The reality is that Unicode *doesn't* completely represent all Asian
>languages well

True. There are still many (minor) scripts that are not yet encoded,
and most of them are used in Asia, in the same way as most of the
scripts already encoded are used in Asia.
(For more details, please see http://unicode.org/roadmaps/
and the links from there to the roadmaps for various parts
of Unicode.)

>(see the discussions around Han unification for a brief
>primer on the issues involved).

Complaints about Han unification are mostly unjustified. The discussion
e.g. around Internationalized Domain Names has shown that unification
has significant advantages. You get into problems when e.g. a Latin
'A', a Cyrillic 'A', and a Greek 'A' are encoded separately (as they
currently are, not the least because they are encoded separately in
some important East Asian standards).
I do not want to immagine the mess we would have if there were separate
codes for Chinese/Japanese/Korean (and maybe Vietnamese, Taiwanese,...)
"variants" of Han characters such as '$B0l(B' (one), '$BFs(B' (two), 
'$B;0(B' (three),
and so on.


>The problem is exacerbated in the
>academic arena where people want to be able to represent ancient
>characters accurately, but it's not limited to that.

Yes, and if you look at academic use, the same can be said for
the Western World. As a simple example, Unicode doesn't contain
codepoints for all the many ligatures used in the Gutenberg bible.
The only difference may be that researchers in the West are
more ready to use an additional layer (e.g. some XML markup or so)
for this, whereas in Asia, the fact that there is already
such a huge number of characters makes it very easy for people
to think that just adding more characters is the solution
for these problems.


>Just because you
>and I can represent our words in under one hundred characters doesn't
>mean that it's appropriate to do the same with others' languages.

Of course not. And Unicode definitely hasn't done that, quite to
the contrary.


Korean got more than 11,000 characters, of which by all accounts
less than 3000 are actually used, the only purpose of the rest being
to complete a nice-looking three-dimensional table.

Han characters currently count around 70,000, of which the majority
is mainly used in dictionaries, and many of them with entries of the
form (freely translated): "A: variant/misprint for B, see B."

Mind you, there are still a lot of Han characters (the core being about
21,000) that are really useful because they are supported on everyday
computer systems in China, Japan, Korea, and so on. And a smaller subset
of these (around 2000-3000 for Japanese, less for Korean, more for 
Chinese)
is what people actually use day in day out.


>It's getting better, but it's still not perfect.

Very much so indeed.

Regards,   Martin.



#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Posted by Austin Ziegler (austin)
on 2008-09-20 07:09
(Received via mailing list)
2008/9/19 Martin Duerst <duerst@it.aoyama.ac.jp>:
> codes for Chinese/Japanese/Korean (and maybe Vietnamese, Taiwanese,...)
> "variants" of Han characters such as '$B0l(B' (one), '$BFs(B' (two), '$B;0(B' (three),
> and so on.

I'm not disagreeing with you in principle, but even if the complaints
are unjustified, the fact is that they exist and they slowed adoption
of Unicode in Asian countries pretty significantly.

> to think that just adding more characters is the solution
> for these problems.

It's also a little different for the Asian researchers because
different characters are different words. It may also be a display
problem; most Western language ligatures can be approximated on
computer displays with just a little tweaking of the display of two
characters, even if a separate glyph in a font is always better. This
isn't always possible with Asian language characters, by my
understanding.

Still, I am encouraged to see Ruby keeping m17n yet improving its
Unicode support.

-austin
Posted by Martin Duerst (Guest)
on 2008-09-20 07:37
(Received via mailing list)
As far as I understand, that was the original plan.

The question is how exactly to distinguish internal
and external encodings. Should we e.g. allow "UTF-16BE"
in a mode when opening a file, but not as an argument to
String#encode? But then what if you want to convert to
UTF-16BE and then use some compression (gzip,...) on
output?

And I think once we were at that point, what happened
was that to whatever extent it was easy to support an
encoding, it was done. As an example, Oniguruma supported
UTF-16(BE/LE) and so on, so that's usable now.

The alternative, which is suggested by this discussion,
is that we decide on a (pretty high) minimum standard for
support for an encoding. All encodings that don't reach
that standard are simply declared dummy and behave as
such (i.e. the same as binary, or even with less functionality).

This would force at least those who understand the issues
to use conversion. But there would still be those that
might do operations on a string labeled "UTF-16BE" under
the impression that this actually works.

It would also mean that each application has to do some
work to distinguish 'really supported' and 'dummy label'
encodings. Or that (as you suggest) conversion would be
automatic, which should work for Unicode-based encodings,
but which might bring up very subtle issues e.g. when
converting from iso-2022-jp to euc-jp (or do you choose
shift_jis?).

Regards,    Martin.

At 22:40 08/09/19, Dave Thomas wrote:
>externally. Might it be easier to support the full set externally, but  
>
#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Posted by Yukihiro Matsumoto (Guest)
on 2008-09-20 10:32
(Received via mailing list)
Jo.

In message "Re: [ruby-core:18753] Re: Character encodings - a radical 
suggestion"
    on Sat, 20 Sep 2008 14:28:25 +0900, Martin Duerst 
<duerst@it.aoyama.ac.jp> writes:

|The alternative, which is suggested by this discussion,
|is that we decide on a (pretty high) minimum standard for
|support for an encoding. All encodings that don't reach
|that standard are simply declared dummy and behave as
|such (i.e. the same as binary, or even with less functionality).

I feel you've suggested something important.  Can you be more concrete
on the idea?  Making UTF-{16,32} dummy again?  Or something else?

              matz.
Posted by Tanaka Akira (Guest)
on 2008-09-20 11:17
(Received via mailing list)
In article <E1Kgh9x-0001GF-5j@x61.netlab.jp>,
  Yukihiro Matsumoto <matz@ruby-lang.org> writes:

> UTF-16 and UTF-8 are easier set, since they are semantically same.
> But how should we treat ISO-2022-JP for example?

I defined stateless-ISO-20222-JP which is semantically same
to ISO-2022-JP.  (Actually, it is a subset of Emacs-Mule.)

% ruby -ve '
p Encoding::Converter.asciicompat_encoding("ISO-2022-JP")
p Encoding::Converter.asciicompat_encoding("UTF-16BE")
p Encoding::Converter.asciicompat_encoding("UTF-16LE")
'
ruby 1.9.0 (2008-09-15 revision 19356) [i686-linux]
#<Encoding:stateless-ISO-2022-JP>
#<Encoding:UTF-8>
#<Encoding:UTF-8>

They are required to insert a error notice for conversion,
for example.
Posted by Martin Duerst (Guest)
on 2008-09-20 11:53
(Received via mailing list)
At 02:13 08/09/18, NARUSE, Yui wrote:


>Yes, they have private use area codepoints, but I think they are not reasonable.
>
>The first reason is that the area is Private Use Area.
>
>Moreover there are some mobile phone careers in Japan and they define own emoji.
>And their PUA codepoints is conflicted.
>http://creation.mb.softbank.jp/web/web_pic_about.html
>
>So they can't be 'Uni'code yet.

Just for everybody's information, the Unicode Technical Committee
(UTC) is working on encoding these into Unicode. But this is not at
all an easy job. Contrary to most characters, most emoji are in color.
Contrary to all previous characters, some emoji are actually
animations. Also, some may have copyright or trademark issues.
So a lot of issues have to be considered carefully.
The UTC is also working on Japanese TV symbols and on emoticons
(presumably from various instant messaging services and the like).

On the other hand, the mappings given by the companies are also
not worked out. As an example, on
http://creation.mb.softbank.jp/web/web_pic_03.html, the copyright
and (R) sign, the zodiac signs, and heart/diamond/clover/...
at least can easily be mapped to pre-existing Unicode characters
without problems.

It would be good if the Japanese mobile carriers and others
would recognize that inventing new "characters" isn't the right
way to satisfy the needs of their customers for fancy little
images, but as with most other stupidities in character encoding,
it takes a while for the people involved directly to figure this
out.

Anyway, if we were really serious in Ruby about handling these,
we would have to introduce encodings such as Shift_JIS-NTT-Docomo,
Shift_JIS-Softbank, and Shift_JIS-kddi-au or so, to make sure
there is no mixup in the original encoding.

Regards,    Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Posted by Vincent Isambart (Guest)
on 2008-09-20 13:38
(Received via mailing list)
> It would be good if the Japanese mobile carriers and others
> would recognize that inventing new "characters" isn't the right
> way to satisfy the needs of their customers for fancy little
> images, but as with most other stupidities in character encoding,
> it takes a while for the people involved directly to figure this
> out.
Thankfully I think they started to understand it. Recent phones have
now lots of emojis that are just GIFs, and when you use them in mails,
the mails are sent as HTML mail with attachments and can be read in
any normal mail client. It also allows the users to easily add (or
even create) new ones. However if the mail is sent to an older phone
the image may be badly displayed.

Anyway, the problem with the existing emoji characters will probably
stay for a long time... (unfortunately it would not be the first
problem due to the need to keep compatibility with the existing...)
Posted by mathew (Guest)
on 2008-09-20 17:11
(Received via mailing list)
On Fri, Sep 19, 2008 at 10:03 AM, Tim Bray <Tim.Bray@sun.com> wrote:

> Actually, I think it would be perfectly OK to remove runtime support for
> UTF-16, if you make input and output possible.  I can't think of any
> practical advantages to handling multiple UCS encodings, for regexing or
> parsing or splitting or matching.  UTF-16 is horrible; C# and Java will be
> paying the price for choosing 16-bit characters long after we're dead.  -Tim
>

I agree with this. I haven't looked at Ruby 1.9, but I don't see any 
point
in supporting UTF-16, UTF-16BE, and UTF-16LE as internal encodings. So 
long
as you can read and write them, I say keep the complexity out of Ruby's
internals and convert to UTF-8 internally.


mathew
Posted by Yukihiro Matsumoto (Guest)
on 2008-09-20 18:14
(Received via mailing list)
Hi,

In message "Re: [ruby-core:18751] Re: Character encodings - a radical 
suggestion"
    on Sat, 20 Sep 2008 10:00:24 +0900, "Michael Selig" 
<michael.selig@fs.com.au> writes:

|Perhaps we need to go back to basics with this discussion. As a mere  
|English speaker, I do not fully understand the issues that are faced by  
|Japanese and other encodings. What I have gathered from this discussion is  
|(please tell me if I am wrong):
|
|- There are characters that Ruby needs to support which cannot be uniquely  
|mapped to Unicode

Yes, even though they are minor.

|- In fact there are entire character sets that we want to support in Ruby  
|that are not supported in Unicode

Yes, I know two of them: Mojikyo, which refusing character
unification.  The character set contains 170,000 characters.  At the
time I first heard that number was huge, but Unicode is approaching
pretty close (it now has more than 100,000 characters).

GB18030, defined by Chinese government.  I don't know the detail, but
I've heard it officially contains Unicode as its subset.  But encoding
scheme for GB18030 is upto 4bytes per codepoint, so I am not sure how
it can holds 21bit Unicode codepoint in it.

|- There are ambiguous characters in some character sets - same code for  
|different characters

Yes.

|I think it would be a benefit if we all got to understand a bit more:
|
|- How the character ambiguity (eg: Yen/ backslash) issue is handled at the  
|moment - generally, not just with Ruby. ie: how do you know that a printer  
|or screen is going to show the right character?

Either avoiding conversion (operation based on bytes), or selecting
proper encoding scheme (out of many very similar encodings, such as
Shift_JIS, CP932, Windows-31J for example).  Conversion table from
unicode.org is carefully designed to ensure roundtrip, although that
is the very reason we have so many similar encoding.  If we can choose
(or negotiate) to use same conversion table at both ends, it is
unlikely to have mojibake problems.

|- How the various "non-ascii compatible" encodings are used in practice.  
|eg: it is my understanding that UTF-7 is really only used in email, and  
|that it would be straightforward to immediately transcode it to/from UTF-8  
|in an POP/IMAP library, so UTF-7 could be avoided completely as an  
|"internal" encoding in Ruby. It's as if were were treating UTF-7 like  
|base64 - just a transformation of a "real" encoding. (In fact UTF-16 & 32  
|could be considered the same sort of thing, except they may be used more  
|widely.)

UTF-{16,32}{BE,LE} are non-ascii compatible, but they are safe to
convert into UTF-8 since their difference only lies in encoding
scheme.  They represent same character set anyway.  ISO-2022 is used
often in mails and web.  The situation is little bit more complicated,
but basically it can be converted into Unicode as well (with slight
risk of yen sign problem).  You can ignore UTF-7.

|- How a Japanese programmer would handle the situation of dealing with a  
|combination of a Japanese non-Unicode compatible character set, and say a  
|UTF-8 encoding which included non-ascii characters, and non-Japanese ones.  
|ie: Is there a reasonable alternative to encoding both to Unicode &  
|somehow dealing with the "difficult characters" as special cases?

Unicode is getting better each day.  So it now covers almost all
day-to-day problems.  Some cellphone problems are covered by using
private area.

              matz.
Posted by Martin Duerst (Guest)
on 2008-09-21 10:09
(Received via mailing list)
At 01:05 08/09/21, Yukihiro Matsumoto wrote:
>|(please tell me if I am wrong):
>unification. The character set contains 170,000 characters.
Just for general information, this doesn't specifically refer to
CJK unification (i.e. unification of the same ideograph from
China, Japan, Korea, and so on) but is more about general glyph
(dis)unification. This means that minor differences in how exactly
to write a character are given separate codepoints. This may help
in historical research (some variants are more used by some writers
or in some centuries than others,...), but in general isn't helpful,
on the contrary, it will make data processing more difficult.

However, even in daily life, there is some need to distinguish
some (ideographic) glyph variants in certain cases. For this,
Unicode contains variation selectors (U+FE00-FE0F and U+E0100-E01EF).
These are used after a base character, based on a registration in the
Ideographic Variation Database (http://www.unicode.org/ivd/).
There is currently only the Adobe-Japan1 collection registered, see
http://www.unicode.org/ivd/data/2007-12-14/IVD_Charts.pdf.
For glyph variants, it would be no problem (although quite some work,
of course) for Mojikyo to register them as Ideographic Variations
in this database. This would make all these Variations usable
in Unicode.

 From http://www.mojikyo.com/info/konjaku/index.html, we can also
see the following:
                       Mojikyo         Unicode
$B4A;z(B (kanji)         150,366           A bit more than double of 
what
                                       Unicode has. In my guess mostly
                                       glyph variants, but there sure 
are
                                       a few not yet encoded characters, 
too.

$BHs4A;z(B (non-kanji)     2,256           Kana variants could be 
encoded
                                       with variation selectors

$B[p;z(B(bonji)            1,875           Don't know, but because 
these are
                                       of Indic origin, my guess is that
                                       Unicode would use a different 
encoding
                                       model with much less characters

$B9C9|J8;z(B(oracle bone)  3,364           space tentatively allocated 
(U+32000-327FF),
(http://www.internationalscientific.org/CharacterAS...)
                                       see 
http://unicode.org/roadmaps/tip/

$B@>2FJ8;z(B/Tangut        6,000           under consideration for 
encoding

$B?eB2J8;z(B                 145           did not find any info, but 
I'm
                                       quite sure a well-written 
proposal
                                       would be accepted

$Bd?=q(B(seal characters) 10,969           Very old style, but most of 
them
(http://www.internationalscientific.org/CharacterAS...)
                                       with clear equivalents to modern
                                       ideographs. Still used on seals.
                                       To unify or not to unify is the
                                       big question.

It seems that Mojikyo is currently handled from two sides: 
www.mojikyo.org
for the non-commercial side, and www.mojikyo.com for the commercial side
(with various products published by Kinokuniya, a big Japanese 
publisher).
That leads to somewhat complicated usage conditions (you can use some
fonts for free for yourself, but have to pay if you use them in a paper
you publish,...), not only for the fonts (would be quite understandable)
but also for some of the data.

>At the
>time I first heard that number was huge, but Unicode is approaching
>pretty close (it now has more than 100,000 characters).

Conclusion: If the Mojikyo people wanted, they could get most if
not all of their stuff into Unicode in one way or another. But
similar to all other work of serious character encoding, it
would be a lot of work.


>GB18030, defined by Chinese government.  I don't know the detail, but
>I've heard it officially contains Unicode as its subset.  But encoding
>scheme for GB18030 is upto 4bytes per codepoint, so I am not sure how
>it can holds 21bit Unicode codepoint in it.

4 bytes raw would be 32 bits, so that should be enough to hold 21 bits.
Because some characters use only one or two bytes, the overall code 
space
is smaller, about 1,600,000 codepoints. This is still larger than 
Unicode
(around 1,100,000 codepoints), but the difference is currently not used
at all.

For more details, please see
http://www.icu-project.org/docs/papers/unicode-gb1...
and http://unicode.org/faq/han_cjk.html#23.
(I was of the impression that GB 18030 contains a few characters
similar to the Japanese $B$;!,(B and friends in JIS X 0213, but I 
haven't
found any such information anymore, so it may not be true).

So I don't think there is any real problem for GB 18030 and Unicode.


>
>Either avoiding conversion (operation based on bytes), or selecting
>proper encoding scheme (out of many very similar encodings, such as
>Shift_JIS, CP932, Windows-31J for example).  Conversion table from
>unicode.org is carefully designed to ensure roundtrip, although that
>is the very reason we have so many similar encoding.  If we can choose
>(or negotiate) to use same conversion table at both ends, it is
>unlikely to have mojibake problems.

Yes, roundtrip is easy if you use the same conversion tables, but
unfortunately, the major vendors (Microsoft, Apple, IBM,...) messed
up with minor variations (usually just a few codepoints out of
several thousand).

As for how you know that a printer or screen is going to show the
right character, you simply don't, in particular e.g. on the Web.
0x5C will show as a Yen sign on Japanese systems with fonts tweaked
for Japanese, but will show as a backslash otherwise. Japanese
IT professionals have to just learn about this.


>convert into UTF-8 since their difference only lies in encoding
>scheme.  They represent same character set anyway.  ISO-2022 is used
>often in mails and web. 

That would be iso-2022-JP. ISO 2022 is a standard that defines a set
of tools to create encodings, not an encoding in and by itself.

Regards,    Martin.

>Unicode is getting better each day.  So it now covers almost all
>day-to-day problems.  Some cellphone problems are covered by using
>private area.
>
>                                                       matz.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Posted by Michael Selig (Guest)
on 2008-09-22 02:05
(Received via mailing list)
On Sun, 21 Sep 2008 02:05:30 +1000, Yukihiro Matsumoto
<matz@ruby-lang.org> wrote:

> |- How a Japanese programmer would handle the situation of dealing with a
> |combination of a Japanese non-Unicode compatible character set, and say  
> a|UTF-8 encoding which included non-ascii characters, and non-Japanese  
> ones.
> |ie: Is there a reasonable alternative to encoding both to Unicode &
> |somehow dealing with the "difficult characters" as special cases?
>
> Unicode is getting better each day.  So it now covers almost all
> day-to-day problems.  Some cellphone problems are covered by using
> private area.

I infer from this that really Unicode is the only (imperfect) solution 
for
true m17n where we have a mixure of completely different character sets
(eg: Japanese & Arabic)?
What I think this means is that there is no "one size fits all" 
solution,
unfortunately.

So I have an alternate suggestion. Maybe I should rename this thread
"Character encodings - a less radical suggestion" :-)

Ruby already has "Encoding::default_external", so why not also have
"default_internal"? This option would either be left unset (or NIL I
guess) or set to an encoding, likely to be UTF-8 in practice, but maybe
there would be a use for it to choose say one of the Japanese encodings 
if
you have a variety of Japanese encodings to handle.

When "default_internal" is nil, Ruby will work as it does now:
- Ruby libraries such as I/O & network libraries will by default return
character data in the external encoding
- No transcoding will take place unless specifically requested by the 
Ruby
program
- The Ruby program is responsible for ensuring that the encodings are 
what
it expects, that strings passed to & from Ruby libraries are in the
encoding the library expects, and that "Encoding Compatibility Errors"
will occur if it is not careful etc.

When "default_internal" is set to an encoding "E":
- Ruby libraries such as I/O & networking libraries will by default
transcode to/from internal encoding E (unless specifically overridden by
an option to the class)
- A Ruby program can then be confident that all strings it handles will 
be
in encoding E, so it doesn't have to worry about encoding compatibility.
For example it can be sure that if "s" is "abc" then "s == 'abc'" is 
true,
no matter where the string "s" originated from.
- Assuming that E is an "ascii-compatible" encoding, the Ruby programmer
doesn't have to face issues like "The value is #{val}" substitution
failing because "val" is non-ascii compatible.
- The "downside" as pointed out by a number of people is that not all
characters may be transcoded cleanly or even be supported (driving 
without
a seat-belt? :-)), but then programs requiring this level of control
should probably not use this feature.

Consequences of this suggestion:
- Don't have to change the current implementation of encodings, String 
or
Regexp
- Avoids "automagical transcoding" within String & Regexp methods
- Responsibility of implementing "default_internal" lies with a certain
set of Ruby libraries like IO & networking

Hope this makes sense.
Mike
Posted by Martin Duerst (Guest)
on 2008-09-22 04:44
(Received via mailing list)
Hello Michael,

Many thanks for your proposal. Earlier, when I proposed some
general "encoding policies" to deal with this and similar
problems, the main problem brought up was that it would
interoperate badly with libraries. But looking at your
concrete proposal, it seems to me that overall, the problems
wouldn't actually be that serious.

Therefore, I think we should seriously consider this proposal,
and hopefully implement it before Sept. 25th. In terms of
implementation, I don't think it should be that difficult,
but it may be quite a bit of work to check
Encoding::default_internal in all the affected methods.

In terms of potential problems, I see the following:
- A library sets Encoding::default_internal. That would lead
  to serious problems, and should be clearly advised against
  in the documentation. Libraries either have to be written
  in a general way, or have to document that they only work
  with certain values of Encoding::default_internal
  (this proposal would therefore help you, but not e.g.
   James Gray for the CVS library)
- Encoding::default_internal is set to some dummy or non-ASCII-
  compatible encoding, which may lead to some hickups.
  We may want to make that impossible or advise against.
  (the main use is UTF-8 anyway)
- We should think through various scenarios for output.
  I can't think of any problems just now, I just noticed
  the absence of considerations for output below.

The advantages that I see with this proposal are:
- It gets rid of the bad usability for "r:UTF-16LE:UTF-8"
  (matz, ruby-core:18666)
- It clearly helps "Unicode inside" applications, but is
  not limited to any encoding and may be helpful for other
  encodings as well.
- It fits well within the rest of the naming scheme and the
  overall idea of having several specific encodings to make
  the work of the user easier. If we wouldn't have
  Encoding::default_external, using Ruby with a single
  local encoding would be a big pain. Introducing
  Encoding::default_internal makes using Ruby with
  "Unicode inside" much less of a pain.


At 08:56 08/09/22, Michael Selig wrote:
>> Unicode is getting better each day.  So it now covers almost all
>> day-to-day problems.  Some cellphone problems are covered by using
>> private area.
>
>I infer from this that really Unicode is the only (imperfect) solution for  
>true m17n where we have a mixure of completely different character sets  
>(eg: Japanese & Arabic)?
>What I think this means is that there is no "one size fits all" solution,  
>unfortunately.

Yes. Unicode fits most of the time, some local encoding fits in many
cases (in particular small scripts), and for some very special jobs,
you may have to use something else (a special encoding such as Mojikyo,
the Unicode private areas, an additional level of markup,...).

>So I have an alternate suggestion. Maybe I should rename this thread  
>"Character encodings - a less radical suggestion" :-)

I just did :-).

Regards,    Martin.

>program
>in encoding E, so it doesn't have to worry about encoding compatibility.  
>Consequences of this suggestion:
>
#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Posted by James Gray (bbazzarrakk)
on 2008-09-22 04:59
(Received via mailing list)
On Sep 21, 2008, at 9:35 PM, Martin Duerst wrote:

> In terms of potential problems, I see the following:
> - A library sets Encoding::default_internal. That would lead
>  to serious problems, and should be clearly advised against
>  in the documentation. Libraries either have to be written
>  in a general way, or have to document that they only work
>  with certain values of Encoding::default_internal
>  (this proposal would therefore help you, but not e.g.
>   James Gray for the CVS library)

I really think this a bigger minus than this implies.  I can name a
lot of libraries that just flat out expect UTF-8 and choke and die on
anything else.  Ruby 1.8 has trained us to think this way for many
years.

Now, if someone were to change Encoding.default_interal then all these
libraries will unexpectedly having data changed on them.  I'm pretty
sure that would cause massive damage.

James Edward Gray II
Posted by Vincent Isambart (Guest)
on 2008-09-22 05:03
(Received via mailing list)
> - Encoding::default_internal is set to some dummy or non-ASCII-
>  compatible encoding, which may lead to some hickups.
>  We may want to make that impossible or advise against.
>  (the main use is UTF-8 anyway)
I'd say we should make it impossible. If you are playing with dummy or
non-ASCII-compatible encodings anyway you must know what you do. So
not being able to rely on default_internal in this case would make
perfect sense to me.

> - We should think through various scenarios for output.
>  I can't think of any problems just now, I just noticed
>  the absence of considerations for output below.
I have not thought it much but if logically:
Input: default_external
$B"-(B conversion (if default_internal != default_external and 
encoding not
specified)
Internal: default_internal
$B"-(B conversion (if default_internal != default_external and 
encoding not
specified)
Output: default_external

Vincent Isambart
Posted by Vincent Isambart (Guest)
on 2008-09-22 05:10
(Received via mailing list)
> I really think this a bigger minus than this implies.  I can name a lot of
> libraries that just flat out expect UTF-8 and choke and die on anything
> else.  Ruby 1.8 has trained us to think this way for many years.

As we said before the main use of default_internal is for
Unicode-inside applications so most of the time it would be UTF-8
anyway...

> Now, if someone were to change Encoding.default_internal then all these
> libraries will unexpectedly having data changed on them.  I'm pretty sure
> that would cause massive damage.

This makes me think that there should be _at least_ a warning (or
completely forbid) if some code tries to change default_internal and
it was already set (if it's set to the same encoding, we could just
ignore it).
Posted by Michael Selig (Guest)
on 2008-09-22 05:21
(Received via mailing list)
On Mon, 22 Sep 2008 12:51:19 +1000, James Gray 
<james@grayproductions.net>
wrote:

>
> I really think this a bigger minus than this implies.  I can name a lot  
> of libraries that just flat out expect UTF-8 and choke and die on  
> anything else.  Ruby 1.8 has trained us to think this way for many years.

That is very true, but the situation exists independent of
"default_internal". The mere fact that Ruby supports multiple encodings
means that every library *should* check the encodings of strings passed 
to
it and do the appropriate thing, unless it is acceptable to just let an
"Encoding Capability Error" be raised.

> Now, if someone were to change Encoding.default_interal then all these  
> libraries will unexpectedly having data changed on them.  I'm pretty  
> sure that would cause massive damage.

Also very true, but people can do all sorts of stupid things that will
make things break. Normally you would expect that "default_internal" 
would
be set once at the very start, but who are we to enforce that - one day
someone will want to change it mid-stream and probably for a good 
reason!

Mike.
Posted by Michael Selig (Guest)
on 2008-09-22 05:34
(Received via mailing list)
On Mon, 22 Sep 2008 12:35:49 +1000, Martin Duerst 
<duerst@it.aoyama.ac.jp>
wrote:

>
> Therefore, I think we should seriously consider this proposal,
> and hopefully implement it before Sept. 25th. In terms of
> implementation, I don't think it should be that difficult,
> but it may be quite a bit of work to check
> Encoding::default_internal in all the affected methods.

Wow, that is rather ambitious - 3 days?
The bulk of the implementation will be in the libraries, and I think 
many
of them need updating to cope with non-acsii encodings anyhow.

> - We should think through various scenarios for output.
>   I can't think of any problems just now, I just noticed
>   the absence of considerations for output below.

I did think about output to a certain extent, and one good thing is that
IO already seems to automatically transcode to the "external" encoding 
at
the moment. As for other classes, again I think most need updating to
support multiple encodings anyhow. They will at a minimum need a way of
having the user pass the "external" encoding (defaulting to
"default_external"), and do the transcode as necessary, based on the
encoding of the data to be output. However, as with IO, this behaviour
probably should happen no matter whether "default_internal" is 
implemented
or not.

Cheers
Mike
Posted by Martin Duerst (Guest)
on 2008-09-22 05:39
(Received via mailing list)
At 11:51 08/09/22, James Gray wrote:
>
>I really think this a bigger minus than this implies.  I can name a  
>lot of libraries that just flat out expect UTF-8 and choke and die on  
>anything else.  Ruby 1.8 has trained us to think this way for many  
>years.

Having a library expect UTF-8 is fine, if it's well known.
The idea with Encoding::default_internal, the way I understand it,
is not force any particular working style on anybody.

But in order to work for 1.9, these libraries will have to
write something like "r:UTF-16LE:UTF-8" anyway for their
i/o, both with and without Encoding::default_internal.
And for data passed internally to the library (method attributes,...),
that data will either be in UTF-8 or not, again independent
of Encoding::default_internal.

The only thing that Encoding::default_internal helps (but this is
significant) is that it makes things easier for an application
programmer who wants to use a single encoding inside. If that
encoding is choosen to be UTF-8, it will also significantly
increase the chances that the application and the aforementioned
libraries will work together well. In this sense, it is helpful
for libraries working only in e.g. UTF-8.


>Now, if someone were to change Encoding.default_interal then all these  
>libraries will unexpectedly having data changed on them.  I'm pretty  
>sure that would cause massive damage.

I disagree, but maybe you have a scenario in mind that I didn't
think about. Can you be more specific?

Regards,    Martin.



#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Posted by Michael Selig (Guest)
on 2008-09-22 05:44
(Received via mailing list)
While I think about it, there is at least one more issue with
"default_internal" - support for ASCII-8BIT aka "BINARY".

I imagine that most people who use this encoding actually want to do bit
or byte manipulation, not character. Some other languages have a 
separate
class for "byte strings" to handle this situation.
Therefore I think if you use an "external" encoding of ASCII-8BIT,
transcoding to/from the "default_internal" encoding should not happen.
This behaviour may be a bit confusing, but I cannot immediately think of 
a
better idea.

Mike
Posted by Martin Duerst (Guest)
on 2008-09-22 10:43
(Received via mailing list)
At 12:35 08/09/22, Michael Selig wrote:
>While I think about it, there is at least one more issue with  
>"default_internal" - support for ASCII-8BIT aka "BINARY".
>
>I imagine that most people who use this encoding actually want to do bit  
>or byte manipulation, not character. Some other languages have a separate  
>class for "byte strings" to handle this situation.
>Therefore I think if you use an "external" encoding of ASCII-8BIT,  
>transcoding to/from the "default_internal" encoding should not happen.
>This behaviour may be a bit confusing, but I cannot immediately think of a  
>better idea.

Very good point. Anything other than pure ASCII won't convert anyway.
And ASCII itself will work fine when labeled as ASCII-8BIT, even
together with UTF-8.

Regards,    Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Posted by James Gray (bbazzarrakk)
on 2008-09-22 15:02
(Received via mailing list)
On Sep 21, 2008, at 10:29 PM, Martin Duerst wrote:

> At 11:51 08/09/22, James Gray wrote:
>
>> Now, if someone were to change Encoding.default_interal then all  
>> these
>> libraries will unexpectedly having data changed on them.  I'm pretty
>> sure that would cause massive damage.
>
> I disagree, but maybe you have a scenario in mind that I didn't
> think about. Can you be more specific?

Encoding.default_internal = "Shift_JIS"

James Edward Gray II
Posted by James Gray (bbazzarrakk)
on 2008-09-22 15:08
(Received via mailing list)
On Sep 21, 2008, at 10:01 PM, Vincent Isambart wrote:

>> Now, if someone were to change Encoding.default_internal then all  
>> these
>> libraries will unexpectedly having data changed on them.  I'm  
>> pretty sure
>> that would cause massive damage.
>
> This makes me think that there should be _at least_ a warning (or
> completely forbid) if some code tries to change default_internal and
> it was already set (if it's set to the same encoding, we could just
> ignore it).

I just can't stop thinking this is too dangerous.  It's really just a
big global variable that affects everything and we know that's usually
bad, right?

Perl has always had a variable that allowed you to change the starting
index of arrays.  If you didn't like the fact that arrays counted from
zero, you could switch it to one.  If you do though, pretty much all
of the libraries that ship with Perl as well as those on the CPAN
start having issues.  I really feel like this would be the same
thing.  By the way, this "feature" is so evil, I believe it's finally
being removed in Perl 6.

James Edward Gray II
Posted by James Gray (bbazzarrakk)
on 2008-09-22 15:12
(Received via mailing list)
On Sep 21, 2008, at 9:35 PM, Martin Duerst wrote:

> In terms of potential problems, I see the following:
> - A library sets Encoding::default_internal. That would lead
>  to serious problems, and should be clearly advised against
>  in the documentation. Libraries either have to be written
>  in a general way, or have to document that they only work
>  with certain values of Encoding::default_internal
>  (this proposal would therefore help you, but not e.g.
>   James Gray for the CVS library)

I really think we need to avoid any solution that means we will need
to change all existing libraries, even just to declare their supported
encodings.  Enough libraries are already broken on 1.9 without us
adding to that and so many great libraries are no longer maintained at
all.

The current situation is probably that we have to be very careful what
we pass into these Unicode only libraries to get them to work.  That's
far from ideal but, it's better than having the library fail to load
at all due to some global setting I may not have even created
(assuming I required code that made the change).

James Edward Gray II
Posted by Michael Selig (Guest)
on 2008-09-23 05:07
(Received via mailing list)
On Mon, 22 Sep 2008 23:03:12 +1000, James Gray 
<james@grayproductions.net>
wrote:

>
> I really think we need to avoid any solution that means we will need to  
> change all existing libraries, even just to declare their supported  
> encodings.  Enough libraries are already broken on 1.9 without us adding  
> to that and so many great libraries are no longer maintained at all.
>
> The current situation is probably that we have to be very careful what  
> we pass into these Unicode only libraries to get them to work.  That's  
> far from ideal but, it's better than having the library fail to load at  
> all due to some global setting I may not have even created (assuming I  
> required code that made the change).

As long as "default_internal" is used sanely, I actually think that it 
may
IMPROVE the library support situation, because its use will make 
"encoding
compatibility errors" less likely to rear their ugly heads.

As long as IO obeys default_internal's setting, I think most other
libraries should just work. I quickly checked "OpenURI", for example, 
and
(assuming I understand the code correctly) it calls IO#set_encoding
passing the charset read from the HTTP header, setting the "external
encoding" of the socket. So as long as IO leaves the "internal encoding"
set to the default_internal setting, open-uri should work as required,
returning the data in the default_internal encoding.

By "sanely" I mean that default_internal is set at the start of the
program, and not changed (or at least not changed between reads of a 
file,
for instance). Also if libraries supporting only Unicode are used then 
it
should either NOT be set (and the Ruby program must then be careful 
about
what it passes to it) or be set to UTF-8. Similarly if the library only
supports ASCII, you wouldn't want to set default_internal to a non-ascii
compatible encoding (very unlikely I think).

I guess if the possibility of changing "default_internal" seems too
problematic, it could be implemented the way "default_external" is -
read-only and set either via a command line flag or to a default. 
Perhaps
the default should simply be the encoding of the ruby program itself. 
But
this idea would mean that for Ruby to behave as it does at the moment, 
you
would have to specifically turn it off somehow.

Mike
Posted by Austin Ziegler (austin)
on 2008-09-23 06:05
(Received via mailing list)
On Mon, Sep 22, 2008 at 11:04 PM, Michael Selig 
<michael.selig@fs.com.au> wrote:
> As long as "default_internal" is used sanely, I actually think that it may
> IMPROVE the library support situation, because its use will make "encoding
> compatibility errors" less likely to rear their ugly heads.

What if it's "set once"? It's treated as essentially frozen after it's
set for the first time. Something like:

  def Encoding.default_internal=(encoding)
    raise "Internal Encoding Already set" if @default_internal_set
    @default_internal_set = true
    @default_internal = encoding
  end

It would be treated as read-only if set by a command-line parameter or
after the first time it's set to an explicit value.

This would discourage people from setting it in libraries (it would
break automatically).

-austin
Posted by Meinrad Recheis (Guest)
on 2008-09-23 11:44
(Received via mailing list)
On Tue, Sep 23, 2008 at 5:04 AM, Michael Selig 
<michael.selig@fs.com.au>wrote:

>>>  with certain values of Encoding::default_internal
>> pass into these Unicode only libraries to get them to work.  That's far from
> libraries should just work. I quickly checked "OpenURI", for example, and
> what it passes to it) or be set to UTF-8. Similarly if the library only
> supports ASCII, you wouldn't want to set default_internal to a non-ascii
> compatible encoding (very unlikely I think).


We also have to consider the fact, that in a multi threaded application 
the
changing of a global variable that affects all threads is potentially
dangerous. Thus, if some library would change the default_internal 
encoding
temporarily this might have unforseeable consequences in other libraries 
or
user code in other threads. I'd advise against having it changeable in 
code
at all, but only by a command line switch.

-- henon
Posted by Michael Selig (Guest)
on 2008-09-24 01:39
(Received via mailing list)
On Mon, 22 Sep 2008 12:35:49 +1000, Martin Duerst 
<duerst@it.aoyama.ac.jp>
wrote:

> Therefore, I think we should seriously consider this proposal,
> and hopefully implement it before Sept. 25th.

I guess all you guys are busy on other things at the moment, so I am 
happy
to implement "default_internal" at least at the Ruby internal C level, 
but
unfortunately it wouldn't be before the weekend.

Given the various problems raised about changing default_internal, I now
agree that it is probably for the best if it were implemented like
default_external - set only at the start, with 
Encoding::default_internal
read-only, no "Encoding::default_internal=".

So I would do the following:

- extend the current -E command line option to have an optional setting
for default_internal in the form "ext:int" - same format as the encoding
"mode" options in IO.

- modify IO to use "default_internal" if no internal encoding is
specified, and the file's external encoding is *not* ASCII-8BIT (to 
allow
for "binary" I/O).

- add class method "default_internal" to Encoding

The only question I still have is "what should it default to if not
specified on the command line"?

I think it should default to the encoding specified in the main ruby
source file. That means that the internal encoding will match the 
encoding
used by the programmer. Is this sensible? Would that break anything? 
Once
set, default_internal can't be changed, so that means there would be no
way of turning it off if the encoding is specified in the ruby source,
unless another option is introduced. Is this a problem?

If no encoding is specified in the main ruby source file, then what? Set
it to the same as "default_external"? Set it to NIL (ie: no default
transcoding)?

This is not backward compatible with Ruby 1.9.0 which has no IO
transcoding by default no matter what the -E & encoding in the source 
are
set to. Any better ideas?

Have I left anything out?
Please let me know what you would like me to do, if anything.

Cheers
Mike
Posted by Martin Duerst (Guest)
on 2008-09-24 13:02
Attachment: patch_default_internal.txt (3 KB)
(Received via mailing list)
At 12:25 08/09/22, Michael Selig wrote:
>Wow, that is rather ambitious - 3 days?
Well, that's the deadline for feature changes for 1.9.1.
It would be a real pity to wait for 2.0 for this.
The feature freeze wiki at
http://redmine.ruby-lang.org/wiki/ruby/DevelopersM...
says that default_internal is currently pending, but that
this should be discussed/settled this week.

Anyhow, I had a look at the code, and it doesn't seem to be that
difficult. The function io_extract_encoding_option in io.c
seems to be central. I'm attaching a patch, which I hope is
a good start. I'm also writing to ruby-dev (in Japanese)
because that's where the real experts are.
The patch isn't as strict as your proposal with respect
to re-setting, but I'm fine either way.

I have tested this patch with code like the following
(called with -Eutf-8, -Eshift_jis, -Eeuc-jp, and without -E
option, in all combinations)

>>>>
Encoding.default_internal = 'utf-8'
      # tested with 'utf-8', 'shift_jis', and 'euc-jp'

s = "\u3042\u3044\u3046\u3048\u304A"
File.open('testout1.txt', 'w:shift_jis') do |f| f.write s end
File.open('testout2.txt', 'w:euc-jp') do |f| f.write s end
File.open('testout3.txt', 'w:utf-8') do |f| f.write s end

File.open('testout1.txt', 'r:shift_jis') do |f| s = f.read; p s.encoding 
end
File.open('testout2.txt', 'r:euc-jp') do |f| s = f.read; p s.encoding 
end
File.open('testout3.txt', 'r:utf-8') do |f| s = f.read; p s.encoding end
File.open('testout3.txt', 'r:ASCII-8BIT') do |f| s = f.read; p 
s.encoding end

# for next line, change file number to pick up default_internal
File.open('testout3.txt', 'r') do |f| s = f.read; p s.encoding end
>>>>

>The bulk of the implementation will be in the libraries, and I think many  
>of them need updating to cope with non-acsii encodings anyhow.

Yes. I'm not sure how libraries are affected by the feature
freeze, but they have to be fixed anyhow, completely independently
of default_internal. And I agree that this cannot be done in 3 days.

Regards,    Martin.

>encoding of the data to be output. However, as with IO, this behaviour  
>probably should happen no matter whether "default_internal" is implemented  
>or not.
>
>Cheers
>Mike
>


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Posted by Michael Selig (Guest)
on 2008-09-24 14:34
(Received via mailing list)
On Wed, 24 Sep 2008 21:02:14 +1000, Martin Duerst 
<duerst@it.aoyama.ac.jp>
wrote:

> Well, that's the deadline for feature changes for 1.9.1.
> It would be a real pity to wait for 2.0 for this.
> The feature freeze wiki at
> http://redmine.ruby-lang.org/wiki/ruby/DevelopersM...
> says that default_internal is currently pending, but that
> this should be discussed/settled this week.

Sorry, I am new here, so I didn't know about that URL, nor about the
release procedures, nor did I know whether one of the other developers 
was
working on this. In my previous post I asked whether I should proceed, 
but
got no reply. I didn't think it was worthwhile my spending time on it if
someone else has done it, or almost has.

> option, in all combinations)
I am not sure if your patch also works correctly with IO#set_encoding.
This is absolutely necessary for HTTP data to be transcoded correctly in
OpenURI, for example. Please see my previous post. That post also 
suggests
NOT implementing default_internal=, but rather extending the -E command
line flag to be -E "ext:int", and if not defined there to use the 
"source
encoding".

I guess your patch may "get the feature in" by the required date, but I
feel that it may require a little more thought to get it right.

Regards
Mike
Posted by Martin Duerst (Guest)
on 2008-09-25 03:19
(Received via mailing list)
At 21:34 08/09/24, Michael Selig wrote:
>On Wed, 24 Sep 2008 21:02:14 +1000, Martin Duerst <duerst@it.aoyama.ac.jp>  
>wrote:

>> option, in all combinations)
>
>I am not sure if your patch also works correctly with IO#set_encoding.

I'm quite sure it doesn't.

>This is absolutely necessary for HTTP data to be transcoded correctly in  
>OpenURI, for example. Please see my previous post. That post also suggests  
>NOT implementing default_internal=, but rather extending the -E command  
>line flag to be -E "ext:int", and if not defined there to use the "source  
>encoding".

I have read that mail, but I disagree to use the "source encoding".
I think it should be possible to use e.g. UTF-8 as a source encoding
without default_internal being automatically set.

Also, I think it should be possible to set default_internal independent
of default_external. I can immagine writing an application where I
always want UTF-8 to be default_internal, but I want it to work with
all kinds of external encodings. In that case, with your proposal,
the alternatives would be to have the user write "ruby -E "ext:UTF-8"
(which means that the user has to figure out his/her external encoding,
which many may not be experts on), because I cannot use a #!.

Well, in a Japanese mail (ruby-dev:36523), matz made this work
with -E :utf-8. I guess that's why he is a language designer,
and I'm not :-).


>I guess your patch may "get the feature in" by the required date, but I  
>feel that it may require a little more thought to get it right.

Definitely it needs some more work and polishing. This is now being
discussed seriously on ruby-dev, but please keep your ideas and comments
comming.

Regards,    Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst@it.aoyama.ac.jp
Posted by Michael Selig (Guest)
on 2008-09-25 04:08
(Received via mailing list)
----- Original Message -----
From: "Martin Duerst" <duerst@it.aoyama.ac.jp>
To: <ruby-core@ruby-lang.org>
Sent: Thursday, September 25, 2008 11:16 AM

> I have read that mail, but I disagree to use the "source encoding".
> I think it should be possible to use e.g. UTF-8 as a source encoding
> without default_internal being automatically set.

My reasoning is this: if a Ruby programmer puts "Encoding: XXX" at the 
top
of his main program, it is saying that she will be using encoding XXX 
for
the "constant" strings and regexps in her code, right? If the
default_internal is different from this then they will get encoding
compatibitlity problems unless they are careful. This is what
"default_internal" is aiming to prevent, or at least reduce.

My feeling is that if they then really want to go on to read a different
encoding without transcoding, they can always open that file with mode
"r:ext:ext". Yes it's a bit ugly, but I think this is going to be the
exception rather than the rule. I guess the mode could be extended to
support something like "r:ext:-" where the "-" in the internal field
indicates no transcoding. Perhaps that would be more tolerable.

My current thinking is to set default_internal to:
- The -E command line option
- The source encoding if not specified in -E
- Leave it nil (no transcoding) if niether is specified
- Also if "default_internal" is US-ASCII it is reset to UTF-8 
automatically
(can't do any harm and will cope better if they specify an external 
encoding
but no internal encoding when opening a file, or their default_external 
is
not US-ASCII)

Can you please outline a scenario where you feel that this would be
unacceptable?

The other problem I have with "default_internal=" is that it's use may 
be
confusing to the Ruby programmer. Does it immediately cause the next 
read of
a file to use the new value or does it just apply to the next open? (I 
think
any implementation would probably be the latter).

> and I'm not :-).
Yes, I was assuming you could pass either "ext", "ext:int" or ":int" to 
the
"-E" option. Sorry if I didn't make that clear.

> Definitely it needs some more work and polishing. This is now being
> discussed seriously on ruby-dev, but please keep your ideas and comments
> comming.
Shame I don't read Japanese!

I know it's after the freeze, but I'll have a go at producing a patch 
myself
over this weekend.

Cheers
Mike.
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.