Hi,
You might at first glance think that this post should go to ruby-dev,
but
please read to the end!
I have been pulling my hair out trying to convert a relatively simple
app
to support m17n under Ruby 1.9 to see what is involved. I need to
support
all common locales worldwide, and data can also be stored in UTF-8 or
UTF-16. I was hoping that Ruby 1.9 was going to take the hard work out
of
this for me. It has to a certain extent, but UTF-16 is the problem - it
breaks so many things, due to its "ASCII incompatibility" (using Ruby's
definition). I can't even do simple things like pull out fields and
substitute into another string without testing "encoding compatibility".
Something as simple as:
puts "The value is #{val}"
fails if val is UTF-16 data.
At one stage I got so frustrated that I was even thinking about going
back
to Python :-(
So I have ended up transcoding any UTF-16 data to UTF-8, and now things
are going much better.
Maybe I am doing something wrong - if so please suggest something I can
do
other than transcode the UTF-16.
But this has lead me to look back at the issues with UTF-16 I have hit,
and to think about all the internal code in Ruby to handle "ASCII
incompatible" encodings, and the overhead involved with supporting it.
And I think that other Ruby programmers may end up doing what I have
done
- avoid using UTF-16 internally because it is too hard.
So my radical suggestion is this:
Remove internal support for non-ASCII encodings completely, and when
reading/writing UTF-16 (and UTF-32) files automatically transcode
to/from
UTF-8.
My reasons:
- String & Regexp operations should just "work" without the programmer
worrying about encoding comaptibility (I think!)
- The programmer only has to think about character encodings at the
"interfaces" (files, network interfaces) not throughout the program
logic
- To my knowledge UTF-16 & UTF-32 are the only "non-ASCII compatible" as
Ruby defines it
- To my knowledge no one actually uses UTF-16 or UTF-32 as a locale
- I would avoid having to use ugly modes to open a file like
"r:UTF-16LE:UTF-8" (very minor)
- Ruby's internal code would be simpler & cleaner and therefore probably
faster and easier to maintain
Maybe I have got this all wrong - I am relatively new to m17n!
Cheers
Mike
on 2008-09-17 03:28
on 2008-09-17 04:45
On Sep 16, 2008, at 8:20 PM, Michael Selig wrote: > > puts "The value is #{val}" > > fails if val is UTF-16 data. I'm not sure I support the pull-them out strategy, but I can confirm that supporting UTF-16 in CSV has eaten about a week of my time and counting. I keep thinking I have it and finding new problem… James Edward Gray II
on 2008-09-17 04:54
Hi, In my previous mail, I think I made a mistake. I am too used to working in a UTF-8 locale, and I forgot about the situations where your locale is not UTF-8 or ASCII. Sorry!! So unfortunately in a general sense you can never simply ignore encoding compatibility. Therefore I can either hope that a user's UTF-8 & UTF-16 data is compatible with their locale, or I have to transcode everything to UTF-8. What else can I do? However, I think support for UTF-16 & UTF-32 internally is not particularly useful and that support for them may not really be justifiable. Mike
on 2008-09-17 04:59
On Sep 16, 2008, at 8:20 PM, Michael Selig wrote: > …but UTF-16 is the problem - it breaks so many things, due to its > "ASCII incompatibility" (using Ruby's definition). I can't even do > simple things like pull out fields and substitute into another > string without testing "encoding compatibility". Something as simple > as: > > puts "The value is #{val}" > > fails if val is UTF-16 data. How ironic… I ran into this issue about five minutes ago. It's killing the CSV implementation I thought I finally had right. :( How would you save this? Instead of: %Q{"#{val}"} # boom for UTF-16! Can we do: ['"', val, '"'].map { |s| s.encode("UTF-16BE") }.join ? Yeah, that seems to work. It sucks, but it works. James Edward Gray II
on 2008-09-17 06:31
On Wed, 17 Sep 2008 12:51:14 +1000, James Gray <james@grayproductions.net> wrote: > Can we do: > > ['"', val, '"'].map { |s| s.encode("UTF-16BE") }.join > > ? Yeah, that seems to work. It sucks, but it works. Yep, it sure sucks! I have been doing some more thinking about these ongoing issues.... <soapbox> Using Ruby SHOULD be making our lives easier, not harder. Other languages like Python have taken an easier route to m17n - represent all strings internally as unicode codepoints. Then there should never be a need to check encoding compatibility, right? I am not saying that this is a perfect solution either, by the way. But having to work around this "Encoding Compatibility Error" all the time is just a pain for apps which need to work in different countries with different locales. Unfortunately it is leading me towards the path of having to transcode everything to UTF-8, even though in 99% of cases all the data IS going to be compatible and be in the user's locale. I don't want so much of my time taken up, and be forced to write ugly code to take care of the remaining 1%. Maybe the problem is that Ruby is being too generous supporting all these different encodings internally! That was one reason why I raised the idea of removing UTF-16 & 32 support - at least that way I know that the ASCII strings from my program can work with any user data. But then the further problem: What if you need to work with (or at least take into account the possibility of) 2 or more non-ascii (but ascii compatible) encodings (eg: the user's locale & UTF-8)? What may solve this issue is if Ruby itself would automatically encode incompatible strings in a compatible encoding (UTF-8 I guess). The only time you should then get "Encoding Compatibility Errors" is when writing data to a file or network stream in a certain encoding and a character cannot be represented. That's it. Just a thought... </soapbox> Mike
on 2008-09-17 10:25
In article <op.uhlqb7b29245dp@kool>, "Michael Selig" <michael.selig@fs.com.au> writes: > - To my knowledge UTF-16 & UTF-32 are the only "non-ASCII compatible" as > Ruby defines it ISO-2022-JP is another example. > - To my knowledge no one actually uses UTF-16 or UTF-32 as a locale They are not usable as locale encoding. http://www.opengroup.org/onlinepubs/007908799/xbd/... | * The encoded values associated with the members of the portable character | set are each represented in a single byte.
on 2008-09-17 10:28
Disclaimer: I haven't used 1.9 encoding stuff so far. Nevertheless my 0.02EUR: 2008/9/17 Michael Selig <michael.selig@fs.com.au>: > <soapbox> > > Using Ruby SHOULD be making our lives easier, not harder. Other languages > like Python have taken an easier route to m17n - represent all strings > internally as unicode codepoints. Which is also what Java does. I have always found Java's approach to encodings very clean and workable. But if I remember correctly Matz once said that Unicode does not cover all Asian symbols so it might not be a too good choice for internal representation. > That was one reason why I raised the idea of removing UTF-16 & 32 support - > > Just a thought... > > </soapbox> I believe that one reason for the difficulties we encounter now is the fact that String is historically used for binary and text data. So there is no clear separation between the two and this bears potential for confusion and bugs. A clean solution would probably involve having a character type which is capable of representing *all* possible symbols and model String as sequence of those characters. Encoding would then be done during input and output only. Questions I see 1. Is this feasible, i.e. is there something similar to Unicode without its limitations? 2. Is it fast enough for the general case? 3. What happens to binary Strings? or more generally: 4. What happens to old (pre 1.9) code? i18n is a nasty beast... Kind regards robert
on 2008-09-17 12:30
On 17/09/2008, Robert Klemme <shortcutter@googlemail.com> wrote: > 1. Is this feasible, i.e. is there something similar to Unicode > without its limitations? What is "similar to Unicode without its limitations"? You mean it contains every character anybody would ever what to write on a computer in a single encoding? So there would have to be a central committee to which everybody submits any character they think of so that it gets a codepoint assigned? And that committee assigns codepoints to those submissions unquestioningly so that no special-purpose encodings are ever needed? And you even try to ask if this is feasible? Well, my answer is that it is not feasible. Thanks Michal
on 2008-09-17 14:59
On Sep 16, 2008, at 11:21 PM, Michael Selig wrote:
> in the user's locale.
I believe Matz has said in the past that transcoding is what they are
trying to avoid in general. You can loose data that way and thus the
core team doesn't favor it. (I hope I got that right. It's from
memory, so don't blame me for putting words in Matz's mouth.)
Besides, I'm not sure if it's the characters I have tried or just that
Ruby's transcoding still needs work, but I've tried converting some
Shift_JIS to UTF-8 that it just couldn't handle. We would have to
have a better conversion rate to support a strategy like this.
James Edward Gray II
on 2008-09-17 15:48
Hi, James Gray wrote: >> transcode everything to UTF-8, even though in 99% of cases all the > a better conversion rate to support a strategy like this. We can convert "all Shift_JIS characters" to Unicode now. But current problem is, there are some mappings Shift_JIS and Unicode conversion. Once you convert data from Shift_JIS to Unicode, true meaning of some characters may be lost forever. (e.g. YEN SIGN Problem) If we develop "a better" conversion, this problem will be more complex.
on 2008-09-17 16:36
Hi,
In message "Re: [ruby-core:18640] Character encodings - a radical
suggestion"
on Wed, 17 Sep 2008 10:20:13 +0900, "Michael Selig"
<michael.selig@fs.com.au> writes:
|So my radical suggestion is this:
|
|Remove internal support for non-ASCII encodings completely, and when
|reading/writing UTF-16 (and UTF-32) files automatically transcode to/from
|UTF-8.
What happens with non Unicode text under your suggestion?
My conservative suggestion is that:
Put "r:UTF-16BE:UTF-8" for mode when you open an UTF-16 file to read,
so that your internal strings are all UTF-8 encoding.
|My reasons:
|
|- String & Regexp operations should just "work" without the programmer
|worrying about encoding comaptibility (I think!)
|- The programmer only has to think about character encodings at the
|"interfaces" (files, network interfaces) not throughout the program logic
My "suggestion" satisfies above two.
|- To my knowledge UTF-16 & UTF-32 are the only "non-ASCII compatible" as
|Ruby defines it
As akr stated this is wrong.
|- To my knowledge no one actually uses UTF-16 or UTF-32 as a locale
Yes.
|- I would avoid having to use ugly modes to open a file like
|"r:UTF-16LE:UTF-8" (very minor)
This is ugly indeed. We might add more Unicode support in the
future. But we are no hurry.
|- Ruby's internal code would be simpler & cleaner and therefore probably
|faster and easier to maintain
Dropping UTF-{16,32} is not enough. Unless we abandon non-Unicode
encoding support altogether, it won't be THAT simple. And I am not
going to remove their support. I use them everyday.
matz.
on 2008-09-17 16:48
On 9/17/2008 3:39 PM, NARUSE, Yui wrote: > We can convert "all Shift_JIS characters" to Unicode now. > But current problem is, there are some mappings Shift_JIS and Unicode > conversion. > Once you convert data from Shift_JIS to Unicode, true meaning of some > characters > may be lost forever. (e.g. YEN SIGN Problem) > > If we develop "a better" conversion, this problem will be more complex. Is there a complete characterization of this whole problem? It seems to be the main reason for sticking to non-UTF-8 character sets in Ruby these days, and concluding from what I have read about it, a solution could be the addition of missing characters/codepoints to Unicode. Why does no-one consider going that way, but instead builds a complicated stack of functions for conversions on top level? To some extent, it looks like 'some' people like insisting on the status quo as it makes them feel special, swimming upstream the Unicode waterfall, retaining on regional locales instead of solving the issue. I do explicitly not refer to Ruby or the developers, they just accept these special needs more than other computer language designers with less sympathy for this anomaly. Nevertheless, a persisting fix is needed, and I think writing more and more clutches for encoding conversion goes the wrong way. This might still be needed for legacy file support, but day-to-day work should not have to deal with this issue so prominently. cheers, - Matthias
on 2008-09-17 17:09
Hi,
In message "Re: [ruby-core:18663] Re: Character encodings - a radical
suggestion"
on Wed, 17 Sep 2008 23:09:32 +0900, Matthias Wächter
<matthias@waechter.wiz.at> writes:
|Is there a complete characterization of this whole problem? It seems
|to be the main reason for sticking to non-UTF-8 character sets in
|Ruby these days, and concluding from what I have read about it, a
|solution could be the addition of missing characters/codepoints to
|Unicode. Why does no-one consider going that way, but instead builds
|a complicated stack of functions for conversions on top level?
Just because it's impossible. History sucks. We have mixed up YEN
SIGN and REVERSE SOLIDUS for long time. They cannot be distinguished
without context information. Technically 0x5c should mean REVERSE
SOLIDUS, but not always so for humans.
Besides that, Unicode is not a panacea. Some character set
(e.g. GB18030 for Chinese characters) is even bigger than Unicode.
In fact, GB18030 is a super set of Unicode.
|To some extent, it looks like 'some' people like insisting on the
|status quo as it makes them feel special, swimming upstream the
|Unicode waterfall, retaining on regional locales instead of solving
|the issue. I do explicitly not refer to Ruby or the developers, they
|just accept these special needs more than other computer language
|designers with less sympathy for this anomaly.
|
|Nevertheless, a persisting fix is needed, and I think writing more
|and more clutches for encoding conversion goes the wrong way. This
|might still be needed for legacy file support, but day-to-day work
|should not have to deal with this issue so prominently.
You are free to feel so, but it's us who take up the burden. Hoever,
we are open for complain about usability, e.g. no for
"r:UTF-16LE:UTF-8".
matz.
on 2008-09-17 18:54
Hi, Yukihiro Matsumoto wrote: > Besides that, Unicode is not a panacea. Some character set > (e.g. GB18030 for Chinese characters) is even bigger than Unicode. > In fact, GB18030 is a super set of Unicode. Emacs-Mule is another encoding which is bigger than Unicode and Ruby supports. And pictgraphs which are used by Japanese Mobile Phones are also not in Unicode.
on 2008-09-17 19:02
On 17/09/2008, James Gray <james@grayproductions.net> wrote: > substitute into another string without testing "encoding compatibility". > Something as simple as: > > > > puts "The value is #{val}" > > > > fails if val is UTF-16 data. > > > > I'm not sure I support the pull-them out strategy, but I can confirm that > supporting UTF-16 in CSV has eaten about a week of my time and counting. I > keep thinking I have it and finding new problem… For your own program you could override String.+ to automagically convert its parameters. I thought this is good enough but you cannot do that for libraries - ruby does not provide any way of bolting on such feature and hiding it from users of the library so that they get the standard behaviour. Still there are multiple ways of combining strings, and these could be used to distinguish different encoding handling. So my suggestion is to make - String.+ do the conversion if possible (it creates a new string so it can be different) - String.<< to only append compatible strings - I am not sure about string interpolation - it technically creates a new string each time so it could just convert but this could get complex if many stings are included in the interpolation. Note that even with automatic conversion you get cases when strings cannot be converted to some superset so somebody could break your application that seems to work OK by supplying input in an exotic encoding. There are other string functions, though. It is unclear what Object.inspect should do. It is generally used to show stuff to the user. But should it convert the string to the user locale, show it in hex with locale information appended, or what? IO could be configurable to either do the necessary conversion or not. Like STDOUT.autoconvert=true then you could write any strings to stdout without problems (as long as the stdout encoding is known and can handle all your strings). Also Array.join could perhaps accept some parameter that either specifies the desired encoding of the result or specifies that the strings should be converted so that they can actually be concateneted. Generally I can imagine the automatic conversion working like this (either as part of core or as an addon): 1) each encoding has a list of compatible supersets 2) each encoding has a list of (incompatible) equivalents [optional] - typical for legacy 8bit encodings which have several variants with the characters reordered in different ways 3) each encoding has a list of incompatible (without conversion) supersets Then string operations could be performed this way: 1) an operation on two strings where one is compatible superset of the other is done without conversion, and the result has encoding of the superset. This is basically the extension of the ASCII-compatible concept to other encodings that could have this feature. If conversion is not allowed and 1) is not applicable (note that each encoding is compatible superset of itself) en exception is raised. If conversion is allowed the autoconversion could follow: 2) if the strings ere encoded in incompatible but equivalent encodings convert one to the encoding of the other based on some order of preference. 3) if there is the same incompatible superset for both strings (or superset of superset ..) convert both strings to this superset. If multiple supersets are available consult order of preference. If neither 2) nor 3) are applicable raise an exception. I am not sure that 2) would ever apply. Some iso encodings should be generally equivalent to some dos or windows codepages but there might be one or two different characters that make the encodings non-equivalent. Perhaps the strings could be checked for these characters but then just converting to a superset might be easier. Thanks Michal
on 2008-09-17 19:07
On Sep 17, 2008, at 9:45 AM, NARUSE, Yui wrote: > And pictgraphs which are used by Japanese Mobile Phones are also not > in Unicode. They have Private Use Area codepoints for emoji, e.g. http://www.au.kddi.com/ezfactory/tec/spec/img/typeD.pdf Seems reasonable. -Tim
on 2008-09-17 19:51
On 17/09/2008, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > |Unicode. Why does no-one consider going that way, but instead builds > In fact, GB18030 is a super set of Unicode. > I wonder how people who suggest Unicode as the single internal encoding would react if GB18030 was suggested instead ;-) Thanks Michal
on 2008-09-17 19:51
Tim Bray wrote: > On Sep 17, 2008, at 9:45 AM, NARUSE, Yui wrote: > >> And pictgraphs which are used by Japanese Mobile Phones are also not >> in Unicode. > > They have Private Use Area codepoints for emoji, e.g. > http://www.au.kddi.com/ezfactory/tec/spec/img/typeD.pdf > > Seems reasonable. -Tim Yes, they have private use area codepoints, but I think they are not reasonable. The first reason is that the area is Private Use Area. Moreover there are some mobile phone careers in Japan and they define own emoji. And their PUA codepoints is conflicted. http://creation.mb.softbank.jp/web/web_pic_about.html So they can't be 'Uni'code yet.
on 2008-09-17 20:27
Hi, Michal Suchanek wrote: > - String.+ do the conversion if possible (it creates a new string so > it can be different) The problem is not "can convert" or "cannot convert". Different mappings and information lost in conversion is the true problem. So they can't be avoided and we can't use automatic conversion. > Note that even with automatic conversion you get cases when strings > cannot be converted to some superset so somebody could break your > application that seems to work OK by supplying input in an exotic > encoding. Difinition of superset is difficult problem. > There are other string functions, though. It is unclear what > Object.inspect should do. It is generally used to show stuff to the > user. But should it convert the string to the user locale, show it in > hex with locale information appended, or what? In previous conversation, Object#inspect should be dependent from locale. > STDOUT.autoconvert=true this seems non-thread-safe. > Generally I can imagine the automatic conversion working like this > (either as part of core or as an addon): > > 1) each encoding has a list of compatible supersets Define "compatible" is this problem. And what is "incompatible"? > 2) each encoding has a list of (incompatible) equivalents [optional] - > typical for legacy 8bit encodings which have several variants with the > characters reordered in different ways Such extension sometimes breaks compatibility. For example, U+9AD8 is assigned in 0x3962 in Shift_JIS. http://www.unicode.org/cgi-bin/GetUnihanData.pl?co... this character has a variation, which has a codepoint U+9aD9 in Unicode. http://www.unicode.org/cgi-bin/GetUnihanData.pl?co... But this is unified in Shift_JIS(JIS X 0208). So 0x3962 includes U+9AD8 and U+9AD9 but once it is converted to Unicode, this is only be U+9AD8. Moreover Windows Code Page 932 include U+9AD9... So this is not easy. # ISO-8859-X may easy > 3) each encoding has a list of incompatible (without conversion) supersets > > Then string operations could be performed this way: > > 1) an operation on two strings where one is compatible superset of the > other is done without conversion, and the result has encoding of the > superset. This is basically the extension of the ASCII-compatible > concept to other encodings that could have this feature. The problem is we don't kwno what encodings are compatible as ASCII. > If conversion is allowed the autoconversion could follow: implement "switch to allow the autoconversion" seems difficult... anyway > 2) if the strings ere encoded in incompatible but equivalent encodings > convert one to the encoding of the other based on some order of > preference. This means, when charset C includes A and B, string in A + string in B => string in C ? When those conversion doesn't lost any information, this is reasonable. > 3) if there is the same incompatible superset for both strings (or > superset of superset ..) convert both strings to this superset. If > multiple supersets are available consult order of preference. What is incompatible superset mean? > I am not sure that 2) would ever apply. Some iso encodings should be > generally equivalent to some dos or windows codepages but there might > be one or two different characters that make the encodings > non-equivalent. Perhaps the strings could be checked for these > characters but then just converting to a superset might be easier. Theoretically it is yes. But practical encodings seem dirty.
on 2008-09-17 20:34
Hi,
In message "Re: [ruby-core:18668] Re: Character encodings - a radical
suggestion"
on Thu, 18 Sep 2008 00:55:40 +0900, "Michal Suchanek"
<hramrach@centrum.cz> writes:
|I wonder how people who suggest Unicode as the single internal
|encoding would react if GB18030 was suggested instead ;-)
Yeah! We should do that!
...no, please. Its encoding scheme is horrific. You have to read two
bytes to tell how many bytes the code point occupies (up to 4 bytes).
matz.
on 2008-09-17 21:14
On Sep 17, 2008, at 8:01 AM, Yukihiro Matsumoto wrote: > Besides that, Unicode is not a panacea. Some character set > (e.g. GB18030 for Chinese characters) is even bigger than Unicode. > In fact, GB18030 is a super set of Unicode. Also, Mojikyo is much larger than Unicode. So, there are two reasonable goals: 1. Provide useful features for strings and characters in the general case, including non-Unicode data 2. Provide good support for Unicode The problem is that the people who care about #1 and don't care about #2 very much, and vice versa. The people who spend all their time doing Web/Internet stuff mostly only care about #2. It seems that Unicode is important so Ruby should have some special low-level language facilities, especially for efficient access to codepoints. If we can do that without harming #1, then everyone should be happy. -T
on 2008-09-17 21:53
On 17/09/2008, NARUSE, Yui <naruse@airemix.jp> wrote: > > Still there are multiple ways of combining strings, and these could be > > used to distinguish different encoding handling. > > > > So my suggestion is to make > > - String.+ do the conversion if possible (it creates a new string so > > it can be different) > > > > The problem is not "can convert" or "cannot convert". > Different mappings and information lost in conversion is the true problem. > So they can't be avoided and we can't use automatic conversion. Yes, even for the "common www encodings" one cannot convert some Japanese encodings with the Yen vs backslash confusion safely. And there are other problems I am sure. > .. > > STDOUT.autoconvert=true > > > > this seems non-thread-safe. Using a single IO in multiple threads is non-safe so this API does not introduce any new problem. It is also similar to the other IO properties that can be already set in non-thread-safe way. > > > > Generally I can imagine the automatic conversion working like this > > (either as part of core or as an addon): > > > > 1) each encoding has a list of compatible supersets > > > > Define "compatible" is this problem. > And what is "incompatible"? Compatible here means that 7bit ASCII is compatible subset of utf-8 or any (most?) of the iso-8859-x encodings. You can join the strings without any conversion. Similarily BCDIC could be considered compatible subset of the EBCDIC codepages if these are ever implemented. > > this character has a variation, which has a codepoint U+9aD9 in Unicode. > http://www.unicode.org/cgi-bin/GetUnihanData.pl?co... > But this is unified in Shift_JIS(JIS X 0208). > > So 0x3962 includes U+9AD8 and U+9AD9 but once it is converted to Unicode, > this is only be U+9AD8. > > Moreover Windows Code Page 932 include U+9AD9... > > So this is not easy. If there are characters that are different in one encoding and mapped to a single codepoint in another encoding these are not equivalent. Strings in those encodings could be considered equivalent as long as they do not contain such characters but it is questionable if scanning the string is desired. On the other hand, the conversion would process the whole string anyway so it could be attempted for such cases, and aborted if such character is encountered. > > The problem is we don't kwno what encodings are compatible as ASCII. That's aways possible to know only by looking at the codepoint table, the same as the ascii-compatible encodings were defined. > > > > If conversion is allowed the autoconversion could follow: > > > > implement "switch to allow the autoconversion" seems difficult... anyway I did not mean to implement a switch - I wanted to define converting and non-converting operations. However, for non-string objects that use strings such switch would be indeed needed. > > > > 2) if the strings ere encoded in incompatible but equivalent encodings > > convert one to the encoding of the other based on some order of > > preference. > > > > This means, > when charset C includes A and B, string in A + string in B => string in C ? > When those conversion doesn't lost any information, this is reasonable. Here I wanted to distinguish two cases but they are in fact pretty much the same: - conversion into an encoding that has the same number of codepoints, just reordered - conversion into an encoding with larger number of codepoints This should be probably handled by encoding preference. When strings in iso-8859-x and the corresponding windows codepage should be added, and the windows codepage is preferred over Unicode encodings and iso-8859 encodings the codepage should be used. On the other hand, if utf-8 is preferred utf-8 should be used for the result. I am not sure how that preference would be set, though. You could set general preference at program start but setting preference for each operation would make the system complicated. But for a single operation the preference could be enforced by converting the operands manually. > > > > 3) if there is the same incompatible superset for both strings (or > > superset of superset ..) convert both strings to this superset. If > > multiple supersets are available consult order of preference. > > > > What is incompatible superset mean? That means that at the string would have to be converted to be represented in the "superset encoding". However, the conversion should be unambiguous. > But practical encodings seem dirty. > Thanks Michal
on 2008-09-18 02:11
Hi, Thanks for all the replies - I am not an expert on all these encodings, and I (obviously mistakenly!) assumed that all other encodings could be converted to Unicode. When I first looked at Ruby 1.9's encoding support I thought "that's neat - I think it will solve my m17n problems". However as I got into it I soon discovered that it wasn't nearly this easy! Here is a summary of my issues: - Non "ASCII-compatible" data is almost impossible to work with. Just take a look at what James Gray was proposing to do for CSV. - When developing standard classes & mixins that could be installed in any country, virtually all methods that handle more than 1 string are going to have to worry about the possibility of dealing with incompatible encodings. This is a major overhead to a programmer - it may not be acceptable to let it raise an error. - Other alternative languages to Ruby which represent all strings as Unicode don't have this problem. Although they may not be a 100% solution in Japan & China, they would certainly be fine for me to use. - As my application is under my control, I can make the decision to transcode everything to UTF-8 if I want to. I was hoping not to, but I think the extra code I would have to write to test encoding compatibility would not be worthwhile as it would be in so many places. And yes, I could write a - For people like James who are trying to modify a standard library like CSV, which on the surface looks like a simple task, it is really quite daunting. My "ideal" would be that Ruby automatically converted to a common encoding rather than raising an Encoding Compatibility Error. And although Unicode apparently may not cope with every character on the planet at present, I guess it will one day, and it seems to me to be the sensible thing to use as the "common encoding" - or UTF-8 to be precise. That way, in the 99% of cases where the encodings ARE compatible, Ruby would work exactly as it does now. But it also means that I can write methods and not have to worry about them blowing up because of encoding incompatibility. It *does* mean that strings may "magically" be converted to UTF-8, but I don't see this as a big deal as long as when they are output they are converted back to the necessary encoding (which I think Ruby does with files now). If the "magic" conversion is a problem, maybe there should be a switch to turn it on & off. This auto-convert policy should also be used with non-destructive methods like String#== etc so the programmer needn't worry whether the same character has a different representation on each side of the "==". The ASCII-8BIT encoding should be reserved as a "special case" and not be subject to auto-conversion, because it is going to be mainly used for "byte strings". Yes, there may be a performance overhead doing this. But is this a big deal if it only happens in 1% of cases? Sure there are issues with this, like what to do with text that cannot be encoded to Unicode (now that I know it exists!), and also the implementation of these suggestions may not be easy, but I think *not* doing something about these issues may make the dev community have a negative impression of Ruby, which would be a great, great shame. Cheers Mike On Thu, 18 Sep 2008 00:28:03 +1000, Yukihiro Matsumoto
on 2008-09-18 02:50
On Wed, Sep 17, 2008 at 10:09 AM, Matthias Wächter <matthias@waechter.wiz.at> wrote: > Is there a complete characterization of this whole problem? It seems > to be the main reason for sticking to non-UTF-8 character sets in > Ruby these days, and concluding from what I have read about it, a > solution could be the addition of missing characters/codepoints to > Unicode. Why does no-one consider going that way, but instead builds > a complicated stack of functions for conversions on top level? While there is a private use plane, it's not generally interoperable to use the private use plane in Unicode. Adding glyphs to Unicode is a lengthy process that requires going through a standards body. The Unicode standard is updated every few years, but the Unicode consortium is much more likely to listen to the Japanese standards bodies than Ruby programmers. > To some extent, it looks like 'some' people like insisting on the > status quo as it makes them feel special, swimming upstream the > Unicode waterfall, retaining on regional locales instead of solving > the issue. I do explicitly not refer to Ruby or the developers, they > just accept these special needs more than other computer language > designers with less sympathy for this anomaly. The reality is that Unicode *doesn't* completely represent all Asian languages well (see the discussions around Han unification for a brief primer on the issues involved). The problem is exacerbated in the academic arena where people want to be able to represent ancient characters accurately, but it's not limited to that. Just because you and I can represent our words in under one hundred characters doesn't mean that it's appropriate to do the same with others' languages. It's getting better, but it's still not perfect. > Nevertheless, a persisting fix is needed, and I think writing more > and more clutches for encoding conversion goes the wrong way. This > might still be needed for legacy file support, but day-to-day work > should not have to deal with this issue so prominently. Day-to-day work *doesn't*. Deal with all of your stuff in a single encoding (UTF-8, UTF-16, whatever) and you don't even have to think about it. If you *ever* deal with more than one encoding, you're going to run into this problem in *any* language. Sorry. -austin, still working on a blog post about a .NET Unicode/XML bug
on 2008-09-18 03:30
Michael Selig wrote:
> should be a switch to turn it on & off.
Have you read Matz's post abount yen sign problem? Converter IS a
problem; you cannot make a converter over (Encoding A -> Unicode ->
Encoding A). That must lose some input. Data loss is the worst thing
to introduce, so ruby asks you to take the risk by explicitly calling a
conversion method.
Problems on character encodings are sourced from complexities of human
activities. I can hardly believe there are any simple, perfect, and/or
"neat" solution.
on 2008-09-18 04:40
----- Original Message ----- From: "Urabe Shyouhei" <shyouhei@ruby-lang.org> > Have you read Matz's post abount yen sign problem? Converter IS a > problem; you cannot make a converter over (Encoding A -> Unicode -> > Encoding A). That must lose some input. Data loss is the worst thing > to introduce, so ruby asks you to take the risk by explicitly calling a > conversion method. Yes I have read it. Do you refuse to drive a car because you may have a crash? I was only suggesting conversion to Unicode as a way of preventing an error being raised. If you need to work with encodings that are not Unicode compatible, I was suggesting that Ruby works exactly as it does at the moment. The conversion was only suggested when you deal with incompatible encodings, which is not going to be common, but is something that programmers whose software is used internationally have to deal with. Also I suggested that there should be a way of turning it off, just in case you are worried that the conversion might happen accidentally. For my application (and I think for many other people's too) it is *far* better to possibly screw up a character or two than to have to write lots of ugly code to cope with the incompatibility. The alternative is transcoding to UTF-8, which will mean those characters will be screwed up anyhow. Mike
on 2008-09-18 05:51
On Sep 17, 2008, at 9:32 PM, Michael Selig wrote: > I was only suggesting conversion to Unicode as a way of preventing > an error being raised. Hey, I know character encodings are hard. I'm still trying to get CSV completely converted. I'm getting closer all the time, but it's been tricky for sure. I'm sure it's a bit of Ruby's fault. The m17n code is still a little raw and I ran into several issues just exploring it. All of our efforts here are making things better though. Look at how many bugs were fixed in the last week just do to emails from you and me. The other thing that's very important to remember is that character encodings are just plain hard to get right. I think it's a pretty big testament that Ruby makes it possible for us to support all these encodings now. I'm definitely in the self-centered-universe camp that thought Unicode was best for most things. I know I would still recommend it in many cases, because it's pretty easy to implement and it does work in many cases. However, our Japanese friends are trying to tell us it's not a universal solution. It doesn't always work well for them in particular, so they would prefer we make something better. I for one am grateful for them teaching me this new lesson, hard or not. And if you prefer to do the UTF-8 everywhere strategy, you can, right? Transcode everything to UTF-8 when it comes in and then you can pretend it's all UTF-8 (because it is!), right? Don't we have the best of both worlds now? James Edward Gray II
on 2008-09-18 05:58
On Sep 17, 2008, at 8:43 PM, James Gray wrote: > And if you prefer to do the UTF-8 everywhere strategy, you can, > right? Transcode everything to UTF-8 when it comes in and then you > can pretend it's all UTF-8 (because it is!), right? Don't we have > the best of both worlds now? Well, yes, as long as Ruby will let me get at the codepoints efficiently. Oh, and in an ideal world, use Unicode properties like I can in Perl (\p{Lu} for example). I think that would make all use Unicode whiners shut up. -T
on 2008-09-18 06:03
On Sep 17, 2008, at 10:49 PM, Tim Bray wrote: > On Sep 17, 2008, at 8:43 PM, James Gray wrote: > >> And if you prefer to do the UTF-8 everywhere strategy, you can, >> right? Transcode everything to UTF-8 when it comes in and then you >> can pretend it's all UTF-8 (because it is!), right? Don't we have >> the best of both worlds now? > > Well, yes, as long as Ruby will let me get at the codepoints > efficiently. Is unpack("U*") not meeting that need? I'm not trying to be a jerk, I'm seriously asking. James Edward Gray II
on 2008-09-18 06:17
On Sep 17, 2008, at 8:55 PM, James Gray wrote: > Is unpack("U*") not meeting that need? I'm not trying to be a jerk, > I'm seriously asking. In fact, that produces the correct answer, and it's what I actually use in my RX code (http://www.tbray.org/ongoing/When/200x/2008/06/10/RX-Work ). The problem is that it could be a lot more efficient. It means I have to take care of organizing the input into chunks and being careful that I haven't chunked in the middle of a UTF-8 character and so on, when what I really want, when x is an IO, is x.each_codepoint do |u| # u is a fixint end with the buffering and utf-8 unpacking being done at a low level without wasting memory. -T
on 2008-09-18 06:18
----- Original Message ----- From: "Austin Ziegler" <halostatue@gmail.com> To: <ruby-core@ruby-lang.org> Sent: Thursday, September 18, 2008 10:42 AM > If you *ever* deal with more than one encoding, you're going to run > into this problem in *any* language. I nearly agree with you, except that these days "day-to-day" work can involve using data from web sites, from email, from RPC servers etc. Unless you know that your HTTP, SMTP classes etc are going to return data in your locale's encoding, you may very well be dealing with more than one encoding. Cheers Mike
on 2008-09-18 07:11
Hi,
In message "Re: [ruby-core:18681] Re: Character encodings - a radical
suggestion"
on Thu, 18 Sep 2008 09:03:35 +0900, "Michael Selig"
<michael.selig@fs.com.au> writes:
|Thanks for all the replies - I am not an expert on all these encodings,
|and I (obviously mistakenly!) assumed that all other encodings could be
|converted to Unicode.
|
|When I first looked at Ruby 1.9's encoding support I thought "that's neat
|- I think it will solve my m17n problems". However as I got into it I soon
|discovered that it wasn't nearly this easy!
I am sorry that life is not that easy.
|Here is a summary of my issues:
|
|- Non "ASCII-compatible" data is almost impossible to work with. Just take
|a look at what James Gray was proposing to do for CSV.
Yes, basically support for UTF-{16,32} are very limited, so that
I believe libraries are OK to omit them. We should document that
clearly, but note that 1.9.1 has not been released yet.
|- Other alternative languages to Ruby which represent all strings as
|Unicode don't have this problem. Although they may not be a 100% solution
|in Japan & China, they would certainly be fine for me to use.
Ruby does not prohibit you to do the same thing as alternative
languages - converting back and force at the surface. The point is, I
think, we haven't yet provided nifty API to do so. If you can live
with Python's open-read-and-decode, I think you are able to stand
Ruby's "r:UTF-16:UTF-8" or open-read-and-encode.
If we need something more, it should be better API to reduce the cost
of Unicode based application, not making the language Unicode centric.
Let me rephrase, it's OK for you to make your application/library
Unicode centric, but not the language itself. The one can declare his
library to support only ASCII compatible text, or UTF-8 text. The
users must care about converting non-conformed text.
|- When developing standard classes & mixins that could be installed in any
|country, virtually all methods that handle more than 1 string are going to
|have to worry about the possibility of dealing with incompatible
|encodings. This is a major overhead to a programmer - it may not be
|acceptable to let it raise an error.
For any serious application/library, there are three choices:
(a) choose US-ASCII
(b) choose UTF-8 (or any specific encoding)
(c) choose to live with multiple encoding
But the last one is not an easy way, indeed. I don't want to force
any Ruby users the hard way. Users should choose anything they want.
But I don't want to deny the possibility.
|It *does* mean that strings may "magically" be converted to UTF-8, but I
|don't see this as a big deal as long as when they are output they are
|converted back to the necessary encoding (which I think Ruby does with
|files now). If the "magic" conversion is a problem, maybe there should be
|a switch to turn it on & off.
|This auto-convert policy should also be used with non-destructive methods
|like String#== etc so the programmer needn't worry whether the same
|character has a different representation on each side of the "==".
|The ASCII-8BIT encoding should be reserved as a "special case" and not be
|subject to auto-conversion, because it is going to be mainly used for
|"byte strings".
If you can do implicit conversion at I/O, why do you have to care
about encoding mixing? Your program should treat single encoding
anyway. Auto-conversion is bad, believe me.
matz.
on 2008-09-18 07:14
Hi, ----- Original Message ----- From: "James Gray" <james@grayproductions.net> To: <ruby-core@ruby-lang.org> Sent: Thursday, September 18, 2008 1:43 PM > The other thing that's very important to remember is that character > encodings are just plain hard to get right. I think it's a pretty big > testament that Ruby makes it possible for us to support all these > encodings now. > However, our Japanese friends are trying to tell us it's not a > universal solution. It doesn't always work well for them in > particular, so they would prefer we make something better. I for one > am grateful for them teaching me this new lesson, hard or not. I agree with you. And if I am coming across as being a jerk or too dogmatic, I don't mean to be! I was trying to make some constructive suggestions (some may have been misguided :-) so that Ruby can meet my needs better, and I think my needs may be quite common as software and data sources become more international. The intent is to make Ruby's character encoding issues as transparent as possible without losing the specific Japanese/Chinese/Welsh(?) requirements (now that I understand them a bit better), and of course to provoke further discussion about the issue. I think other people will soon face the sorts of problems you and I have been hitting over the past couple of weeks. It has been a very good learning experience for me also! > And if you prefer to do the UTF-8 everywhere strategy, you can, > right? Transcode everything to UTF-8 when it comes in and then you > can pretend it's all UTF-8 (because it is!), right? Don't we have the > best of both worlds now? Nearly! Even if I transcode to UTF-8, I still have to make sure I do it at every interface, and that includes to Ruby's standard classes as well - not just IO. So I'll still have to check encodings of strings returned from network classes, and that's something I don't think I need to do with other languages that support Unicode, because there is only one internal string representation. Also testing that I have got it all right may be a nightmare (not that I am anywhere near that stage yet!). It would be so much nicer to have Ruby handle most of this for me. I am a relatively recent convert to Ruby, mainly from Python. This means I am constantly thinking "could I do this easier/better with Python?". And the answer to that question for this latest project seems to be leaning towards "yes" unfortunately, and I'd like to say a definitive "no", because I like so many things about Ruby a great deal! Cheers Mike.
on 2008-09-18 07:18
Hi,
In message "Re: [ruby-core:18687] Re: Character encodings - a radical
suggestion"
on Thu, 18 Sep 2008 12:49:56 +0900, Tim Bray <Tim.Bray@Sun.COM>
writes:
|Well, yes, as long as Ruby will let me get at the codepoints
|efficiently. Oh, and in an ideal world, use Unicode properties like I
|can in Perl (\p{Lu} for example). I think that would make all use
|Unicode whiners shut up. -T
OK, now Ruby 1.9 has String#each_codepoint and understands \p{Lu} for
regular expression. I hope all Unicode whiners would complain no
longer.
matz.
on 2008-09-18 07:31
Hi,
In message "Re: [ruby-core:18678] Re: Character encodings - a radical
suggestion"
on Thu, 18 Sep 2008 02:17:32 +0900, Tim Bray <Tim.Bray@Sun.COM>
writes:
|Also, Mojikyo is much larger than Unicode.
Indeed. I've heard some people used prototype of M17N Ruby to process
Mojikyo text. Since Mojikyo character set is not compatible with
Unicode, it was their only way to process Mojikyo text using scripting
language. That is one my primary motivation over the current M17N
design, although 1.9.1 does not support Mojikyo encoding yet.
matz.
|So, there are two reasonable goals:
|1. Provide useful features for strings and characters in the general
|case, including non-Unicode data
|2. Provide good support for Unicode
|
|The problem is that the people who care about #1 and don't care about
|#2 very much, and vice versa. The people who spend all their time
|doing Web/Internet stuff mostly only care about #2. It seems that
|Unicode is important so Ruby should have some special low-level
|language facilities, especially for efficient access to codepoints.
|If we can do that without harming #1, then everyone should be happy.
We have just started to care about #2 lately, since we are about to
finish #1.
matz.
on 2008-09-18 07:32
Hi,
In message "Re: [ruby-core:18693] Re: Character encodings - a radical
suggestion"
on Thu, 18 Sep 2008 14:06:46 +0900, "Michael Selig"
<michael.selig@fs.com.au> writes:
|Nearly! Even if I transcode to UTF-8, I still have to make sure I do it at
|every interface, and that includes to Ruby's standard classes as well - not
|just IO. So I'll still have to check encodings of strings returned from
|network classes, and that's something I don't think I need to do with other
|languages that support Unicode, because there is only one internal string
|representation. Also testing that I have got it all right may be a nightmare
|(not that I am anywhere near that stage yet!). It would be so much nicer to
|have Ruby handle most of this for me.
You have pointed out important issue here, I think. Let me think
about it. Although I still don't believe auto-conversion is the way
to go.
matz.
on 2008-09-18 07:35
Hi, ----- Original Message ----- From: "Yukihiro Matsumoto" <matz@ruby-lang.org> To: <ruby-core@ruby-lang.org> Sent: Thursday, September 18, 2008 3:03 PM > If you can do implicit conversion at I/O, why do you have to care > about encoding mixing? Your program should treat single encoding > anyway. Auto-conversion is bad, believe me. I/O is only one way I can get data. For example, when I get data via HTTP, how do I control what encoding Ruby's libraries are going to use to return data? I assume I'll get whatever encoding is specified in the HTTP header, so I'll have to remember to convert to UTF-8. It is dangerous to assume that I'll always get UTF-8 data via HTTP. Mike
on 2008-09-18 08:33
Michael Selig wrote: > ----- Original Message ----- From: "Urabe Shyouhei" > <shyouhei@ruby-lang.org> >> Have you read Matz's post abount yen sign problem? Converter IS a >> problem; you cannot make a converter over (Encoding A -> Unicode -> >> Encoding A). That must lose some input. Data loss is the worst thing >> to introduce, so ruby asks you to take the risk by explicitly calling a >> conversion method. > > Yes I have read it. > Do you refuse to drive a car because you may have a crash? We are *designing* a car now. Don't you need a seatbelt because you'll never crash your car? That's insane. Current design might not be perfect. There must be a better practice. But that "better" should be a synonym of "safer".
on 2008-09-18 14:45
On Sep 18, 2008, at 12:10 AM, Yukihiro Matsumoto wrote: > |can in Perl (\p{Lu} for example). I think that would make all use > |Unicode whiners shut up. -T > > OK, now Ruby 1.9 has String#each_codepoint and understands \p{Lu} for > regular expression. I hope all Unicode whiners would complain no > longer. Thanks for putting up with all our whining about encodings Matz. We know you are working hard to make things better for all of us. James Edward Gray II
on 2008-09-18 22:31
On Sep 17, 2008, at 10:10 PM, Yukihiro Matsumoto wrote: > OK, now Ruby 1.9 has String#each_codepoint and understands \p{Lu} for > regular expression. I hope all Unicode whiners would complain no > longer. The community of Unicode whiners says "thank you" to the community of Ruby implementors. I'll grab this code and see how it works for XML parsing. -Tim
on 2008-09-19 10:19
At 10:20 08/09/17, Michael Selig wrote: >Hi, > >You might at first glance think that this post should go to ruby-dev, but >please read to the end! If it's in English, it should be ruby-core, not ruby-dev, as far as I understand. > puts "The value is #{val}" > >fails if val is UTF-16 data. I think in this case, the reason why you see the problem only for UTF-16 is that your string, other than the interpolated data, is currently all US-ASCII. But immagine that sooner or later you (or somebody) is going to localize your application. Then the string might be in any encoding, and you'll get much more "encoding compatibility" exceptions. >At one stage I got so frustrated that I was even thinking about going back >to Python :-( >So I have ended up transcoding any UTF-16 data to UTF-8, and now things >are going much better. > >Maybe I am doing something wrong - if so please suggest something I can do >other than transcode the UTF-16. I think your problem is more general, and you should transcode other encodings to UTF-8, too, if you're not sure you'll be in a situation with a single encoding. >But this has lead me to look back at the issues with UTF-16 I have hit, >and to think about all the internal code in Ruby to handle "ASCII >incompatible" encodings, and the overhead involved with supporting it. > >And I think that other Ruby programmers may end up doing what I have done >- avoid using UTF-16 internally because it is too hard. I agree that all non-ASCII encodings should come with a sticker with a big warning on it, at least. >So my radical suggestion is this: > >Remove internal support for non-ASCII encodings completely, and when >reading/writing UTF-16 (and UTF-32) files automatically transcode to/from >UTF-8. I can understand the former part. Providing something half-baked can have advantages and disadvantages. >My reasons: > >- String & Regexp operations should just "work" without the programmer >worrying about encoding comaptibility (I think!) See below. >- The programmer only has to think about character encodings at the >"interfaces" (files, network interfaces) not throughout the program logic This is desirable/good architecture. Ruby 1.9 will force you to do that, or come up with some other architecture, but won't handle things automatically for you. >- To my knowledge UTF-16 & UTF-32 are the only "non-ASCII compatible" as >Ruby defines it No, there are others, such as iso-2022-jp. But they are not really the main issue. You can get an encoding incompatibility error for any two ASCII-compatible encodings. E.g. iso-8859-1 and iso-8859-2, or any two others. The reason that you currently don't is that one of your strings (or a regexp) always is ASCII-only, even if it's labeled as something else. >- To my knowledge no one actually uses UTF-16 or UTF-32 as a locale True. >- I would avoid having to use ugly modes to open a file like >"r:UTF-16LE:UTF-8" (very minor) Telling Ruby what encoding you expect from the outside is kind of unavoidable. But it would indeed help if it would suffice to tell a Ruby application only once that you want to handle everything internally in a certain encoding. >- Ruby's internal code would be simpler & cleaner and therefore probably >faster and easier to maintain If everything is done in UTF-8 all the time, yes. But I don't think we will go there soon (I wouldn't mind). Speed isn't too much of an issue, but of course the code would be quite a bit simpler. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
on 2008-09-19 10:20
At 10:21 08/09/18, Urabe Shyouhei wrote: >Have you read Matz's post abount yen sign problem? The yen sign problem is indeed a big problem. It's similar to the Y2K problem (people knew they shouldn't use just two digits, but they did, and people know they shouldn't use 0x5c for the Japanese currency anymore, but they still do), except that there is no deadline, and so there is not enough pressure to fix it. >Converter IS a >problem; you cannot make a converter over (Encoding A -> Unicode -> >Encoding A). Sorry, but for all the encodings in daily use, including those in Japan, round-tripping via Unicode works fine. Unicode was explicitly designed to do that (at the expense of introducing quite a bit of what some people might call garbage). This very much includes the Yen/Backslash. The problems may start when you try to do some processing. (many kinds of processing are not affected, but some are) >That must lose some input. Data loss is the worst thing >to introduce, so ruby asks you to take the risk by explicitly calling a >conversion method. Taking the risk explicitly is fine. But some people may feel that it's easier to do that application-by-application than string by string. >Problems on character encodings are sourced from complexities of human >activities. Very much so indeed. >I can hardly believe there are any simple, perfect, and/or >"neat" solution. Who said Unicode is neat? It's just that sometimes one messy solution is better than a mess of many solutions :-(. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
on 2008-09-19 10:20
[most of this mail, and some others, was written Wednesday, and so may repeat some of what Matz and others have said, but I had big problems getting mail out.] At 13:21 08/09/17, Michael Selig wrote: >I have been doing some more thinking about these ongoing issues.... > ><soapbox> > >Using Ruby SHOULD be making our lives easier, not harder. Very much so. >Other languages >like Python have taken an easier route to m17n - represent all strings >internally as unicode codepoints. Then there should never be a need to >check encoding compatibility, right? Yes. The requirement is that you have to make sure your application knows what encoding it's dealing with, and that you have to make sure you can convert everything, even 'private use' characters appearing with a certain frequency in East Asian encodings. >I am not saying that this is a >perfect solution either, by the way. But having to work around this >"Encoding Compatibility Error" all the time is just a pain for apps which >need to work in different countries with different locales. Unfortunately >it is leading me towards the path of having to transcode everything to >UTF-8, even though in 99% of cases all the data IS going to be compatible >and be in the user's locale. I don't want so much of my time taken up, and >be forced to write ugly code to take care of the remaining 1%. In my view, you either have a true single-encoding situation, in which case Ruby should work great, or you have a mixed-encoding situation. And even 1% of "other" encodings means a mixed situation. In a mixed situation, going "Unicode inside" (which for Ruby means "UTF-8 inside") is the best thing to do in most cases. Unicode inside is a model that many, many applications and several programming languages have choosen for many good reasons. Ruby currently supports it, but not as seamlessly as it could. Getting more input about where things hurt most is very helpful. There are probably two things that differ from "all Unicode inside" programming languages such as Perl, Python, and Java: - Because Ruby allows you to use all kinds of non-Unicode encodings, it may give the impression that things work with mixed encodings, and lets you postpone some necessary cleanup that you'd otherwise do upfront. - When reading data, in Java and friends, you only have to indicate the external encoding. In Ruby, you have to mention UTF-8, too, because otherwise the encoding is used just as a label, without conversion. For a "Unicode inside" application, that's an additional burden. [I'm glad to see that Matz thinks that's ugly, too, and wants to do something about it in the future.] I have suggested that we introduce some kind of "encoding policy" that lets some things happen "automagically". (see http://www.sw.it.aoyama.ac.jp/2007/pub/IUC31-ruby/Paper.html, Section 6). One such policy could be "whenever you might get an exception due to an encoding mismatch, try to transcode (e.g., to UTF-8). Another could be "transcode all input to UTF-8 unless there is a specific indication that another encoding is wanted". The main problem with such an approach is that it's very difficult to do this globally, because libraries may have very different assumptions or restrictions, and Ruby doesn't have a 'per library' concept. My understanding is that similar problems can happen with class extensions (two different libraries adding or changing methods with the same name in the same class,..., or one library depending on a change where another depends on having nothing changed,...), and that some solution to this problem is one of the things that Matz mentioned when talking about Ruby 2.0. If such a solution would indeed happen, I guess it wouldn't be too difficult to also use that solution for dealing with "encoding policies". But all this is currently just some vague feeling, none of it exists in actual code. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
on 2008-09-19 10:20
At 23:28 08/09/17, Yukihiro Matsumoto wrote: > on Wed, 17 Sep 2008 10:20:13 +0900, "Michael Selig" ><michael.selig@fs.com.au> writes: >|- I would avoid having to use ugly modes to open a file like >|"r:UTF-16LE:UTF-8" (very minor) > >This is ugly indeed. We might add more Unicode support in the >future. But we are no hurry. The problem here is that it would be much better if we could avoid forcing many people to use such ugly stuff for a few years. But I have to admit that I don't know exactly how a better solution would look like. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
on 2008-09-19 10:21
At 00:01 08/09/18, Yukihiro Matsumoto wrote: >|Unicode. Why does no-one consider going that way, but instead builds >|a complicated stack of functions for conversions on top level? > >Just because it's impossible. History sucks. We have mixed up YEN >SIGN and REVERSE SOLIDUS for long time. They cannot be distinguished >without context information. Technically 0x5c should mean REVERSE >SOLIDUS, but not always so for humans. Thanks for putting it so bluntly. The Europeans did similar things in the ISO 646 age (7-bit encodings with national variants), but were fortunate enough to go through an intermediate stage of 8-bit encodings before going multibyte. >Besides that, Unicode is not a panacea. Definitely not. But it makes a lot of things a lot easier for a lot of people. >Some character set >(e.g. GB18030 for Chinese characters) is even bigger than Unicode. >In fact, GB18030 is a super set of Unicode. How exactly? I know that the Chinese government is requiring GB 18030 support for software sold in China, and that the Unicode Consortium and all the companies involved have been working hard to make sure that this requirement is met by converting from and to Unicode so that applications can use Unicode internally. >|should not have to deal with this issue so prominently. > >You are free to feel so, but it's us who take up the burden. I can't speak for Matz, but I think anybody who wants to share some burden by providing patches and such is also very welcome, although some of the issues discussed on this list are not yet at the level where somebody could write a patch. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
on 2008-09-19 10:33
Hi,
In message "Re: [ruby-core:18728] Re: Character encodings - a radical
suggestion"
on Fri, 19 Sep 2008 17:11:53 +0900, Martin Duerst
<duerst@it.aoyama.ac.jp> writes:
|>Besides that, Unicode is not a panacea.
|
|Definitely not. But it makes a lot of things a lot easier for a
|lot of people.
Indeed. And we'd like to help people's life easier (using Unicode).
Out M17N API is not the final cut (although it's almost fixed for
1.9.1). If there's any idea to make Ruby's Unicode support better,
let us hear. But no thanks in advance for the proposals to abandon
what we have now. ;-)
matz.
on 2008-09-19 11:42
On Fri, 19 Sep 2008 18:24:41 +1000, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > If there's any idea to make Ruby's Unicode support better, > let us hear. But no thanks in advance for the proposals to abandon > what we have now. ;-) > I assume that you are referring to my suggestion to remove support for "non-ASCII compatible" encodings? I don't think that it is an unreasonable proposal given the fact that it is so difficult to handle them. I have often pulled out features in software I have written that seemed a "good idea at the time" but later turned out out not to be. I am not saying that they are a bad idea, but I currently cannot see their value as a "fully-fledged" internal encoding, and I was concerned that supporting them may make a "rod for your own back" when it comes to handling them in libraries. Of course you are the boss when it comes to Ruby, but I feel that you could have phrased this statement ("But no thanks in advance....") a little better. It and the "Unicode whiners" comment are rather offensive I feel. We are only trying to help. Mike.
on 2008-09-19 12:01
Oops, I misfired my mail reader; the following is the right one:
In message "Re: [ruby-core:18732] Re: Character encodings - a radical
suggestion"
on Fri, 19 Sep 2008 18:34:21 +0900, "Michael Selig"
<michael.selig@fs.com.au> writes:
|I assume that you are referring to my suggestion to remove support for
|"non-ASCII compatible" encodings?
No, I was referring the past proposals like "abandon all M17N and
choose Unicode as unified internal character set, like other `major'
languages do" as such.
|Of course you are the boss when it comes to Ruby, but I feel that you
|could have phrased this statement ("But no thanks in advance....") a
|little better.
I am sorry if my phrase appeared offensive. UTF-16 is a nasty beast,
but as I stated we have other beasts (dummy encodings), so that simply
removing UTF-16 would help us little. We have to do it consistently,
if we do.
matz.
on 2008-09-19 12:25
Hi,
In message "Re: [ruby-core:18732] Re: Character encodings - a radical
suggestion"
on Fri, 19 Sep 2008 18:34:21 +0900, "Michael Selig"
<michael.selig@fs.com.au> writes:
|I assume that you are referring to my suggestion to remove support for
|"non-ASCII compatible" encodings?
No, I was referring the past proposal like "abandon all M17N and
choose M17N as unified internal character set, like other `major'
languages do" as such.
|Of course you are the boss when it comes to Ruby, but I feel that you
|could have phrased this statement ("But no thanks in advance....") a
|little better.
I am sorry that my phrase appear offensive.
It and the "Unicode whiners" comment are rather offensive I
|feel. We are only trying to help.
|
|Mike.
|
on 2008-09-19 14:39
On Sep 19, 2008, at 4:34 AM, Michael Selig wrote:
> It and the "Unicode whiners" comment are rather offensive I feel.
I took that as a joke and laughed when I read it. I think matz has a
good sense of humor. :)
James Edward Gray II
on 2008-09-19 14:44
On Fri, Sep 19, 2008 at 8:30 AM, James Gray <james@grayproductions.net> wrote: > On Sep 19, 2008, at 4:34 AM, Michael Selig wrote: >> It and the "Unicode whiners" comment are rather offensive I feel. > I took that as a joke and laughed when I read it. I think matz has a good > sense of humor. :) GMail tells me that, at least in this thread, Tim Bray is the one who used it first to describe himself. ;) -austin
on 2008-09-19 15:48
On Sep 19, 2008, at 4:52 AM, Yukihiro Matsumoto wrote: > UTF-16 is a nasty beast, > but as I stated we have other beasts (dummy encodings), so that simply > removing UTF-16 would help us little. We have to do it consistently, > if we do. I'm no expert in any of this, but I wonder if part of the problem might be that Ruby tries to support all encodings both internally and externally. Might it be easier to support the full set externally, but to have a more limited set internally? For example, you could support UTF-16<any endian> as an external encoding, but transcode to UTF-8 on the way in. You could still support a rich variety of internal encodings, including the Asian ones you need. But you wouldn't have to deal with UTF-16 when implementing Regexp#escape :) So, keep the current set of encodings, but only allow a reasonable (ASCII- compliant) subset as internal encodings. Dave
on 2008-09-19 16:40
Hi,
In message "Re: [ruby-core:18742] Re: Character encodings - a radical
suggestion"
on Fri, 19 Sep 2008 22:40:20 +0900, Dave Thomas <dave@pragprog.com>
writes:
|I'm no expert in any of this, but I wonder if part of the problem
|might be that Ruby tries to support all encodings both internally and
|externally. Might it be easier to support the full set externally, but
|to have a more limited set internally?
That what we thought. Limited support for UTF-16 and dummy encodings
is our result, which seems to be imperfect.
|For example, you could support
|UTF-16<any endian> as an external encoding, but transcode to UTF-8 on
|the way in. You could still support a rich variety of internal
|encodings, including the Asian ones you need. But you wouldn't have to
|deal with UTF-16 when implementing Regexp#escape :) So, keep the
|current set of encodings, but only allow a reasonable (ASCII-
|compliant) subset as internal encodings.
I think you've suggested something valuable, but I cannot imagine the
detail (yet). Magically transcode text in unsupported encoding to
certain supported encoding. Hmm.
UTF-16 and UTF-8 are easier set, since they are semantically same.
But how should we treat ISO-2022-JP for example?
# No, I am asking myself, not you, Dave.
matz.
on 2008-09-19 17:42
On Sep 19, 2008, at 2:52 AM, Yukihiro Matsumoto wrote: > I am sorry if my phrase appeared offensive. UTF-16 is a nasty beast, > but as I stated we have other beasts (dummy encodings), so that simply > removing UTF-16 would help us little. We have to do it consistently, > if we do. Actually, I think it would be perfectly OK to remove runtime support for UTF-16, if you make input and output possible. I can't think of any practical advantages to handling multiple UCS encodings, for regexing or parsing or splitting or matching. UTF-16 is horrible; C# and Java will be paying the price for choosing 16-bit characters long after we're dead. -Tim
on 2008-09-19 17:42
On Sep 19, 2008, at 6:40 AM, Dave Thomas wrote: > I'm no expert in any of this, but I wonder if part of the problem > might be that Ruby tries to support all encodings both internally > and externally. Might it be easier to support the full set > externally, but to have a more limited set internally? For example, > you could support UTF-16<any endian> as an external encoding, but > transcode to UTF-8 on the way in. You could still support a rich > variety of internal encodings, including the Asian ones you need. > But you wouldn't have to deal with UTF-16 when implementing > Regexp#escape :) So, keep the current set of encodings, but only > allow a reasonable (ASCII-compliant) subset as internal encodings. +1 -Tim
on 2008-09-19 17:42
On Sep 19, 2008, at 2:34 AM, Michael Selig wrote: > . It and the "Unicode whiners" comment are rather offensive I feel. > We are only trying to help. To be fair, I think the "Unicode whiners" thing is a joke that I started. I am proud to be a Unicode whiner. Also preacher, advocate, dialectician, and polemicist. -Tim
on 2008-09-20 03:08
On Fri, 19 Sep 2008 19:52:30 +1000, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > > I am sorry if my phrase appeared offensive. UTF-16 is a nasty beast, > but as I stated we have other beasts (dummy encodings), so that simply > removing UTF-16 would help us little. We have to do it consistently, > if we do. No problem - it appears I misunderstood you, sorry. Easy to happen with email, unfortunately :-( Perhaps we need to go back to basics with this discussion. As a mere English speaker, I do not fully understand the issues that are faced by Japanese and other encodings. What I have gathered from this discussion is (please tell me if I am wrong): - There are characters that Ruby needs to support which cannot be uniquely mapped to Unicode - In fact there are entire character sets that we want to support in Ruby that are not supported in Unicode - There are ambiguous characters in some character sets - same code for different characters I think it would be a benefit if we all got to understand a bit more: - How the character ambiguity (eg: Yen/ backslash) issue is handled at the moment - generally, not just with Ruby. ie: how do you know that a printer or screen is going to show the right character? - How the various "non-ascii compatible" encodings are used in practice. eg: it is my understanding that UTF-7 is really only used in email, and that it would be straightforward to immediately transcode it to/from UTF-8 in an POP/IMAP library, so UTF-7 could be avoided completely as an "internal" encoding in Ruby. It's as if were were treating UTF-7 like base64 - just a transformation of a "real" encoding. (In fact UTF-16 & 32 could be considered the same sort of thing, except they may be used more widely.) - How a Japanese programmer would handle the situation of dealing with a combination of a Japanese non-Unicode compatible character set, and say a UTF-8 encoding which included non-ascii characters, and non-Japanese ones. ie: Is there a reasonable alternative to encoding both to Unicode & somehow dealing with the "difficult characters" as special cases? Could someone out there please succinctly explain these things to us westerners? Then perhaps our thinking about this issue may be more aligned. Thanks Mike
on 2008-09-20 03:21
At 17:20 08/09/17, Robert Klemme wrote: >encodings very clean and workable. But if I remember correctly Matz >once said that Unicode does not cover all Asian symbols so it might >not be a too good choice for internal representation. In some sense, this is true, but then this is true for any other encoding (in particular all those used in Asia), too. So that's not really an argument (apart from the fact that if you really need, you can always use the huge private-use areas provided by Unicode; not that I would suggest that though myself). >I believe that one reason for the difficulties we encounter now is the >fact that String is historically used for binary and text data. So >there is no clear separation between the two and this bears potential >for confusion and bugs. That's a part of the problem, but not too big a part. >A clean solution would probably involve having a character type which >is capable of representing *all* possible symbols and model String as >sequence of those characters. Well, yes, that could be done, e.g. model a character as the union of Unicode codepoints and any other odd objects. Users could then define their own objects for their own characters, e.g. with lots of metadata, or font information, or what not. Such ideas have been around for a long time, but in contrast to Ruby's current model, which in some ways is on the edge, but still doable, such a model quickly becomes way more complex and hopelessly slow. >Encoding would then be done during >input and output only. Questions I see > >1. Is this feasible, i.e. is there something similar to Unicode >without its limitations? Unicode is as good as it gets. And it gets better and better (not all historic or minority scripts are encoded yet, but that work is ongoing). The conclusion is: If you want something better than Unicode (in particular something with more scripts and characters covered), the best thing is to contribute to Unicode. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
on 2008-09-20 04:04
At 09:42 08/09/18, Austin Ziegler wrote: >to use the private use plane in Unicode. Very much agreed. Private use areas (a small area in the BMP (Base Multilingual Plane) and planes 15 and 16) are free-for-all, which means you are never really sure what you get there. (for those who want some more terse background reading, I recommend http://www.w3.org/TR/charmod/#sec-PrivateUse) >Adding glyphs to Unicode is a >lengthy process that requires going through a standards body. The >Unicode standard is updated every few years, but the Unicode >consortium is much more likely to listen to the Japanese standards >bodies than Ruby programmers. Well, yes, first because the relevant Japanese standards body is a member of ISO/IEC JTC1/SC2/WG2, the group responsible for ISO 10646, which is in sync with Unicode. And second because Ruby programmers as a group don't have any particular character encoding needs. >The reality is that Unicode *doesn't* completely represent all Asian >languages well True. There are still many (minor) scripts that are not yet encoded, and most of them are used in Asia, in the same way as most of the scripts already encoded are used in Asia. (For more details, please see http://unicode.org/roadmaps/ and the links from there to the roadmaps for various parts of Unicode.) >(see the discussions around Han unification for a brief >primer on the issues involved). Complaints about Han unification are mostly unjustified. The discussion e.g. around Internationalized Domain Names has shown that unification has significant advantages. You get into problems when e.g. a Latin 'A', a Cyrillic 'A', and a Greek 'A' are encoded separately (as they currently are, not the least because they are encoded separately in some important East Asian standards). I do not want to immagine the mess we would have if there were separate codes for Chinese/Japanese/Korean (and maybe Vietnamese, Taiwanese,...) "variants" of Han characters such as '$B0l(B' (one), '$BFs(B' (two), '$B;0(B' (three), and so on. >The problem is exacerbated in the >academic arena where people want to be able to represent ancient >characters accurately, but it's not limited to that. Yes, and if you look at academic use, the same can be said for the Western World. As a simple example, Unicode doesn't contain codepoints for all the many ligatures used in the Gutenberg bible. The only difference may be that researchers in the West are more ready to use an additional layer (e.g. some XML markup or so) for this, whereas in Asia, the fact that there is already such a huge number of characters makes it very easy for people to think that just adding more characters is the solution for these problems. >Just because you >and I can represent our words in under one hundred characters doesn't >mean that it's appropriate to do the same with others' languages. Of course not. And Unicode definitely hasn't done that, quite to the contrary. Korean got more than 11,000 characters, of which by all accounts less than 3000 are actually used, the only purpose of the rest being to complete a nice-looking three-dimensional table. Han characters currently count around 70,000, of which the majority is mainly used in dictionaries, and many of them with entries of the form (freely translated): "A: variant/misprint for B, see B." Mind you, there are still a lot of Han characters (the core being about 21,000) that are really useful because they are supported on everyday computer systems in China, Japan, Korea, and so on. And a smaller subset of these (around 2000-3000 for Japanese, less for Korean, more for Chinese) is what people actually use day in day out. >It's getting better, but it's still not perfect. Very much so indeed. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
on 2008-09-20 07:09
2008/9/19 Martin Duerst <duerst@it.aoyama.ac.jp>: > codes for Chinese/Japanese/Korean (and maybe Vietnamese, Taiwanese,...) > "variants" of Han characters such as '$B0l(B' (one), '$BFs(B' (two), '$B;0(B' (three), > and so on. I'm not disagreeing with you in principle, but even if the complaints are unjustified, the fact is that they exist and they slowed adoption of Unicode in Asian countries pretty significantly. > to think that just adding more characters is the solution > for these problems. It's also a little different for the Asian researchers because different characters are different words. It may also be a display problem; most Western language ligatures can be approximated on computer displays with just a little tweaking of the display of two characters, even if a separate glyph in a font is always better. This isn't always possible with Asian language characters, by my understanding. Still, I am encouraged to see Ruby keeping m17n yet improving its Unicode support. -austin
on 2008-09-20 07:37
As far as I understand, that was the original plan. The question is how exactly to distinguish internal and external encodings. Should we e.g. allow "UTF-16BE" in a mode when opening a file, but not as an argument to String#encode? But then what if you want to convert to UTF-16BE and then use some compression (gzip,...) on output? And I think once we were at that point, what happened was that to whatever extent it was easy to support an encoding, it was done. As an example, Oniguruma supported UTF-16(BE/LE) and so on, so that's usable now. The alternative, which is suggested by this discussion, is that we decide on a (pretty high) minimum standard for support for an encoding. All encodings that don't reach that standard are simply declared dummy and behave as such (i.e. the same as binary, or even with less functionality). This would force at least those who understand the issues to use conversion. But there would still be those that might do operations on a string labeled "UTF-16BE" under the impression that this actually works. It would also mean that each application has to do some work to distinguish 'really supported' and 'dummy label' encodings. Or that (as you suggest) conversion would be automatic, which should work for Unicode-based encodings, but which might bring up very subtle issues e.g. when converting from iso-2022-jp to euc-jp (or do you choose shift_jis?). Regards, Martin. At 22:40 08/09/19, Dave Thomas wrote: >externally. Might it be easier to support the full set externally, but > #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
on 2008-09-20 10:32
Jo.
In message "Re: [ruby-core:18753] Re: Character encodings - a radical
suggestion"
on Sat, 20 Sep 2008 14:28:25 +0900, Martin Duerst
<duerst@it.aoyama.ac.jp> writes:
|The alternative, which is suggested by this discussion,
|is that we decide on a (pretty high) minimum standard for
|support for an encoding. All encodings that don't reach
|that standard are simply declared dummy and behave as
|such (i.e. the same as binary, or even with less functionality).
I feel you've suggested something important. Can you be more concrete
on the idea? Making UTF-{16,32} dummy again? Or something else?
matz.
on 2008-09-20 11:17
In article <E1Kgh9x-0001GF-5j@x61.netlab.jp>, Yukihiro Matsumoto <matz@ruby-lang.org> writes: > UTF-16 and UTF-8 are easier set, since they are semantically same. > But how should we treat ISO-2022-JP for example? I defined stateless-ISO-20222-JP which is semantically same to ISO-2022-JP. (Actually, it is a subset of Emacs-Mule.) % ruby -ve ' p Encoding::Converter.asciicompat_encoding("ISO-2022-JP") p Encoding::Converter.asciicompat_encoding("UTF-16BE") p Encoding::Converter.asciicompat_encoding("UTF-16LE") ' ruby 1.9.0 (2008-09-15 revision 19356) [i686-linux] #<Encoding:stateless-ISO-2022-JP> #<Encoding:UTF-8> #<Encoding:UTF-8> They are required to insert a error notice for conversion, for example.
on 2008-09-20 11:53
At 02:13 08/09/18, NARUSE, Yui wrote: >Yes, they have private use area codepoints, but I think they are not reasonable. > >The first reason is that the area is Private Use Area. > >Moreover there are some mobile phone careers in Japan and they define own emoji. >And their PUA codepoints is conflicted. >http://creation.mb.softbank.jp/web/web_pic_about.html > >So they can't be 'Uni'code yet. Just for everybody's information, the Unicode Technical Committee (UTC) is working on encoding these into Unicode. But this is not at all an easy job. Contrary to most characters, most emoji are in color. Contrary to all previous characters, some emoji are actually animations. Also, some may have copyright or trademark issues. So a lot of issues have to be considered carefully. The UTC is also working on Japanese TV symbols and on emoticons (presumably from various instant messaging services and the like). On the other hand, the mappings given by the companies are also not worked out. As an example, on http://creation.mb.softbank.jp/web/web_pic_03.html, the copyright and (R) sign, the zodiac signs, and heart/diamond/clover/... at least can easily be mapped to pre-existing Unicode characters without problems. It would be good if the Japanese mobile carriers and others would recognize that inventing new "characters" isn't the right way to satisfy the needs of their customers for fancy little images, but as with most other stupidities in character encoding, it takes a while for the people involved directly to figure this out. Anyway, if we were really serious in Ruby about handling these, we would have to introduce encodings such as Shift_JIS-NTT-Docomo, Shift_JIS-Softbank, and Shift_JIS-kddi-au or so, to make sure there is no mixup in the original encoding. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
on 2008-09-20 13:38
> It would be good if the Japanese mobile carriers and others > would recognize that inventing new "characters" isn't the right > way to satisfy the needs of their customers for fancy little > images, but as with most other stupidities in character encoding, > it takes a while for the people involved directly to figure this > out. Thankfully I think they started to understand it. Recent phones have now lots of emojis that are just GIFs, and when you use them in mails, the mails are sent as HTML mail with attachments and can be read in any normal mail client. It also allows the users to easily add (or even create) new ones. However if the mail is sent to an older phone the image may be badly displayed. Anyway, the problem with the existing emoji characters will probably stay for a long time... (unfortunately it would not be the first problem due to the need to keep compatibility with the existing...)
on 2008-09-20 17:11
On Fri, Sep 19, 2008 at 10:03 AM, Tim Bray <Tim.Bray@sun.com> wrote: > Actually, I think it would be perfectly OK to remove runtime support for > UTF-16, if you make input and output possible. I can't think of any > practical advantages to handling multiple UCS encodings, for regexing or > parsing or splitting or matching. UTF-16 is horrible; C# and Java will be > paying the price for choosing 16-bit characters long after we're dead. -Tim > I agree with this. I haven't looked at Ruby 1.9, but I don't see any point in supporting UTF-16, UTF-16BE, and UTF-16LE as internal encodings. So long as you can read and write them, I say keep the complexity out of Ruby's internals and convert to UTF-8 internally. mathew
on 2008-09-20 18:14
Hi,
In message "Re: [ruby-core:18751] Re: Character encodings - a radical
suggestion"
on Sat, 20 Sep 2008 10:00:24 +0900, "Michael Selig"
<michael.selig@fs.com.au> writes:
|Perhaps we need to go back to basics with this discussion. As a mere
|English speaker, I do not fully understand the issues that are faced by
|Japanese and other encodings. What I have gathered from this discussion is
|(please tell me if I am wrong):
|
|- There are characters that Ruby needs to support which cannot be uniquely
|mapped to Unicode
Yes, even though they are minor.
|- In fact there are entire character sets that we want to support in Ruby
|that are not supported in Unicode
Yes, I know two of them: Mojikyo, which refusing character
unification. The character set contains 170,000 characters. At the
time I first heard that number was huge, but Unicode is approaching
pretty close (it now has more than 100,000 characters).
GB18030, defined by Chinese government. I don't know the detail, but
I've heard it officially contains Unicode as its subset. But encoding
scheme for GB18030 is upto 4bytes per codepoint, so I am not sure how
it can holds 21bit Unicode codepoint in it.
|- There are ambiguous characters in some character sets - same code for
|different characters
Yes.
|I think it would be a benefit if we all got to understand a bit more:
|
|- How the character ambiguity (eg: Yen/ backslash) issue is handled at the
|moment - generally, not just with Ruby. ie: how do you know that a printer
|or screen is going to show the right character?
Either avoiding conversion (operation based on bytes), or selecting
proper encoding scheme (out of many very similar encodings, such as
Shift_JIS, CP932, Windows-31J for example). Conversion table from
unicode.org is carefully designed to ensure roundtrip, although that
is the very reason we have so many similar encoding. If we can choose
(or negotiate) to use same conversion table at both ends, it is
unlikely to have mojibake problems.
|- How the various "non-ascii compatible" encodings are used in practice.
|eg: it is my understanding that UTF-7 is really only used in email, and
|that it would be straightforward to immediately transcode it to/from UTF-8
|in an POP/IMAP library, so UTF-7 could be avoided completely as an
|"internal" encoding in Ruby. It's as if were were treating UTF-7 like
|base64 - just a transformation of a "real" encoding. (In fact UTF-16 & 32
|could be considered the same sort of thing, except they may be used more
|widely.)
UTF-{16,32}{BE,LE} are non-ascii compatible, but they are safe to
convert into UTF-8 since their difference only lies in encoding
scheme. They represent same character set anyway. ISO-2022 is used
often in mails and web. The situation is little bit more complicated,
but basically it can be converted into Unicode as well (with slight
risk of yen sign problem). You can ignore UTF-7.
|- How a Japanese programmer would handle the situation of dealing with a
|combination of a Japanese non-Unicode compatible character set, and say a
|UTF-8 encoding which included non-ascii characters, and non-Japanese ones.
|ie: Is there a reasonable alternative to encoding both to Unicode &
|somehow dealing with the "difficult characters" as special cases?
Unicode is getting better each day. So it now covers almost all
day-to-day problems. Some cellphone problems are covered by using
private area.
matz.
on 2008-09-21 10:09
At 01:05 08/09/21, Yukihiro Matsumoto wrote: >|(please tell me if I am wrong): >unification. The character set contains 170,000 characters. Just for general information, this doesn't specifically refer to CJK unification (i.e. unification of the same ideograph from China, Japan, Korea, and so on) but is more about general glyph (dis)unification. This means that minor differences in how exactly to write a character are given separate codepoints. This may help in historical research (some variants are more used by some writers or in some centuries than others,...), but in general isn't helpful, on the contrary, it will make data processing more difficult. However, even in daily life, there is some need to distinguish some (ideographic) glyph variants in certain cases. For this, Unicode contains variation selectors (U+FE00-FE0F and U+E0100-E01EF). These are used after a base character, based on a registration in the Ideographic Variation Database (http://www.unicode.org/ivd/). There is currently only the Adobe-Japan1 collection registered, see http://www.unicode.org/ivd/data/2007-12-14/IVD_Charts.pdf. For glyph variants, it would be no problem (although quite some work, of course) for Mojikyo to register them as Ideographic Variations in this database. This would make all these Variations usable in Unicode. From http://www.mojikyo.com/info/konjaku/index.html, we can also see the following: Mojikyo Unicode $B4A;z(B (kanji) 150,366 A bit more than double of what Unicode has. In my guess mostly glyph variants, but there sure are a few not yet encoded characters, too. $BHs4A;z(B (non-kanji) 2,256 Kana variants could be encoded with variation selectors $B[p;z(B(bonji) 1,875 Don't know, but because these are of Indic origin, my guess is that Unicode would use a different encoding model with much less characters $B9C9|J8;z(B(oracle bone) 3,364 space tentatively allocated (U+32000-327FF), (http://www.internationalscientific.org/CharacterAS...) see http://unicode.org/roadmaps/tip/ $B@>2FJ8;z(B/Tangut 6,000 under consideration for encoding $B?eB2J8;z(B 145 did not find any info, but I'm quite sure a well-written proposal would be accepted $Bd?=q(B(seal characters) 10,969 Very old style, but most of them (http://www.internationalscientific.org/CharacterAS...) with clear equivalents to modern ideographs. Still used on seals. To unify or not to unify is the big question. It seems that Mojikyo is currently handled from two sides: www.mojikyo.org for the non-commercial side, and www.mojikyo.com for the commercial side (with various products published by Kinokuniya, a big Japanese publisher). That leads to somewhat complicated usage conditions (you can use some fonts for free for yourself, but have to pay if you use them in a paper you publish,...), not only for the fonts (would be quite understandable) but also for some of the data. >At the >time I first heard that number was huge, but Unicode is approaching >pretty close (it now has more than 100,000 characters). Conclusion: If the Mojikyo people wanted, they could get most if not all of their stuff into Unicode in one way or another. But similar to all other work of serious character encoding, it would be a lot of work. >GB18030, defined by Chinese government. I don't know the detail, but >I've heard it officially contains Unicode as its subset. But encoding >scheme for GB18030 is upto 4bytes per codepoint, so I am not sure how >it can holds 21bit Unicode codepoint in it. 4 bytes raw would be 32 bits, so that should be enough to hold 21 bits. Because some characters use only one or two bytes, the overall code space is smaller, about 1,600,000 codepoints. This is still larger than Unicode (around 1,100,000 codepoints), but the difference is currently not used at all. For more details, please see http://www.icu-project.org/docs/papers/unicode-gb1... and http://unicode.org/faq/han_cjk.html#23. (I was of the impression that GB 18030 contains a few characters similar to the Japanese $B$;!,(B and friends in JIS X 0213, but I haven't found any such information anymore, so it may not be true). So I don't think there is any real problem for GB 18030 and Unicode. > >Either avoiding conversion (operation based on bytes), or selecting >proper encoding scheme (out of many very similar encodings, such as >Shift_JIS, CP932, Windows-31J for example). Conversion table from >unicode.org is carefully designed to ensure roundtrip, although that >is the very reason we have so many similar encoding. If we can choose >(or negotiate) to use same conversion table at both ends, it is >unlikely to have mojibake problems. Yes, roundtrip is easy if you use the same conversion tables, but unfortunately, the major vendors (Microsoft, Apple, IBM,...) messed up with minor variations (usually just a few codepoints out of several thousand). As for how you know that a printer or screen is going to show the right character, you simply don't, in particular e.g. on the Web. 0x5C will show as a Yen sign on Japanese systems with fonts tweaked for Japanese, but will show as a backslash otherwise. Japanese IT professionals have to just learn about this. >convert into UTF-8 since their difference only lies in encoding >scheme. They represent same character set anyway. ISO-2022 is used >often in mails and web. That would be iso-2022-JP. ISO 2022 is a standard that defines a set of tools to create encodings, not an encoding in and by itself. Regards, Martin. >Unicode is getting better each day. So it now covers almost all >day-to-day problems. Some cellphone problems are covered by using >private area. > > matz. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
on 2008-09-22 02:05
On Sun, 21 Sep 2008 02:05:30 +1000, Yukihiro Matsumoto <matz@ruby-lang.org> wrote: > |- How a Japanese programmer would handle the situation of dealing with a > |combination of a Japanese non-Unicode compatible character set, and say > a|UTF-8 encoding which included non-ascii characters, and non-Japanese > ones. > |ie: Is there a reasonable alternative to encoding both to Unicode & > |somehow dealing with the "difficult characters" as special cases? > > Unicode is getting better each day. So it now covers almost all > day-to-day problems. Some cellphone problems are covered by using > private area. I infer from this that really Unicode is the only (imperfect) solution for true m17n where we have a mixure of completely different character sets (eg: Japanese & Arabic)? What I think this means is that there is no "one size fits all" solution, unfortunately. So I have an alternate suggestion. Maybe I should rename this thread "Character encodings - a less radical suggestion" :-) Ruby already has "Encoding::default_external", so why not also have "default_internal"? This option would either be left unset (or NIL I guess) or set to an encoding, likely to be UTF-8 in practice, but maybe there would be a use for it to choose say one of the Japanese encodings if you have a variety of Japanese encodings to handle. When "default_internal" is nil, Ruby will work as it does now: - Ruby libraries such as I/O & network libraries will by default return character data in the external encoding - No transcoding will take place unless specifically requested by the Ruby program - The Ruby program is responsible for ensuring that the encodings are what it expects, that strings passed to & from Ruby libraries are in the encoding the library expects, and that "Encoding Compatibility Errors" will occur if it is not careful etc. When "default_internal" is set to an encoding "E": - Ruby libraries such as I/O & networking libraries will by default transcode to/from internal encoding E (unless specifically overridden by an option to the class) - A Ruby program can then be confident that all strings it handles will be in encoding E, so it doesn't have to worry about encoding compatibility. For example it can be sure that if "s" is "abc" then "s == 'abc'" is true, no matter where the string "s" originated from. - Assuming that E is an "ascii-compatible" encoding, the Ruby programmer doesn't have to face issues like "The value is #{val}" substitution failing because "val" is non-ascii compatible. - The "downside" as pointed out by a number of people is that not all characters may be transcoded cleanly or even be supported (driving without a seat-belt? :-)), but then programs requiring this level of control should probably not use this feature. Consequences of this suggestion: - Don't have to change the current implementation of encodings, String or Regexp - Avoids "automagical transcoding" within String & Regexp methods - Responsibility of implementing "default_internal" lies with a certain set of Ruby libraries like IO & networking Hope this makes sense. Mike
on 2008-09-22 04:44
Hello Michael, Many thanks for your proposal. Earlier, when I proposed some general "encoding policies" to deal with this and similar problems, the main problem brought up was that it would interoperate badly with libraries. But looking at your concrete proposal, it seems to me that overall, the problems wouldn't actually be that serious. Therefore, I think we should seriously consider this proposal, and hopefully implement it before Sept. 25th. In terms of implementation, I don't think it should be that difficult, but it may be quite a bit of work to check Encoding::default_internal in all the affected methods. In terms of potential problems, I see the following: - A library sets Encoding::default_internal. That would lead to serious problems, and should be clearly advised against in the documentation. Libraries either have to be written in a general way, or have to document that they only work with certain values of Encoding::default_internal (this proposal would therefore help you, but not e.g. James Gray for the CVS library) - Encoding::default_internal is set to some dummy or non-ASCII- compatible encoding, which may lead to some hickups. We may want to make that impossible or advise against. (the main use is UTF-8 anyway) - We should think through various scenarios for output. I can't think of any problems just now, I just noticed the absence of considerations for output below. The advantages that I see with this proposal are: - It gets rid of the bad usability for "r:UTF-16LE:UTF-8" (matz, ruby-core:18666) - It clearly helps "Unicode inside" applications, but is not limited to any encoding and may be helpful for other encodings as well. - It fits well within the rest of the naming scheme and the overall idea of having several specific encodings to make the work of the user easier. If we wouldn't have Encoding::default_external, using Ruby with a single local encoding would be a big pain. Introducing Encoding::default_internal makes using Ruby with "Unicode inside" much less of a pain. At 08:56 08/09/22, Michael Selig wrote: >> Unicode is getting better each day. So it now covers almost all >> day-to-day problems. Some cellphone problems are covered by using >> private area. > >I infer from this that really Unicode is the only (imperfect) solution for >true m17n where we have a mixure of completely different character sets >(eg: Japanese & Arabic)? >What I think this means is that there is no "one size fits all" solution, >unfortunately. Yes. Unicode fits most of the time, some local encoding fits in many cases (in particular small scripts), and for some very special jobs, you may have to use something else (a special encoding such as Mojikyo, the Unicode private areas, an additional level of markup,...). >So I have an alternate suggestion. Maybe I should rename this thread >"Character encodings - a less radical suggestion" :-) I just did :-). Regards, Martin. >program >in encoding E, so it doesn't have to worry about encoding compatibility. >Consequences of this suggestion: > #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
on 2008-09-22 04:59
On Sep 21, 2008, at 9:35 PM, Martin Duerst wrote: > In terms of potential problems, I see the following: > - A library sets Encoding::default_internal. That would lead > to serious problems, and should be clearly advised against > in the documentation. Libraries either have to be written > in a general way, or have to document that they only work > with certain values of Encoding::default_internal > (this proposal would therefore help you, but not e.g. > James Gray for the CVS library) I really think this a bigger minus than this implies. I can name a lot of libraries that just flat out expect UTF-8 and choke and die on anything else. Ruby 1.8 has trained us to think this way for many years. Now, if someone were to change Encoding.default_interal then all these libraries will unexpectedly having data changed on them. I'm pretty sure that would cause massive damage. James Edward Gray II
on 2008-09-22 05:03
> - Encoding::default_internal is set to some dummy or non-ASCII- > compatible encoding, which may lead to some hickups. > We may want to make that impossible or advise against. > (the main use is UTF-8 anyway) I'd say we should make it impossible. If you are playing with dummy or non-ASCII-compatible encodings anyway you must know what you do. So not being able to rely on default_internal in this case would make perfect sense to me. > - We should think through various scenarios for output. > I can't think of any problems just now, I just noticed > the absence of considerations for output below. I have not thought it much but if logically: Input: default_external $B"-(B conversion (if default_internal != default_external and encoding not specified) Internal: default_internal $B"-(B conversion (if default_internal != default_external and encoding not specified) Output: default_external Vincent Isambart
on 2008-09-22 05:10
> I really think this a bigger minus than this implies. I can name a lot of > libraries that just flat out expect UTF-8 and choke and die on anything > else. Ruby 1.8 has trained us to think this way for many years. As we said before the main use of default_internal is for Unicode-inside applications so most of the time it would be UTF-8 anyway... > Now, if someone were to change Encoding.default_internal then all these > libraries will unexpectedly having data changed on them. I'm pretty sure > that would cause massive damage. This makes me think that there should be _at least_ a warning (or completely forbid) if some code tries to change default_internal and it was already set (if it's set to the same encoding, we could just ignore it).
on 2008-09-22 05:21
On Mon, 22 Sep 2008 12:51:19 +1000, James Gray <james@grayproductions.net> wrote: > > I really think this a bigger minus than this implies. I can name a lot > of libraries that just flat out expect UTF-8 and choke and die on > anything else. Ruby 1.8 has trained us to think this way for many years. That is very true, but the situation exists independent of "default_internal". The mere fact that Ruby supports multiple encodings means that every library *should* check the encodings of strings passed to it and do the appropriate thing, unless it is acceptable to just let an "Encoding Capability Error" be raised. > Now, if someone were to change Encoding.default_interal then all these > libraries will unexpectedly having data changed on them. I'm pretty > sure that would cause massive damage. Also very true, but people can do all sorts of stupid things that will make things break. Normally you would expect that "default_internal" would be set once at the very start, but who are we to enforce that - one day someone will want to change it mid-stream and probably for a good reason! Mike.
on 2008-09-22 05:34
On Mon, 22 Sep 2008 12:35:49 +1000, Martin Duerst <duerst@it.aoyama.ac.jp> wrote: > > Therefore, I think we should seriously consider this proposal, > and hopefully implement it before Sept. 25th. In terms of > implementation, I don't think it should be that difficult, > but it may be quite a bit of work to check > Encoding::default_internal in all the affected methods. Wow, that is rather ambitious - 3 days? The bulk of the implementation will be in the libraries, and I think many of them need updating to cope with non-acsii encodings anyhow. > - We should think through various scenarios for output. > I can't think of any problems just now, I just noticed > the absence of considerations for output below. I did think about output to a certain extent, and one good thing is that IO already seems to automatically transcode to the "external" encoding at the moment. As for other classes, again I think most need updating to support multiple encodings anyhow. They will at a minimum need a way of having the user pass the "external" encoding (defaulting to "default_external"), and do the transcode as necessary, based on the encoding of the data to be output. However, as with IO, this behaviour probably should happen no matter whether "default_internal" is implemented or not. Cheers Mike
on 2008-09-22 05:39
At 11:51 08/09/22, James Gray wrote: > >I really think this a bigger minus than this implies. I can name a >lot of libraries that just flat out expect UTF-8 and choke and die on >anything else. Ruby 1.8 has trained us to think this way for many >years. Having a library expect UTF-8 is fine, if it's well known. The idea with Encoding::default_internal, the way I understand it, is not force any particular working style on anybody. But in order to work for 1.9, these libraries will have to write something like "r:UTF-16LE:UTF-8" anyway for their i/o, both with and without Encoding::default_internal. And for data passed internally to the library (method attributes,...), that data will either be in UTF-8 or not, again independent of Encoding::default_internal. The only thing that Encoding::default_internal helps (but this is significant) is that it makes things easier for an application programmer who wants to use a single encoding inside. If that encoding is choosen to be UTF-8, it will also significantly increase the chances that the application and the aforementioned libraries will work together well. In this sense, it is helpful for libraries working only in e.g. UTF-8. >Now, if someone were to change Encoding.default_interal then all these >libraries will unexpectedly having data changed on them. I'm pretty >sure that would cause massive damage. I disagree, but maybe you have a scenario in mind that I didn't think about. Can you be more specific? Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
on 2008-09-22 05:44
While I think about it, there is at least one more issue with "default_internal" - support for ASCII-8BIT aka "BINARY". I imagine that most people who use this encoding actually want to do bit or byte manipulation, not character. Some other languages have a separate class for "byte strings" to handle this situation. Therefore I think if you use an "external" encoding of ASCII-8BIT, transcoding to/from the "default_internal" encoding should not happen. This behaviour may be a bit confusing, but I cannot immediately think of a better idea. Mike
on 2008-09-22 10:43
At 12:35 08/09/22, Michael Selig wrote: >While I think about it, there is at least one more issue with >"default_internal" - support for ASCII-8BIT aka "BINARY". > >I imagine that most people who use this encoding actually want to do bit >or byte manipulation, not character. Some other languages have a separate >class for "byte strings" to handle this situation. >Therefore I think if you use an "external" encoding of ASCII-8BIT, >transcoding to/from the "default_internal" encoding should not happen. >This behaviour may be a bit confusing, but I cannot immediately think of a >better idea. Very good point. Anything other than pure ASCII won't convert anyway. And ASCII itself will work fine when labeled as ASCII-8BIT, even together with UTF-8. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
on 2008-09-22 15:02
On Sep 21, 2008, at 10:29 PM, Martin Duerst wrote: > At 11:51 08/09/22, James Gray wrote: > >> Now, if someone were to change Encoding.default_interal then all >> these >> libraries will unexpectedly having data changed on them. I'm pretty >> sure that would cause massive damage. > > I disagree, but maybe you have a scenario in mind that I didn't > think about. Can you be more specific? Encoding.default_internal = "Shift_JIS" James Edward Gray II
on 2008-09-22 15:08
On Sep 21, 2008, at 10:01 PM, Vincent Isambart wrote: >> Now, if someone were to change Encoding.default_internal then all >> these >> libraries will unexpectedly having data changed on them. I'm >> pretty sure >> that would cause massive damage. > > This makes me think that there should be _at least_ a warning (or > completely forbid) if some code tries to change default_internal and > it was already set (if it's set to the same encoding, we could just > ignore it). I just can't stop thinking this is too dangerous. It's really just a big global variable that affects everything and we know that's usually bad, right? Perl has always had a variable that allowed you to change the starting index of arrays. If you didn't like the fact that arrays counted from zero, you could switch it to one. If you do though, pretty much all of the libraries that ship with Perl as well as those on the CPAN start having issues. I really feel like this would be the same thing. By the way, this "feature" is so evil, I believe it's finally being removed in Perl 6. James Edward Gray II
on 2008-09-22 15:12
On Sep 21, 2008, at 9:35 PM, Martin Duerst wrote: > In terms of potential problems, I see the following: > - A library sets Encoding::default_internal. That would lead > to serious problems, and should be clearly advised against > in the documentation. Libraries either have to be written > in a general way, or have to document that they only work > with certain values of Encoding::default_internal > (this proposal would therefore help you, but not e.g. > James Gray for the CVS library) I really think we need to avoid any solution that means we will need to change all existing libraries, even just to declare their supported encodings. Enough libraries are already broken on 1.9 without us adding to that and so many great libraries are no longer maintained at all. The current situation is probably that we have to be very careful what we pass into these Unicode only libraries to get them to work. That's far from ideal but, it's better than having the library fail to load at all due to some global setting I may not have even created (assuming I required code that made the change). James Edward Gray II
on 2008-09-23 05:07
On Mon, 22 Sep 2008 23:03:12 +1000, James Gray <james@grayproductions.net> wrote: > > I really think we need to avoid any solution that means we will need to > change all existing libraries, even just to declare their supported > encodings. Enough libraries are already broken on 1.9 without us adding > to that and so many great libraries are no longer maintained at all. > > The current situation is probably that we have to be very careful what > we pass into these Unicode only libraries to get them to work. That's > far from ideal but, it's better than having the library fail to load at > all due to some global setting I may not have even created (assuming I > required code that made the change). As long as "default_internal" is used sanely, I actually think that it may IMPROVE the library support situation, because its use will make "encoding compatibility errors" less likely to rear their ugly heads. As long as IO obeys default_internal's setting, I think most other libraries should just work. I quickly checked "OpenURI", for example, and (assuming I understand the code correctly) it calls IO#set_encoding passing the charset read from the HTTP header, setting the "external encoding" of the socket. So as long as IO leaves the "internal encoding" set to the default_internal setting, open-uri should work as required, returning the data in the default_internal encoding. By "sanely" I mean that default_internal is set at the start of the program, and not changed (or at least not changed between reads of a file, for instance). Also if libraries supporting only Unicode are used then it should either NOT be set (and the Ruby program must then be careful about what it passes to it) or be set to UTF-8. Similarly if the library only supports ASCII, you wouldn't want to set default_internal to a non-ascii compatible encoding (very unlikely I think). I guess if the possibility of changing "default_internal" seems too problematic, it could be implemented the way "default_external" is - read-only and set either via a command line flag or to a default. Perhaps the default should simply be the encoding of the ruby program itself. But this idea would mean that for Ruby to behave as it does at the moment, you would have to specifically turn it off somehow. Mike
on 2008-09-23 06:05
On Mon, Sep 22, 2008 at 11:04 PM, Michael Selig <michael.selig@fs.com.au> wrote: > As long as "default_internal" is used sanely, I actually think that it may > IMPROVE the library support situation, because its use will make "encoding > compatibility errors" less likely to rear their ugly heads. What if it's "set once"? It's treated as essentially frozen after it's set for the first time. Something like: def Encoding.default_internal=(encoding) raise "Internal Encoding Already set" if @default_internal_set @default_internal_set = true @default_internal = encoding end It would be treated as read-only if set by a command-line parameter or after the first time it's set to an explicit value. This would discourage people from setting it in libraries (it would break automatically). -austin
on 2008-09-23 11:44
On Tue, Sep 23, 2008 at 5:04 AM, Michael Selig <michael.selig@fs.com.au>wrote: >>> with certain values of Encoding::default_internal >> pass into these Unicode only libraries to get them to work. That's far from > libraries should just work. I quickly checked "OpenURI", for example, and > what it passes to it) or be set to UTF-8. Similarly if the library only > supports ASCII, you wouldn't want to set default_internal to a non-ascii > compatible encoding (very unlikely I think). We also have to consider the fact, that in a multi threaded application the changing of a global variable that affects all threads is potentially dangerous. Thus, if some library would change the default_internal encoding temporarily this might have unforseeable consequences in other libraries or user code in other threads. I'd advise against having it changeable in code at all, but only by a command line switch. -- henon
on 2008-09-24 01:39
On Mon, 22 Sep 2008 12:35:49 +1000, Martin Duerst <duerst@it.aoyama.ac.jp> wrote: > Therefore, I think we should seriously consider this proposal, > and hopefully implement it before Sept. 25th. I guess all you guys are busy on other things at the moment, so I am happy to implement "default_internal" at least at the Ruby internal C level, but unfortunately it wouldn't be before the weekend. Given the various problems raised about changing default_internal, I now agree that it is probably for the best if it were implemented like default_external - set only at the start, with Encoding::default_internal read-only, no "Encoding::default_internal=". So I would do the following: - extend the current -E command line option to have an optional setting for default_internal in the form "ext:int" - same format as the encoding "mode" options in IO. - modify IO to use "default_internal" if no internal encoding is specified, and the file's external encoding is *not* ASCII-8BIT (to allow for "binary" I/O). - add class method "default_internal" to Encoding The only question I still have is "what should it default to if not specified on the command line"? I think it should default to the encoding specified in the main ruby source file. That means that the internal encoding will match the encoding used by the programmer. Is this sensible? Would that break anything? Once set, default_internal can't be changed, so that means there would be no way of turning it off if the encoding is specified in the ruby source, unless another option is introduced. Is this a problem? If no encoding is specified in the main ruby source file, then what? Set it to the same as "default_external"? Set it to NIL (ie: no default transcoding)? This is not backward compatible with Ruby 1.9.0 which has no IO transcoding by default no matter what the -E & encoding in the source are set to. Any better ideas? Have I left anything out? Please let me know what you would like me to do, if anything. Cheers Mike
on 2008-09-24 13:02
At 12:25 08/09/22, Michael Selig wrote: >Wow, that is rather ambitious - 3 days? Well, that's the deadline for feature changes for 1.9.1. It would be a real pity to wait for 2.0 for this. The feature freeze wiki at http://redmine.ruby-lang.org/wiki/ruby/DevelopersM... says that default_internal is currently pending, but that this should be discussed/settled this week. Anyhow, I had a look at the code, and it doesn't seem to be that difficult. The function io_extract_encoding_option in io.c seems to be central. I'm attaching a patch, which I hope is a good start. I'm also writing to ruby-dev (in Japanese) because that's where the real experts are. The patch isn't as strict as your proposal with respect to re-setting, but I'm fine either way. I have tested this patch with code like the following (called with -Eutf-8, -Eshift_jis, -Eeuc-jp, and without -E option, in all combinations) >>>> Encoding.default_internal = 'utf-8' # tested with 'utf-8', 'shift_jis', and 'euc-jp' s = "\u3042\u3044\u3046\u3048\u304A" File.open('testout1.txt', 'w:shift_jis') do |f| f.write s end File.open('testout2.txt', 'w:euc-jp') do |f| f.write s end File.open('testout3.txt', 'w:utf-8') do |f| f.write s end File.open('testout1.txt', 'r:shift_jis') do |f| s = f.read; p s.encoding end File.open('testout2.txt', 'r:euc-jp') do |f| s = f.read; p s.encoding end File.open('testout3.txt', 'r:utf-8') do |f| s = f.read; p s.encoding end File.open('testout3.txt', 'r:ASCII-8BIT') do |f| s = f.read; p s.encoding end # for next line, change file number to pick up default_internal File.open('testout3.txt', 'r') do |f| s = f.read; p s.encoding end >>>> >The bulk of the implementation will be in the libraries, and I think many >of them need updating to cope with non-acsii encodings anyhow. Yes. I'm not sure how libraries are affected by the feature freeze, but they have to be fixed anyhow, completely independently of default_internal. And I agree that this cannot be done in 3 days. Regards, Martin. >encoding of the data to be output. However, as with IO, this behaviour >probably should happen no matter whether "default_internal" is implemented >or not. > >Cheers >Mike > #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
on 2008-09-24 14:34
On Wed, 24 Sep 2008 21:02:14 +1000, Martin Duerst <duerst@it.aoyama.ac.jp> wrote: > Well, that's the deadline for feature changes for 1.9.1. > It would be a real pity to wait for 2.0 for this. > The feature freeze wiki at > http://redmine.ruby-lang.org/wiki/ruby/DevelopersM... > says that default_internal is currently pending, but that > this should be discussed/settled this week. Sorry, I am new here, so I didn't know about that URL, nor about the release procedures, nor did I know whether one of the other developers was working on this. In my previous post I asked whether I should proceed, but got no reply. I didn't think it was worthwhile my spending time on it if someone else has done it, or almost has. > option, in all combinations) I am not sure if your patch also works correctly with IO#set_encoding. This is absolutely necessary for HTTP data to be transcoded correctly in OpenURI, for example. Please see my previous post. That post also suggests NOT implementing default_internal=, but rather extending the -E command line flag to be -E "ext:int", and if not defined there to use the "source encoding". I guess your patch may "get the feature in" by the required date, but I feel that it may require a little more thought to get it right. Regards Mike
on 2008-09-25 03:19
At 21:34 08/09/24, Michael Selig wrote: >On Wed, 24 Sep 2008 21:02:14 +1000, Martin Duerst <duerst@it.aoyama.ac.jp> >wrote: >> option, in all combinations) > >I am not sure if your patch also works correctly with IO#set_encoding. I'm quite sure it doesn't. >This is absolutely necessary for HTTP data to be transcoded correctly in >OpenURI, for example. Please see my previous post. That post also suggests >NOT implementing default_internal=, but rather extending the -E command >line flag to be -E "ext:int", and if not defined there to use the "source >encoding". I have read that mail, but I disagree to use the "source encoding". I think it should be possible to use e.g. UTF-8 as a source encoding without default_internal being automatically set. Also, I think it should be possible to set default_internal independent of default_external. I can immagine writing an application where I always want UTF-8 to be default_internal, but I want it to work with all kinds of external encodings. In that case, with your proposal, the alternatives would be to have the user write "ruby -E "ext:UTF-8" (which means that the user has to figure out his/her external encoding, which many may not be experts on), because I cannot use a #!. Well, in a Japanese mail (ruby-dev:36523), matz made this work with -E :utf-8. I guess that's why he is a language designer, and I'm not :-). >I guess your patch may "get the feature in" by the required date, but I >feel that it may require a little more thought to get it right. Definitely it needs some more work and polishing. This is now being discussed seriously on ruby-dev, but please keep your ideas and comments comming. Regards, Martin. #-#-# Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University #-#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
on 2008-09-25 04:08
----- Original Message ----- From: "Martin Duerst" <duerst@it.aoyama.ac.jp> To: <ruby-core@ruby-lang.org> Sent: Thursday, September 25, 2008 11:16 AM > I have read that mail, but I disagree to use the "source encoding". > I think it should be possible to use e.g. UTF-8 as a source encoding > without default_internal being automatically set. My reasoning is this: if a Ruby programmer puts "Encoding: XXX" at the top of his main program, it is saying that she will be using encoding XXX for the "constant" strings and regexps in her code, right? If the default_internal is different from this then they will get encoding compatibitlity problems unless they are careful. This is what "default_internal" is aiming to prevent, or at least reduce. My feeling is that if they then really want to go on to read a different encoding without transcoding, they can always open that file with mode "r:ext:ext". Yes it's a bit ugly, but I think this is going to be the exception rather than the rule. I guess the mode could be extended to support something like "r:ext:-" where the "-" in the internal field indicates no transcoding. Perhaps that would be more tolerable. My current thinking is to set default_internal to: - The -E command line option - The source encoding if not specified in -E - Leave it nil (no transcoding) if niether is specified - Also if "default_internal" is US-ASCII it is reset to UTF-8 automatically (can't do any harm and will cope better if they specify an external encoding but no internal encoding when opening a file, or their default_external is not US-ASCII) Can you please outline a scenario where you feel that this would be unacceptable? The other problem I have with "default_internal=" is that it's use may be confusing to the Ruby programmer. Does it immediately cause the next read of a file to use the new value or does it just apply to the next open? (I think any implementation would probably be the latter). > and I'm not :-). Yes, I was assuming you could pass either "ext", "ext:int" or ":int" to the "-E" option. Sorry if I didn't make that clear. > Definitely it needs some more work and polishing. This is now being > discussed seriously on ruby-dev, but please keep your ideas and comments > comming. Shame I don't read Japanese! I know it's after the freeze, but I'll have a go at producing a patch myself over this weekend. Cheers Mike.
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.