Ruby 1.9 hates you and me and the encodings we rode in on so just get used to it

DJ_Jazzy_L · December 29, 2009, 9:57pm

On Tue, Dec 29, 2009 at 11:06 AM, Brian C. [email protected]
wrote:

By tagging every string with its own encoding, ruby 1.9 is solving a
problem that does not exist: that is, how do you write a program which
juggles multiple strings in different encodings all at the same time?

To play devil’s advocate here, Japanese users do routinely have to deal
with
multiple different encodings… Shift JIS on Windows/Mac, EUC-JP on *IX,
and
ISO-2022-JP for email (if I even got that correct, it’s somewhat hard to
keep track). And then on top of all of that there’s Unicode in all its
various forms…

While I would personally never choose M17n as the solution for my own
language I can see why it makes sense for a language which originated in
and
is popular in Japan. The encoding situation over there is something of
a
mess.

DJ_Jazzy_L · December 29, 2009, 4:26pm

I would like to chime in here and point out that sometimes you really
want to ignore the errors caused by mis-matched encodings, (as was the
case in my script where I just wanted to match filenames ending in *.mpg
and really didn’t care if the characters occurring before had funkiness
going on with them.)

1.8 had this kind of behavior by default, and I’m assuming python3 and
erlang do too based on the descriptions given in this thread.

As Matz pointed out, you can force ruby1.9 to have this behavior simply
by using the ASCII-8 encoding rather than the default ASCII-7 encoding.
Basically causes the regular expression engine to look at the string as
a series of bytes again like it used to rather than freaking out when it
see’s something it doesn’t expect in that last byte.

I’m by no means knowledgeable about encodings, so take what I’m about to
say with a grain of salt. It seems like the old way of handling
encodings was permissive but imprecise, and the new way is precise but
not always permissive. I like the ability to be precise because before
that ability simply wasn’t an option, however, since allot of people
seem to be confused by the default behavior why not make the default
behavior permissive and set it up so that IF YOU WANT to be precise you
can enable the proper encodings that ensure that behavior? To me this
seems to fall in with the principal of least surprise. (Sorry for
quoting it, I know it’s over-quoted).

What do people think?

     Regards
       Gary

Edward Middleton wrote:

Bill K. wrote:

handling UTF8.

Could you summarize what you feel the key difference of
the python 3 / erlang approach is, compared to ruby19 ?

Taking a UTF-8 approach is easier to implement because you enforce all
strings to be UTF-8 and ignore when this doesn’t work. Kind of like
saying everything will be ASCII or converted to it

I’m a relative newbie in dealing with character encodings,
but I do recall a few lengthy discussions on this list when
ruby19’s M17N was being developed, where the “UTF-8 only”
approaches of some other languages were deemed insufficient
for various reasons.

Not everything maps one-to-one to UTF-8.

However, my understanding is that one is supposed to be
able to effectively make ruby behave as a “UTF-8 only”
language if one makes sure external data is transcoded to
UTF-8 at I/O boundaries.

That is pretty much it. The problem is that a lot of libraries still
don’t handle encodings. This results in some spurious errors when a
function requiring compatible encoding operates on them[1]. The
solution is to add support for handling encodings.

Edward

As appose to ruby 1.8 which would silently ignore actual errors
caused by the use of incompatible encodings.

DJ_Jazzy_L · December 29, 2009, 10:34pm

Tony A. wrote:

To play devil’s advocate here, Japanese users do routinely have to deal
with
multiple different encodings… Shift JIS on Windows/Mac, EUC-JP on *IX,
and
ISO-2022-JP for email

Sure; and maybe they even want to process these formats without a
round-trip to UTF8. (By the way, ruby 1.9 can’t handle Shift JIS
natively)

I want a programming language which (a) handles strings of bytes, and
(b) does so with simple, understandable, and predictable semantics: for
example, concat string 1 with string 2 to make string 3. Is that too
much to ask?

Anyway, I’ll shut up now.

DJ_Jazzy_L · December 30, 2009, 11:38am

Hi,

I think you’re quite a little pessimist here

Until my post on this subject, I have never been complaining far from
that,
and enjoyed to play with âˆ‘, âˆ† and so on.

And I was not complaining, jsut asking how to solve that (The fact it
didn’t
handle the normalization form C is quite logical I think, no language
would
do that easily).

I think having Unicode support is something very useful. Look for
example(even if it is a bad one) PHP and mb_* functions and all encoding
functions, scary, no? Well, I think it’s quite intuitive how it is for
the
moment, and most of the time doing concatenation is not a problem at
all.

So, globally I think a good encoding support is really important, while
being not useful everyday.

Regards,

B.D.

2009/12/29 Brian C. [email protected]

DJ_Jazzy_L · December 30, 2009, 12:04pm

Benoit D. wrote:

Hi,

I think you’re quite a little pessimist here

Until my post on this subject, I have never been complaining far from
that,
and enjoyed to play with âˆ‘, âˆ† and so on.

And I was not complaining, jsut asking how to solve that (The fact it
didn’t
handle the normalization form C is quite logical I think, no language
would
do that easily).

Huh? Normalization transformations should be pretty easy to implement.
(FWIW, the Unicode Consortium recommends KC for identifiers, although
I’m not sure I agree with that recommendation.)

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

DJ_Jazzy_L · December 30, 2009, 1:20pm

Marnen Laibow-Koser wrote:

Huh? Normalization transformations should be pretty easy to implement.

But the point is, you can’t do anything useful with this until you
transcode it anyway, which you can do using Iconv (in either 1.8 or
1.9).

ruby 1.9’s big flag feature of being able to store a string in its
original form tagged with the encoding doesn’t help the OP much, even if
it had been tagged correctly.

I mean, to a degree ruby 1.9 already supports this UTF-8-MAC as an
‘encoding’. For example:

decomp = [101, 115, 112, 97, 110, 771, 111, 108, 46, 108, 110, 103].map { |x| x.chr(“UTF-8-MAC”) }.join
=> “espaÃ±ol.lng”

decomp.codepoints.to_a
=> [101, 115, 112, 97, 110, 771, 111, 108, 46, 108, 110, 103]

decomp.encoding
=> #Encoding:UTF8-MAC

Notice that the n-accent is displayed as a single character by the
terminal, even though it is two codepoints (110,771)

So you could argue that Dir[] on the Mac is at fault here, for tagging
the string as UTF-8 when it should be UTF-8-MAC.

But you still need to transcode to UTF-8 before doing anything useful
with this string. Consider a string containing decomposed characters
tagged as UTF-8-MAC:

(1) The regexp /./ should match a series of decomposed codepoints as a
single ‘character’; str[n] should fetch the nth ‘character’; and so on.
I don’t think this would be easy to implement, since finding a character
boundary is no longer a codepoint boundary.

What you actually get is this:

decomp.split(//)
=> [“e”, “s”, “p”, “a”, “n”, “Ìƒ”, “o”, “l”, “.”, “l”, “n”, “g”]

Aside: note that "Ìƒ is actually a single character, a double quote with
the accent applied!

(2) The OP wanted to match the regexp containing a single codepoint /Ã±/
against the decomposed representation, which isn’t going to work anyway.
That is, ruby 1.9 does not automatically transcode strings so they are
compatible; it just raises an exception if they are not.

/Ã±/ =~ decomp
Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8
regexp with UTF8-MAC string)
from (irb):5
from /usr/local/bin/irb19:12:in `’

(3) Since ruby 1.9 has a UTF-8-MAC encoding, it should be able to
transcode it to UTF-8 without using Iconv. However this is simply
broken, at least in the version I’m trying here.

/Ã±/ =~ decomp.encode(“UTF-8”)
=> nil

decomp.encode(“UTF-8”)
=> “espa\xB1\x00ol.lng”

decomp.encode(“UTF-8”).codepoints.to_a
ArgumentError: invalid byte sequence in UTF-8
from (irb):10:in codepoints' from (irb):10:ineach’
from (irb):10:in to_a' from (irb):10 from /usr/local/bin/irb19:12:in’

RUBY_VERSION
=> “1.9.2”

RUBY_PATCHLEVEL
=> -1

RUBY_REVISION
=> 24186

(4) If general support for decomposed form would be added as further
‘Encodings’, there would be an explosion of encodings: UTF-8-D,
UTF-16LE-D, UTF-16BE-D etc, and that’s ignoring the “compatible” versus
“canonical” composed and decomposed forms.

(5) It is going to be very hard (if not impossible) to make a source
code string or regexp literal containing decomposed “n” and “Ìƒ” to be
distinct from a literal containing a composed “Ã±”. Try it and see.

(In the above paragraph, the decomposed accent is applied to the
double-quote; that is, "Ìƒ is actually a single ‘character’). Most
editors are going to display both the composed and decomposed forms
identically.

I think this just shows that ruby 1.9’s complexity is not helping in the
slightest. If you have to transcode to UTF-8 composed form, then ruby
1.8 does this just as well (and then you only need to tag the regexp as
UTF-8 using //u)

DJ_Jazzy_L · December 30, 2009, 1:48pm

Brian C. wrote:

Marnen Laibow-Koser wrote:

Huh? Normalization transformations should be pretty easy to implement.

But the point is, you can’t do anything useful with this until you
transcode it anyway, which you can do using Iconv (in either 1.8 or
1.9).

Wrong. Normalization transformations are useful within one Unicode
encoding. In fact, they have little use in transcoding as I understand.

[…]

Notice that the n-accent is displayed as a single character by the
terminal, even though it is two codepoints (110,771)

I don’t think it’s meaningful to say that something is displayed as a
single character. You can’t see characters – they’re abstract ideas.
All you can see is the glyphs that represent those characters.

So you could argue that Dir[] on the Mac is at fault here, for tagging
the string as UTF-8 when it should be UTF-8-MAC.

But you’d be wrong, because UTF-8-MAC is valid UTF-8.

But you still need to transcode to UTF-8 before doing anything useful
with this string. Consider a string containing decomposed characters
tagged as UTF-8-MAC:

(1) The regexp /./ should match a series of decomposed codepoints as a
single ‘character’

I am not sure I agree with you.

str[n] should fetch the nth ‘character’;

Yes, but a combining sequence is not conceptually a character in many
cases.

and so on.
I don’t think this would be easy to implement, since finding a character
boundary is no longer a codepoint boundary.

Sure it is. You are confusing characters and combining sequences.

What you actually get is this:

decomp.split(//)
=> [“e”, “s”, “p”, “a”, “n”, “Ìƒ”, “o”, “l”, “.”, “l”, “n”, “g”]

Aside: note that "Ìƒ is actually a single character,

It is nothing of the kind. It is a single combining sequence composed
of two characters. I would expect it to be matched by /…/ .

a double quote with
the accent applied!

Right.

(2) The OP wanted to match the regexp containing a single codepoint /Ã±/
against the decomposed representation, which isn’t going to work anyway.
That is, ruby 1.9 does not automatically transcode strings so they are
compatible; it just raises an exception if they are not.

But UTF-8 NFC and UTF-8 NFD are compatible – they’re not even really
separate encodings. At this point I strongly suggest that you read the
article (I think it’s UAX #15) on Unicode normalization.

/Ã±/ =~ decomp
Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8
regexp with UTF8-MAC string)

If the only difference between UTF-8 and UTF-8-MAC is normalization,
then this is brain-dead.

from (irb):5
from /usr/local/bin/irb19:12:in `’

(3) Since ruby 1.9 has a UTF-8-MAC encoding, it should be able to
transcode it to UTF-8 without using Iconv. However this is simply
broken, at least in the version I’m trying here.

/Ã±/ =~ decomp.encode(“UTF-8”)
=> nil
decomp.encode(“UTF-8”)
=> “espa\xB1\x00ol.lng”
decomp.encode(“UTF-8”).codepoints.to_a
ArgumentError: invalid byte sequence in UTF-8
from (irb):10:in codepoints' from (irb):10:in each’
from (irb):10:in to_a' from (irb):10 from /usr/local/bin/irb19:12:in ’
RUBY_VERSION
=> “1.9.2”
RUBY_PATCHLEVEL
=> -1
RUBY_REVISION
=> 24186

Yikes! That’s bad.

(4) If general support for decomposed form would be added as further
‘Encodings’, there would be an explosion of encodings: UTF-8-D,
UTF-16LE-D, UTF-16BE-D etc, and that’s ignoring the “compatible” versus
“canonical” composed and decomposed forms.

Right. Different normal forms really aren’t separate encodings in the
usual sense.

(5) It is going to be very hard (if not impossible) to make a source
code string or regexp literal containing decomposed “n” and “Ìƒ” to be
distinct from a literal containing a composed “Ã±”. Try it and see.

And that’s probably a good thing. In fact, that’s the point of
normalization.

(In the above paragraph, the decomposed accent is applied to the
double-quote; that is, "Ìƒ is actually a single ‘character’).

Combining sequence.

Most
editors are going to display both the composed and decomposed forms
identically.

And at least in the case of Ã± versus n + combining ~, they normalize to
the same thing in all normal forms (precomposed Ã± in C and KC; a 2-char
combining sequence in D and KD). Thus, under any normalization, they
are equivalent and should be treated as such.

I think this just shows that ruby 1.9’s complexity is not helping in the
slightest. If you have to transcode to UTF-8 composed form, then ruby
1.8 does this just as well (and then you only need to tag the regexp as
UTF-8 using //u)

Normalization really isn’t transcoding in the usual sense.

Best,

Marnen Laibow-Koser
http://www.marnen.org
[email protected]

DJ_Jazzy_L · December 30, 2009, 6:26pm

If I would have one wish open, I would want to have a compile-time
option for ruby 1.9 where I could keep the old ruby 1.8 behaviour. Ruby
1.8 simply gave me less problems here.

I am using loads of umlauts like “Ã¤Ã¶Ã¼” in my comments and ruby 1.8 is
totally happy with it. Ruby 1.9 however hates it, refuses to run it, and
I dont think I want to add something like “Encoding: ISO-8859-1” to all
my .rb scripts.

I’d wish there would be more than one way to treat encodings - and one
way should be to use ruby 1.8 behaviour, because ruby 1.9 just forces me
to make all kind of changes before my old .rb scripts run again, simply
because of the encoding issue (there seem to be some other minor changes
as well, I have had problems in case/when code too, but the encoding
issue seems larger)

This is not really a rant - I am using ruby 1.8.x without any problem,
and I actually LOVE that ruby 1.8.x is not feature frozen. It is also
good that a language can keep evolving.

Personally however I don’t need UTF or another exotic encoding, so the
encoding add-on is of no real advantage to me and rather a burden as I
have to modify .rb files. I can see that other people have different
needs. though.

DJ_Jazzy_L · December 31, 2009, 12:20pm

First, thank for your long and good answers about UTF8-MAC.

2009/12/30 Marc H. [email protected]

If I would have one wish open, I would want to have a compile-time
option for ruby 1.9 where I could keep the old ruby 1.8 behaviour. Ruby
1.8 simply gave me less problems here.

I am using loads of umlauts like “Ã¤Ã¶Ã¼” in my comments and ruby 1.8 is
totally happy with it. Ruby 1.9 however hates it, refuses to run it, and
I dont think I want to add something like “Encoding: ISO-8859-1” to all
my .rb scripts.

Well, I think that’s not so hard to add

encoding: ISO-8859-1

to your scripts( what do you say of writing a small Ruby script for that
?
).

I think that’s something really good! Well, it’s kind of not easy to
know a
file encoding if not specified somewhere. Think a little about somebody
working with you on another platform, he will surely meets problems of
encoding.

So yes, I think is something quite useful and good for compatibility.