[ANN] 1.9 String and M17N documentation

stickstone · August 6, 2009, 1:47pm

I have put together a document which tries to outline the M17N
properties of ruby 1.9 in a logical sequence and demonstrate the
important behaviours. The file is called string19.rb and you can find it
at

There is test code interspersed within the comments, so you can run it
to verify the behaviours described.

P.S.: I’ve spent enough time working on this that I felt entitled to add
another file, soapbox.rb, with my own opinion on all this. Feel free to
ignore it.

stickstone · August 6, 2009, 3:25pm

On Thu, Aug 6, 2009 at 7:47 AM, Brian C.[email protected]
wrote:

I have put together a document which tries to outline the M17N
properties of ruby 1.9 in a logical sequence and demonstrate the
important behaviours. The file is called string19.rb and you can find it
at

GitHub - candlerb/string19: Runnable documentation of ruby 1.9's M17N properties

There is test code interspersed within the comments, so you can run it
to verify the behaviours described.

Clever approach and looks to be a great resource. Thanks for writing
this up.

-greg

stickstone · August 6, 2009, 5:58pm

On Aug 6, 2009, at 6:47 AM, Brian C. wrote:

I have put together a document which tries to outline the M17N
properties of ruby 1.9 in a logical sequence and demonstrate the
important behaviours. The file is called string19.rb and you can
find it
at

GitHub - candlerb/string19: Runnable documentation of ruby 1.9's M17N properties

There is test code interspersed within the comments, so you can run it
to verify the behaviours described.

I just wanted to say that I enjoyed reading through what you have
created. I think you’ve shown a neat way to document behaviors, with
your comment and code mix. Even your simple alias of assert_equal()
to is() really adds to the overall presentation.

I’ve added a link to this repository in a comment to the first article
of my m17n series to help people find it.

It does run for me on Mac OS X, though I do get a warning:

$ ruby_dev string19.rb
Loaded suite string19
Started
WARNING: got “UTF-8” as locale_charmap for LANG=C
.
Finished in 0.589675 seconds.

1 tests, 202 assertions, 0 failures, 0 errors, 0 skips

I have a few specific comments on the test suite.

Just FYI, you ask the following about Regexp::FIXEDENCODING:

FIXME: What is the purpose of this flag?

I do try to explain that under Regexp Encodings in this article, if
you are interested:

http://blog.grayproductions.net/articles/miscellaneous_m17n_details

I’m not sure this is correct:

5. If one object is a String which contains only 7-bit ASCII

characters

(ascii_only?), then the objects are compatible and the result has the

encoding of the other object.

I believe that’s if one Object is a String that’s ascii_only?() and
the other object has an ASCII compatible Encoding. Here’s the case
where what you said doesn’t seem to work:

$ ruby_dev -e ‘p Encoding.compatible?(“ascii”,
“abc”.encode(“UTF-16BE”))’
nil

I don’t believe this is accurate:

Normally, writing a string to a file ignores the encoding property.

However if the internal encoding is set, then the characters are

transcoded from the internal encoding to the external encoding.

For example:

$ ruby_dev -e ‘open(“utf8.txt”, “w:UTF-8”) { |f| f.puts
“abc”.encode(“UTF-16BE”) }’
$ ruby_dev -e ‘p ARGF.read’ utf8.txt
“abc\n”

My understanding is that internal_encoding() is for reading only.
When writing, the String#encoding() is the effective
internal_encoding().

I feel sections 22 and 23 are not impartial and need to be moved to
soapbox.rb.

P.S.: I’ve spent enough time working on this that I felt entitled to
add
another file, soapbox.rb, with my own opinion on all this. Feel free
to
ignore it.

You know I just had to read this.

Seriously, I think you raise interesting points that are worth
discussing. It still feels a little quick to pass ultimate judgement
without that discussion though. Given that, here are my comments for
discussion.

You always say that, because the encoding system is locale
dependent, your code can break when moved to a different environment.
That’s all true. However, we never say the opposite, which is also
true. They made the system locale dependent so it would be possible
that some script written to work on local data could be moved to a
different environment and work on a different type of data there
without being changed. (matz has stated that this choice was mainly
to ease scripting.) Obviously, nothing is guaranteed to work, but it
is possible for the system to do good as well as evil.
There are many environment differences in Ruby and other languages
that have nothing to do with the encoding engine. I use fork() all
the time and it doesn’t even exist on Windows. You also mention the
“rb” flag used on Windows to stop newline translation in your tests.
It’s worth noting that newline translation feature is in Ruby and Perl
to help them work with data differences between the different
environments. These things have been that way for a long time and I
don’t hear a lot of complaints about them, though I would love to have
fork() on Windows just like Perl does. Also, this isn’t limited to
Windows. I posted on this list a few months back about some user
switching code that worked for me everywhere except on Mac OS X. I’m
not saying that any of this is good, but it does exist and we seem to
accept it on some level.
You say that m17n’s complexity can be avoided if we just used UTF-8
everywhere and transcoded incoming and outgoing data. I agree. If we
do that in Ruby 1.9 though, transcode all data as it comes in and just
work with UTF-8 internally, doesn’t all the complexity of m17n go
away? Compatible encodings, the comparison order of differing
encodings, and the like will all be non-issues. Thus it seems to me
that m17n allows us to take this favored approach or take harder
roads, if we so choose.

James Edward G. II

stickstone · August 6, 2009, 8:34pm

On Aug 6, 2009, at 11:44 AM, Brian C. wrote:

Hmm. Could you try setting replacing ‘LANG’ with ‘LC_ALL’ globally? A
reread of the setlocale(3) manpage under Linux shows that LANG is only
tried as a last resort, so perhaps your Mac has a higher-priority
environment variable set.

I bet the issue is this line in my .bashrc:

export LC_CTYPE=en_US.UTF-8

7-bit), but will fail if they are 8-bit.

Maybe this could be fixed by making the ASCII-8BIT encoding be
compatible with everything, and always give an ASCII-8BIT result. But
that would be saying, in essence, an ASCII-8BIT String is one class of
object, and everything else is another class.

I think I understand what you are saying here. You have a good point
that is would be annoying to have the Encoding of the JPEG you are
building up from ASCII-8BIT to UTF-8.

Working with other people’s libraries.

Take REXML as an example. Suppose I decide I want to do this:

doc = REXML::Document.new(src)

Under 1.8, I could do this without worrying.

Really?

What did it do under Ruby 1.8 when fed an XML document that was UTF-16
encoded? Will it read it? When I do searches for content, will it
hand me UTF-16 or UTF-8? These are just some questions that jump to
my mind.

As you’ve said, about the best I can think of is to test it and find
out, only this is Ruby 1.8 I’m talking about here.

Let’s see how it works:

$ ruby -r rexml/document -e ‘REXML::Document.new(ARGF.read)’
utf16_with_bom.xml
/usr/local/lib/ruby/1.8/rexml/parsers/treeparser.rb:92:in parse': #<Iconv::InvalidCharacter: "\340 \250 \274 \347 \215 \257 \346 \265 \245 \347 \221 \241 \346 \234 \276 \345 \215 \257 \346 \265 \245 \342 \201 \203 \346 \275 \256 \347 \221 \245 \346 \271\264\343\260\257\347\215\257\346\265\245\347\221\241\346\234\276", ["\n"]> (REXML::ParseException) /usr/local/lib/ruby/1.8/rexml/encodings/ICONV.rb:7:inconv’
/usr/local/lib/ruby/1.8/rexml/encodings/ICONV.rb:7:in decode' /usr/local/lib/ruby/1.8/rexml/source.rb:57:inencoding=’
/usr/local/lib/ruby/1.8/rexml/parsers/baseparser.rb:213:in pull' /usr/local/lib/ruby/1.8/rexml/parsers/treeparser.rb:22:inparse’
/usr/local/lib/ruby/1.8/rexml/document.rb:227:in build' /usr/local/lib/ruby/1.8/rexml/document.rb:43:ininitialize’
-e:1:in new' -e:1 ... "\n" Line: Position: Last 80 unconsumed characters: <sometag>Some Content</sometag> from /usr/local/lib/ruby/1.8/rexml/document.rb:227:inbuild’
from /usr/local/lib/ruby/1.8/rexml/document.rb:43:in initialize' from -e:1:innew’
from -e:1

Ah, it just tells me my data is invalid. It’s not though:

$ iconv -f UTF-16BE -t UTF-8 < utf16_with_bom.xml

<?xml version="1.0" encoding="UTF-16BE"?>

Some Content

Ruby 1.9 can read it:

$ ruby_dev -r rexml/document -e ‘puts
REXML::Document.new(ARGF.read.force_encoding(“BINARY”)).to_s’
utf16_with_bom.xml

<?xml version='1.0' encoding='UTF-16BE'?>

Some Content

It looks like it’s suppose to work in Ruby 1.8 too and I’ve just hit a
bug. At least, if I’m reading the source right. I had to check.

Anyway, the point of all this is that it really isn’t any easier, for
me, to reason about Ruby 1.8 encoding behavior. Ruby 1.9 didn’t
invent character encodings, it just started paying attention to them
as we all should have been doing all along. That’s all my opinion, of
course.

James Edward G. II

stickstone · August 6, 2009, 6:44pm

James G. wrote:

I just wanted to say that I enjoyed reading through what you have
created. I think you’ve shown a neat way to document behaviors, with
your comment and code mix. Even your simple alias of assert_equal()
to is() really adds to the overall presentation.

Thanks James.

It does run for me on Mac OS X, though I do get a warning:

$ ruby_dev string19.rb
Loaded suite string19
Started
WARNING: got “UTF-8” as locale_charmap for LANG=C
.
Finished in 0.589675 seconds.

Hmm. Could you try setting replacing ‘LANG’ with ‘LC_ALL’ globally? A
reread of the setlocale(3) manpage under Linux shows that LANG is only
tried as a last resort, so perhaps your Mac has a higher-priority
environment variable set.

I’m not sure this is correct:

5. If one object is a String which contains only 7-bit ASCII

characters

(ascii_only?), then the objects are compatible and the result has the

encoding of the other object.

Thank you, fixed.

I don’t believe this is accurate:

Normally, writing a string to a file ignores the encoding property.

I think we crossed over on that one. I spotted the error after
re-reading your articles and posted a correction - I think it’s right
now.

You say that m17n’s complexity can be avoided if we just used UTF-8
everywhere and transcoded incoming and outgoing data. I agree. If we
do that in Ruby 1.9 though, transcode all data as it comes in and just
work with UTF-8 internally, doesn’t all the complexity of m17n go
away? Compatible encodings, the comparison order of differing
encodings, and the like will all be non-issues.

Yes, for scripts that process text. And in practice, this is what most
people processing text will find: their source is in their preferred
encoding, their external files are in their preferred encoding, and
everything “just works” - pretty much in the way that ruby 1.8 did with
$KCODE.

I have two key problems.

Working with binary. I can force the encoding on my own source files,
and I can force the encoding on any files that I open, but I still have
to interact with other libraries which return strings. If I build a
string by concatenating strings taken from elsewhere, I have to force
the encodings. If I forget, it may work sometimes (if those strings are
7-bit), but will fail if they are 8-bit.

Maybe this could be fixed by making the ASCII-8BIT encoding be
compatible with everything, and always give an ASCII-8BIT result. But
that would be saying, in essence, an ASCII-8BIT String is one class of
object, and everything else is another class.

Working with other people’s libraries.

Take REXML as an example. Suppose I decide I want to do this:

doc = REXML::Document.new(src)

Under 1.8, I could do this without worrying. But under 1.9, a whole host
of questions tumble out.

will REXML require me to have set the src to the correct encoding?
in order to parse it, will it reset the encoding of my ‘src’ object?
What will it do if ‘src’ is frozen? Will it dup the string?

XML documents carry their encoding within them. There’s the xml charset
declaration, and the BOM, and failing that the document is UTF-8 by
definition, because if it were in a different encoding, then it must
declare it:

http://www.w3.org/TR/2008/REC-xml-20081126/#charencoding

So I reckon REXML should ignore the encoding of src. Even if it were
tagged as (say) ISO-8859-1 because that’s the locale encoding, or
ASCII-8BIT because it came from a socket, it should be treated as UTF-8
unless declared otherwise. And then if I access the node using #text,
would I get something tagged as UTF-8, or something else?

The only way to be sure is to try it and see (and a quick test suggests
that it does work in the way I described).

But this process has to be repeated for every library you might use.

stickstone · August 7, 2009, 2:44am

On Aug 6, 2009, at 08:57, James G. wrote:

You say that m17n’s complexity can be avoided if we just used
UTF-8 everywhere and transcoded incoming and outgoing data. I
agree. If we do that in Ruby 1.9 though, transcode all data as it
comes in and just work with UTF-8 internally, doesn’t all the
complexity of m17n go away? Compatible encodings, the comparison
order of differing encodings, and the like will all be non-issues.
Thus it seems to me that m17n allows us to take this favored
approach or take harder roads, if we so choose.

I’m too lazy to dig this out of the archives, but there are some
encodings that don’t have a 1:1 mapping to Unicode thus the round-trip
through UTF-8 (etc.) will destroy them.

In short, Ruby doesn’t transcode everything to preserve the integrity
of your data.

stickstone · August 6, 2009, 8:54pm

Ruby 1.9 didn’t invent character encodings

Just out of curiosity. Are there other languages that handle encodings
the way ruby 1.9 does?

stickstone · August 7, 2009, 9:52am

Eric H. wrote:

I’m too lazy to dig this out of the archives, but there are some
encodings that don’t have a 1:1 mapping to Unicode thus the round-trip
through UTF-8 (etc.) will destroy them.

Indeed, although we’re both having a hard time thinking of an actual
example. It seems that dealing with such things is not an everyday
requirement for most people. So you write a library for that, and then
the rest of us aren’t saddled with the complexity.

stickstone · August 12, 2009, 12:46am

On Aug 7, 2009, at 00:52, Brian C. wrote:

Eric H. wrote:

I’m too lazy to dig this out of the archives, but there are some
encodings that don’t have a 1:1 mapping to Unicode thus the round-
trip
through UTF-8 (etc.) will destroy them.

Indeed, although we’re both having a hard time thinking of an actual
example. It seems that dealing with such things is not an everyday
requirement for most people.

This seems to be similar to the reasoning behind two-digit years.

So you write a library for that, and then the rest of us aren’t
saddled with the complexity.

Unfortunately, software ends up getting used in places the author
didn’t expect. Why not write robust software the first time instead
of being lazy?

stickstone · August 7, 2009, 10:07am

James G. wrote:

Working with other people’s libraries.

Take REXML as an example. Suppose I decide I want to do this:

doc = REXML::Document.new(src)

Under 1.8, I could do this without worrying.

Really?

What did it do under Ruby 1.8 when fed an XML document that was UTF-16
encoded? Will it read it? When I do searches for content, will it
hand me UTF-16 or UTF-8? These are just some questions that jump to
my mind.

OK, I didn’t write my statement clearly enough.

In ruby 1.8, the question is, “will it parse this document?”

In ruby 1.9, the question is, “will it parse this document, and does
the correct parsing depend on which encoding I set the ‘src’ string to,
and if so, what do I need to set it to?”

Then take a method which returns a string, say REXML::Element.text().
This is a bit simpler.

In ruby 1.8, the question is, “does this return the content of my
element, and has it been transcoded?”

In ruby 1.9, the question is the same, plus “what encoding does it set
on that value?”

require ‘rexml/document’
=> true

d = REXML::Document.new("<?xml encoding='iso-8859-1'?>\xfcber")
=> … </>

d.elements.first
=> … </>

d.elements.first.text
=> “Ã¼ber”

d.elements.first.text.encoding
=> #Encoding:UTF-8

OK, so it looks like REXML has transcoded to UTF-8, and tagged the
result as such. I’m not really helping my case because you have to do
the same test with 1.8:

require ‘rexml/document’
=> true

d = REXML::Document.new("<?xml encoding='iso-8859-1'?>\xfcber")
=> … </>

d.elements[1]
=> … </>

d.elements[1].text
=> “\303\274ber”

So it’s been transcoded here too. But I don’t have to worry about what
encoding ‘tag’ it has been given.

Maybe all this would be much simpler if Ruby didn’t crash when given
incompatible encodings, but transcoded the right-hand-side
automatically. For example:

a << b

a keeps its original encoding, b is transcoded to a’s encoding

a.tr(“Ã¼”,“Ãœ”)

the Ã¼ and Ãœ are transcoded to a’s encoding first

with transcoding to BINARY being a null operation.

stickstone · August 12, 2009, 11:17am

BTW, I find James’s writeup of what he had to do to the CSV library (*)
enlightening. Even ruby 1.9 won’t match an ASCII regexp like /,/ against
a wide encoding, so he had to generate new regexps dynamically at
runtime.

Now, I think that’s a good thing, optimising the regexps to match the
incoming data stream efficiently. But I also observe that this would
have worked just fine if the encoding were a property of the regexp only

which is the approach 1.8 takes to regexps. What I mean is, once you
have decided to build a “UTF-16LE” regexp, say, you can just match it
against a stream of bytes.

Making every single String also have an encoding property only gives
more opportunities for Ruby to raise exceptions. Some may argue this is
Ruby “protecting” you from doing something silly, but if I’m working
with string literals or binary data returned from a library, whose
encoding may or may not have been set to ASCII-8BIT, then I don’t want
this “protection”. Rather, I need protecting against ruby 1.9.

There is only one case I can see where having the encoding be a property
of the String itself is useful: selecting individual characters by
index. e.g.

if str.size > 50
str = str[0,47] + “…”
end

There’s a huge amount of language pain introduced just for that.

Regards,

Brian.

(*) http://blog.grayproductions.net/articles/what_ruby_19_gives_us

stickstone · August 12, 2009, 12:13pm

James G. wrote:

Just FYI, you ask the following about Regexp::FIXEDENCODING:

FIXME: What is the purpose of this flag?

I do try to explain that under Regexp Encodings in this article, if
you are interested:

http://blog.grayproductions.net/articles/miscellaneous_m17n_details

“A fixed_encoding?() Regexp is one that will raise an
Encoding::CompatibilityError if matched against any String that contains
a different Encoding from the Regexp itself.”

I think that’s not exactly correct:

$ irb19 --simple-prompt

re = /gro/u
=> /gro/
re.encoding
=> #Encoding:UTF-8
re.fixed_encoding?
=> true
str = “gro”.force_encoding(“ISO-8859-1”)
=> “gro”
re =~ str
=> 0

AFAICS, it will only raise an error if the matched string is of a
different encoding and is not ascii_only?

str = “gro\xdf”.force_encoding(“ISO-8859-1”)
=> “groï¿½”
re =~ str
Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8
regexp with ISO-8859-1 string)

stickstone · August 12, 2009, 11:00am

Eric H. wrote:

On Aug 7, 2009, at 00:52, Brian C. wrote:

Eric H. wrote:

I’m too lazy to dig this out of the archives, but there are some
encodings that don’t have a 1:1 mapping to Unicode thus the round-
trip
through UTF-8 (etc.) will destroy them.

Indeed, although we’re both having a hard time thinking of an actual
example. It seems that dealing with such things is not an everyday
requirement for most people.

This seems to be similar to the reasoning behind two-digit years.

I don’t understand what you’re getting at. Obviously the round trip
4-digit-years -> 2-digit-years -> 4-digit-years is not lossless, but
that would be a silly thing to do (i.e. if you’ve captured
4-digit-years, then you store them and work with them as 4-digit-years).

You’re saying you want to avoid external->Unicode->external encoding
transcodings. But these are rarely problematic (I’ve still not seen an
example), and in those rare cases you could just handle the external
encoding as binary data. Remember also that for stateful encodings,
you’re forced to transcode anyway - even ruby 1.9 won’t handle snippets
of ISO_2022_JP in isolation, for example.

So you write a library for that, and then the rest of us aren’t
saddled with the complexity.

Unfortunately, software ends up getting used in places the author
didn’t expect. Why not write robust software the first time instead
of being lazy?

In My Opinion (which may not be shared by anyone else), ruby 1.9’s
String implementation is anything but robust. It’s over-complicated,
under-specified, buggy as hell, and badly gets in the way when you want
to work with binary data or write programs which don’t crash when given
unexpected input.

If it were optional, it would be fine. Since it’s a mandatory part of
the language, it destroys it for me. Ruby 1.8 is a fine general purpose
language; ruby 1.9 is a text-processing language (and may still trip you
up even in that case)

Regards,

Brian.

stickstone · August 12, 2009, 3:16pm

On Wed, Aug 12, 2009 at 5:02 AM, Brian C.[email protected]
wrote:

In My Opinion (which may not be shared by anyone else), ruby 1.9’s
String implementation is anything but robust. It’s over-complicated,
under-specified, buggy as hell, and badly gets in the way when you want
to work with binary data or write programs which don’t crash when given
unexpected input.

I’m not sure what binary data you’ve been having such great problems
with. Prawn deals with a lot of binary data, and yes, we needed to
make sure that it was being loaded as such and not accidentally
treated as encoded bytes, but I really didn’t find this to be a major
undertaking. I guess this is because we didn’t need to port over
existing 1.8 code and wrote our implementation with 1.9 in mind, but
maybe I’m missing some big problem that we didn’t hit in our use case.

On a personal note, I wish you’d cut out the vitriol, because you’re
acting like a jerk. You have learned a lot about the M17n system and
produced valuable resources in the process, and have helped uncovered
dark corners and bugs, and for that, the community can be appreciated
for the efforts. But if you manage to make everyone feel miserable
in the process with your abrasive attitude, I don’t think that’s going
to do anything for anyone.

You’ve made your feelings about the design decisions very clear. Now
can you maybe stick to the technical details so that these discussions
don’t become nasty unnecessarily?

-greg

stickstone · August 12, 2009, 7:33pm

On Aug 12, 2009, at 02:02, Brian C. wrote:

requirement for most people.

This seems to be similar to the reasoning behind two-digit years.

I don’t understand what you’re getting at.

“dealing with [non 1:1 conversion round trips] is not an everyday
requirement for most people” is roughly equivalent to “four-digit
years is not an everyday requirement for most people” (or was, back
when people were using two-digit years)

You’re saying you want to avoid external->Unicode->external encoding
transcodings.

I was stating that this is a design goal of ruby’s encoding features.
(And likely causes much of the pain you feel in this area.)

But these are rarely problematic (I’ve still not seen an
example), and in those rare cases you could just handle the external
encoding as binary data.

Agreed. Furthermore, most of the time software is likely to only work
within a single encoding.

Remember also that for stateful encodings, you’re forced to
transcode anyway - even ruby 1.9 won’t handle snippets of
ISO_2022_JP in isolation, for example.

Software written without this in mind will probably be used this way
regardless of the original authors’ intent (and will break), much like
two-digit-year software did when four-digit years became necessary.

PS: I think you can provide valuable input on how to make ruby’s API
for encodings more robust and easier to use, but you seem to hate it
so much that you can’t be bothered to raise issues in a way that will
get them fixed.

stickstone · August 12, 2009, 4:36pm

On Aug 12, 2009, at 5:13 AM, Brian C. wrote:

"A fixed_encoding?() Regexp is one that will raise an
=> #Encoding:UTF-8

str = “gro\xdf”.force_encoding(“ISO-8859-1”)
=> “groï¿½”

re =~ str
Encoding::CompatibilityError: incompatible encoding regexp match
(UTF-8
regexp with ISO-8859-1 string)

Thanks for the correction. I’ve updated the article you quoted with a
correction.

James Edward G. II

stickstone · August 12, 2009, 9:54pm

On Wed, Aug 12, 2009 at 3:42 PM, Brian C.[email protected]
wrote:

Maybe some bandaids would be accepted (e.g. ASCII-8BIT is compatible
with everything and forces the result to ASCII-8BIT), but I’m hesitant
to propose enlarging the ruleset further.

I think this is a good change that would at least cause mistakes to
fail faster.

I also suggested a simple binary string syntax on ruby-core, allowing:

%b{GIF} to be shorthand for “GIF”.force_encoding(“BINARY”)

(Though that’s admittedly more cosmetic than functionally significant)

A U-Turn is very unlikely to happen, but I imagine Matz will be
receptive for polishing things around the edges.

stickstone · August 13, 2009, 11:14am

Gregory B. wrote:

A U-Turn is very unlikely to happen, but I imagine Matz will be
receptive for polishing things around the edges.

I have put a few ideas in a document ‘alternatives.markdown’ at the same
location.

The other possibility which may make sense is to transcode
automatically. For example, in

s3 = s1 + s2

then s2 is transcoded to s1’s encoding, and the result s3 always has
s1’s encoding.

That could actually be useful in helping to combine strings from
different sources. All the compatibility rules would vanish, and rather
than raising exceptions, ruby would just “do the right thing”.
Transcoding to BINARY/ASCII-8BIT would be a null operation, so building
a binary string would be safe too.

This isn’t a total U-turn, but it would be quite a major shift and I
suspect too big for 1.9.x.

stickstone · August 13, 2009, 1:41pm

On Thu, Aug 13, 2009 at 5:14 AM, Brian C.[email protected]
wrote:

That could actually be useful in helping to combine strings from
different sources. All the compatibility rules would vanish, and rather
than raising exceptions, ruby would just “do the right thing”.
Transcoding to BINARY/ASCII-8BIT would be a null operation, so building
a binary string would be safe too.

This isn’t a total U-turn, but it would be quite a major shift and I
suspect too big for 1.9.x.

Yeah, this is also a reasonable behavior, IMO. However, I think Matz
has some reservation about (potentially lossy) transcoding, which is
the reason for the M17N system in the first place. Special casing
form ASCII-8BIT might be more conservative.

-greg

stickstone · August 12, 2009, 9:42pm

Eric H. wrote:

PS: I think you can provide valuable input on how to make ruby’s API
for encodings more robust and easier to use, but you seem to hate it
so much that you can’t be bothered to raise issues in a way that will
get them fixed.

It’s not so much “can’t be bothered”, as “don’t believe that a U-turn is
going to happen”.

Maybe some bandaids would be accepted (e.g. ASCII-8BIT is compatible
with everything and forces the result to ASCII-8BIT), but I’m hesitant
to propose enlarging the ruleset further.