Forum: Ruby-core not use system for default encoding

Posted by Roger Pack (Guest)
on 2010-08-30 17:11
(Received via mailing list)
It strikes me as a bit "scary" to use system locale settings to
*arbitrarily* set Encoding.default_external

For example, I develop on windows (def: IBM437).
This means that if I want this to work cross platform I have to
specify IBM437 for every File.read (et al) that I use in my library.
So it is a bit scary.

Suggestion: default to UTF-8 *no matter where* then allow the user to
change it if they want something else.
Or even default to BINARY (ASCII-8BIT) unless they specify.  Most
users don't want/need encoding until they run into it--they can handle
it then.
Thoughts?
Thanks!
-r
Posted by Run Paint Run Run (Guest)
on 2010-08-30 22:56
(Received via mailing list)
> It strikes me as a bit "scary" to use system locale settings to *arbitrarily*
> set Encoding.default_external

What do you mean by “arbitrarily”? The algorithm used
(i.e. http://goo.gl/soW7) is pretty straightforward. Presumably
a user’s locale encoding reflects that in which he prefers to work.

> For example, I develop on windows (def: IBM437).  This means that if I want
> this to work cross platform I have to specify IBM437 for every File.read (et
> al) that I use in my library.  So it is a bit scary.

Well, firstly, that’s a pretty odd encoding to use. In general, you’ll 
have a
far easier time if you use UTF-8 for everything, and legacy encodings 
only when
necessary or, possibly, writing in a CJK script. In any case, even if 
you
continue using that encoding, the external encoding only needs to be 
specified
explicitly if the files contain non-ASCII characters. And if they do, 
and
interoperability is your goal, then why are you using IBM437 in the 
first
place?

> Suggestion: default to UTF-8 *no matter where* then allow the user to change
> it if they want something else.

That stands in opposition to the design goals of Ruby’s M17N. See 
Naruse’s
article at http://goo.gl/Xy20 for the background. Or, to put it another 
way,
Ruby’s system was designed by encoding experts, including Unicode 
Consortium
member Martin J. Dürst, and still explicitly rejects defaulting to 
UTF-8.

> Or even default to BINARY (ASCII-8BIT) unless they specify.  Most users don't
> want/need encoding until they run into it--they can handle it then.

That was the philosophy of English-centric software development for many 
years,
but global distribution and the web discredited it. Users don’t think 
they care
about encoding, but then they process input containing non-ASCII byte
sequences, and everything blows up. For example, consider the following 
“binary”
representations of e acute:

  Encoding.list.map{|n| "é".encode(n).dump rescue nil}.compact.uniq
  #=> [""\\u{e9}"", ""\\x88m"", ""\\xA0\\xC1"", ""\\x8F\\xAB\\xB1"",
  ""\\xA8\\xA6"", ""\\xE9"", ""\\x00\\xE9".force_encoding("UTF-16BE")",
  ""\\xE9\\x00".force_encoding("UTF-16LE")",
  ""\\x00\\x00\\x00\\xE9".force_encoding("UTF-32BE")",
  ""\\xE9\\x00\\x00\\x00".force_encoding("UTF-32LE")", ""\\x82"", 
""\\x8E"",
  ""e\\xCC\\x81"", ""\\xC3\\xA9""]

If you store text as byte sequences without associated encodings, how 
will you
display it? How do you pattern match against it? You can’t because the 
strategy
is "\\110\\97\\105\\118\\130", as IBM437 users would say, or, if you 
speak
GB2312, "\\110\\97\\105\\118\\168\\166". Ultimately, the closest you can 
get to
ignoring encodings while at the same time remaining interoperable, is by
storing data in a Unicode-compatible encoding—UTF-8 being the obvious 
general
choice—then transcoding your input into UTF-8. Even this requires that 
either
the input is tagged with an encoding, or you’re willing to use imperfect
heuristic algorithms to detect it.
Posted by Roger Pack (Guest)
on 2010-08-31 01:30
(Received via mailing list)
> Well, firstly, that’s a pretty odd encoding to use. In general, you’ll have a
> far easier time if you use UTF-8 for everything, and legacy encodings only when
> necessary or, possibly, writing in a CJK script. In any case, even if you
> continue using that encoding, the external encoding only needs to be specified
> explicitly if the files contain non-ASCII characters. And if they do, and
> interoperability is your goal, then why are you using IBM437 in the first
> place?

So are you saying that regardless of what Encoding.default_external is
set to, if I read a file that only has ascii characters, it will
always show up as the same string, regardless?
Thanks!
-r
Posted by Run Paint Run Run (Guest)
on 2010-08-31 02:43
(Received via mailing list)
>> Well, firstly, that’s a pretty odd encoding to use. In general, you’ll have
>> a far easier time if you use UTF-8 for everything, and legacy encodings only
>> when necessary or, possibly, writing in a CJK script. In any case, even if
>> you continue using that encoding, the external encoding only needs to be
>> specified explicitly if the files contain non-ASCII characters. And if they
>> do, and interoperability is your goal, then why are you using IBM437 in the
>> first place?

> So are you saying that regardless of what
> Encoding.default_external is set to, if I read a file that only has ascii
> characters, it will always show up as the same string, regardless?
> Thanks!

Almost always. :-) Specifically, the encodings both need to be
ASCII-compatible, meaning their repertoire should consist of a superset 
of
ASCII. You can confirm with the Encoding#ascii_compatible? predicate, 
e.g.:

  >> Encoding::IBM437.ascii_compatible?  #=> true
  >> Encoding::UTF_8.ascii_compatible?   #=> true

The implication is that a file containing only ASCII is valid in any of 
these
encodings, i.e. a given ASCII character is represented by the same byte
sequence when transcoded to any ASCII-compatible encoding.

In practise, most of the encodings supported by Ruby are 
ASCII-compatible, with
the main exceptions being UTF-16 and UTF-32, in both their big- and
little-endian variants.

  Encoding.list.reject(&:ascii_compatible?)
  #=> [#<Encoding:UTF-16BE>, <Encoding:UTF-16LE>, #<Encoding:UTF-32BE>,
  #<Encoding:UTF-32LE>, <Encoding:ISO-2022-JP (dummy)>,
  #<Encoding:ISO-2022-JP-2 (dummy)>, <Encoding:CP50220 (dummy)>,
  #<Encoding:CP50221 (dummy)>, #<Encoding:UTF-7 (dummy)>,
  #<Encoding:ISO-2022-JP-KDDI (dummy)>]

For example, the first three encodings below are ASCII-compatible, so an
ASCII-only string is represented by the same set of bytes; the last 
three are
not ASCII-compatible, so the byte sequences differ:

  >> s="y%$;"
  >> pp %w{ibm437 utf-8 ascii utf-16le utf-16be utf-32le}.
  >>    map{|e| [e, *s.encode(e).bytes]}
  [
    ["ibm437", 121, 37, 36, 59],
    ["utf-8", 121, 37, 36,59],
    ["ascii", 121, 37, 36, 59],
    ["utf-16le", 121, 0, 37, 0, 36, 0, 59, 0],
    ["utf-16be", 0, 121, 0, 37, 0, 36, 0, 59],
    ["utf-32le", 121, 0, 0, 0, 37, 0, 0, 0, 36, 0, 0, 0, 59, 0, 0, 0]
  ]
Posted by NARUSE, Yui (Guest)
on 2010-08-31 18:58
(Received via mailing list)
(2010/08/31 0:04), Roger Pack wrote:
> Or even default to BINARY (ASCII-8BIT) unless they specify.  Most
> users don't want/need encoding until they run into it--they can handle
> it then.
> Thoughts?
> Thanks!
> -r

Japanese version of Windows uses CP932 (a.k.a. SJIS or Windows-31J).
And its command prompt uses CP932; it's not UTF-8 and can't use UTF-8.
So we must follow locale information.
(almost always locale reflects terminal's encoding)
Posted by Roger Pack (Guest)
on 2010-09-02 21:24
(Received via mailing list)
> Japanese version of Windows uses CP932 (a.k.a. SJIS or Windows-31J).
> And its command prompt uses CP932; it's not UTF-8 and can't use UTF-8.
> So we must follow locale information.
> (almost always locale reflects terminal's encoding)

Good to know.

I don't mean this as argumentative, but I still have some feedbacks.
You can just gloss over them if you're done with the discussion :)

I guess my only concern is that if you read in binary data, users of
1.9 *must* specify binary mode.
(which is reasonable), but here is my confusion:

$ cat test.rb
p File.read('other_file_unknown_encoding').encoding
p 'abc'.encoding

$ ruby test.rb
#<Encoding:IBM437>
#<Encoding:US-ASCII>

for better or worse, even though 'other_file_unknown_encoding' has
only ASCII characters, its encoding is set to my system default.  This
means that if I package up the file "other_file_unknown_encoding" in
my gem, and read it later, I *must* specify its encoding when I read
it.  So it will succeed locally and *fail* on another box, which to me
adds extra confusion.

Suggestion in this regard: for file input do not use the system to
determine encoding.  Force users to specify one at least once in code
somewhere so they realize the implications of what is happening.

The other surprising thing to me is that it assigns IBM437 not only to
terminal input but also to file input.  *Nobody* edits files in
IBM437.  It feels inappropriate to use that for the encoding, or,
really, to have a default for file encoding.

Suggestion in this regard:  It should force me to assign an explicit
encoding for reading files, or (if I don't) it should default to some
less desirable default, like BINARY.  How else can you really know
what a file encoding was, if it isn't specified?  If this were a word
processor, I would expect local files to be written in the default
locale, but this is reading arbitrary files, so I think it should,
again, force people to *at least once* specify their own default
external encoding, so that they realize what is going on behind the
scenes.
Feedbacks?
Thanks.
-r
Posted by Run Paint Run Run (Guest)
on 2010-09-03 01:35
(Received via mailing list)
> means that if I package up the file "other_file_unknown_encoding" in
> my gem, and read it later, I *must* specify its encoding when I read
> it.  So it will succeed locally and *fail* on another box, which to me
> adds extra confusion.

No. You can show this is not the case yourself by specifying a different
default external encoding, then reading in a file. If the file is 
ASCII-only,
it can be read under any ASCII-compatible external encoding without you 
having
to specify a thing. Your file is valid, and identical, in any of these
encodings, so the precise one chosen is immaterial.

  run@paint:/tmp →  echo "aheh&^^Gah*" >ascii
  run@paint:/tmp →  irb
  >> File.read('ascii').hash
  => -971653016
  >> File.read('ascii', encoding: 'ibm437').hash
  => -971653016
  >> File.read('ascii', encoding: 'utf-8').hash
  => -971653016
  >> File.read('ascii', encoding: 'big5').hash
  => -971653016
  >> Encoding.default_external='shift_jis'
  => "shift_jis"
  >> File.read('ascii').encoding
  => #<Encoding:Shift_JIS>
  >> File.read('ascii').hash
  => -971653016

> The other surprising thing to me is that it assigns IBM437 not only to
> terminal input but also to file input.  *Nobody* edits files in
> IBM437.  It feels inappropriate to use that for the encoding, or,
> really, to have a default for file encoding.

I know little of Windows matters, but surely if that’s your system 
locale, it
is indeed the default encoding for editing files. Certainly, your text 
editor
has presumably been configured to use something more sensible, but if 
not
wouldn’t it fallback to IBM437?
Posted by NARUSE, Yui (Guest)
on 2010-09-03 04:27
(Received via mailing list)
2010/9/2 Roger Pack <rogerdpack2@gmail.com>:
> I guess my only concern is that if you read in binary data, users of
>
> for better or worse, even though 'other_file_unknown_encoding' has
> only ASCII characters, its encoding is set to my system default.  This
> means that if I package up the file "other_file_unknown_encoding" in
> my gem, and read it later, I *must* specify its encoding when I read
> it.  So it will succeed locally and *fail* on another box, which to me
> adds extra confusion.

If the encoding of file is unknown, it needs to specify its encoding.
An application should get the encoding of external file from users
and use it when the app open a file.

Yes, you *must* specify its encoding.

> Suggestion in this regard: for file input do not use the system to
> determine encoding.  Force users to specify one at least once in code
> somewhere so they realize the implications of what is happening.

Yeah, if you don't want dynamic decision with locale
you can specify File.read("foo.txt", encoding:"UTF-8").

> The other surprising thing to me is that it assigns IBM437 not only to
> terminal input but also to file input.  *Nobody* edits files in
> IBM437.  It feels inappropriate to use that for the encoding, or,
> really, to have a default for file encoding.

try echo "foo bar" > foo.txt.

> Suggestion in this regard:  It should force me to assign an explicit
> encoding for reading files, or (if I don't) it should default to some
> less desirable default, like BINARY.  How else can you really know
> what a file encoding was, if it isn't specified?

specify it.

> If this were a word
> processor, I would expect local files to be written in the default
> locale, but this is reading arbitrary files, so I think it should,
> again, force people to *at least once* specify their own default
> external encoding, so that they realize what is going on behind the
> scenes.

You can use -E or -U or Encoding.default_external=.
Posted by Roger Pack (Guest)
on 2010-09-03 05:10
(Received via mailing list)
>> my gem, and read it later, I *must* specify its encoding when I read
>> it.  So it will succeed locally and *fail* on another box, which to me
>> adds extra confusion.
>
> No. You can show this is not the case yourself by specifying a different
> default external encoding, then reading in a file. If the file is ASCII-only,
> it can be read under any ASCII-compatible external encoding without you having
> to specify a thing. Your file is valid, and identical, in any of these
> encodings, so the precise one chosen is immaterial.

True.  I was referring more to the case where I did have non-ASCII
characters.  Using file.read on said file will then work on my box but
not necessarily on yours, which I think is surprising.  It's basically
an extra "gotcha"... (though I'll admit I've never run into it myself,
I've seen people mention it before).  It's not very portable...

>> The other surprising thing to me is that it assigns IBM437 not only to
>> terminal input but also to file input.  *Nobody* edits files in
>> IBM437.  It feels inappropriate to use that for the encoding, or,
>> really, to have a default for file encoding.
>
> I know little of Windows matters, but surely if that’s your system locale, it
> is indeed the default encoding for editing files. Certainly, your text editor
> has presumably been configured to use something more sensible, but if not
> wouldn’t it fallback to IBM437?

Surprisingly, the console edit.exe command actually appears to edit in
IBM437.  Nothing else that I'm aware of.  It's a mistake to assume
that encoding for File I/O.  Requiring it to be more explicit would
help people realize this, at least for me :P
Thanks!
-r
Posted by Roger Pack (Guest)
on 2010-11-03 10:54
(Received via mailing list)
>> Yeah, if you don't want dynamic decision with locale
>> you can specify File.read("foo.txt", encoding:"UTF-8").
>
> Yeah I'm just afraid that newbies won't realize this, because it is
> implicit (hidden) on their local machines.

My latest thought is that files' default encoding can (should?) be
binary.  Then there's no confusion, and if they want to read it with
an encoding, they can specify so.
For console input, it makes sense to have a default encoding, but not
really for files.  At least from my perspective :)

I also noted that a few others agree that having a "system dependent"
file default encoding seems bad form [1].
Cheers!
-r
[1] http://github.com/candlerb/string19/blob/master/soapbox.rb#L75
(the whole thing is a good read :)
Posted by NARUSE, Yui (Guest)
on 2010-11-03 12:35
(Received via mailing list)
(2010/11/03 1:51), Roger Pack wrote:
> really for files.  At least from my perspective :)
Which we adopt no confusion or almost works well was very difficult 
problem.
Someone argue about auto conversion over concatenation of ASCII strings,
and others says that Ruby should convert strings even if both of those 
strings
have non ASCII characters.
Decisions around them are hard and need many use cases.

Yes, I know you may think default encoding should be binary.
Yes, I know you may think default encoding should be UTF-8.
Yes, I know you may think default encoding should be its locale.

To decide a spec we need many situations, an environment whose locale
is C, en_US.UTF-8, Windows-1252, CP932, and so on.
Their files are wrote in UTF-8, US-ASCII, CP932, EUC-JP, or binary 
files.
A reasonable suggestion includes thoughts around such situations.
A good suggestion includes thoughts around situations we didn't imagine.

> For console input, it makes sense to have a default encoding, but not
> really for files.  At least from my perspective :)

Yes, you are right.
But in most situation we thought those files are wrote in the same 
encoding
of the locale; it's this answer.

> I also noted that a few others agree that having a "system dependent"
> file default encoding seems bad form [1].
> [1] http://github.com/candlerb/string19/blob/master/soapbox.rb#L75
> (the whole thing is a good read :)

I read it.
If he proposed a reasonable suggestion, we might adoptit.
Posted by Brian Candler (candlerb)
on 2010-11-15 05:45
(Received via mailing list)
> > I also noted that a few others agree that having a "system dependent"
> > file default encoding seems bad form [1].
> > [1] http://github.com/candlerb/string19/blob/master/so...
> > (the whole thing is a good read :)
>
> I read it.
> If he proposed a reasonable suggestion, we might adoptit.

That depends on your definition of "reasonable" :-)

I have just reworked this document:
https://github.com/candlerb/string19/blob/master/alternatives.markdown

where I outline three options. You can skip option 1 because I very much
doubt you will consider it.

Regards,

Brian.
Posted by NARUSE, Yui (Guest)
on 2010-11-15 11:07
(Received via mailing list)
(2010/11/14 20:40), Brian Candler wrote:
>>> I also noted that a few others agree that having a "system dependent"
>>> file default encoding seems bad form [1].
>>> [1] http://github.com/candlerb/string19/blob/master/so...
>>> (the whole thing is a good read :)
>>
>> I read it.
>> If he proposed a reasonable suggestion, we might adoptit.
>
> That depends on your definition of "reasonable" :-)

Of course, but a proposal is always such thing.

Anyway, a "reasonable" proposal always based on many facts and use 
cases.
If a proposal doesn't consider real use cases, it will be rejected.

> I have just reworked this document:
> https://github.com/candlerb/string19/blob/master/alternatives.markdown
>
> where I outline three options. You can skip option 1 because I very much
> doubt you will consider it.

== Option 1

This is what Japanese people often say "Americans don't consider 
non-ASCII".

You say String needs an encoding only when it gives substrings.
But it needs on each_char, each_line, ==, length, succ, index, inspect,
upcase, codepoints, reverse, ord, end_with, chop, delete, tr, and 
related methods.

We Japanese hacked with regexps when we want to do such things like:
"\xE3\x81\x82\xE3\x81\x84".scan(/[\w\W]/u).length for String#length
We don't want to back to such old days.

== Option 2

> Have a universally-compatible "BINARY" encoding.
> Any operation between BINARY and FOO gives encoding BINARY,
> and transcoding between BINARY and any other encoding is a null operation.

This will hide unexpectedly mixed BINARY string.
You'll realize hard to debug such strings.
This is why wycats requested us that ASCII-8BIT should be ASCII 
incompatible.

> Open all files in BINARY mode, except where explicitly asked:
>   File.open("/etc/passwd","r:locale")

I want to do this if people accept this, but I think they don't.

> Treat invalid characters in the same way as String#[] does,
> i.e. never raise an exception. In particular, regexp matching always succeeds.

This will raise security issue.

== Option 3

> Everything is compatible with everything else, by definition.

UTF-16BE is easy one because it can convert to UTF-8 without any 
information lost.
Other encodings which are different character set is the problem.

How it works like:

"\xA1".force_encdoing("ISO-8859-1") +"\xA1".force_encdoing("ISO-8859-2")

Yeah, automatic conversion feels like very good solution.
But further consideration, you'll find many edge cases.
Such huge number of "edge" cases make encodings difficult, but we can't 
escape from it.
A reasonable proposal must consider them.
We considered them when we design current implementation.
Posted by Roger Pack (Guest)
on 2010-11-15 18:10
(Received via mailing list)
A few reactions...

> == Option 2
>
>> Have a universally-compatible "BINARY" encoding.
>> Any operation between BINARY and FOO gives encoding BINARY,
>> and transcoding between BINARY and any other encoding is a null operation.

Is setting Encoding.default_external to BINARY the same effect as
this? (all strings will be binary, thus they will all have every
operation result in binary...)

> This will hide unexpectedly mixed BINARY string.
> You'll realize hard to debug such strings.

True hard to debug if several different encodings are merged.  Perhaps
you could make it "opt-in" to use this universal encoding proposed?

>> Open all files in BINARY mode, except where explicitly asked:
>> File.open("/etc/passwd","r:locale")
>
> I want to do this if people accept this, but I think they don't.

I would accept it :)
Thanks for the feedback.  I think it only helps ruby get better.
Maybe another option is a command line switch like --binary_only
(avoids all encodings, does only ASCII-8BIT) or make it so that
setting Encoding.default_external and default_internal to ASCII-8BIT
would have the same effect...
Just thinking out loud.
Cheers!
-r
Posted by NARUSE, Yui (Guest)
on 2010-11-15 20:25
(Received via mailing list)
(2010/11/16 2:03), Roger Pack wrote:
> operation result in binary...)
Universal-compatible "BINARY" and Encoding.default_external to BINARY 
are
different problem.
Encoding.default_external to BINARY but not universal-compatible can be 
exist.

>> This will hide unexpectedly mixed BINARY string.
>> You'll realize hard to debug such strings.
>
> True hard to debug if several different encodings are merged.  Perhaps
> you could make it "opt-in" to use this universal encoding proposed?

What is "opt-in"?

>>> Open all files in BINARY mode, except where explicitly asked:
>>>   File.open("/etc/passwd","r:locale")
>>
>> I want to do this if people accept this, but I think they don't.
>
> I would accept it :)

As ruby's source encoding, it should default to US-ASCII mode
and raise exception when it includes non ASCII.

> Thanks for the feedback.  I think it only helps ruby get better.
> Maybe another option is a command line switch like --binary_only
> (avoids all encodings, does only ASCII-8BIT) or make it so that
> setting Encoding.default_external and default_internal to ASCII-8BIT
> would have the same effect...
> Just thinking out loud.

Setting Encoding.default_external or LC_ALL=C and playing with UTF-8,
you may simulate what will happen.
Of course it needs some non ASCII sample texts.
Posted by "Martin J. Dürst" <duerst@it.aoyama.ac.jp> (Guest)
on 2010-11-16 02:52
(Received via mailing list)
Hello Roger,

On 2010/11/16 2:03, Roger Pack wrote:
> operation result in binary...)
No, as Yui explained.

>> This will hide unexpectedly mixed BINARY string.
>> You'll realize hard to debug such strings.
>
> True hard to debug if several different encodings are merged.  Perhaps
> you could make it "opt-in" to use this universal encoding proposed?

If you mean "BINARY" by universal encoding, then that would indeed have
to be a heavily guarded opt-in. I'm not sure why you want everything to
be BINARY, and every string in another encoding to combine easily with
BINARY.

Going back to 1.8 and solving the encoding problems isn't the same
thing. Just using BINARY in most cases is just GIGO (garbage in, garbage
out) [unless the only thing you are doing is to actually manipulate
binary data, which exists but is rather rare (and in that case, just
declaring your input files,... to be BINARY is the right way to go)].
Such programs will pass tests, but they won't actually work they way
they are intended to work, because encodings may easily be mixed up. You
may get very much the same results, the programs may work with data that
is all or mostly ASCII, but then in circumstances where there is lots of
data beyond ASCII, it will produce garbage.

I think if we want to work towards improving what we currently have, we
might have to actually make the encoding checks more severe. One of the
main problems is that currently, ASCII data just works. This was done
because Matz and the others working on Ruby 1.9 didn't want to require
an encoding declaration at the start of every Ruby source file, and for
every file opening,..., and didn't want to upset the major user bases
too much.

One part of a solution might be a kind of "test mode", where checks are
stricter, but it would give you a greater probability that the program
would work in a wider range of production settings. Some of that might
be fairly easy. But the fact that US-ASCII and BINARY (==ASCII8BIT)
currently mix easily will need some very careful thought and work (it
has been discussed many times in great detail on ruby-dev). The short
summary of why US-ASCII and BINARY currently mix easily is that if you
want to check the magic number at the start of a GIF file, you want to
write that
    ~= /^GIF8/
and not
    ~= /^\x47\x49\x46\x38/b
(where the 'b', which I just invented, declares the regular expression
as binary, because it would otherwise still be US-ASCII (or whatever the
script encoding is)).

Another part of a solution is to come up with some test harness that
easily allows to check for a wide range of encoding circumstances.

Regards,   Martin.
Posted by James Tucker (Guest)
on 2010-11-16 17:04
Attachment: smime.p7s (3,59 KB)
(Received via mailing list)
On 31 Aug 2010, at 09:54, NARUSE, Yui wrote:

>> change it if they want something else.
> (almost always locale reflects terminal's encoding)
I think the biggest problem that is a real issue in this discussion is 
that irb crashes in unicode terminals on windows:

cmd /u
chcp 65001
ruby -e "puts Encoding.default_external"
UTF-8
irb # => crash

IRB has other bugs in this area, for example, irb assumes UTF-8 in all 
cases and doesn't use Encoding defaults.

192p0 mingw
Posted by Brian Candler (candlerb)
on 2010-11-17 13:06
(Received via mailing list)
NARUSE, Yui wrote on 2010-11-15 11:07:
> This is what Japanese people often say "Americans don't consider
> non-ASCII

Sure, many people want to handle non-ASCII text. But:

* I would consider all non-Unicode character sets to be legacy, do you
disagree?
* ruby 1.9's model doesn't handle stateful encodings like ISO-2022-JP,
so these need transcoding at the edge anyway
* hence why not just transcode everything that's not Unicode into 
Unicode?

I would choose UTF-8 as the internal Unicode representation, since the
majority of external Unicode data already UTF-8.  (*)

Then you end up with the design used by both Python 3.0 and Erlang: you 
have
two data types, one for binary strings, and one for UTF-8 text.  (I 
should
add this to the document as an explicit alternative)

This would wipe out most of the complexity associated with ruby 1.9 at a
stroke. What you would lose is:

* the ability to handle things like EUC-JP or GB2312 "natively", that 
is,
without transcoding them to UTF-8 and back
* the ability to write ruby programs in non-UTF-8 character sets

How big a loss is that?

(*) There's an argument which says use UTF-16 or UTF-32 internally as 
it's
better suited to character indexing.  I would say that this is 
outweighed by
the extra RAM bandwidth used, and the fact that most data is UTF-8 so 
would
have to be transcoded.

> > Have a universally-compatible "BINARY" encoding.
> > Any operation between BINARY and FOO gives encoding BINARY,
> > and transcoding between BINARY and any other encoding is a null operation.
>
> This will hide unexpectedly mixed BINARY string.
> You'll realize hard to debug such strings.

I would much rather have a program which outputs a plausible binary 
string
from its inputs than one which crashes given unexpected data.  ruby 1.9
hugely magnifies the number of unit test cases to achieve coverage of 
these
edge cases.

> > Treat invalid characters in the same way as String#[] does,
> > i.e. never raise an exception. In particular, regexp matching always succeeds.
>
> This will raise security issue.

In what way is it a security issue?  Why is it not a security issue that
String#[] doesn't error?  Why is it not a security issue that 'sed' 
handles
such files successfully?

Roger Pack wrote:
> Maybe another option is a command line switch like --binary_only
> (avoids all encodings, does only ASCII-8BIT) or make it so that
> setting Encoding.default_external and default_internal to ASCII-8BIT
> would have the same effect...
> Just thinking out loud.

I think that's a bad idea, because then your program behaves in
different ways depending on how it's launched, and that breaks libraries
which depend on this global flag being set a particular way, and 
applications
which are run in different environments. This problem already affects 
1.8:
see http://www.ruby-forum.com/topic/216511

I think if you wanted this behaviour to be opt-in you'd have a 
completely
different encoding, call it 'true-binary' for sake of argument.  Then 
you'd
say:

  # encoding:true-binary

at the top of the source file to make all your string literals be this
univeral binary encoding.

Martin J. Drst wrote:
> But the fact that US-ASCII and BINARY (==ASCII8BIT)
> currently mix easily will need some very careful thought and work (it
> has been discussed many times in great detail on ruby-dev). The short
> summary of why US-ASCII and BINARY currently mix easily is that if you
> want to check the magic number at the start of a GIF file, you want to
> write that
>     ~= /^GIF8/
> and not
>     ~= /^\x47\x49\x46\x38/b

I think you mean:

      =~ /\AGIF8/

Since this is binary data, you'll have read it using 'read' not 'gets'.
This demonstrates a far more dangerous default behaviour in Ruby: ^ 
doesn't
only match start of string, so the regexp you wrote will match things 
you
didn't expect.  (This isn't really relevant to discussion in hand, 
except
where people have said it's somehow dangerous to allow binary strings to
interact with strings in different encodings without raising an error)

Regards,

Brian.
Posted by Brian Candler (candlerb)
on 2010-11-17 13:55
(Received via mailing list)
NARUSE, Yui (Guest) wrote on 2010-11-15 11:07
> > Open all files in BINARY mode, except where explicitly asked:
> >   File.open("/etc/passwd","r:locale")
>
> I want to do this if people accept this, but I think they don't.

I agree that the difference between "r" and "rb" already exists for this
purpose.

However I think there is an inconsistency between the strict rules for
source code (you must specify the encoding), and lax rules for data 
(ruby
will make a guess).

To me, either of the following resolutions make sense.

(1) For files opened as text ("r" or "w", as opposed to "rb" or "wb"),
    make the encoding default to US-ASCII.

    If users want something else they can open as "r:utf-8" or
    "r:locale" - the latter would guess the encoding from the 
environment
    as is done today.

    This forces people to make an explicit choice of behaviour, in the 
same
    way that the #encoding tag forces an explicit choice for source 
files.

(2) Make the default encoding for both text files and source code be 
UTF-8.
    This covers a large proportion of use cases. If you are writing a
    program which reads a UTF-8 template then you're excused from 
declaring
    it, but if it reads (say) a GB2312 template then you must say so.

    IMO this is better than writing a program which reads your GB2312
    template successfully on your machine, but crashes when run on 
someone
    else's machine.

Behaviour of STDIN can be argued, but I'd go for consistency (fixed 
US-ASCII
or UTF-8).  Scripts which want to be environment-sensitive can add:
    STDIN.set_encoding "locale"

Regards,

Brian.
Posted by NARUSE, Yui (Guest)
on 2010-11-17 14:42
(Received via mailing list)
(2010/11/17 21:05), Brian Candler wrote:
> NARUSE, Yui wrote on 2010-11-15 11:07:
>> This is what Japanese people often say "Americans don't consider
>> non-ASCII
>
> Sure, many people want to handle non-ASCII text. But:
>
> * I would consider all non-Unicode character sets to be legacy, do you
> disagree?

Yes, they are legacy.
But it is different problem whether we can throw away them or not.

> * ruby 1.9's model doesn't handle stateful encodings like ISO-2022-JP,
> so these need transcoding at the edge anyway

Yes.

> * hence why not just transcode everything that's not Unicode into Unicode?

Conversion table from non Unicode to Unicode or back is not clear.
"XML Japanese Profile" describes about this confusion.
http://www.w3.org/Submission/japanese-xml/


> I would choose UTF-8 as the internal Unicode representation, since the
> majority of external Unicode data already UTF-8.  (*)

"already" or "more and more" is arguable, but I almost agree.

> Then you end up with the design used by both Python 3.0 and Erlang: you have
> two data types, one for binary strings, and one for UTF-8 text.  (I should
> add this to the document as an explicit alternative)

Python 3.0's internal representation is UTF-16/UTF-32.
I don't know Erlang.

How we walk around binary strings and Unicode strings is big design 
problem.
Before designing it I can't evaluate it.

> better suited to character indexing.  I would say that this is outweighed by
> the extra RAM bandwidth used, and the fact that most data is UTF-8 so would
> have to be transcoded.

Why Rails3 still supports legacy encodings  may answer it.

> edge cases.
Characters have huge number of edges.
Those edge will be still sharp even if the language only support 
Unicode.

>>> Treat invalid characters in the same way as String#[] does,
>>> i.e. never raise an exception. In particular, regexp matching always succeeds.
>>
>> This will raise security issue.
>
> In what way is it a security issue?  Why is it not a security issue that
> String#[] doesn't error?  Why is it not a security issue that 'sed' handles
> such files successfully?

See http://www.infoq.com/news/2009/09/rails-vulnerabilities
Posted by NARUSE, Yui (Guest)
on 2010-11-17 15:53
(Received via mailing list)
(2010/11/17 21:54), Brian Candler wrote:
> source code (you must specify the encoding), and lax rules for data (ruby
>
>      This forces people to make an explicit choice of behaviour, in the same
>      way that the #encoding tag forces an explicit choice for source files.

This seems strict and clear, but in practical it is difficult.

> (2) Make the default encoding for both text files and source code be UTF-8.
>      This covers a large proportion of use cases. If you are writing a
>      program which reads a UTF-8 template then you're excused from declaring
>      it, but if it reads (say) a GB2312 template then you must say so.
>
>      IMO this is better than writing a program which reads your GB2312
>      template successfully on your machine, but crashes when run on someone
>      else's machine.

Ruby 2.0 may be this side.

> Behaviour of STDIN can be argued, but I'd go for consistency (fixed US-ASCII
> or UTF-8).  Scripts which want to be environment-sensitive can add:
>      STDIN.set_encoding "locale"

On (2), it should be UTF-8.


Anyway, Matz said his team are developping embeded Ruby.
It doesn't have M17N and only supports US-ASCII/UTF-8.
http://www.slideshare.net/yukihiro_matz/rubyconf-2010-keynote-by-matz

So it may be fun to imagine the detail of its behavior.
I think how go and back between binary and UTF-8 is key problem.
Posted by James Edward Gray II (Guest)
on 2010-11-17 16:10
(Received via mailing list)
On Nov 17, 2010, at 6:05 AM, Brian Candler wrote:

> (*) There's an argument which says use UTF-16 or UTF-32 internally as it's 
better suited to character indexing.

That's a myth.  Due to Unicode's combining characters, it's not that 
much better.

James Edward Gray II
Posted by mathew (Guest)
on 2010-11-17 17:38
(Received via mailing list)
On Tue, Nov 16, 2010 at 10:03, James Tucker <jftucker@gmail.com> wrote:

> I think the biggest problem that is a real issue in this discussion is that
> irb crashes in unicode terminals on windows:
>
> cmd /u
> chcp 65001
> ruby -e "puts Encoding.default_external"
> UTF-8
> irb # => crash
>

I don't think a Windows-specific irb bug is a good reason to change 
Ruby's
character set handling.

irb works fine in a UTF-8 terminal on Linux. That's how I run it all the
time.

$ ruby -e "puts Encoding.default_external"
UTF-8
$ irb
irb(main):001:0>


mathew
Posted by NARUSE, Yui (Guest)
on 2010-11-20 07:08
(Received via mailing list)
(2010/11/18 1:35), mathew wrote:
>
> I don't think a Windows-specific irb bug is a good reason to change Ruby's 
character set handling.
>
> irb works fine in a UTF-8 terminal on Linux. That's how I run it all the time.
>
> $ ruby -e "puts Encoding.default_external"
> UTF-8
> $ irb
> irb(main):001:0>

See also [ruby-core:33162].
Posted by Brian Candler (candlerb)
on 2010-11-24 13:13
(Received via mailing list)
NARUSE, Yui (Guest) wrote on 2010-11-17 15:53
> Anyway, Matz said his team are developping embeded Ruby.
> It doesn't have M17N and only supports US-ASCII/UTF-8.
> http://www.slideshare.net/yukihiro_matz/rubyconf-2...

Interesting: that may be a forward path for me, as long as it can handle
binary too.

One of the joys of ruby 1.8 is that it is so small: the source tarball 
is
about 1/3rd of the size of Perl, and yet it comes with a much more 
complete
set of libraries.  It runs on tiny machines like OpenWrt boxes with 4MB 
of
flash, and the kernel and the rest of the O/S has to fit in that space 
too
:-)

It would be a shame if most existing libraries and frameworks wouldn't 
work
with embedded ruby though.

Regards,

Brian.
Posted by Roger Pack (Guest)
on 2010-11-24 17:39
(Received via mailing list)
> Python 3.0's internal representation is UTF-16/UTF-32.
> I don't know Erlang.

I suppose that would be another proposal suggestion :)
Cheers!
-r
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.