A question about Ruby 1.9's "external encoding"

dubstep · March 20, 2011, 1:38am

I have the following program:

p Encoding.default_external
File.open(‘testing’, ‘w’) do |f|
p f.external_encoding
end

and when I run it I the following output:

In other words, the file’s “external encoding” is nil. What does this
mean? Shouldn’t this be “UTF-8”, the default external encoding?

BTW, “ruby1.9.1 -v” gives me:

ruby 1.9.1p378 (2010-01-10 revision 26273) [i486-linux]

I’m using Ubuntu 10.04.1, and that’s the most updated version of Ruby
1.9.1.

alby · March 20, 2011, 1:16pm

On 03/20/2011 01:38 AM, Albert S. wrote:

nil

In other words, the file’s “external encoding” is nil. What does this
mean? Shouldn’t this be “UTF-8”, the default external encoding?

--------------------------------------------------- IO#external_encoding
io.external_encoding => encoding

  From Ruby 1.9.1

  Returns the Encoding object that represents the encoding of the
  file. If io is write mode and no encoding is specified, returns
  +nil+.

I’d say it means that the default encoding is used.

BTW, “ruby1.9.1 -v” gives me:

ruby 1.9.1p378 (2010-01-10 revision 26273) [i486-linux]

I’m using Ubuntu 10.04.1, and that’s the most updated version of Ruby
1.9.1.

irb(main):001:0> Encoding.default_external
Encoding.default_external Encoding.default_external=
irb(main):001:0> Encoding.default_external
=> #Encoding:UTF-8
irb(main):002:0> Encoding.default_internal
=> nil
irb(main):003:0> File.open(“x”,“w”){|io| p io.external_encoding; io.puts
“aä”}
nil
=> nil
irb(main):004:0> File.open(“x”,“r:UTF-8”){|io| p io.external_encoding;
io.read}
#Encoding:UTF-8
=> “aä\n”
irb(main):005:0>

Apparently the file is encoded in UTF-8 because I can read it without
errors and get what I expect.

Kind regards

robert

alby · March 20, 2011, 2:17pm

Albert S. wrote in post #988363:

I have the following program:

p Encoding.default_external
File.open(‘testing’, ‘w’) do |f|
p f.external_encoding
end

and when I run it I the following output:

#Encoding:UTF-8
nil

In other words, the file’s “external encoding” is nil. What does this
mean? Shouldn’t this be “UTF-8”, the default external encoding?

Depends what you mean by “shouldn’t be”. The rules for encodings in ruby
1.9 are (IMO) arbitrary and inconsistent.

In the case of external encodings: yes, they default to nil for files
opened in write mode. This means that no transcoding is done on output.
For example, if you have a String which happens to contain binary, or
ISO-8859-1, it will be written out unchanged (i.e. the sequence of bytes
in the String is the same sequence of bytes which will end up in the
file).

If you want to transcode on output, you have to set the external
encoding explicitly.

Since none of this is documented anywhere officially, I attempted to
reverse engineer it. I’ve documented about 200 behaviours here:

github.com

candlerb/string19/blob/master/string19.rb

#!/usr/bin/env ruby19
# encoding: UTF-8
# This document is Copyright (C) Brian Candler 2009 and released under a
# Creative Commons Attribution-NonCommercial 3.0 Unported License.

############# CONTENTS ###################

# -1. PREAMBLE
#  0. INTRODUCTION
#  1. ENCODINGS
#  2. PROPERTIES OF ENCODINGS
#  3. STRING, FILE AND REGEXP ENCODINGS
#  4. VALID AND FIXED ENCODINGS
#  5. COMPATIBLE OBJECTS
#  6. STRING CONCATENATION
#  7. THE BINARY / ASCII-8BIT ENCODING
#  8. SINGLE CHARACTERS
#  9. EQUALITY AND COLLATION
# 10. HASH AND EQL?
# 11. UPPER AND LOWER CASE

This file has been truncated. show original

For my own code, I still use ruby 1.8 exclusively.

alby · March 20, 2011, 2:19pm

Robert K. wrote in post #988404:

--------------------------------------------------- IO#external_encoding
io.external_encoding => encoding
  From Ruby 1.9.1
  Returns the Encoding object that represents the encoding of the
  file. If io is write mode and no encoding is specified, returns
  +nil+.
I’d say it means that the default encoding is used.

No, it doesn’t.

Apparently the file is encoded in UTF-8 because I can read it without
errors

ruby 1.9 does not give errors if you read a file which is not UTF-8
encoded with the external encoding is UTF-8. You will just get strings
with valid_encoding? false.

It will give errors if you attempt UTF-8 regexp matches on the data
though.

The rules for which methods give errors and which don’t are pretty odd.
For example, string[n] doesn’t give an exception, even if the string is
invalid.

alby · March 20, 2011, 5:36pm

On 20.03.2011 14:19, Brian C. wrote:

I’d say it means that the default encoding is used.

No, it doesn’t.

So, which encoding is used then? An encoding has to be used because
you cannot write to a file without a particular encoding. There needs
to be a defined mapping between character data and bytes in the file.

Apparently the file is encoded in UTF-8 because I can read it without
errors

ruby 1.9 does not give errors if you read a file which is not UTF-8
encoded with the external encoding is UTF-8. You will just get strings
with valid_encoding? false.

I could see in the console that the file was read properly. Also:

irb(main):001:0> File.open(“x”,“w”){|io| p io.external_encoding; io.puts
“aä”}
nil
=> nil
irb(main):002:0> s = File.open(“x”,“r:UTF-8”){|io| p
io.external_encoding; io.read}
#Encoding:UTF-8
=> “aä\n”
irb(main):003:0> s.valid_encoding?
=> true
irb(main):004:0>

It will give errors if you attempt UTF-8 regexp matches on the data
though.

The rules for which methods give errors and which don’t are pretty odd.
For example, string[n] doesn’t give an exception, even if the string is
invalid.

I would concede that encodings in Ruby are pretty complex. It’s easier
in Java where String never has a particular encoding and only reading
and writing uses encodings. However, Java’s Strings were not capable of
handling all Asian symbols as I have learned on this list. Since 1.5
they managed to increase the range of Unicode codepoints which can be
covered - at the cost of making String handling a mess:

http://download.oracle.com/javase/6/docs/api/java/lang/String.html#codePointAt(int)

Now suddenly String.length() no longer returns the length in real
characters (code points) but rather the length in chars. I figure,
Ruby’s solution might not be so bad after all.

Kind regards

robert

alby · March 20, 2011, 6:39pm

Robert K. wrote in post #988429:

On 20.03.2011 14:19, Brian C. wrote:

I’d say it means that the default encoding is used.

No, it doesn’t.

So, which encoding is used then?

None.

An encoding has to be used because
you cannot write to a file without a particular encoding.

Untrue. In Unix, read() and write() just work on sequences of bytes, and
have no concept of encoding.

Perhaps you are thinking of a language like Python 3, where there is a
distinction between “characters” and “bytes representing those
characters” (maybe Java has that distinction too, I don’t know enough
about Java to say)

In ruby 1.9, every String is a bunch of bytes plus an encoding tag. When
you write this out to a file, and the external encoding is nil, then
just the bytes are written, and the encoding is ignored.

I could see in the console that the file was read properly.

What you see in the console in irb does not necessarily mean much in
ruby 1.9, because STDOUT.external_encoding is nil by default too.

irb(main):001:0> File.open(“x”,“w”){|io| p io.external_encoding; io.puts
“aä”}
nil
=> nil
irb(main):002:0> s = File.open(“x”,“r:UTF-8”){|io| p
io.external_encoding; io.read}
#Encoding:UTF-8
=> “aä\n”
irb(main):003:0> s.valid_encoding?
=> true

Now, that’s more complex, and does show that the data is valid UTF-8.
(I wasn’t arguing that it wasn’t; I was arguing that your logic was
flawed, because even if the data were not valid UTF-8, your program
would have run without raising an error. Therefore the fact that it runs
without error is insufficient to show that the data is valid UTF-8)

[In Java]

Now suddenly String.length() no longer returns the length in real
characters (code points) but rather the length in chars. I figure,
Ruby’s solution might not be so bad after all.

Of course, even in Unicode, the number of code points is not necessarily
the same as the number of glyphs or “printable characters”.

alby · March 21, 2011, 4:01pm

On Sun, Mar 20, 2011 at 6:39 PM, Brian C. [email protected]
wrote:

Robert K. wrote in post #988429:

On 20.03.2011 14:19, Brian C. wrote:

I’d say it means that the default encoding is used.

No, it doesn’t.

So, which encoding is used then?

None.

Even if no encoding is used explicitly an encoding must be used
nevertheless (see below).

In ruby 1.9, every String is a bunch of bytes plus an encoding tag. When
you write this out to a file, and the external encoding is nil, then
just the bytes are written, and the encoding is ignored.

Which basically means that the string’s own encoding is used. If you
have a number of bytes and want to interpret them as characters you
must use an encoding, even if it is 8 bit ASCII and there is no
conversion going on. There is no such thing as a text file without
encoding whether applied explicitly or not. On one side there are
bytes and on the other side there are character codes (or Unicode code
points).

io.external_encoding; io.read}
#Encoding:UTF-8
=> “aä\n”
irb(main):003:0> s.valid_encoding?
=> true

Now, that’s more complex, and does show that the data is valid UTF-8.
(I wasn’t arguing that it wasn’t; I was arguing that your logic was
flawed, because even if the data were not valid UTF-8, your program
would have run without raising an error. Therefore the fact that it runs
without error is insufficient to show that the data is valid UTF-8)

So what we learn here is that since my original string had encoding
UTF-8 the encoding of the file happened to be UTF-8 as well. That
basically means that by accident we can get a file with mixed encoding
content. Shudder.

Here’s the test:

s = “aä”
=> “aä”
s.encoding
=> #Encoding:UTF-8
s = s.encode ‘ISO-8859-1’
=> “a\xE4”
s.encoding
=> #Encoding:ISO-8859-1
Encoding.default_external
=> #Encoding:UTF-8
$stdout.external_encoding
=> nil
File.open(“x”,“w”){|io| p io.external_encoding; io.puts(s)}
nil
=> nil
t = File.open(“x”,“r:UTF-8”){|io| p io.external_encoding; io.read}
#Encoding:UTF-8
=> “a\xE4\n”
t.encoding
=> #Encoding:UTF-8
t.valid_encoding?
=> false
t.length
=> 3

Now let’s fix it

t.force_encoding ‘ISO-8859-1’
=> “a\xE4\n”
t.encoding
=> #Encoding:ISO-8859-1
t.valid_encoding?
=> true

Output:

$stdout.external_encoding
=> nil
$stdout.puts t
a▒
=> nil
$stdout.set_encoding($stdin.external_encoding)
=> #<IO:>
$stdout.external_encoding
=> #Encoding:UTF-8
$stdout.puts t
aä
=> nil

For me this boils down to these rules:

Strings are sequences of bytes
Strings have an associated encoding which does not need to match
the actual encoding of the binary content
In absence of a target (external or internal, depending on
direction) encoding IO operations use a String’s binary data as is,
otherwise they try to convert between encodings and raise an error if
that is not possible.

Cheers

robert

alby · March 23, 2011, 12:25am

Content preview: On 21.03.2011 16:00, Robert K. wrote: > For me
this boils
down to these rules: > > 1. Strings are sequences of bytes > > 2.
Strings
have an associated encoding which does not need to match > the
actual encoding
of the binary content > > 3. In absence of a target (external or
internal,
depending on > direction) encoding IO operations use a String’s
binary data
as is, > otherwise they try to convert between encodings and raise
an error
if > that is not possible. […]

Content analysis details: (-2.9 points, 5.0 required)

pts rule name description

-1.0 ALL_TRUSTED Passed through trusted hosts only via SMTP
-1.9 BAYES_00 BODY: Bayes spam probability is 0 to 1%
[score: 0.0000]
X-Cloudmark-Analysis: v=1.1
cv=HQ3F56nxkum+cgCiDL7AXQpbvw7DWrWCBJRnYYnM0Zc= c=1 sm=0
a=aofHTkXiRO8A:10 a=WLR_qwQM_kQA:10 a=IkcTkHD0fZMA:10
a=Y2utz3T8Io6i8Wqw22gA:9 a=zxjDzMNOOsrC3omVKIUA:7
a=HGX_DUN8rG0v2hFQ5sBe132ByTwA:4 a=QEXdDO2ut3YA:10
a=HpAAvcLHHh0Zw7uRqdWCyQ==:117
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Precedence: bulk
Lines: 17
List-Id: ruby-talk.ruby-lang.org
List-Software: fml [fml 4.0.3 release (20011202/4.0.3)]
List-Post: mailto:[email protected]
List-Owner: mailto:[email protected]
List-Help: mailto:[email protected]?body=help
List-Unsubscribe: mailto:[email protected]?body=unsubscribe
Received-SPF: none (Address does not pass the Sender Policy Framework)
SPF=FROM;
[email protected];
remoteip=::ffff:221.186.184.68;
remotehost=carbon.ruby-lang.org;
helo=carbon.ruby-lang.org;
receiver=eq4.andreas-s.net;

On 21.03.2011 16:00, Robert K. wrote:

that is not possible.
Is it just me or does this, especially point 2, sound highly confusing
if not dangerous?

Markus

alby · March 23, 2011, 10:16am

On Wed, Mar 23, 2011 at 12:24 AM, Markus F. [email protected]
wrote:

otherwise they try to convert between encodings and raise an error if
that is not possible.

Is it just me or does this, especially point 2, sound highly confusing
if not dangerous?

The rule as such is pretty clear IMHO. It does not meet “naive”
expectations and as such probably violates POLS (although Matz’s
expectations are almost certainly different than ours - especially
since his native language has a much richer set of symbols than
western languages).

What I find slightly puzzling is this:

irb(main):001:0> s1 = “a”
=> “a”
irb(main):002:0> s1.encoding
=> #Encoding:UTF-8
irb(main):003:0> s2 = s1.encode ‘ISO-8859-1’
=> “a”
irb(main):004:0> s2.encoding
=> #Encoding:ISO-8859-1
irb(main):005:0> s1 == s2
=> true
irb(main):006:0> s1.eql? s2
=> true
irb(main):007:0> [s1.hash, s2.hash]
=> [1003075638, 1003075638]
irb(main):008:0> [s1.hash, s2.hash].uniq
=> [1003075638]
irb(main):009:0> s1.encoding == s2.encoding
=> false

Apparently only the byte representation is used for equivalence checks
and the encoding is ignored. I guess this is a pragmatic optimization
for speed since

string comparisons are very frequent
often strings with different encodings do also have different
binary representation (the fact that UTF-8 and ISO-8859-1 share the
common subset of ASCII 7 bit might be viewed as a special case).

irb(main):010:0> s1 = “”
=> “”
irb(main):011:0> s2 = s1.encode ‘ISO-8859-1’
=> “\xE4”
irb(main):012:0> s1 == s2
=> false
irb(main):013:0> s1.eql? s2
=> false
irb(main):014:0> [s1.hash, s2.hash].uniq
=> [-276501091, 359342273]

If you include the encoding in equivalence check “s1 == s2” would
yield false in the first case (IRB line 005) although both strings
actually represent the same character sequence. The proper solution
of course would be to compare two strings on the character level but
since this would make decoding the byte sequence necessary performance
would be worse and we collide with item 1 above.

I think you can write proper locale aware programs in Ruby (mostly be
specifying internal and external encodings). But, as in all
languages, you must be aware of the fact that you need to explicitly
deal with encodings. The fact remains that i18n is a complex topic
because human cultures and languages are so vastly different. And the
complexity does not go away because it is inherent in the matter - no
matter what technical solutions you invent. Given that, the possible
discrepancy between the byte data and the encoding (which manifests
itself in the existence of String#valid_encoding?) does look a lot
smaller already.

For even more information and detail I recommend James’s excellent
article at
http://blog.grayproductions.net/articles/miscellaneous_m17n_details

And there’s more to be found here
http://blog.grayproductions.net/categories/character_encodings

Oh, and while we’re at it, maybe we should add a method like this to
String:

class String
def ensure_encoding
raise Encoding::InvalidByteSequenceError, “Wrong encoding for %p”
% self unless valid_encoding?
self
end
end

Then we can do something like

puts s.ensure_encoding.length

or other String operations and be sure that the encoding is proper.
Does anybody have a better (shorter) name for such a method?

Kind regards

robert

alby · March 23, 2011, 1:27pm

On Wed, Mar 23, 2011 at 12:59 PM, Albert S. [email protected]
wrote:

=> #Encoding:ISO-8859-1

same bytes
=> [[215, 144], [215, 144]]
irb(main):048:0> [utf.valid_encoding?, latin.valid_encoding?] # And are
ok
=> [true, true]
irb(main):046:0> utf == latin # But they aren’t equal
=> false

Thanks for the interesting example! I noticed:

irb(main):008:0> utf.length
=> 1
irb(main):009:0> latin.length
=> 2

In your case it’s good the strings are considered equal: we want to know
if the letters are all the same. “a” is “a”… no matter what encoding.

Turns out the encoding is considered in comparison (read bottom up):

int
rb_str_comparable(VALUE str1, VALUE str2)
{
int idx1, idx2;
int rc1, rc2;

if (RSTRING_LEN(str1) == 0) return TRUE;
if (RSTRING_LEN(str2) == 0) return TRUE;
idx1 = ENCODING_GET(str1);
idx2 = ENCODING_GET(str2);
if (idx1 == idx2) return TRUE;
rc1 = rb_enc_str_coderange(str1);
rc2 = rb_enc_str_coderange(str2);
if (rc1 == ENC_CODERANGE_7BIT) {
    if (rc2 == ENC_CODERANGE_7BIT) return TRUE;
    if (rb_enc_asciicompat(rb_enc_from_index(idx2)))
        return TRUE;
}
if (rc2 == ENC_CODERANGE_7BIT) {
    if (rb_enc_asciicompat(rb_enc_from_index(idx1)))
        return TRUE;
}
return FALSE;

}

/* expect tail call optimization */
static VALUE
str_eql(const VALUE str1, const VALUE str2)
{
const long len = RSTRING_LEN(str1);

if (len != RSTRING_LEN(str2)) return Qfalse;
if (!rb_str_comparable(str1, str2)) return Qfalse;
if (memcmp(RSTRING_PTR(str1), RSTRING_PTR(str2), len) == 0)
    return Qtrue;
return Qfalse;

}

VALUE
rb_str_equal(VALUE str1, VALUE str2)
{
if (str1 == str2) return Qtrue;
if (TYPE(str2) != T_STRING) {
if (!rb_respond_to(str2, rb_intern(“to_str”))) {
return Qfalse;
}
return rb_equal(str2, str1);
}
return str_eql(str1, str2);
}

Now, everything is clear.

Cheers

robert

alby · March 23, 2011, 12:59pm

Robert K. wrote in post #988839:

What I find slightly puzzling is this:

irb(main):001:0> s1 = “a”
=> “a”
irb(main):002:0> s1.encoding
=> #Encoding:UTF-8
irb(main):003:0> s2 = s1.encode ‘ISO-8859-1’
=> “a”
irb(main):004:0> s2.encoding
=> #Encoding:ISO-8859-1
irb(main):005:0> s1 == s2
=> true
irb(main):006:0> s1.eql? s2
=> true
irb(main):007:0> [s1.hash, s2.hash]
=> [1003075638, 1003075638]
irb(main):008:0> [s1.hash, s2.hash].uniq
=> [1003075638]
irb(main):009:0> s1.encoding == s2.encoding
=> false

Apparently only the byte representation is used for equivalence checks
and the encoding is ignored.

I don’t think this is true:

irb(main):043:0> utf = “\u05D0” # Alef
=> “א”
irb(main):044:0> latin = utf.dup; latin.force_encoding ‘ISO-8859-1’
=> “�\x90”
irb(main):045:0> [utf.bytes.to_a, latin.bytes.to_a] # They have the
same bytes
=> [[215, 144], [215, 144]]
irb(main):048:0> [utf.valid_encoding?, latin.valid_encoding?] # And are
ok
=> [true, true]
irb(main):046:0> utf == latin # But they aren’t equal
=> false

In your case it’s good the strings are considered equal: we want to know
if the letters are all the same. “a” is “a”… no matter what encoding.