On Wed, Mar 23, 2011 at 12:24 AM, Markus F. [email protected]
wrote:
otherwise they try to convert between encodings and raise an error if
that is not possible.
Is it just me or does this, especially point 2, sound highly confusing
if not dangerous?
The rule as such is pretty clear IMHO. It does not meet “naive”
expectations and as such probably violates POLS (although Matz’s
expectations are almost certainly different than ours - especially
since his native language has a much richer set of symbols than
western languages).
What I find slightly puzzling is this:
irb(main):001:0> s1 = “a”
=> “a”
irb(main):002:0> s1.encoding
=> #Encoding:UTF-8
irb(main):003:0> s2 = s1.encode ‘ISO-8859-1’
=> “a”
irb(main):004:0> s2.encoding
=> #Encoding:ISO-8859-1
irb(main):005:0> s1 == s2
=> true
irb(main):006:0> s1.eql? s2
=> true
irb(main):007:0> [s1.hash, s2.hash]
=> [1003075638, 1003075638]
irb(main):008:0> [s1.hash, s2.hash].uniq
=> [1003075638]
irb(main):009:0> s1.encoding == s2.encoding
=> false
Apparently only the byte representation is used for equivalence checks
and the encoding is ignored. I guess this is a pragmatic optimization
for speed since
-
string comparisons are very frequent
-
often strings with different encodings do also have different
binary representation (the fact that UTF-8 and ISO-8859-1 share the
common subset of ASCII 7 bit might be viewed as a special case).
irb(main):010:0> s1 = “”
=> “”
irb(main):011:0> s2 = s1.encode ‘ISO-8859-1’
=> “\xE4”
irb(main):012:0> s1 == s2
=> false
irb(main):013:0> s1.eql? s2
=> false
irb(main):014:0> [s1.hash, s2.hash].uniq
=> [-276501091, 359342273]
If you include the encoding in equivalence check “s1 == s2” would
yield false in the first case (IRB line 005) although both strings
actually represent the same character sequence. The proper solution
of course would be to compare two strings on the character level but
since this would make decoding the byte sequence necessary performance
would be worse and we collide with item 1 above.
I think you can write proper locale aware programs in Ruby (mostly be
specifying internal and external encodings). But, as in all
languages, you must be aware of the fact that you need to explicitly
deal with encodings. The fact remains that i18n is a complex topic
because human cultures and languages are so vastly different. And the
complexity does not go away because it is inherent in the matter - no
matter what technical solutions you invent. Given that, the possible
discrepancy between the byte data and the encoding (which manifests
itself in the existence of String#valid_encoding?) does look a lot
smaller already.
For even more information and detail I recommend James’s excellent
article at
http://blog.grayproductions.net/articles/miscellaneous_m17n_details
And there’s more to be found here
http://blog.grayproductions.net/categories/character_encodings
Oh, and while we’re at it, maybe we should add a method like this to
String:
class String
def ensure_encoding
raise Encoding::InvalidByteSequenceError, “Wrong encoding for %p”
% self unless valid_encoding?
self
end
end
Then we can do something like
puts s.ensure_encoding.length
or other String operations and be sure that the encoding is proper.
Does anybody have a better (shorter) name for such a method?
Kind regards
robert