Nathan B. wrote in post #1077055:
I’m aware of the implementation details of Ruby 1.9’s String. What
I’ve been trying to figure out for a bit now is all of the
idiosyncrasies of the standard library APIs.
Ah, well that’s an open-ended question. Ruby’s standard library is very
large, and none of the encoding-related behaviour is documented. But
File.open / getc are pretty fundamental to encoding behaviour.
As such, I’m very curious
about these details, so I performed a few more experiments.
Interestingly I’m still not seeing this behavior. Could this have
changed at some point between 1.9.0 and 1.9.3-p194? Am I running into
something on OS-specific
I don’t think so, unless it’s behaviour of irb. I think you are just
misinterpreting the results, and being confused by String#inspect.
Look at
c.bytes.to_a
and
c.pack(“H*”)
to see what’s really in the String.
I ran two tests. One with the inverted exclamation character, which is
code point U+00A0 and the euro sign, which is code point U+20AC. I
used these two characters, as the inverted exclamation has the same
code point value in Unicode, ISO-8859-1 and Windows-1252, but the byte
value is two bytes in UTF-8 and one byte in Windows-1252; the euro
sign is in Unicode and Windows-1252, but at different code points.
For the euro sign, i create a windows-1252 text file with a single
byte of 0x80 (the code point value) and then opened up IRB and ran the
following.
1.9.3-p194 :001 > f = File.open(‘euro_win1252.txt’, ‘r:windows-1252’)
=> #<File:euro_win1252.txt>
1.9.3-p194 :002 > c = f.getc
=> “\x80”
That’s the single byte you expected. However String#inspect has some
hard-coded behaviour which treats bytes in the range 0x80-0x9f (I think)
as unprintable, and therefore substitutes hex representation. “puts c”
will squirt the string directly at the terminal, and because your
terminal is UTF-8 but the string is invalid UTF-8, it will be
unprintable. Your terminal will probably substitute some special
character.
1.9.3-p194 :003 > c.encoding
=> #Encoding:Windows-1252
1.9.3-p194 :004 > ct = c.encode(‘utf-8’)
=> “€”
You’ve transcoded it. Now ct contains two bytes, which is the UTF-8
representation of that character. Then you’ve sent it to the screen.
By default ruby does no transcoding on output (i.e. it does not take
into account the encoding of your terminal). Your terminal is in fact
UTF-8, and so those two bytes get displayed as the one character you’re
sending.
(Because you’re running OSX, your terminal is almost certainly UTF-8;
mine is anyway)
For the inverted exclamation point, i created a windows-1252 text file
with a single byte of 0xA1 (the code point value) and then opened up
IRB and ran the following.
1.9.3-p194 :001 > f = File.open(‘inverted_win1252.txt’,
‘r:windows-1252’)
=> #<File:inverted_win1252.txt>
1.9.3-p194 :002 > c = f.getc
=> “\xA1”
Again that’s one byte; for some reason String#inspect or irb is showing
it in hex representation and I don’t know why in this case. But if it
didn’t, it would be unprintable on a UTF-8 terminal.
1.9.3-p194 :003 > c.encoding
=> #Encoding:Windows-1252
1.9.3-p194 :004 > ct = c.encode(‘utf-8’)
=> “¡”
Now ct contains 2 bytes, the UTF-8 representation of that character, and
your terminal displays it properly.
Anyway, try running something like this from the command line and see if
it’s any clearer, because it eliminates any possible interaction with
irb.
File.open(“inverted_win1252.txt”,“wb”) do |f|
f.write “\xA1”
end
File.open(“inverted_win1252.txt”,“r:windows-1252”) do |f|
c = f.getc
puts c.bytes.to_a
puts c.unpack(“H*”)
puts c.encoding
puts c.inspect
puts c
ct = c.encode(“utf-8”)
puts ct.bytes.to_a
puts ct.unpack(“H*”)
puts ct.encoding
puts ct.inspect
puts ct
end
Regards,
Brian.