$ cat test.rb
File.open(‘test.in’, ‘r’).each_line do |line|
puts line
test = “Der gro\xdfe BilderSauger”
puts test
The result is
Der gro\xdfe BilderSauger
Der gro?e BilderSauger
I have tried to put an encoding in the File.open() or line.encode()
without success. the ‘’ is recognized as a real ‘’, not as the
beginning of an hex escape sequence.
How can I get \xdf to be recognized as ß when reading from a file?
On Feb 10, 2011, at 12:34 PM, Gilles Gilles wrote:
The result is
Der gro\xdfe BilderSauger
Der gro?e BilderSauger
I have tried to put an encoding in the File.open() or line.encode()
without success. the ‘’ is recognized as a real ‘’, not as the
beginning of an hex escape sequence.
How can I get \xdf to be recognized as when reading from a file?
Unicode codepoint 00df needs to be written in a particular encoding.
I’ll choose UTF-8.
\xdf is the single byte DF, and your ‘test’ string will have encoding
ASCII-8BIT.
You’d use \u00df instead if you are using UTF-8 encoding (where that
character is encoded as two bytes).
Otherwise, if you are using ISO-8859-1 (say), where that codepoint is
the single byte DF, then you’ll need to open your file with that
encoding specified
Of course, test.in should not contain the character sequence “” “x” “d”
“f”
but rather the single byte (or two bytes, if you are using UTF-8
encoding)
Probably the simplest solution is to open a text editor and type in ß.
Then use hexdump -C on the file to see what it contains, byte by byte.
That’s a double-quoted string, and so Ruby is doing some translation of
the contents. A common example is \n meaning “newline”; in this case,
\xNN means the byte with hex code NN. So when you do each_byte, that’s
what you get, a single byte.
Change the double-quotes to single-quotes and you’ll actually get the
four separate characters.
But when read from a file:
…
l.each_byte {|b| puts b}
…
92 <- Here
120 <- we
100 <- are
102 <- as 4 ASCII chars ‘\xdf’
That proves that the file actually contains the four characters
‘’, ‘x’, ‘d’, ‘f’. If you want further proof, try
hexdump -C test.in
to take Ruby out of the loop completely.
So there’s neither UTF-8 nor ISO-8859-1 in that file, just plain ASCII
characters.
If you want to turn this into something else, you would have to process
it. For example:
$ cat test.rb
a = “Der gro\xdfe BilderSauger”
a.each_byte { |b| puts b }
$ ruby test.rb
68
101
114
32
103
114
111
223 <- Here I am
101
32
66
105
108
100
101
114
83
97
117
103
101
114
You can also see that the length is 22, not 25.
Also if I
puts a.encode(‘UTF-8’, ‘ISO-8859-1’)
I see the proper character in my terminal
But when read from a file:
$ cat test.rb
File.open(‘test.in’, ‘r:ISO-8859-1’).each_line do |l|
puts l
puts ‘***’
puts l.length
puts ‘***’
l.each_byte {|b| puts b}
end
$ ruby test.rb
Der gro\xdfe BilderSauger
25
68
101
114
32
103
114
111
92 <- Here
120 <- we
100 <- are
102 <- as 4 ASCII chars ‘\xdf’
101
32
66
105
108
100
101
114
83
97
117
103
101
114
I also tried to put UTF-8 codepoints and read as UTF-8 without luck. It
seems there is no escape sequence when reading from a stream, which I
can understand.
What I can’t figure out is how to interpret these escape sequences when
reading them from a file.