Hi,
I have the following test case:
$ cat test.in
Der gro\xdfe BilderSauger
$ cat test.rb
File.open('test.in', 'r').each_line do |line|
puts line
test = "Der gro\xdfe BilderSauger"
puts test
The result is
Der gro\xdfe BilderSauger
Der gro?e BilderSauger
I have tried to put an encoding in the File.open() or line.encode()
without success. the '\' is recognized as a real '\', not as the
beginning of an hex escape sequence.
How can I get \xdf to be recognized as ß when reading from a file?
Thanks
--Gilles
on 2011-02-10 21:34
on 2011-02-11 01:54
On Feb 10, 2011, at 12:34 PM, Gilles Gilles wrote: > > The result is > Der gro\xdfe BilderSauger > Der gro?e BilderSauger > > I have tried to put an encoding in the File.open() or line.encode() > without success. the '\' is recognized as a real '\', not as the > beginning of an hex escape sequence. > > How can I get \xdf to be recognized as when reading from a file? Unicode codepoint 00df needs to be written in a particular encoding. I'll choose UTF-8. $ echo 'Der groe BilderSauger' > test.in $ hexdump -C test.in 00000000 44 65 72 20 67 72 6f c3 9f 65 20 42 69 6c 64 65 |Der gro..e Bilde| 00000010 72 53 61 75 67 65 72 0a |rSauger.| 00000018 In test.rb you also need to set UTF-8 for this to work: $ cat test.rb # coding: UTF-8 File.open('test.in', 'r').each_line do |line| puts line test = "Der groe BilderSauger" puts test end $ ruby19 test.rb Der groe BilderSauger Der groe BilderSauger
on 2011-02-12 11:50
Gilles Devaux wrote in post #980948:
> test = "Der gro\xdfe BilderSauger"
\xdf is the single byte DF, and your 'test' string will have encoding
ASCII-8BIT.
You'd use \u00df instead if you are using UTF-8 encoding (where that
character is encoded as two bytes).
Otherwise, if you are using ISO-8859-1 (say), where that codepoint *is*
the single byte DF, then you'll need to open your file with that
encoding specified
Of course, test.in should not contain the character sequence "\" "x" "d"
"f"
but rather the single byte (or two bytes, if you are using UTF-8
encoding)
Probably the simplest solution is to open a text editor and type in ß.
Then use hexdump -C on the file to see what it contains, byte by byte.
on 2011-02-15 21:05
Sorry I haven't responded earlier but it seems I'm not notified by email. The thing is I do not control the input, it is the browscap file found here and is in ISO-8859-1 http://browsers.garykeith.com/downloads.asp \xdf (decimal 223) is a valid ISO-8859-1 code point (http://en.wikipedia.org/wiki/ISO/IEC_8859-1) it appears as '?' because my terminal is UTF-8 but the bytes are there: $ cat test.rb a = "Der gro\xdfe BilderSauger" a.each_byte { |b| puts b } $ ruby test.rb 68 101 114 32 103 114 111 223 <- Here I am 101 32 66 105 108 100 101 114 83 97 117 103 101 114 You can also see that the length is 22, not 25. Also if I puts a.encode('UTF-8', 'ISO-8859-1') I see the proper character in my terminal But when read from a file: $ cat test.rb File.open('test.in', 'r:ISO-8859-1').each_line do |l| puts l puts '***' puts l.length puts '***' l.each_byte {|b| puts b} end $ ruby test.rb Der gro\xdfe BilderSauger *** 25 *** 68 101 114 32 103 114 111 92 <- Here 120 <- we 100 <- are 102 <- as 4 ASCII chars '\xdf' 101 32 66 105 108 100 101 114 83 97 117 103 101 114 I also tried to put UTF-8 codepoints and read as UTF-8 without luck. It seems there is no escape sequence when reading from a stream, which I can understand. What I can't figure out is how to interpret these escape sequences when reading them from a file. --Gilles
on 2011-02-15 21:31
You're doing two different things. > $ cat test.rb > a = "Der gro\xdfe BilderSauger" That's a double-quoted string, and so Ruby is doing some translation of the contents. A common example is \n meaning "newline"; in this case, \xNN means the byte with hex code NN. So when you do each_byte, that's what you get, a single byte. Change the double-quotes to single-quotes and you'll actually get the four separate characters. > But when read from a file: ... > l.each_byte {|b| puts b} ... > 92 <- Here > 120 <- we > 100 <- are > 102 <- as 4 ASCII chars '\xdf' That proves that the file actually contains the four characters '\', 'x', 'd', 'f'. If you want further proof, try hexdump -C test.in to take Ruby out of the loop completely. So there's neither UTF-8 nor ISO-8859-1 in that file, just plain ASCII characters. If you want to turn this into something else, you would have to process it. For example: l.gsub!(/\\x([0-9a-f]{2})/i) { $1.hex.chr } # or in ruby 1.9, if you want to tag the encoding: l.gsub!(/\\x([0-9a-f]{2})/i) { $1.hex.chr("ISO-8859-1") }
on 2011-02-17 01:07
Stupid of me. The file is indeed ISO-8859-1 (some other characters are
encoded this way) just not this one, it's escaped.
This:
l.gsub!(/\\x([0-9a-f]{2})/i) { $1.hex.chr }
# or in ruby 1.9, if you want to tag the encoding:
l.gsub!(/\\x([0-9a-f]{2})/i) { $1.hex.chr("ISO-8859-1") }
is exactly what I want.
Thanks.
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.