In Ruby 1.9.3-429, I am trying to parse plain text files with various
encodings that will ultimately be converted to UTF-8 strings. Non-ascii
characters work fine with a file encoded as UTF-8, but problems come up
with non-UTF-8 files.
Simplified example:
File.open(file) do |io|
io.set_encoding("#{charset.upcase}:#{Encoding::UTF_8}")
line, char = “”, nil
until io.eof? || char == ?\n || char == ?\r
char = io.readchar
puts “Character #{char} has #{char.each_codepoint.count} codepoints”
puts “SLICE FAIL” unless char == char.slice(0,1)
line << char
end
line
end
Both files are just a single string áÁð encoded appropriately. I have
checked that the files have been encoded correctly via “$ file -i
<file_name>”
With a UTF-8 file, I get back:
Character á has 1 codepoints
Character Á has 1 codepoints
Character ð has 1 codepoints
With an ISO-8859-1 file:
Character á has 2 codepoints
SLICE FAIL
Character Á has 2 codepoints
SLICE FAIL
Character ð has 2 codepoints
SLICE FAIL
The way I am interpreting this is readchar is returning an incorrectly
converted encoding which is causing slice to return incorrectly.
Is this behavior correct? Or am I specifying the file external encoding
incorrectly? I would rather not rewrite this process so I am hoping I am
making a mistake somewhere. There are reasons why I am parsing files
this way, but I don’t think those are relevant to my question.
Specifying the internal and external encoding as an option in File.open
yielded the same results.