ArgumentError - invalid byte sequence in UTF-8

luislavena · July 21, 2011, 5:47am

Hi,

Every now and then I get errors relating to UTF8 encodings, and each
time I fail to (guess) find the right combination of words to get Ruby
1.92 to play nice with some string it doesn’t like.

Right now I want to open a log file and read it, but some script kiddie
has decided to connect using some crazy non ASCII characters, and this
line in my script

File.readlines(logfile, :encoding => "UTF-8" )

Now spits out the error:

ArgumentError - invalid byte sequence in UTF-8

when encountering lines like this:

83.44.178.124 - - [19/Jul/2011:19:15:00 +0100]
?.???S\x08\x02?N~],>~Q?~@6\x15`ҷ?~Vg?'dR\x1C??\x08?F\x06w?~H?~F?\x08P~V?\x0Bf\x22?\x17~M^??{??j\x1E??p?~AU~\
“400” 166 “-” “-” “-”

I’d really like to know how to fix this without dropping 1.9. Does
anyone know the magic words that will get this logfile read? These are
my best efforts

File.readlines(logfile, :encoding => "UTF-8" ).map{|e|

e.force_encoding(‘UTF-8’)}

File.readlines(logfile, :encoding => "UTF-8" ).map{|e|

e.encode(‘UTF-8’, undef: :replace, replace: “??”)}

File.readlines(logfile, :encoding => "UTF-8" ).map{|e|

e.encode(‘iso-8859-1’, undef: :replace, replace: “??”)}

They fail They do read a logfile with valid utf8 in there. Any help
is much appreciated.

Regards,
Iain

tensaiji · July 21, 2011, 6:54am

No matter what you do, there can always be an invalid byte sequence.

#encoding: utf-8

puts RUBY_VERSION

str = “m€, ¥ou”

File.open(‘text.txt’, ‘w’) do |f|
f.puts str
end

IO.foreach(‘text.txt’, ‘r:UTF-8’) do |line|
p line.encoding.name
p line
end

–output:–
1.9.2
“UTF-8”
“m€, ¥ou\n”

#encoding: utf-8

puts RUBY_VERSION

str = “m€, ¥ou”

File.open(‘text.txt’, ‘w’) do |f|
f.puts str
end

lines = IO.readlines(‘text.txt’, :encoding => ‘UTF-8’)

lines.each do |line|
p [line.encoding.name, line]
end

–output:–
1.9.2
[“UTF-8”, “m€, ¥ou\n”]

puts RUBY_VERSION

str = “me, you”

File.open(‘text.txt’, ‘w’) do |f|
f.puts str
end

lines = IO.readlines(‘text.txt’, :encoding => ‘UTF-8’)

lines.each do |line|
p [line.encoding.name, line]
end

–output:–
1.9.2
[“UTF-8”, “me, you\n”]

#encoding: utf-8

puts RUBY_VERSION

str = “m€, ¥ou”

File.open(‘text.txt’, ‘w’) do |f|
f.puts str
end

lines = IO.readlines(‘text.txt’, :encoding => ‘ISO-8859-1’)

lines.each do |line|
p [line.encoding.name, line]
end

–output:–
1.9.2
[“ISO-8859-1”, “m\xE2\x82\xAC, \xC2\xA5ou\n”]

tensaiji · July 24, 2011, 12:37pm

On 21 Jul 2011, at 13:26, Brian C. wrote:

of ruby 1.9, but in many cases you can read a string which has invalid
ArgumentError: invalid byte sequence in UTF-8
I’d suggest that BINARY mode is the way to go for you. If your objective

Thanks. I am running some regex on the lines later, so that is where the
script is actually choking. I’ll just have to put up with this I
suppose.

Again, many thanks.

Regards,
Iain

tensaiji · July 21, 2011, 2:26pm

Iain B. wrote in post #1012004:

File.readlines(logfile, :encoding => "UTF-8" )
Now spits out the error:

ArgumentError - invalid byte sequence in UTF-8

Are you sure it’s that particular line which splits out the error?

There are no hard-and-fast rules, because of the whole incoherent design
of ruby 1.9, but in many cases you can read a string which has invalid
encodings, but you get an error later on when you try to do things like
regexp matches on it.

irb(main):002:0> File.open(“zzz1”,“wb”) { |f| f.write("\xdd\xdd") }
=> 2
irb(main):003:0> File.readlines(“zzz1”)
=> ["\xDD\xDD"]
irb(main):004:0> File.readlines(“zzz1”, :encoding=>“UTF-8”)
=> ["\xDD\xDD"]
irb(main):005:0> File.readlines(“zzz1”, :encoding=>“UTF-8”)[0] =~ /./
ArgumentError: invalid byte sequence in UTF-8
from (irb):5
from /usr/local/bin/irb192:12:in `’
irb(main):006:0>

You can of course set :encoding=>“BINARY” (or “ASCII-8BIT”) when you
read the file. Or you could open the file in binary mode (“rb”), which I
don’t think File.readlines supports directly, but File.open does. The
two are not exactly the same; binary mode also prevents CR/CRLF
translations on non-Unix platforms.

I’d suggest that BINARY mode is the way to go for you. If your objective
is to read in some log lines, chomp them, and write them out again,
whilst allowing arbitrary byte sequences, this will Just Work [TM], just
like it would in ruby 1.8.

However, regexp matches will be against individual bytes of the string,
rather than entire UTF-8 characters.

It’s strange how in ruby 1.9, str[x] works just fine with invalid
encodings, but str=~/./ does not. But that’s only one of many strange
things about ruby 1.9.