I’m reading a CSV file that has some non US-ASCII characters. I want
to parse each value in each row and strip out any leading/lagging
potential whitespace.
However, when I come across some unusual characters, I get invalid
byte sequence in UTF-8
here is an example:
irb(main):041:0* a = “\xFF”
=> “\xFF”
irb(main):042:0> a.encoding
=> #Encoding:UTF-8
irb(main):043:0> a.strip
ArgumentError: invalid byte sequence in UTF-8
from (irb):43:in strip' from (irb):43 from /usr/local/lib/ruby/gems/1.9.1/gems/railties-3.0.3/lib/ rails/commands/console.rb:44:instart’
from /usr/local/lib/ruby/gems/1.9.1/gems/railties-3.0.3/lib/
rails/commands/console.rb:8:in start' from /usr/local/lib/ruby/gems/1.9.1/gems/railties-3.0.3/lib/ rails/commands.rb:23:in<top (required)>’
from script/rails:6:in require' from script/rails:6:in’
so now I’m going to try and change encoding, but this doesn’t work
either
irb(main):044:0> a.encode!(“ASCII-8BIT”, undef: :replace)
Encoding::InvalidByteSequenceError: “\xFF” on UTF-8
from (irb):44:in encode!' from (irb):44 from /usr/local/lib/ruby/gems/1.9.1/gems/railties-3.0.3/lib/ rails/commands/console.rb:44:instart’
from /usr/local/lib/ruby/gems/1.9.1/gems/railties-3.0.3/lib/
rails/commands/console.rb:8:in start' from /usr/local/lib/ruby/gems/1.9.1/gems/railties-3.0.3/lib/ rails/commands.rb:23:in<top (required)>’
from script/rails:6:in require' from script/rails:6:in’
Is there any way to strip out these characters while staying with
utf-8 encoding?
I’m reading a CSV file that has some non US-ASCII characters. I want
to parse each value in each row and strip out any leading/lagging
potential whitespace.
However, when I come across some unusual characters, I get invalid
byte sequence in UTF-8
I guess it’s not genuinely UTF-8.
If you think it is sort of broken UTF-8 which includes FF characters
for some reason, then you could force encoding to binary, remove the FF
characters, then force back to UTF-8.
More likely I’d have thought it was a single-byte encoding (like
ISO-8859-1 perhaps). But in any case, if you’re just doing CSV parsing,
you can quite legitimately treat UTF-8 as binary - since all you need to
do is recognise commas and double quotes, and the rest just gets passed
through.
More info at
Or just use ruby 1.8.
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.