Invalid byte sequence in UTF-8 (ArgumentError) - Ruby - how to hande invaid bytes on runtime

musicdenotation · January 31, 2014, 9:14pm

Basically I am having data in a file as below :

Cote  0.56  0.6  0.71  0.93  0.08  0.21  0.98  0.96  CÙte d'Ivoire

CÙte d’Ivoire 20.15 0.002 0.002 0.003 0

Problem created for Ù.

I was getting an exception every time :

Just I came up with a code below :

File.open('/home/kirti/workspace/Ruby/project_free/data/2013_diamonds.txt')

do |file|
file.readlines.map do |i|
begin
i = i.gsub(/[\u0022]/,’’)
rescue
p $!
p i
p i.encode(‘UTF-16’, :invalid => :replace, :replace =>
‘’).encode(‘utf-8’)
end
end
end

# >> #<ArgumentError: invalid byte sequence in UTF-8>
# >> "Cote\t0.56\t0.6\t0.71\t0.93\t0.08\t0.21\t0.98\t0.96\tC\xD9te

d’Ivoire\t\tC\xD9te d’Ivoire\t20.15\t0.002\t0.002\t0.003\t0\r\n"
# >> “Cote\t0.56\t0.6\t0.71\t0.93\t0.08\t0.21\t0.98\t0.96\tCte
d’Ivoire\t\tCte d’Ivoire\t20.15\t0.002\t0.002\t0.003\t0\r\n”

Now problem is the line i.encode('UTF-16', :invalid => :replace, :replace => '').encode('utf-8') handling it properly, but for invalid
byte it is replacing it with "". As you can see, I got Cte d'Ivoire\t\tCte.. where the character Ù is missing. But can this be
placed, with some logic in the line ...:replace => ''. I am looking
for instead of "", the dynamic charater for which error happned, with
some processing replacement character should be that Ù or any …

my-ruby · February 1, 2014, 4:37am

Dear Arup R.,

I have a (humble) guess about your problem…
I think you are trying to open an ISO_8859_1 encoded file as if it was
an UTF_8.
If so, you don’t need to recreate the logic of the character/encoding
translation the way you are trying to do. Ruby will kindly do that for
you.
You just have to open the file telling Ruby that it is an ISO_8859_1
file.

Try…
File.open(’/home/kirti/workspace/Ruby/project_free/data/2013_diamonds.txt’,
external_encoding: Encoding::ISO_8859_1) do |file|

I think no error will be raised.

This worked for me, at least with the two lines that you gave as
example.

Just tell me if it worked with the whole file.

You can read more at the Encoding, IO and File ruby documentation (see
internal and external encoding).

Best regards,
Abinoam Jr.

my-ruby · February 1, 2014, 7:16am

Abinoam Jr. wrote in post #1135206:

Dear Arup R.,

I have a (humble) guess about your problem…

Try…
File.open(’/home/kirti/workspace/Ruby/project_free/data/2013_diamonds.txt’,
external_encoding: Encoding::ISO_8859_1) do |file|

Very good suggestion indeed.

Just one topic in Ruby, always troubled me to understand the rationality
about this encoding. When to think of internal_encoding and
external_encoding. Why not only encoding? Sometimes in this
situation I also used force_encoding… This all I just used an an
trial and error. No I didn’t really aware of what I was doing. My goal
was to fix the error. First try encoding, then try force_encoding…

Can you give me some lights on this topic ?

my-ruby · February 1, 2014, 3:16pm

Abinoam Jr. wrote in post #1135236:

Feel free to ask if it’s not clear yet.
The encoding problem is complex (but the solution we have in Ruby is
simple IMHO).
I’m a native portuguese speaker and have to rely on good encoding
support.

One line from the file gave me so much trouble, once. I fixed it as
below

data_line.chomp.force_encoding(‘windows-1252’).encode(‘utf-8’)

But before doing this - I tried first

(a) data_line.chomp.encoding(‘utf-8’)
(b) data_line.chomp.force_encoding(‘utf-8’)

Then finally

data_line.chomp.force_encoding(‘windows-1252’).encode(‘utf-8’) worked.

Why a and b attempt didn’t work ? As I told you earlier, I always fixed
it using “trial and error” method.

Can you explain this ? May be with your help, I can make my base much
strong, in such encoding related issue

my-ruby · February 2, 2014, 12:44am

Can you give me the line?
I can try to help you.
Most of the problem come from the following: One open a file telling
Ruby
(implicitly or explicitly) the encoding, for example utf-8. But, the
byte
representation inside it is another encoding.
Em 01/02/2014 11:16, “Arup R.” [email protected] escreveu:

my-ruby · February 1, 2014, 12:28pm

Dear Arup,

For you to try to understand the rationality of it just relax and
think about why you (not me ) were trying to “encoding” or
“force_enconding” a string coming from a file that has a different
encoding than the internal one in your program.

Perhaps you will notice that you are receiving data (external data) in
an encoding different than that used internally.

Ruby does exactly what you were trying to accomplish, but in a more
elegant/fashioned way ;-).

When you set the external encoding of a file, Ruby tries to translate
all data coming from the file from the external encoding to the
internal one.
And when you try to write to the file it does the reverse.
So that you can preserve the original encoding of the file and don’t
have to worry about encoding compatibility inside your program.

Go for at

Feel free to ask if it’s not clear yet.
The encoding problem is complex (but the solution we have in Ruby is
simple IMHO).
I’m a native portuguese speaker and have to rely on good encoding
support.
As Ruby has its roots on japanese programmers, I think they’re really
concerned on good encoding support with a rich set of features to deal
with it.

Kind regards,
Abinoam Jr.