How could I make the Ruby 1.9 string ignore the invalid utf-8 byte sequence in split?

Dear buddies,

I am using ruby to run some map reduce job in hadoop streaming.
Unfortunately, we have some dirty data which have invalid byte sequence
as
the input. So while running things like

line.chomp.split("\t")

I will get

Best wishes,
Stanley Xu

Sorry, I just mis-sent the half-typed mail by a short-cut in gmail.

I just resent a mail to described the problem.

Best wishes,
Stanley Xu

On Tue, Mar 22, 2011 at 3:30 PM, Stanley Xu [email protected] wrote:

Sorry, I just mis-sent the half-typed mail by a short-cut in gmail.

I just resent a mail to described the problem.

Did you? I can’t seem to find it.

Cheers

robert

Anyway, let me resend it again.

Dear buddies,

I am using ruby to run some map reduce job in hadoop streaming.
Unfortunately, we have some dirty data which have invalid byte sequence
as
the input. So while running things like

line.chomp.split("\t")

I will get errors like
:in `split’: invalid byte sequence in UTF-8 (ArgumentError)

I searched a little bit and try to use iconv to ignore the invalid
sequence
by

if !line.valid_encoding?
ic = Iconv.new(‘UTF-8//IGNORE’, ‘UTF-8’)
line = ic.iconv(line)
end

It resolve most of the invalid lines but will still a couple of line
will
have the same error.

I am wondering if there is a way I could let the string.split() worked
in
ruby1.9 with invalid character sequences?

Thanks in advance

Best wishes,
Stanley Xu

On Tue, Mar 22, 2011 at 11:09 PM, Robert K.

Maybe you should not encode the data from its external_encoding to
UTF-8.
I had been trapped in the encoding problem that some GBK characters
cannot transform to UTF-8.

encoding: utf-8

File.open(‘file.txt’, ‘r:gbk’).each_line do |line| # not ‘r:gbk:utf-8’
arr = line.chomp.split("\t".encode(‘gbk’)) # encode “\t” to gbk

blah blah

end

Joey

I am working with Chinese character radicals.
I came across a radical which has a codepoint “\uE839”.


encoding: utf-8

[ STDIN, STDOUT, STDERR ].each do |stdio|
stdio.set_encoding( ‘gbk’, ‘utf-8’ )
end
char = “\uE839”
puts char # Encoding::UndefinedConversionError


f.rb:7:in write': U+E839 from UTF-8 to GBK (Encoding::UndefinedConversionError) from f.rb:7:inputs’
from f.rb:7:in puts' from f.rb:7:in

But Perl works


use utf8;
use open “:encoding(gbk)”, “:std”;

$char = “\N{U+E839}”;
print $char;

It prints out what I want–a chinese character radical

This sounds like it might be a legitimate bug. Can you file a ticket on
redmine with this code sample?

Hi Joey,

I don’t think that’s the problem. It is probably a file with utf-8
characters. Like 1 millions lines could be split well, but 1000 of them
will
get the “invalid bytes sequence error”.

Now I have a temporary solution like the following:

if !line.valid_encoding?
line = line.unpack(‘C*’).pack(‘U*’)
end
fields = line.chomp.split("\t")

But I really doubt it is a good solution, for the invalid character
might
means a valid sequence in gbk or something like that.

Isn’t there a way I could split the string in ruby 1.9 in the old 1.8
“dirty
way”?

Best wishes,
Stanley Xu