How could I make the Ruby 1.9 string ignore the invalid utf-8 byte sequence in split?

Dear buddies,

I am using ruby to run some map reduce job in hadoop streaming.
Unfortunately, we have some dirty data which have invalid byte sequence
the input. So while running things like


I will get

Best wishes,
Stanley Xu

Sorry, I just mis-sent the half-typed mail by a short-cut in gmail.

I just resent a mail to described the problem.

Best wishes,
Stanley Xu

On Tue, Mar 22, 2011 at 3:30 PM, Stanley Xu [email protected] wrote:

Sorry, I just mis-sent the half-typed mail by a short-cut in gmail.

I just resent a mail to described the problem.

Did you? I can’t seem to find it.



Anyway, let me resend it again.

Dear buddies,

I am using ruby to run some map reduce job in hadoop streaming.
Unfortunately, we have some dirty data which have invalid byte sequence
the input. So while running things like


I will get errors like
:in `split’: invalid byte sequence in UTF-8 (ArgumentError)

I searched a little bit and try to use iconv to ignore the invalid

if !line.valid_encoding?
ic =‘UTF-8//IGNORE’, ‘UTF-8’)
line = ic.iconv(line)

It resolve most of the invalid lines but will still a couple of line
have the same error.

I am wondering if there is a way I could let the string.split() worked
ruby1.9 with invalid character sequences?

Thanks in advance

Best wishes,
Stanley Xu

On Tue, Mar 22, 2011 at 11:09 PM, Robert K.

Maybe you should not encode the data from its external_encoding to
I had been trapped in the encoding problem that some GBK characters
cannot transform to UTF-8.

encoding: utf-8‘file.txt’, ‘r:gbk’).each_line do |line| # not ‘r:gbk:utf-8’
arr = line.chomp.split("\t".encode(‘gbk’)) # encode “\t” to gbk

blah blah



I am working with Chinese character radicals.
I came across a radical which has a codepoint “\uE839”.

encoding: utf-8

[ STDIN, STDOUT, STDERR ].each do |stdio|
stdio.set_encoding( ‘gbk’, ‘utf-8’ )
char = “\uE839”
puts char # Encoding::UndefinedConversionError

f.rb:7:in write': U+E839 from UTF-8 to GBK (Encoding::UndefinedConversionError) from f.rb:7:inputs’
from f.rb:7:in puts' from f.rb:7:in

But Perl works

use utf8;
use open “:encoding(gbk)”, “:std”;

$char = “\N{U+E839}”;
print $char;

It prints out what I want–a chinese character radical

This sounds like it might be a legitimate bug. Can you file a ticket on
redmine with this code sample?

Hi Joey,

I don’t think that’s the problem. It is probably a file with utf-8
characters. Like 1 millions lines could be split well, but 1000 of them
get the “invalid bytes sequence error”.

Now I have a temporary solution like the following:

if !line.valid_encoding?
line = line.unpack(‘C*’).pack(‘U*’)
fields = line.chomp.split("\t")

But I really doubt it is a good solution, for the invalid character
means a valid sequence in gbk or something like that.

Isn’t there a way I could split the string in ruby 1.9 in the old 1.8

Best wishes,
Stanley Xu