How to use String.split to split a mixed encoding string(part encoded in gbk, part encoded in utf-8)

luislavena · March 23, 2011, 4:54am

Dear Buddies,

Yesterday, I sent a mail of let the split ignore the error utf-8 bytes
sequences. And I checked the string I wanted to parse in Java and found
out
that the string is encoded in gbk and part of the string is encoded in
utf-8.

I am wondering if I could find a way to still split the string by split
method, and then I could try to force_encoding part of the string that
might
encoded in gbk and resolve the problem.

I am wondering if there is a way I could do so without the “invalid
bytes
sequence” error?

Thanks.

Best wishes,
Stanley Xu

xuwenhao · March 23, 2011, 10:33am

On Wed, Mar 23, 2011 at 4:53 AM, Stanley Xu [email protected] wrote:

sequence" error?
A string with a mixed encoding is difficult to handle. I think you
have these options

Ensure that the string does not contain mixed encoding (this
would be the first and best choice IMHO).
If you can’t because you get the data from somewhere else, use
encoding BINARY as a diversion:

mixed_content.force_encoding Encoding::BINARY
chunks = mixed_content.split /\t/
chunks[0].force_encoding Encoding::UTF_8
chunks[1].force_encoding Encoding::GBK

or

mixed_content.force_encoding Encoding::BINARY
a, b = mixed_content.split /\t/
a.force_encoding Encoding::UTF_8
b.force_encoding Encoding::GBK

Kind regards

robert

xuwenhao · March 23, 2011, 3:06pm

Thanks a lot, Robert. Your solution really helps.

Best wishes,
Stanley Xu

On Wed, Mar 23, 2011 at 5:32 PM, Robert K.