How to use String.split to split a mixed encoding string(part encoded in gbk, part encoded in utf-8)

Dear Buddies,

Yesterday, I sent a mail of let the split ignore the error utf-8 bytes
sequences. And I checked the string I wanted to parse in Java and found
that the string is encoded in gbk and part of the string is encoded in

I am wondering if I could find a way to still split the string by split
method, and then I could try to force_encoding part of the string that
encoded in gbk and resolve the problem.

I am wondering if there is a way I could do so without the “invalid
sequence” error?


Best wishes,
Stanley Xu

On Wed, Mar 23, 2011 at 4:53 AM, Stanley Xu [email protected] wrote:

sequence" error?
A string with a mixed encoding is difficult to handle. I think you
have these options

  1. Ensure that the string does not contain mixed encoding (this
    would be the first and best choice IMHO).

  2. If you can’t because you get the data from somewhere else, use
    encoding BINARY as a diversion:

mixed_content.force_encoding Encoding::BINARY
chunks = mixed_content.split /\t/
chunks[0].force_encoding Encoding::UTF_8
chunks[1].force_encoding Encoding::GBK


mixed_content.force_encoding Encoding::BINARY
a, b = mixed_content.split /\t/
a.force_encoding Encoding::UTF_8
b.force_encoding Encoding::GBK

Kind regards


Thanks a lot, Robert. Your solution really helps.

Best wishes,
Stanley Xu

On Wed, Mar 23, 2011 at 5:32 PM, Robert K.