Dear Buddies,
Yesterday, I sent a mail of let the split ignore the error utf-8 bytes
sequences. And I checked the string I wanted to parse in Java and found
out
that the string is encoded in gbk and part of the string is encoded in
utf-8.
I am wondering if I could find a way to still split the string by split
method, and then I could try to force_encoding part of the string that
might
encoded in gbk and resolve the problem.
I am wondering if there is a way I could do so without the “invalid
bytes
sequence” error?
Thanks.
Best wishes,
Stanley Xu
On Wed, Mar 23, 2011 at 4:53 AM, Stanley Xu [email protected] wrote:
sequence" error?
A string with a mixed encoding is difficult to handle. I think you
have these options
-
Ensure that the string does not contain mixed encoding (this
would be the first and best choice IMHO).
-
If you can’t because you get the data from somewhere else, use
encoding BINARY as a diversion:
mixed_content.force_encoding Encoding::BINARY
chunks = mixed_content.split /\t/
chunks[0].force_encoding Encoding::UTF_8
chunks[1].force_encoding Encoding::GBK
or
mixed_content.force_encoding Encoding::BINARY
a, b = mixed_content.split /\t/
a.force_encoding Encoding::UTF_8
b.force_encoding Encoding::GBK
Kind regards
robert
Thanks a lot, Robert. Your solution really helps.
Best wishes,
Stanley Xu
On Wed, Mar 23, 2011 at 5:32 PM, Robert K.