On Wed, Mar 23, 2011 at 4:53 AM, Stanley Xu <wenhao.xu / gmail.com> wrote:
> Yesterday, I sent a mail of let the split ignore the error utf-8 bytes
> sequences. And I checked the string I wanted to parse in Java and found out
> that the string is encoded in gbk and part of the string is encoded in
> utf-8.
>
> I am wondering if I could find a way to still split the string by split
> method, and then I could try to force_encoding part of the string that might
> encoded in gbk and resolve the problem.
>
> I am wondering if there is a way I could do so without the "invalid bytes
> sequence" error?

A string with a mixed encoding is difficult to handle.  I think you
have these options

1. Ensure that the string does *not* contain mixed encoding (this
would be the first and best choice IMHO).

2. If you can't because you get the data from somewhere else, use
encoding BINARY as a diversion:

mixed_content.force_encoding Encoding::BINARY
chunks = mixed_content.split /\t/
chunks[0].force_encoding Encoding::UTF_8
chunks[1].force_encoding Encoding::GBK

or

mixed_content.force_encoding Encoding::BINARY
a, b = mixed_content.split /\t/
a.force_encoding Encoding::UTF_8
b.force_encoding Encoding::GBK

Kind regards

robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/