Anyway, let me resend it again.


Dear buddies,

I am using ruby to run some map reduce job in hadoop streaming.
Unfortunately, we have some dirty data which have invalid byte sequence as
the input. So while running things like

line.chomp.split("\t")

I will get  errors like
:in `split': invalid byte sequence in UTF-8 (ArgumentError)

I searched a little bit and try to use iconv to ignore the invalid sequence
by

if !line.valid_encoding?
      ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
      line = ic.iconv(line)
end

It resolve most of the invalid lines but will still a couple of line will
have the same error.

I am wondering if there is a way I could let the string.split() worked in
ruby1.9 with invalid character sequences?

Thanks in advance

Best wishes,
Stanley Xu



On Tue, Mar 22, 2011 at 11:09 PM, Robert Klemme
<shortcutter / googlemail.com>wrote:

> On Tue, Mar 22, 2011 at 3:30 PM, Stanley Xu <wenhao.xu / gmail.com> wrote:
> > Sorry, I just mis-sent the half-typed mail by a short-cut in gmail.
> >
> > I just resent a mail to described the problem.
>
> Did you?  I can't seem to find it.
>
> Cheers
>
> robert
>
> --
> remember.guy do |as, often| as.you_can - without end
> http://blog.rubybestpractices.com/
>
>