On 6/15/10, R.. Kumar <sentinel1879 / gmail.com> wrote:
> I download the page http://www.ruby-forum.com/forum/4 using wget. Then i
> cat the file and pipe to gsub.
>
> I get: -e:1:in `gsub': invalid byte sequence in US-ASCII (ArgumentError)
>
>
> wget -q -k -O index11.html http://www.ruby-forum.com/forum/4
>
> cat index11.html | ruby -pe 'gsub(/href=a\/"/,"href=\"'${base}'")' >
> ofile
>
> (The value of base is http://www.ruby-forum.com/)
>
> So what must i do so this command can run. It runs fine with another
> site.
> If i replace ruby with perl -pe 's|....|g' that works fine.
>
> I actually run this in a loop with various URLS from cron.

Handling this kind of thing right means tracking encodings right....
which means you'd have to extract the encoding from the http session
and then mark the input as that encoding in your ruby script... and
then deal with the inevitable incompatible encoding errors that would
crop up.

It sounds to me, tho, like in this case what you have a just some
hacky little scripts and it would be acceptable for them to be
imperfect. So, in that case, I suggest trying to set the encoding for
your source file(s) to BINARY. That's a hack, but it ought to be
effective.

Alternately, you could drop back to the 1.8 interpreter, like Brian
suggests, which more or less uses BINARY as the default source
encoding.