On Wed, Mar 30, 2011 at 7:35 AM, ctdev <ctdev421 / gmail.com> wrote:
>> What is the encoding of your input HTML file?
>
> Opening one of the files in IRB and checking external_encoding.name
> returns "UTF-8".

That doesn't detect the true file encoding (indeed, the file is either
in a different encoding or the file is corrupt, hence your invalid
byte sequence).

http://blog.grayproductions.net/articles/ruby_19s_three_default_encodings

ruby -v -e 'puts File.open("/etc/passwd").external_encoding'
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]
US-ASCII

LC_CTYPE=ja_JP.sjis ruby -v -e 'puts File.open("/etc/passwd").external_encoding'
ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]
Shift_JIS

> I wonder if I might have better luck building the scraping portion of
> my app in a different language (though I'd rather stick with Ruby).

Well, another language might ignore the invalid characters so it would
look like it worked fine, but your output could actually be invalid.