On Wed, Mar 30, 2011 at 1:35 PM, ctdev <ctdev421 / gmail.com> wrote:

>> What is the encoding of your input HTML file?
>
> Opening one of the files in IRB and checking external_encoding.name
> returns "UTF-8".

That was not the question.  He wanted to know the encoding of the
_file_.  You should be able to identify this from the HTTP response.

> This is from a group of pages I scraped with Hpricot (before switching
> to Nokogiri) and saved locally.
>
> The site itself comes from a Microsoft environment and there seems to
> be much weirdness in the files. I'll need to anticipate and
> accommodate that in my code.

Weirdness with regard to encodings or other weirdness?

> I wonder if I might have better luck building the scraping portion of
> my app in a different language (though I'd rather stick with Ruby).

IMHO it is usually simpler to stay in one ecosystem.  If the server
sends the correct encoding I would expect Hpricot and Nokogiri to
treat the file properly.  If you fetched the files with a pre 1.9
version then maybe you have to refetch them.

Cheers

robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/