On Wed, Mar 30, 2011 at 1:35 PM, ctdev <ctdev421 / gmail.com> wrote: >> What is the encoding of your input HTML file? > > Opening one of the files in IRB and checking external_encoding.name > returns "UTF-8". That was not the question. He wanted to know the encoding of the _file_. You should be able to identify this from the HTTP response. > This is from a group of pages I scraped with Hpricot (before switching > to Nokogiri) and saved locally. > > The site itself comes from a Microsoft environment and there seems to > be much weirdness in the files. I'll need to anticipate and > accommodate that in my code. Weirdness with regard to encodings or other weirdness? > I wonder if I might have better luck building the scraping portion of > my app in a different language (though I'd rather stick with Ruby). IMHO it is usually simpler to stay in one ecosystem. If the server sends the correct encoding I would expect Hpricot and Nokogiri to treat the file properly. If you fetched the files with a pre 1.9 version then maybe you have to refetch them. Cheers robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/