On Tue, Jul 07, 2009 at 10:18:50PM +0900, Patrick Lajeunesse wrote:
> Hi,
> I'm trying to scrape links using Mechanize. Sometimes accented characters
> (on French pages) are corrupt once Ruby gets them. To see what I mean, check
> this:
> 
> require 'mechanize'
> a = WWW::Mechanize.new
> page = a.get('http://www.agr.gc.ca/cb/index_f.php?s1=n&s2=index&page=2009_07
> ')
> page.links.each do |a_link|
>   puts a_link
> end
> 
> Of course, it's only the accents that are entered in plain text (i.e.,
> without entities) that have this problem. But in an imperfect world, I can't
> always count on accents being entered properly.
> 
> Is there anything I can do about this? I've tried using Iconv to convert the
> strings to UTF-8, but that just resulted in a different (but still wrong)
> character in place of the broken ones.

What version of nokogiri / mechanize do you have installed?  I ran your
code and was able to see the accents:

  http://skitch.com/aaron.patterson/bs4qt/terminal-bash-80x24

Most of the time, these encoding issues are due to the server
incorrectly identifying the encoding of the content.  Is this content
supposed to be ISO-8859-1?

-- 
Aaron Patterson
http://tenderlovemaking.com/