Issue #2567 has been updated by Eric Hodel.


The problem is not so much forcing the user to figure out how to get correct encoding (charset) but making sure the encoding returned is accurate.  If we can add this feature to Net::HTTP in a way that works for most cases that's great.  

Unfortunately websites outside of the US seem to have big problems with guessing the encoding correctly and require an attempt at parsing the document first.  Most bugs in mechanize about setting the encoding correctly came from people parsing non-English and non-Latin websites (so UTF-8 or ISO-8859-1 won't work).

If we can do this without needing to parse the document that's great, but I think that is very difficult to do.  Having a broken or inaccurate way of choosing the encoding will be worse than having no way.
----------------------------------------
Feature #2567: Net::HTTP does not handle encoding correctly
http://redmine.ruby-lang.org/issues/2567

Author: Ryan Sims
Status: Assigned
Priority: Low
Assignee: Yui NARUSE
Category: lib
Target version: 1.9.x
ruby -v: ruby 1.9.1p376 (2009-12-07 revision 26041) [i686-linux]


=begin
 A string returned by an HTTP get does not have its encoding set appropriately with the charset field, nor does the content_type report the charset. Example code demonstrating incorrect behavior is below.
 
 #!/usr/bin/ruby -w
 # encoding: UTF-8
 
 require 'net/http'
 
 uri = URI.parse('http://www.hearya.com/feed/')
 result = Net::HTTP.start(uri.host, uri.port) {|http|
     http.get(uri.request_uri)
 }
 
 p result['content-type']     # "text/xml; charset=UTF-8" <- correct
 p result.content_type        # "text/xml" <- incorrect; truncates the charset field
 puts result.body.encoding    # ASCII-8BIT <- incorrect encoding, should be UTF-8
=end



-- 
http://redmine.ruby-lang.org