Issue #15933 has been updated by gareth (Gareth Adams).


Thanks Matthew,

I've now paid more attention to which RFCs are obsolete and which are still active.

phluid61 (Matthew Kerwin) wrote:
> > So it seems if you're making a change, it should be: ignore the protocol, and default to UTF-8 for `text/csv`.
> 
> Or rather: ignore the protocol; and consult the IANA registry to see what the individual `text/...` types have as their default, and use UTF-8 as a final fallback.  Which is unpleasant.

The [IANA registry](https://www.iana.org/assignments/media-types/media-types.xhtml#text) isn't in a machine readable format, and so even if it were acceptable to depend on a gem like [mime-types-data](https://github.com/mime-types/mime-types-data) as a curated source of these values (I realise stdlib can't depend on gems), that data isn't currently available.

Looking through the registry manually, most text subtypes make no mention of a charset (either because they predate RFC6838 or because its recommendation to make charset required wasn't enforced) or specify UTF-8 explicitly. Only 5 (by my reading) mention a required charset parameter that is different to UTF-8, and in my opinion none of these are incompatible with using UTF-8 as a default.

text/sgml: specifies US-ASCII, but references an obsoleted RFC [RFC1521] to justify that.
text/troff: specifies US-ASCII, but cites "this will be the default 'US-ASCII'" and this specification predates RFC6838 which changed the default to UTF-8.
text/uri-list: See below.
text/vnd.a: specifies UTF-8 only "if 8 bit bytes are encountered" US-ASCII otherwise.
text/vnd.si.uricatalogue [obsoleted by author request]: specifies US-ASCII always.

The [uri-list registration](https://www.iana.org/assignments/media-types/text/uri-list) states (as of 1999):

> Currently, URIs can be represented using US-ASCII. However, there
> are many non-standard URIs which use special character sets.
> Discussion of how to best achieve internationalization of URIs is
> underway. This registration will be updated with a discussion of the
> URI charsets once that discussion has concluded.

The registration was not updated, despite [IRIs](https://www.w3.org/International/articles/idn-and-iri/) being defined in [RFC3987](https://tools.ietf.org/html/rfc3987) to use UTF-8 or the ASCII transformation Punycode in 2005.

It seems to me that changing the default to UTF-8 and extending the check to match "https" URIs is:

* Correct in all cases except for a minuscule number of edge cases
* Compatible in all of those other cases
* Overridable by defining exceptions inline (as opposed to using a dependency like mime-types-data) if anyone raises issues with this default

My suggestion that we could override it (e.g. with a Hash of `subtype => default_charset`) is just as a contingency. There's no need to at the moment, and since this hasn't needed to be changed in nearly 20 years I'm not worried that this is a volatile piece of code.

If there are no objections, I'll follow up with a replacement patch using this as a plan.

----------------------------------------
Bug #15933: OpenURI: Assign default charset for HTTPS as well as HTTP
https://bugs.ruby-lang.org/issues/15933#change-78669

* Author: gareth (Gareth Adams)
* Status: Assigned
* Priority: Normal
* Assignee: akr (Akira Tanaka)
* Target version: 
* ruby -v: 
* Backport: 2.4: UNKNOWN, 2.5: UNKNOWN, 2.6: UNKNOWN
----------------------------------------
Using `open-uri` to load a document in the following circumstances:

* The `Content-Type` header is `text/*` and *doesn't* specify a charset, e.g. `Content-Type: text/csv`
* The document is loaded from an `https://` URL

í─will cause the resulting string to have `ASCII-8BIT` encoding.

As the [documentation for OpenURI#charset](https://github.com/ruby/ruby/blob/trunk/lib/open-uri.rb#L538-L560) mentions, [RFC2616/3.7.1](https://tools.ietf.org/html/rfc2616#section-3.7.1) says:

> When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.

OpenURI takes this literally - only assigning ISO-8859-1 if `@base_uri.scheme` is *exactly* "http". This check was written [17 years ago](https://github.com/ruby/ruby/commit/3a20ed532b57da1e58287a5c53abe14400a085f4#diff-0f19cb99597e5fb90bfb937b22143b51R264) in 2002 even before TLS 1.1 was defined, and well before HTTPS was common.

I believe this check should now also match the scheme "https". As [RFC2818/2](https://tools.ietf.org/html/rfc2818#section-2) says:

> Conceptually, HTTP/TLS is very simple. Simply use HTTP over TLS precisely as you would use HTTP over TCP

1. Is this a suitable change to make?

2. I have a patch to fix the functionality (attached). What else do I need to specify in terms of documentation/tests? I'm happy to put more work into this, but it's my first contribution to Ruby core and I'd like some pointers. I've read through https://bugs.ruby-lang.org/projects/ruby/wiki/HowToReport

---Files--------------------------------
ruby-changes.patch (1.21 KB)


-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>