Issue #12852 has been updated by Matthew Kerwin.


Olivier Lacan wrote:
> 
> It's common for OAuth authentication flows to store a destination URI to return to when the handshake process is completed. This URI can be stored without first being processed by a web sever that will encode it in the way Rails does for submitted forms since it's not meant to be processed  that is until it comes back to the origin server.
> 

You keep using the word "URI" to refer to these data objects, and by specification a URI data object cannot contain non-ASCII characters (and even some ASCII characters are forbidden.) If we agree that `"haha\nlol"` is a String that cannot be parsed as a URI, we should agree the same for `"http://example.org/\u{2713}".force_encoding('UTF-8')`

> 
> I opened this due to an issue I encountered in an OAuth provider handshake procedure. You could argue that I should be expected to `URI.encode` any URI set as a destination query parameter to prevent this issue from occurring, surely.
> 

That's what I'm saying, but from the opposite direction. If you're storing a String, and want to ensure that it encodes a valid URI, you should `URI.encode` the parts before storing them in the String.

By analogy, if we replace "URI" with "JSON" in this discussion the same holds true: `"{\"foo\":\"\n\"}"` holds a String that looks a lot like JSON, but isn't valid [ [RFC 7159](https://tools.ietf.org/html/rfc7159#section-7)], and `JSON.parse` correctly raises an exception on it.

If a network peer is sending you a message that includes bytes that look like a URI but with UTF-8-encoded Unicode characters and *not* ASCII-compatible percent-encoded octets (i.e. it's sending an IRI), then one of two things is happening:

1. the protocol you're using is built on IRIs, not URIs, and you are responsible for any transformations to/from URIs (including `URI.encode`); or

2. the peer is in violation of a spec, and you should throw an error back at it. (In this case the specs are usually quiet clear on exactly what error to throw, too.)

> 
> Do you not agree that URI.parse should accept unicode entities in URIs? It wasn't clear from your response. 
> 

I think `URI.parse` correctly raises an exception when it encounters characters that are forbidden by RFC3986.

> 
> I'm not aware of any IRI-compatible API in MRI that could allow me to directly parse URIs containing non-ASCII characters with Ruby, whether they match the strict definition of a URI or not.

Your thinking here seems confused. If a String contains non-ASCII characters then it's not a URI. If it is a URI then it strictly matches the definition of a URI. If a String contains a valid IRI, then yeah, you're not going to get much help from Ruby; but IRIs are not commonly used in the real world anyway.

----------------------------------------
Bug #12852: URI.parse can't handle non-ascii URIs
https://bugs.ruby-lang.org/issues/12852#change-60941

* Author: Olivier Lacan
* Status: Open
* Priority: Normal
* Assignee: akira yamada
* ruby -v: 
* Backport: 2.1: UNKNOWN, 2.2: UNKNOWN, 2.3: UNKNOWN
----------------------------------------
Given a return URL path like: `/search?utf8=\u{2713}&q=foo`, `URI.parse` raises the following exception: 

```ruby
URI.parse "/search?utf8=\u{2713}&q=foo"
URI::InvalidURIError: URI must be ascii only "/search?utf8=\u{2713}&q=foo"
```

This `\u{2713}` character is commonly used by web frameworks like Rails to enforce UTF-8 in forms: https://github.com/rails/rails/blob/92703a9ea5d8b96f30e0b706b801c9185ef14f0e/actionview/lib/action_view/helpers/form_tag_helper.rb#L823-L830

```ruby
"\u{2713}"
=> ""
```

Is it unreasonable to expect non-ascii portion of URIs to be handled by URI.parse? The way to circumvent this issue is to call URI.encode on the URI string prior to passing it to URI.parse:

```ruby
URI.parse URI.encode("/search?utf8=\u{2713}&q=foo")
=> #<URI::Generic /search?utf8=%E2%9C%93&q=foo>
```

By comparison, a library like Addressable parses this URI without issue.

```
require "addressable/uri"
=> #<Addressable::URI:0x3feffa84158c URI:/search?utf8=&q=foo>
```

This is how Addressable implements parsing:
https://github.com/sporkmonger/addressable/blob/a15b7045a09911bcc47b106200554809c879a5f6/lib/addressable/uri.rb#L75-L145

PS: Tried under MRI 2.3.1 and 2.4.0-preview1



-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>