Issue #11522 has been updated by Martin Drst.


Nobuyoshi Nakada wrote:
> It has no hints for encoding.

In theory, that's correct. In practice, there are several better possibilities.

1) We can add an additional parameter that indicates the encoding.

2) We can default to UTF-8. That's because most URIs that contain non-ASCIIbyte values these days are based on UTF-8, and their percentage is increasing steadily.

3) We can check whether using UTF-8 makes sense or not. If the bytes are valid UTF-8, then the chance that they are anything else than UTF-8 is virtually 0.

1) and 2) are already done by CGI.unescape. But 3) isn't. Also, CGI.unescape changes '+' to ' ', which is desirable in some contexts (query parts in http(s) URIs), but not in others (e.g. mailto URIs).

----------------------------------------
Bug #11522: URI::decode returns incorrectly encoding strings
https://bugs.ruby-lang.org/issues/11522#change-54190

* Author: Charlie Anderson
* Status: Rejected
* Priority: Normal
* Assignee: akira yamada
* ruby -v: ruby 2.2.2p95 (2015-04-13 revision 50295) [x86_64-linux]
* Backport: 2.0.0: UNKNOWN, 2.1: UNKNOWN, 2.2: UNKNOWN
----------------------------------------
When given unicode characters to encode and decode, the URI module returns a string with an invalid encoding.

~~~
irb(main):026:0* unicode = 'ߢ~'
=> "ߢ~"
irb(main):027:0> unicode.encoding
=> #<Encoding:UTF-8>
irb(main):028:0> unicode.valid_encoding?
=> true
irb(main):029:0> encoded = URI::encode(unicode)
=> "%C5%93%C2%B4%C3%A5%E2%88%91%C2%AE%C2%B4%C3%9F%E2%88%82%E2%80%A0%E2%89%88%C2%A9%C6%92%C3%A7%CB%99%C2%A9%E2%88%9A%E2%88%86%CB%99%E2%88%AB%CB%9A%E2%88%86~%C2%AC"
irb(main):030:0> encoded.encoding
=> #<Encoding:US-ASCII>
irb(main):031:0> encoded.valid_encoding?
=> true
irb(main):032:0> decoded = URI::decode(encoded)
=> "\xC5\x93\xC2\xB4\xC3\xA5\xE2\x88\x91\xC2\xAE\xC2\xB4\xC3\x9F\xE2\x88\x82\xE2\x80\xA0\xE2\x89\x88\xC2\xA9\xC6\x92\xC3\xA7\xCB\x99\xC2\xA9\xE2\x88\x9A\xE2\x88\x86\xCB\x99\xE2\x88\xAB\xCB\x9A\xE2\x88\x86~\xC2\xAC"
irb(main):033:0> decoded.encoding
=> #<Encoding:US-ASCII>
irb(main):034:0> decoded.valid_encoding?
=> false
~~~

I would expect decoded to have a valid encoding - probably as UTF-8?



-- 
https://bugs.ruby-lang.org/