Issue #15033 has been reported by stevecheckoway (Stephen Checkoway).

----------------------------------------
Bug #15033: Encoding fallback uses wrong character when multiple conversions are required
https://bugs.ruby-lang.org/issues/15033

* Author: stevecheckoway (Stephen Checkoway)
* Status: Open
* Priority: Normal
* Assignee: 
* Target version: 
* ruby -v: ruby 2.5.1p57 (2018-03-29 revision 63029) [x86_64-darwin17]
* Backport: 2.3: UNKNOWN, 2.4: UNKNOWN, 2.5: UNKNOWN
----------------------------------------
When converting a string from one encoding to another that involves multiple conversions, the proc passed to `encode` will be called with the incorrect value if the conversion fails in the middle of the conversion.

For example,

```irb
> "\u016f".encode('ISO-2022-JP', fallback: proc { |c| "&\#x#{c.ord.to_s(16)};" }).encode('UTF-8')
=> "�"
```

Here, the ordinal passed to the proc was 0x8fabeb rather than 0x16f.

If I use the `:xml` option instead of `:fallback`, I get the expected result

```irb
> "\u016f".encode('ISO-2022-JP', xml: :text).encode('UTF-8')
=> "ů"
```

The cause of this seems pretty clear. The conversion process from UTF-8 to ISO-2022-JP goes from UTF-8 to EUC-JP to stateless-ISO-2022-JP to ISO-2022-JP. The first conversion succeeds

```irb
> "\u016f".encode('EUC-JP')
=> "\x{8FABEB}"
```

but the second fails

``` irb
> "\u016f".encode('EUC-JP').encode('stateless-ISO-2022-JP')
Traceback (most recent call last):
        3: from /opt/local/bin/irb2.5:11:in `<main>'
        2: from (irb):10
        1: from (irb):10:in `encode'
Encoding::UndefinedConversionError ("\x8F\xAB\xEB" from EUC-JP to stateless-ISO-2022-JP)
```

In this situation, I believe that the procedure passed to `encode` should be called with the original failing character, not an intermediate one.



-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>