Issue #12052 has been updated by duerst (Martin Drst).


jeremyevans0 (Jeremy Evans) wrote in #note-3:

> It looks like this issue occurs when using both multibyte source and destination encoding.  If either the source or destination encoding is not multibyte, the issue doesn't occur:
> 
> ```ruby
> # Multibyte source, single-byte destination
> "<\0>\0".encode("utf-8", "utf-16le", xml: :text).bytes
> => [38, 108, 116, 59, 38, 103, 116, 59]
> 
> # Single-byte source, multibyte destination
> "<>".encode("utf-16le", "utf-8", xml: :text).bytes
> => [38, 0, 108, 0, 116, 0, 59, 0, 38, 0, 103, 0, 116, 0, 59, 0]
> 
> # Multibyte source, multibyte destination
> "<\0>\0".encode("utf-16le", "utf-16le", xml: :text).bytes
> => [38, 108, 116, 59, 0, 38, 103, 116, 59, 0]
> ``` 

True, except that usually the term "multibyte encoding" includes encodings such as UTF-8, and we are speaking here about encodings with code units longer than one byte.

But thinking about it, it may also include encodings such as EBCDIC (IBM037) and Shift_JIS and ISO-2020-JP. In the former case, I get
```Ruby
"<>".encode("IBM037")
=> "\x4C\x6E"
"<>".encode("IBM037").encode("IBM037", xml: :text)
=> "\x4C\x6E"```
"<>".encode("IBM037").force_encoding("US-ASCII")
=> "Ln"
```
This is explained rather easily: '<' and '>' are \x4C and \x6E in EBCDIC, but because the `xml: :text` processing runs in ASCII, these are interpretedas 'L' and 'n' and left alone. Shift_JIS actually seems safe, because the characters to be converted are encoded as plain ASCII, and because they allfall into the range 0x20..0x3F, which isn't used as a second byte in Shift_JIS.

For ISO-2022-JP, we are not so lucky. Take the string `"<>湿"` (the Kanji stands for 'wet', which is appropriate here in Japan because we are in the Rainy Season now :-; it's there because in ISO-2022-JP, its value is encoded with the same bytes as "<>", after switching with the necessary escape sequences):

```Ruby
"<>青".encode("ISO-2022-JP")
=> "\x3C\x3E\e\x24\x42\x40\x44\e\x28\x42"
"<>湿".encode("ISO-2022-JP").force_encoding("US-ASCII")
=> "<>\e$B<>\e(B"
"<>湿".encode("ISO-2022-JP").encode("ISO-2022-JP", xml: :text)
=> "\x26\x6C\x74\x3B\x26\x67\x74\x3B\e\x24\x42\x26\x6C\x74\x3B\x26\x67\x74\x3B\e\x28\x42"
"<>湿".encode("ISO-2022-JP").encode("ISO-2022-JP", xml: :text).force_encoding("US-ASCII")
=> "&lt;&gt;\e$B&lt;&gt;\e(B"
```
Trying to further transcode the result of `encode("ISO-2022-JP", xml: :text)` leads to an encoding error. 

> So a possible way to work around the issue until it can be properly fixedwould be to detect the case where both source and destination are multibyte, switch the destination to UTF-8, then encode the result of that to the desired destination encoding.

The condition seems to be slightly more narrow. Even if both encodings havecode units of more than one byte, things work as long as the encodings arenot the same, most probably because these cases already get transcoded viaUTF-8:
```Ruby
"<\0>\0".force_encoding("UTF-16LE").encode("UTF-16BE", xml: :text)
=> "&lt;&gt;"
"<\0>\0".force_encoding("UTF-16LE").encode("UTF-32BE", xml: :text)
=> "&lt;&gt;"
"<\0>\0".force_encoding("UTF-16LE").encode("UTF-32LE", xml: :text)
=> "&lt;&gt;"
"<\0>\0".force_encoding("UTF-16LE").encode("UTF-16LE", xml: :text)
=> "\u6C26\u3B74\u2600\u7467;"
```

I'll have a look at your patch later, but just wanted to get this out. Sorry to be more quick with encodings than with the actual code :-(.

----------------------------------------
Bug #12052: String#encode with xml option returns wrong result for totally non-ASCII-compatible encodings
https://bugs.ruby-lang.org/issues/12052#change-92654

* Author: nobu (Nobuyoshi Nakada)
* Status: Open
* Priority: Normal
* Assignee: akr (Akira Tanaka)
* Backport: 2.0.0: REQUIRED, 2.1: REQUIRED, 2.2: REQUIRED, 2.3: REQUIRED
----------------------------------------
`String#encode`をASCII非互換エンコーディングから同じエンコーディングへ、`xml:`オプション付きで呼ぶとおかしな結果を返します。
バイナリとして変換してしまっているようです。

```ruby
p "<\0>\0".encode("utf-16le", "utf-16le", xml: :text)
#=> "\u6C26\u3B74\u2600\u7467;"
```



-- 
https://bugs.ruby-lang.org/