Issue #15995 has been updated by duerst (Martin D=FCrst).


Issue #15931 mentions both https://www.unicode.org/reports/tr26/tr26-4.html=
 and https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/types.=
html#modified_utf_8_strings as definitions of CESU-8, but they are not iden=
tical.

The difference is in how they treat U+0000 (NULL) characters: UTR 26 does n=
ot treat it in any special way (i.e. it is encoded as "\x00"), but the Java=
 definition treats specially, encoding it as "\xC0\x80". The IANA registrat=
ion refers to the Unicode definition (see https://www.iana.org/assignments/=
charset-reg/CESU-8). TR 26 explains that "CESU-8 is useful in 8-bit process=
ing environments where binary collation with UTF-16 is required.". For this=
 to work, U+0000 has to be encoded as "\x00".

Issue #15931 currently implements CESU-8 as defined in UTR 26:

```
$ ruby -e 'puts "\xC0\x80".force_encoding("cesu-8").valid_encoding?'
false

$ ruby -e 'puts "\x00".force_encoding("cesu-8").valid_encoding?'
true
```

It is unclear whether this is what the originator of issue #15931 wanted; h=
is use case seems to be Java.

----------------------------------------
Feature #15995: Add encoding conversion for CESU-8 from and to UTF-8
https://bugs.ruby-lang.org/issues/15995#change-79294

* Author: duerst (Martin D=FCrst)
* Status: Open
* Priority: Normal
* Assignee: duerst (Martin D=FCrst)
* Target version: =

----------------------------------------
As discussed in issue #15931, encoding conversion (transcoding) from/to CES=
U-8 is missing, so we should add it. When then hopefully can make CESU-8 a =
dummy encoding.



-- =

https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=3Dunsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>