Nobuyoshi Nakada wrote:
> Hi,
> 
> At Tue, 30 Oct 2007 15:30:30 +0900,
> Martin Duerst wrote in [ruby-core:13082]:
>> Please don't. If you really want, you might use \x{...} for a big-
>> endian representation of the underlying byte sequence for all encodings,
>> including UTF-8. This would mean e.g. the following:
> 
> In [ruby-dev:16603], Matz said that `codepoint isn't a byte
> representation but is a "number"'.
> 
>> Directly encoded string: " "
>>
>> Using \x for UTF-8: "\xE4\xB8\xAD\xE7\x94\xB0 \xE4\xBC\xB8\xE6\x82\xA6"
>> Using \x for Shift_JIS: "\x92\x86\x93\x63 \x90\x4c\x89\x78"
>>
>> Using \x{...} for UTF-8: "\x{E4B8AD}\x{E794B0} \x{E4BCB8}\x{E682A6}"
>> Using \x{...} for Shift_JIS: "\x{9286}\x{9363} \x{904c}\x{8978}"
>>
>> Using \u (currently only UTF-8): "\u4E2D\u7530 \u4F38\u60A6"
>> Using \u (in the future potentially for Shift_JIS and others):
>>                                  "\u4E2D\u7530 \u4F38\u60A6"
> 
> Rather, "\x{4366 4544} \x{3f2d 3159}" for both of Shift_JIS and
> EUC-JP which are based on JIS0212, and "\x{4E2D 7530} \x{4F38
> 60A6}" for UTF-8, I'd expect.

So the \x escape identifies the codepoint, but not the encoding. Its
interpretation as a sequence of bytes must therefore depend on the
current primary encoding (or perhaps on the script encoding). So string
literals can have different meanings when run in different locales (or
when cut-and-pasted between programs).  The \u escape is very different:
it specifies the encoding, so the encoding of the string literal always
comes out right even if the script is run in a different locale or the
encoding of the script itself is changed.

If I'm understanding this correctly, it seems like these \x{} escapes
would be very dangerous.  (Though I suppose you could argue that they
are no more dangerous or non-portable than regular \x byte escapes.)

I think I can understand how this \x proposal would be useful for
JIS0201 codepoints: you could create string literals that would be
portable to both EUC and SJIS encodings.  Is that the use case that is
driving this?  Would Japanese programmers actually write their string
literals that way?  If so, why not call this \j (for JIS) or \k (for
Kanji) instead of \x?  And if renaming it to \j, it should probably
cause an error when used with a primary encoding that is not based on
JIS0201.

	David