At 16:46 07/10/29, Nobuyoshi Nakada wrote:
>Hi,
>
>At Mon, 29 Oct 2007 14:11:45 +0900,
>David Flanagan wrote in [ruby-core:13032]:
>> >> In a related matter, Should String.inspect be modified to support \u 
>> >> escapes?  Perhaps when the primary encoding is not unicode it should do 
>> >> this?
>> > 
>> > \x for other encodings?  It means rb_encoding has to know which
>> > escape to be used.
>> > 
>> 
>> Hmm.  I assumed that it would be the inspect method that generated the 
>> escapes, not the encoding objects. But yes, I was being Unicode-centric, 
>> since that is the only encoding we're considering a special escape for...
>
>Instead, how about to use \x for all encodings, with making
>\x{...} to represent codepoint?

Please don't. If you really want, you might use \x{...} for a big-
endian representation of the underlying byte sequence for all encodings,
including UTF-8. This would mean e.g. the following:

Directly encoded string: " "

Using \x for UTF-8: "\xE4\xB8\xAD\xE7\x94\xB0 \xE4\xBC\xB8\xE6\x82\xA6"
Using \x for Shift_JIS: "\x92\x86\x93\x63 \x90\x4c\x89\x78"

Using \x{...} for UTF-8: "\x{E4B8AD}\x{E794B0} \x{E4BCB8}\x{E682A6}"
Using \x{...} for Shift_JIS: "\x{9286}\x{9363} \x{904c}\x{8978}"

Using \u (currently only UTF-8): "\u4E2D\u7530 \u4F38\u60A6"
Using \u (in the future potentially for Shift_JIS and others):
                                 "\u4E2D\u7530 \u4F38\u60A6"

As you can see, and as discussed earlier, \x{} is very shallow syntactic
sugar, based on the actual binary representation, and therefore not really
necessary. It is slightly more readable than a sequence of \x bytes,
but I don't think this is so important, because I don't think it will
be used very much (most people who use a specific legacy encoding have
the fonts and editing tools needed).

On the other hand, \u is something different, it identifies the same
character potentially across many different encodings. In this sense,
it is great for writing scripts that may be run in different encodings
with the same character semantics. This is not yet possible, but as soon
as we have conversion, it won't be too difficult to implement. It is a
huge help for UTF-8, because it makes the computer do the calculations,
not dumping it on humans.

Regards,    Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst / it.aoyama.ac.jp