At 06:31 07/10/26, David Flanagan wrote:
>Yukihiro Matsumoto wrote:
>> Hi,
>> In message "Re: \u escapes in string literals: proof of concept   implementation"
>>     on Tue, 23 Oct 2007 16:53:57 +0900, David Flanagan <david / davidflanagan.com> writes:
>> |I like the \Uxxxxxx escape instead of \u{}.  Would you consider this, Matz?
>> Actually I hate counting digits.  When I am forced to put sufficient
>> number of preceding zeros to specify non-BMP character, I'd go mad.
>> Is there any reason \U<8ditits> is better than \u{}?  If it's
>> sufficient reason, it's OK to allow \U as well.
>>                                                      matz.
>
>I've been meaning to ask you about the 8 digits.  Unicode only uses 6 digits currently: the highest allowed codepoint is 10FFFF.  So even if Unicode grew to have 16 times then number of codepoints 6 hex digits would still be enough.

As for \U, I think we should stay with the forms we have currently,
and we can still introduce new ones if there is a great demand.

Please note that although the codepoints with five or six hex digits
are in the majority when compared to those with only four digits,
most of that area is completely empty, and the characters assinged
are extremely rare, so the chance of actually using one of these
escapes is extremely low.

Also, please note that at the speed the Unicode consortium and
the corresponding ISO WG are working, it will take centuries to
fill up all that space, if ever. In my opinion, this space will
only ever be filled up in case we get invaded by some extraterrestrial
beings that happen to have a writing system with way more characters
than the most complex writing system on earth (Han ideographs).


>What I was proposing was \U with exactly 6 digits after it.  And you'd only use it for those rare codepoints with 5 or 6 digits.  Without the curly braces it is shorter.  I don't feel actually feel strongly about \u{} versus \U however.  And reducing the number of special characters after slash is probably a good thing.
>
>Unless I'm missing the point, however, I don't think there is any reason to allow 4-byte codepoints.  I read somewhere that although the UTF-8 encoding scheme can be extended to encode 32 bits in 6 bytes, this is actually forbidden by the UTF-8 spec.  (I haven't verified that, but I think I saw it on Wikipedia.)  So if Ruby allows \u{xxxxxxxx} (8 hex digits) it will generate invalid codepoints in an illegal extension of UTF-8.

Yes indeed. You don't need to go to Wikipedia for this, you can
go straight to the source. As I posted earlier, it's in Section 3.9
of the Unicode Standard. Look for page 103 and 104 at
http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf.

I'm very sure that ISO 10646 will adjust to this also,
and probably already has done so. I can contact the
Convener of the responsible WG or the editor of the
spec directly if you want, I  know both of them well.

The IETF also has RFC 3629, and again, the definition is
the same as in Unicode, although it's worded differently.
(http://www.ietf.org/rfc/rfc3629.txt; please note that
this is a full IETF Standard, which is very rare in the
IETF.)

In Ruby, I have found relevant code in pack.c and in enc/utf8.c.
The decoding code in pack.c is slightly better than the rest
in that it checks and rejects overlong sequences, but otherwise,
all the code is in pretty bad shape with respect to the standards.

I'd gladly provide patches for both the above files, and anything
else if necessary. pack.c is really simple. enc/utf8.c is a bit
more complex, because this essentially belongs to Oniguruma, and
because it even has #defines to pass through 0xFE and 0xFF bytes
if one wishes to do so (I hope Ruby doesn't allow these).

I think we should go ahead and tighten these pieces of code, because
some of these limitations are related to security issues, and we
should make sure we follow the standards. If bytes not conforming
to the standards accidentally creep in, then this is a mistake anyway.
If somebody wants to include such bytes on purpose, they have to
use the binary encoding.

Regards,    Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst / it.aoyama.ac.jp