Matz:
>|>I'm not going to choose USC-2.  UCS-2 is obsolete.

Schulman:
>|Do you mean that it has been superceded UTF-16? Or what?

Matz
>That's what I mean.

Good. Both UCS-2 and UTF-16 have the same 16-bit encoding
for the 49,194 presently defined characters used in most of
the languages of the world. UTF-16 is a superset of UCS-2,
adding in the possibility of surrogates. Just out of
curiosity, though, how important is the surrogate extension
to users in Japan?

Matz:
>|>But I'm going to add M17N feature to the next version Ruby.
>|>The future Ruby should handle Unicode as well as other encodings.

What exactly is the "M17N feature" that you plan to add? 

Matz:
>Unicode 3.0 is really an improvement.  Most Japanese can accept it
>except time and space efficiency.
>...
>    By using UTF-8, most of Japanese character takes 3 bytes each.  It
>    would be 1.5 time bigger than current.  Imagine all of your text
>    data grows 50% bigger.

I agree. I'm not partial to UTF-8 either. In my earlier
post, I recommended UCS-2, which is a two byte encoding for
both the Western languages and the CJK languages. As far as
DBCS Japanese goes, UCS-2 introduces no changes in storage
or processing requirements. The same is true for the
superset UTF-16, assuming surrogates are not required.

In converting to UTF-16, it's the Western languages that
would suffer a "hit" in terms of storage and processing
time. UTF-8, accordingly, will probably remain common in
Western end users shops for some time to come but not, I
hope, as the internal encoding of system software.

My own experience in developing international software is
that it is MUCH easier to work in an environment in which
UCS-2 or UTF-16 is the internal storage norm rather than
UTF-8. Accordingly, I seek out operating systems, databases,
and language providers that standardize on either of these
as their normative, internal coding.

It is necessary, of course, to provide secondary
transformation routines into the two other Unicode
transformation formats (UTF-8 and UTF-32), as well as
various legacy encodings. 

>Using Unicode as an internal universal chacacter
>sets covers 98% of M17N, but I want to cover ALL of the cases, and
>from my personal experience (Ruby Japanization), I think it's
>efficiently possible.

What is the 2% that isn't covered by Unicode's UTF-16
encoding (which provides for about 1 mn code points, if one
includes the surrogate facility)?

Richard Schulman