Hi,

In message "[ruby-talk:7436] Unicode Issues (was: "A Java Developer's Wish List for Ruby")"
    on 00/12/17, Richard A.Schulman <RichardASchulman / att.net> writes:

|>|Do you mean that it has been superceded UTF-16? Or what?
|>That's what I mean.
|
|Good. Both UCS-2 and UTF-16 have the same 16-bit encoding
|for the 49,194 presently defined characters used in most of
|the languages of the world. UTF-16 is a superset of UCS-2,
|adding in the possibility of surrogates. Just out of
|curiosity, though, how important is the surrogate extension
|to users in Japan?

Not important yet.  But future addition to JIS standard will probably
be covered by surrogate extension (yes, KANJI set is still growing).
So ignoring surrogates shall be great trouble in the future.

|Matz:
|>|>But I'm going to add M17N feature to the next version Ruby.
|>|>The future Ruby should handle Unicode as well as other encodings.
|
|What exactly is the "M17N feature" that you plan to add? 

Each string and regex object will be able to have information about
its encodings.  Matching, indexing etc. will be based on that
information.

|Matz:
|>Unicode 3.0 is really an improvement.  Most Japanese can accept it
|>except time and space efficiency.
|>...
|>    By using UTF-8, most of Japanese character takes 3 bytes each.  It
|>    would be 1.5 time bigger than current.  Imagine all of your text
|>    data grows 50% bigger.
|
|I agree. I'm not partial to UTF-8 either. In my earlier
|post, I recommended UCS-2, which is a two byte encoding for
|both the Western languages and the CJK languages. As far as
|DBCS Japanese goes, UCS-2 introduces no changes in storage
|or processing requirements. The same is true for the
|superset UTF-16, assuming surrogates are not required.

It doubles ASCII space though (multibyte text is often mixture of
ASCII and KANJI characters).

|In converting to UTF-16, it's the Western languages that
|would suffer a "hit" in terms of storage and processing
|time. UTF-8, accordingly, will probably remain common in
|Western end users shops for some time to come but not, I
|hope, as the internal encoding of system software.

Why not?  Although its variable length nature, I think UTF-8 is good
for internal encoding too.  E.g.

  * ASCII superset
  * no NULL (\0) in string
  * no endian problems

Plus, UTF-16 is variable length anyway (as I mentioned above, we can't
ignore surrogates).

I think it's the reason Perl and Python choose UTF-8 as their internal
encoding.  I'd choose UTF-8 too if I could stick with Unicode.

|My own experience in developing international software is
|that it is MUCH easier to work in an environment in which
|UCS-2 or UTF-16 is the internal storage norm rather than
|UTF-8. Accordingly, I seek out operating systems, databases,
|and language providers that standardize on either of these
|as their normative, internal coding.

If the following conditions can be fulfilled, it's easy to develop
I18N software using UTF-16, as you said.

  * surrogates can be ignored
  * all characters can be converted into Unicode

These conditions are often OK for many many applications.  But I can
not FORCE these conditions to ALL applications written in Ruby.

|>Using Unicode as an internal universal character
|>sets covers 98% of M17N, but I want to cover ALL of the cases, and
|>from my personal experience (Ruby Japanization), I think it's
|>efficiently possible.
|
|What is the 2% that isn't covered by Unicode's UTF-16
|encoding (which provides for about 1 mn code points, if one
|includes the surrogate facility)?

Don't take numbers literally.  It's a synonym for "almost all". ;-)

I was thinking of applications that process big character set
(e.g. Mojikyo set) which is not covered by Unicode.  I don't know
exactly how many code points it has.  But I've heard it's pretty big,
possibly consumes half of surrogate space.  And they want to process
them now.  I think they don't want to wait Unicode consortium to
assign code points for their characters.

							matz.