Matz: >|>I'm not going to choose USC-2. UCS-2 is obsolete. Schulman: >|Do you mean that it has been superceded UTF-16? Or what? Matz >That's what I mean. Good. Both UCS-2 and UTF-16 have the same 16-bit encoding for the 49,194 presently defined characters used in most of the languages of the world. UTF-16 is a superset of UCS-2, adding in the possibility of surrogates. Just out of curiosity, though, how important is the surrogate extension to users in Japan? Matz: >|>But I'm going to add M17N feature to the next version Ruby. >|>The future Ruby should handle Unicode as well as other encodings. What exactly is the "M17N feature" that you plan to add? Matz: >Unicode 3.0 is really an improvement. Most Japanese can accept it >except time and space efficiency. >... > By using UTF-8, most of Japanese character takes 3 bytes each. It > would be 1.5 time bigger than current. Imagine all of your text > data grows 50% bigger. I agree. I'm not partial to UTF-8 either. In my earlier post, I recommended UCS-2, which is a two byte encoding for both the Western languages and the CJK languages. As far as DBCS Japanese goes, UCS-2 introduces no changes in storage or processing requirements. The same is true for the superset UTF-16, assuming surrogates are not required. In converting to UTF-16, it's the Western languages that would suffer a "hit" in terms of storage and processing time. UTF-8, accordingly, will probably remain common in Western end users shops for some time to come but not, I hope, as the internal encoding of system software. My own experience in developing international software is that it is MUCH easier to work in an environment in which UCS-2 or UTF-16 is the internal storage norm rather than UTF-8. Accordingly, I seek out operating systems, databases, and language providers that standardize on either of these as their normative, internal coding. It is necessary, of course, to provide secondary transformation routines into the two other Unicode transformation formats (UTF-8 and UTF-32), as well as various legacy encodings. >Using Unicode as an internal universal chacacter >sets covers 98% of M17N, but I want to cover ALL of the cases, and >from my personal experience (Ruby Japanization), I think it's >efficiently possible. What is the 2% that isn't covered by Unicode's UTF-16 encoding (which provides for about 1 mn code points, if one includes the surrogate facility)? Richard Schulman