On Fri, 19 Sep 2008 19:52:30 +1000, Yukihiro Matsumoto  
<matz / ruby-lang.org> wrote:

> |I assume that you are referring to my suggestion to remove support for
> |"non-ASCII compatible" encodings?
>
> No, I was referring the past proposals like "abandon all M17N and
> choose Unicode as unified internal character set, like other `major'
> languages do" as such.
>
> |Of course you are the boss when it comes to Ruby, but I feel that you
> |could have phrased this statement ("But no thanks in advance....") a
> |little better.
>
> I am sorry if my phrase appeared offensive.  UTF-16 is a nasty beast,
> but as I stated we have other beasts (dummy encodings), so that simply
> removing UTF-16 would help us little.  We have to do it consistently,
> if we do.

No problem - it appears I misunderstood you, sorry. Easy to happen with  
email, unfortunately :-(

Perhaps we need to go back to basics with this discussion. As a mere  
English speaker, I do not fully understand the issues that are faced by  
Japanese and other encodings. What I have gathered from this discussion is  
(please tell me if I am wrong):

- There are characters that Ruby needs to support which cannot be uniquely  
mapped to Unicode
- In fact there are entire character sets that we want to support in Ruby  
that are not supported in Unicode
- There are ambiguous characters in some character sets - same code for  
different characters

I think it would be a benefit if we all got to understand a bit more:

- How the character ambiguity (eg: Yen/ backslash) issue is handled at the  
moment - generally, not just with Ruby. ie: how do you know that a printer  
or screen is going to show the right character?
- How the various "non-ascii compatible" encodings are used in practice.  
eg: it is my understanding that UTF-7 is really only used in email, and  
that it would be straightforward to immediately transcode it to/from UTF-8  
in an POP/IMAP library, so UTF-7 could be avoided completely as an  
"internal" encoding in Ruby. It's as if were were treating UTF-7 like  
base64 - just a transformation of a "real" encoding. (In fact UTF-16 & 32  
could be considered the same sort of thing, except they may be used more  
widely.)
- How a Japanese programmer would handle the situation of dealing with a  
combination of a Japanese non-Unicode compatible character set, and say a  
UTF-8 encoding which included non-ascii characters, and non-Japanese ones.  
ie: Is there a reasonable alternative to encoding both to Unicode &  
somehow dealing with the "difficult characters" as special cases?

Could someone out there please succinctly explain these things to us  
westerners? Then perhaps our thinking about this issue may be more aligned.

Thanks
Mike