Hi,

In message "Re: [ruby-core:18751] Re: Character encodings - a radical  suggestion"
    on Sat, 20 Sep 2008 10:00:24 +0900, "Michael Selig" <michael.selig / fs.com.au> writes:

|Perhaps we need to go back to basics with this discussion. As a mere  
|English speaker, I do not fully understand the issues that are faced by  
|Japanese and other encodings. What I have gathered from this discussion is  
|(please tell me if I am wrong):
|
|- There are characters that Ruby needs to support which cannot be uniquely  
|mapped to Unicode

Yes, even though they are minor.

|- In fact there are entire character sets that we want to support in Ruby  
|that are not supported in Unicode

Yes, I know two of them: Mojikyo, which refusing character
unification.  The character set contains 170,000 characters.  At the
time I first heard that number was huge, but Unicode is approaching
pretty close (it now has more than 100,000 characters).

GB18030, defined by Chinese government.  I don't know the detail, but
I've heard it officially contains Unicode as its subset.  But encoding
scheme for GB18030 is upto 4bytes per codepoint, so I am not sure how
it can holds 21bit Unicode codepoint in it.

|- There are ambiguous characters in some character sets - same code for  
|different characters

Yes.

|I think it would be a benefit if we all got to understand a bit more:
|
|- How the character ambiguity (eg: Yen/ backslash) issue is handled at the  
|moment - generally, not just with Ruby. ie: how do you know that a printer  
|or screen is going to show the right character?

Either avoiding conversion (operation based on bytes), or selecting
proper encoding scheme (out of many very similar encodings, such as
Shift_JIS, CP932, Windows-31J for example).  Conversion table from
unicode.org is carefully designed to ensure roundtrip, although that
is the very reason we have so many similar encoding.  If we can choose
(or negotiate) to use same conversion table at both ends, it is
unlikely to have mojibake problems.

|- How the various "non-ascii compatible" encodings are used in practice.  
|eg: it is my understanding that UTF-7 is really only used in email, and  
|that it would be straightforward to immediately transcode it to/from UTF-8  
|in an POP/IMAP library, so UTF-7 could be avoided completely as an  
|"internal" encoding in Ruby. It's as if were were treating UTF-7 like  
|base64 - just a transformation of a "real" encoding. (In fact UTF-16 & 32  
|could be considered the same sort of thing, except they may be used more  
|widely.)

UTF-{16,32}{BE,LE} are non-ascii compatible, but they are safe to
convert into UTF-8 since their difference only lies in encoding
scheme.  They represent same character set anyway.  ISO-2022 is used
often in mails and web.  The situation is little bit more complicated,
but basically it can be converted into Unicode as well (with slight
risk of yen sign problem).  You can ignore UTF-7.

|- How a Japanese programmer would handle the situation of dealing with a  
|combination of a Japanese non-Unicode compatible character set, and say a  
|UTF-8 encoding which included non-ascii characters, and non-Japanese ones.  
|ie: Is there a reasonable alternative to encoding both to Unicode &  
|somehow dealing with the "difficult characters" as special cases?

Unicode is getting better each day.  So it now covers almost all
day-to-day problems.  Some cellphone problems are covered by using
private area.

							matz.