Hi,

In message "Re: [ruby-core:18681] Re: Character encodings - a radical suggestion"
    on Thu, 18 Sep 2008 09:03:35 +0900, "Michael Selig" <michael.selig / fs.com.au> writes:

|Thanks for all the replies - I am not an expert on all these encodings,  
|and I (obviously mistakenly!) assumed that all other encodings could be  
|converted to Unicode.
|
|When I first looked at Ruby 1.9's encoding support I thought "that's neat  
|- I think it will solve my m17n problems". However as I got into it I soon  
|discovered that it wasn't nearly this easy!

I am sorry that life is not that easy.

|Here is a summary of my issues:
|
|- Non "ASCII-compatible" data is almost impossible to work with. Just take  
|a look at what James Gray was proposing to do for CSV.

Yes, basically support for UTF-{16,32} are very limited, so that
I believe libraries are OK to omit them.  We should document that
clearly, but note that 1.9.1 has not been released yet.

|- Other alternative languages to Ruby which represent all strings as  
|Unicode don't have this problem. Although they may not be a 100% solution  
|in Japan & China, they would certainly be fine for me to use.

Ruby does not prohibit you to do the same thing as alternative
languages - converting back and force at the surface.  The point is, I
think, we haven't yet provided nifty API to do so.  If you can live
with Python's open-read-and-decode, I think you are able to stand
Ruby's "r:UTF-16:UTF-8" or open-read-and-encode.

If we need something more, it should be better API to reduce the cost
of Unicode based application, not making the language Unicode centric.

Let me rephrase, it's OK for you to make your application/library
Unicode centric, but not the language itself.  The one can declare his
library to support only ASCII compatible text, or UTF-8 text.  The
users must care about converting non-conformed text.

|- When developing standard classes & mixins that could be installed in any  
|country, virtually all methods that handle more than 1 string are going to  
|have to worry about the possibility of dealing with incompatible  
|encodings. This is a major overhead to a programmer - it may not be  
|acceptable to let it raise an error.

For any serious application/library, there are three choices:

(a) choose US-ASCII
(b) choose UTF-8 (or any specific encoding)
(c) choose to live with multiple encoding

But the last one is not an easy way, indeed.  I don't want to force
any Ruby users the hard way.  Users should choose anything they want.
But I don't want to deny the possibility.

|It *does* mean that strings may "magically" be converted to UTF-8, but I  
|don't see this as a big deal as long as when they are output they are  
|converted back to the necessary encoding (which I think Ruby does with  
|files now). If the "magic" conversion is a problem, maybe there should be  
|a switch to turn it on & off.
|This auto-convert policy should also be used with non-destructive methods  
|like String#== etc so the programmer needn't worry whether the same  
|character has a different representation on each side of the "==".
|The ASCII-8BIT encoding should be reserved as a "special case" and not be  
|subject to auto-conversion, because it is going to be mainly used for  
|"byte strings".

If you can do implicit conversion at I/O, why do you have to care
about encoding mixing?  Your program should treat single encoding
anyway.  Auto-conversion is bad, believe me.

							matz.