At 10:20 08/09/17, Michael Selig wrote:
>Hi,
>
>You might at first glance think that this post should go to ruby-dev, but  
>please read to the end!

If it's in English, it should be ruby-core, not ruby-dev, as far as I
understand.

>I have been pulling my hair out trying to convert a relatively simple app  
>to support m17n under Ruby 1.9 to see what is involved. I need to support  
>all common locales worldwide, and data can also be stored in UTF-8 or  
>UTF-16. I was hoping that Ruby 1.9 was going to take the hard work out of  
>this for me. It has to a certain extent, but UTF-16 is the problem - it  
>breaks so many things, due to its "ASCII incompatibility" (using Ruby's  
>definition). I can't even do simple things like pull out fields and  
>substitute into another string without testing "encoding compatibility".  
>Something as simple as:
>
>       puts "The value is #{val}"
>
>fails if val is UTF-16 data.

I think in this case, the reason why you see the problem only for
UTF-16 is that your string, other than the interpolated data, is
currently all US-ASCII. But immagine that sooner or later you
(or somebody) is going to localize your application. Then the
string might be in any encoding, and you'll get much more
"encoding compatibility" exceptions.


>At one stage I got so frustrated that I was even thinking about going back  
>to Python :-(
>So I have ended up transcoding any UTF-16 data to UTF-8, and now things  
>are going much better.
>
>Maybe I am doing something wrong - if so please suggest something I can do  
>other than transcode the UTF-16.

I think your problem is more general, and you should transcode other
encodings to UTF-8, too, if you're not sure you'll be in a situation
with a single encoding.


>But this has lead me to look back at the issues with UTF-16 I have hit,  
>and to think about all the internal code in Ruby to handle "ASCII  
>incompatible" encodings, and the overhead involved with supporting it.
>
>And I think that other Ruby programmers may end up doing what I have done  
>- avoid using UTF-16 internally because it is too hard.

I agree that all non-ASCII encodings should come with a sticker with
a big warning on it, at least.


>So my radical suggestion is this:
>
>Remove internal support for non-ASCII encodings completely, and when  
>reading/writing UTF-16 (and UTF-32) files automatically transcode to/from  
>UTF-8.

I can understand the former part. Providing something half-baked
can have advantages and disadvantages.


>My reasons:
>
>- String & Regexp operations should just "work" without the programmer  
>worrying about encoding comaptibility (I think!)

See below.

>- The programmer only has to think about character encodings at the  
>"interfaces" (files, network interfaces) not throughout the program logic

This is desirable/good architecture. Ruby 1.9 will force you to do that,
or come up with some other architecture, but won't handle things
automatically for you.

>- To my knowledge UTF-16 & UTF-32 are the only "non-ASCII compatible" as  
>Ruby defines it

No, there are others, such as iso-2022-jp. But they are not really the
main issue. You can get an encoding incompatibility error for any two
ASCII-compatible encodings. E.g. iso-8859-1 and iso-8859-2, or any two
others. The reason that you currently don't is that one of your strings
(or a regexp) always is ASCII-only, even if it's labeled as something
else.

>- To my knowledge no one actually uses UTF-16 or UTF-32 as a locale

True.

>- I would avoid having to use ugly modes to open a file like  
>"r:UTF-16LE:UTF-8" (very minor)

Telling Ruby what encoding you expect from the outside is kind of
unavoidable. But it would indeed help if it would suffice to tell
a Ruby application only once that you want to handle everything
internally in a certain encoding.

>- Ruby's internal code would be simpler & cleaner and therefore probably  
>faster and easier to maintain

If everything is done in UTF-8 all the time, yes. But I don't think
we will go there soon (I wouldn't mind). Speed isn't too much of
an issue, but of course the code would be quite a bit simpler.

Regards,   Martin.



#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst / it.aoyama.ac.jp