On Wednesday 14 June 2006 06:01 am, Juergen Strobel wrote:
> For my personal vision of "proper" Unicode support, I'd like to have
> UTF-8 the standard internal string format, and Unicode Points the
> standard character code, and *all* String functions to just work
> intuitively "right" on a character base rather than byte base. Thus
> the internal String encoding is a technical matter only, as long as it
> is capable of supporting all Unicode characters, and these internal
> details are not exposed via public methods.

Maybe Juergen is saying the same thing I'm going to say, but since I don't 
understand / recall what UTF-8 encoding is exactly:

I'm beginning to think (with a newbie sort of perspective) that Unicode is too 
complicated to deal with inside a program.  My suggestion would be that 
Unicode be an external format...

What I mean is, when you have a program that must handle international text, 
convert the Unicode to a fixed width representation for use by the program.   
Do the processing based on these fixed width characters.  When it's complete, 
convert it back to Unicode for output.

It seems to me that would make a lot of things easier.

Then I might have two basic "types" of programs--programs that can handle any 
text (i.e., international), and other programs that can handle only English 
(or maybe only European languages that can work with an 8 bit byte).  (I 
suggest these two types of programs because I suspect those that have to 
handle the international character set will be slower than those that don't.)

Aside: What would that take to handle all the characters / ideographs (is that 
what they call them, the Japanese, Chinese, ... characters) presently in use 
in the world--iirc, 16 bits (2**16) didn't cut it for Unicode--would 32 bits?

Randy Kramer

> I/O and String functions should be able to convert to and from
> different external encodings, via plugin modules. Note I don't require
> non Unicode String classes, just the possibility to do I/O with
> foreign characters sets, or conversion to byte arrays. Strings should
> consist of characters, not just be a sequence of bytes meaningless
> without external information about their encoding.
>
> No ruby apps or libraries should break because they are surprised by
> (Unicode) Strings, or it should be obvious the fault is with them.
>
> Optionally, additional String classes with different internal Unicode
> encodings might be a boon for certain performance sensitive
> applications, and they should all work together much like Numbers of
> different kinds do.
>
> While I want ruby source files to be UTF-8 encoded, in no way do I
> want identifiers to consist of additional national characters. I like
> names in APIs everyone can actually type, but literal Strings is a
> different matter.
>
> I know this is a bit vague on the one hand, and might demand intrusive
> changes on the other one.  Java history shows proper Unicode support
> is no trivial matter, and I don't feel qualified to give advice how to
> implement this. It's just my vision of how Strings ideally would be.
>
> And of course for my personal vision to become perfect, everyone
> outside Ruby should adopt Unicode too.
>
> J?rgen