On 6/14/06, Yukihiro Matsumoto <matz / ruby-lang.org> wrote:
> Hi,
>
> In message "Re: Unicode roadmap?"
>     on Wed, 14 Jun 2006 08:11:49 +0900, "Victor Shepelev" <vshepelev / imho.com.ua> writes:
>
> |From: Pete [mailto:pertl / gmx.org]
> |Sent: Wednesday, June 14, 2006 1:58 AM
> |> As I am German the 'missing' unicode support is one of the greatest
> |> obstacles for me (and probably all other Germans doing their stuff
> |> seriously)...
> |
> |The same is for Russians/Ukrainians. In our programming communities question
> |"does the programming language supports Unicode as 'native'?" has very high
> |priority.
>
> Alright, then what specific features are you (both) missing?  I don't
> think it is a method to get number of characters in a string.  It
> can't be THAT crucial.  I do want to cover "your missing features" in
> the future M17N support in Ruby.
>

What I want is all methods working seamlessly with unicode strings so
that I do not have to think about the encoding.

Regexps do work with utf-8 strings if KCODE is set to u (but it
defaults to n even when locale uses UTF-8).

String searches should probably work but they would retrurn wrong position.
Things like split should work for utf-8, the encoding is pretty well defined.

But one might want to use length and [] to work with strings.
It can be simulated with unicode_string=string.scan(/./). But it is no
longer a string. It is composed of characters only as long as I assign
only characters using []=.
The string functions should do the right thing even for utf-8. But I
guess utf-32 is more useful for working with strings this way.

It might be a good idea to stick encoding information into strings (it
is probably the only way how internationalization can be done and the
sanity of all involved preserved at the same time). The functions for
comparison, etc could use it to do the right thing even if strings
come in several encodings. ie. cp1251 from the system, utf-8 from a
web page, ...

Functions like open could convert the string correctly according to
locale. One should be able to set the encoding information (ie for web
page title when the meta tag for content type is found in a web
page),and remove it to suppress string conversion. It should be also
possible to convert the string (ie to UTF-32 to speed up character
access).

Things like <=>, upcase, downcase, etc make sense only in context of
locale (language). Only the encoding does not define them.
I guess the default <=>is based on the binary representation of the
string. This would mean different sorting of the same strings in
different encodings. Sorting by the unicode code point would be at
least the same for any encoding.

Thanks

Michal