On 6/14/06, Yukihiro Matsumoto <matz / ruby-lang.org> wrote: > Hi, > > In message "Re: Unicode roadmap?" > on Wed, 14 Jun 2006 08:11:49 +0900, "Victor Shepelev" <vshepelev / imho.com.ua> writes: > > |From: Pete [mailto:pertl / gmx.org] > |Sent: Wednesday, June 14, 2006 1:58 AM > |> As I am German the 'missing' unicode support is one of the greatest > |> obstacles for me (and probably all other Germans doing their stuff > |> seriously)... > | > |The same is for Russians/Ukrainians. In our programming communities question > |"does the programming language supports Unicode as 'native'?" has very high > |priority. > > Alright, then what specific features are you (both) missing? I don't > think it is a method to get number of characters in a string. It > can't be THAT crucial. I do want to cover "your missing features" in > the future M17N support in Ruby. > What I want is all methods working seamlessly with unicode strings so that I do not have to think about the encoding. Regexps do work with utf-8 strings if KCODE is set to u (but it defaults to n even when locale uses UTF-8). String searches should probably work but they would retrurn wrong position. Things like split should work for utf-8, the encoding is pretty well defined. But one might want to use length and [] to work with strings. It can be simulated with unicode_string=string.scan(/./). But it is no longer a string. It is composed of characters only as long as I assign only characters using []=. The string functions should do the right thing even for utf-8. But I guess utf-32 is more useful for working with strings this way. It might be a good idea to stick encoding information into strings (it is probably the only way how internationalization can be done and the sanity of all involved preserved at the same time). The functions for comparison, etc could use it to do the right thing even if strings come in several encodings. ie. cp1251 from the system, utf-8 from a web page, ... Functions like open could convert the string correctly according to locale. One should be able to set the encoding information (ie for web page title when the meta tag for content type is found in a web page),and remove it to suppress string conversion. It should be also possible to convert the string (ie to UTF-32 to speed up character access). Things like <=>, upcase, downcase, etc make sense only in context of locale (language). Only the encoding does not define them. I guess the default <=>is based on the binary representation of the string. This would mean different sorting of the same strings in different encodings. Sorting by the unicode code point would be at least the same for any encoding. Thanks Michal