On 15-jun-2006, at 2:11, Charles O Nutter wrote: > Every time these unicode discussions come up my head spins like a > top. You > should see it. > > We would love to be able to support unicode in JRuby, but there's > always > that nagging question of what it should look like and what would > mesh well > with the Ruby community at large. With the underlying platform > already rich > with unicode support, it would not take much effort to modify > JRuby. So then > there's a simple question: Yukihiro Matsumoto wrote: > > Define "proper Unicode support" first. > > I'm planning enhancing Unicode support in 1.9 in a year or so > (finally). But I'm not sure that conforms your definition of "proper > Unicode support". Note that 1.8 handles Unicode (UTF-8) if your > string operations are based on Regexp. > Hello everyone, and sorry for chiming so fiercely. Got into some confusion with the ML controls. Just joined the list seeing the subject popping up once more. I am doing Unicode-aware apps in Rails and Ruby right now and it hurts. I'll try to define "proper Unicode support" as I (dream of it at night) see it. 1. All string indexing (length, index, slice, insert) works with characters instead of bytes, whatever length in bytes the characters have to be. String methods (index or =~) should _never_ return offsets that will damage the string's characters if employed for slicing - you shouldn't have to manually translate the byte offset of 2 to character offset of 1 because the second character is multibyte. Simple example: def translate_offset(str, byte_offset) chunk = str[0..byte_offset] begin chunk.unpack("U*").length - 1 rescue ArgumentError # this offset is just wrong! shift upwards and retry chunk = str[0..(byte_offset+=1)] retry end end I think it's unnecessarily painful for something as easy as string =~ /pattern/. Yes, you can get that offset you recieve from =~ and then get the slice of the string and then split it again with /./mu to get the same number etc... 2. Case-insensitive regexes actually work. Even in my Oniguruma- enabled builds of 1.8.2. it was not true (maybe changed now). At least "Unicode general" collation casefolding (such a thing exists) available built-in on every platform. 4. Locale-aware sorting, including multibyte charsets, if provided by the OS 5. Preferably separate (and strictly purposed) Bytestring that you get out of Sockets and use in Servers etc. - or the ability to "force" all strings recieved from external resources to be flagged uniformly as being of a certain encoding in _your_ program, not somewhere in someone's library. If flags have to be set by libraries, they won't be set because most developers sadly don't care: http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html http://thraxil.org/users/anders/posts/2005/11/01/unicodification/ 6. Unicode-aware strip dealing with weirdo whitespaces (hair space, thin space etc.) 7. And no, as I mentioned - it doesn't handle it properly because the /i modifier is broken, and to deal without it you need to downcase BOTH the regexp and the string itself. Closed circle - you go and get the Unicode gem with tables. All of this can be controlled either per String (then 99 out of 100 libraries I use will be getting it wrong - see above) or by a global setting such as $KCODE. As an example of something that is ridiculously backwards to do in Ruby now is this (I spent some time refactoring this today): http://dev.rubyonrails.org/browser/trunk/actionpack/lib/action_view/ helpers/text_helper.rb#L44 Here you have a major problem because the /i flag doesn't do anything (Ruby is incapable of Unicode-aware casefolding), and using offsets means that you are always one step from damaging someone's text. It's just wrong that it has to be so painful. Python3000, IMO, gets this right (as does Java) - byte array and a String are sompletely separate, and String operates with characters and characters only. That's what I would expect. Hope this makes sense somewhat :-) -- Julian 'Julik' Tarkhanov please send all personal mail to me at julik.nl -- Julian 'Julik' Tarkhanov please send all personal mail to me at julik.nl