On 15-jun-2006, at 2:11, Charles O Nutter wrote:

> Every time these unicode discussions come up my head spins like a  
> top. You
> should see it.
>
> We would love to be able to support unicode in JRuby, but there's  
> always
> that nagging question of what it should look like and what would  
> mesh well
> with the Ruby community at large. With the underlying platform  
> already rich
> with unicode support, it would not take much effort to modify  
> JRuby. So then
> there's a simple question:

Yukihiro Matsumoto wrote:

>
> Define "proper Unicode support" first.
>
> I'm planning enhancing Unicode support in 1.9 in a year or so
> (finally).  But I'm not sure that conforms your definition of "proper
> Unicode support".  Note that 1.8 handles Unicode (UTF-8) if your
> string operations are based on Regexp.
>

Hello everyone, and sorry for chiming so fiercely. Got into some  
confusion with the ML controls.

Just joined the list seeing the subject popping up once more. I am  
doing Unicode-aware apps in Rails and Ruby right now and it hurts.  
I'll try to define  "proper Unicode support" as I (dream of it at  
night) see it.

1. All string indexing (length, index, slice, insert) works with  
characters instead of bytes, whatever length in bytes the characters  
have to be.
String methods (index or =~) should _never_ return offsets that will  
damage the string's characters if employed for slicing - you  
shouldn't have to manually translate the byte offset of 2 to  
character offset of 1 because the second character is multibyte.

Simple example:

     def translate_offset(str, byte_offset)
       chunk = str[0..byte_offset]
       begin
         chunk.unpack("U*").length - 1
       rescue ArgumentError # this offset is just wrong! shift  
upwards and retry
         chunk = str[0..(byte_offset+=1)]
         retry
       end
     end

I think it's unnecessarily painful for something as easy as string  
=~ /pattern/. Yes, you can get that offset you recieve from =~ and  
then get the slice of the string and then split it again with /./mu  
to get the same number etc...

2. Case-insensitive regexes actually work. Even in my Oniguruma- 
enabled builds of 1.8.2. it was not true (maybe changed now). At  
least "Unicode general" collation casefolding (such a thing exists)  
available built-in on every platform.
4. Locale-aware sorting, including multibyte charsets, if provided by  
the OS
5. Preferably separate (and strictly purposed) Bytestring that you  
get out of Sockets and use in Servers etc. - or the ability to  
"force" all strings recieved from external resources to be flagged  
uniformly as being of a certain encoding in _your_ program, not  
somewhere in someone's library. If flags have to be set by libraries,  
they won't be set because most developers sadly don't care:

http://www.zackvision.com/weblog/2005/11/mt-unicode-mysql.html
http://thraxil.org/users/anders/posts/2005/11/01/unicodification/

6. Unicode-aware strip dealing with weirdo whitespaces (hair space,  
thin space etc.)
7. And no, as I mentioned - it doesn't handle it properly because  
the /i modifier is broken, and to deal without it you need to  
downcase BOTH the regexp and the string itself. Closed circle - you  
go and get the Unicode gem with tables.

All of this can be controlled either per String (then 99 out of 100  
libraries I use will be getting it wrong - see above) or by a global  
setting such as $KCODE.

As an example of something that is ridiculously backwards to do in  
Ruby now is this (I spent some time refactoring this today):
http://dev.rubyonrails.org/browser/trunk/actionpack/lib/action_view/ 
helpers/text_helper.rb#L44

Here you have a major problem because the /i flag doesn't do anything  
(Ruby is incapable of Unicode-aware casefolding), and using offsets  
means that you are always one step from damaging someone's text. It's  
just wrong that it has to be so painful.

Python3000, IMO, gets this right (as does Java) - byte array and a  
String are sompletely separate, and String operates with characters  
and characters only.

That's what I would expect. Hope this makes sense somewhat :-)
--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl




--
Julian 'Julik' Tarkhanov
please send all personal mail to
me at julik.nl