On 7/28/06, Matt Todd <chiology / gmail.com> wrote:
> So, the problem with Unicode support in Ruby is that the code
> currently assumes that each letter is one byte, instead of multiple?
> This includes presumably search algorithms (for Regexs, et al), then?
>
> Or is my understanding warped and wrong?

Regexes in 1.8 can do utf-8.

>
> _Why, et al, if you could break down the actual difficulties with
> implementing Unicode support into Ruby 1.8, I think that might clear
> up the questions we have as to whether a library eradicates all
> problems (obviously, some problems can't be fixed, but merely hacked
> or worked around).

The problem is with compatibility. In 1.8 it is expected that strings
are arrays of bytes. You can split them to characters with a regex or
convert into a sequence of codepoints. But no standard library or
function would understand that (except the single one that is there
for undoing the transformation).

So you have the choice to work with utf-8 strings and regexes, and
whenever you want characters convert the strings so that you get to
characters.

Or you can use a special unicode string class (such as from icu4r)
that no standard functions understand. Some may be able to do to_s but
you get a normal string then.

Or you can change the strings to handle utf-8 (or any other multibyte)
characters, and probably break most of the standard functions.

None of these is completely satisfactory because it is far from
_transparent_ unicode support in the standard string class. That is
planned for 2.0.

Thanks

Michal