On 7/28/06, Matt Todd <chiology / gmail.com> wrote: > So, the problem with Unicode support in Ruby is that the code > currently assumes that each letter is one byte, instead of multiple? > This includes presumably search algorithms (for Regexs, et al), then? > > Or is my understanding warped and wrong? Regexes in 1.8 can do utf-8. > > _Why, et al, if you could break down the actual difficulties with > implementing Unicode support into Ruby 1.8, I think that might clear > up the questions we have as to whether a library eradicates all > problems (obviously, some problems can't be fixed, but merely hacked > or worked around). The problem is with compatibility. In 1.8 it is expected that strings are arrays of bytes. You can split them to characters with a regex or convert into a sequence of codepoints. But no standard library or function would understand that (except the single one that is there for undoing the transformation). So you have the choice to work with utf-8 strings and regexes, and whenever you want characters convert the strings so that you get to characters. Or you can use a special unicode string class (such as from icu4r) that no standard functions understand. Some may be able to do to_s but you get a normal string then. Or you can change the strings to handle utf-8 (or any other multibyte) characters, and probably break most of the standard functions. None of these is completely satisfactory because it is far from _transparent_ unicode support in the standard string class. That is planned for 2.0. Thanks Michal