David Garamond wrote: > If someone could summarize the recent Unicode/multibyte string > discussion on a wiki, that would be nice (and _very_ useful). It will > help programmers prepare their code for Unicode support and backward > compatibility in the future. Topics should include: Note that lots of this was recently discussed in [ruby-core:04146]. I'll try to answer the questions as accurately as possible. > - how will strings be stored in memory (which probably be different > between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc); AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple bytes for one character.) Note that the RString record of Ruby will get a new field for the encoding. > - how to check a string's charset, encoding; String#encoding. It will return a String. > - how to do various operations in the new multibyte sring, especially > those which will be done differently compared to the classic string; Just like before, AFAIK. E.g. String#downcase, String#gsub and so on. > - what will happen to the classic string (e.g. will it perhaps be > renamed to ByteArray or something); The String interface will remain the same. Strings will just get added the encoding facilities, but will remain largely backwards compatible AFAIK. > - comparison rules for cross-encoding and cross-charset strings; Strings that have the same encoding and the same bytes are equivalent. Strings that have ASCII compatible, but different encodings and only ASCII characters are equivalent. Everything else is different. I think there will be ways for converting from one encoding to another one, but I don't know the details. > - regexes; Regexp#encoding is introduced, matching uses similar rules as String comparison. > - how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte > string support (especially since Ruby is a pretty latecomer in the > Unicode scene); I can't really do an in-depth comparison here, because I don't know the other languages. Note that str[0] will return a one-character String and that ?x will do the same. There will be a new method like String#code point for getting the underlying raw bytes. I think the one-character Strings can later still be optimized fairly easily so that they can be immediate Objects.