David Garamond wrote:

> If someone could summarize the recent Unicode/multibyte string 
> discussion on a wiki, that would be nice (and _very_ useful). It will 
> help programmers prepare their code for Unicode support and backward 
> compatibility in the future. Topics should include:

Note that lots of this was recently discussed in [ruby-core:04146]. I'll 
try to answer the questions as accurately as possible.

> - how will strings be stored in memory (which probably be different 
> between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);

AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple 
bytes for one character.) Note that the RString record of Ruby will get 
a new field for the encoding.

> - how to check a string's charset, encoding;

String#encoding. It will return a String.

> - how to do various operations in the new multibyte sring, especially 
> those which will be done differently compared to the classic string;

Just like before, AFAIK. E.g. String#downcase, String#gsub and so on.

> - what will happen to the classic string (e.g. will it perhaps be 
> renamed to ByteArray or something);

The String interface will remain the same. Strings will just get added 
the encoding facilities, but will remain largely backwards compatible AFAIK.

> - comparison rules for cross-encoding and cross-charset strings;

Strings that have the same encoding and the same bytes are equivalent.
Strings that have ASCII compatible, but different encodings and only 
ASCII characters are equivalent.
Everything else is different.

I think there will be ways for converting from one encoding to another 
one, but I don't know the details.

> - regexes;

Regexp#encoding is introduced, matching uses similar rules as String 
comparison.

> - how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte 
> string support (especially since Ruby is a pretty latecomer in the 
> Unicode scene);

I can't really do an in-depth comparison here, because I don't know the 
other languages.

Note that str[0] will return a one-character String and that ?x will do 
the same. There will be a new method like String#code point for getting 
the underlying raw bytes. I think the one-character Strings can later 
still be optimized fairly easily so that they can be immediate Objects.