Florian Gross ha scritto: > David Garamond wrote: > >> If someone could summarize the recent Unicode/multibyte string >> discussion on a wiki, that would be nice (and _very_ useful). It will >> help programmers prepare their code for Unicode support and backward >> compatibility in the future. Topics should include: > > > Note that lots of this was recently discussed in [ruby-core:04146]. I'll > try to answer the questions as accurately as possible. > >> - how will strings be stored in memory (which probably be different >> between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc); > > > AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple > bytes for one character.) Note that the RString record of Ruby will get > a new field for the encoding. > >> - how to check a string's charset, encoding; > > > String#encoding. It will return a String. > >> - how to do various operations in the new multibyte sring, especially >> those which will be done differently compared to the classic string; > > > Just like before, AFAIK. E.g. String#downcase, String#gsub and so on. > >> - what will happen to the classic string (e.g. will it perhaps be >> renamed to ByteArray or something); > > > The String interface will remain the same. Strings will just get added > the encoding facilities, but will remain largely backwards compatible > AFAIK. > >> - comparison rules for cross-encoding and cross-charset strings; > > > Strings that have the same encoding and the same bytes are equivalent. > Strings that have ASCII compatible, but different encodings and only > ASCII characters are equivalent. > Everything else is different. > > I think there will be ways for converting from one encoding to another > one, but I don't know the details. > >> - regexes; > > > Regexp#encoding is introduced, matching uses similar rules as String > comparison. > >> - how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte >> string support (especially since Ruby is a pretty latecomer in the >> Unicode scene); > > > I can't really do an in-depth comparison here, because I don't know the > other languages. > > Note that str[0] will return a one-character String and that ?x will do > the same. There will be a new method like String#code point for getting > the underlying raw bytes. I think the one-character Strings can later > still be optimized fairly easily so that they can be immediate Objects. an addition and two questions: the encoding of the source file will be indicated with the same approach of python: #!/usr/bin/ruby # -*- coding: <encoding name> -*- or command line option (maybe -K ) or compile time configuration time. But I wonder: why can't we keep using $KCODE for this and have to use that ugly magic string? Also, not that I am an espert, but is localization supposed to work? i.e. accented letters which are common in european languages are supposed to be able to be capitalized and such? Is'nt this related to a charset property of the string different from encoding ? IIRC in parrot-land a string is a <stream of bytes>+<encoding>+<charset>+<language>, how happens that we just care about one of this things? Also, given that this seem a huge work.. will it spin off in a proper indipendent libm17n library ? :)