Florian Gross ha scritto:
> David Garamond wrote:
> 
>> If someone could summarize the recent Unicode/multibyte string 
>> discussion on a wiki, that would be nice (and _very_ useful). It will 
>> help programmers prepare their code for Unicode support and backward 
>> compatibility in the future. Topics should include:
> 
> 
> Note that lots of this was recently discussed in [ruby-core:04146]. I'll 
> try to answer the questions as accurately as possible.
> 
>> - how will strings be stored in memory (which probably be different 
>> between CRuby, JRuby, Ruby-on-Parrot, Ruby-on-dotnet, etc);
> 
> 
> AFAIK just the raw bytes as before. (And UTF8 and so on can use multiple 
> bytes for one character.) Note that the RString record of Ruby will get 
> a new field for the encoding.
> 
>> - how to check a string's charset, encoding;
> 
> 
> String#encoding. It will return a String.
> 
>> - how to do various operations in the new multibyte sring, especially 
>> those which will be done differently compared to the classic string;
> 
> 
> Just like before, AFAIK. E.g. String#downcase, String#gsub and so on.
> 
>> - what will happen to the classic string (e.g. will it perhaps be 
>> renamed to ByteArray or something);
> 
> 
> The String interface will remain the same. Strings will just get added 
> the encoding facilities, but will remain largely backwards compatible 
> AFAIK.
> 
>> - comparison rules for cross-encoding and cross-charset strings;
> 
> 
> Strings that have the same encoding and the same bytes are equivalent.
> Strings that have ASCII compatible, but different encodings and only 
> ASCII characters are equivalent.
> Everything else is different.
> 
> I think there will be ways for converting from one encoding to another 
> one, but I don't know the details.
> 
>> - regexes;
> 
> 
> Regexp#encoding is introduced, matching uses similar rules as String 
> comparison.
> 
>> - how will Ruby differ from Perl/Python/Java/PHP in Unicode/multibyte 
>> string support (especially since Ruby is a pretty latecomer in the 
>> Unicode scene);
> 
> 
> I can't really do an in-depth comparison here, because I don't know the 
> other languages.
> 
> Note that str[0] will return a one-character String and that ?x will do 
> the same. There will be a new method like String#code point for getting 
> the underlying raw bytes. I think the one-character Strings can later 
> still be optimized fairly easily so that they can be immediate Objects.


an addition and two questions: the encoding of the source file will be 
indicated with the same approach of python:
  #!/usr/bin/ruby
  # -*- coding: <encoding name> -*-

or command line option (maybe -K ) or  compile time configuration time.
But I wonder: why can't we keep using $KCODE for this and have to use 
that ugly magic string?

Also, not that I am an espert, but is localization supposed to work?
i.e. accented letters which are common in european languages are 
supposed to be able to be capitalized and such?
  Is'nt this related to a charset property of the string different from 
encoding ?
IIRC in parrot-land a string is a <stream of 
bytes>+<encoding>+<charset>+<language>, how happens that we just care 
about one of this things?

Also, given that this seem a huge work.. will it spin off in a proper 
indipendent libm17n library ? :)