On 26.6.2006, at 20:37, Michal Suchanek wrote:

>> However, whether you use an encoding or not, you still get a String
>> back. Consider:
>>
>>   s1 = File.open("file.txt", "rb") { |f| f.read }
>>   s2 = File.open("file.txt", "rb", encoding: :utf8) { |f| f.read }
>>
>>   s1.class == s2.class # true
>>   s1.encoding == s2.encoding # false
>>
>> But that doesn't mean I have to keep treating s1 as a raw data byte
>> array -- or even convert it.
>>
>>   s1.encoding = :utf8
>>   s1.encoding == s2.encoding # true
>>
>> I think that the fundamental difference here is whether you view  
>> encoded
>> strings as fundamentally different objects, or whether you view the
>> encodings as *lenses* on how to interpret the object data. I  
>> prefer the
>> latter view.
>
> If you consider s3 = File.open('legacy.txt','rb',:iso885915) { |f|  
> f.read }
> without autoconversion you would have to immediately do  
> s3.recode :utf8
> otherwise s1 + s3 would not work.

Yes. This shows that if there is no autoconversion, programmer will  
always need to recode to a common app encoding if the aplication is  
to work without problems. And if we always need to recode strings  
which we receive from third-part classes/libraries, encoding handling  
will either consume half of the program lines  or people won't do it  
and programs will be full of errors. As can be seen from experience  
of other languages (and Ruby), the second option will prevail and we  
will be in a mess not much better than today.

Therefore m17n without autconversion (as is current Matz's proposal)  
gains us almost nothing. If we have no autoconversion, my vote goes  
to Unicode internal encoding (because it implicitly handles  
autoconversion problems).

On the topic of ByteArray: my concern is that the distinction between  
bytes and characters will not be clear and therefore we need to  
introduce ByteArray to separate bytes from characters, to ensure  
reliability and predictability of code like result = File.open 
( "file" ) { |f| f.read 1000 } (now tell me what 'result' is?}.

If there will be clear and simple rules, such as "IO always returns  
binary strings if not given encoding parameter" then this distinction  
will not need to be additionally enforced by separating classes. One  
String class will do.

On the other hand, if there will be all kinds of automatic encoding  
tagging for convenience of simple-script-writers, then we need  
ByteArray to prevent error-prone code with undefined results.

izidor