On 18/12/2008, at 7:24 PM, Brian Candler wrote:

> On Thu, Dec 18, 2008 at 08:43:45AM +0900,  
> danielcavanagh / aanet.com.au wrote:
>> concatenation could be extended too. if one string is a superset of  > the
>> other then no actual conversion needs to be done
>
> Concatenation raises interesting issues. For example:
>
>   data = "".force_encoding("UTF-8")
>   while chunk = file.read(1024)
>     data << chunk
>   end
>   # what is data.encoding ?
>
> Here each chunk of 1024 bytes may have split multibyte characters at  tart
> or end. However it's OK to concatenate them, and as long as the file  s read
> to the end, the result would be valid UTF-8.
>
> Ruby's current behaviour is to do the concatenation bytewise, but  
> downgrades
> the encoding to binary when concatenating binary onto the end of  
> UTF-8 (and
> File#read returns binary)
>
> irb(main):001:0> data = "".force_encoding("UTF-8")
> => ""
> irb(main):002:0> data.encoding
> => #<Encoding:UTF-8>
> irb(main):003:0> data << "\x61"
> => "a"
> irb(main):004:0> data.encoding
> => #<Encoding:UTF-8>
> irb(main):005:0> data << "\xc3"
> => "a\xC3"
> irb(main):006:0> data << "\x9f"
> => "a\xC3\x9F"
> irb(main):007:0> data.encoding
> => #<Encoding:ASCII-8BIT>
> irb(main):008:0> data.force_encoding("UTF-8")
> => "a

well we know how to solve that don't we? make read() read characters  
not bytes ;)

honestly, that seems to be only proper solution. it makes no sense to  ork with characters everywhere but then read only bytes. reading only  ytes should set the string's encoding to binary, and only when the  
programmer is sure the string is valid utf-8 should he change the  
encoding. the other options seem to be continue to do what you  
describe above (which is less than desirable) or raise an exception,