On 18/12/2008, at 7:24 PM, Brian Candler wrote: > On Thu, Dec 18, 2008 at 08:43:45AM +0900, > danielcavanagh / aanet.com.au wrote: >> concatenation could be extended too. if one string is a superset of > the >> other then no actual conversion needs to be done > > Concatenation raises interesting issues. For example: > > data = "".force_encoding("UTF-8") > while chunk = file.read(1024) > data << chunk > end > # what is data.encoding ? > > Here each chunk of 1024 bytes may have split multibyte characters at tart > or end. However it's OK to concatenate them, and as long as the file s read > to the end, the result would be valid UTF-8. > > Ruby's current behaviour is to do the concatenation bytewise, but > downgrades > the encoding to binary when concatenating binary onto the end of > UTF-8 (and > File#read returns binary) > > irb(main):001:0> data = "".force_encoding("UTF-8") > => "" > irb(main):002:0> data.encoding > => #<Encoding:UTF-8> > irb(main):003:0> data << "\x61" > => "a" > irb(main):004:0> data.encoding > => #<Encoding:UTF-8> > irb(main):005:0> data << "\xc3" > => "a\xC3" > irb(main):006:0> data << "\x9f" > => "a\xC3\x9F" > irb(main):007:0> data.encoding > => #<Encoding:ASCII-8BIT> > irb(main):008:0> data.force_encoding("UTF-8") > => "a well we know how to solve that don't we? make read() read characters not bytes ;) honestly, that seems to be only proper solution. it makes no sense to ork with characters everywhere but then read only bytes. reading only ytes should set the string's encoding to binary, and only when the programmer is sure the string is valid utf-8 should he change the encoding. the other options seem to be continue to do what you describe above (which is less than desirable) or raise an exception,