On Dec 18, 2008, at 5:00 AM, Daniel Cavanagh wrote: > On 18/12/2008, at 7:24 PM, Brian Candler wrote: > >> On Thu, Dec 18, 2008 at 08:43:45AM +0900, >> danielcavanagh / aanet.com.au wrote: >>> concatenation could be extended too. if one string is a superset >>> of the >>> other then no actual conversion needs to be done >> >> Concatenation raises interesting issues. For example: >> >> data = "".force_encoding("UTF-8") >> while chunk = file.read(1024) >> data << chunk >> end >> # what is data.encoding ? >> >> Here each chunk of 1024 bytes may have split multibyte characters >> at start >> or end. However it's OK to concatenate them, and as long as the >> file is read >> to the end, the result would be valid UTF-8. >> >> Ruby's current behaviour is to do the concatenation bytewise, but >> downgrades >> the encoding to binary when concatenating binary onto the end of >> UTF-8 (and >> File#read returns binary) >> >> irb(main):001:0> data = "".force_encoding("UTF-8") >> => "" >> irb(main):002:0> data.encoding >> => #<Encoding:UTF-8> >> irb(main):003:0> data << "\x61" >> => "a" >> irb(main):004:0> data.encoding >> => #<Encoding:UTF-8> >> irb(main):005:0> data << "\xc3" >> => "a\xC3" >> irb(main):006:0> data << "\x9f" >> => "a\xC3\x9F" >> irb(main):007:0> data.encoding >> => #<Encoding:ASCII-8BIT> >> irb(main):008:0> data.force_encoding("UTF-8") >> => "a > > well we know how to solve that don't we? make read() read characters ot bytes ;) > > honestly, that seems to be only proper solution. it makes no sense > to work with characters everywhere but then read only bytes. reading nly bytes should set the string's encoding to binary, and only when he programmer is sure the string is valid utf-8 should he change > the encoding. You can certainly do that. CSV does: # # Builds a String in <tt>@encoding</tt>. All +chunks+ will be transcoded to # that encoding. # def encode_str(*chunks) chunks.map { |chunk| chunk.encode(@encoding.name) }.join end # # Reads at least +bytes+ from <tt>@io</tt>, but will read up 10 bytes ahead if # needed to ensure the data read is valid in the ecoding of that data. This # should ensure that it is safe to use regular expressions on the read data, # unless it is actually a broken encoding. The read data will be returned in # <tt>@encoding</tt>. # def read_to_char(bytes) return "" if @io.eof? data = @io.read(bytes) begin encoded = encode_str(data) raise unless encoded.valid_encoding? return encoded rescue # encoding error or my invalid data raise if @io.eof? or data.size >= bytes + 10 return data else data += @io.read(1) until data.valid_encoding? or @io.eof? or data.size >= bytes + 10 retry end end end > the other options seem to be continue to do what you describe above which is less than desirable) or raise an exception, which would be nnoying to have to check for but possibly better than the current > solution. or maybe not... Exceptions will be raises if you try to do something like match a regular expression against data with a broken encoding. James Edward Gray II