On Dec 18, 2008, at 5:00 AM, Daniel Cavanagh wrote:

> On 18/12/2008, at 7:24 PM, Brian Candler wrote:
>
>> On Thu, Dec 18, 2008 at 08:43:45AM +0900,  
>> danielcavanagh / aanet.com.au wrote:
>>> concatenation could be extended too. if one string is a superset  
>>> of the
>>> other then no actual conversion needs to be done
>>
>> Concatenation raises interesting issues. For example:
>>
>>  data = "".force_encoding("UTF-8")
>>  while chunk = file.read(1024)
>>    data << chunk
>>  end
>>  # what is data.encoding ?
>>
>> Here each chunk of 1024 bytes may have split multibyte characters  
>> at start
>> or end. However it's OK to concatenate them, and as long as the  
>> file is read
>> to the end, the result would be valid UTF-8.
>>
>> Ruby's current behaviour is to do the concatenation bytewise, but  
>> downgrades
>> the encoding to binary when concatenating binary onto the end of  
>> UTF-8 (and
>> File#read returns binary)
>>
>> irb(main):001:0> data = "".force_encoding("UTF-8")
>> => ""
>> irb(main):002:0> data.encoding
>> => #<Encoding:UTF-8>
>> irb(main):003:0> data << "\x61"
>> => "a"
>> irb(main):004:0> data.encoding
>> => #<Encoding:UTF-8>
>> irb(main):005:0> data << "\xc3"
>> => "a\xC3"
>> irb(main):006:0> data << "\x9f"
>> => "a\xC3\x9F"
>> irb(main):007:0> data.encoding
>> => #<Encoding:ASCII-8BIT>
>> irb(main):008:0> data.force_encoding("UTF-8")
>> => "a
>
> well we know how to solve that don't we? make read() read characters  ot bytes ;)
>
> honestly, that seems to be only proper solution. it makes no sense  
> to work with characters everywhere but then read only bytes. reading  nly bytes should set the string's encoding to binary, and only when  he programmer is sure the string is valid utf-8 should he change  
> the encoding.

You can certainly do that.  CSV does:

   #
   # Builds a String in <tt>@encoding</tt>.  All +chunks+ will be  
transcoded to
   # that encoding.
   #
   def encode_str(*chunks)
     chunks.map { |chunk| chunk.encode(@encoding.name) }.join
   end

   #
   # Reads at least +bytes+ from <tt>@io</tt>, but will read up 10  
bytes ahead if
   # needed to ensure the data read is valid in the ecoding of that  
data.  This
   # should ensure that it is safe to use regular expressions on the  
read data,
   # unless it is actually a broken encoding.  The read data will be  
returned in
   # <tt>@encoding</tt>.
   #
   def read_to_char(bytes)
     return "" if @io.eof?
     data = @io.read(bytes)
     begin
       encoded = encode_str(data)
       raise unless encoded.valid_encoding?
       return encoded
     rescue  # encoding error or my invalid data raise
       if @io.eof? or data.size >= bytes + 10
         return data
       else
         data += @io.read(1) until data.valid_encoding? or
                                   @io.eof?             or
                                   data.size >= bytes + 10
         retry
       end
     end
   end

> the other options seem to be continue to do what you describe above  which is less than desirable) or raise an exception, which would be  nnoying to have to check for but possibly better than the current  
> solution. or maybe not...

Exceptions will be raises if you try to do something like match a  
regular expression against data with a broken encoding.

James Edward Gray II