On Dec 18, 2008, at 5:00 AM, Daniel Cavanagh wrote:

> On 18/12/2008, at 7:24 PM, Brian Candler wrote:
>
>> On Thu, Dec 18, 2008 at 08:43:45AM +0900, =20
>> danielcavanagh / aanet.com.au wrote:
>>> concatenation could be extended too. if one string is a superset =20
>>> of the
>>> other then no actual conversion needs to be done
>>
>> Concatenation raises interesting issues. For example:
>>
>>  data =3D "".force_encoding("UTF-8")
>>  while chunk =3D file.read(1024)
>>    data << chunk
>>  end
>>  # what is data.encoding ?
>>
>> Here each chunk of 1024 bytes may have split multibyte characters =20
>> at start
>> or end. However it's OK to concatenate them, and as long as the =20
>> file is read
>> to the end, the result would be valid UTF-8.
>>
>> Ruby's current behaviour is to do the concatenation bytewise, but =20
>> downgrades
>> the encoding to binary when concatenating binary onto the end of =20
>> UTF-8 (and
>> File#read returns binary)
>>
>> irb(main):001:0> data =3D "".force_encoding("UTF-8")
>> =3D> ""
>> irb(main):002:0> data.encoding
>> =3D> #<Encoding:UTF-8>
>> irb(main):003:0> data << "\x61"
>> =3D> "a"
>> irb(main):004:0> data.encoding
>> =3D> #<Encoding:UTF-8>
>> irb(main):005:0> data << "\xc3"
>> =3D> "a\xC3"
>> irb(main):006:0> data << "\x9f"
>> =3D> "a\xC3\x9F"
>> irb(main):007:0> data.encoding
>> =3D> #<Encoding:ASCII-8BIT>
>> irb(main):008:0> data.force_encoding("UTF-8")
>> =3D> "a=DF"
>
> well we know how to solve that don't we? make read() read characters =20=

> not bytes ;)
>
> honestly, that seems to be only proper solution. it makes no sense =20
> to work with characters everywhere but then read only bytes. reading =20=

> only bytes should set the string's encoding to binary, and only when =20=

> the programmer is sure the string is valid utf-8 should he change =20
> the encoding.

You can certainly do that.  CSV does:

   #
   # Builds a String in <tt>@encoding</tt>.  All +chunks+ will be =20
transcoded to
   # that encoding.
   #
   def encode_str(*chunks)
     chunks.map { |chunk| chunk.encode(@encoding.name) }.join
   end

   #
   # Reads at least +bytes+ from <tt>@io</tt>, but will read up 10 =20
bytes ahead if
   # needed to ensure the data read is valid in the ecoding of that =20
data.  This
   # should ensure that it is safe to use regular expressions on the =20
read data,
   # unless it is actually a broken encoding.  The read data will be =20
returned in
   # <tt>@encoding</tt>.
   #
   def read_to_char(bytes)
     return "" if @io.eof?
     data =3D @io.read(bytes)
     begin
       encoded =3D encode_str(data)
       raise unless encoded.valid_encoding?
       return encoded
     rescue  # encoding error or my invalid data raise
       if @io.eof? or data.size >=3D bytes + 10
         return data
       else
         data +=3D @io.read(1) until data.valid_encoding? or
                                   @io.eof?             or
                                   data.size >=3D bytes + 10
         retry
       end
     end
   end

> the other options seem to be continue to do what you describe above =20=

> (which is less than desirable) or raise an exception, which would be =20=

> annoying to have to check for but possibly better than the current =20
> solution. or maybe not...

Exceptions will be raises if you try to do something like match a =20
regular expression against data with a broken encoding.

James Edward Gray II