On 18/12/2008, at 7:24 PM, Brian Candler wrote:

> On Thu, Dec 18, 2008 at 08:43:45AM +0900, =20
> danielcavanagh / aanet.com.au wrote:
>> concatenation could be extended too. if one string is a superset of =20=

>> the
>> other then no actual conversion needs to be done
>
> Concatenation raises interesting issues. For example:
>
>   data =3D "".force_encoding("UTF-8")
>   while chunk =3D file.read(1024)
>     data << chunk
>   end
>   # what is data.encoding ?
>
> Here each chunk of 1024 bytes may have split multibyte characters at =20=

> start
> or end. However it's OK to concatenate them, and as long as the file =20=

> is read
> to the end, the result would be valid UTF-8.
>
> Ruby's current behaviour is to do the concatenation bytewise, but =20
> downgrades
> the encoding to binary when concatenating binary onto the end of =20
> UTF-8 (and
> File#read returns binary)
>
> irb(main):001:0> data =3D "".force_encoding("UTF-8")
> =3D> ""
> irb(main):002:0> data.encoding
> =3D> #<Encoding:UTF-8>
> irb(main):003:0> data << "\x61"
> =3D> "a"
> irb(main):004:0> data.encoding
> =3D> #<Encoding:UTF-8>
> irb(main):005:0> data << "\xc3"
> =3D> "a\xC3"
> irb(main):006:0> data << "\x9f"
> =3D> "a\xC3\x9F"
> irb(main):007:0> data.encoding
> =3D> #<Encoding:ASCII-8BIT>
> irb(main):008:0> data.force_encoding("UTF-8")
> =3D> "a=DF"

well we know how to solve that don't we? make read() read characters =20
not bytes ;)

honestly, that seems to be only proper solution. it makes no sense to =20=

work with characters everywhere but then read only bytes. reading only =20=

bytes should set the string's encoding to binary, and only when the =20
programmer is sure the string is valid utf-8 should he change the =20
encoding. the other options seem to be continue to do what you =20
describe above (which is less than desirable) or raise an exception, =20
which would be annoying to have to check for but possibly better than =20=

the current solution. or maybe not...=