On 6/26/06, Julian 'Julik' Tarkhanov <listbox / julik.nl> wrote:
> On 26-jun-2006, at 15:27, Michal Suchanek wrote:
>> Why would you read 4096 bytes in the first place?
> This is a pattern. If a file has no line endings, but just one (very
> logn) stream of characters - can you really use gets?

>> If you knew the file is in some weird multibyte encoding you should
>> have set it for the stream, and read something meaningful.
> Or there should be a facility that preserves you from reading
> incomplete strings. But is it implied that if I set IO.encoding = foo
> the IO objects will prevent me? Will they go out to the provider of
> the io and get the missing remaining bytes? In the case of Unicode the
> absolute, rigorous minimum is to NEVER EVER slice into a codepoint,
> and it can go anywhere you want in terms of complexity (because
> slicing between codepoints is also not the way).

Anyone who wants to set all IO operations to a particular encoding is
making a huge mistake. Individual IO operations or handles could be set
to a particular encoding, but you would have a high probability of
breaking code external to you that did any IO operations if you forced
all IO to use your encodings.

>> If it is "ascii compatible" (ISO-8859-*, cp*, utf-8, .. ) you can
>> just use gets.
>>
>> Otherwise there is no meaningful string content.
>>
>> Note that 4096 bytes is always OK for UTF-32 (or similar plain wide
>> character encodings),
> Of which UTF-32 is the only one that is relevant for Unicode, and if
> you investigated the subject a little you would know that slicing
> Unicode strings at codepoint boundaries is often NOT enough. That way
> you can cut a part of a compound character, a modifier codepoint or an
> RTL override remarkably easily, which will just give you a different
> character altogether (or alter your string diplay in a particularly
> nasty way - that is, _reverse_ your string display for the remaining
> output of you program  if you remove an RTL override terminator).

Oh, I understand that very well. At least as well as you do. However,
that is independent of whether IO works on encoded or unencoded values.
It's easy enough to check the validity of your encoding, too. If you're
not checking external input for taintedness, then you're doing silly
things, too. One *cannot* hide too much of the complexity from Unicode,
because to do so will increase the chance that programmers not as smart
as you are will, well, screw the pooch royally.

>> and may at worst get you half of a surrogate character for UTF-16.
>> And strings will have to handle incomplete characters anyway - they
>> may result from some delays/buffering in network IO or such.
> This is exactly why the notion of having strings both as byte buffers
> and character vectors seems a little difficult. 90 percent of my use
> cases for Ruby need characters, not bytes - and I would love to hint
> it specifically shall that be needed. The problem right now is that
> Ruby does not distinguish these at the moment.

Yes, and that's where your opposition to maintaining this is
persistently misguided. Ruby *will* distinguish between a String without
an encoding and a String with an encoding. You're basing your opposition
to tomorrow's behaviour based on today's (known bad) behaviour. Please,
stop doing that.

And while most of your use cases deal with characters, code that I've
written deals with both bytes and characters in equal measures.

-austin
-- 
Austin Ziegler * halostatue / gmail.com * http://www.halostatue.ca/
               * austin / halostatue.ca * http://www.halostatue.ca/feed/
               * austin / zieglers.ca