On 26-jun-2006, at 15:27, Michal Suchanek wrote: > > Why would you read 4096 bytes in the first place? This is a pattern. If a file has no line endings, but just one (very logn) stream of characters - can you really use gets? > > If you knew the file is in some weird multibyte encoding you should > have set it for the stream, and read something meaningful. Or there should be a facility that preserves you from reading incomplete strings. But is it implied that if I set IO.encoding = foo the IO objects will prevent me? Will they go out to the provider of the io and get the missing remaining bytes? In the case of Unicode the absolute, rigorous minimum is to NEVER EVER slice into a codepoint, and it can go anywhere you want in terms of complexity (because slicing between codepoints is also not the way). > > If it is "ascii compatible" (ISO-8859-*, cp*, utf-8, .. ) you can > just use gets. > > Otherwise there is no meaningful string content. > > Note that 4096 bytes is always OK for UTF-32 (or similar plain wide > character encodings), Of which UTF-32 is the only one that is relevant for Unicode, and if you investigated the subject a little you would know that slicing Unicode strings at codepoint boundaries is often NOT enough. That way you can cut a part of a compound character, a modifier codepoint or an RTL override remarkably easily, which will just give you a different character altogether (or alter your string diplay in a particularly nasty way - that is, _reverse_ your string display for the remaining output of you program if you remove an RTL override terminator). > and may at worst get you half of a surrogate > character for UTF-16. And strings will have to handle incomplete > characters anyway - they may result from some delays/buffering in > network IO or such. This is exactly why the notion of having strings both as byte buffers and character vectors seems a little difficult. 90 percent of my use cases for Ruby need characters, not bytes - and I would love to hint it specifically shall that be needed. The problem right now is that Ruby does not distinguish these at the moment. -- Julian 'Julik' Tarkhanov please send all personal mail to me at julik.nl