On 6/26/06, Julian 'Julik' Tarkhanov <listbox / julik.nl> wrote:
>
> On 26-jun-2006, at 3:11, Austin Ziegler wrote:
>
>
> >
> > Okay. What class should I get here?
> >
> >  data = File.open("file.txt", "rb") { |f| f.read }
> >
> > Under the people who want separate ByteVector and String class, I'll
> > need *two* APIs:
> >
> >  st = File.open("file.txt", "rb") { |f| f.read_string }
> >  bv = File.open("file.txt", "rb") { |f| f.read_bytes }
> >
> > Stupid, stupid, stupid, stupid. If I have guessed wrong about the
> > contents of file.txt, I have to rewind and read it again. Better to
> > *always* read as bytes and then say, "this is actually UTF-8". This
> > would be as stupid in C++, Java, or C#:
>
> Not so fast, let's say you read from a file:
>
> >  st = File.open("file.txt", "rb") { |f| f.read(4056) }
>
> and you recieve a PART of a unicode string (because you cannot know
> where to stop reading before yoy look into the structure).
> The only way to make what you read valid now is to slide along the
> byte length and try to catch the bytes that you skipped.
> Should I continue?

Why would you read 4096 bytes in the first place?

If you knew the file is in some weird multibyte encoding you should
have set it for the stream, and read something meaningful.

If it is "ascii compatible" (ISO-8859-*, cp*, utf-8, .. ) you can just use gets.

Otherwise there is no meaningful string content.

Note that 4096 bytes is always OK for UTF-32 (or similar plain wide
character encodings), and may at worst get you half of a surrogate
character for UTF-16. And strings will have to handle incomplete
characters anyway - they may result from some delays/buffering in
network IO or such.

Thanks

Michal