On Fri, 19 Sep 2008 12:47:11 +1000, James Gray <james / grayproductions.net>  
wrote:

> I'm still struggling with this issue in the CSV code.  I read ahead to  
> find your line endings, but when I do it can cause error.  The reason is  
> that, with some encodings, I may read a partial character.  Then if I  
> later hit that String with a Regexp, Ruby blows up on the malformed data.

So far I have found that the best thing to do when reading multi-byte text  
is to avoid IO#read completely, unless:
- you know how many bytes you need to read (eg: you have a "Content  
length" like an HTTP header), or
- you are going to read the entire file in one go.

IO#gets and IO#each_line are typically what I am using. Can't you use them  
for CSV? Maybe I'm being too simplistic, but by specifying the CSV line  
ending to gets or each_line, won't that just work? The line terminator  
("sep") parameter can be an arbitrary string, and the "limit" parameter,  
although specified in bytes, will always round to a character boundary.  
Matz confirmed this behaviour. I think you can even set the line  
terminator to a null string, and just use the limit, which means that gets  
works almost like read, except it never splits characters.

Cheers
Mike