On Sep 14, 2008, at 6:48 PM, Michael Selig wrote:

> On Mon, 15 Sep 2008 04:51:55 +1000, James Gray <james / grayproductions.net 
> > wrote:
>
>>
>>> Do you really need to convert encodings in CSV? I would have  
>>> thought that as long as your seperater characters are in a  
>>> compatible encoding to the CSV data, everything should work  
>>> without having to worry about the encodings.
>>
>> I believe a conversion is required because:
>>
>> * I have to incorporate whatever separators they give me into my  
>> ASCII regular expressions.  I'm not sure how I would do that  
>> without conversions if they gave me UTF-16 separators, for example.
>> * I couldn't reasonably provide defaults without any transcoding.   
>> For example, a comma and quote are useless for UTF-16.
>>
>
> I think understand what you are saying. You are right, you will need  
> transcoding in some cases.

Good to hear I'm not totally crazy.  :)

I really appreciate you talking these issues out with me.  I've really  
felt like I'm own my own in uncharted waters as I've worked on these  
issue.  Thanks for questioning me and giving me so many great ideas to  
try.

> I suggest:
> Say your regular expression (as a string) is "r". Test whether the  
> encoding of "r" is compatible with the input file's encoding (if you  
> are about to read from a file) or the encoding of the input (if it  
> is a string) using Encoding.comaptible?, and if not, then encode "r"  
> to the input's encoding.

Interesting.  I wasn't aware of Encoding::compatible?().  Thanks for  
pointing that out.

It's a bit of a hassle for me to use in the CSV library because it  
compares the actual data (String or Regexp) instead of the Encoding  
objects.  It's easier for me to check all of this in the setup, before  
I'm actually reading the data.  I wish it could compare Encoding  
objects.

I could make it work though.  It's a thought.

Of course, I doubt there's much of a downside to doing the encoding  
for compatible encodings.  A little extra work, but it's a one time  
price paid when CSV sets itself up to read.

> This encoding may fail, if for example for some odd reason the  
> separator is a multi-byte UTF-16 character, and the default encoding  
> is ASCII, but then this is probably an error anyhow.

Yeah, I figure I can't help there and it's OK to toss errors then.

> You will probably need to do a similar thing when building the  
> regexp string in the first place if it includes separators that the  
> user can specify in any encoding.

Yes, the issue I seem stuck on now is that Regexp::escape() is not  
encoding safe.  See my earlier post on this.

In fact, I'm not sure what is safe to pass into this method.  I  
originally thought I might be able to transcode their separators to  
UTF-8, Regexp::escape() them and then transcode them to the needed  
encoding, but I suspect I can even come up with some UTF-8 sequence it  
would mangle.  And I hate the idea of double conversion.

I'm thinking I may need to roll my own encoding safe Regexp::escape(),  
but I'm hoping this is just a sign of my advanced encoding paranioa.   
Thoughts?

James Edward Gray II