On Mon, 15 Sep 2008 04:51:55 +1000, James Gray <james / grayproductions.net>  
wrote:

>
>> Do you really need to convert encodings in CSV? I would have thought  
>> that as long as your seperater characters are in a compatible encoding  
>> to the CSV data, everything should work without having to worry about  
>> the encodings.
>
> I believe a conversion is required because:
>
> * I have to incorporate whatever separators they give me into my ASCII  
> regular expressions.  I'm not sure how I would do that without  
> conversions if they gave me UTF-16 separators, for example.
> * I couldn't reasonably provide defaults without any transcoding.  For  
> example, a comma and quote are useless for UTF-16.
>

I think understand what you are saying. You are right, you will need  
transcoding in some cases.

I suggest:
Say your regular expression (as a string) is "r". Test whether the  
encoding of "r" is compatible with the input file's encoding (if you are  
about to read from a file) or the encoding of the input (if it is a  
string) using Encoding.comaptible?, and if not, then encode "r" to the  
input's encoding. This encoding may fail, if for example for some odd  
reason the separator is a multi-byte UTF-16 character, and the default  
encoding is ASCII, but then this is probably an error anyhow.

You will probably need to do a similar thing when building the regexp  
string in the first place if it includes separators that the user can  
specify in any encoding.

By the way, this transcoding shouldn't be needed in many cases, as many  
character encodings are ascii compatible.

Hope this makes sense.

Mike.