On Sep 14, 2008, at 2:49 AM, Michael Selig wrote:

> On Sun, 14 Sep 2008 14:48:47 +1000, James Gray <james / grayproductions.net 
> > wrote:
>
>> On Sep 13, 2008, at 5:39 PM, James Gray wrote:
>>
>>> * What is the proper way to build a regular expression in some  
>>> encoding I have in a variable?
>>
>> According to the new Pickaxe, Regexp is suppose to pick up the  
>> encoding of a passed in String.  I'm not finding that to be totally  
>> accurate though, because this code:
>>
>>   ascii_str = <<-END_PARSER
>>   \\G(?:\\A|,)     # anchor the match
>>   (?: "( (?>[^"]*) # find quoted fields
>>          (?> ""
>>          [^"]* )* )"
>>       |            # ... or ...
>>       ([^",]*)     # unquoted fields
>>       )
>>   (?=,|\\z)        # ensure we are at field's end
>>   END_PARSER
>>   p ascii_str.encoding
>>   ascii_re = Regexp.new(ascii_str)
>>   p ascii_re.encoding
>>
>>   sjis_str = ascii_str.encode("SJIS")
>>   p sjis_str.encoding
>>   sjis_re = Regexp.new(sjis_str)
>>   p sjis_re.encoding
>>
>> prints:
>>
>>   #<Encoding:US-ASCII>
>>   #<Encoding:US-ASCII>
>>   #<Encoding:Shift_JIS>
>>   #<Encoding:US-ASCII>
>
> This is because Shift-JIS is a superset of ascii (well nearly), and  
> all characters in "sjis_str" were actually ascii. So Ruby optimizes  
> this to an ascii Regexp, because (presumably) it is more efficient  
> to match on. If you have some 2 byte characters in the string that  
> you convert to a Regexp, you should find that the encoding stays as  
> Shift-JIS. Regexp matching of an ascii regexp on a Shift-JIS string  
> should work transparently.

Yeah, I think I understand this now.  But is a US-ASCII Regexp  
equivalent to an ASCII-8BIT Regexp, because I can't find a way to  
force the latter?

> Do you really need to convert encodings in CSV? I would have thought  
> that as long as your seperater characters are in a compatible  
> encoding to the CSV data, everything should work without having to  
> worry about the encodings.

I believe a conversion is required because:

* I have to incorporate whatever separators they give me into my ASCII  
regular expressions.  I'm not sure how I would do that without  
conversions if they gave me UTF-16 separators, for example.
* I couldn't reasonably provide defaults without any transcoding.  For  
example, a comma and quote are useless for UTF-16.

Please tell me if you see flaws in my logic though.  I would love to  
hear this is easier than I believe it is.

James Edward Gray II