On Sun, 14 Sep 2008 14:48:47 +1000, James Gray <james / grayproductions.net>  
wrote:

> On Sep 13, 2008, at 5:39 PM, James Gray wrote:
>
>> * What is the proper way to build a regular expression in some encoding  
>> I have in a variable?
>
> According to the new Pickaxe, Regexp is suppose to pick up the encoding  
> of a passed in String.  I'm not finding that to be totally accurate  
> though, because this code:
>
>    ascii_str = <<-END_PARSER
>    \\G(?:\\A|,)     # anchor the match
>    (?: "( (?>[^"]*) # find quoted fields
>           (?> ""
>           [^"]* )* )"
>        |            # ... or ...
>        ([^",]*)     # unquoted fields
>        )
>    (?=,|\\z)        # ensure we are at field's end
>    END_PARSER
>    p ascii_str.encoding
>    ascii_re = Regexp.new(ascii_str)
>    p ascii_re.encoding
>
>    sjis_str = ascii_str.encode("SJIS")
>    p sjis_str.encoding
>    sjis_re = Regexp.new(sjis_str)
>    p sjis_re.encoding
>
> prints:
>
>    #<Encoding:US-ASCII>
>    #<Encoding:US-ASCII>
>    #<Encoding:Shift_JIS>
>    #<Encoding:US-ASCII>

This is because Shift-JIS is a superset of ascii (well nearly), and all  
characters in "sjis_str" were actually ascii. So Ruby optimizes this to an  
ascii Regexp, because (presumably) it is more efficient to match on. If  
you have some 2 byte characters in the string that you convert to a  
Regexp, you should find that the encoding stays as Shift-JIS. Regexp  
matching of an ascii regexp on a Shift-JIS string should work  
transparently.

>
> I tried to test with UTF-16 as well, since I think that's a good edge  
> case.  However, we don't seem to have a converter for that:
>
>    $ ruby_dev ~/Desktop/regexp_encoding.rb #<Encoding:US-ASCII>
>    #<Encoding:US-ASCII>
>    /Users/james/Desktop/regexp_encoding.rb:15:in `encode': code  
> converter not found (US-ASCII to UTF-16) (Encoding::NoConverter)
>    	from /Users/james/Desktop/regexp_encoding.rb:15:in `<main>'
>
> I guess that means I need to be using Iconv anyway, to increase the  
> amount of encodings I can support.  Right?
Do you really need to convert encodings in CSV? I would have thought that  
as long as your seperater characters are in a compatible encoding to the  
CSV data, everything should work without having to worry about the  
encodings.

Cheers
Mike