On Sep 14, 2008, at 2:49 AM, Michael Selig wrote: > On Sun, 14 Sep 2008 14:48:47 +1000, James Gray <james / grayproductions.net > > wrote: > >> On Sep 13, 2008, at 5:39 PM, James Gray wrote: >> >>> * What is the proper way to build a regular expression in some >>> encoding I have in a variable? >> >> According to the new Pickaxe, Regexp is suppose to pick up the >> encoding of a passed in String. I'm not finding that to be totally >> accurate though, because this code: >> >> ascii_str = <<-END_PARSER >> \\G(?:\\A|,) # anchor the match >> (?: "( (?>[^"]*) # find quoted fields >> (?> "" >> [^"]* )* )" >> | # ... or ... >> ([^",]*) # unquoted fields >> ) >> (?=,|\\z) # ensure we are at field's end >> END_PARSER >> p ascii_str.encoding >> ascii_re = Regexp.new(ascii_str) >> p ascii_re.encoding >> >> sjis_str = ascii_str.encode("SJIS") >> p sjis_str.encoding >> sjis_re = Regexp.new(sjis_str) >> p sjis_re.encoding >> >> prints: >> >> #<Encoding:US-ASCII> >> #<Encoding:US-ASCII> >> #<Encoding:Shift_JIS> >> #<Encoding:US-ASCII> > > This is because Shift-JIS is a superset of ascii (well nearly), and > all characters in "sjis_str" were actually ascii. So Ruby optimizes > this to an ascii Regexp, because (presumably) it is more efficient > to match on. If you have some 2 byte characters in the string that > you convert to a Regexp, you should find that the encoding stays as > Shift-JIS. Regexp matching of an ascii regexp on a Shift-JIS string > should work transparently. Yeah, I think I understand this now. But is a US-ASCII Regexp equivalent to an ASCII-8BIT Regexp, because I can't find a way to force the latter? > Do you really need to convert encodings in CSV? I would have thought > that as long as your seperater characters are in a compatible > encoding to the CSV data, everything should work without having to > worry about the encodings. I believe a conversion is required because: * I have to incorporate whatever separators they give me into my ASCII regular expressions. I'm not sure how I would do that without conversions if they gave me UTF-16 separators, for example. * I couldn't reasonably provide defaults without any transcoding. For example, a comma and quote are useless for UTF-16. Please tell me if you see flaws in my logic though. I would love to hear this is easier than I believe it is. James Edward Gray II