On Sep 14, 2008, at 8:42 PM, Michael Selig wrote:

> On Mon, 15 Sep 2008 10:45:52 +1000, James Gray <james / grayproductions.net 
> > wrote:
>
>> I really appreciate you talking these issues out with me.  I've  
>> really felt like I'm own my own in uncharted waters as I've worked  
>> on these issue.  Thanks for questioning me and giving me so many  
>> great ideas to try.
> No problem. I am just starting to get into this myself.
>
>> Interesting.  I wasn't aware of Encoding::compatible?().  Thanks  
>> for pointing that out.
>>
>> It's a bit of a hassle for me to use in the CSV library because it  
>> compares the actual data (String or Regexp) instead of the Encoding  
>> objects.  It's easier for me to check all of this in the setup,  
>> before I'm actually reading the data.  I wish it could compare  
>> Encoding objects.
>
> I agree completely.
> And it turns out that you can't simply do Encoding.compatible? 
> ("".force_encoding(enc1), "".force_encoding(enc2)), because Ruby  
> seems to optimize the null string somehow (I think), and can say  
> that they are compatible when the actual encodings are not!
> Would be much cleaner if Encoding#compatible? accepted Encoding  
> objects as well as strings. Would this be a problem to implement?
>
>> Yes, the issue I seem stuck on now is that Regexp::escape() is not  
>> encoding safe.  See my earlier post on this.
>>
>> In fact, I'm not sure what is safe to pass into this method.  I  
>> originally thought I might be able to transcode their separators to  
>> UTF-8, Regexp::escape() them and then transcode them to the needed  
>> encoding, but I suspect I can even come up with some UTF-8 sequence  
>> it would mangle.  And I hate the idea of double conversion.
>>
>> I'm thinking I may need to roll my own encoding safe  
>> Regexp::escape(), but I'm hoping this is just a sign of my advanced  
>> encoding paranioa.  Thoughts?
> Your email spurred me on to play with Regexps on non-ascii  
> compatible encodings, and unless I have misunderstood something, I  
> just can't get Regexps to work properly at all.
>
> Try:
> Regexp.new("abc".force_encoding("UTF-16BE"))
> ==> RegexpError: invalid multibyte character: /abc/
>
> Bug?
>
> I am fairly sure that Regexp methods including escape will work  
> properly on UTF-8.

I was worried about a UTF-8 character where the trailing byte looked  
like something that needed escaping.  Hopefully the engine does  
account for that though, yes.

However, if I can't make a regex work with a non-ASCII encoding, is  
there any point?  My all ASCII regular expressions will work on pretty  
much everything else, right?  If you pass me non-ASCII separators,  
there's nothing I can do anyway, right?  Seems like this isn't  
possible, which is a big disappointment for me.

James Edward Gray II