On Mon, 15 Sep 2008 10:45:52 +1000, James Gray <james / grayproductions.net>  
wrote:

> I really appreciate you talking these issues out with me.  I've really  
> felt like I'm own my own in uncharted waters as I've worked on these  
> issue.  Thanks for questioning me and giving me so many great ideas to  
> try.
No problem. I am just starting to get into this myself.

> Interesting.  I wasn't aware of Encoding::compatible?().  Thanks for  
> pointing that out.
>
> It's a bit of a hassle for me to use in the CSV library because it  
> compares the actual data (String or Regexp) instead of the Encoding  
> objects.  It's easier for me to check all of this in the setup, before  
> I'm actually reading the data.  I wish it could compare Encoding objects.

I agree completely.
And it turns out that you can't simply do  
Encoding.compatible?("".force_encoding(enc1), "".force_encoding(enc2)),  
because Ruby seems to optimize the null string somehow (I think), and can  
say that they are compatible when the actual encodings are not!
Would be much cleaner if Encoding#compatible? accepted Encoding objects as  
well as strings. Would this be a problem to implement?

> Yes, the issue I seem stuck on now is that Regexp::escape() is not  
> encoding safe.  See my earlier post on this.
>
> In fact, I'm not sure what is safe to pass into this method.  I  
> originally thought I might be able to transcode their separators to  
> UTF-8, Regexp::escape() them and then transcode them to the needed  
> encoding, but I suspect I can even come up with some UTF-8 sequence it  
> would mangle.  And I hate the idea of double conversion.
>
> I'm thinking I may need to roll my own encoding safe Regexp::escape(),  
> but I'm hoping this is just a sign of my advanced encoding paranioa.   
> Thoughts?
Your email spurred me on to play with Regexps on non-ascii compatible  
encodings, and unless I have misunderstood something, I just can't get  
Regexps to work properly at all.

Try:
Regexp.new("abc".force_encoding("UTF-16BE"))
==> RegexpError: invalid multibyte character: /abc/

Bug?

I am fairly sure that Regexp methods including escape will work properly  
on UTF-8.

Mike.