On Sep 15, 2008, at 3:05 AM, Michael Selig wrote:

> On Mon, 15 Sep 2008 13:09:31 +1000, James Gray <james / grayproductions.net 
> > wrote:
>
>
>> However, if I can't make a regex work with a non-ASCII encoding, is  
>> there any point?  My all ASCII regular expressions will work on  
>> pretty much everything else, right?  If you pass me non-ASCII  
>> separators, there's nothing I can do anyway, right?  Seems like  
>> this isn't possible, which is a big disappointment for me.
>
> I don't think it is quite that bad. Regexp's appear to be broken on  
> UTF-16 & UTF-32. UTF-8 for example certainly seems OK to me.

And it looks like we've now seen this wasn't really the issue.  It  
should be possible to make the regular expressions work for these  
encodings.

Thus, my final issue is Regexp::escape().  I must escape separators  
that are passed in so a common separator like | doesn't change the  
meaning of my Regexp.  I see two options here:

1.  Transcode incoming separators to UTF-8, call Regexp::escape() on  
them, then transcode them to the data encoding.
2.  Hand roll an encoding safe Regexp::escape().

I'm not in love with either of these options, but they are the best  
ideas I have.  The first one adds the arbitrary requirement that  
separators must go cleanly to UTF-8 and the second just sounds tricky  
to get right.  Any thoughts on these choices?

James Edward Gray II
James Edward Gray II