James Gray wrote:
> On Apr 13, 2009, at 6:12 PM, NARUSE, Yui wrote:
> 
>> And these are set Regexp::FIXEDENCODING.
>> This raise exceptions on strings with other encodings
>> even if the regexp contains only 7-bit.
>> The constant Regexp::FIXEDENCODING is defined in 1.9.2
>> but the value is also used in 1.9.1.
> 
> I'm sorry, but I don't think I understood this.  I tried to check it in 
> irb, but that confused me even more:
> 
> $ irb_dev
> irb(main):001:0> Regexp::FIXEDENCODING
> => 16
> 
> Can you explain what the magic 16 means here please?

irb(main):005:0> Regexp::IGNORECASE
=> 1
irb(main):006:0> Regexp::EXTENDED
=> 2
irb(main):007:0> Regexp::MULTILINE
=> 4
irb(main):008:0> Regexp::FIXEDENCODING
=> 16

Regexp::FIXEDENCODING is a constant for specifying regexp option, like Regexp::IGNORECASE, Regexp::EXTENDED and Regexp::MULTILINE.  So magic 16 has no special meanings. This means only the value which is same as ARG_ENCODING_FIXED (in re.c). This value may differ on other implementations.

>>> * A / literal that would be US-ASCII due to the source Encoding or 
>>> /n will be upgraded to ASCII-8BIT by hex, octal, control, meta, or 
>>> control-meta byte escapes (as discussed in [ruby-core:23184])
>> simillar to above, /n raise warnings on other than ASCII-8BIT strings.
> 
> I'm not sure I understand.  What wouldn't be valid in ASCII-8BIT?

See following difference,

irb(main):016:0> /a/n =~ "a\u3042"
(irb):16: warning: regexp match /.../n against to UTF-8 string
=> 0
irb(main):017:0> Regexp.new("a".force_encoding("ASCII-8BIT")) =~ "a\u3042"
=> 0

This means /n set a flag to regexp.  This flag is internally called as ARG_ENCODING_NONE and its value is 32 (in re.c).  This is same logic of ARG_ENCODING_FIXED.

Why /n sets ARG_ENCODING_NONE and doesn't raise Exception but warnings is, the usage of /n was more ambigous than /u, /s and /e.

This value wasn't provided as Ruby's constant yet.

>>> * A / literal will receive a UTF-8 Encoding if it includes \u escapes
>>> * Regexp objects constructed with Regexp::new() receive the Encoding 
>>> of the String passed containing the regular expression
>>> Am I right so far?  Am I missing any variations?
>>> Am I right that Regexp's favor US-ASCII because it maximizes their 
>>> compatibility?  It makes it so you can use them on any ASCII 
>>> compatible String instead of just a String in the source Encoding, 
>>> right?
>>
>> Yes, and if you set Regexp::FIXEDENCODING the regexp will match only the
>> same encoding.
> 
> Again, I'm not sure how I set this.

As you set Regexp::IGNORECASE,

irb(main):077:0> Regexp.new("a".force_encoding("iso-8859-1"))=~"a\u3042"
=> 0
irb(main):078:0> Regexp.new("a".force_encoding("iso-8859-1"),Regexp::FIXEDENCODING)=~"a\u3042"
Encoding::CompatibilityError: incompatible encoding regexp match (ISO-8859-1 regexp with UTF-8 string)
        from (irb):78:in `=~'
        from (irb):78
        from /usr/local/bin/irb19:12:in `<main>'

-- 
NARUSE, Yui  <naruse / airemix.jp>