James Gray wrote: > On Apr 13, 2009, at 6:12 PM, NARUSE, Yui wrote: > >> And these are set Regexp::FIXEDENCODING. >> This raise exceptions on strings with other encodings >> even if the regexp contains only 7-bit. >> The constant Regexp::FIXEDENCODING is defined in 1.9.2 >> but the value is also used in 1.9.1. > > I'm sorry, but I don't think I understood this. I tried to check it in > irb, but that confused me even more: > > $ irb_dev > irb(main):001:0> Regexp::FIXEDENCODING > => 16 > > Can you explain what the magic 16 means here please? irb(main):005:0> Regexp::IGNORECASE => 1 irb(main):006:0> Regexp::EXTENDED => 2 irb(main):007:0> Regexp::MULTILINE => 4 irb(main):008:0> Regexp::FIXEDENCODING => 16 Regexp::FIXEDENCODING is a constant for specifying regexp option, like Regexp::IGNORECASE, Regexp::EXTENDED and Regexp::MULTILINE. So magic 16 has no special meanings. This means only the value which is same as ARG_ENCODING_FIXED (in re.c). This value may differ on other implementations. >>> * A / literal that would be US-ASCII due to the source Encoding or >>> /n will be upgraded to ASCII-8BIT by hex, octal, control, meta, or >>> control-meta byte escapes (as discussed in [ruby-core:23184]) >> simillar to above, /n raise warnings on other than ASCII-8BIT strings. > > I'm not sure I understand. What wouldn't be valid in ASCII-8BIT? See following difference, irb(main):016:0> /a/n =~ "a\u3042" (irb):16: warning: regexp match /.../n against to UTF-8 string => 0 irb(main):017:0> Regexp.new("a".force_encoding("ASCII-8BIT")) =~ "a\u3042" => 0 This means /n set a flag to regexp. This flag is internally called as ARG_ENCODING_NONE and its value is 32 (in re.c). This is same logic of ARG_ENCODING_FIXED. Why /n sets ARG_ENCODING_NONE and doesn't raise Exception but warnings is, the usage of /n was more ambigous than /u, /s and /e. This value wasn't provided as Ruby's constant yet. >>> * A / literal will receive a UTF-8 Encoding if it includes \u escapes >>> * Regexp objects constructed with Regexp::new() receive the Encoding >>> of the String passed containing the regular expression >>> Am I right so far? Am I missing any variations? >>> Am I right that Regexp's favor US-ASCII because it maximizes their >>> compatibility? It makes it so you can use them on any ASCII >>> compatible String instead of just a String in the source Encoding, >>> right? >> >> Yes, and if you set Regexp::FIXEDENCODING the regexp will match only the >> same encoding. > > Again, I'm not sure how I set this. As you set Regexp::IGNORECASE, irb(main):077:0> Regexp.new("a".force_encoding("iso-8859-1"))=~"a\u3042" => 0 irb(main):078:0> Regexp.new("a".force_encoding("iso-8859-1"),Regexp::FIXEDENCODING)=~"a\u3042" Encoding::CompatibilityError: incompatible encoding regexp match (ISO-8859-1 regexp with UTF-8 string) from (irb):78:in `=~' from (irb):78 from /usr/local/bin/irb19:12:in `<main>' -- NARUSE, Yui <naruse / airemix.jp>