James Gray wrote: > I'm trying to document the Encoding Regexp objects receive for the m17n > series on my blog. This is how I think it works: > > * A / literal is given a US-ASCII Encoding if it contains only 7-bit > characters > * A / literal receives the current source Encoding when it contains > 8-bit characters > * The old /u and /n style modifiers still work to force a UTF-8 or > US-ASCII Encoding There are /e (EUC-JP) and /s (Windows-31J). And these are set Regexp::FIXEDENCODING. This raise exceptions on strings with other encodings even if the regexp contains only 7-bit. The constant Regexp::FIXEDENCODING is defined in 1.9.2 but the value is also used in 1.9.1. > * A / literal that would be US-ASCII due to the source Encoding or /n > will be upgraded to ASCII-8BIT by hex, octal, control, meta, or > control-meta byte escapes (as discussed in [ruby-core:23184]) simillar to above, /n raise warnings on other than ASCII-8BIT strings. > * A / literal will receive a UTF-8 Encoding if it includes \u escapes > * Regexp objects constructed with Regexp::new() receive the Encoding of > the String passed containing the regular expression > Am I right so far? Am I missing any variations? > > Am I right that Regexp's favor US-ASCII because it maximizes their > compatibility? It makes it so you can use them on any ASCII compatible > String instead of just a String in the source Encoding, right? Yes, and if you set Regexp::FIXEDENCODING the regexp will match only the same encoding. P.S. If you write about regexp, the difference of /i and character class between Unicode and non-Unicode may be a topic. -- NARUSE, Yui <naruse / airemix.jp>