Daniel DeLorme wrote: > Greg Willits wrote: >> But... this fails /^[a-zA-Z\xE4-\xE6]*?&/u >> But... this works /^[a-zA-Z\xE4\xE5\xE6]*?&/u >> >> I've boiled the experiments down to realizing I can't define a range >> with \x > Let me try to explain that in order to redeem myself from my previous > angry post. :-) > Basically, \xE4 is counted as the byte value 0xE4, not the unicode > character U+00E4. And in a range expression, each escaped value is taken > as one character within the range. Which results in not-immediately > obvious situations: > > >> 'a¥Æ¥¥bvH¥Æ¥«gt¥Æ¡¦wH¥Æ©§uG'.scan(/[\303\251]/u) > => [] > >> 'a¥Æ¥¥bvH¥Æ¥«gt¥Æ¡¦wH¥Æ©§uG'.scan(/[#{"\303\251"}]/u) > => ["é"] OK, I see oniguruma docs refer to \x as encoded byte value and \x{} as a character code point -- which with your explanation I can finally tie together what that means. Took me a second to recognize the #{} as Ruby and not some new regex I'd never seen :-P And I realize now too I wasn't picking up on the use of octal vs decimal. Seems like Ruby doesn't like to use the hex \x{7HHHHHHH} variant? > What is happening in the first case is that the string does not contain > characters \303 or \251 because those are invalid utf8 sequences. But > when the value "\303\251" is *inlined* into the regex, that is > recognized as the utf8 character "é" and a match is found. > > So ranges *do* work in utf8 but you have to be careful: > > >> "".scan(/[-]/u) > => ["ä", "ç", "è", "é", "ê", "î"] > >> "".scan(/[\303\244-\303\256]/u) > => ["\303", "\303", "\303", "\244", "\303", "\247", "\303", "\250", > "\303", "\251", "\303", "\252", "\303", "\256", "\303", "\257", "\303", > "\264", "\303", "\274"] > >> "".scan(/[#{"\303\244-\303\256"}]/u) > => ["ä", "ç", "è", "é", "ê", "î"] > > Hope this helps. Yes! -- gw -- Posted via http://www.ruby-forum.com/.