Daniel DeLorme wrote:
> Greg Willits wrote:

>> But... this fails  /^[a-zA-Z\xE4-\xE6]*?&/u
>> But... this works  /^[a-zA-Z\xE4\xE5\xE6]*?&/u
>> 
>> I've boiled the experiments down to realizing I can't define a range
>> with \x

> Let me try to explain that in order to redeem myself from my previous
> angry post.

:-)

> Basically, \xE4 is counted as the byte value 0xE4, not the unicode
> character U+00E4. And in a range expression, each escaped value is taken
> as one character within the range. Which results in not-immediately
> obvious situations:
> 
>  >> 'aébvHögtåwH??FuG'.scan(/[\303\251]/u)
> => []
>  >> 'aébvHögtåwH??FuG'.scan(/[#{"\303\251"}]/u)
> => ["é"]

OK, I see oniguruma docs refer to \x as encoded byte value and \x{} as a 
character code point -- which with your explanation I can finally tie 
together what that means.

Took me a second to recognize the #{} as Ruby and not some new regex I'd 
never seen :-P

And I realize now too I wasn't picking up on the use of octal vs 
decimal.

Seems like Ruby doesn't like to use the hex \x{7HHHHHHH} variant?


> What is happening in the first case is that the string does not contain
> characters \303 or \251 because those are invalid utf8 sequences. But
> when the value "\303\251" is *inlined* into the regex, that is
> recognized as the utf8 character "é" and a match is found.
> 
> So ranges *do* work in utf8 but you have to be careful:
> 
>  >> "??âäçèéêîïôü".scan(/[ä-î]/u)
> => ["ä", "ç", "è", "é", "ê", "î"]
>  >> "??âäçèéêîïôü".scan(/[\303\244-\303\256]/u)
> => ["\303", "\303", "\303", "\244", "\303", "\247", "\303", "\250",
> "\303", "\251", "\303", "\252", "\303", "\256", "\303", "\257", "\303",
> "\264", "\303", "\274"]
>  >> "??âäçèéêîïôü".scan(/[#{"\303\244-\303\256"}]/u)
> => ["ä", "ç", "è", "é", "ê", "î"]
> 
> Hope this helps.

Yes!

-- gw
-- 
Posted via http://www.ruby-forum.com/.