On Dec 3, 12:47 pm, Greg Willits <li... / gregwillits.ws> wrote:
> Daniel DeLorme wrote:
> > Greg Willits wrote:
> >> But... this fails  /^[a-zA-Z\xE4-\xE6]*?&/u
> >> But... this works  /^[a-zA-Z\xE4\xE5\xE6]*?&/u
>
> >> I've boiled the experiments down to realizing I can't define a range
> >> with \x
> > Let me try to explain that in order to redeem myself from my previous
> > angry post.
>
> :-)
>
> > Basically, \xE4 is counted as the byte value 0xE4, not the unicode
> > character U+00E4. And in a range expression, each escaped value is taken=

> > as one character within the range. Which results in not-immediately
> > obvious situations:
>
> >  >> 'a=E9bvH=F6gt=E5wH=C5FuG'.scan(/[\303\251]/u)
> > =3D> []
> >  >> 'a=E9bvH=F6gt=E5wH=C5FuG'.scan(/[#{"\303\251"}]/u)
> > =3D> ["=E9"]
>
> OK, I see oniguruma docs refer to \x as encoded byte value and \x{} as a
> character code point -- which with your explanation I can finally tie
> together what that means.
>
> Took me a second to recognize the #{} as Ruby and not some new regex I'd
> never seen :-P
>
> And I realize now too I wasn't picking up on the use of octal vs
> decimal.
>
> Seems like Ruby doesn't like to use the hex \x{7HHHHHHH} variant?
>
>
>
> > What is happening in the first case is that the string does not contain
> > characters \303 or \251 because those are invalid utf8 sequences. But
> > when the value "\303\251" is *inlined* into the regex, that is
> > recognized as the utf8 character "=E9" and a match is found.
>
> > So ranges *do* work in utf8 but you have to be careful:
>
> >  >> "=E0=E2=E4=E7=E8=E9=EA=EE=EF=F4=FC".scan(/[=E4-=EE]/u)
> > =3D> ["=E4", "=E7", "=E8", "=E9", "=EA", "=EE"]
> >  >> "=E0=E2=E4=E7=E8=E9=EA=EE=EF=F4=FC".scan(/[\303\244-\303\256]/u)
> > =3D> ["\303", "\303", "\303", "\244", "\303", "\247", "\303", "\250",
> > "\303", "\251", "\303", "\252", "\303", "\256", "\303", "\257", "\303",
> > "\264", "\303", "\274"]
> >  >> "=E0=E2=E4=E7=E8=E9=EA=EE=EF=F4=FC".scan(/[#{"\303\244-\303\256"}]/u=
)
> > =3D> ["=E4", "=E7", "=E8", "=E9", "=EA", "=EE"]
>
> > Hope this helps.
>
> Yes!
>
> -- gw
> --
> Posted viahttp://www.ruby-forum.com/.