On Nov 30, 4:00 pm, Greg Willits <li... / gregwillits.ws> wrote:
> Dale Martenson wrote:
> > On Nov 30, 2:18 pm, Greg Willits <li... / gregwillits.ws> wrote:
>
> >> So, what's the secret to using unicode character ranges in Ruby regex
> >> (or Rails validations)?
>
> > Tim Bray gave a great talk about I18N, M17N and Unicode at the 2006
> > Ruby Conference. His presentation can be found at:
>
> >http://www.tbray.org/talks/rubyconf2006.pdf
>
> > He described how many member functions have trouble dealing with these
> > character sets. He made special reference to regular expressions.
>
> That's just beyond sad.
>
> I've been using Lasso for several years now, and *2003* it provided
> complete support for Unicode. I know there's some esoterics it may not
> deal with, but for all practical purposes we can round-trip data in
> western and eastern languages with Lasso quite easily.
>
> How can all these other languages be so far behind?
>
> Pretty bad if I can't even allow Mr. Mu=F1os or G=F6ran to enter their nam=
es
> in a web form with proper server side validations. Aargh.
>
> -- gw
> --
> Posted viahttp://www.ruby-forum.com/.

Ruby 1.8 doesn't have unicode support (1.9 is starting to get it).
Everything in ruby is a bytestring.

irb(main):001:0> 'a=E9bvH=F6gt=E5wH=C5FuG'.scan(/./)
=3D> ["a", "\303", "\251", "b", "v", "H", "\303", "\266", "g", "t",
"\303", "\245", "w", "H", "\303", "\205", "F", "u", "G"]

So your character class is matching the first byte of the composite
characters (which is \303 in octal), and skipping the next (since it's
below the range). You probably want something like...

reg =3D /[\xc0-\xd6\xd9-\xf6\xf9-\xff][\x80-\xbc]/
'a=E9bvH=F6gt=E5wH=C5FuG'.scan(reg)

irb(main):006:0* reg =3D /[\xc0-\xd6\xd9-\xf6\xf9-\xff][\x80-\xbc]/
=3D> /[\xc0-\xd6\xd9-\xf6\xf9-\xff][\x80-\xbc]/
irb(main):007:0> 'a=E9bvH=F6gt=E5wH=C5FuG'.scan(reg)
=3D> ["\303\251", "\303\266", "\303\245", "\303\205"]
irb(main):008:0> "=E5" =3D=3D "\303\245"
=3D> true

Ps. I'm not entirely sure the value of the second character class is
right.

Regards,
Jordan