On Dec 4, 3:07 am, Daniel DeLorme <dan... / dan42.com> wrote:

> Ok, then how do you explain this:
>  >> $KCODE='u'
> => "u"
>  >> "abc\303\244".scan(/./)
> => ["a", "b", "c", "]
>
> This doesn't require any libraries, and it seems to my eyes that ruby is
> converting 5 bytes into 4 characters. It shows an awareness of utf8. If
> that's not *some* kind of unicode support then please tell me what it
> is. It seem were disagreeing on some basic definition of what "unicode
> support" means.

I guess we were talking about different things then. I never meant to
imply that the regexp engine can't match unicode characters (it's
"dumb" implementation though; it basically only knows that bytes above
127 can have more bytes following and should be grouped together as
candidates for a match; that's slightly simplified, but basically
accurate).

I, like Charles (and I think most people), was referring to the
ability to index into strings by characters, find their lengths in
characters, to compose and decompose composite characters, to
normalize characters, convert them to other encodings like shift-jis,
and other such things. Ruby 1.9 has started adding such support, while
ruby 1.8 lacks it. It can be hacked together with regular expressions
(e.g., the link Jimmy posted), or even as a real, compiled extension
[1], but merely saying that *you* the programmer can implement it
using ruby 1.8, is not the same thing as saying ruby 1.8 supports it
(just like I could build a python VM in ruby, but that doesn't mean
that the ruby interpreter runs python bytecode). Anyhow, I guess it's
just a difference of opinion. I don't mind being wrong (happens a
lot! ;) I just don't like being accused of spreading FUD about ruby,
which to my mind implies malice of forethought rather that simply
mistake.

[1] http://rubyforge.org/projects/char-encodings/
    http://git.bitwi.se/ruby-character-encodings.git/

> > Secondly, as I said in my first post to this thread, the characters
> > trying to be matched are composite characters, which requires you to
> > match both bytes. You can try to using a unicode regexp, but then you
> > run into the problem you mention--the regexp engine expects the pre-
> > composed, one-byte form...
>
> > ".scan(/[\303\262]/u) # => []
> > ".scan(/[\xf2]/u) # => ["\303\262"]
>
> Wow, I never knew that second one could work. Unicode support is
> actually better than I thought! You learn something new every day.
>
> > ...which is why I said it's more robust to use something like the the
> > regexp that Jimmy linked to and I reposted, instead of a unicode
> > regexp.
>
> I'm not sure what makes that huge regexp more robust than a simple
> unicode regexp.
>
> Daniel

Well, I won't claim that you can't get a unicode regexp to match the
same. And I only saw that large regexp when it was posted here, so
I've not tested it to any great length. Interestingly, 1.9 uses this
regexp (originally from jcode.rb in stdlib) to classify a string as
containing utf-8: '[\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf][\x80-
\xbf]'. My thought was that without knowing all of the minute
intricacies of unicode and how ruby strings and regexps work with
unicode values (which I don't, and assume the OP doesn't), I think the
huge regexp is more likely to Just Work in more cases than a home-
brewed unicode regexp. But like I said, that's just an initial
conclusion, I don't claim it's absolutely correct.

Regards,
Jordan