MonkeeSage wrote:
> Firstly, ruby doesn't have unicode support in 1.8, since unicode *IS*
> a standard mapping of bytes to *characters*. That's what unicode is.
> I'm sorry you don't like that, but don't lie and say ruby 1.8 supports
> unicode when it knows nothing about that standard mapping and treats
> everything as individual bytes (and any byte with a value greater than
> 126 just prints an octal escape)

Ok, then how do you explain this:
 >> $KCODE='u'
=> "u"
 >> "abc\303\244".scan(/./)
=> ["a", "b", "c", "?"]

This doesn't require any libraries, and it seems to my eyes that ruby is 
converting 5 bytes into 4 characters. It shows an awareness of utf8. If 
that's not *some* kind of unicode support then please tell me what it 
is. It seem were disagreeing on some basic definition of what "unicode 
support" means.

> Secondly, as I said in my first post to this thread, the characters
> trying to be matched are composite characters, which requires you to
> match both bytes. You can try to using a unicode regexp, but then you
> run into the problem you mention--the regexp engine expects the pre-
> composed, one-byte form...
> 
> "?".scan(/[\303\262]/u) # => []
> "?".scan(/[\xf2]/u) # => ["\303\262"]

Wow, I never knew that second one could work. Unicode support is 
actually better than I thought! You learn something new every day.

> ...which is why I said it's more robust to use something like the the
> regexp that Jimmy linked to and I reposted, instead of a unicode
> regexp.

I'm not sure what makes that huge regexp more robust than a simple 
unicode regexp.

Daniel