Hi,

I've found some strange and unexpected behaviour to do with pattern 
matching when I use Unicode.  My example code follows and contains 
comments to suggest what I think should happen:


$KCODE = 'u'
require 'jcode'

text = "\xa3A\nB\n\xa3C\nxD\nE"

# This pattern finds all lines that intuitively should match it.
puts "Pattern includes \"(?:x|\xa3)?\":"
text.scan(/^(?:x|\xa3)?[A-Z]$/).each {|s| puts s }

# This pattern finds all lines except the one containing the C, which is
# contrary to my intuition.  I'd expect it to match all lines or, if I 
were
# really paranoid about Unicode, I *might* expect it to match all but 
the
# lines containing A and C.
puts "Pattern includes \"[x\xa3]\":"
text.scan(/^[x\xa3]?[A-Z]$/).each {|s| puts s }


The output of this is:


Pattern includes "(?:x|ú)?":
úA
B
úC
xD
E
Pattern includes "[xú]":
úA
B
xD
E


Without the first two (Unicode-specifying) lines, the output is what I 
expect:


Pattern includes "(?:x|ú)?":
úA
B
úC
xD
E
Pattern includes "[xú]":
úA
B
úC
xD
E


(Notice the extra line in the second half.)  The thing I think is 
bizarre is that if Unicode is being used, the ú matches ONLY where it's 
the very first thing in the string.

Is there something funny about Unicode characters when using character 
classes?  Is this a known issue, or is it something weird and/or 
ignorant that I'm doing?

Thanks!

Richard

-- 
Posted via http://www.ruby-forum.com/.