Issue #4044 has been updated by duerst (Martin Dürst).


Hello Yui,

We discussed this issue at today's developpers' meeting in Akihabara.

There was wide consensus among the attendees that it is very strange to have 'k' and 's' included in the set of non-word (\W) characters. Therefore we are sorry, but we don't agree with your https://bugs.ruby-lang.org/issues/4044#note-7.

duerst (Martin Dürst) wrote:
> My current proposal is that we analyse what casing data is being used in what places when using /i (case insensitive matching) in regular expressions, and that we then fix that.

We have discussed this a bit. The first question is what \w should refer to in Ruby. I personally would hope that in the long term, we can move this to include all word characters (i.e. also non-ascii Latin, other scripts, Hiragana, Katakana, Kanji,...). But the general opinion today was that we should keep this as ASCII only currently. Anyway, this bug is independent of this problem, because in both cases, \w includes 'k' and 's', and therefore in both cases, \W must not include 'k' nor 's'.

Also, we noted that regular expression components such as \w or \W should be independent of whether /i is set or not. The reason for that is that \w already takes care of combining lower- and upper-case characters. So there's nothing a /i can improve, and it should not make things worse.

> By the way, can somebody explain the following difference:
> 
> $ ruby -e "puts /[\W]|\u1234/i.match('k').inspect"
> #<MatchData "k">
> 
> $ ruby -e "puts /\W|\u1234/i.match('k').inspect"
> nil
> 
> (|\u1234 is there just to force the regexp to be in UTF-8.)

I suspect that this is due to the fact that \W in character classes gets expanded to an actual list of characters (or ranges) before case-extension (/i), whereas \W outside character classes does not get affected by case-extension.

Given the above, I have reopened this bug. I hope to be able to help you over the next two weeks, but I hope you can take the lead.

Regards,   Martin.

----------------------------------------
Bug #4044: Regex matching errors when using \W character class and /i option
https://bugs.ruby-lang.org/issues/4044#change-25109

Author: ben_h (Ben Hoskings)
Status: Feedback
Priority: Normal
Assignee: naruse (Yui NARUSE)
Category: core
Target version: 1.9.2
ruby -v: ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]


=begin
 Hi all,
 
 Josh Bassett and I just discovered an issue with regex matches on ruby-1.9.2p0. (We reduced it while we were hacking on gemcutter.)
 
 The case-insensitive (/i) option together with the non-word character class (\W) match inconsistently against the alphabet. Specifically the regex doesn't match properly against the letters 'k' and 's'.
 
 The following expression demonstrates the problem in irb:
 
     puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[^\W]/i] ].inspect }
 
 As a reference, the following two expressions are working properly:
 
     puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[^\W]/] ].inspect }
     puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[\w]/i] ].inspect }
 
 Cheers
 Ben Hoskings & Josh Bassett
=end



-- 
http://bugs.ruby-lang.org/