Issue #4044 has been updated by Matthew Kerwin.


Martin Drst wrote:
> On 2016/02/03 12:21, matthew / kerwin.net.au wrote:
>  
>  > I want to write a spec for this, but some of the details are unclear to me. Can we confirm whether each of the following are spec?
>  
>  Please don't just assume that the current behavior is spec.

Indeed, that's why I asked.

>  If it 
>  doesn't match with common sense in any way, it's very clear that we have 
>  to fix it. There may be borderline cases that are up for discussion, but 
>  at least most of the examples I have seen don't meet that criterion.
>

Confusion abounds. I thought that if there was a formal spec, at least that would give a solid grounding to start from. As it is we rely on implementations to describe what should/does happen, which is imperfect and allows us to confuse bugs with spec.

(Right now I'm particularly interested in why `/[\W]/i =~ 'k' #=> nil`)

>  My understanding was that Ken Takata fixed the problem with r47598, but 
>  I'll try to have another look at that.
>  
>  When I looked at Ken's solution last time
>  (the details are at the following link, in Japanese
>  https://github.com/k-takata/Onigmo/issues/4), it included some aspects 
>  related to ASCII, which keeps confusing me.
>

I've looked at that issue, but I'm afraid I can't read Japanese (and Google translate only gets me so far.) I think I get the gist of it, but any subtlety is probably lost to me.

>  The relevant specification is Unicode Technical Standard #18, Unicode 
>  Regular Expressions, in particular 
>  http://www.unicode.org/reports/tr18/#Simple_Loose_Matches. There are 
>  various choices at the end of that section that are relevant to this issue.
>  
>  My personal preference among the choices A-D is B. As far as I 
>  understand it, it would mean that while a /i option would change how 
>  literal characters are matched, it would not affect how it affects 
>  properties such as \W.
>

I suppose we're in choice D at the moment (that would explain why `/\W/i` and `/[\W]/i` match differently,) but just which "specific properties and/or explicit character classes" remains unclear. Documenting those (and writing a spec) would help.

>  My justification for this is as follows: If I want e.g. a word 
>  character, then that already should include all the necessary 
>  characters, both upper and lower case (and title case just in case you 
>  forgot about it :-). It's difficult to see why I'd want the set of 
>  characters to change when adding /i. The same argument can be applied to 
>  \W and most if not all similar cases.
>

When we were discussing it on Ruby Talk the other day I came up with this:

* the '' ligature is a non-word character
* it has a case conversion, so is affected by the `//i` flag

So:

* `//` is a subset of `/\W/`
* `//i` matches '', 'FF', 'ff', 'fF', and 'Ff'
* therefore `/\W/i` should match all of the above

The first two dot points are where I see the contention. If I were to make a general rule, I'd say that "\W" should not be expanded for case-folding, since 'case' is a property of word characters. (If anything matches "\W" it is, by definition, not a word character, so should not be subject to word-type operations like case-folding.)

If that were so, `//i` (and therefore `/\W/i`) would match '' but not 'FF'.

That would, I think, make `\W` a perfect complement to `\w` (identical to `[^\w]`); which seems to be what people expect.

I think that means you and I are saying the same thing, in different ways.

>  The case that I think can be up for discussion is explicit character 
>  classes, such as [a-z]. Here, in effect automatically adding A-Z (and 
>  some other case equivalents) may indeed make sense.

Certainly; I use `/[0-9a-f]/i` myself for matching hexadecimal numbers (and similar patterns for similar things.)  However where would that leave us with `/[a-e\W]/i` ?

----------------------------------------
Bug #4044: Regex matching errors when using \W character class and /i option
https://bugs.ruby-lang.org/issues/4044#change-56886

* Author: Ben Hoskings
* Status: Closed
* Priority: Normal
* Assignee: Yui NARUSE
* ruby -v: ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]
* Backport: 
----------------------------------------
=begin
 Hi all,
 
 Josh Bassett and I just discovered an issue with regex matches on ruby-1.9.2p0. (We reduced it while we were hacking on gemcutter.)
 
 The case-insensitive (/i) option together with the non-word character class (\W) match inconsistently against the alphabet. Specifically the regex doesn't match properly against the letters 'k' and 's'.
 
 The following expression demonstrates the problem in irb:
 
     puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[^\W]/i] ].inspect }
 
 As a reference, the following two expressions are working properly:
 
     puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[^\W]/] ].inspect }
     puts ('a'..'z').to_a.map {|c| [c, c.ord, c[/[\w]/i] ].inspect }
 
 Cheers
 Ben Hoskings & Josh Bassett
=end




-- 
https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=unsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>