Issue #5871 has been updated by Martin Dürst.

Status changed from Rejected to Open

Shouhei Urabe writes:

> Quite generally speaking you are advised not to use /i in Unicode.

Are there other examples where /i is advised against? If yes, please let's look at them and try to fix them, too.

> The reason? because Babylonians did something wrong.

Many problems can be (figuratively) blamed on the Babylonians, but not this one.

> In this specific case the [\W], which equals to [^A-Za-z], includes ??? and ??.  So /[\W]/i includes k and SS.

Let's look at this in detail. At https://bugs.ruby-lang.org/issues/4044#note-9, Yui Naruse writes:

> Unicode ignore case breaks it.
> http://unicode.org/reports/tr21/

That link says "Superseded Unicode Standard Annex". It gives three locations for the information, http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G33992, http://www.unicode.org/versions/Unicode4.0.0/ch04.pdf#G124722, and http://www.unicode.org/versions/Unicode4.0.0/ch05.pdf#G21180. In the archival version of tr21, at http://www.unicode.org/reports/tr21/tr21-5.html, I find the word "ignore" just two times, and I didn't find a definition of "ignore case". Can somebody tell me exactly what is meant?

I don't assume that the Unicode Standard would define or imply that 'k' or 'S' are non-word characters. However, if indeed there is some data or text in the Unicode Standard that defines or implies this, then that would need to be fixed urgently, and I'd like to help.

> 212A; C; 006B; # KELVIN SIGN
> 00DF; F; 0073 0073; # LATIN SMALL LETTER SHARP S
> http://www.unicode.org/Public/UNIDATA/CaseFolding.txt

> \W includes U+212A and U+00DF
> /i adds U+006B (k) and U+0073 (S) to [\W]
> ^ reverses the class; it doesn't include k & S.

Because of "the Babylonians", it is frequently the case that some property that applies in a limited character set (e.g. the character set of US-ASCII) doesn't apply directly in a wider character set (e.g. the Unicode character set). In that case, rather than blaming the problem on "the Babylonians", what needs to be done is: 1) Analyse the problem, to figure out what assumptions are no longer guaranteed. 2) Think about what programmers/users would most reasonably expect. 3) Figure out how to fix the implementation so that expectations are met even without the previously valid assumptions.

In our case, we have the assumption that the negation of a character class does not include any characters of that class. For ASCII, that's true. For Unicode, as currently implemented, it's not true, but that's only because the Unicode case tables haven't been used correctly. When it comes to "the Babylonians", there isn't a one-to-one case mapping, and as a consequence, one-way case mapping and case equivalence behave somewhat differently. I think what should be implemented is that the \w (Word character) class is defined on round-trip case equivalence (which would include U+212A and U+00DF), not as apparently currently the case on one-way case mappings. The use of round-trip case equivalence may also be appropriate for other operations in the regular expression implementation, but this needs to be checked.

Anyway, an implementation that claims that 'k' and 'S' are non-word characters is fundamentally broken, and we have to fix it. I have therefore reopened the bug. (Sorry, I was not aware of https://bugs.ruby-lang.org/issues/4044, otherwise I'd have explained things then.)

The question of whether to use round-trip case equivalence (which is appropriate e.g. for search) or only some more limited case operation also comes up in other circumstances. As an example, IDNA 2003 defines that ?? (U+00DF) mapps to 'ss', but in the context of domain names, that turned out to be the wrong choice, because it means that it is impossible to use ?? in internationalized domain names. This was fixed in IDNA 2008.



----------------------------------------
Bug #5871: regexp \W matches some word characters when inside a case-insensitive character class
https://bugs.ruby-lang.org/issues/5871

Author: Gareth Adams
Status: Open
Priority: Normal
Assignee: 
Category: 
Target version: 
ruby -v: ruby 1.9.2p290 (2011-07-09 revision 32553) [x86_64-darwin10.8.0]


=begin
The following replacement, which should do nothing, has removed the upper- and lower-case "K"s and "S"s from the result:

    > "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz".gsub(/[\W]/i,"")
    => "ABCDEFGHIJLMNOPQRTUVWXYZabcdefghijlmnopqrtuvwxyz"

The result is correct (the same as the input string) if I remove either the character class:
 
    > "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz".gsub(/\W/i,"")
    => "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz" 

or the case insensitive flag:

    > "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz".gsub(/[\W]/,"")
    => "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

This has been observed in two separate ruby 1.9 installs:

* ruby 1.9.2p290 (2011-07-09 revision 32553) [x86_64-darwin10.8.0]
* ruby 1.9.3p0 (2011-10-30 revision 33570) [x86_64-darwin11.2.0]
  
but works correctly in 1.8
=end



-- 
http://bugs.ruby-lang.org/