Issue #18013 has been updated by duerst (Martin D=FCrst).


Just a question: What's the purpose of nested character classes?

I didn't even know that there was such a thing as nested character classes.

Depending on the purpose of nested character classes, the right way to hand=
le things may differ. This is just a wild guess, but if there's no differen=
ce between usual character classes and nested character classes, then there=
 isn't really a purpose for nested character classes.

----------------------------------------
Bug #18013: Unexpected results when mxiing negated character classes and ca=
se-folding
https://bugs.ruby-lang.org/issues/18013#change-92691

* Author: jirkamarsik (Jirka Marsik)
* Status: Open
* Priority: Normal
* ruby -v: ruby 3.0.1p64 (2021-04-05 revision 0fb782ee38) [x86_64-linux]
* Backport: 2.6: UNKNOWN, 2.7: UNKNOWN, 3.0: UNKNOWN
----------------------------------------
```
irb(main):001:0> /[^a-c]/i.match("A")
=3D> nil
irb(main):002:0> /[[^a-c]]/i.match("A")
=3D> #<MatchData "A">
```

The two regular expressions above match different strings, because the char=
acter classes denote different sets of characters. In order for `/[^a-c]/i`=
 to produce correct results, Oniguruma provided a fix that can still be eas=
ily seen in the code as it is hidden behind an always-on preprocessor flag =
(`CASE_FOLD_IS_APPLIED_INSIDE_NEGATIVE_CCLASS`, https://github.com/ruby/rub=
y/blob/9eae8cdefba61e9e51feb30a4b98525593169666/regparse.c#L5528). The idea=
 of the fix is to first case-fold a character class and only then apply the=
 negation (essentially moving the case-fold operator *inside* the negation).

In the case of our first regular expression, `[a-c]` is case-folded into `[=
a-cA-C]` and that is then inverted into `[^a-cA-C]`, which is the expected =
result. However, this case-folding logic is currently only being applied to=
 the top-most character class and so if we use a nested negated character c=
lass, the order of the operations will be switched.

With our second regular expression, `[a-c]` will first be negated to yield =
`[^a-c]`, which will then be case-folded into `.`, the set of all character=
s (since `[^a-c]` contains `A-C`, which case-fold into `a-c`).

A way to fix this would be to apply case-folding for nested character class=
es as well, so that the nested character classes behave the same as the top=
-most character class. Then, we would get the same semantics for both expre=
ssions.



-- =

https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=3Dunsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>