Issue #17837 has been updated by duerst (Martin D=FCrst).


nobu (Nobuyoshi Nakada) wrote in #note-22:
> I made a patch for `Regexp#backtrack_limit=3D`.
> This seems no significant performance difference.
> =

> https://github.com/ruby/ruby/compare/master...nobu:Regexp.backtrack_limit=
?expand=3D1

I have looked at this patch. I think this is the general direction to go. I=
 also think that the interface/API looks good, maybe having a keyword argum=
ent on Regexp.new, too, would be a good addition.

I installed the patch and ran some very experiments. I started with a very =
slow Regexp that I use to show my students. It can be made of any length n,=
 but gets really slow when n grows, to the order of O(2**n). I realized tha=
t it actually may be some kind of worst case, because it's not really doing=
 much except for backtracking. That means that the overhead of counting the=
 backtracks will show very clearly. Any more 'average' example should slow =
down quite a bit less.

So here is the program I used:
```Ruby
HAS_BACKTRACK_LIMIT =3D Regexp.respond_to? :backtrack_limit

def slow_find (n)
  s =3D 'a' * n
  r =3D Regexp.compile('a?' * n + s)
  m =3D nil
  t1 =3D Time.now
  10.times { m =3D s.match(r) }
  t2 =3D Time.now
  print "n: #{n}, time: #{t2-t1}/10"
  print ", backtrack_count: #{m.backtrack_count}" if HAS_BACKTRACK_LIMIT
  puts
end

(25..29).each do |i|
  slow_find i
end
```

You can easily adjust this by changing the part (25..29) (the range of s n'=
s used) and the two instances of '10' (the number of times a match is run i=
n a row).

Here are the results for the patch:
```
duerst@Kloentalersee:~/nobuRuby$ ./ruby try_backtrack_limit.rb
n: 25, time: 2.8453695/10, backtrack_count: 33554431
n: 26, time: 5.6392941/10, backtrack_count: 67108863
n: 27, time: 11.3532755/10, backtrack_count: 134217727
n: 28, time: 24.1388335/10, backtrack_count: 268435455
n: 29, time: 49.084651/10, backtrack_count: 536870911
duerst@Kloentalersee:~/nobuRuby$ ./ruby try_backtrack_limit.rb
n: 25, time: 2.7971486/10, backtrack_count: 33554431
n: 26, time: 5.9609293/10, backtrack_count: 67108863
n: 27, time: 12.126138/10, backtrack_count: 134217727
n: 28, time: 24.7895166/10, backtrack_count: 268435455
n: 29, time: 49.6923646/10, backtrack_count: 536870911
duerst@Kloentalersee:~/nobuRuby$ ./ruby try_backtrack_limit.rb
n: 25, time: 2.8213545/10, backtrack_count: 33554431
n: 26, time: 6.1295964/10, backtrack_count: 67108863
n: 27, time: 12.1948968/10, backtrack_count: 134217727
n: 28, time: 24.6284841/10, backtrack_count: 268435455
n: 29, time: 48.6898231/10, backtrack_count: 536870911
```

And here are the results without the patch:
```
duerst@Kloentalersee:~/ruby3$ ./ruby ../nobuRuby/try_backtrack_limit.rb
n: 25, time: 2.6384167/10
n: 26, time: 5.2395088/10
n: 27, time: 11.3225276/10
n: 28, time: 23.289667/10
n: 29, time: 45.9637488/10
duerst@Kloentalersee:~/ruby3$ ./ruby ../nobuRuby/try_backtrack_limit.rb
n: 25, time: 2.5845849/10
n: 26, time: 5.2094378/10
n: 27, time: 10.5159888/10
n: 28, time: 22.5549276/10
n: 29, time: 45.600226/10
duerst@Kloentalersee:~/ruby3$ ./ruby ../nobuRuby/try_backtrack_limit.rb
n: 25, time: 2.5993792/10
n: 26, time: 5.2897985/10
n: 27, time: 11.2203586/10
n: 28, time: 23.1157868/10
n: 29, time: 45.0094087/10
```

These results where obtained on a WSL2/Ubuntu install on Windows 10. All ot=
her user programs were switched off, which on Windows doesn't mean there's =
nothing else running, of course. It should be clear from the above results =
that the difference is around 5%, maybe a bit higher, but not 10%.

As I already said, that's for what I think is pretty much the worst case. A=
ll this Regexp does in backtrack in a binary tree of depth n, testing out a=
ll the combinations of choosing 'a' or not 'a' in the first half of the Reg=
exp (which is just "a?a?a?a?...."). Every time it looks for an 'a' in that =
part, it finds one. But then (except for the very very last try) it cannot =
match the second part of the Regexp (just n "a"s) to the rest of the string=
 (which is also just n "a"s). For that, it actually doesn't need time, beca=
use this part is optimized with a Boyer-Moor algorithm, which means it just=
 checks that the last "a" in the Regexp is beyond the actual string and so =
there's no match. This can be seen from the debug output when compiling Rub=
y with
```
#define ONIG_DEBUG_PARSE_TREE
#define ONIG_DEBUG_COMPILE
```
in regint.h, which results in the following:

```
$ ./ruby -e 'Regexp.new("a?a?a?aaa")'
`RubyGems' were not loaded.
`error_highlight' was not loaded.
`did_you_mean' was not loaded.

PATTERN: /a?a?a?aaa/ (US-ASCII)
<list:55e37d443dc0>
   <quantifier:55e37d443d80>{0,1}
      <string:55e37d443d40>a
   <quantifier:55e37d443e40>{0,1}
      <string:55e37d443e00>a
   <quantifier:55e37d445030>{0,1}
      <string:55e37d444ff0>a
   <string:55e37d4450b0>aaa
optimize: EXACT_BM
  anchor: []
  sub anchor: []

exact: [aaa]: length: 3
code length: 26
0:[push:(+2)] 5:[exact1:a] 7:[push:(+2)] 12:[exact1:a] 14:[push:(+2)]
19:[exact1:a] 21:[exact3:aaa] 25:[end] =

```

So this should give everybody some indication of the worst slowdown with th=
is new backtrack_limit feature. Results for some more "average" scenarios w=
ould also help.

----------------------------------------
Feature #17837: Add support for Regexp timeouts
https://bugs.ruby-lang.org/issues/17837#change-92884

* Author: sam.saffron (Sam Saffron)
* Status: Open
* Priority: Normal
----------------------------------------
### Background

ReDoS are a very common security issue. At Discourse we have seen a few thr=
ough the years. https://owasp.org/www-community/attacks/Regular_expression_=
Denial_of_Service_-_ReDoS

In a nutshell there are 100s of ways this can happen in production apps, th=
e key is for an attacker (or possibly innocent person) to supply either a p=
roblematic Regexp or a bad string to test it with.

```
/A(B|C+)+D/ =3D~ "A" + "C" * 100 + "X"
```

Having a problem Regexp somewhere in a large app is a universal constant, i=
t will happen as long as you are using Regexps. =



Currently the only feasible way of supplying a consistent safeguard is by u=
sing `Thread.raise` and managing all execution. This kind of pattern requir=
es usage of a third party implementation. There are possibly issues with jR=
uby and Truffle when taking approaches like this.

### Prior art

.NET provides a `MatchTimeout` property per: https://docs.microsoft.com/en-=
us/dotnet/api/system.text.regularexpressions.regex.matchtimeout?view=3Dnet-=
5.0

Java has nothing built in as far as I can tell: https://stackoverflow.com/q=
uestions/910740/cancelling-a-long-running-regex-match

Node has nothing built in as far as I can tell: https://stackoverflow.com/q=
uestions/38859506/cancel-regex-match-if-timeout


Golang and Rust uses RE2 which is not vulnerable to DoS by limiting feature=
s (available in Ruby RE2 gem)

```
irb(main):003:0> r =3D RE2::Regexp.new('A(B|C+)+D')
=3D> #<RE2::Regexp /A(B|C+)+D/>
irb(main):004:0> r.match("A" + "C" * 100 + "X")
=3D> nil
```

### Proposal

Implement `Regexp.timeout` which allow us to specify a global timeout for a=
ll Regexp operations in Ruby. =


Per Regexp would require massive application changes, almost all web apps w=
ould do just fine with a 1 second Regexp timeout.

If `timeout` is set to `nil` everything would work as it does today, when s=
et to second a "monitor" thread would track running regexps and time them o=
ut according to the global value.

### Alternatives =


I recommend against a "per Regexp" API as this decision is at the applicati=
on level. You want to apply it to all regular expressions in all the gems y=
ou are consuming.

I recommend against a move to RE2 at the moment as way too much would break =



### See also: =


https://people.cs.vt.edu/davisjam/downloads/publications/Davis-Dissertation=
-2020.pdf
https://levelup.gitconnected.com/the-regular-expression-denial-of-service-r=
edos-cheat-sheet-a78d0ed7d865





-- =

https://bugs.ruby-lang.org/

Unsubscribe: <mailto:ruby-core-request / ruby-lang.org?subject=3Dunsubscribe>
<http://lists.ruby-lang.org/cgi-bin/mailman/options/ruby-core>