On 10/2/10, Ammar Ali <ammarabuali / gmail.com> wrote:
> On Sun, Oct 3, 2010 at 1:37 AM, Caleb Clausen <vikkous / gmail.com> wrote:
>> Wow, that's pretty cool, actually. I like it. A couple of criticism
>> (meant as constructive):
>
> Thanks Caleb. Your criticisms and comments were indeed constructive
> and very welcome. Much appreciated.
>
>
>> I'm a little concerned about what will happen with this regexp:
>> /(?:foo){1,10000}/. Currently, Regexp handles this just fine, but with
>> your lib I fear there might be a performance problem.
>
> One problem I saw with large repetitions under 1.8.x is exceeding the
> maximum number of allowed grouped sub-expressions, which is hard set
> at >= 254 (or ((1<<8)-1) to quote the code). So the maximum number of
> grouped repetitions under 1.8 will always be 253. I initially thought
> it was a buffer exhaustion issue, but after checking, it turns out
> there's room for up to 64K under 1.8. In the case of
> /(?:foo){1,10000)/, the group limit was being reached when the regex
> was only at about 1.7K. With 1.9.x there doesn't appear to be a limit,
> or at least I didn't hit it. I stopped trying at 250_000. Yet another
> great reason to move to 1.9.

Seems like performance is limited by the constraints of the underlying
regexp engine, then. In other words, no problem. Good. Maybe I
shouldn't have worried.

> As for performance, I haven't done any testing, but expect expressions
> with greedy parts that require a lot of backtracking (already poorer
> performers with a potential for exponential time) to get worse when
> they get expanded. A benchmark is in order for sure. I would like to
> also gauge how helpful atomic groups (?>) can be in situations like
> this.

I would think that there would actually be no difference in the
pathological backtracking cases, or maybe only a small one because the
expanded regexp will blow out the data cache more quickly.

> I'm not sure about the backslashes. What would you expect from this:
>
>>> re = MetaRegexp.new '\d\\d\\\d\\\\d'
> => /\d\d\\d\\d/
>
> The standard Regexp produces the same results:
>
>>> re = Regexp.new '\d\\d\\\d\\\\d'
> => /\d\d\\d\\d/
>
> Am I missing something? I will double check.

When you create a regexp from a string literal, there are actually two
levels of backslash interpretation. The string literal itself does
backslash interpretation first, then the regexp engine does a second
level. So, your examples above are not working quite the way you seem
to think...

So, the string literal '\d\\d\\\d\\\\d' is equivalent to
"\\d\\d\\\\d\\\\d"... in other words, you're still only testing the
cases of one and two backslashes before something else. A better
string to test with is '\\d\\\\d\\\\\\d\\\\\\\\d'. Doubling all the
backslashes defeats the first level of interpretation.

I just tried your parser out against that string, at it does appear to
be doing the right thing. I didn't inspect your code closely enough
the first time to be confident of it, but after looking at it again, I
think you are handling multiple backslashes correctly.