On Sun, Oct 3, 2010 at 1:37 AM, Caleb Clausen <vikkous / gmail.com> wrote:
> Wow, that's pretty cool, actually. I like it. A couple of criticism
> (meant as constructive):

Thanks Caleb. Your criticisms and comments were indeed constructive
and very welcome. Much appreciated.


> I'm a little concerned about what will happen with this regexp:
> /(?:foo){1,10000}/. Currently, Regexp handles this just fine, but with
> your lib I fear there might be a performance problem.

One problem I saw with large repetitions under 1.8.x is exceeding the
maximum number of allowed grouped sub-expressions, which is hard set
at >= 254 (or ((1<<8)-1) to quote the code). So the maximum number of
grouped repetitions under 1.8 will always be 253. I initially thought
it was a buffer exhaustion issue, but after checking, it turns out
there's room for up to 64K under 1.8. In the case of
/(?:foo){1,10000)/, the group limit was being reached when the regex
was only at about 1.7K. With 1.9.x there doesn't appear to be a limit,
or at least I didn't hit it. I stopped trying at 250_000. Yet another
great reason to move to 1.9.

As for performance, I haven't done any testing, but expect expressions
with greedy parts that require a lot of backtracking (already poorer
performers with a potential for exponential time) to get worse when
they get expanded. A benchmark is in order for sure. I would like to
also gauge how helpful atomic groups (?>) can be in situations like
this.


> Your Regexp parser looks to me to be a little naive. I gave it only
> the briefest of scannings, but it looks like you're not handling
> special characters escaped by [ and ], nor some of the other types of
> ways of using backslashes, like \C-x, for example. I'm not entirely
> clear what happens if there are two or more backslashes in a row. (A
> character is only escaped by backslashes if preceded by an odd number
> of them.)

You are right, the parser is naive and incomplete. It doesn't handle
several cases, including character classes. I didn't even consider
other escape sequences like control/meta characters or even \x hex
values. Thanks for pointing that omission out.

I'm not sure about the backslashes. What would you expect from this:

>> re = MetaRegexp.new '\d\\d\\\d\\\\d'
=> /\d\d\\d\\d/

The standard Regexp produces the same results:

>> re = Regexp.new '\d\\d\\\d\\\\d'
=> /\d\d\\d\\d/

Am I missing something? I will double check.


> I wrote a more complete (at least I hope so) regexp parser (but also
> specialized for my purposes) as part of my lib sequence. See the
> group_anchors method in this file:
> http://github.com/coatl/sequence/blob/master/lib/sequence/stringlike.rb
>
> It would be nice to have a Regexp parser as a standalone library, actually.

Cool. Thanks for sharing that. Even after a quick read I learned a
couple of interesting things. I like the way you use split and the
handling of the escape sequences.

Also, I really like the idea of a standalone regex parser. Hmmm...
very tempting.