Hi 

On Wed, Jul 19, 2006 at 08:49:45PM +0900, Hugh Sasse wrote:
> > I've experienced a rather strange problem with regular-expressions
> > recently. I used a RE to capture some input from a cisco-device, like:
> 
> so you need the brackets....
> > 
>         [...]
> >  4. ...  And a '#' at the end of the prompt #
> 
> Is the prompt at the end of what you are looking for, or does text
> follow the prompt?

Yes, there's nothing behind the #, but it doesn't make difference if you
add a $. And well, in the original case, using a multiline RE, the \A and
\Z sometimes even worsed the problem (I don't have examples for that
anymore).

> > | # "x" * length + "ABC" + "y" * 10 
> > | -> takes forever (if LENGHT is big enough), as you can see here the tag
> > | 'ABC' occurred, but the ending of the second variable part hasn't.
> > 
> > After some experiments I could even simplify the problem:
> > LENGTH = 20000
> > text = "x" * LENGTH + "ABC" + "y" * 10
> > re = /(.*)ABC[y]#/
> > -> takes forever...
> 
> which is similar.

The [y] is non-sense, but it shouldn't cause such a performance loss.

> > I'm not sure if this is a problem of the ruby RE-implementation 
> > or if this a problem of RE in general (Keyword: back-tracking). Some
> 
> Yes, it's backtracking.  Hence my question above.  Try
>   re = /(.*)ABC[y]#$/
> if the prompt is the last thing you are looking for.  That will
> seriously help, because it is saying "If there's no # at the end of
> the string. give up now."
> 
> that [y]# looks odd to me -- do you mean
>   re = /(.*)ABC[y]*#$/

It was meant to be odd... Im not sure, but when there can be only one
possible solution after the tag 'ABC' he shouldn't start the
backtrack... but perhaps I'm simplyfing here something.

The final $ doesn't make a difference:

| ## output of [1] ##
| /(.*)ABC[y]\#$/
| no
| 3.839027
| /(.*)ABC[y]\#/
| no
| 3.796321
| /(.*)ABCy\#$/
| no
| 0.000176
| /(.*)ABCy\#/
| no
| 0.000226

> And you might want to try out *? since you have 2 stars in that case.
> That would make it non-greedy.  And having a character class [] with 
> only one char in it looks odd too.  Maybe you need a non-capturing bracket
> instead  
>   re = /(.*?)ABC(?:y*?)#$/

Doesn't make a difference :(, same program as above:

| /(.*?)ABC(?:y*?)\#$/
| no
| 3.613218

> > tests with Perl showed to me that Perl usually reacts better to such
> > regular-expressions.
> 
> Concrete examples and timings might be of use to the regexp developers,
> but a statement like the above makes it hard to test improvements.
> I don't develop the RE code in Ruby, so you'll need to wait for 
> someone who does to show interest in this detail.

We should pick the best example and then run some benchmarks. The question is
only which regular-expression does best represent this problem? It should be 
simple and all other factors should be removed. That's also the reason why I
tried this [y] RE, it doesn't make sense but it's simpler and it also triggers
this problem.


Regards,
Reto Sch?ttel

1)
LENGTH = 5000

text = "x" * LENGTH + "ABC" + "y" * 10

res = [ /(.*)ABC[y]\#$/,
        /(.*)ABC[y]\#/,
        /(.*)ABCy\#$/,
        /(.*)ABCy\#/ ]

res.each do |re|
  t = Time.now
  puts re.inspect
  puts re =~ text ? "yes" : "no"
  puts Time.now - t
end