On 05.03.2007 08:08, Bill Kelly wrote:
> From: <cppasm / gmail.com>
> On 35, 227ʬ, "Bill Kelly" <b... / cts.com> wrote:
>> > The problem is, the regexp involves nested variable-length matches,
>> > so when no match is found, the engine keeps backtracking and trying
>> > to find other possible matches, across all combinations of
>> > (?:\d{1,3}[,\.]?)+\d{3}.{0,20}
>> > ..of which there are exponential possibilities within a string of
>> > digits...
>> >
>> > Hopefully, it may be possible to rewrite your regexp to fail more
>> > quickly on bad data.If you could tell us more about the exact
>> > specification of the data you're trying to match, people might
>> > be able to help further.
>>
>> Thank you so much!
>> I'm trying to rewrite SpamAssassin in ruby just for fun :-). The
>> regexp mentioned is just a rule in SpamAssassin's rule files.It's used
>> to match against every line of a mail message.In fact I tested using
>> the value and regexp mentioned above in Java(JDK, oro and regexp of
>> Apache),perl,python and ruby.It finished very soon in perl and
>> python,while the same thing happened in Java(whichever library I
>> use,jdk,oro or regexp) as in ruby,so I guess maybe the underlying
>> matchings mechanism between them are quite different?
> 
> Yes, although not so much the underlying engine, but some
> regex matchers are optimized to try to short-circuit certain
> patterns to fail fast.  From what I've read, a lot of code has
> been added to the Perl regexp matcher over the years to make it
> smarter at failing fast on various nested variable-length
> expressions.
> 
> Still, it is often possible to translate an expression into
> something that will fail fast without requiring engine-specific
> optimizations.
> 
> If you HAVE to parse SpamAssassin rule files without modification,
> then, you will likely run into more situations like this where a
> pattern that failed fast due to optimizations in the Perl matcher,
> runs too slowly in a regexp engine without the equivalent shortcut
> optimization.  :-(

Hm, one quick manual optimization is to use

value =~ /\b(?:pics|pictures|images|photos|movies)/i &&
   value =~ 
/\b(?:\d{1,3}[,\.]?)+\d{3}.{0,20}\b(?:pics|pictures|images|photos|movies)/i

i.e. precheck the texts and use short circuit evaluation.  That at least 
reduces the number of patterns dramatically where the exponential 
runtime hits you.

Ah, I found a better one:

value =~ 
/\b\d{1,3}(?:[,.]?\d{3})*.{0,20}\b(?:pics|pictures|images|photos|movies)/i

That one's really fast on your value because it involves much less 
backtracking.

I don't have my "Mastering Regular Expressions" handy but there you can 
also find techniques for "loop unrolling" which should help in this case.

Btw, you don't need to escape the dot in the character class.  It does 
not have special meaning there.

Kind regards

	robert