Ara.T.Howard wrote:

> On Sat, 3 Sep 2005, Nikolai Weibull wrote:

> > Ara.T.Howard wrote:

> > >  - anchor every single regular expression : performance degrades
> > >    exponentially otherwise

> > That's an overstatement.  True, use anchors whenever possible, as
> > they are great hints to the engine and they certainly limit the
> > number of possible matches very often, but it's not like every
> > regular expression without anchors takes O(N^M),

> i don't think it is an overstatement.  i'm not saying every regex w/o
> anchor __is__ O(N^M) - but that they easily __degrade__ to that worst
> case more easily w/o them.  a single misplaced '.*' can do this.  one
> time i was examining a processing machine that uses regexes to match
> keys of items that arrive over the network and are inserted into a
> giant in memory circular queue
> (http://my.unidata.ucar.edu/content/software/ldm/archive/index.html)
> because it's cpu usage had shot up from around 5 to 50% for no
> apparent reason.  the keys were not large (10-100 bytes) and the
> regexes ldm is configured with tend to be pretty simple, things like
> 
>   /F10.*OLS/ /F12.*OLS/
> 
> and these are matched against strings like
> 
>   F10200008100242.OLS

Well, what you said was that "performance degrades exponentially
otherwise".

I don't have your whole dataset, but it looks to me that you could be
using a wild-card matcher instead.  Another good solution would be to
not use Ruby's built-in regular expression matcher, as it's implemented
using backtracking, which, you are correct, has a tendency to go haywire
far too often.  That's the price you pay for having crap like
backreferencing and look-around I'm afraid.  

(Sidenote: I'm releasing a regular expression matcher for Ruby in a near
future that is quite nice.)

Anyway, your advice is sound; the more information you give a
backtracking NFA implementation, the better,
        nikolai

-- 
Nikolai Weibull: now available free of charge at http://bitwi.se/!
Born in Chicago, IL USA; currently residing in Gothenburg, Sweden.
main(){printf(&linux["\021%six\012\0"],(linux)["have"]+"fun"-97);}