------art_8675_27763988.1132668964492
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

There's a whole section in Mastering Regular Expressions that goes into the
differences between regexp engines.

Summary: it makes a big difference how you setup your patterns!

On 11/22/05, Robert Klemme <bob.news / gmx.net> wrote:
>
> Horacio Sanson wrote:
> > I have this little script that takes a list of keyword sets, each set
> > has only two keywords and for each one of them the script creates a
> > regular expression like this:
> >
> > Regexp.new("#{key1}\.*#{key2}|#{key2}\.*#{key1}")
> >
> > then I match it to a string that contains a long text fetched from a
> > web page.
> >
> > a more complete pseudo-code
> >
> > #########################################
> > long_text = get_web_page(url)
> >
> > keyword_hash = load_keyword_array_from_database
> >
> > keyword_hash.each_pair { |id, value|
> >
> > key1 = value[0]
> > key2 = value[1]
> >
> > r = Regexp.new("#{key1}\.*#{key2}|#{key2}\.*#{key1}")
> > return id if long_text =~ r
> > }
> >
> > return -1
> > ###########################################
> >
> >
> > Now this code works perfect, the problem is that the keyword_hash has
> > more than 300 elements and running this code can take between 50 to
> > 120 seconds. Since I am processing more than 1000 pages with this
> > code it takes forever.
> >
> >
> > I solved this problem by replacing the regular expression match to
> >
> > r1 = Regexp.new("#{key1}\.*#{key2}")
> > r2 = Regexp.new("#{key2}\.*#{key1}")
> >
> > return id if long_text =~ r1 or long_text =~ r2
> >
> >
> > I simply put the or statement outside the regular expresion and the
> > speedup was from 50~120sec to 0.40 secs per page.
> >
> >
> > using the Benchmark class and running some test I got
> >
> > normal: 0 0
> > 27.688000 0.015000 27.703000 ( 27.765000
> > )
> > fast:
> > 0.469000 0.000000 0.484000 (0.954000)
> >
> >
> > the speed difference is totally diferent.
> >
> > Is this expected when using regular expressions??
>
> On obvious optimization is to create all regexps during
> load_keyword_array_from_database() and not during iteration of the hash.
> That way you just have to do it once and can reuse those regexps with
> multiple pages you check.
>
> Another possible optimization is to take your approach of splitting the
> regexps a bit further and create two regexps - one for each keyword - and
> return the id if both match. This works only correctly if (i) keywords
> don't overlap or (ii) you can use \b to ensure matching on word
> boundaries.
>
> Kind regards
>
> robert
>
>
>
>


--
-

'There was an owl lived in an oak.
The more he heard, the less he spoke.
The less he spoke, the more he heard.'

Christian Leskowsky
christian.leskowsky / gmail.com

------art_8675_27763988.1132668964492--