------ art_8675_27763988.1132668964492 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline There's a whole section in Mastering Regular Expressions that goes into the differences between regexp engines. Summary: it makes a big difference how you setup your patterns! On 11/22/05, Robert Klemme <bob.news / gmx.net> wrote: > > Horacio Sanson wrote: > > I have this little script that takes a list of keyword sets, each set > > has only two keywords and for each one of them the script creates a > > regular expression like this: > > > > Regexp.new("#{key1}\.*#{key2}|#{key2}\.*#{key1}") > > > > then I match it to a string that contains a long text fetched from a > > web page. > > > > a more complete pseudo-code > > > > ######################################### > > long_text = get_web_page(url) > > > > keyword_hash = load_keyword_array_from_database > > > > keyword_hash.each_pair { |id, value| > > > > key1 = value[0] > > key2 = value[1] > > > > r = Regexp.new("#{key1}\.*#{key2}|#{key2}\.*#{key1}") > > return id if long_text =~ r > > } > > > > return -1 > > ########################################### > > > > > > Now this code works perfect, the problem is that the keyword_hash has > > more than 300 elements and running this code can take between 50 to > > 120 seconds. Since I am processing more than 1000 pages with this > > code it takes forever. > > > > > > I solved this problem by replacing the regular expression match to > > > > r1 = Regexp.new("#{key1}\.*#{key2}") > > r2 = Regexp.new("#{key2}\.*#{key1}") > > > > return id if long_text =~ r1 or long_text =~ r2 > > > > > > I simply put the or statement outside the regular expresion and the > > speedup was from 50~120sec to 0.40 secs per page. > > > > > > using the Benchmark class and running some test I got > > > > normal: 0 0 > > 27.688000 0.015000 27.703000 ( 27.765000 > > ) > > fast: > > 0.469000 0.000000 0.484000 (0.954000) > > > > > > the speed difference is totally diferent. > > > > Is this expected when using regular expressions?? > > On obvious optimization is to create all regexps during > load_keyword_array_from_database() and not during iteration of the hash. > That way you just have to do it once and can reuse those regexps with > multiple pages you check. > > Another possible optimization is to take your approach of splitting the > regexps a bit further and create two regexps - one for each keyword - and > return the id if both match. This works only correctly if (i) keywords > don't overlap or (ii) you can use \b to ensure matching on word > boundaries. > > Kind regards > > robert > > > > -- - 'There was an owl lived in an oak. The more he heard, the less he spoke. The less he spoke, the more he heard.' Christian Leskowsky christian.leskowsky / gmail.com ------ art_8675_27763988.1132668964492--