Emmanuel <emmanuel / lijasdoblea.com> writes:

> http://swtch.com/~rsc/regexp/regexp1.html
>
> I became a little worried since i'm making hevy use of regexes in a
> program of mine that shortly i'll have to run on each of thousands of
> text files.

I don't know about proposed plans for the regexp engine for ruby, but I would
say not to be overly concerned at this stage. 

From reading that article, one thing that I noted was that even the author
acknowledged that the performance of regular expressions is significantly
affected by how the regexp are defined. If you keep in mind the basic concepts
and create your regexp accordingly, you will likely get performance differences
that even outweigh the differences between the two approaches outlined in that
article - or putting it another way, poorly specified RE will perform badly
regardless of the algorithm used. What is important is to do things like
anchoring, using as precise specification as possible and taking advantage of
any knowledge regarding the data you are processing. 

I've not done large (ie. gigabytes of data) processing with RE under ruby, but
I have done so with perl and the performance was quite acceptable. 

There is no point worrying about optimisation until you know there is a
performance issue. For all you know, using the ruby RE engine for your task may
fall well within your performance requirements. 

The other way to look at this is to consider what you would do/use as an
alternative. I've also used a lot of Tcl and according to that article, Tcl
uses the faster algorithm, yet I've never really noticed any great performance
difference between using Perl or Tcl. So, your choice is to continue and see if
there is a problem and then deal with that if/when it occurs or jump now and
start writing your program in Tcl, awk or using grep (maybe even call grep from
within ruby, but I suspect all performance gains would be lost in passing the
data between ruby and the grep process). 

I've seen many people make a decision regarding the choice of technology
because they have read somewhere that x is faster than y. Often, I've then seen
something created which is flakey, takes 10x as long to develop or simply
doesn't work when in reality, the aspect they were concerned about wasn't even
relevant to the situation they were working in. Recently, I had an argument
with one of our sys admins who wanted to use the Riser FS rather than Ext3 as
the file system on a new server. His argument was that Riser had better
performance characteristics and the system would perform better. My argument
was that file system performance was not a significant bottleneck for the
server and we would be better off sticking with a file system that had a better
track record, more robust tools and represented a technology more sys admins
were familiar with. I lost the argument, initially. The server was configured
with RiserFS and after a few months, we came in one morning to find massive
disk curruption problems. The server was changed to Ext3 and has not missed a
beat since. More to the point, the performance using Ext3 is still well within
acceptable performance metrics. My point isn't that RiserFS may not be a good
file system - it probably is and possibly is even "better" than Ext3. My point
is that speed is not the only issue to consider. 

Something which the article doesn't address is complexity and correctness. The
algorithm used by Perl et. al. may not be fast compared to the alternative, but
it is relatively simple. As important to performance is correctness. An
extremely fast RE is not a better solution if it is only correct 95% of the
time or is so complex, it is difficult to maintain without bugs creeping in
after version updates etc. 

Tim

-- 
tcross (at) rapttech dot com dot au