Emmanuel <emmanuel / lijasdoblea.com> writes: > http://swtch.com/~rsc/regexp/regexp1.html > > I became a little worried since i'm making hevy use of regexes in a > program of mine that shortly i'll have to run on each of thousands of > text files. I don't know about proposed plans for the regexp engine for ruby, but I would say not to be overly concerned at this stage. From reading that article, one thing that I noted was that even the author acknowledged that the performance of regular expressions is significantly affected by how the regexp are defined. If you keep in mind the basic concepts and create your regexp accordingly, you will likely get performance differences that even outweigh the differences between the two approaches outlined in that article - or putting it another way, poorly specified RE will perform badly regardless of the algorithm used. What is important is to do things like anchoring, using as precise specification as possible and taking advantage of any knowledge regarding the data you are processing. I've not done large (ie. gigabytes of data) processing with RE under ruby, but I have done so with perl and the performance was quite acceptable. There is no point worrying about optimisation until you know there is a performance issue. For all you know, using the ruby RE engine for your task may fall well within your performance requirements. The other way to look at this is to consider what you would do/use as an alternative. I've also used a lot of Tcl and according to that article, Tcl uses the faster algorithm, yet I've never really noticed any great performance difference between using Perl or Tcl. So, your choice is to continue and see if there is a problem and then deal with that if/when it occurs or jump now and start writing your program in Tcl, awk or using grep (maybe even call grep from within ruby, but I suspect all performance gains would be lost in passing the data between ruby and the grep process). I've seen many people make a decision regarding the choice of technology because they have read somewhere that x is faster than y. Often, I've then seen something created which is flakey, takes 10x as long to develop or simply doesn't work when in reality, the aspect they were concerned about wasn't even relevant to the situation they were working in. Recently, I had an argument with one of our sys admins who wanted to use the Riser FS rather than Ext3 as the file system on a new server. His argument was that Riser had better performance characteristics and the system would perform better. My argument was that file system performance was not a significant bottleneck for the server and we would be better off sticking with a file system that had a better track record, more robust tools and represented a technology more sys admins were familiar with. I lost the argument, initially. The server was configured with RiserFS and after a few months, we came in one morning to find massive disk curruption problems. The server was changed to Ext3 and has not missed a beat since. More to the point, the performance using Ext3 is still well within acceptable performance metrics. My point isn't that RiserFS may not be a good file system - it probably is and possibly is even "better" than Ext3. My point is that speed is not the only issue to consider. Something which the article doesn't address is complexity and correctness. The algorithm used by Perl et. al. may not be fast compared to the alternative, but it is relatively simple. As important to performance is correctness. An extremely fast RE is not a better solution if it is only correct 95% of the time or is so complex, it is difficult to maintain without bugs creeping in after version updates etc. Tim -- tcross (at) rapttech dot com dot au