gotoken / math.sci.hokudai.ac.jp (GOTO Kentaro) wrote:
Forgot to make a comment about one point.
[...]
>Ruby's regexp is extended rather than the theoretical one.
[...]
For those who don't understand what he meant, the theory
of regular expressions started with a theory of finite
state engines. The finite state engine approach (used in
grep for instance) gives limited functionality but also
can determine whether or not there is a match very
efficiently.
Most modern "regular expression" engines actually are not
regular. Rather than tracing a finite state engine through
the string, they recursively attempt to match the pattern
to the string. Any lanugage (like Ruby) which is able to
find backreferences is using recursion internally.
However as soon as you add backreferences to regular
expressions, determining whether or not there is a match
is an NP-complete problem. Which means that the engine must
have disasterous failures. (Note that Perl and Ruby promise
not just a match, but a specific one. Which is the
difference between .* and .*? - as far as I know this is
actually an NP-hard problem.) An example of a disasterous
failure is the following:
/^(\s*yada\s*)+$/
If you have a long set of yada's which breaks off in the
middle of a word, this will not match but Ruby will not be
able to figure this out in a reasonable time. For a full
explanation of why not, how to spot problems, and what to
do about them I strongly recommend Jeffrey Friedl's book.
Which brings up two points.
First of all I think it would be very good if there was a
mode for Ruby's Regexp class where when it tried to do a
match it would not stop at the first match, but rather try
to find all matches and then return the first. The only
purpose for this is a pragma to turn on for testing so that
if you have a disasterous RE you will find it out easily.
Secondly the above disaster does not actually have any
backreferences in it. A long time ago I figured out how to
(in theory) optimize a recursive RE engine so that it could
not fall into disaster unless there actually are
backreferences. That is one of those good ideas that I am
unlikely to do anything about for a few years at least.
But if anyone is interested, you can see the discussion at
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2000-05/msg00853.html
The idea is that you create new states in the RE engine which
are mapped to ordered sets of states in the original. (The
order is the order that you would have entered those states
on the way to failing to match.) Except for that twist this
is the classic subset algorithm for turning an NFA into a DFA
but that twist turns it from an NFA back into an NFA which
happens to be more efficient to process.
This single optimization should in theory cover a large number
of special case optimizations, but to the best of my knowledge
has not actually been tried.
Anyways if anyone feels motivated, go for it!
Cheers,
Ben
_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com