On Wed, May 23, 2007 at 01:00:04AM +0900, Hans Fugal wrote:
> Well that works for \w+ an \s+, but what if you want to match /01+0/? 
> You'd get a syntax error on 0111 even though it's a valid partial match.

OK, I see the problem - it's not detecting the end of the expression, it's
saying that this expression *might* match but only if the right characters
were appended to the end of the source.

In the general case I think you'd have to turn each RE into one which
matches all possible prefixes, perhaps something like

  /(0(1+(0)?)?)/   # (note *)

However, if you can guarantee that no individual valid token is going to be
longer than a certain size (let's say 200 characters) then it would be
simpler to ensure that you read-ahead at least 200 characters into a buffer
and then match against that.

Alternatively: perhaps only a few of your token REs have unlimited variable
length. Those you can code in the prefix form like that shown above. The
remainder (of fixed or limited length) can just be matched in the simple way
against a large enough read-ahead buffer.

Regards,

Brian.

(*) Hmm, this isn't quite right, since it partially matches 011112 as well.
You could check for a partial match (i.e. $3 = nil) and allow it only if it
consumes the whole string.

Alternatively, the RE itself needs to say "must be followed by X or end of
string". This works, but it's a bit ugly:

  /(0(\z|1+(\z|0)))

I can't think of a better formulation off the top of my head though.