On Fri, May 25, 2007 at 11:25:09PM +0900, Hans Fugal wrote:
>     # productions
>     expr: term              {. expr0  = term  .}
>         (   '+'  term       {. expr0 += term3 .}
>           | '-'  term       {. expr0 -= term4 .}
>         )*
>         ;
> 
>     term: fact              {. term0  = fact1 .}
>         (   '*' fact        {. term0 *= fact2 .}
>           | '/' fact        {. term0 /= fact3 .}
>         )*
>         ;
> 
>     fact: ['+'] const       {. fact0 =  const1.to_f .}
>         |  '-'  const       {. fact0 = -const2.to_f .}
>         |  '(' expr ')'     {. fact0 =  expr1 .}
>         ;
> 
>     # terminals
>     const: /\d+[\.\d+]/ = '0';

I imagine it should be
const: /\d+(\.\d+)?/

> It's easy 
> enough if you have all of the input, or "a lot" which is reasonably 
> expected to be longer than any token, or if you can count on tokens not 
> crossing a guard (such as a newline), but in general you need to do 
> partial matching.

All your tokens above are one character, so a one character lookahead is
fine, apart from 'const'

If you read your input in 4096 byte blocks, reading a new block when your
buffer is less than 4096 bytes full, then you'll have somewhere between 4K
and 8K of lookahead. You'd do that for efficiency anyway, I'd hope.

So this leaves the case of any terminal regexps which might be required to
match more than 4K of data as a single token. If you're parsing a language
like that, then I'd agree that having partial matching makes your code a bit
simpler. But otherwise, you can write your regexp to match partially:

  const: /\d+(\.(\d+)?)?/

and if a partial match is detected, keep eating more input as necessary.

Note: the worst case is that the entire file consists of a single token - in
which case, you *will* end up reading the whole file into memory anyway.

Regards,

Brian.