Brian Candler wrote:
> On Fri, May 25, 2007 at 11:25:09PM +0900, Hans Fugal wrote:
>>     # productions
>>     expr: term              {. expr0  = term  .}
>>         (   '+'  term       {. expr0 += term3 .}
>>           | '-'  term       {. expr0 -= term4 .}
>>         )*
>>         ;
>>
>>     term: fact              {. term0  = fact1 .}
>>         (   '*' fact        {. term0 *= fact2 .}
>>           | '/' fact        {. term0 /= fact3 .}
>>         )*
>>         ;
>>
>>     fact: ['+'] const       {. fact0 =  const1.to_f .}
>>         |  '-'  const       {. fact0 = -const2.to_f .}
>>         |  '(' expr ')'     {. fact0 =  expr1 .}
>>         ;
>>
>>     # terminals
>>     const: /\d+[\.\d+]/ = '0';
> 
> I imagine it should be
> const: /\d+(\.\d+)?/

Yes, of course. Sorry.

> 
>> It's easy 
>> enough if you have all of the input, or "a lot" which is reasonably 
>> expected to be longer than any token, or if you can count on tokens not 
>> crossing a guard (such as a newline), but in general you need to do 
>> partial matching.
> 
> All your tokens above are one character, so a one character lookahead is
> fine, apart from 'const'

My example is not meant to be pathological. It's not hard to imagine a 
set of terminal regexes where you need much more than one-character 
lookahead.

> 
> If you read your input in 4096 byte blocks, reading a new block when your
> buffer is less than 4096 bytes full, then you'll have somewhere between 4K
> and 8K of lookahead. You'd do that for efficiency anyway, I'd hope.

When was the last time you typed 4k faster than the computer could 
process it? The biggest reason for stream parsing is when there's a 
human on the other end. In practice, you usually wouldn't have trouble 
assuming newline is never part of a token, because you wouldn't get 
stdin except after the user presses enter. But that's not entirely a 
guarantee, and there can be other situations (like a network stream) 
where that might not be the case.

> 
> So this leaves the case of any terminal regexps which might be required to
> match more than 4K of data as a single token. If you're parsing a language
> like that, then I'd agree that having partial matching makes your code a bit
> simpler. But otherwise, you can write your regexp to match partially:
> 
>   const: /\d+(\.(\d+)?)?/
> 
> and if a partial match is detected, keep eating more input as necessary.
> 

Easy enough in a hand-written lexer, but I wouldn't want to ask users of 
a parser generator to have to provide partial-match regexes. And parsing 
regexes to discover a partial match regex is not an easy task.