On Thu, 14 Jun 2001, Dave Thomas wrote: > Robert Feldt <feldt / ce.chalmers.se> writes: > > > Any comments or ideas? Which solution would you prefer if you'd get to > > choose? > > I know next to nothing about parsing and lexing, but it seems to me > that you'd spend an inordinate amount of effort trying to produce a > lexer that could deal with the kind of things that people dream up for > tokens. Identifying regular expressions is a particularly hairy case > Yes, sounds reasonable. > that comes to mind: /[[]/ and friends are all special cases. > I'm not sure we're talking about the same thing here but wouldn't /\/((\\\/)|[^\/])*\/[iomx]*/ cut it? Thats what I use to find regexps in rockit grammars so I hope I'm not too far off the mark... > So, I'd be in favor of providing simple hooks for adding my own code > to the lexer. If this isn't language independent, then have a > provision to add lexer chunks in multiple languages: you can get it > all working in Ruby, then when you want to produce your C-based > parser, give a command line option and it will chose your C-based > lexer chunk. > Thanks for your opinion; this is close to what I feel is the right thing. I'm thinking something like: Grammar Ruby Tokenizers (Ruby) # First one is default. No name needed if only one. ... Tokenizers (C) ... Tokens ... And its a good thing not to spend too much time on this issue since people will not very often work with non-regular languages. I'm glad I asked for your opinion. > In the code example you showed for this (S2), you had a dedicated call > per symbol. I assume this means that these chunks must be called often > on a trial and error basis, looking for a match. You might be able to > make this more efficient by (a) providing some context as part of the call > and/or (b) allowing the lexing chunk to return the type of symbol > found: > Yes, the API needs more thinking. However, the penalty might not be as high as you'd think since what tokens might match can be inferred from the parsing context (the current production/rule being applied). Often this will limit the number of tokens that can match. Its a good thing though to encourage that only the minimum is taken care of by a tokenizer. If we know that there must be a leading % we can generate faster lexers. > Tokenizers > def delimited_string > s, cp = @string, @current_position > type = case s[cp] > when 'q' then DelimQString > when 'Q' then DelimIString > when 'x' then DelimXString > when 'r' then Regexp > else return nil > end > # .. more stuff > return type, end_pos, position_of_next_unconsumed_char > end > Yes, thats better than my example. Thanks. Thanks, Robert