(This is a bit long...)
Hi all,
I'm contemplating different ways for how to extend rockit to handle
non-regular features in the languages to be tokenized/lexed/scanned and
parsed. This is needed to tokenize things that cannot be expressed with
regular expressions. An example (referred to as E1 below) from the Ruby
grammar is parentheses-delimited nested strings such as
%q(a (nested) string)
which cannot be expressed in a RE since there is no way to count and
balance the parentheses. Other examples of non-regular languages are
E2, Hollerith strings in Fortran: nHa1,a2,...,an where leading number
specifies the number of numbers to follow after the H.
I'd appreciate your input on possible solutions to the problem (or
alternative solutions you might have):
S1. Do nothing and let people handle this by writing their own lexer
You can plug your own lexers into rockit today so this is the current
approach. However, I really think there is added value in having a
formalism and support for using it. The main reason for using hand-written
lexers are speed but there are actually lexer generators that generate
code that is better or very close to the performance of hand-coded
lexers. And if you really crave control why aren't you hand-coding your
regexps today, eh? And why are you interested in using a parser
generator? (Well maybe you aren't; sorry for this mail... ;-))
S2. "Hooks" for hand-coded tokenizers
Allow an easy way for people to plug in their custom tokenizers into
the existing regular expression (RE) based lexing framework.
An example for E1 above would be
Grammar Ruby
Tokenizers
def delimited_string(single)
s, cp = @string, @current_position
return nil unless s[cp,1] == "%" and s[cp+2,0] =~ /(\(|\[|{|<)/
return nil unless s[cp+1,1] == (single ? 'q' : 'Q')
lp = $1
rp = {"(" => ")", "[" => "]", "{" => "}", "<" => ">"}[lp]
# ... continue here to find the position of the balancing rp
# Return lexeme and position of next unmatched char
return s[(cp+3)...balancing_rp_pos], balancing_rp_pos+1
end
Tokens
Blank = /\s+/ :skip:
...
DelimQString = delimited_qstring(true)
DelimIString = delimited_qstring(false)
Productions
Program -> ...
...
Literal -> String
| ...
String -> DelimQString
| DelimIString
| ...
Pro: Power of a full programming language.
People already know Ruby so no need to learn new language.
No need to implement more advanced lexer generator.
Con: Language-dependent => cannot generate C version for speed etc
I like this one though I think its a bit ugly. But its simple and
powerful.
S3. Do nothing and let people handle this post parsing
For E2 this would mean using a regexp like /\d+H(\d+(,\d+)*)?/
and then checking after the parse if the string is really valid.
Pro: Simple
Con: Why continue with parse-related activities after parsing finished?
Cannot handle all non-regular features for example not E1 above?
S4. Extend regexp that can be used in rockit grammars so that non-regular
things can be expressed
Add ways to count the number of chars from a char class that has been
matched so far.
Something like, for E1,
/%q(.)(?while count(\1) != count(balancing(\1)))\1/
where count(c) is the number of c chars seen during this match,
balancing is
def balancing(paren)
{"(" => ")", "[" => "]", "{" => "}", "<" => ">"}[paren]
end
and the (?while condition) construct means "consume characters while
the condition is true".
For E2 this would be
/(\d+)H(\d+(,\d+)(?times $1.to_i-1))/
where the re(?times cnt) construct means "match the re regexp cnt
times".
Pro: Blends nicely into the existing regexp framework?
Con: What if more constructs than while, times and count is needed? Are
they powerful enough?
Unclear how to implement on top of existing framework. Probably
need to roll our own regexp engine with new stuff added => takes time.
S5. ANTLR-style lexing
In ANTLR you use the same way to describe the lexing as you do the
parsing, ie. you have the power of CFG and lookahead etc
Balancing strings can be lexed with something like
Balanced : '(' (Balanced | /[^)]/)* ')'
Pro: Only needs to learn one formalism. Generation procedures similar =>
less code to maintain.
Con: New stuff to look into and learn => takes time
Maybe its a valid question to ask why I'm developing rockit at all;
maybe I should have focused on RAntlr (!) in the spirit of RBison...
The obvious reason is that I don't know Antlr and LL-parsing but
know some things about regexps and LR-parsing. And the Antlr code is
large. But maybe its all about NIH... ;-)
S6. Combining lexing and parsing with merging AST building
We can use something similar to S5 to solve E1 by combining lexing and
parsing (ie. specifying delimited strings in the productions instead of as
tokens). If we add a way to specify that lexemes of matched elements
should be merged we can build the proper AST. Something like
ParenDelimQString -> '%q' Balanced [QString: _,lexeme]
Balanced -> '(' InnerBalanced* ')' [^: +,+,+]
InnerBalanced -> Balanced [^]
| /[^)]+/ [^]
where ^ means "lift this thing up one level in the AST" and + means
"merge the lexeme with the lexemes of the other +-marked elements".
Pro: Blends very nicely into the existing framework
Con: Cannot handle E2
Performance
We need one Balanced production for each paren that can be a
delimiter ('(', '[', '{' and '<') => much for specifying little
Maybe we can solve the last con by allowing back-references in
productions? But we'll also need parameterized productions, like
Balanced(p, bp) -> p InnerBalanced(p, bp)* bp
etc. Might be cool...
Any comments or ideas? Which solution would you prefer if you'd get to
choose?
Regards,
Robert