------art_28737_14265135.1202495254535
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

On Feb 7, 2008 8:08 AM, Ruby Quiz <james / grayproductions.net> wrote:

> We saw a large variety of solutions for this week's problem.  Many of them
> used
> a parser generator to construct their parser.  You do that by defining a
> grammar
> that describes the syntax you need to read.  The parser generator then
> translates your grammar into parsing code that will match the described
> syntax.
> Of the generators used, Treetop was definitely the most popular and is
> surely
> worth a look if you want to do some grammar based parsing.


But not nearly the fastest...  My very old Grammar class already generates
parsers 5-10X faster than treetop and my next release will give another
5-10X plus using ruby2cext will give another 5-10X (generates code that
optimizes well with ruby2cext).  Only a pure-C parser would beat it.  I've
had this stuff sitting around locally for too long and need to release.

... i'll get off my (biased) soapbox now ...

<snip>

Eric Mahurin sent in a hand-rolled solution that has quite a few advantages.
> First, it is trivial to adapt so that the entire content to be parsed
> doesn't
> need to be read into memory.  It's also some very efficient code.


Computationally, I believe this is the simplest (most efficient) type -
LL(1).  I parses "L"eft-to-right and derives the "L"eftmost part of a
grammar using (1) token/character.  When the token is a character it becomes
very easy to parse a file by (just need a variable to hold the next
character in the file - previously read with IO.getc).


> Unfortunately, that makes it a touch less obvious if you aren't already
> familiar
> with parsing techniques.


Yep.  When rolling by hand,  Regexp  makes it a bit easier to do recursive
descent parsers since you can easily do more than just a character at a
time.  If no Regexp has unlimited match length, you'll have an LL(k) parser
where k is that max match length.  You could adapt reading from a file by
keeping a string buffer of the next (k) characters in the file.  I you have
a Regexp that can have unlimited match length, I think you might call that
an LL(*) parser.  You'll have more problems parsing from a file in this
case, unless possibly you can keep all matches contained in a line
(IO.getswould act as a mini-parser looking for newline).

Look here if you want to see more about LL parsers:

http://en.wikipedia.org/wiki/LL_parser


> The AST definition also merits a bit of discussion.  It would have been
> nice to
> just have each method return the Ruby objects for the JSON it parsed.
>  However,
> false and nil (null in JSON) are legal JSON parses in some places.  If I
> return
> those, I won't be able to tell if my parse succeeded or failed.  To get
> around
> that, all parsed JSON values are wrapped in a trivial abstract syntax tree
> object.  Returning this object is always true and, after I've verified
> that the
> parse worked, it's just one more method call to retrieve the actual value
> it
> parsed.


The approach I like to take is have the caller pass the AST to be modified.
I just use a sequence (Array or even String) to represent the sequence of
branches at that level.  Then each grammar method can put none to an
arbitrary number of branches in the AST.  There is much more flexibility.
The grammar method just returns true or false for whether it matches
independent of the AST.  Some parsers don't even need an AST (or a sparse
one) since things might be done on the fly as parsing occurs.  Passing the
AST separately works in all these cases.

------art_28737_14265135.1202495254535--