------art_20999_12435193.1202383926571
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

On Feb 6, 2008 9:42 AM, James Gray <james / grayproductions.net> wrote:

> Here's a revised version of this parser that passes all of the tests:
>
> #!/usr/bin/env ruby -wKU
>

Hi James,

I applied these fixes to my optimization of your StringScanner recursive
descent parser and applied a few more.  I also got it working in ruby1.9.
It is at the end of this message.  I couldn't figure out why yours wasn't
working with 1.9.

Also, I did some more benchmarking.  This time I ran 4 passes for each
dataset and I picked the best bandwidth (ch/s) among the 2 longest running
datasets.  I limited the depth to 10 because only the fastest 2 parsers need
deeper (to not parse <1s) and ruby seems to do strange things with garbage
collection.  All of the fastest parsers are shown with results from the
depth dataset.

I ran everything I could easily on 1.9 (didn't load any gems for 1.9).

Here are the results with this benchmarking that I think is more reliable:

   1.8    1.9 F E gzip author/gem
------    --- - - ---- ----------
     -      - 5 0  545 Pawel Radecki (RE, recursive descent)
    NL      - 6 0  796 James Koppel (RE+StringIO, recursive descent)
    NL     NL 5 1  683 Justin Ethier (RE lexer, ruby eval, fixes)
    NL      - 3 2 1074 James Edward Gray II (peggy)
  4662     SF 0 0  608 Eric Mahurin (Grammar0, no code-gen)
  5264      - 6 2  588 ghostwheel (ghostwheel, fixes)
  5764      - 2 0  706 Eric I (Treetop, unicode broken)
  8246      - 2 0  631 Steve (Treetop, mismatches in benchmark)
 11856      - 1 1  545 Clifford Heath (Treetop)
 25841      - 0 0  842 Alexander Stedile (RE, recursive descent)
 57217      - 0 0  723 Eric Mahurin (Grammar, v0.5)
152205     NL 2 1  660 Paolo Bonzini (RE, recursive descent)
     - 191529 2 1  445 Thomas Link (RE lexer, ruby eval)
202090      - 0 0  774 James Edward Gray II (RE, recursive descent)
233646      - 0 0  653 Eric Mahurin (Grammar, unreleased)
270508 211266 1 0    - json
299259      - 6 0    - fjson (w/ C ext)
351163 235726 0 0  607 James & Eric M (RE, recursive)
359665 279539 3 0  405 Thomas & Paolo (RE + eval)
430251 106285 0 0  837 Eric Mahurin (LL(1), recursive descent)
2079190     - 0 0  653 Eric Mahurin (Grammar, unreleased, ruby2cext)
8534416 3217453 0 0 - json (w/ C ext)

NL: non-linear, SF: seg-fault

Here are some things to note:

* on closer inspection, some of the slower solutions are nonlinear (usually
quadratic runtime).  ch/s is not meaningful since it just gets worse with
larger data sets.

* ruby1.9 killed my LL(1) parser performance.  The reason for this is that
characters in 1.9 are full objects, where in 1.8 they are immediate
Fixnum's.  4X slower is not good.  I need to go ask if we can get some
immediate 1-character Strings in ruby 1.9 to solve this.  Anything that
worked with characters in 1.8 is going to see the same problem.  My Grammar
classes generate similiar LL(1) parsers and will be affected in the same
way.  It works so well in 1.8 because there is so little object creation
(using Regexp's creates more objects).

* Dominick Baton's ruby2cext is giving my dev Grammar class a 10X
performance boost.  Woohoo!  Only 4X away from the best hand-crafted C (not
bad considering the ruby and then the C were autogenerated).  I generate
low-level stuff that optimizes well.  High-level Regexp's will have little
benefit from ruby2cext.

Here is the benchmark code I used (along with the previously posted
RandomJSON class):

parser  SONParser.new
generator  andomJSON.new((ARGV[1]||0).to_i)
bandwidth  
bandwidth0  
t0  
l  il
t  il
max_depth  0
(max_depth+1).times { |depth|
    tree  enerator.value(depth, depth%2)
    s  ree.inspect
    s.gsub!(/, ':')
    s.gsub!(/nil/, 'null')
    tree2  il
    l  .length
    t  il
    4.times {
        Benchmark.bm { |b|
            GC.start
            t1  .report("#{depth} #{l} ") { tree2  arser.parse(s) }
            GC.start
            raise("#{tree2.inspect}!tree.inspect}") if tree2!ee
            GC.start
            if (!t or t1.real<t.real)
                t  1
            end
        }
    }
    bandwidth  /t.real
    puts "#{bandwidth.to_i} chars/second"
    break if (t.real>
RGV[0]||1).to_f or depth>
x_depth)
    if (t.real>t0)
        bandwidth0  andwidth
        t0  .real
    end
}
bandwidth  andwidth0 if (bandwidth0>bandwidth)

puts "\n#{bandwidth.to_i} chars/second"


Here is another optimization of James' solution:


require "strscan"

class JSONParser

  def parse(input)
    @input  tringScanner.new(input)
    @input.scan(/\s*/)
    parse_value(out)
    @input.eos? or error("Unexpected data")
    out[0]
  end

  private

  def parse_value(out)
    if @input.scan(/\{\s*/)
      object  }
      kv  ]
      until @input.scan(/\}\s*/)
        object.empty? or @input.scan(/,\s*/) or error("Expected ,")
        kv.clear
        @input.scan(/"/) or error("Expected string")
        parse_string(kv)
        @input.scan(/:\s*/) or error("Expecting object separator")
        parse_value(kv)
        object[kv[0]]  v[1]
      end
      out << object
    elsif @input.scan(/\[\s*/)
      array  ]
      until @input.scan(/\]\s*/)
        array.empty? or @input.scan(/,\s*/) or error("Expected ,")
        parse_value(array)
      end
      out << array
    elsif @input.scan(/"/)
      parse_string(out)
    elsif @input.scan(/-?(?:0|[1-9]\d*)(?:\.\d+)?(?:[eE][+-]?\d+)?\s*/)
      out << eval(@input.matched)
    elsif @input.scan(/true\s*/)
      out << true
    elsif @input.scan(/false\s*/)
      out << false
    elsif @input.scan(/null\s*/)
      out << nil
    else
      error("Illegal JSON value")
    end
  end

  def parse_string(out)
    string  "
    while true
      if @input.scan(/[^\\"]+/)
        string.concat(@input.matched)
      elsif @input.scan(%r{\\["\\/]})
        string << @input.matched[-1]
      elsif @input.scan(/\\[bfnrt]/)
        string << eval(%Q{"#{@input.matched}"})
      elsif @input.scan(/\\u[0-9a-fA-F]{4}/)
        string << @input.matched[2..-1].to_i(16)
      else
        break
      end
    end
    @input.scan(/"\s*/) or error("Unclosed string")
    out << string
  end

  def error(message)
    if @input.eos?
      raise "Unexpected end of input."
    else
      raise "#{message}:  #{@input.peek(@input.string.length)}"
    end
  end
end

------art_20999_12435193.1202383926571--