Simon Kr÷šer wrote:
> William James wrote:
>
> > Simon Kr÷šer wrote:
> >
> >>Robert Klemme wrote:
> >>
> >>
> >>>[...]
> >>>Hi Glenn,
> >>>
> >>>this sounds like a challenge!  I'd love to get my hands on the input and
> >>>the spec what you want do do with the data to see whether I can find an
> >>>even faster Ruby implementation.  If your data is not private that is.
> >>>Alternatively you could maybe anonymize it...
> >>>
> >>>Kind regards
> >>>
> >>>   robert
> >>
> >>second that!
> >>
> >>cheers
> >>
> >>Simon
> >
> >
> > I'd like to give it a shot.
>
> Hi,
>
> as i was curious i wrote some little test scripts.
> First i create a testfile (100000 rows, 9 values each rows. 5 ints, 4
> strings in each row). I use different methods on reading them:
>
>                           user     system      total        real
> just read             0.062000   0.031000   0.093000 (  0.094000)
> just readlines        0.219000   0.016000   0.235000 (  0.234000)
> readlines-split       3.468000   0.078000   3.546000 (  3.562000)
> read-scan            10.953000   0.047000  11.000000 ( 11.016000)
> read-scan-block      11.485000   0.031000  11.516000 ( 11.563000)
> read-split-whole      6.234000   0.047000   6.281000 (  6.312000)
>
> "just read" and "just readlines" are only for reference.
> The file is aprox 12MB and i think 3.5s is a good starting point.
>
> Here is the code:
>
> ----------------------------------------------------------------------
> require 'benchmark'
>
> s=' ' * 21
> open('testfile.cvs', 'wb') do |file|
>    100000.times do |l|
>      line = l.to_s;
>      4.times do
>        21.times{|i|s[i] = ?A + rand(26)}
>        line << ', ' << s << ', ' << rand(10000).to_s
>      end
>      file.puts(line)
>    end
> end
>
> a1, a2, a3, a4 = nil
> Benchmark.bm 20 do |bm|
>    bm.report("just read") do
>      a = IO.read('testfile.cvs')
>    end
>
>    bm.report("just readlines") do
>      IO.readlines('testfile.cvs')
>    end
>
>    bm.report("readlines-split") do
>      a3 = IO.readlines('testfile.cvs').map!{|l| l.split(', ')}
>      a3.each{|b| b[0] = b[0].to_i; b[2] = b[2].to_i; b[4] = b[4].to_i;
> b[6] = b[6].to_i; b[8] = b[8].to_i}
>    end
>
>    bm.report("read-scan") do
>      a1 =
> IO.read('testfile.cvs').scan(/^(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?)$/)
>      a1.each{|b| b[0] = b[0].to_i; b[2] = b[2].to_i; b[4] = b[4].to_i;
> b[6] = b[6].to_i; b[8] = b[8].to_i}
>    end
>
>    bm.report("read-scan-block") do
>      a2 = []
>
> IO.read('testfile.cvs').scan(/^(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?)$/)
> do |b|
>        a2 << [b[0].to_i, b[1], b[2].to_i, b[3], b[4].to_i, b[5],
> b[6].to_i, b[7], b[8].to_i]
>      end
>    end
>
>    bm.report("read-split-whole") do
>      counter = 0;
>      a4 = Array.new(100000) {Array.new(9)}
>      IO.read('testfile.cvs').split(/\n|, /).each do |f|
>        a4[counter / 9][counter % 9] = ((counter % 9) % 2).zero? ? f.to_i : f
>        counter += 1
>      end
>    end
> end
>
> puts a1 == a2 && a2 == a3 && a3 == a4
> ----------------------------------------------------------------------

On my computer, it's a tiny bit faster without the .map!.

  bm.report( "readlines-split" ) do
    a3 = IO.readlines('testfile.cvs').map{ |l|
      l.split(', ') }.map{ |b|
        b[0] = b[0].to_i; b[2] = b[2].to_i; b[4] = b[4].to_i
        b[6] = b[6].to_i; b[8] = b[8].to_i
        b
      }
  end