William James wrote:

> Simon Kr÷šer wrote:
> 
>>Robert Klemme wrote:
>>
>>
>>>[...]
>>>Hi Glenn,
>>>
>>>this sounds like a challenge!  I'd love to get my hands on the input and
>>>the spec what you want do do with the data to see whether I can find an
>>>even faster Ruby implementation.  If your data is not private that is.
>>>Alternatively you could maybe anonymize it...
>>>
>>>Kind regards
>>>
>>>   robert
>>
>>second that!
>>
>>cheers
>>
>>Simon
> 
> 
> I'd like to give it a shot.

Hi,

as i was curious i wrote some little test scripts.
First i create a testfile (100000 rows, 9 values each rows. 5 ints, 4 
strings in each row). I use different methods on reading them:

                          user     system      total        real
just read             0.062000   0.031000   0.093000 (  0.094000)
just readlines        0.219000   0.016000   0.235000 (  0.234000)
readlines-split       3.468000   0.078000   3.546000 (  3.562000)
read-scan            10.953000   0.047000  11.000000 ( 11.016000)
read-scan-block      11.485000   0.031000  11.516000 ( 11.563000)
read-split-whole      6.234000   0.047000   6.281000 (  6.312000)

"just read" and "just readlines" are only for reference.
The file is aprox 12MB and i think 3.5s is a good starting point.

Here is the code:

----------------------------------------------------------------------
require 'benchmark'

s=' ' * 21
open('testfile.cvs', 'wb') do |file|
   100000.times do |l|
     line = l.to_s;
     4.times do
       21.times{|i|s[i] = ?A + rand(26)}
       line << ', ' << s << ', ' << rand(10000).to_s
     end
     file.puts(line)
   end
end

a1, a2, a3, a4 = nil
Benchmark.bm 20 do |bm|
   bm.report("just read") do
     a = IO.read('testfile.cvs')
   end

   bm.report("just readlines") do
     IO.readlines('testfile.cvs')
   end

   bm.report("readlines-split") do
     a3 = IO.readlines('testfile.cvs').map!{|l| l.split(', ')}
     a3.each{|b| b[0] = b[0].to_i; b[2] = b[2].to_i; b[4] = b[4].to_i; 
b[6] = b[6].to_i; b[8] = b[8].to_i}
   end

   bm.report("read-scan") do
     a1 = 
IO.read('testfile.cvs').scan(/^(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?)$/)
     a1.each{|b| b[0] = b[0].to_i; b[2] = b[2].to_i; b[4] = b[4].to_i; 
b[6] = b[6].to_i; b[8] = b[8].to_i}
   end

   bm.report("read-scan-block") do
     a2 = []
 
IO.read('testfile.cvs').scan(/^(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?)$/) 
do |b|
       a2 << [b[0].to_i, b[1], b[2].to_i, b[3], b[4].to_i, b[5], 
b[6].to_i, b[7], b[8].to_i]
     end
   end

   bm.report("read-split-whole") do
     counter = 0;
     a4 = Array.new(100000) {Array.new(9)}
     IO.read('testfile.cvs').split(/\n|, /).each do |f|
       a4[counter / 9][counter % 9] = ((counter % 9) % 2).zero? ? f.to_i : f
       counter += 1
     end
   end
end

puts a1 == a2 && a2 == a3 && a3 == a4
----------------------------------------------------------------------

It would be nice if anyone could rewrite the scanf package in C...

cheers

Simon