William James wrote: > Simon Kröçer wrote: > >>Robert Klemme wrote: >> >> >>>[...] >>>Hi Glenn, >>> >>>this sounds like a challenge! I'd love to get my hands on the input and >>>the spec what you want do do with the data to see whether I can find an >>>even faster Ruby implementation. If your data is not private that is. >>>Alternatively you could maybe anonymize it... >>> >>>Kind regards >>> >>> robert >> >>second that! >> >>cheers >> >>Simon > > > I'd like to give it a shot. Hi, as i was curious i wrote some little test scripts. First i create a testfile (100000 rows, 9 values each rows. 5 ints, 4 strings in each row). I use different methods on reading them: user system total real just read 0.062000 0.031000 0.093000 ( 0.094000) just readlines 0.219000 0.016000 0.235000 ( 0.234000) readlines-split 3.468000 0.078000 3.546000 ( 3.562000) read-scan 10.953000 0.047000 11.000000 ( 11.016000) read-scan-block 11.485000 0.031000 11.516000 ( 11.563000) read-split-whole 6.234000 0.047000 6.281000 ( 6.312000) "just read" and "just readlines" are only for reference. The file is aprox 12MB and i think 3.5s is a good starting point. Here is the code: ---------------------------------------------------------------------- require 'benchmark' s=' ' * 21 open('testfile.cvs', 'wb') do |file| 100000.times do |l| line = l.to_s; 4.times do 21.times{|i|s[i] = ?A + rand(26)} line << ', ' << s << ', ' << rand(10000).to_s end file.puts(line) end end a1, a2, a3, a4 = nil Benchmark.bm 20 do |bm| bm.report("just read") do a = IO.read('testfile.cvs') end bm.report("just readlines") do IO.readlines('testfile.cvs') end bm.report("readlines-split") do a3 = IO.readlines('testfile.cvs').map!{|l| l.split(', ')} a3.each{|b| b[0] = b[0].to_i; b[2] = b[2].to_i; b[4] = b[4].to_i; b[6] = b[6].to_i; b[8] = b[8].to_i} end bm.report("read-scan") do a1 = IO.read('testfile.cvs').scan(/^(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?)$/) a1.each{|b| b[0] = b[0].to_i; b[2] = b[2].to_i; b[4] = b[4].to_i; b[6] = b[6].to_i; b[8] = b[8].to_i} end bm.report("read-scan-block") do a2 = [] IO.read('testfile.cvs').scan(/^(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?),\s*(.*?)$/) do |b| a2 << [b[0].to_i, b[1], b[2].to_i, b[3], b[4].to_i, b[5], b[6].to_i, b[7], b[8].to_i] end end bm.report("read-split-whole") do counter = 0; a4 = Array.new(100000) {Array.new(9)} IO.read('testfile.cvs').split(/\n|, /).each do |f| a4[counter / 9][counter % 9] = ((counter % 9) % 2).zero? ? f.to_i : f counter += 1 end end end puts a1 == a2 && a2 == a3 && a3 == a4 ---------------------------------------------------------------------- It would be nice if anyone could rewrite the scanf package in C... cheers Simon