2007/11/27, Raul Parolari <raulparolari / gmail.com>:
> Robert Klemme wrote:
>
> > What about this one?
> >
> > def get_file_names_3
> >    Dir["{CAL,NCPH,GOH}[0-9][0-9][0-9][0-9][0-9][0-9].xls"]
> > end
>
> Robert
>
>   I had not tested case 1 in the benchmark (I dropped digit '1' in the
> call); it turns out that it is the Find, as you had thought, to be the
> one that causes most inefficiency; without repeating all the code, these
> are the results, including also your last suggestion.
>
> # uses Find, and selects files manually
> def get_file_names
>   fn=[]
>   ..
>   fn
> end
>
> # 1) uses Dir.glob and builds array with a loop
> def get_file_names1
>   fn=[ ]
>   all_files = Dir.glob("*")
>   ..
>   fn
> end
>
> # 2) uses Dir.glob and grep
> def get_file_names2
>   all_files = Dir.glob("*")
>   my_files  = all_files.grep(%r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x)
> end

We completely forgot about Dir.entries()...  That might be a tad
faster because supposedly no globbing is going on.

> # 3) variation of solution 2
> def get_file_names3
>    Dir["{CAL,NCPH,GOH}[0-9][0-9][0-9][0-9][0-9][0-9].xls"]
> end
>
> # with 30 files
> Benchmark.bm(5) do |timer|
>   timer.report('get_file_names')  {10_000.times {get_file_names}  }
>   timer.report('get_file_names1') {10_000.times {get_file_names1} }
>   timer.report('get_file_names2') {10_000.times {get_file_names2} }
>   timer.report('get_file_names3') {10_000.times {get_file_names3} }
> end
>
>                user        system      total       real
> get_file_names 14.640000   9.080000   23.720000 ( 23.778029)
> get_file_names1  1.690000   1.200000   2.890000 (  2.903737)
> get_file_names2  1.370000   1.210000   2.580000 (  2.581539)
> get_file_names3  1.430000   3.530000   4.960000 (  4.968951)
>
> Solution 2) is, as we saw before, the winner: 10 times faster than the
> original solution. But the grep only improves things by 10%; 90% of the
> contribution comes from removing Find (as Robert had guessed)!
>
> Regarding the last one (called solution 3) from you:
> >    Dir["{CAL,NCPH,GOH}[0-9][0-9][0-9][0-9][0-9][0-9].xls"]
>
> This turned out to be 2 times slower than solution 2. Checking if
> something is between 0-9 is apparently quite slower than checking for
> 'digit'.
>
> In conclusion:
> Solution 2 is the fastest; but the reason is not the grep (as I had
> theorized), which accounts only for 10% of the improvement; the other
> 90% comes from removing Find, as Robert had guessed.

It's an old truth: access to external memory is always an order of
magnitude slower than access to main memory. Ergo, if you can reduce
the number of accesses to that memory you typically achieve big
improvements.  :-)

> Last consideration: increasing the number of files, the weight of 'grep'
> in the improvement increases (but, enough of benchmarks for to-day :-).

Thanks for taking the time to go through this.

:-)

Kind regards

robert

-- 
use.inject do |as, often| as.you_can - without end