On 26.11.2007 20:39, Raul Parolari wrote:
> Mark Woodward wrote:
>> I've benchmarked this and 'get_file_names2' is considerably quicker!
>>
>> Could someone explain why? ie what is it about get_file_names2 that
>> makes it so much quicker? What is it about get_file_names that slows it
>> down?
> 
> Robert Klemme wrote:
>> In _2 you do a dir glob in *one* directory only while find recursively
>> descends a *directory tree*.  Even though you prune in some cases it
>> comes as no surprise that this is slower.
> 
> Actually, I think that most of the gain in the final solution comes from 
> this:
> a) in the original solution, the result array is built inserting 
> manually each file that matches the pattern in the array..
> b) in the final solution (get_file_names2) not only Find is replaced by 
> Dir.glob, but the result array is built by grep.
> 
> This may sound surprising, but map (collect) and grep are highly 
> optimized C mechanisms; moreover they can do smart guesses on the size 
> of the result array (in particular, grep can inmediately conclude that 
> the size of the result array will be less than the array given to him). 
> In the manual solution instead, every item added to the array probably 
> requires some memory reallocation.
> 
> Let's verify if this is true, benchmarking 3 variants:
> 
> 0) get_file_names : Find     + manual array build
> 1) get_file_names1: Dir.glob + manual array build
> 2) get_file_names2: Dir.glob + grep
> 
> # uses Find + manual array build
> def get_file_names
>   fn=[]
>   Find.find('./') do |path|
>     Find.prune if File.basename(path) == 'sent'
>     curr_file = File.basename(path)
>     if curr_file =~ %r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x
>       fn << curr_file
>     end
>   end
>   fn
> end
> 
> # uses Dir.glob + manual array build
> def get_file_names1
>   all_files = Dir.glob("*")
>   my_files.each do |path|
>     curr_file = File.basename(path)
>     if curr_file =~ %r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x
>       fn << curr_file
>     end
>   end
>   fn
> end
> 
> # uses Dir.glob + grep
> def get_file_names2
>   all_files = Dir.glob("*")
>   my_files  = all_files.grep(%r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x)
> end

What about this one?

def get_file_names_3
   Dir["{CAL,NCPH,GOH}[0-9][0-9][0-9][0-9][0-9][0-9].xls"]
end

> # with 10 files matching the pattern
> Benchmark.bm(5) do |timer|
>   timer.report('get_file_names')  {10000.times {get_file_names}}
>   timer.report('get_file_names1') {10000.times {get_file_names}}
>   timer.report('get_file_names2') {10000.times {get_file_names2}}
> end
> 
>                  user       system      total      real
> get_file_names   4.680000   3.870000   8.550000 ( 8.567400)
> get_file_names1  4.670000   3.870000   8.540000 ( 8.564753)
> get_file_names2  0.690000   0.780000   1.470000 ( 1.470924)
> 
> The 580% improvement in the final solution comes almost exclusively from 
> the grep.
> 
> ---
> I add this for Mark:
> it is best in any case to avoid the unnecessary 'Find', for several 
> reasons; the first is that what you need is the simple Dir.glob.
> 
> But let me cite a more insidious problem in that code: the idea of 
> pruning explicitly 'sent'. Even if you think that the only subdirectory 
> present will be 'sent', what happens if 8 months from now another person 
> (or yourself, after some more thousands lines of code!) needs to add 
> another subdir with files matching that pattern (perhaps just to save a 
> version of them)? will he remember the assumption done months before?
> 
> Moreover, as Stefano pointed out at the beginning of the thread, this 
> would cause subtle file corruptions, as the code is not set up to deal 
> with the path correctly (a testing nightmare would follow, until 
> somebody remembers the 'assumption'; some people have seen this hundreds 
> of times..:-).
> 
> So, the best is not to code 'assumptions' in the code; and of course to 
> use the right tools for the problem at hand: in this case, Dir.glob and 
> the beautiful grep

Good point!  That's basically the same reason why it's better to 
explicitely open ports in firewall configs than to close "dangerous" ports.

Kind regards

	robert