On 26.11.2007 20:39, Raul Parolari wrote: > Mark Woodward wrote: >> I've benchmarked this and 'get_file_names2' is considerably quicker! >> >> Could someone explain why? ie what is it about get_file_names2 that >> makes it so much quicker? What is it about get_file_names that slows it >> down? > > Robert Klemme wrote: >> In _2 you do a dir glob in *one* directory only while find recursively >> descends a *directory tree*. Even though you prune in some cases it >> comes as no surprise that this is slower. > > Actually, I think that most of the gain in the final solution comes from > this: > a) in the original solution, the result array is built inserting > manually each file that matches the pattern in the array.. > b) in the final solution (get_file_names2) not only Find is replaced by > Dir.glob, but the result array is built by grep. > > This may sound surprising, but map (collect) and grep are highly > optimized C mechanisms; moreover they can do smart guesses on the size > of the result array (in particular, grep can inmediately conclude that > the size of the result array will be less than the array given to him). > In the manual solution instead, every item added to the array probably > requires some memory reallocation. > > Let's verify if this is true, benchmarking 3 variants: > > 0) get_file_names : Find + manual array build > 1) get_file_names1: Dir.glob + manual array build > 2) get_file_names2: Dir.glob + grep > > # uses Find + manual array build > def get_file_names > fn=[] > Find.find('./') do |path| > Find.prune if File.basename(path) == 'sent' > curr_file = File.basename(path) > if curr_file =~ %r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x > fn << curr_file > end > end > fn > end > > # uses Dir.glob + manual array build > def get_file_names1 > all_files = Dir.glob("*") > my_files.each do |path| > curr_file = File.basename(path) > if curr_file =~ %r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x > fn << curr_file > end > end > fn > end > > # uses Dir.glob + grep > def get_file_names2 > all_files = Dir.glob("*") > my_files = all_files.grep(%r{^ (CAL|NCPH|GOH) \d{6} \.xls $}x) > end What about this one? def get_file_names_3 Dir["{CAL,NCPH,GOH}[0-9][0-9][0-9][0-9][0-9][0-9].xls"] end > # with 10 files matching the pattern > Benchmark.bm(5) do |timer| > timer.report('get_file_names') {10000.times {get_file_names}} > timer.report('get_file_names1') {10000.times {get_file_names}} > timer.report('get_file_names2') {10000.times {get_file_names2}} > end > > user system total real > get_file_names 4.680000 3.870000 8.550000 ( 8.567400) > get_file_names1 4.670000 3.870000 8.540000 ( 8.564753) > get_file_names2 0.690000 0.780000 1.470000 ( 1.470924) > > The 580% improvement in the final solution comes almost exclusively from > the grep. > > --- > I add this for Mark: > it is best in any case to avoid the unnecessary 'Find', for several > reasons; the first is that what you need is the simple Dir.glob. > > But let me cite a more insidious problem in that code: the idea of > pruning explicitly 'sent'. Even if you think that the only subdirectory > present will be 'sent', what happens if 8 months from now another person > (or yourself, after some more thousands lines of code!) needs to add > another subdir with files matching that pattern (perhaps just to save a > version of them)? will he remember the assumption done months before? > > Moreover, as Stefano pointed out at the beginning of the thread, this > would cause subtle file corruptions, as the code is not set up to deal > with the path correctly (a testing nightmare would follow, until > somebody remembers the 'assumption'; some people have seen this hundreds > of times..:-). > > So, the best is not to code 'assumptions' in the code; and of course to > use the right tools for the problem at hand: in this case, Dir.glob and > the beautiful grep Good point! That's basically the same reason why it's better to explicitely open ports in firewall configs than to close "dangerous" ports. Kind regards robert