Mark W. wrote:
I’ve benchmarked this and ‘get_file_names2’ is considerably quicker!
Could someone explain why? ie what is it about get_file_names2 that
makes it so much quicker? What is it about get_file_names that slows it
down?
Robert K. wrote:
In _2 you do a dir glob in one directory only while find recursively
descends a directory tree. Even though you prune in some cases it
comes as no surprise that this is slower.
Actually, I think that most of the gain in the final solution comes from
this:
a) in the original solution, the result array is built inserting
manually each file that matches the pattern in the array…
b) in the final solution (get_file_names2) not only Find is replaced by
Dir.glob, but the result array is built by grep.
This may sound surprising, but map (collect) and grep are highly
optimized C mechanisms; moreover they can do smart guesses on the size
of the result array (in particular, grep can inmediately conclude that
the size of the result array will be less than the array given to him).
In the manual solution instead, every item added to the array probably
requires some memory reallocation.
Let’s verify if this is true, benchmarking 3 variants:
- get_file_names : Find + manual array build
- get_file_names1: Dir.glob + manual array build
- get_file_names2: Dir.glob + grep
uses Find + manual array build
def get_file_names
fn=[]
Find.find(’./’) do |path|
Find.prune if File.basename(path) == ‘sent’
curr_file = File.basename(path)
if curr_file =~ %r{^ (CAL|NCPH|GOH) \d{6} .xls $}x
fn << curr_file
end
end
fn
end
uses Dir.glob + manual array build
def get_file_names1
all_files = Dir.glob("*")
my_files.each do |path|
curr_file = File.basename(path)
if curr_file =~ %r{^ (CAL|NCPH|GOH) \d{6} .xls $}x
fn << curr_file
end
end
fn
end
uses Dir.glob + grep
def get_file_names2
all_files = Dir.glob("*")
my_files = all_files.grep(%r{^ (CAL|NCPH|GOH) \d{6} .xls $}x)
end
with 10 files matching the pattern
Benchmark.bm(5) do |timer|
timer.report(‘get_file_names’) {10000.times {get_file_names}}
timer.report(‘get_file_names1’) {10000.times {get_file_names}}
timer.report(‘get_file_names2’) {10000.times {get_file_names2}}
end
user system total real
get_file_names 4.680000 3.870000 8.550000 ( 8.567400)
get_file_names1 4.670000 3.870000 8.540000 ( 8.564753)
get_file_names2 0.690000 0.780000 1.470000 ( 1.470924)
The 580% improvement in the final solution comes almost exclusively from
the grep.
I add this for Mark:
it is best in any case to avoid the unnecessary ‘Find’, for several
reasons; the first is that what you need is the simple Dir.glob.
But let me cite a more insidious problem in that code: the idea of
pruning explicitly ‘sent’. Even if you think that the only subdirectory
present will be ‘sent’, what happens if 8 months from now another person
(or yourself, after some more thousands lines of code!) needs to add
another subdir with files matching that pattern (perhaps just to save a
version of them)? will he remember the assumption done months before?
Moreover, as Stefano pointed out at the beginning of the thread, this
would cause subtle file corruptions, as the code is not set up to deal
with the path correctly (a testing nightmare would follow, until
somebody remembers the ‘assumption’; some people have seen this hundreds
of times…:-).
So, the best is not to code ‘assumptions’ in the code; and of course to
use the right tools for the problem at hand: in this case, Dir.glob and
the beautiful grep
Regards
Raul