Optimizing Hash deletions

mschwab · June 26, 2007, 4:37am

I have a recursive file-compare program that I run on 2 directories that
each contain about 10,000 files, 99% of which don’t change. I read all
the
filenames into hashes and then (first) delete all the matches, using
this
code:

def RemoveIdenticalDirs ( old_fe_list , new_fe_list )
# this function is necessary because we have to do a
case-insensitive
match for directories, but not for files
$stderr.puts “Detecting identical dirs…”
old_fe_list.keys.each do | oldfile |
new_fe_list.keys.each do | newfile | # can’t use .has_key?
because
we have to do a case-insensitive match
if old_fe_list[oldfile].is_dir and
new_fe_list[newfile].is_dir
and oldfile.downcase == newfile.downcase
old_fe_list.delete ( oldfile )
new_fe_list.delete ( newfile )
break
end
end
end
end

def RemoveIdenticalFiles ( old_fe_list , new_fe_list , comparetype )
$stderr.puts “Detecting identical files…”
old_fe_list.keys.each do | file |
if new_fe_list.has_key? ( file )
if !(old_fe_list[file].is_dir) and
!(new_fe_list[file].is_dir)
and FilesAreIdentical ( comparetype , old_fe_list[file] ,
new_fe_list[file]
)
old_fe_list.delete ( file )
new_fe_list.delete ( file )
end
end
end
end

Note that I’ve added an attribute to each hash called .is_dir which just
holds a Boolean value.

When I run the program (under WinXP using Ruby 185-21, if it matters),
it
takes about 5-10 seconds to execute the above 2 functions. But that’s
not
the worst part - it chews up memory big time. The machine I’m running it
on
has 768MB of RAM (with no swap file), and Windows gives me a warning
that
it’s out of memory as the above code runs. However, neither Windows nor
the
program crashes. I’m guessing that all the .delete’s are causing lots of
memory usage and the garbage collector starts running.

So my question is - is there a less memory-intensive way of doing the
above?
Would setting elements to nil instead of deleting them make any
difference?
Or maybe copying entries to a new hash somehow?

Mike S.

mschwab · June 26, 2007, 5:21am

From: “Mike S.” [email protected]

old_fe_list.keys.each do | oldfile |
end
end
has 768MB of RAM (with no swap file), and Windows gives me a warning that
it’s out of memory as the above code runs. However, neither Windows nor the
program crashes. I’m guessing that all the .delete’s are causing lots of
memory usage and the garbage collector starts running.

So my question is - is there a less memory-intensive way of doing the above?
Would setting elements to nil instead of deleting them make any difference?
Or maybe copying entries to a new hash somehow?

You’ve got an O(n**2) search in RemoveIdenticalDirs. How long
does RemoveIdenticalDirs take to execute, as compared to
RemoveIdenticalFiles?

Also, what does FilesAreIdentical() do ?

Regards,

Bill

mschwab · June 26, 2007, 4:18pm

If the files mostly never change, there may be faster ways.

If your code does the changing, then you can log whenever that gets
done. This makes the list much shorter and no pruning is needed.

Alternatively, you could touch the files so that they all have the same
timestamp. Then, you could just assemble the list of files that have a
timestamp that is different and you would not need to do all that
pruning.

mschwab · June 26, 2007, 3:39pm

Dear Mike,

why don’t you first create an Array of files that you want
to delete, and do the deletion after that ?
You can start the garbage collection by force also … maybe
like this:

class String
def compare_dir(other_dir)
list1 = Dir.entries(self)
list2=Dir.entries(other_dir)
in_first_but_not_in_second_dir=list1-list2
# remove directories from the list
in_first_but_not_in_second_dir.delete_if{|x|
FileTest::directory?(x)}
return in_first_but_not_in_second_dir
end
end
class Array
def delete_files
self.each{|x|
File.delete(x)
GC.start
}
end
end

I compared just the names of roughly 3500 files in two
directories in a fraction of a second like this,
but I haven’t tried to erase them…
Best regards,

Axel

t=Time.now
p “/home/axel”.compare_dir(“/home/axel/ruby”).length
p Time.now-t

-------- Original-Nachricht --------
Datum: Tue, 26 Jun 2007 11:35:54 +0900
Von: “Mike S.” [email protected]
An: [email protected]
Betreff: optimizing Hash deletions

mschwab · June 26, 2007, 6:14pm

FilesAreIdentical() either compares the files based on size/date, or on
contents (depending on the comparetype parameter). I’m currently running
it
with size/date (and it’s still slow and hogs memory).

Mike S.

mschwab · June 27, 2007, 1:14am

RemoveIdenticalDirs takes about 2-3 times as long to run as
RemoveIdenticalFiles.

FilesAreIdentical compares 2 files based on size/date and returns
true/false.

Mike S.

mschwab · June 27, 2007, 1:18am

The files are changed by other programs, so I can’t log those.

The speed really isn’t that much of an issue, since this program isn’t
run
too often. It’s mostly the memory usage (I estimate about 600MB peak
memory
used to compare 10,000 files).

Mike S.