Slow Find.find - real problem

Hey,

I have a real head ache and would really appreciate any advice.

I have a folder with 100,000’s of files within it and I need to run a
rename process across these.

def renameFile(oldID, newID)
Find.find(@pdfDIR) do |curItem|
if File.file?(curItem) and curItem[/.pdf$/i] and
File.basename(curItem)[/^T/i] #If the file is
valid, is a PDF and it has not been renamed with T on the front
else
if File.basename(curItem.to_s.strip) == oldID
File.rename(curItem.to_s.strip, “#{@pdfDIR }\T#{newID}.pdf”)
#rename the exprted PDF file to the new filename with T on the front
end
end
end
end

My main code basically reads the 2 pieces of data from a CSV file, first
the current filename of the file and then the new filename I want to
rename the file to. In summary, i grab the old and new filenames from a
CSV file one at a time then call the above object where I search the
given directory for a filename that matches the old file name, when I
find it I rename it to the new name.

My issue is this is very very slow and is taking about an hour to do 10
file renames. Is there any way I can speed this up?

Many thanks

Stuart

Stuart C. wrote:

My issue is this is very very slow and is taking about an hour to do 10
file renames. Is there any way I can speed this up?

The slow bit will be scanning the filesystem.

I suggest you run a single File.find across the whole tree, building a
suitable data structure (e.g. a hash of {basename => full filename} for
every file which is eligible to be renamed)

Then when you read in the CSV, you can just cross-reference to this data
structure and locate the files immediately.

If there is a possibility of more than one file with the same basename
existing in multiple directories, then build your data structure
appropriately, e.g.
{basename => [fullpath1, fullpath2, fullpath3, …]}

Alternatively, do this the other way round: read in the whole CSV (which
is likely to be fast) into a suitable data structure, and then scan
across the filesystem once; while you scan it, for every file you find
check whether the CSV had a rename instruction for it.

If the CSV is too big to fit into memory then consider loading it into a
database of some sort instead.

HTH,

Brian.

Thanks for getting back to me. At present I read the CSV file get the
first line of information the. Search the file system in the object
above.

Are you saying I should hash the file system, loop through the hash and
get the first entry then search the csv and rename?

Could you provide an example please? The basenames are all unique!

Thanks a lot

Brian C. wrote:

Stuart C. wrote:

My issue is this is very very slow and is taking about an hour to do 10
file renames. Is there any way I can speed this up?

The slow bit will be scanning the filesystem.

I suggest you run a single File.find across the whole tree, building a
suitable data structure (e.g. a hash of {basename => full filename} for
every file which is eligible to be renamed)

Then when you read in the CSV, you can just cross-reference to this data
structure and locate the files immediately.

If there is a possibility of more than one file with the same basename
existing in multiple directories, then build your data structure
appropriately, e.g.
{basename => [fullpath1, fullpath2, fullpath3, …]}

Alternatively, do this the other way round: read in the whole CSV (which
is likely to be fast) into a suitable data structure, and then scan
across the filesystem once; while you scan it, for every file you find
check whether the CSV had a rename instruction for it.

If the CSV is too big to fit into memory then consider loading it into a
database of some sort instead.

HTH,

Brian.

Could you provide an example please? The basenames are all unique!

Well, I guess you are doing a ‘find’ because the files are in various
directories, and you don’t know which file is in which directory.
Otherwise, you’d be able to construct the path directly.

So at the simplest I’m thinking of something like this:

require ‘find’
base2path = {}
Find.find("/etc") do |path|
base2path[File.basename(path)] = path
end

have a look at what you’ve built

require ‘pp’
pp base2path

You could add some logic to skip files you know are not of interest if
you like.

Then if the csv says it wants to rename foo.pdf, you look for
base2path[“foo.pdf”] to find its full path instantly. If you get nil,
then it doesn’t exist.

The disadvantage of what I’ve proposed above is that you have to scan
the whole filesystem once (very slow) before you start doing any work -
and that’s probably the point where you find there are bugs in the rest
of your script, then you have to start again.

So it may be quicker to debug if you write the program to read the CSV
first, and then start doing the renames as you walk across the
filesystem:

idmap = {}
csvfile.each do |old_id, new_id|
idmap[old_id] = new_id
end

Find.find("/etc") do |path|
old_id = basename[path]
new_id = idmap[old_id]
next unless new_id
File.rename(path, File.join(File.dirname(path), new_id))
end

That’s completely untested, but is just to give you an idea.

HTH,

Brian.

On Sat, Sep 4, 2010 at 7:24 PM, Stuart C.
[email protected] wrote:

I have a folder with 100,000’s of files within it and I need to run a
rename process across these.

ok. one folder, many files.

def renameFile(oldID, newID)
Find.find(@pdfDIR) do |curItem|

hmm. careful w this command, it recurses. and watch for symlinks too.

if File.file?(curItem) and curItem[/.pdf$/i] and
File.basename(curItem)[/^T/i] #If the file is
valid, is a PDF and it has not been renamed with T on the front
else
if File.basename(curItem.to_s.strip) == oldID
File.rename(curItem.to_s.strip, “#{@pdfDIR }\T#{newID}.pdf”)
#rename the exprted PDF file to the new filename with T on the front
end
end
end
end

My main code basically reads the 2 pieces of data from a CSV file, first
the current filename of the file and then the new filename I want to
rename the file to. In summary, i grab the old and new filenames from a
CSV file one at a time then call the above object where I search the
given directory for a filename that matches the old file name, when I
find it I rename it to the new name.

My issue is this is very very slow and is taking about an hour to do 10
file renames.

of course, there is really something wrong in there. let’s fix it :slight_smile:

Is there any way I can speed this up?

you are scanning the folder for every file, so 10 file renames results
to 10 (re)scans of those 100k files.

correct me if i’m wrong, but i think you do not need to scan the
folder. just chdir to that folder. loop on your csv and do the inner
rename. let the os do the search and the rename. i bet 10 renames wont
even take 10 seconds :slight_smile:

something like, (untested)

Dir.chdir folder_to_change
csvfile.each do |old, new|
File.rename(old, new)
end

best regards -botp

On 04.09.2010 13:24, Stuart C. wrote:

I have a real head ache and would really appreciate any advice.

I have a folder with 100,000’s of files within it and I need to run a
rename process across these.

My issue is this is very very slow and is taking about an hour to do 10
file renames. Is there any way I can speed this up?

Brian gave excellent advice already. I’d just want to point out that
keeping hundred thousands of files in a single directory (do you?) is
generally slow on most file systems. If possible distributing them
across several directories generally helps speed up things.

Kind regards

robert

On Sat, Sep 4, 2010 at 2:10 PM, Stuart C.
[email protected] wrote:

Thanks for getting back to me. At present I read the CSV file get the
first line of information the. Search the file system in the object
above.

Are you saying I should hash the file system, loop through the hash and
get the first entry then search the csv and rename?

Could you provide an example please? The basenames are all unique!

If you are starting with a unique basename, and then traversing your
folders to find it, and those folders have a lot of files, then you
will be traversing your folders many times, when you can probably
cache the structure in memory. That’s what Brian was proposing. So you
could start with reading all the folders and files and mapping each
basename to its full path in a hash:

require ‘find’
h = {}
Find.find “/home/jesus/ruby” do |file|
h[File.basename(file)] = file
end

and then, read each line of the CSV file and search for the basename
in the hash:

CSV.foreach(“files.csv”) do |old,new|
if h[old]
#rename h[old] to new
end
end

Hope this helps,

Jesus.

Robert K. wrote:

On 04.09.2010 13:24, Stuart C. wrote:

I have a real head ache and would really appreciate any advice.

I have a folder with 100,000’s of files within it and I need to run a
rename process across these.

My issue is this is very very slow and is taking about an hour to do 10
file renames. Is there any way I can speed this up?

Brian gave excellent advice already. I’d just want to point out that
keeping hundred thousands of files in a single directory (do you?) is
generally slow on most file systems. If possible distributing them
across several directories generally helps speed up things.

Kind regards

robert

I agree, Brian gave some great code and the speed increase was very
impressive so many thanks. The final block of code is given below.
Regarding the large number of files in a directory, I agree its tough on
the file system, however that is our of my control.

pdfFiles = “\data”

Find.find(pdfFiles) do |curPath|
oldID = File.basename(curPath)
newID = nameMap[oldID]
next unless newID
File.rename(curPath.to_s.strip, “#{@pdfFiles}\#{newID}.pdf”)
end

Many thanks again