Counting the files in a directory

I’m writing some scripts to help manage a mail scanner used at my
work. Being a mail scanner, it’s got huuuuUUUge quarantine
directories.

Now, I know I can do something along the lines of:

Dir.open("/foo").collect.length-2 #if you’re wondering, the -2 is to
ignore . and …

to get a count of what’s in a directory, but the problem there is,
it’s rather slow when you run that in a directory with a few thousand
files on a server under a severe (4.5>average_load>2) load.

After perusing the Dir, Find and Stat classes, I haven’t seen a better
way.
I thought that perhaps there was some sort of system call, at least in
Real OSesâ„¢ (Linux, *BSD, Unix, etc), that would return the number of
files inside of a directory. Something that would hopefully return in
a 1/4th or 1/8th a second, rather than in 4 or 8 (or 20…) seconds.

Any clues?

Thanks,
Kyle

On 11.01.2008 16:19, Kyle S. wrote:

I’m writing some scripts to help manage a mail scanner used at my
work. Being a mail scanner, it’s got huuuuUUUge quarantine
directories.

Now, I know I can do something along the lines of:

Dir.open("/foo").collect.length-2 #if you’re wondering, the -2 is to
ignore . and …

You could as well do

count = Dir.entries("/foo").size - 2

Any clues?
The major time will be IO and that cannot be changed I guess. You could
however do some form of caching: read the size and the last mod date of
each dir you are interested in and store that in a Hash (and write that
via Marshal to disk between invocations if you process terminates in
between). Then you need only check whether the mod date has changed and
only read the directory if it has. Disadvantage is that you need one
more IO - albeit that will pull just one block so it might pay off.

Kind regards

robert

Entries seems to be fairly identical to collect, and it does look
nicer…
but yea still slow.

The problem with caching is that we only keep quarantine directories
around for 10 days, due to their size and the relative rarity of us
needing to pull something out of it. One reason for writing this as a
script is that we recover rarely enough that whoever is doing it
forgot how to recover. Still, it’s often enough that we want to be
able to do it easily.

In other cases than a mail system, caching would be a very good idea
though.

I’ll try and read more of the C stuff for handling files/directories
in unix. I can hold out hope for awhile.

Thanks,
Kyle

On 11.01.2008 19:14, Kyle S. wrote:

Entries seems to be fairly identical to collect, and it does look nicer…
but yea still slow.

As I said: it’s the IO for crowded directories (see also Mike’s reply).

The problem with caching is that we only keep quarantine directories
around for 10 days, due to their size and the relative rarity of us
needing to pull something out of it. One reason for writing this as a
script is that we recover rarely enough that whoever is doing it
forgot how to recover. Still, it’s often enough that we want to be
able to do it easily.

In other cases than a mail system, caching would be a very good idea though.

I am not sure I understand why you think it is a bad idea. If you only
cache the number of files per directory where is the issue? Or is this
script not invoked regularly? Probably I am missing a bit of your use
case.

I’ll try and read more of the C stuff for handling files/directories
in unix. I can hold out hope for awhile.

Won’t help. It’s really the size of the directory. Maybe you give a
little more detail about your script and when it’s used so we can come
up with better suggestions.

Cheers

robert

Kyle S. wrote:

Entries seems to be fairly identical to collect, and it does look
nicer…
but yea still slow.

The problem with caching is that we only keep quarantine directories
around for 10 days, due to their size and the relative rarity of us
needing to pull something out of it. One reason for writing this as a
script is that we recover rarely enough that whoever is doing it
forgot how to recover. Still, it’s often enough that we want to be
able to do it easily.

If there’s a large number of files in these directories that’s probably
the source of the slowness, not the method used to get the list of
entries.

Many filesystems (some less than others) don’t behave as well when you
get a “large” number of files in one directory. I think the rule of
thumb I’ve used for ext2 filesystems is you’ll start to notice a delay
when you get a few hundred entries, and you’ll start to feel it when you
have thousands.

One way around this (short of installing / upgrading to a new underlying
filesystem that handles these cases better (xfs, for example)) is to
split files out into a directory tree based either on the filename
directly or a hash made from the real filename (say an MD5 hex string of
the filename and you make two levels based on the first 4 hex digits,
00/00, 00/01, …, ff/fe, ff/ff; 00/00 contains all files for which the
hashed filename begins “0000…”, etc.). The downside of this is that
you either have to walk the entire tree to see the contents, or keep an
external index of the contents (which would eliminate your needing to do
what you’re trying to do and the justification for splitting things up,
but . . . :).

regards,

Siep

Should be read as:

require ‘win32ole’
fso = WIN32OLE.new(“Scripting.FileSystemObject”)
folder = fso.GetFolder(“C:/WINDOWS/system32”)
puts folder.Files.count

Kyle S. wrote:
(…)

I thought that perhaps there was some sort of system call, at least in
Real OSesâ„¢ (Linux, *BSD, Unix, etc), that would return the number of
files inside of a directory. Something that would hopefully return in
a 1/4th or 1/8th a second, rather than in 4 or 8 (or 20…) seconds.

Any clues?

Thanks,
Kyle

On Windows there is such a call. With Ruby you have to take a bit of a
(known) detour to get there:

require ‘win32ole’
fso = WIN32OLE.new(“Scripting.FileSystemObject”)
folder = fso.GetFolder("C:/WINDOWS/system32
puts folder.Files.count

regards,

Siep

2008/1/12, Siep K. [email protected]:

puts folder.Files.count
Did you verify that this is faster? I am skeptical because this call
does basically the same: it gets the list of files in the directory
and counts them. I would expect a speedup only if there was an API
function that would directly return the number of files.

Kind regards

robert

2008/1/14, Siep K. [email protected]:

robert

fso_40 | 0.020000 0.080000 0.100000 ( 0.110000)
Dir_40 | 0.050000 0.071000 0.121000 ( 0.120000)

Apparently, for small directories it doesn’t make much difference. For
large directories it does. (filesystem ntfs).

Amazing. So there is probably some room for improvement of the
Windows build of Ruby. :slight_smile:

Btw, you did not do the subtraction of two - does GetFolder not return
“.” and “…”?

Kind regards

robert

Robert K. wrote:

2008/1/12, Siep K. [email protected]:

puts folder.Files.count
Did you verify that this is faster? I am skeptical because this call
does basically the same: it gets the list of files in the directory
and counts them. I would expect a speedup only if there was an API
function that would directly return the number of files.

Kind regards

robert

require ‘win32ole’
require ‘benchmark’
ldirname = “C:/WINDOWS/system32” #2500+ files
mdirname = “C:/ruby/lib/ruby/1.8” #800+ files
sdirname=“C:/ruby/lib/ruby/1.8/i386-mswin32” # 40+ files
@fso = WIN32OLE.new(“Scripting.FileSystemObject”)
n=500

Benchmark.bmbm do |x|

x.report(“fso_mixed
|”){n.times{fso(ldirname);fso(mdirname);fso(sdirname)}}
x.report(“Dir_mixed
|”){n.times{dir(ldirname);dir(mdirname);dir(sdirname)}}
x.report(“fso_2500 |”){n.times{fso(ldirname)}}
x.report(“Dir_2500 |”){n.times{dir(ldirname)}}
x.report(“fso_800 |”){n.times{fso(mdirname)}}
x.report(“Dir_800 |”){n.times{dir(mdirname)}}
x.report(“fso_40 |”){n.times{fso(sdirname)}}
x.report(“Dir_40 |”){n.times{dir(sdirname)}}

def fso(dirname)
folder = @fso.GetFolder(dirname)
count = folder.Files.count
end

def dir(dirname)
count = Dir.entries(dirname).size - 2
end

end

results in:

              user     system      total        real

fso_mixed | 0.360000 1.222000 1.582000 ( 1.673000)
Dir_mixed | 3.635000 1.382000 5.017000 ( 5.157000)
fso_2500 | 0.271000 1.071000 1.342000 ( 1.362000)
Dir_2500 | 3.305000 1.282000 4.587000 ( 4.697000)
fso_800 | 0.040000 0.120000 0.160000 ( 0.160000)
Dir_800 | 0.170000 0.100000 0.270000 ( 0.281000)
fso_40 | 0.020000 0.080000 0.100000 ( 0.110000)
Dir_40 | 0.050000 0.071000 0.121000 ( 0.120000)

Apparently, for small directories it doesn’t make much difference. For
large directories it does. (filesystem ntfs).

Siep

Robert,
The script itself won’t be run as routinely as the
directories are rotated. The directories have a daily rotation so
there are only the most recent 10 days available at once, but the
script itself may only be invoked once or twice in a month, at most.

I understand that the size of the directory itself is a problem, but I
was hoping that somehow there was a way to get a simple, more
efficient count. I know the b-tree based file systems are somewhat
new in unix & unix-like systems, I was just hoping there was some more
efficient way :slight_smile:

The script itself (as it stands now, albeit slower than I would have
liked) does the following:
With no arguments, lists the number of quarantined and spam messages
being held, for each day.
With a date, lists the file names of the quarantined messages, as well
as their recipients.
With a date and the file name of a quarantined message, warns the
user, asks them if they want to continue, then moves the message back
into the appropriate queue to be delivered.

Thanks

–Kyle

Robert K. wrote:

Btw, you did not do the subtraction of two - does GetFolder not return
“.” and “…”?

Kind regards

robert

require ‘win32ole’

dir = WIN32OLE.new(“Scripting.FileSystemObject”).GetFolder(“C:/ruby/” )
dir.Files.each{|file| puts file.name}

no dots here

puts

folder.SubFolders.each{|subdir| puts subdir.name}

still no dots

I don’t now where the dots have gone, but I don’t miss them in this
context.

regards,

Siep

On Jan 14, 2008 10:16 AM, Siep K. [email protected] wrote:

Siep,
I’ve had a bit of experience with the Win32OLE objects in ruby
before (at my last job). You’re right, they are a detour, though
sometimes win32ole may feel more like a byway where your car breaks
down and the only place for you to stay that night is the Bates
Motel…

Humm. There must be some way…

–Kyle

On 14.01.2008 18:12, Kyle S. wrote:

On Jan 14, 2008 10:16 AM, Siep K. [email protected] wrote:

Siep,
I’ve had a bit of experience with the Win32OLE objects in ruby
before (at my last job). You’re right, they are a detour, though
sometimes win32ole may feel more like a byway where your car breaks
down and the only place for you to stay that night is the Bates
Motel…

You sure mean the “Gates Motel”, don’t you? :slight_smile:

robert

On Jan 11, 2008 1:06 PM, Mike F. [email protected]
wrote:

able to do it easily.
have thousands.
external index of the contents (which would eliminate your needing to do
what you’re trying to do and the justification for splitting things up,
but . . . :).


Posted via http://www.ruby-forum.com/.

Mike,
I’ve been an advocate of using the right file system for the
job for ages now, but the sad matter is, this is running on a rather
old version of RedHat, which doesn’t support anything real other than
ext2 & 3. As for our possible upgrade paths to this box, it would
still be RedHat, or a clone (CentOS). From what I can see, they still
don’t support modern file systems by default. Admittedly I’m tempted
to add the support myself (it’s not hard), but then it’ll bring up the
“its a production system” argument here.

sigh
–Kyle

On Sat, 12 Jan 2008 08:54:42 +0900, Siep K. wrote:

require ‘win32ole’
fso = WIN32OLE.new(“Scripting.FileSystemObject”) folder =
fso.GetFolder("C:/WINDOWS/system32 puts folder.Files.count

Oh, that’s quite an interesting way to work with files and folders :slight_smile:

-Thufir

Kyle S. wrote:

I’ll try and read more of the C stuff for handling files/directories
in unix. I can hold out hope for awhile.

Thanks,
Kyle

You may have already gotten here…
What kind of times does this give? ( the first run will include the
initial
compilation time )
You can modify it to meet your needs ( if you have questions, just post
back
) – see man scandir
you can setup a filter function to allow returning counts for specific
file
matches.
As is, it returns a count for all files, visible and hidden.

for rubyinline see:

https://rubyforge.org/projects/rubyinline

-----------snip dircount.rb--------------------------------
require ‘inline’

class DirCount
inline do | builder |
builder.include ‘<dirent.h>’
builder.include ‘<stdio.h>’
builder.c "
int count() {
struct dirent **namelist;
int n;
int count;

            count = n = scandir(\".\", &namelist, 0, 0);
            if (n < 0)
                perror(\"scandir\");
            else {
                while(n--) {
                /* printf(\"%s\n\", namelist[n]->d_name);*/
                free(namelist[n]);
                }
                free(namelist);
            }

            return (count);
        }"
end

end

dc = DirCount.new()
puts dc.count()
-----------snip--------------------

Thufir wrote:

On Sat, 12 Jan 2008 08:54:42 +0900, Siep K. wrote:

require ‘win32ole’
fso = WIN32OLE.new(“Scripting.FileSystemObject”) folder =
fso.GetFolder("C:/WINDOWS/system32 puts folder.Files.count

Oh, that’s quite an interesting way to work with files and folders :slight_smile:

-Thufir

Yes, indeed, infact it’s quite efficient according to the benchmark
tests above!
By the way, is there like a link or documentation on the list of
class/methods that can be used like GetFolder,GetFile… etc? So far i
only know abt these 2 methods…

What do you want to do with it?