Moving large amount of files, 1.750.000+

abeansits · November 9, 2008, 6:07pm

Hello fellow Rubyists!

I’m trying to impress my boss and co-workers with Ruby so we
hopefully can start to use it in work more often. I was given
the task with moving a large repository of images from one
source to the next. The repository consists of around 1.750.000
images and requires around 350GB of space.
I though this would be no match for Ruby!
Even though it proved no match for Ruby it was a large match for me. =)

I have attached the source code with this post.
Please be gentle on me, I’m quite new to Ruby. =D

So far I have run test on my local machine and it took around 47s to
copy 4.211 items. calculating With this speed it would take around
13H to copy the whole repository. That’s a lot of time.
If I present this to my co-workers I know they will instantly blame Ruby
for this, even though I am the one to blame.

My question is this: How do I speed up my application?
I reused my filehandler and skipped the printing to the console,
but it is still taking time.

Also if any one has any previous experience of handling this many files
any kind of tips are welcome. I’m quite worried that the array
containing
the path to all the files will flood the stack.

Thanks in advance and my regards.
//Sebastian

abeansits · November 9, 2008, 7:15pm

On 09.11.2008 18:04, Sebastian N. wrote:

Hello fellow Rubyists!

I’m trying to impress my boss and co-workers with Ruby so we
hopefully can start to use it in work more often. I was given
the task with moving a large repository of images from one
source to the next. The repository consists of around 1.750.000
images and requires around 350GB of space.

My question is this: How do I speed up my application?
I reused my filehandler and skipped the printing to the console,
but it is still taking time.

Also if any one has any previous experience of handling this many files
any kind of tips are welcome. I’m quite worried that the array
containing
the path to all the files will flood the stack.

Sorry to disappoint you but this amount of copying won’t be really fast
regardless of programming language. You do not mention what a “source”
in your case is, what operating systems are involved and what transport
media you are intending to use (local, network). If you need to
transport using a network in my experience tar with a pipe works pretty
well. But no matter what you do, the slowest link will determine your
throughput: you cannot go faster than network speed or the speed that
your “sources” can read or write.

Here’s the tar variant, since you copy images I assume data is
compressed and does not need compression (on your favorite Unix shell
prompt):

$> ( cd “$source” && tar cf - . ) | ( ssh user@target "cd ‘$target’ &&
tar xf - )

If you can physically move the source disk to the target host and then
do a local copy with cp -a that’s probably the fastest you can go -
unless the physical takes ages (e.g. to the moon or other remote
locations).

Kind regards

robert

abeansits · November 9, 2008, 7:38pm

On Sunday 09 November 2008 01:12 pm, Robert K. wrote:

Sorry to disappoint you but this amount of copying won’t be really
fast
regardless of programming language. You do not mention what a
“source”
in your case is, what operating systems are involved and what
transport
media you are intending to use (local, network). If you need to
transport using a network in my experience tar with a pipe works
pretty

If you can physically move the source disk to the target host and then
do a local copy with cp -a that’s probably the fastest you can go -
unless the physical takes ages (e.g. to the moon or other remote
locations).

I agree with Robert, but before I saw his response I did some
calculations. Assuming all the images are the same size (about 200
KB), moving 4,211 of them in 47 seconds is a data rate close to 18
MB/sec.–that’s faster than a 100 mb/sec Ethernet, not counting any
overhead due to collisions.

That’s pretty fast for most channels. Are you moving data from one disk
to another on the same computer? Or over a high speed connection
between two computers? What is the raw hardware speed of the
interconnect?

I wouldn’t be too worried about the 13 hours, you’ve got a lot of data
to move.

Randy K.

abeansits · November 9, 2008, 10:15pm

On Sunday 09 November 2008 01:35 pm, Randy K. wrote:

I wouldn’t be too worried about the 13 hours, you’ve got a lot of data
to move.

PS: I wish I had added: Since all you’re doing is copying files, do it
from the CLI (as Robert suggested)–no need to involve any programming
language which is just added overhead. Then let us know how many hours
it takes that way, for comparison.

Randy K.

abeansits · November 10, 2008, 8:55am

First of all, thanks for your quick answer!
I was a bit tired when I asked the question so I’m sorry
for the lacking information.

Robert K. wrote:

On 09.11.2008 18:04, Sebastian N. wrote:

Hello fellow Rubyists!

I’m trying to impress my boss and co-workers with Ruby so we
hopefully can start to use it in work more often. I was given
the task with moving a large repository of images from one
source to the next. The repository consists of around 1.750.000
images and requires around 350GB of space.

My question is this: How do I speed up my application?
I reused my filehandler and skipped the printing to the console,
but it is still taking time.

Also if any one has any previous experience of handling this many files
any kind of tips are welcome. I’m quite worried that the array
containing
the path to all the files will flood the stack.

Sorry to disappoint you but this amount of copying won’t be really fast
regardless of programming language. You do not mention what a “source”
in your case is, what operating systems are involved and what transport
media you are intending to use (local, network). If you need to
transport using a network in my experience tar with a pipe works pretty
well. But no matter what you do, the slowest link will determine your
throughput: you cannot go faster than network speed or the speed that
your “sources” can read or write.

The target system I will use is a virtual Windows 2003 server with a
mounted network drive. Unfortunatly I have no access to any of the
hardware.
But I know there is at least a 100Mbit Ethernet connection between the
server and the mounted disk.

Here’s the tar variant, since you copy images I assume data is
compressed and does not need compression (on your favorite Unix shell
prompt):

$> ( cd “$source” && tar cf - . ) | ( ssh user@target "cd ‘$target’ &&
tar xf - )

Thanks for your tips, but it’s a Windows system.

If you can physically move the source disk to the target host and then
do a local copy with cp -a that’s probably the fastest you can go -
unless the physical takes ages (e.g. to the moon or other remote
locations).

Since our company outsourced the hardware maintenance the moon or across
the street makes no difference. =(

Kind regards

robert

What I meant to ask was, I what way can I change my source code to be
more effective?
Thanks a lot for your time.
//Sebastian

abeansits · November 10, 2008, 9:13am

Randy K. wrote:

On Sunday 09 November 2008 01:35 pm, Randy K. wrote:

I wouldn’t be too worried about the 13 hours, you’ve got a lot of data
to move.

Your probably right, I will start the job on a friday evening and let it
take it’s time.

PS: I wish I had added: Since all you’re doing is copying files, do it
from the CLI (as Robert suggested)–no need to involve any programming
language which is just added overhead. Then let us know how many hours
it takes that way, for comparison.

Randy K.

Your probably right about this as well, but I can’t backout of the Ruby
corner now. I already opened my mouth about Ruby to much now, if I
change my method now it will make Ruby look realy bad. =(

This is what I succeded with:

I removed all of the console prints for each file. (This lowered the
time with about 20s! I had no idea that output was so demanding.).
I kept the filehandle open for writing to the process.log.
I also removed any line of unessesary code in the critical part of my
application.
This lowered the time to around 17s. I will now try to run the test on
the right environment.

Of course I will post the results here for your guys to se.
Thanks again for your time.
//Sebastian

abeansits · November 10, 2008, 9:01am

Thank you as well Kramer! I will try to clarify…

Randy K. wrote:

On Sunday 09 November 2008 01:12 pm, Robert K. wrote:

Sorry to disappoint you but this amount of copying won’t be really
fast
regardless of programming language. You do not mention what a
“source”
in your case is, what operating systems are involved and what
transport
media you are intending to use (local, network). If you need to
transport using a network in my experience tar with a pipe works
pretty

If you can physically move the source disk to the target host and then
do a local copy with cp -a that’s probably the fastest you can go -
unless the physical takes ages (e.g. to the moon or other remote
locations).

I agree with Robert, but before I saw his response I did some
calculations. Assuming all the images are the same size (about 200
KB), moving 4,211 of them in 47 seconds is a data rate close to 18
MB/sec.–that’s faster than a 100 mb/sec Ethernet, not counting any
overhead due to collisions.

That’s pretty fast for most channels. Are you moving data from one disk
to another on the same computer? Or over a high speed connection
between two computers? What is the raw hardware speed of the
interconnect?

I know it is a very rough estimation, and the test I performed where on
my Macbook Pro from one folder to another. Of course when I run this
live, the environment will be very different. I just wanted to estimate
a minimum time for the copy.

I wouldn’t be too worried about the 13 hours, you’ve got a lot of data
to move.

Randy K.

abeansits · November 10, 2008, 9:31am

Sebastian N. [email protected] [2008-11-10 17:11:08 +0900]:

language which is just added overhead. Then let us know how many hours
time with about 20s! I had no idea that output was so demanding.).

I kept the filehandle open for writing to the process.log.

I also removed any line of unessesary code in the critical part of my
application.
This lowered the time to around 17s. I will now try to run the test on
the right environment.

Of course I will post the results here for your guys to se.
Thanks again for your time.
//Sebastian

This may be a naive suggestion. It may be worthwhile to see if there
is a benefit in parallelize the process using Threads (split the
transfer
jobs among multiple threads ??? … )

saji

Saji N. Hameed

APEC Climate Center +82 51 668 7470
National Pension Corporation Busan Building 12F
Yeonsan 2-dong, Yeonje-gu, BUSAN 611705 [email protected]
KOREA

abeansits · November 10, 2008, 2:16pm

2008/11/10 Sebastian N. [email protected]:

compressed and does not need compression (on your favorite Unix shell
prompt):

$> ( cd “$source” && tar cf - . ) | ( ssh user@target "cd ‘$target’ &&
tar xf - )

Thanks for your tips, but it’s a Windows system.

The command above works on a cygwin shell. Alternatively you can use
XCOPY or directly use the Windows Shell (Explorer).

If you can physically move the source disk to the target host and then
do a local copy with cp -a that’s probably the fastest you can go -
unless the physical takes ages (e.g. to the moon or other remote
locations).

Since our company outsourced the hardware maintenance the moon or across
the street makes no difference. =(

What I meant to ask was, I what way can I change my source code to be
more effective?

And the answer is and was: don’t bother too much because your transfer
is IO bound regardless of programming language or tool used.

Cheers

robert

abeansits · November 10, 2008, 1:51pm

On Mon, Nov 10, 2008 at 09:28, Saji N. Hameed [email protected] wrote:

This may be a naive suggestion. It may be worthwhile to see if there
is a benefit in parallelize the process using Threads (split the transfer
jobs among multiple threads ??? … )

I guess Ara Howard’s threadify
(http://codeforpeople.com/lib/ruby/threadify/) might be handy.

The usefulness of more threads depends on network saturation - measure
your network/disk throughput
using plain system copy (maybe several parallel ones), then measure
what your script does.
I’m afraid if you’re going over ethernet, one thread would be enough.

I’d also suggest using File.directory? for testing if the file is
directory, instead of searching for ‘.’

Jano

abeansits · November 10, 2008, 3:10pm

Robert K. wrote:

2008/11/10 Sebastian N. [email protected]:

compressed and does not need compression (on your favorite Unix shell
prompt):

$> ( cd “$source” && tar cf - . ) | ( ssh user@target "cd ‘$target’ &&
tar xf - )

Thanks for your tips, but it’s a Windows system.

The command above works on a cygwin shell. Alternatively you can use
XCOPY or directly use the Windows Shell (Explorer).

Ok great tip. Will keep it as a backup plan.
The thing is I need loging of all files being transfered so I know if
something is missing.

If you can physically move the source disk to the target host and then
do a local copy with cp -a that’s probably the fastest you can go -
unless the physical takes ages (e.g. to the moon or other remote
locations).

Since our company outsourced the hardware maintenance the moon or across
the street makes no difference. =(

What I meant to ask was, I what way can I change my source code to be
more effective?

And the answer is and was: don’t bother too much because your transfer
is IO bound regardless of programming language or tool used.

OK! I will listen to your tips.
Thanks for all your input Robert.
Best regards
//Sebastian

Cheers

robert

abeansits · November 10, 2008, 3:11pm

Jano S. wrote:

On Mon, Nov 10, 2008 at 09:28, Saji N. Hameed [email protected] wrote:

This may be a naive suggestion. It may be worthwhile to see if there
is a benefit in parallelize the process using Threads (split the transfer
jobs among multiple threads ??? … )

I guess Ara Howard’s threadify
(http://codeforpeople.com/lib/ruby/threadify/) might be handy.

The usefulness of more threads depends on network saturation - measure
your network/disk throughput
using plain system copy (maybe several parallel ones), then measure
what your script does.
I’m afraid if you’re going over ethernet, one thread would be enough.

Thats what I though to. Thanks for confirming this.

I’d also suggest using File.directory? for testing if the file is
directory, instead of searching for ‘.’

I will definitly do this.
Thanks for your input Jano.
Best regards
//Sebastian

Jano

abeansits · November 10, 2008, 3:13pm

Saji N. Hameed wrote:

Sebastian N. [email protected] [2008-11-10 17:11:08 +0900]:

language which is just added overhead. Then let us know how many hours
time with about 20s! I had no idea that output was so demanding.).

I kept the filehandle open for writing to the process.log.

I also removed any line of unessesary code in the critical part of my
application.
This lowered the time to around 17s. I will now try to run the test on
the right environment.

Of course I will post the results here for your guys to se.
Thanks again for your time.
//Sebastian

This may be a naive suggestion. It may be worthwhile to see if there
is a benefit in parallelize the process using Threads (split the
transfer
jobs among multiple threads ??? … )

saji

Saji N. Hameed

APEC Climate Center +82 51 668 7470
National Pension Corporation Busan Building 12F
Yeonsan 2-dong, Yeonje-gu, BUSAN 611705 [email protected]
KOREA

Thanks for your input. I the time wont go down any more I will
definitley try this.
Best regards
//Sebastian

abeansits · November 10, 2008, 9:01pm

Robert K. wrote:

2008/11/10 Sebastian N. [email protected]:

The command above works on a cygwin shell. Alternatively you can use
XCOPY or directly use the Windows Shell (Explorer).

Ok great tip. Will keep it as a backup plan.
The thing is I need loging of all files being transfered so I know if
something is missing.

xcopy writes all filenames to the console. You can easily redirect
this to a file.

xcopy from to /e /i > log

For tar just add letter “v” for output of file names, e.g.,

( cd “$source” && tar cvf - . 2>copied_files ) | ( ssh user@target "cd
‘$target’ && tar xf - )

Cheers

robert

xcopy has (or used to have) problems with long pathnames. Microsoft
provides “robocopy” (standard in Vista, downloadable for others).It has
features like tolerance for network outages and the ability to copy
ACL’s on ntfs. Robocopy - Wikipedia

I’ve no experience with xxcopy (an improved xcopy). It looks good too.

hth,

Siep

hth,

Siep

abeansits · November 10, 2008, 4:31pm

2008/11/10 Sebastian N. [email protected]:

The command above works on a cygwin shell. Alternatively you can use
XCOPY or directly use the Windows Shell (Explorer).

Ok great tip. Will keep it as a backup plan.
The thing is I need loging of all files being transfered so I know if
something is missing.

xcopy writes all filenames to the console. You can easily redirect
this to a file.

xcopy from to /e /i > log

For tar just add letter “v” for output of file names, e.g.,

( cd “$source” && tar cvf - . 2>copied_files ) | ( ssh user@target "cd
‘$target’ && tar xf - )

Cheers

robert