Fast alternatives to "File" and "IO" for large numbers of files?

luislavena · February 24, 2011, 4:10am

People,

I have script that does:

statistical processing from data in 50x32x20 (32,000) large input
files
writes a small text file (22 lines with one or more columns of
numbers)
for each input file
read all small files back in again for final processing.

Profiling shows that IO is taking up more than 60% of the time - short
of
making fewer, larger files for the data (which is inconvenient for
random
viewing/ processing of individual results) are there other alternatives
to
using the “File” and “IO” classes that would be faster?

Thanks,

Phil.

unebaguettesvp · February 24, 2011, 9:16am

Hi, could you be more specific on what do you do with the small files,
read/write in per-line or whole file?for rapid file ops due to file
system heaps(or sort) may be slow anyway.so maybe you can try less file
ops, for example, write a file with a single string may serve the io
cache well. or, maybe, have a lot of files to write/read in a new
thread, so that IO may not interfere your none-IO calculations, if you
have some

unebaguettesvp · February 24, 2011, 9:20am

On Thu, 24 Feb 2011 12:09:48 +0900, Philip R. wrote:

Thanks,

Phil.

I can think of two approaches here.

First, you can write one large file (perhaps creating it in memory
first) and then splitting it afterwards.

Second, if you’re on *nix, you can write your output files to a tmpfs.

Both should reduce number of seeks and improve performance.

unebaguettesvp · February 24, 2011, 10:02am

If you read in all the data files and build a single Ruby data structure
which contains all the data you’re interested in, you can dump it out
like this:

File.open(“foo.msh”,“wb”) { |f| Marshal.dump(myobj, f) }

And you can reload it in another program like this:

myobj = File.open(“foo.msh”,“rb”) { |f| Marshal.load(f) }

This is very fast.

unebaguettesvp · February 27, 2011, 10:16am

Philip R. wrote in post #984112:

If you read in all the data files and build a single Ruby data
structure which contains all the data you’re interested in, you can
dump it out like this:

File.open(“foo.msh”,“wb”) {|f| Marshal.dump(myobj, f) }

I did read up about this stuff but I have to have human readable files.

You can use YAML.dump and .load too. Not as fast, and rather buggy(*),
but it would do the job.

(*) There are various strings which ruby’s default YAML implementation
(syck) cannot serialize and deserialize back to the same string. These
might have been fixed, or you could use a different YAML implementation.

unebaguettesvp · February 24, 2011, 9:53am

On Thu, Feb 24, 2011 at 4:09 AM, Philip R. [email protected]
wrote:

making fewer, larger files for the data (which is inconvenient for random
viewing/ processing of individual results) are there other alternatives to
using the “File” and “IO” classes that would be faster?

I think whatever you do, as long as you do not get rid of the IO or
improve IO access patterns your performance gains will only be
marginally. Even a C extension would not help you if you stick with
the same IO patterns.

We should probably learn more about the nature of your processing but
considering that you only write 32,000 * 22 * 80 (estimated line
length) = 56,320,000 bytes (~ 54MB) NOT writing those small files is
probably an option. Burning 54MB of memory in a structure suitable
for later processing (i.e. you do not need to parse all those small
files) is a small price compared to the large amount of IO you need to
do to read that data back again (plus the CPU cycles for parsing).

The second best option would be to keep the data in memory as before
but still write those small files if you really need them (for example
for later processing). In this case you could put this in a separate
thread so your main processing can continue on the state in memory.
That way you’ll gain another improvement.

For reading of the large files I would use at most two threads because
I assume they all reside on the same filesystem. With two threads one
can do calculations (e.g. parsing, aggregating) while the other thread
is doing IO. If you have more threads you’ll likely see a slowdown
because you may introduce too many seeks etc.

Kind regards

robert

unebaguettesvp · February 26, 2011, 3:36pm

People,

Thanks to all who responded - I have concatenated the replies for ease
of response:

On 2011-02-24 19:15, pp wrote:

files
that would be faster?
in a new thread, so that IO may not interfere your none-IO
calculations, if you have some

Each individual small file is written in one go ie file opened, written
to and closed - there is no re-opening and more writing. See later for
current approach.

On 2011-02-24 19:19, Peter Z. wrote:

I can think of two approaches here.

First, you can write one large file (perhaps creating it in memory
first) and then splitting it afterwards.

Second, if you’re on *nix, you can write your output files to a
tmpfs.

Both should reduce number of seeks and improve performance.

After staying up all night, I eventually settled on a hash table
outputted via YAML to ONE very large file. I need a human friendly form
for spot checking of statistical calculations so I have used a hash
table and the key lets me find a particular calculation in the big file
in the same way I would have found it in the similarly named
subdirectories. I haven’t actually implemented this on the full system
yet so it will be interesting to see if Vim can handle opening a 32,000
x 23 line file (and bigger actually if each individual small file is
bigger than a 23x1 array).

On 2011-02-24 19:52, Robert K. wrote:

I think whatever you do, as long as you do not get rid of the IO or
improve IO access patterns your performance gains will only be
marginally. Even a C extension would not help you if you stick with
the same IO patterns.

Right.

We should probably learn more about the nature of your processing
but considering that you only write 32,000 * 22 * 80 (estimated line
length) = 56,320,000 bytes (~ 54MB) NOT writing those small files is
probably an option. Burning 54MB of memory in a structure suitable
for later processing (i.e. you do not need to parse all those small
files) is a small price compared to the large amount of IO you need
to do to read that data back again (plus the CPU cycles for
parsing).

Yep - I came to that conclusion too and went for one big hash table and
one file.

The second best option would be to keep the data in memory as before
but still write those small files if you really need them (for
example for later processing). In this case you could put this in a
separate thread so your main processing can continue on the state in
memory. That way you’ll gain another improvement.

Interesting idea but I’m not sure how to actually implement that but I
will see how the hash table/one file approach goes first.

For reading of the large files I would use at most two threads
because I assume they all reside on the same filesystem. With two
threads one can do calculations (e.g. parsing, aggregating) while the
other thread is doing IO. If you have more threads you’ll likely see
a slowdown because you may introduce too many seeks etc.

OK, this idea might help for the next stage.

On 2011-02-24 20:02, Brian C. wrote:

If you read in all the data files and build a single Ruby data
structure which contains all the data you’re interested in, you can
dump it out like this:

File.open(“foo.msh”,“wb”) {|f| Marshal.dump(myobj, f) }

I did read up about this stuff but I have to have human readable files.

And you can reload it in another program like this:

myobj = File.open(“foo.msh”,“rb”) {|f| Marshal.load(f) }

This isvery fast.

I might check this out as an exercise!

Thanks to all again!

Phil.

Philip R.

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: [email protected]