Repeatedly open file or save entire file to memory?

I want to make sure I do what is most efficient when dealing with
multiple and potentially large files.

I need to take row(n) and row(n+1) from a file and use the data to do
things in other parts of my program. Then the program will iterate by
incrementing n. I may have up to 30 files, each having 50,000 rows.

My question is should I read row(n) and row(n+1), accessing the file
again and again on each iteration of the main program? Or should I just
read the whole file into memory (say, an array) then just grab items
from the array by index in the main program?

2009/9/17 Jason L. [email protected]:

from the array by index in the main program?
Other schemes can be devised too:

  1. read the file once remembering indexes for every file and row
    (IO#tell) and then access rows via IO#seek

  2. since you are incrementing n, read row n, remember pos, read row n

  • 1, next time round #seek to position and continue reading
  1. as 2 but remember line n+1 so you do not have to read it again

  2. if the access pattern to files is not round robin but different,
    you might get better results by storing more information in memory
    forr least recently accessed files

  3. read files in chunks of x lines and remember them in memory thus
    reducing file accesses

It really depends on what you do with those files, how your access
patterns are etc.

Kind regards

robert

Jason,

I want to make sure I do what is most efficient when dealing with
multiple and potentially large files.

you can use ruby-prof for profiling of your code. It’s available as
a gem.

Best regards,

Axel

2009/9/17 Axel E. [email protected]:

Jason,

I want to make sure I do what is most efficient when dealing with
multiple and potentially large files.

you can use ruby-prof for profiling of your code. It’s available as
a gem.

I consider Jason’s question as a design level question. That’s
nothing where a profiler can really help. Of course you can code up
alternatives and measure performance. But it can only tell you which
version of several is fastest - it cannot tell you how you should
change your design to improve it.

In this case performance bottlenecks are rather in the area of disk IO
and all a profiler can tell you is how much of your time you spend in
IO - but not how to minimize that.

Kind regards

robert

In this case performance bottlenecks are rather in the area of disk IO
and all a profiler can tell you is how much of your time you spend in
IO - but not how to minimize that.

I agree, although that argument doesn’t make much sense.

A profiler can never tell you how to minimize anything, it can
only show you where you should look for optimizations.
In this case of course that’s futile, since we already know where
to optimize: the IO

Greetz!

You could put the data into a database,
which should be performant enough
and still very easy to use, even when
your lookup pattern should change
in the future.

Greetz!

Fabian S. wrote:

You could put the data into a database,
which should be performant enough
and still very easy to use, even when
your lookup pattern should change
in the future.

Greetz!

That is a good idea. Do you recommend ruby DBI or ActiveRecord? I need
ease of use and simplicity. My interface is the command line.

actually I like Datamapper the most. It’s very intuitive.
You should check it out: http://datamapper.org/doku.php

I definitely like the way datamapper handles things better
than ActiveRecord, but that’s a matter of taste.

Greetz!

-------- Original-Nachricht --------

Datum: Fri, 18 Sep 2009 15:30:38 +0900
Von: Robert K. <shortcutter@googlema il.com>
An: [email protected]
Betreff: Re: repeatedly open file or save entire file to memory?

2009/9/17 Axel E. [email protected]:

Jason,

I want to make sure I do what is most efficient when dealing with
multiple and potentially large files.

you can use ruby-prof for profiling of your code. It’s available as
a gem.

Dear Robert,

I consider Jason’s question as a design level question. That’s
nothing where a profiler can really help. Of course you can code up
alternatives and measure performance. But it can only tell you which
version of several is fastest - it cannot tell you how you should
change your design to improve it.

I agree with you. I proposed this precisely to see how long several
alternatives take. One always has to think about design oneself :slight_smile:

Best regards,

Axel