Optimize write of large file

Hello,
I have data to process and to write into files progressively. The data
files are in the end very large, but I append to them small strings. I
suppose buffering the strings before apending to the file would be
faster. I don’t need the files to be written before the end of the whole
process (i.e. I don’t use their content).

I’ve searched for info about how File buffer its data but it seems we
can not configure anything about this, did I miss something ?
My first idea was to buffer everything myself, appending lines to a
string, or an array of strings and write when I reach a big enough
amount of data. But if File uses a buffer anyway, it would be a waste of
time I suppose ?
Do you have any advice to optimize the writing of large files ?
Thanks !

Hi,

Ruby and the glibc the kernel etc are doing buffering already.
There is usually no need for explict buffering from ruby.

You can test this for yourself, try to write the same string for a
million time in a loop. Not each write triggers a disk transaction.

Regards,

Markus

On 5/12/2011 07:58, Yoann M. wrote:

string, or an array of strings and write when I reach a big enough
amount of data. But if File uses a buffer anyway, it would be a waste of
time I suppose ?
Do you have any advice to optimize the writing of large files ?

As mentioned, the file writes are already being buffered by lower
layers; however, if you are closing and reopening the files throughout
your processing, the buffers aren’t helping you much. Try to ensure
that you open each file only once and keep those file references around
to use until you know you’re permanently done writing to each one.
Unless you have a large number of files to open, you shouldn’t have to
worry about resource constraints on the number of concurrently open
files.

-Jeremy

You’re right, doing the buffer myself does not make it faster. For
writing 10 millions lines, with an array of strings, one string, and no
homemade-buffer (code is attached) :
Buffer array : 11.141s
Buffer string : 9.748s
No buffer : 10.344s

Don’t you think using more RAM before writing on disk could make the
process faster ? I thought so, then I’d like to say to File how much RAM
it can uses to speed things up, because I can use a lot of RAM.

Regards

On Thu, May 12, 2011 at 5:07 PM, Yoann M. [email protected] wrote:

You’re right, doing the buffer myself does not make it faster. For
writing 10 millions lines, with an array of strings, one string, and no
homemade-buffer (code is attached) :
Buffer array : 11.141s
Buffer string : 9.748s
No buffer : 10.344s

Don’t you think using more RAM before writing on disk could make the
process faster ? I thought so, then I’d like to say to File how much RAM
it can uses to speed things up, because I can use a lot of RAM.

No, more does not help more. With modern operating systems you never
directly write through to the disk.* The OS is buffering your writes
anyway. Even worse: using up much memory in the process to hold the
whole file can make your program slower because of the overhead of
memory allocation. In the worst case your program is paged to disk.
Don’t worry too much about this.

  • Note there are some circumstances where you write directly to disk
    (or rather, the write operation returns only after the disk
    acknowledged the data). This is sometimes called “direct IO”. This
    does make sense in special circumstances only (some RDBMS can do it).

Attachments:
http://www.ruby-forum.com/attachment/6191/test_write.rb

You can make your life easier by using Benchmark for this.

require ‘benchmark’

Benchmark.bm 20 do |x|
x.report “a test” do

end

x.report “another test” do

end
end

Kind regards

robert

Thanks for your answers, I’ll let the OS optimize this on its own then
:wink:

IMHO the primary speed bottleneck is the disk drive itself and the
“possible”
File-System fragmentation.

RAM just let the operating system do the writes “as optimal as
possible”. The effect of drastically more ram wont be more than 1-5%.

When you use a ramdisk this will differ much :wink:

But when you are worried about file persistence you should not do this
g

I do not knew any details about your use case, there are other
possiblities:

  • writing direkt to the block device, bypassing file systems
  • mirror ramdisk writes to other machines for persistence
  • ?

On Fri, May 13, 2011 at 12:07:30AM +0900, Yoann M. wrote:

Regards

Attachments:
http://www.ruby-forum.com/attachment/6191/test_write.rb


Posted via http://www.ruby-forum.com/.


Markus S.
Phone: 049 201 / 647 59 63
Mobile: 049 178 / 529 91 42
Web: www.seonic.net
Email: [email protected]
Seonic IT-Systems GbR
Anton Shatalov & Markus S.
Walterhohmannstra