Reasons to use a buffer in IO::read?

Hi Ruby people,

I’m wondering what the functional and performance differences might be
between the two statements below? Assume ‘io’ is an IO instance with
gobs of data in it. Assume ‘file’ is an open file instance with write
access:

until io.eof? do
file.write(io.read(10485760))
end

buffer = ‘’
until io.eof? do
buffer = io.read(10485760)
file.write(buffer)
end

I see that Ruby provides for a buffer and I’m wondering what the
reason is? I read this article but am still not clear on the benefit
of a buffer at all:

http://rcoder.net/content/fast-ruby-io

I’m wondering if providing a buffer might reduce malloc issues and
speed things up? I can’t see any other reason to use one…

Thanks in advance for any information!

Steve

On Dec 5, 6:59 pm, Steve M. [email protected] wrote:

http://rcoder.net/content/fast-ruby-io

I’m wondering if providing a buffer might reduce malloc issues and
speed things up? I can’t see any other reason to use one…

Thanks in advance for any information!

Steve

$ ri IO#buffer

IO#read
ios.read([length [, buffer]]) => string, buffer, or nil

 Reads at most _length_ bytes from the I/O stream, or to the end

of
file if length is omitted or is +nil+. length must be a
non-negative integer or nil. If the optional buffer argument is
present, it must reference a String, which will receive the data.

 At end of file, it returns +nil+ or +""+ depend on _length_.
 +_ios_.read()+ and +_ios_.read(nil)+ returns +""+.
 +_ios_.read(_positive-integer_)+ returns nil.

    f = File.new("testfile")
    f.read(16)   #=> "This is line one"

So…

buffer = “”
file.write(io.read(nil, buffer))
print "I read this stuff ", buffer, “\n”

Regards,
Jordan

On Dec 5, 5:56 pm, MonkeeSage [email protected] wrote:

I see that Ruby provides for a buffer and I’m wondering what the
Steve
present, it must reference a String, which will receive the data.
buffer = “”
file.write(io.read(nil, buffer))
print "I read this stuff ", buffer, “\n”

Regards,
Jordan

Thanks Jordan. How is your code different (if at all) from:

buffer = io.read
file.write(buffer)
print "I read this stuff ", buffer, “\n”

Am I missing something? I just don’t see why buffer is useful - is it
a performance benefit or some kind of syntax improvement that I’m
missing? The only thing I can see is that it has some kind of low
level malloc optimization if the same string size is passed in
repeatedly during partial writes.

Steve

2007/12/7, Steve M. [email protected]:

access:

until io.eof? do
file.write(io.read(10485760))
end

buffer = ‘’

This line above is completely superfluous.

 ios.read([length [, buffer]])    => string, buffer, or nil

Jordan

Thanks Jordan. How is your code different (if at all) from:

buffer = io.read
file.write(buffer)
print "I read this stuff ", buffer, “\n”

Am I missing something? I just don’t see why buffer is useful - is it
a performance benefit or some kind of syntax improvement that I’m
missing?

Yes, the string referenced by buffer is reused. This leads to
improved performance for the typical application which is like this:

buffer = “”
while ( io.read(1024, buffer) )
file.write buffer
end

The only thing I can see is that it has some kind of low
level malloc optimization if the same string size is passed in
repeatedly during partial writes.

Exactly (see above). Note that it is very inefficient to read with
such a large chunk size as you use in your original posting. If you
want to read the whole file you can simply do io.read.

Kind regards

robert

On Dec 7, 2007 6:25 AM, MonkeeSage [email protected] wrote:

I’m wondering what the functional and performance differences might be
buffer = io.read(10485760)
speed things up? I can’t see any other reason to use one…
Reads at most length bytes from the I/O stream, or to the end
f.read(16) #=> “This is line one”
Thanks Jordan. How is your code different (if at all) from:

file.write(io.read(nil, buffer))
print "I read this stuff ", buffer, “\n”

…looks the same as this code…

file.write(buffer = io.read)

print "I read this stuff ", buffer, “\n”

Regards,
Jordan

I’d assume the former saves you a bunch of allocations when looping
through a file
(I assume the buffer is reused instead of allocating a new one for
each iteration).

i.e.
buffer = “”
File.open(‘xxx’,‘r’) do |f|
while f.read(1024, buffer) do
process(buffer)
end
end

vs.

File.open(‘xxx’,‘r’) do |f|
while true do
buffer = f.read(1024)
break if buffer.empty?
process(buffer)
end
end

On Dec 6, 10:31 pm, Steve M. [email protected] wrote:

gobs of data in it. Assume ‘file’ is an open file instance with write
end
Thanks in advance for any information!
file if length is omitted or is +nil+. length must be a
So…
buffer = io.read
file.write(buffer)
print "I read this stuff ", buffer, “\n”

Am I missing something? I just don’t see why buffer is useful - is it
a performance benefit or some kind of syntax improvement that I’m
missing? The only thing I can see is that it has some kind of low
level malloc optimization if the same string size is passed in
repeatedly during partial writes.

Steve

I don’t know if there is any optimization is the back end, but it lets
you pass the results of io.read to another method and also put them in
buffer at the same time. But since you can do that with assignment, I
don’t really see any point to it (I was just trying to give an example
as the docs describe). To me, unless as you say, there is some
optimization going on in the backend, this code…

buffer = “”
file.write(io.read(nil, buffer))
print "I read this stuff ", buffer, “\n”

…looks the same as this code…

file.write(buffer = io.read)
print "I read this stuff ", buffer, “\n”

Regards,
Jordan

On Dec 7, 4:56 am, MonkeeSage [email protected] wrote:

    return read_all(fptr, remain_size(fptr), str);
    StringValue(str);

rb_string_value(ptr)
}
rb_str_modify(str);
{

if (len < 0) {
    }

when looping many times), since it appears to me to be doing the same
thing (compare str_new from string.c, which is what rb_tainted_str_new
calls).

Regards,
Jordan


References:

http://svn.ruby-lang.org/repos/ruby/branches/ruby_1_8/io.chttp://svn.ruby-lang.org/repos/ruby/branches/ruby_1_8/ruby.hhttp://svn.ruby-lang.org/repos/ruby/branches/ruby_1_8/string.c

Oh…wait…I’m completely dense. Duh! io_read() is going to create /
re-initialize new string anyway to put its results in. So If I create
a new string independently to store the return value of IO#read, then
I’m causing an extra allocation and copy. Sorry for wasting space.
Have pity on mentally handicapped people like me. :stuck_out_tongue:

Regards,
Jordan

On Dec 7, 3:29 am, Jano S. [email protected] wrote:

I’d assume the former saves you a bunch of allocations when looping
through a file
(I assume the buffer is reused instead of allocating a new one for
each iteration).

I’m not the smartest C programmer (or the smartest anything
programmer), but I’m not seeing any optimization in the actual C code.
Please correct me if I’m wrong.

First, io_read() is the function called in the backend from IO#read.
Te relevant lines are:

====
rb_scan_args(argc, argv, “02”, &length, &str);

if (NIL_P(length)) {

if (!NIL_P(str)) StringValue(str);
GetOpenFile(io, fptr);
rb_io_check_readable(fptr);
return read_all(fptr, remain_size(fptr), str);
}
len = NUM2LONG(length);
if (len < 0) {
rb_raise(rb_eArgError, “negative length %ld given”, len);
}

if (NIL_P(str)) {

str = rb_tainted_str_new(0, len);
}
else {
StringValue(str);
rb_str_modify(str);
rb_str_resize(str,len);
}

So we see that we get a new string from rb_tainted_str_new if buffer
is is not passed in to IO#read; otherwise str is used and we call
StringValue on it.

So what is StringValue? A macro defined in ruby.h:

====
#define StringValue(v) rb_string_value(&(v))

And what is rb_string_value()? A function from string.c:

====
static char *null_str = “”;

VALUE
rb_string_value(ptr)
volatile VALUE *ptr;
{
VALUE s = *ptr;
if (TYPE(s) != T_STRING) {
s = rb_str_to_str(s);
*ptr = s;
}
if (!RSTRING(s)->ptr) {
FL_SET(s, ELTS_SHARED);
RSTRING(s)->ptr = null_str;
}
return s;
}

So if it’s not a string, we convert it to one, otherwise we zero it
out.

But the interesting lines are back up in io_read():

====
rb_str_modify(str);
rb_str_resize(str,len);

Now rb_str_modify() (string.c) is called with our zeroed string. And
it in turn calls str_make_independent():

====
static void
str_make_independent(str)
VALUE str;
{
char *ptr;

ptr = ALLOC_N(char, RSTRING(str)->len+1);
if (RSTRING(str)->ptr) {

memcpy(ptr, RSTRING(str)->ptr, RSTRING(str)->len);
}
ptr[RSTRING(str)->len] = 0;
RSTRING(str)->ptr = ptr;
RSTRING(str)->aux.capa = RSTRING(str)->len;
FL_UNSET(str, STR_NOCAPA);
}

And finally, rb_str_resize is called:

====
VALUE
rb_str_resize(str, len)
VALUE str;
long len;
{
if (len < 0) {
rb_raise(rb_eArgError, “negative string size (or size too big)”);
}

rb_str_modify(str);
if (len != RSTRING(str)->len) {

if (RSTRING(str)->len < len || RSTRING(str)->len - len > 1024) {
REALLOC_N(RSTRING(str)->ptr, char, len+1);
if (!FL_TEST(str, STR_NOCAPA)) {
RSTRING(str)->aux.capa = len;
}
}
RSTRING(str)->len = len;
RSTRING(str)->ptr[len] = ‘\0’; /* sentinel */
}
return str;
}

Now, like I said, I’m not the greatest C programmer…but I fail to
see how, if I’m reading the code above correctly, passing in a buffer
string to IO#read is any more optimal than creating a new string (even
when looping many times), since it appears to me to be doing the same
thing (compare str_new from string.c, which is what rb_tainted_str_new
calls).

Regards,
Jordan


References:

http://svn.ruby-lang.org/repos/ruby/branches/ruby_1_8/io.c
http://svn.ruby-lang.org/repos/ruby/branches/ruby_1_8/ruby.h
http://svn.ruby-lang.org/repos/ruby/branches/ruby_1_8/string.c

2007/12/7, MonkeeSage [email protected]:

Please correct me if I’m wrong.
rb_io_check_readable(fptr);
else {
So what is StringValue? A macro defined in ruby.h:
VALUE
RSTRING(s)->ptr = null_str;

VALUE str;
FL_UNSET(str, STR_NOCAPA);}

{
}
string to IO#read is any more optimal than creating a new string (even
http://svn.ruby-lang.org/repos/ruby/branches/ruby_1_8/io.chttp://svn.ruby-lang.org/repos/ruby/branches/ruby_1_8/ruby.hhttp://svn.ruby-lang.org/repos/ruby/branches/ruby_1_8/string.c

Oh…wait…I’m completely dense. Duh! io_read() is going to create /
re-initialize new string anyway to put its results in. So If I create
a new string independently to store the return value of IO#read, then
I’m causing an extra allocation and copy. Sorry for wasting space.
Have pity on mentally handicapped people like me. :stuck_out_tongue:

LOL

Also, allocating of a String instance is not only the raw malloc of
the memory but as well the bookkeeping needed for GC. So it is more
expensive than a simple resize. Note also, that if you loop with code
like the one I showed the length of the string instance is adjusted
only once because all chunks have the same length or are shorter
(the last one potentially).

Kind regards

robert