String copy-on-write question

larsch · May 5, 2008, 5:40pm

Hello group,

Ruby implements copy-on-write for strings, so you can do stuff like
this very cheaply:

str = 0.chr * (2**24) # 16MiB allocated
str[100…-1] # this costs only a small amount of memory

How come this optimization does not apply in this case?:

str[100…-2] # this costs around 16MiB bytes of memory

As a side effect, if using regexps on a large string, the pre-match
and post-match variables behave differently:

s = 0.chr * (223) + “Hello” + 0.chr * (223) # About 16MiB
allocated (after GC)
s.scan(/Hello/) { |m| p m } # This is free
p $’.size # This is free
p $`.size # This costs another 8MiB.

Any insights?

Lars

larsch · May 5, 2008, 6:21pm

On 05.05.2008 18:07, ts wrote:

p $`.size # This costs another 8MiB.

same reason here.

Interesting. Do you also happen to know why not an additional field is
used that stores the length? Is the reason maybe usage of C library
string functions that work on zero terminated strings?

Cheers

robert

larsch · May 5, 2008, 6:08pm

Lars C. wrote:

Well, it’s best if you look at rb_str_substr() in string.c

str[100…-1] # this costs only a small amount of memory

ruby just need to adjust the pointer and the length in the new
object

str[100…-2] # this costs around 16MiB bytes of memory

one character is missing from the previous string, if it do the
same thing than previously then it must

adjust the pointer
adjust the length
add \0 at the end

This mean that fatally it has modified the string, this is why it
duplicate.

p $’.size # This is free
p $`.size # This costs another 8MiB.

same reason here.

Guy Decoux

larsch · May 5, 2008, 6:50pm

On 05.05.2008 18:33, ts wrote:

Robert K. wrote:

Interesting. Do you also happen to know why not an additional field is
used that stores the length?

I’ve not understood : it has a field which give it the length of
the string, for example with

Ah, ok. This happens when one is too lazy to look into the source.
Somehow I had assumed that the length was not stored because you made
the point that the \0 could not be inserted without altering the
original. I concluded, there is no length.

str = ‘0’ * 200
str[100 … -1]

the first object (in str) will have 200 for its length
the field length in the new object will have the value 100
                         Is the reason maybe usage of C library 
string functions that work on zero terminated strings?
only matz know this

Well, maybe he’ll stop by and enlighten us.

Kind regards

robert

larsch · May 5, 2008, 6:35pm

Robert K. wrote:

Interesting. Do you also happen to know why not an additional field is
used that stores the length?

I’ve not understood : it has a field which give it the length of
the string, for example with

str = ‘0’ * 200
str[100 … -1]

the first object (in str) will have 200 for its length
the field length in the new object will have the value 100

                         Is the reason maybe usage of C library

string functions that work on zero terminated strings?

only matz know this

Guy Decoux