Implementation of String or Array #slice and similar ops?

ronjeffries · December 2, 2005, 11:30am

As some may recall, I’m fiddling with some string manipulation in a
recent
series of articles. I’m wondering about the specific implementation of
operations like slice, on String (and on Array).

Specifically, does slicing a String copy the bytes, or point into the
old
string, and similarly for Array? I see that if I modify the slice, the
original
is not modified, and vice versa, but have the bytes already been copied
when the
slice is created? Or not until later?

Similar questions for Array, all around, and for operations like <<.

Bottom line, are all the methods of String and Array directly allocating
storage, or are any of them “virtual”?

My guess is that the bytes are copied. If so, is anyone aware of a class
or
classes that address substrings and such as virtual copies?

Thanks,

ronjeffries · December 2, 2005, 12:35pm

On Fri, Dec 02, 2005 at 07:27:32PM +0900, Ron J. wrote:
[…]

My guess is that the bytes are copied. If so, is anyone aware of a class or
classes that address substrings and such as virtual copies?

Ruby does use COW for arrays and strings, so internal pointers are
reused and the data they point to duplicated when actually needed.

somearray[start, size] will always reuse the internal array.
If I’m reading rb_str_substr correctly, somestring[start, rsize] will
copy the bytes unless start’+rsize == somestring.size (start’ being
the adjusted offset), amongst other conditions.

ronjeffries · December 2, 2005, 3:31pm

On 12/2/05, Mauricio Fernández [email protected] wrote:

On Fri, Dec 02, 2005 at 07:27:32PM +0900, Ron J. wrote:
Ruby does use COW for arrays and strings, so internal pointers are
reused and the data they point to duplicated when actually needed.

somearray[start, size] will always reuse the internal array.
If I’m reading rb_str_substr correctly, somestring[start, rsize] will
copy the bytes unless start’+rsize == somestring.size (start’ being
the adjusted offset), amongst other conditions.

What does “COW” mean?

A serious problem occurs when you go modify the original (likely
large) array after taking a slice. It allocates a new area for the
original array and leaves the slice (probably small) pointing the the
original area with the entire array. This can cause serious
performance issues (memory and run-time). Here is a simple example
that shows the problem:

time ruby -e ‘n=2**13; a=(1…n).to_a; n.times { |i| a.push(a[i,1]) };
IO.readlines(“/proc/#$$/status”).grep(/VmSize/).display’
VmSize: 593020 kB
50.44user 0.65system 1:00.24elapsed 84%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (1957major+121801minor)pagefaults 0swaps

You’ll find that both memory and runtime are O(n**2). vs. the
equivalent without element sharing (O(n)):

time ruby -e ‘n=2**13; a=(1…n).to_a; n.times { |i| a.push([a[i]]) };
IO.readlines(“/proc/#$$/status”).grep(/VmSize/).display’
VmSize: 3384 kB
0.01user 0.00system 0:00.22elapsed 7%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (1major+558minor)pagefaults 0swaps

Back in September, I provided a patch to 1.9 for arrays. It addressed
this issue plus made operations near the beginning of the array (i.e.
shift/unshift) just as fast as operations near the end (i.e.
pop/push). Here was my post:

http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-core/5861

This patch hasn’t been accepted to the ruby head (yet). The guys
doing the Sydney fork of ruby (1.8 derivative) do plan on integrating
it. I’m not sure where they are on it.

I haven’t had time to do the same for strings.

ronjeffries · December 2, 2005, 3:51pm

Eric M. wrote:

What does “COW” mean?

Copy On Write.

ronjeffries · December 2, 2005, 4:52pm

On Dec 2, 2005, at 9:28 AM, Eric M. wrote:

What does “COW” mean?

copy on write

Gary W.

ronjeffries · December 4, 2005, 2:51am

Thanks to all for the info on this. It sounds like with strings I won’t
get any
advantage from COW at all on slicing, and that even with arrays I’ll not
be much
better off.

For my current purposes, I don’t see how to use arrays anyway, since I’m
interested primarily in large chunks of contiguous memory.

Any pointers relating to what I’m doing in my Extended Set Theory
articles will
be gratefully received. A ping via email will help, as I don’t get back
here as
frequently as I probably should.

Thanks,

ronjeffries · December 2, 2005, 4:56pm

On 12/2/05, [email protected] [email protected] wrote:

On Dec 2, 2005, at 9:28 AM, Eric M. wrote:

What does “COW” mean?

copy on write

Copy-on-write - Wikipedia

Thanks. It looks like the normal COW usage from this wiki is when the
shared resource starts and ends at the same memory locations for all
uses. I think the conventional COW as described above would apply to
Array#replace and Array#clone, but not Array#slice. The way ruby does
COW is clearly a win for Array#replace and Array#clone, but not for
Array#slice (and many others) as demonstrated in my example.

ronjeffries · December 5, 2005, 2:45am

On Sat, 3 Dec 2005 19:12:20 -0700, [email protected] wrote:

can you use narray? it stores the elements as continuous memory. i’ve got a
patch that allows it’s use with mmap to - so a file can be mapped, manipulated
as an narray, and the bytes are persisted to file automagically.

Well, I wouldn’t rule it out. Seems like it might have a lot of stuff
that I
wouldn’t need, being numerically oriented. Its base functionality might
be
interesting.

Are there docs for it anywhere handy? I didn’t find any with a quick
goog.

Thanks,

ronjeffries · December 4, 2005, 3:23am

On Sun, 4 Dec 2005, Ron J. wrote:

Thanks to all for the info on this. It sounds like with strings I won’t get
any advantage from COW at all on slicing, and that even with arrays I’ll not
be much better off.

For my current purposes, I don’t see how to use arrays anyway, since I’m
interested primarily in large chunks of contiguous memory.

can you use narray? it stores the elements as continuous memory. i’ve
got a
patch that allows it’s use with mmap to - so a file can be mapped,
manipulated
as an narray, and the bytes are persisted to file automagically.

-a