Zero-copy String Handling

curt_s · December 10, 2007, 5:00pm

I’m writing a C extension that involves fast scanning through and
parsing of tab-delimited files. Basically, I mmap the file, figure out
where the row and column boundaries are, and for each row end up with
an array of strings (pointer and length) for each row that I then pass
on to other C or Ruby code. The array and its strings are not supposed
to be modified by the callees, only read, and I can also live with the
callees being required to make their own copies of the strings and
arrays if they need to keep the data accessable after the call, if I can
figure out some way to enforce that.

It appears to me that this means I don’t really have any need to
copy the data; I ought to just be able to set up a bunch of (likely
frozen) String objects and then tweak the ptr and len on them and pass
them around, avoiding any allocations or data copies. From a bit of
experimentation, I can see that dropping several calls to rb_str_new for
each row results in an enormous speed increase–about ten-fold–in how
fast I can scan through the file.

Does anybody have any suggestions on a reasonably safe way to do this?

cjs

curt_s · December 10, 2007, 5:16pm

On Dec 10, 10:52 am, Curt S. [email protected] wrote:

It appears to me that this means I don’t really have any need to
copy the data; I ought to just be able to set up a bunch of (likely
frozen) String objects and then tweak the ptr and len on them and pass
them around, avoiding any allocations or data copies. From a bit of
experimentation, I can see that dropping several calls to rb_str_new for
each row results in an enormous speed increase–about ten-fold–in how
fast I can scan through the file.

Does anybody have any suggestions on a reasonably safe way to do this?

I’ve read (although I’ve never verified this by looking at the source
code) that Ruby’s strings are copy-on-write. So if you have code like
this:

s1 = "This is some string."
s2 = s1[5, 7]   # => "is some"
s3 = s1[5..11]  # => "is some"

no copies are made and data is shared … until you modify the data,
in which case the copy is made.

If you wanted to enforce the read-only aspect, perhaps you could
create a subclass of String where all the methods and operators that
modify the character data are re-implemented to raise exceptions.

Eric

====

Interested in hands-on, on-site Ruby training? See http://LearnRuby.com
for information about a well-reviewed class.

curt_s · December 12, 2007, 7:10pm

On Dec 10, 2007, at 8:52 AM, Curt S. wrote:

Does anybody have any suggestions on a reasonably safe way to do this?

this code shares memory between an mmap and narray

http://codeforpeople.com/lib/ruby/nmap/

look at na_str.c, it uses rb_str_new4 to avoid copying.

a @ http://codeforpeople.com/

curt_s · December 12, 2007, 7:11pm

On 10.12.2007 16:52, Curt S. wrote:

It appears to me that this means I don’t really have any need to
copy the data; I ought to just be able to set up a bunch of (likely
frozen) String objects and then tweak the ptr and len on them and pass
them around, avoiding any allocations or data copies. From a bit of
experimentation, I can see that dropping several calls to rb_str_new for
each row results in an enormous speed increase–about ten-fold–in how
fast I can scan through the file.

Does anybody have any suggestions on a reasonably safe way to do this?

This is what I’d do: create a single string per line and use substring
(aka #[]) to create strings that represent the portion needed; byte
buffer will be shared then. You don’t even need to freeze them because
of copy on write.

Kind regards

robert

curt_s · December 12, 2007, 10:10pm

Curt S. wrote:

reduce my object creation load at all. I seem to recall last time I was
playing around with this sort of thing and using a profiler, GC was an
enormous cost for me. This probably isn’t surprising given the nature of
the problem; a typical file might be ten million rows of fifty elements
each, which would be 500 million object creations and collections.

cjs

For a problem of this scale, it seems like it would make sense to use a
custom class that had some of the methods of String - enough for the
callees to treat it like a String - but not in fact String. Give it a
.to_s method to convert to a real String when it’s really necessary.

curt_s · December 12, 2007, 11:25pm

On 2007-12-11 08:50 +0900 (Tue), Tim H. wrote:

For a problem of this scale, it seems like it would make sense to use a
custom class that had some of the methods of String - enough for the
callees to treat it like a String - but not in fact String. Give it a .to_s
method to convert to a real String when it’s really necessary.

That’s a good idea. In fact, I could use a struct of the same shape as
RString, and just invoke methods in string.c to do the real work for the
methods I decide to implement. I could also leave the objects existant
after the backing pages have been unmapped (as I do now with the String
objects), but tweak them so that they throw an exception when you try
to use them after that, as opposed to the current situation where the
interpreter dumps core.

Thanks for the idea.

cjs

curt_s · December 12, 2007, 7:15pm

On 2007-12-11 02:44 +0900 (Tue), Robert K. wrote:

This is what I’d do: create a single string per line and use substring
(aka #[]) to create strings that represent the portion needed; byte
buffer will be shared then. You don’t even need to freeze them because
of copy on write.

This was attractive for a couple of seconds, until I realized that not
only does it still add a copy of the entire row of data (albeit as one
large allocation rather than many small ones), but it also doesn’t
reduce my object creation load at all. I seem to recall last time I was
playing around with this sort of thing and using a profiler, GC was an
enormous cost for me. This probably isn’t surprising given the nature of
the problem; a typical file might be ten million rows of fifty elements
each, which would be 500 million object creations and collections.

cjs

curt_s · December 13, 2007, 3:23am

Curt S. wrote:

after the backing pages have been unmapped (as I do now with the String
objects), but tweak them so that they throw an exception when you try
to use them after that, as opposed to the current situation where the
interpreter dumps core.

Thanks for the idea.

cjs

Works for me. RMagick 2 has the ability to “destroy” an image. A
destroyed image is like a closed file. The object still exists but any
attempt to use it raises a DestroyedImage exception.

curt_s · December 13, 2007, 3:39am

On 2007-12-11 17:22 +0900 (Tue), Robert K. wrote:

2007/12/11, Tim H. [email protected]:

For a problem of this scale, it seems like it would make sense to use a
custom class that had some of the methods of String - enough for the
callees to treat it like a String - but not in fact String. Give it a
.to_s method to convert to a real String when it’s really necessary.

But you still have the GC overhead - regardless whether you create
String or Object.

Not necessarially. If you set the contract, as I have, that the object
is valid only for the duration of execution of the callback to which
it’s passed, and that validity expires once the callback returns, you
can allocate 50 (or whatever) of these read-only pseudo-String objects
at the start of reading the file, and re-use them for each row. The key
is that anything that the callback wants to save for use after this
particular call has exited will have to be copied.

I.e., that way you would create a single String per line only while
allowing the caller to create substring instances if needed. If the
client code needs to do that anyway you could as well create those
String instances yourself because it does not make any difference. If
not, you save factor of 50 creations

When you’re talking about processing twenty and thirty million row
files, even with the factor of 50 savings you’ve still got a problem.
(Though I should benchmark this to see how it compares. A quick
comparison on my machine:

a = "aaaa"
n = 10_000_000
n10 = 100_000_000
n.times { }        # 2 seconds
n.times { rand(n10) }    # 16 seconds
n.times { a + 'b' }      # 24 seconds
n.times { rand(n10}.to_s }    # 27 seconds

[Oops, looks like “a + ‘b’” isn’t a no-copy operation after the first
one.])

I should perhaps explain that the common case is that the strings
are examined, but not modified. Particularly common is a relational
restriction: abort processing that row (for that particular chain of
Conditions and Results) if a data item doesn’t match a particular value,
or fall in between some particular values, or whatever. Actually needing
to copy a value is relatively rare. More frequent would be incrementing
a counter if a value matches some condition, or adding the value to a
total.

I should also explain that having these pseudo-strings as actual Ruby
objects immediately after parsing is partly a convenience; in something
that needed to be really fast, the initial Conditions, at least, and
later ones and Results, if they match frequently enough, would be in C,
and I suppose I could delay creating the String objects until I had to
call a Ruby routine.

But as it stands, even the callbacks to Ruby are surprisingly fast, if
they don’t do a lot. On my machine, for an 800,000 line file (entirely
cached in memory), a scan running only C code takes 1.8 seconds of
CPU. Doing an rb_str_new on each line brings it up to 2.3 seconds;
doing instead a call into a Ruby function that does a simple string
comparison against an instance variable takes 3.4 seconds.

But I’d say chances are that client code will do more complex
manipulations…

As above; in many cases the intial comparisons to discard the line can
be done by simple C code.

Here’s another variant: if you mmap the file and it fits into mem…

Sometimes not my case, unfortunately; I have to be able to deal with
files several gigabytes long on a 32-bit machine.

Thanks for all of the useful advice.

cjs

curt_s · December 12, 2007, 11:40pm

2007/12/11, Tim H. [email protected]:

large allocation rather than many small ones), but it also doesn’t
callees to treat it like a String - but not in fact String. Give it a
.to_s method to convert to a real String when it’s really necessary.

But you still have the GC overhead - regardless whether you create
String or Object.

Another idea would be to change the interface. “Pseudo” code:

io.each do |line|
line.freeze
some_smart_parsing do |line, start, end|
# line does not change, start and end are integer indexes
end
end

I.e., that way you would create a single String per line only while
allowing the caller to create substring instances if needed. If the
client code needs to do that anyway you could as well create those
String instances yourself because it does not make any difference. If
not, you save factor of 50 creations (ints are treated differently).
But I’d say chances are that client code will do more complex
manipulations and in that case the question is whether you have the
right (i.e. fast enough) tool at all. Because these blocks will do
some calculation and those will likely also create objects etc. I’d
say you either have to live with the overhead or use a different tool
altogether.

Here’s another variant: if you mmap the file and it fits into mem you
could as well do

some_smart_parsing do |s, start, end, field_index|

s is the whole file and does not change, start and end are integer

indexes
end

The field index could be a flag indicating first or not first field in
a record. But this interface starts to get contrived and you still
have the overhead in the block.

Kind regards

robert

curt_s · December 13, 2007, 12:11pm

On 12/11/07, Curt S. [email protected] wrote:

are examined, but not modified. Particularly common is a relational
later ones and Results, if they match frequently enough, would be in C,
and I suppose I could delay creating the String objects until I had to
call a Ruby routine.

Would the immutable Rope implementation from RubyQuiz 137 help here?