Avoiding memcpy() during String creation in a C extension

I’m working on a fast wikitext-to-HTML translator written in C and am
trying to optimize it.

Profiling shows that a fair portion of the time is being spent
instantiating temporary strings and much of that time is spent in
memcpy.

In much of the code I could avoid many String instantiations and just
prepare everything in a buffer at the raw byte level, not touching any
Ruby objects, and do a rb_str_new() at the end.

The problem with this approach is that Ruby will alloc space for the
string and then memcpy the buffer, which is wasteful seeing as I’ve
already done the alloc and I just want to relinquish control of that
block of memory and hand it over to Ruby.

Basically what I would like is a String instantiation method which
could take a pointer to a buffer and basically take ownership of that
buffer instead of calling memcpy() on it.

I’ve looked in string.c to see if such a function exists and it
appears that it does not. I know I could write my own such function
but that is starting to meddle with the internals a little bit too
much for my liking, and I’d need a separate version for Ruby 1.8 and
another for 1.9. It would be very brittle and could break if changes
are made to Ruby in the future.

Any ideas? Or should I just resign myself to that redundant memcpy()?
We’re not talking about a lot of memory, but it’s something that can
run thousands of times per second and is currently processing wikitext
at about 6 megabytes per second and I’d like to make it even faster.

Cheers,
Wincent

Wincent C. wrote:

Basically what I would like is a String instantiation method which
could take a pointer to a buffer and basically take ownership of that
buffer instead of calling memcpy() on it.

I’ve looked in string.c to see if such a function exists and it
appears that it does not.

Instead of putting your data into a buffer and then asking Ruby to make
a String out of it, could you use rb_str_buf_new() to create a String
with a specific capacity and then put your data into the String’s
buffer?

On 6 feb, 18:46, Tim H. [email protected] wrote:

a String out of it, could you use rb_str_buf_new() to create a String
with a specific capacity and then put your data into the String’s
buffer?

Thanks, Tim.

That would indeed work I think. Didn’t know about that function but it
will come in very handy I think. And if the buffer needs to grow as I
append output to it then rb_str_cat should be relatively efficient (I
certainly noticed a huge speed-up when I started using rb_str_cat with
raw buffers rather than creating new Ruby Strings and using
rb_str_append).

Cheers,
Wincent

On 6 feb, 20:43, Phlip [email protected] wrote:

DMA that back those up…
I’m not sure what you mean. Can you clarify a bit? The translator
consists of an extremely faster Ragel tokenizer/lexer/scanner that
emits tokens one at a time, and a parser that decides what to do with
them. The tokens themselves don’t contain copies of the corresponding
substrings, but pointers into the input string. The parser is not a
recursive descent parser but an on-the-fly translator that decides
what to emit as it sees each token (thus avoiding function call
overhead; we don’t need the same level of structured parsing that a
recursive descent parser would give us because we’re only transforming
wikitext into HTML, not parsing a programming language; wikitext can
be imperfect and have syntax errors whereas a strictly defined
programming language must not).

But when it comes time to emit the output you generally have to
instantiate some strings. As an example, consider parsing the
wikitext, ‘‘foo’’, which should be translated into foo. At
the Ruby level you want a simple interface, where you pass in a
standard String object and get back a standard String object:

parser.parse “‘‘foo’’”
=> “foo

So under the covers in the C extension this is what happens:

  1. Initialize empty output buffer, currently just an empty Ruby
    string
  2. Provide tokenizer with a pointer to the start of the input string
    (no copies are made)
  3. Ask tokenizer for the next token
  4. Get an “EM” token (no copy of token text made, just a pointer
    into the input string
  5. Parser appends “” to output buffer using rb_str_cat (avoiding
    instantiating of Ruby String)
  6. Ask tokenizer for next token
  7. Get a “PRINTABLE” token (no copy of token text made, just a
    pointer into the input string)
  8. Parser appends “foo” to output buffer using rb_str_cat (again
    avoiding instantiating a Ruby String)
  9. Ask tokenizer for next token
  10. Get an “EM” token
  11. Parser appends " to output buffer using rb_str_cat
  12. Return finished String to caller

So in a general sense it is a “streaming” transformer (meaning that it
transforms the input on the fly without doing any lookahead except in
some special circumstances where it has to ask the lexer for another
token in order to decide what to emit), but I gather from your comment
that you’re referring to something else.

So in looking at the code there are now very few String/Object
instantiations that I can avoid because I am already doing as much as
I can at the raw byte level (with rb_str_cat and memcpy itself).
Nevertheless it looks like those few sites are now the lowest-hanging-
fruit reported in profiling and in order to make it faster that’s what
I have to attack. So I sent my original email in this thread because
there are still some temporary instantiations that I could avoid on
occasions where the parser calls out to other functions (which are
actually mostly inlined), and being able to pass in a buffer without
having Ruby memcpy it would be a great help.

Cheers,
Wincent