Memcache C extension

snacktime · July 17, 2007, 9:47pm

I’ve decided I want to become more proficient in C, so the project is
chose is a ruby memcache C extension. I found the following site with
a hash algorithm that seems to be an improvement on the default on,
can anyone think of a reason why not to use this?

http://www.last.fm/user/RJ/journal/2007/04/10/392555/

Also, since the extension isn’t going to be thread safe (I’d have to
create a new native thread for each instance of the class I think if I
wanted to do that), I was thinking about the best way to structure it
would be to use global variables for the C struct’s that can only be
created once. Then in the initialize function I can make sure to
only call the C function mc_new() once.

For example:

struct memcache *mc = NULL;

static VALUE class_initialize(VALUE self) {
if (mc == NULL)
rb_raise();
mc = mc_new();
return self;
}

Also, I have a particular need for caching database queries, where
buffer the returned rows and then send the whole thing to memcache.
Would it be more efficient to implement a string buffer in C rather
then use an array or concating strings in ruby? I’m thinking about
adding this as an additional feature, where you create a buffer, fill
it up, then when you want to do a memcache add/set, you tell it to
add/set the contents of the buffer.

Thoughts/feedback appreciated.

Chris

snacktime · July 17, 2007, 9:52pm

}
I meant Mc not mc, Mc being global.

Chris

snacktime · July 17, 2007, 10:20pm

snacktime wrote the following on 17.07.2007 21:45 :

I’ve decided I want to become more proficient in C, so the project is
chose is a ruby memcache C extension. I found the following site with
a hash algorithm that seems to be an improvement on the default on,
can anyone think of a reason why not to use this?

http://www.last.fm/user/RJ/journal/2007/04/10/392555/

very poor distribution with few servers,
no control of weight between servers (ie if I have a memcache server
with 2GB and one with 128MB, nothing prevents the second from getting 10
times as much access as the first one…).

This algorithm benefits very large server farms with identical server
configurations where having poor performance over a small part of the
cache isn’t a big deal.

If you want both good controlled distribution and high cache hits with
low number of servers, on first thought I’ll implement a double query
system where cache clients are responsible for moving data accross
servers based on an old configuration and a new one.
When the hit rate ratio on the new configuration is good and bad on the
old one, deactivate queries on the old configuration.
Assuming 2 gets and a put to memcache servers are faster than direct
access to the data, this is probably a win.

With that kind of system you could even handle server failures
automatically by creating new configurations on the fly (and using hit
ratio stats from the two previous configuration to decide which one is
the better new ‘old one’).

But then again it’s only a first thought

Lionel.

snacktime · July 17, 2007, 10:27pm

On 7/17/07, snacktime [email protected] wrote:

I’ve decided I want to become more proficient in C, so the project is
chose is a ruby memcache C extension. I found the following site with
a hash algorithm that seems to be an improvement on the default on,
can anyone think of a reason why not to use this?

http://www.last.fm/user/RJ/journal/2007/04/10/392555/

For what it’s worth, there is a pure Ruby implementation of this idea.
It’s the MemCacheWithConsistentHashing gem.

Info can be found in this thread:
http://groups.google.com/group/acts_as_cached/browse_thread/thread/a01d22ce0fc9ccec

snacktime · July 17, 2007, 10:46pm

On Jul 17, 2007, at 12:45, snacktime wrote:

I’ve decided I want to become more proficient in C, so the project is
chose is a ruby memcache C extension. I found the following site with
a hash algorithm that seems to be an improvement on the default on,
can anyone think of a reason why not to use this?

http://www.last.fm/user/RJ/journal/2007/04/10/392555/

A consistent hashing library is completely orthogonal from a
memcached library.

You’d be better off wrapping apr_memcache or libmemcache than writing
yet another C memcached client.

Also, since the extension isn’t going to be thread safe (I’d have to
create a new native thread for each instance of the class I think if I
wanted to do that), I was thinking about the best way to structure it
would be to use global variables for the C struct’s that can only be
created once. Then in the initialize function I can make sure to
only call the C function mc_new() once.

Only one thread can be in C code at a time (unless you call
rb_thread_select), so writing ruby-thread-safe C libraries is
probably easier than you suspect.

Also, I have a particular need for caching database queries, where
buffer the returned rows and then send the whole thing to memcache.
Would it be more efficient to implement a string buffer in C rather
then use an array or concating strings in ruby? I’m thinking about
adding this as an additional feature, where you create a buffer, fill
it up, then when you want to do a memcache add/set, you tell it to
add/set the contents of the buffer.

It will be a better use of your time to use an existing memcached
library and optimize your entire task using benchmarks and
profiling. memcached overhead is probably not going to be your
bottleneck.

snacktime · July 17, 2007, 10:47pm

Lionel B. wrote the following on 17.07.2007 22:18 :

with 2GB and one with 128MB, nothing prevents the second from getting
servers based on an old configuration and a new one.
But then again it’s only a first thought

Lionel.

Second thought.

Instead of choosing the server based on comparison of hash values with a
simple integer comparis, you could compute a XOR distance between the
hash value of the key and the hash value of the server. Then you can
adjust the distance by a weigth.

Example (using 8bit hash keys for readability):
key1 as a hash key of 0x01010101
server1 as a hash key of 0x11000111
server2 as a hash key of 0x10011011

dist(key1, server1) = 3
dist(key1, server2) = 5

server1 wins (if equality, give priority to the first server in the
list)

Now if you want to use weigth, you’ll have to adjust the distances by
adding a correction factor.

You can compute this correction factor by first computing the
distribution of distances between 0 and size(hash). Then you can derive
the probability of
random_distance < another_random_distance + correction_factor
by comparing 1 (sum of all distance probabilities) with 1 - the
probabilities of each possible distance that is < correction_factor.

You now have a mapping of correction_factor -> relative weigth, inverse
the map and you can use it to give relative weights to servers (this
obviously works better if you have long hashes as you can’t have huge
weights with small hashes and the more the weight is near the max
allowed, the less perfect the distribution amongst servers is).

This has the advantages of:

working great with additions/removal of servers (has good has the
solution in the original URL),
supports weight between servers,
not ideal distribution, but orders of magnitude better than the other
algorithm.

Lionel.

snacktime · July 17, 2007, 11:36pm

On 7/17/07, Eric H. [email protected] wrote:

memcached library.

You’d be better off wrapping apr_memcache or libmemcache than writing
yet another C memcached client.

I am wrapping libmemcache, but it looked like it was easy enough to
provide your own hashing algorithm. This is actually just part of
some other work I’m doing on a query caching layer in activerecord,
and memcache was actually taking a significant amount of time in the
caching layer, plus I wanted to hone my C skills anyways.

Chris