Which Hash methods are thread safe?

thekog · July 13, 2007, 8:52pm

When I’ve written multithreaded code in Ruby in the past I’ve been quite
diligent about synchronizing access to shared resources. However, now
I’m
working on a project where oversynchronization may actually be costly
and
I’m curious about how much I can actually do with core datastructures
before running into thread safety issues.

I’ve seen the issue of thread safety come up a couple times on the list
before but I’ve not been able to find concrete answers as to what
exactly
is safe.

Thus, a few questions:

Will Hash#[] always return either a previously assigned value, or
the
default value, despite interleaved calls to Hash#[]= and
Hash#delete?

Or is it possible for Hash#[] in such a scenario to result in a wild
pointer access that would raise an exception or worse?
Will interleaved calls to Hash#[]= do something sensible? That is,
when all interleaved calls are finished, will a lookup always return
one of the previously assigned values (even if it’s not the most
“recent”)?
Can something “bad” happen when a hash is modified during execution
of
Hash#each? I understand that such a call may yield elements that
were added since the start of its execution, or may not yield
elements
that were delete since the start of its execution. But, at least,
will
all key/value pairs be yielded that weren’t modified during
execution?

Will each pair be yielded only once? I’m thinking a rehash during
#each
could result in disaster.
In general, can the lack of Ruby-level synchronization actually
corrupt
a datastructure in some really unfortunate way? For example, could
interfering mutations to middle elements of an array accidentally
lop
off the entire end of an array?
Does the answer to any of the above change for ruby 1.9?

Basically, in short, I’m willing to give up synchronized access to data
structures for trivial operations (i.e., lookups) where the result is
only
valid for an infinitessimally short peiord anyways, so long as I know
that
it won’t result in wild application behavior like exceptions or a
segfault.

A semi-related question: Are there readers/writers locks for Ruby?
I’ve
not seen one.

Thanks!

thekog · July 13, 2007, 10:54pm

On Sat, 14 Jul 2007 03:51:06 +0900, Mike K.
[email protected] wrote:

I’m working on a project where oversynchronization may actually be costly

Have you measured to see whether this is actually the case?

I’m curious about how much I can actually do with core datastructures
before running into thread safety issues.

Not much at all. Reading a data structure from multiple threads is
fine as long as nothing ever writes to it once it becomes shared, and
as long as each thread besides the one creating/initializing the object
obtained its original reference to the data structure through
synchronized
means.

I’ve seen the issue of thread safety come up a couple times on the list
before but I’ve not been able to find concrete answers as to what exactly
is safe.

Giving up synchronization means that (depending on the particulars of
the
Ruby implementation) you lose a lot of important guarantees like memory
operations being visible in a logical order. It isn’t so much a
question
of what is and isn’t safe: once you drop synchronization, all bets are
off.

(Generally a Ruby implementation will try to make sure you can’t crash
the interpreter, but beyond that you’re on your own.)

The best way to get good threaded performance in a Ruby program is:

minimize the amount of synchronization required

a. as much as possible, try to only share “immutable” objects
between
threads

b. make thread-local copies of things

c. let one thread have responsibility for an object at a time,
passing objects between threads via a synchronized communication
mechanism like Queue
load fastthread after thread.rb (but rescue LoadError if it’s not
available)

If you’d like to describe what you’re doing with the shared Hash, I may
also be able to give more specific advice.

A semi-related question: Are there readers/writers locks for Ruby? I’ve
not seen one.

There is sync.rb in stdlib, but I wouldn’t recommend using it because
the
implementation is very poor.

-mental

thekog · July 14, 2007, 12:26am

On Sat, Jul 14, 2007 at 05:53:37AM +0900, MenTaLguY wrote:

Have you measured to see whether this is actually the case?

Not yet. Chances are that synchronization won’t actually have a runtime
performance impact. It’s just the first time I’ve worked on a project
where I didn’t feel comfortable with requiring so much synchronization
for reading, when in practice, rarely will data actually be mutated.

There’s also a scalability concern, however I imagine that the rest of
the application will see scalability issues long before.

If you’d like to describe what you’re doing with the shared Hash, I may
also be able to give more specific advice.

The one situation I’m working on right now is basically a pub/sub
mechanism on top of drb. Basically a bunch of clients register a
callback object in a hash (keyed by some client identifier) on a server.
When a message is published, it gets sent to each client registered in
the hash.

The problem with synchronization is that only one message can be sent at
a time. I could make a copy of the hash value array so that drb calls
don’t have to made while synchronized, but that feels like I’d have to
make a lot of unnecessary copies.

Basically the point is that messages are published often, subscriptions
rarely change. If there was a good r/w lock implementation with
writer’s priority, that would solve the problem rather easily.

I guess I could also use Thread.critical= when making subscription
updates, but isn’t that deprecated?

thekog · July 14, 2007, 1:08am

Mike K. wrote:

The one situation I’m working on right now is basically a pub/sub
mechanism on top of drb. Basically a bunch of clients register a
callback object in a hash (keyed by some client identifier) on a server.
When a message is published, it gets sent to each client registered in
the hash.

Seems you are implementing an MoM (Message oriented Middleware) on your
own.

Threads may be a little overkill for such a system. I’m currently
working on stompserver and the “topic” system might be a good choice
(depending on the constraints on you client side).

Basically, stompserver is a Ruby MoM based on the Stomp protocol. Each
client connecting to a stompserver can:

publish in queues or topics,
subscribe to queues or topics and listen for incoming messages.

On queues one message is delivered to one and only one client subscribed
to the queue. On topics, the message is dispatched to each and every
client subscribed. The whole server is based on EventMachine which used
a simple select loop (with 0.8.0 they just switched to epoll on Linux
but the idea is the same: one processus, one thread, event driven
server).

As ruby currently doesn’t support native threads, event based servers
are probably the best fit to get performance (not to mention that often
event based servers outperform thead based servers, especially on a
single CPU system and sometimes even on SMP) and to avoid thread related
complexity.

Stompserver isn’t the only solution around. ReliableMessaging is another
that might even be better suited to your needs as it is based on Drb
too, see
http://trac.labnotes.org/cgi-bin/trac.cgi/wiki/Ruby/ReliableMessaging

Lionel

thekog · July 14, 2007, 2:04am

On Fri, 13 Jul 2007 16:59:28 -0700, MenTaLguY [email protected] wrote:

The problem with synchronization is that only one message can be sent at
a time.

It also establishes an ordering between delivery and (un)subscription,
which might be important. On the other hand, if your clients are robust against
unexpected messages, you can certainly take advantage of extra concurrency
to get better throughput.

note: here, I’m comparing synchronized approaches which hold the lock
during
message delivery with synchronized approaches which do not. The example
code I gave belongs to the latter category.

-mental

thekog · July 14, 2007, 2:00am

On Sat, 14 Jul 2007 06:54:00 +0900, Mike K.
[email protected] wrote:

On Sat, Jul 14, 2007 at 05:53:37AM +0900, MenTaLguY wrote:
It’s just the first time I’ve worked on a project
where I didn’t feel comfortable with requiring so much synchronization
for reading, when in practice, rarely will data actually be mutated.

Resist those feelings until they’re backed up with real measurements;
you’ll be happier and make less work for yourself.

The one situation I’m working on right now is basically a pub/sub
mechanism on top of drb. Basically a bunch of clients register a
callback object in a hash (keyed by some client identifier) on a server.
When a message is published, it gets sent to each client registered in
the hash.

The problem with synchronization is that only one message can be sent at
a time.

It also establishes an ordering between delivery and (un)subscription,
which
might be important. On the other hand, if your clients are robust
against
unexpected messages, you can certainly take advantage of extra
concurrency
to get better throughput.

I could make a copy of the hash value array so that drb calls
don’t have to made while synchronized, but that feels like I’d have to
make a lot of unnecessary copies.

Basically the point is that messages are published often, subscriptions
rarely change.

Why not make the copies at subscription time rather than delivery time?

def initialize
@lock = Mutex.new
@subscriptions = Hash.new { |h,k| h[k] = [].freeze }
end

def publish(publisher, message)
@lock.synchronize do
@subscriptions[publisher]
end.each do |subscriber|
subscriber.deliver message
end
self
end

def subscribe(subscriber, publisher)
@lock.synchronize do
subscriptions = @subscriptions[publisher].dup
subscriptions.push subscriber
@subscriptions[publisher] = subscriptions.freeze
end
self
end

def unsubscribe(subscriber, publisher)
@lock.synchronize do
subscriptions = @subscriptions[publisher].dup
subscriptions.delete subscriber
@subscriptions[publisher] = subscriptions.freeze
end
self
end

The freezing isn’t really necessary, but it makes it a little
clearer what’s happening (and catches you if you mess up).

If there was a good r/w lock implementation with writer’s priority,
that would solve the problem rather easily.

Possibly, yes. You’d want to measure, though, especially as a Ruby
implementation of an R/W lock would typically introduce a lot of
extra method calls (which can be very expensive)

I guess I could also use Thread.critical= when making subscription
updates, but isn’t that deprecated?

I wouldn’t recommend using it.

-mental

thekog · July 14, 2007, 2:16am

On Sat, 14 Jul 2007 08:59:24 +0900, MenTaLguY [email protected] wrote:

def publish(publisher, message)
@lock.synchronize do
@subscriptions[publisher]
end.each do |subscriber|
subscriber.deliver message
end
self
end

Also, note that this allows the delivery of several messages from
the same publisher to be interspersed (as would an R/W lock). You
may need to think hard about whether that is okay for your purposes.

-mental

thekog · July 14, 2007, 7:57pm

On Sat, Jul 14, 2007 at 08:59:24AM +0900, MenTaLguY wrote:

Why not make the copies at subscription time rather than delivery time?

That’s a great idea.

Thanks for the code, it’s given me some ideas on an approach if I
continue the “doing it on my own” route.

thekog · July 14, 2007, 7:55pm

On Sat, Jul 14, 2007 at 08:07:01AM +0900, Lionel B. wrote:

Seems you are implementing an MoM (Message oriented Middleware) on your own.

It’s creeping into that territory, yes.

Threads may be a little overkill for such a system.

Agreed. However I wanted to go the path of least resistance, and
building something on top of drb which uses threads internally was the
right direction to go. This, of course, is all assuming nobody else has
produced a system that’s adaptable to my needs.

I’m currently working on stompserver and the “topic” system might be
a good choice (depending on the constraints on you client side).
…
Stompserver isn’t the only solution around. ReliableMessaging is another
that might even be better suited to your needs as it is based on Drb
too, see
http://trac.labnotes.org/cgi-bin/trac.cgi/wiki/Ruby/ReliableMessaging

Thanks for the pointers, I’ll definitely take a look at these and see if
they’re workable alternatives. I really don’t want to reinvent the
wheel if I don’t have to.