Inter-Process Messaging

Daniel_DeLorme · October 12, 2007, 8:18am

Daniel DeLorme wrote:

DRb

Frankly I’m still not entirely clear on what are the advantages of each,
both in terms of features and in terms of performance. I guess I’ll just
have to write some code to find out…

Daniel

You missed Rinda, a Linda derivative layered on top of DRb.
There are basically two ways to do concurrency/parallelism: shared
memory and message passing. I’m not sure what the real tradeoffs are
– I’ve pretty much had my brain hammered with “shared memory bad –
message passing good”, but there obviously must be some
counter-arguments, or shared memory wouldn’t exist.
System V IPC has three components – message queues, semaphores, and
shared memory segments. So … if you’ve found a System V IPC library,
you’ve found a shared memory library.

Daniel_DeLorme · October 12, 2007, 8:07am

Thanks for all the answers. I now have:

primitives:

TCPServer/TCPSocket
UNIXServer/UNIXSocket
IO.pipe
named pipes
shared memory (can’t find low-level lib?)

libraries: (after some digging on rubyforge)

DRb
Event Machine
ActiveMessaging
Slave
BackgroundDRb
System V IPC
POSIXIPC (not released)
reliable-msg
MPI Ruby
stomp / stompserver / stompmessage
AP4R (Asynchronous Processing for Ruby)

Frankly I’m still not entirely clear on what are the advantages of each,
both in terms of features and in terms of performance. I guess I’ll just
have to write some code to find out…

Daniel

Daniel_DeLorme · October 12, 2007, 8:38am

From: “M. Edward (Ed) Borasky” [email protected]

Daniel DeLorme wrote:

Thanks for all the answers. I now have:

primitives:

TCPServer/TCPSocket

UNIXServer/UNIXSocket

IO.pipe

named pipes

shared memory (can’t find low-level lib?)

possibly mmap: http://moulon.inra.fr/ruby/mmap.html

MPI Ruby

You missed Rinda, a Linda derivative layered on top of DRb.

There are basically two ways to do concurrency/parallelism: shared
memory and message passing. I’m not sure what the real tradeoffs are
– I’ve pretty much had my brain hammered with “shared memory bad –
message passing good”, but there obviously must be some
counter-arguments, or shared memory wouldn’t exist.

The way I look at it, Re: trade-offs, message passing is nice
because the API doesn’t change whether the receiver is on localhost
or over the wire.

Shared memory is nice for very high bandwidth between processes;
say process A decodes a 300MB TIFF image into memory and wants
to make the bitmap available to process B.

System V IPC has three components – message queues, semaphores, and
shared memory segments. So … if you’ve found a System V IPC library,
you’ve found a shared memory library.

Regards,

Bill

Daniel_DeLorme · October 12, 2007, 9:05am

Bill K. wrote:

From: “M. Edward (Ed) Borasky” [email protected]
…
Shared memory is nice for very high bandwidth between processes;
say process A decodes a 300MB TIFF image into memory and wants
to make the bitmap available to process B.

And shared memory might also be nice if you don’t want queueing, you
just want the latest data. Sensors in a real-time system, for example.

Daniel_DeLorme · October 12, 2007, 9:58am

From: “Joel VanderWerf” [email protected]

The way I look at it, Re: trade-offs, message passing is nice
because the API doesn’t change whether the receiver is on localhost
or over the wire.

(Just wanted to correct myself and say the API need not change,
rather than does not change… certainly some message passing
APIs only work on the local machine, but others are network-
transparent.

Shared memory is nice for very high bandwidth between processes;
say process A decodes a 300MB TIFF image into memory and wants
to make the bitmap available to process B.

And shared memory might also be nice if you don’t want queueing, you
just want the latest data. Sensors in a real-time system, for example.

Ah, yeah. Makes sense.

Indeed, this can also apply over the wire; for example: dealing with
a client->server->client remote mouse click-drag-release operation.
(Where, for example, a client is dragging a scrollbar thumb where
the UI is hosted on the server, and may be being rendered to multiple
clients.) One might transmit the initial click coordinate over TCP,
then the subsequent real-time drag telemetry over UDP, and finally the
terminating click-release coordinate again over TCP. (Such that, TCP
is used for reliable messages, UDP for unreliable streaming best-effort
real-time telemetry where one only wants the latest data.)

Regards,

Bill

Daniel_DeLorme · October 12, 2007, 10:10am

On 10/12/07, Daniel DeLorme [email protected] wrote:

Frankly I’m still not entirely clear on what are the advantages of each,
both in terms of features and in terms of performance. I guess I’ll just
have to write some code to find out…

What no one has yet asked you is: what kind of data do you have to pass
between this fork-parent and child, and by what protocol?

Are they simple commands and responses (as in HTTP)? Is there a
state-machine (as in SMTP)? Or is the data-transfer full-duplex without
a
protocol?

Are the data-flows extremely large? Exactly what are your performance
requirements?

Do you have one parent and one child? Or one parent and a great many
children?

Daniel_DeLorme · October 12, 2007, 12:04pm

Francis C. wrote:

What no one has yet asked you is: what kind of data do you have to pass
between this fork-parent and child, and by what protocol?

Are they simple commands and responses (as in HTTP)? Is there a
state-machine (as in SMTP)? Or is the data-transfer full-duplex without a
protocol?

Keeping in mind that this project is mainly intended as a learning
experience, my specific idea is a http server architecture with
generalist and specialist worker processes. The dispatcher would
partition requests to generalists (as in HTTP) who in turn might
dispatch to specialists (as in SMTP) for particular sub-tasks.

Simple example: 100 requests for a thumbnail come at the same time, are
split among N generalist workers, each of which asks the thumbnail
specialist process to generate the thumbnail. The specialist catches the
100 simultaneous requests, generates the thumbnail once and sends the
result back to the N generalists, who render it.

Are the data-flows extremely large? Exactly what are your performance
requirements?

Good questions, but ultimately my true purpose is to educate myself
about parallel processing; that’s the only requirement I have. So while
I’d say data-flows are unlikely to be large in this case, I’d still like
to how to handle large data-flows.

Hmmm, any good books to recommend?

Daniel_DeLorme · October 12, 2007, 2:03pm

Francis C. wrote:

Sounds like you’ve already read all the books you need to read. You have the
standard lingo down pat!

I’m afraid that must be a freak accident

But here goes: you picked the wrong project to demonstrate parallel
processing. Fast handling of network I/O is best done in an event-driven
way, and not in parallel. The parallelism that this problem exhibits arises
from the inherent nondeterminacy of having many independent clients
operating simultaneously. This pattern does expose capturable intramachine
latencies, but they’re due to timing differentials, not to processing
inter-dependencies.

Could you elaborate what you mean by “timing differentials” and
“processing inter-dependencies”? For regular webapps, time spent
querying the database most certainly exposes capturable intramachine
latencies. Event-driven sounds good, but doesn’t that requires that
all I/O be non-blocking? If you have blocking I/O in, say, a
third-party lib, you’re toast.

It’s intuitively attractive to structure a network server as a set of
parallel processes or threads, but it doesn’t add anything in terms of
performance or scalability. As regards multicore architectures, they add
little to a network server because the size of the incoming network pipe
typically dominates processor bandwidth in such applications.

You mean to say it’s the network that is usually the bottleneck, not the
CPU? Well, in my experience the database is usually the bottleneck, but
let’s not forget that ruby is particularly demanding on the CPU.

You may rejoin: “but how about an HTTP server that does a massive amount of
local processing to fulfill each request?” Now that’s more interesting. Just
get rid of the HTTP part and concentrate on how to parallelize the
processing. That’s a huge and well-studied problem in itself, and the net is
full of good resources on it.

While optimizing for CPU speed is fine, I’m also interested in process
isolation. If you have a monster lib that takes 1 minute to initialize
and requires 1 GB of resident memory but is used only occasionally, do
you really want to load it in all of your worker processes?

Daniel_DeLorme · October 12, 2007, 12:11pm

On 10/12/07, Daniel DeLorme [email protected] wrote:

Keeping in mind that this project is mainly intended as a learning

Sounds like you’ve already read all the books you need to read. You have
the
standard lingo down pat!

I’m probably the wrong person to ask, because I’ve been doing
high-performance parallel processing for many, many years, and my advice
(being experience-based) will assuredly fly in the face of orthodoxy.

But here goes: you picked the wrong project to demonstrate parallel
processing. Fast handling of network I/O is best done in an event-driven
way, and not in parallel. The parallelism that this problem exhibits
arises
from the inherent nondeterminacy of having many independent clients
operating simultaneously. This pattern does expose capturable
intramachine
latencies, but they’re due to timing differentials, not to processing
inter-dependencies.

It’s intuitively attractive to structure a network server as a set of
parallel processes or threads, but it doesn’t add anything in terms of
performance or scalability. As regards multicore architectures, they add
little to a network server because the size of the incoming network pipe
typically dominates processor bandwidth in such applications.

You may rejoin: “but how about an HTTP server that does a massive amount
of
local processing to fulfill each request?” Now that’s more interesting.
Just
get rid of the HTTP part and concentrate on how to parallelize the
processing. That’s a huge and well-studied problem in itself, and the
net is
full of good resources on it.

Daniel_DeLorme · October 12, 2007, 3:25pm

On 10/12/07, Daniel DeLorme [email protected] wrote:

Could you elaborate what you mean by “timing differentials” and
“processing inter-dependencies”? For regular webapps, time spent
querying the database most certainly exposes capturable intramachine
latencies. Event-driven sounds good, but doesn’t that requires that
all I/O be non-blocking? If you have blocking I/O in, say, a
third-party lib, you’re toast.

Event-driven application frameworks often deal with library calls that
necessarily involve blocking network or intramachine I/O (DBMS calls
being
the only really common example) by using a thread pool. For example,
EventMachine provides the #defer method to do this. It manages an
internal
thread pool that runs outside of the main reactor loop.

In general, the problem of architecting a high-performance web server
that
includes external dependencies like databases, legacy applications,
SOAP,
message-queueing systems, etc etc, is a very big problem with no simple
answer. It’s also been intensely studied, so there are resources for you
all
over the web.

You mean to say it’s the network that is usually the bottleneck, not the
CPU? Well, in my experience the database is usually the bottleneck, but
let’s not forget that ruby is particularly demanding on the CPU.

I made that remark in relation to efforts to make network servers run
faster
by hosting them on multiprocessor or multicore machines. You’ll usually
find
that a single computer with one big network pipe attached to it won’t be
able to process the I/O fast enough to keep all the cores busy. You
might
then be tempted to host the DBMS on the same machine, but that’s rarely
a
good idea. Simpler is better.

While optimizing for CPU speed is fine, I’m also interested in process

isolation. If you have a monster lib that takes 1 minute to initialize
and requires 1 GB of resident memory but is used only occasionally, do
you really want to load it in all of your worker processes?

You’re assuming that worker processes are a good idea in the first
place,
which I haven’t granted :-). But seriously, if this is the road you want
to
go down, then I’d recommend you look at message-queueing technologies.
I’m
not in favor of distributed objects for a lot of reasons, but something
like
DRb will give you the ability to avoid marshalling and unmarshalling
data.
SOAP is another alternative, but having been there and done that, I’d
rather
chew glass.

Daniel_DeLorme · October 12, 2007, 4:31pm

Francis C. wrote:

In general, the problem of architecting a high-performance web server that
includes external dependencies like databases, legacy applications, SOAP,
message-queueing systems, etc etc, is a very big problem with no simple
answer. It’s also been intensely studied, so there are resources for you all
over the web.

In other words, Robert Heinlein’s TANSTAAFL principle holds up for this
domain, like many others: “There Ain’t No Such Thing As A Free Lunch!”

Ironically, I was invited to a seminar a couple of weeks ago about
concurrency titled, “The Free Lunch Is Over”. What was ironic about it
was that I couldn’t attend because I had a prior commitment – a service
anniversary cruise with my employer at which I received a free lunch.

I made that remark in relation to efforts to make network servers run faster
by hosting them on multiprocessor or multicore machines. You’ll usually find
that a single computer with one big network pipe attached to it won’t be
able to process the I/O fast enough to keep all the cores busy. You might
then be tempted to host the DBMS on the same machine, but that’s rarely a
good idea. Simpler is better.

Up to a point, yes, simpler is better. But the goal of system
performance engineering of this type is to have, as much as possible, a
balanced system – network, disk and processor utilizations
approximately equal and none of them saturated. That’s the “sweet spot”
where you get the highest throughput for the lowest cost.

If your workload is well-behaved, you can sometimes get here “on the
average over a workday”. But web server workloads are anything but
well-behaved, even in the absence of deliberate denial of service
attacks.

Daniel_DeLorme · October 12, 2007, 7:55pm

On Fri, 12 Oct 2007 15:18:26 +0900, M. Edward (Ed) Borasky wrote:

There are basically two ways to do concurrency/parallelism: shared
memory and message passing. I’m not sure what the real tradeoffs are
– I’ve pretty much had my brain hammered with “shared memory bad –
message passing good”, but there obviously must be some
counter-arguments, or shared memory wouldn’t exist.

(Putting on my fake beard) In the old days…

Shared memory is (probably) always the fastest solution; in fact, on
some
OS’s, local message passing is implemented as a layer on top of shared
memory.

But, of course, if you implement concurrency in terms of shared memory,
you
have to worry about lock contention, queue starvation, and all the other
things that generally get handled for you if you use a higher-level
messaging protocol. And your software is now stuck with assuming that
the
sender and receiver are on the same machine; most other messaging
libraries
will work equally well on the same or different machines.

Back when machines were much slower, I had an application that already
used
shared memory for caching, but was bogging down on system-level message
passing calls. There were three worker servers that would make requests
to
the cache manager to load a record from the database into the
shared-memory
cache for the worker to use. (This serialized database access and
reduced
the number of open file handles.)

So I changed them to stop using the kernel-level message-queuing
routines;
instead, they’d store their requests in a linked list that was kept in a
different shared memory region. The cache manager would unlink the
first
request from the list, process it, and link that same request structure
back onto the “reply” list with a return code. The requests/replies
were
very small, stayed in processor cache, etc., and there was much less
context-switching in and out of kernel mode since the queuing was now
all
userland. This also saved a lot of memory allocate/frees, another
expensive operation at the time; most message-passing involves at least
one
full copy of the message.

Occasionally, the request list would empty out, in which case we had to
use
the kernel to notify the cache manager to wake up and check its list,
but
that was a rare occasion, and “notify events” on that system were still
much cheaper than a queue message.

I would doubt that any of this type of optimization applies to Ruby on
modern OS’s, however.

Daniel_DeLorme · October 12, 2007, 9:09pm

On 10/12/07, Jay L. [email protected] wrote:

userland. This also saved a lot of memory allocate/frees, another
I would doubt that any of this type of optimization applies to Ruby on
modern OS’s, however.

At the risk of starting a threadjack, I think Ruby is today not as
well-served as some other development products in the way of native
message-passing systems. There is Assaf Arkin’s very nice reliable-msg
library, which defines a good API for messaging, and has some support
for
persistence. And of course there are several libraries which support
Stomp,
making it easier to work with products like Java’s AMQ.

I’d like to see a full-featured, high-performance MQ system for Ruby,
however. By “for Ruby” I don’t necessarily mean “in Ruby,” but rather
tightly integrated and easy/intuitive to use. Doing such a thing was the
original motivation for creating EventMachine, by the way.

Daniel_DeLorme · October 13, 2007, 5:29am

Francis C. wrote:

original motivation for creating EventMachine, by the way.
“If you build it, they will come.”