Process-group gem - concurrent processes with fibers

Hi Everyone,

Process::Group is a class for coordinating and managing multiple
processes which execute concurrently in fibers.

In some of my testing scripts, multiple processes need to run. In the
past,
I’ve just done this sequentially. However, I’ve been modernising some
scripts and I’ve bundled up the code into this gem.

Previously:

Process.spawn(“some-task”)
process_and_email_results

Process.spawn(“some-other-task --foobar”)
process_and_email_results

Now I can run like this:

group = Process::Group.new

Fiber.new do
group.spawn(“some-task”)
process_and_email_results
end.resume

Fiber.new do
group.spawn(“some-other-task --foobar”)
process_and_email_results
end.resume

group.wait

Process::Group allows you to run the two tasks concurrently and in these
cases it was an easy option to modernise existing scripts. You can call
spawn multiple times in a fiber and it will work as expected. You can
also
kill the entire group of processes if you wish.

Examples, documentation and code:

Kind regards,
Samuel

On 03/11/2014 12:17 AM, Samuel W. wrote:

Fiber.new do

Process::Group allows you to run the two tasks concurrently and in these
cases it was an easy option to modernise existing scripts. You can call
spawn multiple times in a fiber and it will work as expected. You can
also kill the entire group of processes if you wish.

Examples, documentation and code: GitHub - ioquatix/process-group: Manages a group of processes which can run concurrently using fibers.

Kind regards,
Samuel

What’s the advantage over using threads, the old school way?

Thread.new do
system(“some-task”)
process_and_email_results
end

On Tue, Mar 11, 2014 at 6:56 PM, Joel VanderWerf
[email protected]wrote:

What’s the advantage over using threads, the old school way?

Or a system like http://celluloid.io which provides both threads and
fibers
and can integrate with things like I/O reactors…

Celluloid looks pretty interesting - I’ve seen it pop up quite a few
times. A unix process group and a set of actors are two completely
different things (e.g. signal handling). I wanted something dead simple
and
specific to what I was trying to do. I’ve also got some use-cases for
which
celluloid feels too heavy.

Threads are good but I felt like I wanted something more predictable.
Also,
not all implementations of Ruby use green threads and therefore might
have
synchronisation issues if you use (either directly or indirectly through
a
gem/library) shared global state.

On 03/12/2014 06:22 AM, Samuel W. wrote:

Threads are good but I felt like I wanted something more predictable.
Also, not all implementations of Ruby use green threads and therefore
might have synchronisation issues if you use (either directly or
indirectly through a gem/library) shared global state.

Even green threads have this danger, don’t they?

Taking over manual scheduling seems a bit awkward compared to using some
kind of concurrency control (mutexes, queues, actors). What happens if
application code inside the fiber (process_and_email_results in the
example) makes a blocking IO call?

Manual scheduling with fibers is great for testing concurrent code which
would otherwise run in threads, because you can force a certain kind of
contention in a predicable way. I’m working on extracting a library for
doing this from a project where it’s been a useful technique.

Still wondering how you handle blocking IO in fibers.

If all of the code inside the fiber is under your control, you can use
non-blocking operations, and Fiber.yield if the operation would block.
(See example below.)

But I get the impression you are dealing with various third-party libs
which might just open a socket and start talking? Couldn’t that block
the fiber and therefore the whole thread?

This has always seemed to me to be the compelling feature of ruby’s
threads: you just let the thread scheduler manage blocking.

For anyone else who’s reading and hasn’t played with fibers, here’s what
you can do to avoid blocking the whole thread while one fiber waits for
input:


require ‘socket’
require ‘fiber’

s1, s2 = UNIXSocket.pair

f = Fiber.new do
loop do
begin
puts “Fiber checking for available data”
data = s1.read_nonblock(10)
puts “Fiber received #{data.inspect}”
rescue IO::WaitReadable
puts “Fiber yielding”
Fiber.yield
puts “Fiber resuming”
unless IO.select([s1], [], [], 0)
puts “…even though no data is available”
end
retry
rescue => ex
puts ex
end
end
end

f.resume
f.resume

puts “writing to socket”
s2.write “123456”

f.resume
f.resume

puts “writing to socket”
s2.write “abcdef”

f.resume
f.resume

Even green threads have this danger, don’t they?

Yes, but in this context, I’m actually not sure I’d call the manual
scheduling a danger. While it could be referred to as explicit
scheduling,
I prefer to look at as providing a specific, well defined, non-blocking
API
with explicit synchronisation points.

(I think what I really like about fibers is they make it very easy to
compose concurrent code in a predictable way. For all intents and
purposes,
the code is still sequential with very little overhead.)

Taking over manual scheduling seems a bit awkward compared to using some
kind of concurrency control (mutexes, queues, actors).

I would have said the opposite. Code using threads is typically very
hard
to reason about compared to sequential code (like the API I’ve
proposed).

Except in specific situations (e.g. game engines, data
processing/access,
algorithms/compression), I find threading causes more problems than it
solves (e.g.
http://www.linuxprogrammingblog.com/threads-and-fork-think-twice-before-using-them
). Even debugging code with threads can be a nightmare - why is there a
deadlock - why is there memory corruption - etc. The only situation
where
I’ve seen this working well in general is in languages/environments
designed from the ground up to support parallel processing (e.g.
haskell,
clojure, etc). Everything else seems like a hack that requires careful
analysis to verify correctness and the path to the dark side is always
just
one (poorly chosen) line of code away…

Anyway, basically, I really like fibers - if you want to run concurrent
unix processes, this gem is a good starting point.

Thanks for your thoughts and input.

Kind regards,
Samuel

On Wed, Mar 12, 2014 at 5:56 PM, Joel VanderWerf
[email protected]wrote:

Still wondering how you handle blocking IO in fibers.

This is a genuine concern with this sort of library. For it to really be
useful, you need to be able to do things like I/O concurrently. In fact,
if
it can’t do I/O, it’s not particularly helpful, because Fibers are
useless
for CPU-bound tasks by default. I/O is one of the biggest use cases of
fibers.

If you’re curious how Celluloid handles it, it provides a Celluloid::IO
companion library which has duck types of things like TCPSocket,
UDPSocket,
and UNIXSocket which interact with Celluloid’s scheduling and can
suspend/resume fibers when they make “blocking” calls. I/O multiplexing
is
handled by a central reactor/event loop (provided by nio4r)

Still wondering how you handle blocking IO in fibers.

That wasn’t an important feature for the intended purpose of the gem,
therefore there is no explicit support for it at the moment… that might
seem like a cop out but it is exactly what I wanted (minimal features,
specific use-case).

But I get the impression you are dealing with various third-party libs
which might just open a socket and start talking? Couldn’t that block
the
fiber and therefore the whole thread?

That is the same problem you’d have for any sequential code, whether it
is
running in a fiber or in an actor - calling something that blocks
indefinitely - but I think as a user you’d be aware of this. I’m not
proposing a solution to this problem, I think that’s probably impossible
anyway.

This has always seemed to me to be the compelling feature of ruby’s
threads: you just let the thread scheduler manage blocking.

The thread scheduler may seem like a good idea in theory, but in
practice
event driven code that works with OS primitives (select, epoll, kevent)
is
generally more efficient. I think there are good arguments either way
(e.g.
SUN UltraSparc chips seemed to be designed for thread-based workloads,
running up to 64 threads in parallel, a bit like HyperThreading in x86),
but event driven systems generally seem easier to reason about, give
more
predictable behaviour, better defined resource usage, etc. Also, as
mentioned, while some implementations use green threads, not all
implementations are using green threads. That means that if you use
threads, you need to deal with reentrancy and contention issues - at
least
the same, if not more, complex than dealing with fibers (e.g. calling
fork
might break everything when using threads, as mentioned).

Thanks for the example code. I’m sure that can be done more efficiently
and
cleanly by having one function calling #select and resuming the correct
fiber.

Thanks for your ideas and feedback.

Kind regards,
Samuel

This is a genuine concern with this sort of library. For it to really be
useful

This library is VERY useful for me in it’s current form. If you want
concurrent I/O, yes, don’t use this library. IF you just want to run
processes to completion concurrently, this library is perfect. I’m using
it
to retrofit existing sequential scripts and also in another project
similar
to make which doesn’t care about IO, just running compilers/linkers,
etc.

You can avoid fibers/threads entirely, too.
Just a hash, lambdas, and waitpid2:

tasks is a hash which maps pids to lambdas (callbacks):

tasks = {
Process.spawn(“some-task”) => lambda do |status|
process_and_email_results(status, “some task done!”)
end,
Process.spawn(“some-other-task --foobar”) => lambda do |status|
process_and_email_results(status, “some other task done!”)
end,
}

until tasks.empty?
pid, status = Process.waitpid2(-1)
if callback = tasks.delete(pid)
callback.call(status)
else
warn “reaped unknown process: #{status.inspect}”
end
end