I’m looking at replacing our homebrew job queue with something better,
and
I’m wondering what’s out there. If there’s something that meets our
needs
I’d like to use it; otherwise I may end up building our next gen job
queue
by assembling various components that exist already. I’ve seen a couple
dozen solutions for this in the Ruby world, but haven’t scrutinized many
of
them in depth and am unsure if there are any that fully meet our needs.
We have many different kinds of jobs, however the main ones I’m worried
about are rather CPU intensive, take a long time to execute, and operate
on
large amounts of input and output data.
-
Distributed: we certainly have too many jobs to run on a single
computer.
We need to distribute them across a number of workers. Beyond that we’d
like to automatically provision more workers when the backlog of jobs
gets
too large, and shut down workers if too many of them are idle. -
Fault Tolerant: as with any distributed system, fault tolerance is
important. An evil monkey should be able to futz with any part of the
system, crashing computers willy nilly, and it should continue to Just
Work. If a job breaks down and is somehow lost anywhere along the way,
the
system should eventually detect this and retry the job. From a design
perspective, I would like to see as much of the system as stateless as
possible. The agents executing the jobs and message queues should be
stateless. Ideally the entirety of the state of the system is kept in
the
database, and that’s the only part of the system we need to ensure
recovering state from after a crash. If a worker or message queue goes
down
we shouldn’t need to worry about recovering jobs-in-flight, they should
simply be retried. -
Idempotent: going along with a stateless approach to fault tolerance,
if
the system does misdetect a failed job and ends up executing it twice,
the
system should detect this and discard the redundant results. Having the
same job accidentally complete multiple times should not screw up the
state
of the system. -
Support for Temporary Failures: sometimes a job fails in such a way
that
it should be retried (i.e. external resources needed to complete a job
are
temporarily unavailable). The system should support retrying these jobs
in
a sensible way, and it’d be nice to specify on a job-by-job basis how
many
attempts should be made before the system should give up and consider it
a
permanent failure. -
Support for Live Upgrades: we’d like to be able to add new types of
jobs
to the system . So along with this, agents should know what types of
jobs
they support, and not request unsupported jobs from the message queue.
I suppose some of my comments are presupposing the architecture: a list
of
jobs stored in the database, command and control processes which load
jobs
into and read results from the message queues, and agents which pull
jobs
from and report jobs to the message queues, and the message queues
themselves. I’d be open to other designs, so long as they have the
above
properties.
Given that, is there an existing job queue that meets my needs that I
should
be checking out?