Parallelizing a ruby task

I have a long-running batch job that I would like to speed up.
Currently it uses only one CPU and the server I have in mind for this
has 16 cores, and I want to take advantage of them.

I’m thinking one of three possibilities:

  1. JRuby, where the threads are native OS threads
  2. A message queue (e.g. ActiveMQ + Stomp), where worker threads run
    as separate processes, thus using all cores.
  3. A MapReduce implementation (e.g. hadoop)

I would like to see if anyone has gone down this road and can weigh in
on these options.

– Mark.

On Jun 29, 2009, at 2:00 PM, Mark T. wrote:

I would like to see if anyone has gone down this road and can weigh in
on these options.

How difficult it your task? If you were able to reduce (heh) it to a
MapReduce problem, you could use something like Skynet or Starfish.
For even simpler forking, check out Ara Howard’s threadify or my
forkify for simple parallel processing.

  • Lee

I would like to see if anyone has gone down this road and can weigh in
on these options.

How difficult it your task? If you were able to reduce (heh) it to a
MapReduce problem, you could use something like Skynet or Starfish.
For even simpler forking, check out Ara Howard’s threadify or my
forkify for simple parallel processing.

Yes, it fits a MapReduce problem but most MapReduce implementations I
came across seemed like overkill. I wasn’t aware of Skynet or
Starfish–they look promising, thanks. The file interface of Starfish
may in fact be just what I’m looking for.

I’ll check out threadify and forkify too.

Thanks again.
– Mark.

Just throwing my 2 cents out here:

What if you just created a daemon controller that threaded each process
on a different core o_O?

Would speed things up greatly whilst keeping control over each process.

  • Mac

On 30 Jun 2009, at 00:17, Michael L. wrote:

Just throwing my 2 cents out here:

What if you just created a daemon controller that threaded each
process
on a different core o_O?

Would speed things up greatly whilst keeping control over each
process.

Yep, multiple processes are your friend - especially on Unix :slight_smile:

Ellie

Eleanor McHugh
Games With Brains
http://slides.games-with-brains.net

raise ArgumentError unless @reality.responds_to? :reason

Mark T. wrote:

  1. A message queue (e.g. ActiveMQ + Stomp), where worker threads run
    as separate processes, thus using all cores.

    I would like to see if anyone has gone down this road and can weigh in
    on these options.

I have gone down option 2, it works well.

Depending on your application, you may not need the sophistication of a
“real” queue manager. You could just create a Queue object (from
thread.rb), running in its own process, and share it using DRb. Multiple
reader processes can pop messages from the queue, and will block until a
message is available. Writers can push messages into the queue as
required. There is also SizedQueue which will block the writers if the
queue gets too full.

A “real” queue manager like RabbitMQ may make sense if you need your
subtasks to persist in the queue in the event of a system crash. But for
a simple worker-farm type of application, this usually isn’t necessary.

Regards,

Brian.

On Jun 29, 7:17 pm, Michael L. [email protected] wrote:

Just throwing my 2 cents out here:

What if you just created a daemon controller that threaded each process
on a different core o_O?

Would speed things up greatly whilst keeping control over each process.

That’s the idea behind using a message queue – it does that kind of
stuff for you. Workers are processes that will be distributed among
cores. The only thing I’m unsure about in a MQ architecture is the
collating of answers from all the worker threads, i.e. the Reduce part
of MapReduce.