Using multicore CPUs in parallel tasks

Hi,

I’ve been reading around a bit but couldn’t find a solution that worked,
so here goes:

I am running ruby 1.8 and want to make full use of a quad core CPU
(64bit, Ubuntu) in a task that lends itself to multithreading/multicore
use.

It’s basically an array of objects that are each use in a fairly CPU
intensive job, so I figured I could have 4 of them run at the same time
, one on each CPU.

BUT…

The only reasonably understandably suggestion looked something like:


threads = 4
my_array = [something_here]

threads.times do
Process.fork(a_method(my_array.shift))
end

my_array.each do |object|
Process.wait(0)
Process.fork(a_method(object))
end

But this still only used one CPU (and looks a bit ugly…). Is that some
limitation of ruby (v 1.8 specifically) or am I doing something wrong?

Cheers,

Marc

On Thu, Oct 29, 2009 at 10:56 AM, Marc H.
[email protected] wrote:

intensive job, so I figured I could have 4 of them run at the same time
, one on each CPU.

You might want to checkout Pure and Tiamat and talk to James Lawrence
(see links). He seems to have something you are asking for. I don’t
know much about these 2 project, they came by my radar a few days ago
but I think it’s cool what James is working on!

== Links

== Author

end
Cheers,

Marc

Posted via http://www.ruby-forum.com/.


Kind Regards,
Rajinder Y.

http://DevMentor.org

Do Good! - Share Freely, Enrich and Empower people to Transform their
lives.

On Thu, Oct 29, 2009 at 8:56 AM, Marc H.
[email protected]wrote:

intensive job, so I figured I could have 4 of them run at the same time
threads.times do
limitation of ruby (v 1.8 specifically) or am I doing something wrong?

Cheers,

Marc

Posted via http://www.ruby-forum.com/.

You are going to want Ruby 1.9 for this. In 1.8 threads are “green”,
basically they only exists as threads inside the VM so you still only
hit
one core and any blocking system I/O will block all of your threads.


“Hey brother Christian with your high and mighty errand, Your actions
speak
so loud, I can’t hear a word you’re saying.”

-Greg Graffin (Bad Religion)

On Thu, Oct 29, 2009 at 11:48 AM, Glen H. [email protected]
wrote:

You are going to want Ruby 1.9 for this. In 1.8 threads are “green”,
basically they only exists as threads inside the VM so you still only hit
one core and any blocking system I/O will block all of your threads.

Ruby 1.9 isn’t going to help you when using threads to distribute
computation across CPU cores. The Global VM Lock ensures that
simultaneous
computation is still limited to one core.

JRuby, on the other hand, does not have this limitation. On MRI/1.9 I
would
recommend using multiple processes.

Marc,

How long lived is each of these tasks? Are we talking seconds or weeks?
Is there a user-facing aspect to this or is throughput the variable
that you’re wanting to optimize?

When you say “fairly CPU intensive”, doe sthis mean that when one of
these tasks runs you see (from sar/mpstat) that one of your CPUs is
pinned?

Peter

On 10/29/2009 09:04 PM, Tony A. wrote:

On Thu, Oct 29, 2009 at 11:48 AM, Glen H. [email protected] wrote:

You are going to want Ruby 1.9 for this. In 1.8 threads are “green”,
basically they only exists as threads inside the VM so you still only hit
one core and any blocking system I/O will block all of your threads.

Ruby 1.9 isn’t going to help you when using threads to distribute
computation across CPU cores. The Global VM Lock ensures that simultaneous
computation is still limited to one core.

Are you saying that the global VM lock even extends to several
processes? Because Marc did not want to use threads for distribution
but rather processes.

Kind regards

robert

On Thu, Oct 29, 2009 at 2:04 PM, Tony A. [email protected] wrote:

computation is still limited to one core.

JRuby, on the other hand, does not have this limitation. On MRI/1.9 I
would
recommend using multiple processes.


Tony A.
Medioh/Nagravision

Ah, I did not know that.


“Hey brother Christian with your high and mighty errand, Your actions
speak
so loud, I can’t hear a word you’re saying.”

-Greg Graffin (Bad Religion)

On 10/29/2009 03:56 PM, Marc H. wrote:

intensive job, so I figured I could have 4 of them run at the same time
threads.times do
limitation of ruby (v 1.8 specifically) or am I doing something wrong?
I believe you are not using Process.fork properly. In fact, I am
surprised that you do not get an exception:

irb(main):001:0> Process.fork(“foo”)
ArgumentError: wrong number of arguments (1 for 0)
from (irb):1:in `fork’
from (irb):1
from :0

Basically what you do is you do a calculation (a_method(object)) and
then you create a process. No surprise that only one CPU is busy.

Here’s something else that you could do

processes = 4

my_array.each_slice my_array.size / processes do |tasks|
fork do
tasks.each do |task|
a_method(task)
end
end
end

Process.waitall

Drawback is that one of those processes might accidentally get all the
easy tasks and you do not utilize CPUs optimally. Here’s another
solution that does not have that issue

processes = 4
count = 0

my_array.each do |task|
if count == processes
Process.wait
count -= 1
end

fork do
a_method(task)
end
count += 1
end

Process.waitall

You can see that it works with this example:

processes = 4
count = 0

10.times do |task|
if count == processes
Process.wait
count -= 1
end

fork do
printf “%-20s start %4d %4d\n”, Time.now, $$, task
sleep rand(5) + 2
printf “%-20s end %4d %4d\n”, Time.now, $$, task
end
count += 1
end

Process.waitall

Kind regards

robert

On Thu, Oct 29, 2009 at 4:05 PM, Robert K.
[email protected]wrote:

processes.

No, if you look over my post again it specifically mentions the GVL
applies
to threads and suggests using processes.

2009/10/29 Tony A. [email protected]:

Are you saying that the global VM lock even extends to several processes?
Because Marc did not want to use threads for distribution but rather
processes.

No, if you look over my post again it specifically mentions the GVL applies
to threads and suggests using processes.

I figured as much. The thread discussion does not help Marc, because
he explicitly wanted to use processes for core utilization. Basically
Glen sent us in the wrong direction though. :slight_smile:

Cheers

robert

Robert K. wrote:

processes = 4
count = 0

my_array.each do |task|
if count == processes
Process.wait
count -= 1
end

fork do
a_method(task)
end
count += 1
end

Process.waitall

Another option,

Tiamat.open_local(4) {
pure do
fun_map :result => my_array do |elem|
a_method(elem)
end
end.compute.result
}

This lets you distribute across N physical machines without a change to
the code.

Robert K. wrote:

I believe you are not using Process.fork properly. In fact, I am
surprised that you do not get an exception:

irb(main):001:0> Process.fork(“foo”)
ArgumentError: wrong number of arguments (1 for 0)
from (irb):1:in `fork’
from (irb):1
from :0

Yes, quite possible - I didn’t really look up the exact code, just wrote
it down from memory, sorry about that…

processes = 4
count = 0

my_array.each do |task|
if count == processes
Process.wait
count -= 1
end

fork do
a_method(task)
end
count += 1
end

Process.waitall

That works like a charm, thanks a lot!

Tony A. wrote:

Ruby 1.9 isn’t going to help you when using threads to distribute
computation across CPU cores. The Global VM Lock ensures that
simultaneous computation is still limited to one core.

JRuby, on the other hand, does not have this limitation. On MRI/1.9
I would recommend using multiple processes.

I’m not so sure jruby does this effectively.

require ‘tiamat/autoconfig’
require ‘pure/dsl’
require ‘benchmark’

mod = pure do
def total(left, right)
left + right
end

def left
(1…5_000_000).inject(0) { |acc, n| acc + n }
end

def right
(1…5_000_000).inject(0) { |acc, n| acc + n }
end
end

Benchmark.bmbm { |bm|
bm.report(“1 thread, 1 interpreter”) {
mod.compute(1).total
}
bm.report(“2 threads, 1 interpreter”) {
mod.compute(2).total
}

this part removed for jruby bench

bm.report(“2 threads, 2 interpreters”) {
Tiamat.open_local(2) {
mod.compute.total
}
}
}

== ruby 1.9.2dev (2009-10-18 trunk 25393) [i386-darwin9.8.0]
Rehearsal -------------------------------------------------------------
1 thread, 1 interpreter 4.370000 0.020000 4.390000 ( 4.389990)
2 threads, 1 interpreter 4.360000 0.030000 4.390000 ( 4.385111)
2 threads, 2 interpreters 0.010000 0.010000 4.700000 ( 2.460661)
--------------------------------------------------- total: 13.480000sec

                            user     system      total        real

1 thread, 1 interpreter 4.360000 0.020000 4.380000 ( 4.376050)
2 threads, 1 interpreter 4.360000 0.030000 4.390000 ( 4.380982)
2 threads, 2 interpreters 0.010000 0.010000 4.710000 ( 2.465925)

== jruby 1.4.0RC3 (ruby 1.8.7 patchlevel 174) (2009-10-30 1d7de2d) (Java
HotSpot™ Client VM 1.5.0_20) [i386-java]
Rehearsal ------------------------------------------------------------
1 thread, 1 interpreter 6.060000 0.000000 6.060000 ( 6.060000)
2 threads, 1 interpreter 7.629000 0.000000 7.629000 ( 7.629000)
-------------------------------------------------- total: 13.689000sec

                           user     system      total        real

1 thread, 1 interpreter 6.080000 0.000000 6.080000 ( 6.080000)
2 threads, 1 interpreter 7.288000 0.000000 7.288000 ( 7.288000)

On Fri, Oct 30, 2009 at 11:07 AM, James M. Lawrence
[email protected] wrote:

fork do
pure do
fun_map :result => my_array do |elem|
a_method(elem)
end
end.compute.result
}

This lets you distribute across N physical machines without a change to
the code.

This is just elegant =) … it’s funny how I observer something then
more of what I observer comes in to the fold! Was hoping you would
reply to the thread :wink:


Posted via http://www.ruby-forum.com/.


Kind Regards,
Rajinder Y.

http://DevMentor.org

Do Good! - Share Freely, Enrich and Empower people to Transform their
lives.

On Fri, Oct 30, 2009 at 10:14 AM, James M. Lawrence
[email protected] wrote:

2 threads, 2 interpreters  0.010000  0.010000  4.710000 (  2.465925)
1 thread, 1 interpreter   6.080000  0.000000  6.080000 (  6.080000)
2 threads, 1 interpreter  7.288000  0.000000  7.288000 (  7.288000)

JRuby benchmarking:

  • Use Java 6+

Java 6 is much faster than Java 5. Java 7 is faster still in many cases.

  • Pass --server if -v output says “client” VM

The Hotspot JVM has two modes: “server” and “client”. The “server” VM
does runtime-profiled optimizations and can be 2x or more faster than
the “client” VM.

Results on my system (core 2 duo 2.66GHz):

ruby 1.9.2dev (2009-07-23 trunk 24248) [i386-darwin9.7.1]
Rehearsal -------------------------------------------------------------
1 thread, 1 interpreter 3.370000 0.020000 3.390000 ( 3.516261)
2 threads, 1 interpreter 3.330000 0.020000 3.350000 ( 3.412460)
2 threads, 2 interpreters 0.010000 0.000000 3.590000 ( 2.133313)
--------------------------------------------------- total: 10.330000sec

                            user     system      total        real

1 thread, 1 interpreter 3.350000 0.010000 3.360000 ( 3.415410)
2 threads, 1 interpreter 3.350000 0.020000 3.370000 ( 3.423560)
2 threads, 2 interpreters 0.000000 0.010000 3.630000 ( 2.302965)

jruby 1.5.0.dev (ruby 1.8.7 patchlevel 174) (2009-10-30 eaa9e7f) (Java
HotSpot™ 64-Bit Server VM 1.6.0_15) [x86_64-java]
Rehearsal ------------------------------------------------------------
1 thread, 1 interpreter 2.373000 0.000000 2.373000 ( 2.373000)
2 threads, 1 interpreter 1.733000 0.000000 1.733000 ( 1.733000)
--------------------------------------------------- total: 4.106000sec

                           user     system      total        real

1 thread, 1 interpreter 2.145000 0.000000 2.145000 ( 2.145000)
2 threads, 1 interpreter 1.840000 0.000000 1.840000 ( 1.840000)

It would probably improve more with a longer run, but this is pretty
good.

  • Charlie

On Fri, Oct 30, 2009 at 2:06 AM, Robert K.
[email protected]wrote:

I figured as much. The thread discussion does not help Marc, because
he explicitly wanted to use processes for core utilization. Basically
Glen sent us in the wrong direction though. :slight_smile:

I’ve always worked best as a diversion.


“Hey brother Christian with your high and mighty errand, Your actions
speak
so loud, I can’t hear a word you’re saying.”

-Greg Graffin (Bad Religion)

Charles Oliver N.:

This does not match my results. Are you sure both cores are being used?

I am certain. I tried to head off this question when I said: all
applications are closed save Terminal; top reports 0% CPU usage
beforehand; top reports java at 100% CPU during the 1-thread test;
185% CPU during the 2-thread test; top was not running during the
posted benchmarks.

I should also mention this is my mp3 player co-opted into a Mac dev
machine–a Mac Mini. Maybe Java balks at the specs. System Profiler:

Model Name: Mac mini
Model Identifier: Macmini2,1
Processor Name: Intel Core 2 Duo
Processor Speed: 1.83 GHz
Number Of Processors: 1
Total Number Of Cores: 2
L2 Cache: 2 MB
Memory: 1 GB
Bus Speed: 667 MHz

Darwin jl.local 9.8.0 Darwin Kernel Version 9.8.0: Wed Jul 15 16:55:01
PDT 2009; root:xnu-1228.15.4~1/RELEASE_I386 i386

It would be nice to match jruby versions. Can you try master 55366a1
or push eaa9e7f to a remote branch?

[quoting the rest in full due to ruby-forum gateway breakage]

Also does not match my results:

Rehearsal ---------------------------------------------
1 thread 4.795000 0.000000 4.795000 ( 4.739000)
2 threads 3.072000 0.000000 3.072000 ( 3.072000)
------------------------------------ total: 7.867000sec

            user     system      total        real

1 thread 4.081000 0.000000 4.081000 ( 4.081000)
2 threads 2.966000 0.000000 2.966000 ( 2.966000)

I’d love to hear from others trying this benchmark, since the results
you’ve given don’t match my results on any of the systems I’m testing.

Charles Nutter wrote:

JRuby benchmarking:

  • Use Java 6+

Java 6 is much faster than Java 5. Java 7 is faster still in many cases.

  • Pass --server if -v output says “client” VM

I didn’t consider it because the behavior I showed looks wrong for
either Java 5 or Java 6 in either client or server mode. Indeed I
obtained the same results with Java 6 Server VM.

A computation split into two parallel threads takes more time than the
same computation with one thread. ‘top’ reports 185% CPU and 100% CPU
respectively.

I was not concerned with comparing MRI and jruby. MRI was a baseline
to demonstrate that Pure’s parallelism was working in the first place.

I was unable to find your eaa9e7f commit so I grabbed the latest
master branch.

jruby 1.5.0.dev (ruby 1.8.7 patchlevel 174) (2009-11-02 55366a1) (Java
HotSpot™ 64-Bit Server VM 1.6.0_15) [x86_64-java]

Core 2 Duo 1.83GHz; all apps closed except Terminal; benchmarks made
without ‘top’ running.

Rehearsal ------------------------------------------------------------
1 thread, 1 interpreter 3.422000 0.000000 3.422000 ( 3.422000)
2 threads, 1 interpreter 4.008000 0.000000 4.008000 ( 4.008000)
--------------------------------------------------- total: 7.430000sec

                           user     system      total        real

1 thread, 1 interpreter 2.942000 0.000000 2.942000 ( 2.942000)
2 threads, 1 interpreter 3.595000 0.000000 3.595000 ( 3.595000)

Results are the same with Pure removed:

require ‘benchmark’

def left
(1…10_000_000).inject(0) { |acc, n| acc + n }
end

def right
(1…10_000_000).inject(0) { |acc, n| acc + n }
end

Benchmark.bmbm { |bm|
bm.report(“1 thread”) {
Thread.new {
[left, right]
}.value
}
bm.report(“2 threads”) {
[
Thread.new { left },
Thread.new { right },
].map { |t| t.value }
}
}

Rehearsal ---------------------------------------------
1 thread 6.726000 0.000000 6.726000 ( 6.726000)
2 threads 7.478000 0.000000 7.478000 ( 7.478000)
----------------------------------- total: 14.204000sec

            user     system      total        real

1 thread 6.636000 0.000000 6.636000 ( 6.636000)
2 threads 8.196000 0.000000 8.196000 ( 8.196000)

On Mon, Nov 2, 2009 at 11:47 AM, James M. Lawrence
[email protected] wrote:

Rehearsal ------------------------------------------------------------
1 thread, 1 interpreter   3.422000  0.000000  3.422000 (  3.422000)
2 threads, 1 interpreter  4.008000  0.000000  4.008000 (  4.008000)
--------------------------------------------------- total: 7.430000sec

               user   system    total     real
1 thread, 1 interpreter   2.942000  0.000000  2.942000 (  2.942000)
2 threads, 1 interpreter  3.595000  0.000000  3.595000 (  3.595000)

This does not match my results. Are you sure both cores are being used?

Rehearsal ---------------------------------------------
1 thread   6.726000  0.000000  6.726000 (  6.726000)
2 threads  7.478000  0.000000  7.478000 (  7.478000)
----------------------------------- total: 14.204000sec

        user   system    total     real
1 thread   6.636000  0.000000  6.636000 (  6.636000)
2 threads  8.196000  0.000000  8.196000 (  8.196000)

Also does not match my results:

Rehearsal ---------------------------------------------
1 thread 4.795000 0.000000 4.795000 ( 4.739000)
2 threads 3.072000 0.000000 3.072000 ( 3.072000)
------------------------------------ total: 7.867000sec

            user     system      total        real

1 thread 4.081000 0.000000 4.081000 ( 4.081000)
2 threads 2.966000 0.000000 2.966000 ( 2.966000)

I’d love to hear from others trying this benchmark, since the results
you’ve given don’t match my results on any of the systems I’m testing.

  • Charlie