Why not call Thread.join?

thefed · December 31, 2007, 6:02am

Take this code from the Ruby Cookbook:

module Enumerable
def each_simultaneously
threads = []
each { |e| threads << Thread.new { yield e } }
return threads
end
end

It is used on an array so that you may do this:
[1,2,3].each_simultaneously do |i|
sleep 5
puts i
end

And it works!

But why don’t I need to call threads.each {|t| t.join }?

And if I did, would it slow it down?

Thanks,
Ari
-------------------------------------------|
Nietzsche is my copilot

thefed · December 31, 2007, 6:50am

On Dec 30, 9:02 pm, thefed [email protected] wrote:

It is used on an array so that you may do this:
[1,2,3].each_simultaneously do |i|
sleep 5
puts i
end

And it works!

What did you expect to happen?
The example you provided will do nothing but create threads and
exit.

But why don’t I need to call threads.each {|t| t.join }?

Any running threads are killed when the program exits.

And if I did, would it slow it down?

Generally speaking, the only thing it would slow down (stop really) is
the execution path of the main thread.

Now if for some reason your main thread has to do other work, a join
would delay that, of course.

thefed · December 31, 2007, 4:10pm

On 31.12.2007 06:45, Skye Shaw!@#$ wrote:

The example you provided will do nothing but create threads and
the execution path of the main thread.

Now if for some reason your main thread has to do other work, a join
would delay that, of course.

Nevertheless it’s good practice to join. If main has other work to do
then you should join once that is done, i.e. at the end of the script.
If those threads have terminated already you basically only have the
overhead of the Threads Array iteration - but you get robustness in
return, i.e. you ensure that all those Threads can terminate properly
(assuming that they are written in a way to do that eventually).

Kind regards

robert

thefed · December 31, 2007, 5:15pm

On 31.12.2007 17:02, thefed wrote:

executing before joining the others?
They are not joined at the same time but one after the other.

Cheers

robert

thefed · December 31, 2007, 5:03pm

On Dec 31, 2007, at 12:49 AM, Skye Shaw!@#$ wrote:

Generally speaking, the only thing it would slow down (stop really) is
the execution path of the main thread.

Now if for some reason your main thread has to do other work, a join
would delay that, of course.

OK, I understand it better. But why does each {|t| t.join} join them
all at the same time (ish), and not wait for the first one to finish
executing before joining the others?

thefed · December 31, 2007, 6:41pm

On Mon, 31 Dec 2007 00:02:10 -0500, thefed wrote:

[1,2,3].each_simultaneously do |i|
sleep 5
puts i
end

When I ran this (not in IRB) it didn’t work. The interpreter terminated
before any of the threads finished sleeping for 5 seconds. In any case,
you want to join each thread so that the next statement will only
execute
after all of the threads have finished their work (otherwise your next
statement will see an undetermined intermediate view of the array).

OK, I understand it better. But why does each {|t| t.join} join them
all at the same time (ish), and not wait for the first one to finish
executing before joining the others?

It joins them one at a time in order. But while your main thread is
waiting for a specific thread to finish, any other thread is also
allowed
to execute, and possibly terminate. If thread b terminates while thread
a
is joined, then you call join on thread b, join will return immediately
since there’s nothing to wait for. Hence, each{|t| t.join} finishes
practically immediately when the longest running thread finishes.

–Ken

thefed · December 31, 2007, 7:02pm

module Enumerable
def each_simultaneously
threads = []
each { |e| threads << Thread.new { yield e } }
return threads
end
end

Sorry all, THIS is the fixed up version of each_simultaneously. Turns
out Ruby Cookbook has errors, too!

thefed · December 31, 2007, 6:57pm

On Dec 31, 2007, at 11:15 AM, Robert K. wrote:

On 31.12.2007 17:02, thefed wrote:

OK, I understand it better. But why does each {|t| t.join} join
them all at the same time (ish), and not wait for the first one
to finish executing before joining the others?

They are not joined at the same time but one after the other.

But then why doesn’t this take 15 seconds? t.join is called in the
main thread, so shouldn’t the next Thread#join not get called until
the first one finishes?

module Enumerable
def each_simultaneously
threads = []
each { |e| threads >> Thread.new { yield e } }
return threads
end
end

start_time = Time.now
[7,8,9].each_simultaneously do |e|
sleep(5) # Simulate a long, high-latency operation
print “Completed operation for #{e}!\n”
end

Completed operation for 8!

Completed operation for 7!

Completed operation for 9!

Time.now - start_time # => 5.009334

thefed · December 31, 2007, 9:47pm

module Enumerable
print “Completed operation for #{e}!\n”
end

Completed operation for 8!

Completed operation for 7!

Completed operation for 9!

Time.now - start_time # => 5.009334

try looking at the crude timeline below…

sec 0 1 2 3 4 5
6 7
|---------|---------|---------|---------|---------|---------|---------|
main ====@=================================================
t[1] ===================================================
t[2] ===================================================
t[3] ===================================================

The @ on the main thread represents when the t.join gets called. It
waits in this simple case for t[1] to finish it’s work (sleeping for 5
seconds), then waits for t[2]. As t[2] has also been doing work all
this time, it only blocks the main thread for another 0.1 sec before
finishing. Same for t[3]. So this contrived example it takes 5 seconds

whatever overhead for starting threads.

You could throw more instrumentation in there if you wish and do
things like adding additional calls to sleep to simulate extra thread
overhead to make it more obvious.

thefed · December 31, 2007, 10:54pm

On Dec 31, 2007, at 3:46 PM, Craig B. wrote:

The @ on the main thread represents when the t.join gets called. It
waits in this simple case for t[1] to finish it’s work (sleeping
for 5 seconds), then waits for t[2]. As t[2] has also been doing
work all this time, it only blocks the main thread for another 0.1
sec before finishing. Same for t[3]. So this contrived example it
takes 5 seconds + whatever overhead for starting threads.

You could throw more instrumentation in there if you wish and do
things like adding additional calls to sleep to simulate extra
thread overhead to make it more obvious.

Thank you SO MUCH! This really clears threading up for me. In
retrospect it was less than obvious, but evident nonetheless. But
this timeline really made the difference for me. Thank you!

Ari

thefed · January 1, 2008, 3:25am

Craig B. wrote:

module Enumerable
print “Completed operation for #{e}!\n”
end

Completed operation for 8!

Completed operation for 7!

Completed operation for 9!

Time.now - start_time # => 5.009334

try looking at the crude timeline below…

sec 0 1 2 3 4 5
6 7
|---------|---------|---------|---------|---------|---------|---------|
main ====@=================================================
t[1] ===================================================
t[2] ===================================================
t[3] ===================================================

The @ on the main thread represents when the t.join gets called. It
waits in this simple case for t[1] to finish it’s work (sleeping for 5
seconds), then waits for t[2]. As t[2] has also been doing work all
this time, it only blocks the main thread for another 0.1 sec before
finishing. Same for t[3]. So this contrived example it takes 5 seconds

whatever overhead for starting threads.

You could throw more instrumentation in there if you wish and do
things like adding additional calls to sleep to simulate extra thread
overhead to make it more obvious.

To me the important point in addition to the parallelism is that, when
run in batch mode, say with SciTE, main takes less than a second and
kills all the threads. Hence the messages are never seen. To see
the reports you have to do something like

start_time = Time.now
[7,8,9].each_simultaneously do |e|
sleep(5) # Simulate a long, high-latency operation
print “Completed operation for #{e}!\n”
end
sleep 5 #######main must take at least 5 seconds!!!

Completed operation for 8!

Completed operation for 7!

Completed operation for 9!

Time.now - start_time # => 5.009334

to guarantee that the threads have 5 seconds to finish
their operation. Or you can use

module Enumerable
def each_simultaneously
collect {|e| Thread.new {yield e}}.each {|t| t.join}
end
end

which guarantees that the threads will finish before
control is returned to main.

In reality it is also important that threads spend a large
part of their operation just waiting when there is only one
CPU.

I think the problem arose because the example on page 760
of the Ruby Cookbook does not mention the necessity of the
main thread lasting long enough and does not show code to
make it happen.

I realize that much of this may have been obvious to some
who replied, but as a newby it wasn’t to me until I read
the section and played with the code.

Ian

thefed · January 1, 2008, 6:00pm

Robert K. wrote:

I prefer the solution that does not join in the method but returns
Threads. If you think about it, that version is significantly more
flexible. You can join those threads immediately

an_enum.each_simultaneously {|e| … }.each {|th| th.join}

but you can as well do some work in between

threads = an_enum.each_simultaneously {|e| … }
do_some_work
…
threads.each {|th| th.join}

Thanks. That helps both with my understanding the significance
of collect and threads.

Ian

thefed · January 1, 2008, 1:00pm

On 01.01.2008 03:25, Ian W. wrote:

sleep 5 #######main must take at least 5 seconds!!!
Sorry to say that, but this is a bogus solution. Using sleep for this
is not a good idea: if tasks take longer then you will loose output
anyway or even risk that some tasks are not finished properly, if all
tasks are finished much faster you’ll waste time.

The thread killing is the exact reason why #each_simultaneously was
built to return an Array of Thread objects. That way you can join all
the threads.

 collect {|e| Thread.new {yield e}}.each {|t| t.join}
end
end

which guarantees that the threads will finish before
control is returned to main.

I prefer the solution that does not join in the method but returns
Threads. If you think about it, that version is significantly more
flexible. You can join those threads immediately

an_enum.each_simultaneously {|e| … }.each {|th| th.join}

but you can as well do some work in between

threads = an_enum.each_simultaneously {|e| … }
do_some_work
…
threads.each {|th| th.join}

I realize that much of this may have been obvious to some
who replied, but as a newby it wasn’t to me until I read
the section and played with the code.

When I was initially confronted with multithreading it also took me a
while. For me at the time it was difficult to not confuse Thread
objects with threads. This was in Java which decouples Thread object
creation and thread execution, which probably makes it a bit easier to
grasp the concepts.

It is important to keep this distinction in mind: a Thread object in a
way is an object that is like any other object just with the added twist
that it may be associated with an independent thread of execution
(i.e. in Java it is not associated until the thread starts and after the
thread terminates, in Ruby the association is there right from the start
because threads are started immediately and lasts until the thread
terminates).

Kind regards

robert