Trouble with threads

gmoniey · February 21, 2008, 2:06am

I am trying to expand my web crawler to use multiple threads (with
mechanize), and I ma having some trouble. It seems that each thread is
not creating a local variable, but rather they are sharing the “index”
variable below:

threads = []
mutex = Mutex.new

10.times do |i|
threads[i] = Thread.new(i) { |index|
while index < @will_visit.size
current_link = @will_visit[index]
begin
index += 10
puts current_link
page = @agent.get(current_link)
if(page.kind_of? WWW::Mechanize::Page)
page.links.each do |link|
mutex.synchronize do
if(validLink?(link))
@will_visit.push(link.href)
end
end
end
end

    puts "Currently visiting page #{index} of #{@will_visit.size}"
  rescue Exception => msg
    puts "Error with " + current_link

puts msg
puts msg.backtrace
end
end
}
end
threads.each {|t| t.join }

From what I have read from google, the ‘index’ variable should be
independent between threads, but it seems that it is shared. The problem
may also be with the face that @agent is shared, but I am not sure

gmoniey · February 21, 2008, 2:25am

On Thu, 21 Feb 2008 10:06:14 +0900, gm gm [email protected] wrote:

From what I have read from google, the ‘index’ variable should be
independent between threads, but it seems that it is shared. The problem
may also be with the face that @agent is shared, but I am not sure

‘index’ is indeed independent here; @agent being shared, however, is
very likely to cause problems. As far as I know, WWW::Mechanize agents
are not safe for use by multiple threads. Each thread will need its
own agent.

-mental

gmoniey · February 21, 2008, 3:03am

On Wed, Feb 20, 2008 at 5:24 PM, MenTaLguY [email protected] wrote:

‘index’ is indeed independent here; @agent being shared, however, is
very likely to cause problems. As far as I know, WWW::Mechanize agents
are not safe for use by multiple threads. Each thread will need its
own agent.

That agrees completely with my direct experience. WWW::Mechanize is,
sadly, not threadsafe - it reuses a buffer for each request, and if
you start a new request before another completes, the new request
will clobber the input buffer. I gladly share my painfully won
experience with you.

Judson

gmoniey · February 21, 2008, 3:28am

Judson L. wrote:

On Wed, Feb 20, 2008 at 5:24 PM, MenTaLguY [email protected] wrote:

‘index’ is indeed independent here; @agent being shared, however, is
very likely to cause problems. As far as I know, WWW::Mechanize agents
are not safe for use by multiple threads. Each thread will need its
own agent.

That agrees completely with my direct experience. WWW::Mechanize is,
sadly, not threadsafe - it reuses a buffer for each request, and if
you start a new request before another completes, the new request
will clobber the input buffer. I gladly share my painfully won
experience with you.

Judson

that would explain the crazy output. I will try and modify it a bit, and
see how it works. Thanks!

gmoniey · February 21, 2008, 3:35am

Judson L. wrote:

On Wed, Feb 20, 2008 at 5:24 PM, MenTaLguY [email protected] wrote:

‘index’ is indeed independent here; @agent being shared, however, is
very likely to cause problems. As far as I know, WWW::Mechanize agents
are not safe for use by multiple threads. Each thread will need its
own agent.

That agrees completely with my direct experience. WWW::Mechanize is,
sadly, not threadsafe - it reuses a buffer for each request, and if
you start a new request before another completes, the new request
will clobber the input buffer. I gladly share my painfully won
experience with you.

Judson

that seemed to make a big difference…also, do you think I need to put
a mutex around the ‘out.puts’ call (ie file output?)

gmoniey · February 21, 2008, 3:55am

gm gm wrote:

I am trying to expand my web crawler to use multiple threads (with
mechanize), and I ma having some trouble. It seems that each thread is
not creating a local variable, but rather they are sharing the “index”
variable

threads = []

10.times do |i|
threads[i] = Thread.new(i) do |index|
if index == 0
index += 100
end

puts index

end
end

threads.each do |t|
t.join
end

–output:–
100
1
2
3
4
5
6
7
8
9

If the threads shared the index variable, then each line of the output
would be 100. Instead, i gets assigned to index for each thread, and
nothing a thread does to its index variable has any effect on another
thread’s index variable.