I am trying to expand my web crawler to use multiple threads (with
mechanize), and I ma having some trouble. It seems that each thread is
not creating a local variable, but rather they are sharing the “index”
variable below:
threads = []
mutex = Mutex.new
10.times do |i|
threads[i] = Thread.new(i) { |index|
while index < @will_visit.size
current_link = @will_visit[index]
begin
index += 10
puts current_link
page = @agent.get(current_link)
if(page.kind_of? WWW::Mechanize::Page)
page.links.each do |link|
mutex.synchronize do
if(validLink?(link)) @will_visit.push(link.href)
end
end
end
end
puts "Currently visiting page #{index} of #{@will_visit.size}"
rescue Exception => msg
puts "Error with " + current_link
puts msg
puts msg.backtrace
end
end
}
end
threads.each {|t| t.join }
From what I have read from google, the ‘index’ variable should be
independent between threads, but it seems that it is shared. The problem
may also be with the face that @agent is shared, but I am not sure
From what I have read from google, the ‘index’ variable should be
independent between threads, but it seems that it is shared. The problem
may also be with the face that @agent is shared, but I am not sure
‘index’ is indeed independent here; @agent being shared, however, is
very likely to cause problems. As far as I know, WWW::Mechanize agents
are not safe for use by multiple threads. Each thread will need its
own agent.
‘index’ is indeed independent here; @agent being shared, however, is
very likely to cause problems. As far as I know, WWW::Mechanize agents
are not safe for use by multiple threads. Each thread will need its
own agent.
That agrees completely with my direct experience. WWW::Mechanize is,
sadly, not threadsafe - it reuses a buffer for each request, and if
you start a new request before another completes, the new request
will clobber the input buffer. I gladly share my painfully won
experience with you.
‘index’ is indeed independent here; @agent being shared, however, is
very likely to cause problems. As far as I know, WWW::Mechanize agents
are not safe for use by multiple threads. Each thread will need its
own agent.
That agrees completely with my direct experience. WWW::Mechanize is,
sadly, not threadsafe - it reuses a buffer for each request, and if
you start a new request before another completes, the new request
will clobber the input buffer. I gladly share my painfully won
experience with you.
Judson
that would explain the crazy output. I will try and modify it a bit, and
see how it works. Thanks!
‘index’ is indeed independent here; @agent being shared, however, is
very likely to cause problems. As far as I know, WWW::Mechanize agents
are not safe for use by multiple threads. Each thread will need its
own agent.
That agrees completely with my direct experience. WWW::Mechanize is,
sadly, not threadsafe - it reuses a buffer for each request, and if
you start a new request before another completes, the new request
will clobber the input buffer. I gladly share my painfully won
experience with you.
Judson
that seemed to make a big difference…also, do you think I need to put
a mutex around the ‘out.puts’ call (ie file output?)
I am trying to expand my web crawler to use multiple threads (with
mechanize), and I ma having some trouble. It seems that each thread is
not creating a local variable, but rather they are sharing the “index”
variable
threads = []
10.times do |i|
threads[i] = Thread.new(i) do |index|
if index == 0
index += 100
end
puts index
end
end
threads.each do |t|
t.join
end
–output:–
100
1
2
3
4
5
6
7
8
9
If the threads shared the index variable, then each line of the output
would be 100. Instead, i gets assigned to index for each thread, and
nothing a thread does to its index variable has any effect on another
thread’s index variable.
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.