Mechanize MySQL and threads - deadlock?

First of all: I’m still new to Ruby.

So pointing me to documentation or books is fine.

Use case:

Use mechanize to gather information. Because there are many pages I’d
like to run multiple threads each fetching pages. The fetched data
should be written to a MySQL database.

Can you point me to information telling me how to do this?

The failure looks like this now:

/pr/tasks/get_data_ruby/tasks.rb:364:in join': deadlock detected (fatal) from /pr/tasks/get_data_ruby/tasks.rb:364:inblock in
run_tasks_wait’
from /pr/tasks/get_data_ruby/tasks.rb:364:in each' from /pr/tasks/get_data_ruby/tasks.rb:364:inrun_tasks_wait’
from get-data.rb:37:in `<mai

What is causing such deadlocks at all?

Details about my implementation:

Ruby version: ruby 1.9.1p378 (2010-01-10 revision 26273) [x86_64-linux]
sequel-3.8.0
mysqlplus-0.1.1

Because things always go wrong I’d like store state in database to
resume work where the script failed.

To keep things simple I tried giving each thread it’s own agent and DB
connection:

def newDBConnection
Sequel.connect(
:adapter => ‘mysql’,
:user => ‘root’,
:host => ‘localhost’,
:database => ‘get_data’,
:password=>‘XXX’)
end

share one agent and db connection per thread

class MyThread < Thread
def agent
if !@agent
@agent = Mechanize.new
@agent.max_history =1
end
@agent
end

def db
  @dbCache ||= newDBConnection
end

end

next I defined a task which reuses the db and Mechanize agent from the
thread which is running the task:

class Task
def run
# override
@thread = Thread.current
task
end

def agent
@agent ||= @thread.agent
end

def db
@dbCache ||= @thread.db
end
end

Next I wrote a simple function taking a list of tasks and a thread class
MyThread. it spawns parallel threads each getting a task from the task
list (Queue). They all may add more tasks to the queue.
The script should run until all tasks are done.

t: class extending Thread

tasks: type Queue.new

parallel: num of threads used to run those tasks

def run_tasks_wait(t, tasks, parallel)
working = 0
threads = []

run 3 threads

(1…parallel).each {|i|
threads << t.new {
firstTime = true
while working > 0 || firstTime
firstTime = false
while task = tasks.pop
working += 1
$log.debug(“starting task #{task.to_s}”)
$log.catchAndLog “caught exception in main worker thread” do
task.run if !task.nil?
end
$log.debug(“finished task #{task.to_s} threads-working:
#{working}”)
working -= 1
end
# even if there is nothing left in queue keep thread running if
there is one thread running
# this thread may push additional tasks to the queue
sleep 1
end
} }
# wait for threads
threads.each {|t| t.join() }
end

Thanks for any pointers
Marc W.

t: class extending Thread

tasks: type Queue.new

parallel: num of threads used to run those tasks

def run_tasks_wait(t, tasks, parallel)
Replacing the Queue by an Array seems to fix the issue.

Marc