Hi, I’m relatively new to ruby and threading in general. I’m trying to
get the following code to work. Essentially, the program scrapes data
from a site which has a list of urls on the first page (and more list
pages can be accessed by hitting next) and each url in the list needs to
be followed as well.
So what I would like to do is create 5 concurrent threads: Thread 1
would download the list page as a Mechanize page object and queue it,
hit next and download the next page as a Mechanize page object and queue
it, etc. Thread 2 would take the queue from Thread 1 and start to
extract the required data, i.e. the urls and queue them into a new
queue. Thread 3 would take the queue from Thread 2 and download the page
each url points to and save it as a Mechanize page object and queue it
into another queue. Thread 4 would take the queue from Thread 3 and
extract the necessary data, format it and queue it into yet another
queue. And finally, Thread 5 will take the queue from Thread 4 and write
the data to a file.
At least, that is in theory… So I wrote the following program, however
only the first Thread seems to be queueing and the rest don’t work.
Please let me know if I’m missing something.
rank_pages_queue = Queue.new
items_queue = Queue.new
item_pages_queue = Queue.new
finished_items_queue = Queue.new
mech_page = get_page(url)
rank_page_download = Thread.new do
while mech_page
rank_pages_queue << mech_page
mech_page = hit_next(mech_page)
end
end
rank_page_extract = Thread.new do
while rank_pages_download.alive?
rank_page = rank_pages_queue.pop
items_queue << get_rank_name_url(rank_page)
end
end
item_page_download = Thread.new do
while rank_page_extract.alive?
items = items_queue.pop
items.each do |item_arr|
item_pages_queue << [item_arr, get_page(item_arr[2])]
end
end
Thanks in advance. And I’m running Ruby 1.8.7 on Snow Leopard.