Forum: Ruby Newbie on Threads

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
B015712ceb9dcae483f52167379acae7?d=identicon&s=25 Nabs Kahn (nabusman)
on 2009-05-26 10:20
I'm creating a screen scraping software and I want to have X (let's say
10 for example) threads running simultaneously doing the scraping. The
program will access a text file containing an unknown number of URLs and
then scrape.

My question is how do I setup the threads so that once a thread finishes
execution it picks up another URL and starts executing again.

Thanks,

Nabs
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2009-05-26 10:48
(Received via mailing list)
2009/5/26 Nabs Kahn <nabusman@gmail.com>:
> I'm creating a screen scraping software and I want to have X (let's say
> 10 for example) threads running simultaneously doing the scraping. The
> program will access a text file containing an unknown number of URLs and
> then scrape.
>
> My question is how do I setup the threads so that once a thread finishes
> execution it picks up another URL and starts executing again.

Create a Queue (require 'thread') and have your worker threads read
URL's from it.

Kind regards

robert
B015712ceb9dcae483f52167379acae7?d=identicon&s=25 Nabs Kahn (nabusman)
on 2009-05-26 13:58
Thanks for the quick response, this is what I wrote, but it doesn't seem
to work, no errors, but it just finishes the program without doing
anything. Where did I go wrong?

require 'thread'

buffer = SizedQueue.new(10)

producer = Thread.new do
  File.open("urls.txt").each do |url|
    buffer << url
  end
end

consumer = Thread.new do
  while buffer.num_waiting != 0
    url = buffer.pop
    #do screen scraping with url here
  end
end

consumer.join



Robert Klemme wrote:
> 2009/5/26 Nabs Kahn <nabusman@gmail.com>:
>> I'm creating a screen scraping software and I want to have X (let's say
>> 10 for example) threads running simultaneously doing the scraping. The
>> program will access a text file containing an unknown number of URLs and
>> then scrape.
>>
>> My question is how do I setup the threads so that once a thread finishes
>> execution it picks up another URL and starts executing again.
>
> Create a Queue (require 'thread') and have your worker threads read
> URL's from it.
>
> Kind regards
>
> robert
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2009-05-26 16:55
(Received via mailing list)
2009/5/26 Nabs Kahn <nabusman@gmail.com>:
> Thanks for the quick response, this is what I wrote, but it doesn't seem
> to work, no errors, but it just finishes the program without doing
> anything. Where did I go wrong?

I am not sure what you expect: your program does not do anything with
the url.   Did you try printing it?

> require 'thread'
>
> buffer = SizedQueue.new(10)
>
> producer = Thread.new do
>  File.open("urls.txt").each do |url|
>    buffer << url

For better readability I suggest using buffer.enq.

>  end
> end
>
> consumer = Thread.new do
>  while buffer.num_waiting != 0

Rather use buffer.deq which is a blocking call.  Also, in your case,
the consumer will stop working as soon as the queu has run empty
*once*.  You rather want a more reliable termination detection.
Usually I put a special value into the queue (more precisely, as many
special values as there are threads).  In your case a symbol would
work.

>    url = buffer.pop
>    #do screen scraping with url here
>  end
> end
>
> consumer.join

Kind regards

robert
B015712ceb9dcae483f52167379acae7?d=identicon&s=25 Nabs Kahn (nabusman)
on 2009-05-26 17:05
> I am not sure what you expect: your program does not do anything with
> the url.   Did you try printing it?

I left that part out, figured it would make the example unnecessarily
complex.

Thanks for the tips.

Nabs
Da33a4ac652c1c8900392a8599206640?d=identicon&s=25 Thomas B. (tpreal)
on 2009-05-26 20:37
Nabs Kahn wrote:
> require 'thread'
>
> buffer = SizedQueue.new(10)
>
> producer = Thread.new do
>   File.open("urls.txt").each do |url|
>     buffer << url
>   end
> end
>
> consumer = Thread.new do
>   while buffer.num_waiting != 0
>     url = buffer.pop
>     #do screen scraping with url here
>   end
> end
>
> consumer.join
>

Hello. You said that you want X threads, but in your example you have
only one scraping thread. I think this is more what you intended (not
tested):

require 'thread'

buffer = SizedQueue.new(10)

producer = Thread.new do
  File.open("urls.txt").each do |url|
    buffer << url
  end
end

consumers = Array::new(x){
Thread.new do
  while url=buffer.deq
    #do screen scraping with url here
  end
end
}

Now if you want the program to stop after the producer finishes, you
should add

producer.join
consumers.each{|c| c.join}

to make sure all processing is finished.
B015712ceb9dcae483f52167379acae7?d=identicon&s=25 Nabs Kahn (nabusman)
on 2009-05-27 18:03
This is what I ended up doing, similar to what was suggested.
(definition of screenScrape method not included)

bufferSize = 10
buffer = SizedQueue.new(bufferSize)
threads = []

producer = Thread.new do
  File.open("urls.txt").each do |url|
    buffer.enq url
  end
  bufferSize.times {buffer.enq(:END_OF_WORK)}
end

bufferSize.times do
  threads << Thread.new do
    url = nil
    while(url != :END_OF_WORK)
      url = buffer.deq
      screenScrape(url)
    end
  end
end

producer.join
threads.each do |thr|
  thr.join
end

Thomas B. wrote:
> Nabs Kahn wrote:
>> require 'thread'
>>
>> buffer = SizedQueue.new(10)
>>
>> producer = Thread.new do
>>   File.open("urls.txt").each do |url|
>>     buffer << url
>>   end
>> end
>>
>> consumer = Thread.new do
>>   while buffer.num_waiting != 0
>>     url = buffer.pop
>>     #do screen scraping with url here
>>   end
>> end
>>
>> consumer.join
>>
>
> Hello. You said that you want X threads, but in your example you have
> only one scraping thread. I think this is more what you intended (not
> tested):
>
> require 'thread'
>
> buffer = SizedQueue.new(10)
>
> producer = Thread.new do
>   File.open("urls.txt").each do |url|
>     buffer << url
>   end
> end
>
> consumers = Array::new(x){
> Thread.new do
>   while url=buffer.deq
>     #do screen scraping with url here
>   end
> end
> }
>
> Now if you want the program to stop after the producer finishes, you
> should add
>
> producer.join
> consumers.each{|c| c.join}
>
> to make sure all processing is finished.
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2009-05-27 18:12
(Received via mailing list)
2009/5/27 Nabs Kahn <nabusman@gmail.com>:
> This is what I ended up doing, similar to what was suggested.
> (definition of screenScrape method not included)

Thanks for the update!

>
> bufferSize.times do
>  threads << Thread.new do
>    url = nil
>    while(url != :END_OF_WORK)
>      url = buffer.deq
>      screenScrape(url)
>    end
>  end
> end

The loop above does not work properly because you will hand off
:END_OF_WORK to screenScrape().  Rather do

 threads << Thread.new do
   while ((url = buffer.deq) != :END_OF_WORK)
     screenScrape(url)
   end
 end

or

 threads << Thread.new do
   until ((url = buffer.deq) == :END_OF_WORK)
     screenScrape(url)
   end
 end

> producer.join
> threads.each do |thr|
>  thr.join
> end

Btw, you do not need a separate producer thread.  You can simply do
that in the main thread.  But of course you must start worker threads
before you start to fill the queue.

Kind regards

robert
B015712ceb9dcae483f52167379acae7?d=identicon&s=25 Nabs Kahn (nabusman)
on 2009-05-27 18:24
> Btw, you do not need a separate producer thread.  You can simply do
> that in the main thread.  But of course you must start worker threads
> before you start to fill the queue.

Oh, I thought it would cause the main thread to pause until there was
room in the queue since there would be more urls in the file, but only
10 in the queue, so I created a new thread for the producer thread.
Thanks for the information.
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2009-05-27 19:46
(Received via mailing list)
On 27.05.2009 18:24, Nabs Kahn wrote:
>> Btw, you do not need a separate producer thread.  You can simply do
>> that in the main thread.  But of course you must start worker threads
>> before you start to fill the queue.
>
> Oh, I thought it would cause the main thread to pause until there was
> room in the queue since there would be more urls in the file, but only
> 10 in the queue, so I created a new thread for the producer thread.

Well, it will.  But so will your producer thread and eventually the main
thread since it joins on the producer.  The producer thread brings you
only advantages if you need to do other things in the main thread.  But
since you don't in the code you presented you can as well do the queue
filling in the main thread.

Btw, I just notice one thing: you don't chomp the lines read from the
file.  So your URL's will still contain the trailing line feed.

> Thanks for the information.

You're welcome!

Kind regards

  robert
B015712ceb9dcae483f52167379acae7?d=identicon&s=25 Nabs Kahn (nabusman)
on 2009-05-27 21:35
> Well, it will.  But so will your producer thread and eventually the main
> thread since it joins on the producer.  The producer thread brings you
> only advantages if you need to do other things in the main thread.  But
> since you don't in the code you presented you can as well do the queue
> filling in the main thread.

I see, that makes sense.


> Btw, I just notice one thing: you don't chomp the lines read from the
> file.  So your URL's will still contain the trailing line feed.

I am actually chomping them inside the screenScraper method.

Nabs
E0d864d9677f3c1482a20152b7cac0e2?d=identicon&s=25 Robert Klemme (Guest)
on 2009-05-27 23:46
(Received via mailing list)
On 27.05.2009 21:35, Nabs Kahn wrote:
>> file.  So your URL's will still contain the trailing line feed.
>
> I am actually chomping them inside the screenScraper method.

I wouldn't do that.  Preparation of the input should be done outside of
your scraping method.  You put too much knowledge about the environment
into the method.

Cheers

  robert
B015712ceb9dcae483f52167379acae7?d=identicon&s=25 Nabs Kahn (nabusman)
on 2009-05-28 10:41
> I wouldn't do that.  Preparation of the input should be done outside of
> your scraping method.  You put too much knowledge about the environment
> into the method.

Right, I see your point, I'll change it. Thanks again.

Nabs
This topic is locked and can not be replied to.