Forum: Ruby Pipelined Processing

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
bob.news (Guest)
on 2005-11-20 17:04
(Received via mailing list)
Hi,

this came up recently on IRC: question was how to chain processing so
that
each step runs concurrently to other steps.  While I don't see real
benefit
as long as there are no native threads in Ruby I played around a bit and
this is the result (attached).  There's certainly rool for improvement.
Do
with this whatever you like.

Kind regards

    robert
pesterhazy (Guest)
on 2005-11-20 23:03
(Received via mailing list)
>> this came up recently on IRC: question was how to chain processing so
>> that each step runs concurrently to other steps.  While I don't see real
>> benefit as long as there are no native threads in Ruby I played around a
>> bit and this is the result (attached).  There's certainly rool for
>> improvement.  Do with this whatever you like.

Hello,

it's funny you should bring this up because I just cooked up something
similar (albeit less sophisticated and robust). What I needed was
several tasks being carried out in parallel. In this particular instance
I download several pages using open-uri and output a chunk to webrick as
soon as its processing is finished (as responsiveness is critical). For
network I/O, ruby's pseudo-threads seem to work well, though I agree
native threads would be much better.

Attached the ParallelEnumerate class along with a trivial test case.
Bear with me, it's my first ruby script.

Cheers,
Paulus
bob.news (Guest)
on 2005-11-20 23:27
(Received via mailing list)
Paulus E. <removed_email_address@domain.invalid> wrote:
> several tasks being carried out in parallel.
It's not really similar.  While my program executes several stages of
processing in parallel you are actually doing similar things in
parallel.
At least that's what I understood from your explanation and code.

> In this particular
> instance I download several pages using open-uri and output a chunk
> to webrick as soon as its processing is finished (as responsiveness
> is critical). For network I/O, ruby's pseudo-threads seem to work
> well, though I agree native threads would be much better.

True, as soon as slow IO is involved ruby threads do ok as long as the
processing doesn't use too many resources.

> Attached the ParallelEnumerate class along with a trivial test case.
> Bear with me, it's my first ruby script.

You try to tackle multithreading with your first script?  Wow!  I don't
exactly understand why you use a thread per collection just to fill a
queue.
Maybe I'm missing something here but it looks a bit strange.  Are you
sure
this actually downloads in parallel?

If I would have done this I'd taken a different approach (but maybe I'm
missing some of your requirements): I'd create a queue which receives
URL's
(or whatever tasks you have).  Then I'd set up n threads (n>0, probably
depending on user input) and each thread reads elements from the queue
and
processes them in parallel.  You might as well combine both approaches,
i.e.
if a URL has been downloaded, the content is pushed onto another queue
from
which another number of threads (possible just 1) reads and processes.

Kind regards

    robert
pesterhazy (Guest)
on 2005-11-21 00:08
(Received via mailing list)
Robert K. schrieb:
>> Attached the ParallelEnumerate class along with a trivial test case.
>> Bear with me, it's my first ruby script.
> You try to tackle multithreading with your first script?  Wow!  I don't
> exactly understand why you use a thread per collection just to fill a
> queue. Maybe I'm missing something here but it looks a bit strange.  Are
> you sure this actually downloads in parallel?

Yes it works for the purpose. The "collections" look like this:

class Source
  includes Enumerable
  def each
    @data = open_url("http://...").read
    while true
      element = get_next_element
      yield element
    end
  end
  def next_element
    # process @data
  end
end

It would probably have been cleaner to do this using a thread pool for
downloading the pages and processing the data in the main thread, as you
suggest - seperating the stages. I used an enumerator because it's
convenient - I wrap the enumerator in a pseudo IO object which I return
to webrick (which expects an object that supports the method "read").
That way, I get a kind of simple asynchronous data processing.

>
> If I would have done this I'd taken a different approach (but maybe I'm
> missing some of your requirements): I'd create a queue which receives
> URL's (or whatever tasks you have).  Then I'd set up n threads (n>0,
> probably depending on user input) and each thread reads elements from
> the queue and processes them in parallel.  You might as well combine
> both approaches, i.e. if a URL has been downloaded, the content is
> pushed onto another queue from which another number of threads (possible
> just 1) reads and processes.

Thanks for the comment,
Paulus
Sean O. (Guest)
on 2005-11-21 00:44
(Received via mailing list)
On 11/20/05, Robert K. <removed_email_address@domain.invalid> wrote:
>
> Hi,
>
> this came up recently on IRC: question was how to chain processing so that
> each step runs concurrently to other steps.  While I don't see real benefit
> as long as there are no native threads in Ruby

One idea that springs to mind is to keep a responsive monitor thread
(e.g. GUI, console, remote) while performing a batch process by
splitting up an intensive computation into environmentally
thread-friendly chunks, represented by the blocks.

> Do
> with this whatever you like.
>
> Kind regards
>
>    robert

Thanks! I will :)

Regards,

Sean
ptkwt (Guest)
on 2005-11-21 08:55
(Received via mailing list)
In article <removed_email_address@domain.invalid>,
Robert K. <removed_email_address@domain.invalid> wrote:
>
True, without native threads you won't really gain any performance, but
what
if (to improve performance) you were to either:
1) launch new processes instead of threads?
  or
2) set things up so that different stages of the pipeline can run on
different
machines? (maybe using Drb?)

....of course the amount of information passed between stages of the
pipeline
would need to be small so that the communication overhead would stay
low.


Phil
This topic is locked and can not be replied to.