Pipelined Processing


#1

Hi,

this came up recently on IRC: question was how to chain processing so
that
each step runs concurrently to other steps. While I don’t see real
benefit
as long as there are no native threads in Ruby I played around a bit and
this is the result (attached). There’s certainly rool for improvement.
Do
with this whatever you like.

Kind regards

robert

#2

this came up recently on IRC: question was how to chain processing so
that each step runs concurrently to other steps. While I don’t see real
benefit as long as there are no native threads in Ruby I played around a
bit and this is the result (attached). There’s certainly rool for
improvement. Do with this whatever you like.

Hello,

it’s funny you should bring this up because I just cooked up something
similar (albeit less sophisticated and robust). What I needed was
several tasks being carried out in parallel. In this particular instance
I download several pages using open-uri and output a chunk to webrick as
soon as its processing is finished (as responsiveness is critical). For
network I/O, ruby’s pseudo-threads seem to work well, though I agree
native threads would be much better.

Attached the ParallelEnumerate class along with a trivial test case.
Bear with me, it’s my first ruby script.

Cheers,
Paulus


#3

Paulus E. removed_email_address@domain.invalid wrote:

several tasks being carried out in parallel.
It’s not really similar. While my program executes several stages of
processing in parallel you are actually doing similar things in
parallel.
At least that’s what I understood from your explanation and code.

In this particular
instance I download several pages using open-uri and output a chunk
to webrick as soon as its processing is finished (as responsiveness
is critical). For network I/O, ruby’s pseudo-threads seem to work
well, though I agree native threads would be much better.

True, as soon as slow IO is involved ruby threads do ok as long as the
processing doesn’t use too many resources.

Attached the ParallelEnumerate class along with a trivial test case.
Bear with me, it’s my first ruby script.

You try to tackle multithreading with your first script? Wow! I don’t
exactly understand why you use a thread per collection just to fill a
queue.
Maybe I’m missing something here but it looks a bit strange. Are you
sure
this actually downloads in parallel?

If I would have done this I’d taken a different approach (but maybe I’m
missing some of your requirements): I’d create a queue which receives
URL’s
(or whatever tasks you have). Then I’d set up n threads (n>0, probably
depending on user input) and each thread reads elements from the queue
and
processes them in parallel. You might as well combine both approaches,
i.e.
if a URL has been downloaded, the content is pushed onto another queue
from
which another number of threads (possible just 1) reads and processes.

Kind regards

robert

#4

Robert K. schrieb:

Attached the ParallelEnumerate class along with a trivial test case.
Bear with me, it’s my first ruby script.
You try to tackle multithreading with your first script? Wow! I don’t
exactly understand why you use a thread per collection just to fill a
queue. Maybe I’m missing something here but it looks a bit strange. Are
you sure this actually downloads in parallel?

Yes it works for the purpose. The “collections” look like this:

class Source
includes Enumerable
def each
@data = open_url(“http://…”).read
while true
element = get_next_element
yield element
end
end
def next_element
# process @data
end
end

It would probably have been cleaner to do this using a thread pool for
downloading the pages and processing the data in the main thread, as you
suggest - seperating the stages. I used an enumerator because it’s
convenient - I wrap the enumerator in a pseudo IO object which I return
to webrick (which expects an object that supports the method “read”).
That way, I get a kind of simple asynchronous data processing.

If I would have done this I’d taken a different approach (but maybe I’m
missing some of your requirements): I’d create a queue which receives
URL’s (or whatever tasks you have). Then I’d set up n threads (n>0,
probably depending on user input) and each thread reads elements from
the queue and processes them in parallel. You might as well combine
both approaches, i.e. if a URL has been downloaded, the content is
pushed onto another queue from which another number of threads (possible
just 1) reads and processes.

Thanks for the comment,
Paulus


#5

On 11/20/05, Robert K. removed_email_address@domain.invalid wrote:

Hi,

this came up recently on IRC: question was how to chain processing so that
each step runs concurrently to other steps. While I don’t see real benefit
as long as there are no native threads in Ruby

One idea that springs to mind is to keep a responsive monitor thread
(e.g. GUI, console, remote) while performing a batch process by
splitting up an intensive computation into environmentally
thread-friendly chunks, represented by the blocks.

Do
with this whatever you like.

Kind regards

robert

Thanks! I will :slight_smile:

Regards,

Sean


#6

In article removed_email_address@domain.invalid,
Robert K. removed_email_address@domain.invalid wrote:

True, without native threads you won’t really gain any performance, but
what
if (to improve performance) you were to either:

  1. launch new processes instead of threads?
    or
  2. set things up so that different stages of the pipeline can run on
    different
    machines? (maybe using Drb?)

…of course the amount of information passed between stages of the
pipeline
would need to be small so that the communication overhead would stay
low.

Phil