Pipelined processing with the Thread-Per-Block scheduler?

Dobai-Pataky_BSSSSl · November 9, 2010, 6:36am

Dear all,

I conducted a simple experiment (using GRC) to test the TPB scheduler’s
performance, and following a search here, I cannot find any definitive
information that would explain the observed behaviour. I kindly request
your
thoughts on the matter:

Three flow graphs were created in separate GRC documents. No graph uses
throttling. Tests were run on a dual-core Linux machine using a 3.3git
release.

 One graph: a high-rate signal source connected to a resampler,

which
is in turn connected to a null sink.

 Two identical disconnected sub-graphs: each contains a high-rate

signal source connected to a resampler, which is in turn connected to a
null
sink (i.e. as above, just twice).

 One graph: one high-rate signal source whose output is connected

to
the input of two separate resamplers, each of which is connected to its
own
null sink.

‘High-rate’ means a few Msps, and the resamplers output data at a
similar
rate (e.g. 8MHz, decim/interp=4:3).

Thanks to the TPB scheduler, (2) uses 100% CPU (max load on both cores)
as
the sub-graphs are disconnected.

However when running (1) and (3), only 50% utilisation is observed. I
also
placed ‘Copy’ and ‘Kludge Copy’ blocks before the resampler inputs in
(3),
but this did not increase performance (which makes sense given the
assumed
flow model below).

I am not aware of the intricacies of the asynchronous flow model used,
or
the TPB scheduler (I only skimmed the source), but I wonder why (1) and
(3)
do not use more than 50% CPU?

Please excuse any gaps in my understanding, but my thoughts are as
follows:

Asynchronous producer/consumer and push/pull graphs are obviously quite
complicated to get right in all circumstances (I pulled my hair out
designing one), and there are a number of ways data can be passed
between
blocks - doubtless to say, GR generally does an excellent job of this.
In
the particular scenario of (1) and (3) though, is the performance
bottleneck
the manner in which that data is passed around, and how/when the blocks’
production/consumption state, and thread state, is changed? I’m not sure
if
a push or pull model is used without a clock or throttle, but does the
signal source block because it must wait until its own internal
production
buffer is consumed by the resampler? So therefore the currently running
thread switches back and forth between the signal source and resampler?
This
(in my mind) rests on the assumption that the buffer (memory region)
that is
passed to the general_work of the resampler actually lives inside the
signal
source block, and there is no direct control over how much of that
buffer is
consumed in one iteration of the connected block’s (in this case the
resampler’s) general_work, aside from indirectly via forecast in the
connected block? Or is that not the case?

This (empirical and thought) experiment should be framed with a view to
pipelining. Ideally, as the graph is not throttled, the threads should
seldom block and utilisation for (1) should be close to 100%, and (3)
should
be slightly less on dual-core (because in the best case only the signal
source and one resampler can run at any one time). This would rely on
produced data either living on-the-wire (connection) between blocks, or
in
the input stage of a connected block - of course this comes with
restrictions and overheads (I’m not sure what the base-class block does
in
regards to managing the data buffers passed to/from general_work). For
(3),
the data (memory block) produced by the signal source would be
read-only,
and therefore could be simultaneously processed by the two resampler
blocks
on separate cores, thus achieving greater throughput.

Is a major architectural change required to realise this? Or if it has
already been considered, are the overheads potentially so large that it
would degrade performance?

Thanks for your thoughts,

Balint

Balint_S · November 9, 2010, 10:08pm

On Tue, Nov 09, 2010 at 04:34:42PM +1100, Balint S. wrote:

throttling. Tests were run on a dual-core Linux machine using a 3.3git

However when running (1) and (3), only 50% utilisation is observed. I also
placed ‘Copy’ and ‘Kludge Copy’ blocks before the resampler inputs in (3),
but this did not increase performance (which makes sense given the assumed
flow model below).

I am not aware of the intricacies of the asynchronous flow model used, or
the TPB scheduler (I only skimmed the source), but I wonder why (1) and (3)
do not use more than 50% CPU?

What kind of hardware are you running this on? (Pls be specific.)
How may core does the kernel say you have? cat /proc/cpuinfo

If these are i5’s or i7’s (or anything else with “hyperthreading”)
please remember that half of the cores shown by the kernel aren’t
really cores, and that there are substantial resources shared between
them.

I’ve achieved near linear speed up, up to 24 cores, with certain flow
graphs so I’m pretty sure that the implementation is sound.

Feel free to send me the grc file off list and I’ll take a look at it.

Eric