I conducted a simple experiment (using GRC) to test the TPB scheduler’s
performance, and following a search here, I cannot find any definitive
information that would explain the observed behaviour. I kindly request
thoughts on the matter:
Three flow graphs were created in separate GRC documents. No graph uses
throttling. Tests were run on a dual-core Linux machine using a 3.3git
One graph: a high-rate signal source connected to a resampler,
is in turn connected to a null sink.
Two identical disconnected sub-graphs: each contains a high-rate
signal source connected to a resampler, which is in turn connected to a
sink (i.e. as above, just twice).
One graph: one high-rate signal source whose output is connected
the input of two separate resamplers, each of which is connected to its
‘High-rate’ means a few Msps, and the resamplers output data at a
rate (e.g. 8MHz, decim/interp=4:3).
Thanks to the TPB scheduler, (2) uses 100% CPU (max load on both cores)
the sub-graphs are disconnected.
However when running (1) and (3), only 50% utilisation is observed. I
placed ‘Copy’ and ‘Kludge Copy’ blocks before the resampler inputs in
but this did not increase performance (which makes sense given the
flow model below).
I am not aware of the intricacies of the asynchronous flow model used,
the TPB scheduler (I only skimmed the source), but I wonder why (1) and
do not use more than 50% CPU?
Please excuse any gaps in my understanding, but my thoughts are as
Asynchronous producer/consumer and push/pull graphs are obviously quite
complicated to get right in all circumstances (I pulled my hair out
designing one), and there are a number of ways data can be passed
blocks - doubtless to say, GR generally does an excellent job of this.
the particular scenario of (1) and (3) though, is the performance
the manner in which that data is passed around, and how/when the blocks’
production/consumption state, and thread state, is changed? I’m not sure
a push or pull model is used without a clock or throttle, but does the
signal source block because it must wait until its own internal
buffer is consumed by the resampler? So therefore the currently running
thread switches back and forth between the signal source and resampler?
(in my mind) rests on the assumption that the buffer (memory region)
passed to the general_work of the resampler actually lives inside the
source block, and there is no direct control over how much of that
consumed in one iteration of the connected block’s (in this case the
resampler’s) general_work, aside from indirectly via forecast in the
connected block? Or is that not the case?
This (empirical and thought) experiment should be framed with a view to
pipelining. Ideally, as the graph is not throttled, the threads should
seldom block and utilisation for (1) should be close to 100%, and (3)
be slightly less on dual-core (because in the best case only the signal
source and one resampler can run at any one time). This would rely on
produced data either living on-the-wire (connection) between blocks, or
the input stage of a connected block - of course this comes with
restrictions and overheads (I’m not sure what the base-class block does
regards to managing the data buffers passed to/from general_work). For
the data (memory block) produced by the signal source would be
and therefore could be simultaneously processed by the two resampler
on separate cores, thus achieving greater throughput.
Is a major architectural change required to realise this? Or if it has
already been considered, are the overheads potentially so large that it
would degrade performance?
Thanks for your thoughts,