TPB scheduler fills block buffers

Hi,

I am writing an application for which I want to keep the latency to a
minimum, and this involves trying to keep the buffers between the blocks
as
empty as possible. Basically, I have a source block with an element size
of
512 bytes that accepts data through a public member function. If there
is
no data to transmit it will produce one empty packet to keep the USRP
busy.
The scheduler keeps asking for 64 items and I give it one. This goes on
until its write buffer is full. The processing latency (from the source
block to the USRP) of the first items is a few ms, but this grows to
well
over a second due to the large amounts of buffered data.

Looking at the behavior of the single threaded scheduler, it seems it is
trying to keep the latency low by only requesting data from source
blocks
when the other blocks fail to produce anything. However, the thread per
block scheduler does not seem to care about whether a block is a source
block or not. Is there any way I can configure it to do this? Is there
any
other approach for solving this problem?

Thankful for any help,
Anton Blad

On Mon, Nov 29, 2010 at 08:42:14AM +0100, antonb wrote:

over a second due to the large amounts of buffered data.

Looking at the behavior of the single threaded scheduler, it seems it is
trying to keep the latency low by only requesting data from source blocks
when the other blocks fail to produce anything. However, the thread per
block scheduler does not seem to care about whether a block is a source
block or not. Is there any way I can configure it to do this? Is there any
other approach for solving this problem?

Thankful for any help,
Anton Blad

Hi Anton,

There’s been some discussion about this over the past few months.
Probably the easiest way to fix this problem is to limit the usable
size of the buffers in the processing chain. This is a relatively
small change, that would allow an application developer to control the
worst case amount of buffering that is used in the graph. By default,
we allocate buffers on the order of 32KiB, then double that size so
that we can double buffer under the TPB scheduler. (Optimizing for
throughput.)

The current implementation requires that the physical size of the
buffers be a multiple of the page size. The fix I’m proposing leaves
that constraint in place (it’s hard to remove for a variety of
reasons), but allows the specification of a limit on the total number
of samples allowed in the buffer. Thus, if want low latency, you
could specify a limit of 512 bytes (or less for that matter), and the
total buffering would be drastically reduced from what’s currently
used.

I won’t have time to look at this until after the new year, but if
you’re interested in looking at it, the primary place to look is
gnuradio-core/src/lib/runtime/gr_buffer.{h,cc}. Basically you’d want
to limit the amount of space that it reports is available for writing
to min(, )

Eric

On Mon, 29 Nov 2010 09:30:11 -0800, Eric B. [email protected] wrote:

On Mon, Nov 29, 2010 at 08:42:14AM +0100, antonb wrote:

Hi,

I am writing an application for which I want to keep the latency to a
minimum, and this involves trying to keep the buffers between the
blocks
as
empty as possible. Basically, I have a source block with an element
size
of
512 bytes that accepts data through a public member function. If there
is
no data to transmit it will produce one empty packet to keep the USRP
busy.
The scheduler keeps asking for 64 items and I give it one. This goes on
until its write buffer is full. The processing latency (from the source
block to the USRP) of the first items is a few ms, but this grows to
well
over a second due to the large amounts of buffered data.

Looking at the behavior of the single threaded scheduler, it seems it
is
trying to keep the latency low by only requesting data from source
blocks

buffers be a multiple of the page size. The fix I’m proposing leaves
to limit the amount of space that it reports is available for writing
to min(, )

Eric

Hi Eric,

Thanks for your reply. There are two obvious drawbacks with the simple
fix: the latency will still be higher than necessary, and processing
large
chunks will not be possible. I have the following alternative
suggestion:

  • Create a new policy governing class (gr_tpb_policy_governor?) with the
    purpose of keeping track of which blocks are making progress. The class
    contains a condition variable that source blocks wait on in case the
    scheduling policy disallows adding more samples to the flowgraph.

  • Create an instance of gr_tpb_policy_governor in the gr_scheduler_tpb
    constructor and tell it the number of blocks in the flattened flowgraph.

  • Add a reference to the gr_tpb_policy_governor instance in the
    gr_tpb_detail blocks. Update the states of the blocks in
    gr_tpb_detail::set_{input,output}_changed and in
    gr_tpb_thread_body::gr_tpb_thread_body depending on
    gr_block_executor::state.

The policies I can think of being useful are:

  • flooding: the current policy, for performance

  • active_blocks: block sources if more than a minimum number of blocks
    are
    processing concurrently, in order to reduce latency. Could be set to 1
    to
    give the behavior of the single threaded scheduler.

  • num_samples: block sources if more than a minimum number of samples
    are
    processed currently, in order to reduce latency while still ensuring
    acceptable performance.

I guess that the main drawback of this proposal is that the
gr_tpb_policy_governor will contain a very heavily used mutex.

Comments? If I do these changes, will the GNU Radio team be interested
in
a patch?

Anton

On Tue, Dec 07, 2010 at 11:36:19AM +0100, Anton Blad wrote:

of
Looking at the behavior of the single threaded scheduler, it seems it
Anton Blad
throughput.)
I won’t have time to look at this until after the new year, but if
fix: the latency will still be higher than necessary, and processing large

  • Add a reference to the gr_tpb_policy_governor instance in the
    processing concurrently, in order to reduce latency. Could be set to 1 to
    a patch?
    If it can be done cleanly and simply in a way that doesn’t reduce the
    performance too much (say 3%) for those who are using “flooding”, then
    I think we’d consider it. The measure of performance should be total
    CPU time (== user + sys).

I do have some questions about how you’d tune the “active_blocks” and
“num_samples” cases…

Eric