OMP Data. GNURadio overhead?

Rafa_F · August 14, 2015, 8:04pm

Sorry for the HTML…

I have been done some work applying OpenMP to GNURadio and collected
some data. This data was collected WITHOUT GNURadio overhead.
Specifically, I interfaced directly with my detector passing 30 seconds
(30 seconds * 10msps) of data in a buffer (i.e., I allocated and filled
gr_vector_const_void_star, etc.) and calculated the performance at
different sensitivities. Herein lies one problem.

When applying OpenMP against a buffer there has to be enough data to
make it worth while but GNURadio buffers are fairly small. I don’t see
a reasonable way to increase buffer sizes for a single source->block
without modifying the constant in flat_flowgraph.cc which has the side
effect of the default size for all buffers. Yes?

I am looking for a way to measure GNURadio overhead. There is a certain
amount of overhead depending on the number of blocks, set() functions,
GUI Sinks, etc. and I’d like to know what that overhead is. Ideas?

One thought is to set a hardware pin low in the source block and set it
high in the detector block then measuring with a scope. The problem is
these pins often incur kernel overhead by opening something in /dev,
writing a string, then closing the device and waiting for the kernel to
get around to actually toggling the pin. Measurements showed this is
wildly unpredictable. Another option is to toggle a ping on an SDR but
the same problem exists with additional USB transaction delays.

Anyway, in the data below a “signal” buffer is defined as ~1200 samples
(i.e., MAXimum message size), or 2x1200=2400 complex number “chunks”. I
found 2xMAX a reasonable value because it is within a reasonable buffer
amount from GNURadio with my alteration to flat_flowgraph.cc. The
OpenMP code really looks like this:

#pragma omp parallel for num_threads(ncpu),schedule(dynamic,1)
if( work_list.size() > 1 )
for( size_t i = 0; i < work_list.size(); ++i ) {
do work…
}

Pretty simple.

That said, unless GNURadio can provide a selective and reasonably large
amount of samples to process then the value of applying OpenMP is
probably moot.

Below, the term “sensitivity” is a bit of a misnomer because as
sensitivity increases signal rejection increases; but the text is what
it is. More specifically, there are roughly twelves sets of criteria
that need to be met before signal presence is declared. Some of those
criteria involve std::log10() and std::pow(10.0,x) operations but
interestingly those math operations are a very small amount of the
detection effort (1.02% worst case).

The numbers in the first block below is the rate in samples/second I
can process samples. For example, “Baseline” “45.0” is “1,662,796”
samples per second.

From an OpenMP perspective, I have eight cores but limited the effort
to five with the idea GNURadio overhead and other blocks have three
cores to do their thing, worst case. OpenMP gave me a >300% performance
gain in "OMP 5core,2xMAX) but the theoretical gain is 400%. Not perfect
but I’ll take it. What these numbers tell me is OpenMP can have
significant value in the context of GNURadio.

My code was run on an AMD 9590 at 4.7GHz, 5GHz boost – my development
platform. For reasons, I also ran it on a CubieBoard4 (ARM
architecture). I should also mention I have seen NO side effects of
running OpenMP within GNURadio other than considerable amusement.

                                                       Sensativity:45.050.055.060.065.0Baseline

(no
OMP)1,662,769.543,478,927.5012,272,150.9919,503,025.5320,680,179.97Load
divided by
5cores6,689,934.9013,861,829.9052,729,762.3284,351,801.4289,637,372.72OMP
1core1,445,098.782,146,371.2212,476,405.0919,413,352.3920,517,069.07OMP
5core,
2xMAX6,966,915.1014,678,631.5955,022,397.0287,443,925.0792,651,283.37OMP
5core,
4xMAX6,992,352.2414,907,933.7655,214,642.0287,609,463.6592,778,048.16OMP
5core,
8xMAX6,956,344.7614,660,481.2455,202,186.9387,681,575.7392,898,798.06CubieBoard
5core,
2xMAX1,805,831.743,794,176.1614,182,461.4224,444,325.8526,173,792.25Performance
difference
From BaselineBaseline (no OMP)0.00%0.00%0.00%0.00%0.00%Load divided by
5cores302.34%298.45%329.67%332.51%333.45%OMP
1core-13.09%-38.30%1.66%-0.46%-0.79%OMP 5core,
2xMAX318.99%321.93%348.35%348.36%348.02%OMP 5core,
4xMAX320.52%328.52%349.92%349.21%348.63%OMP 5core,
8xMAX318.36%321.41%349.82%349.58%349.22%CubieBoard 5core,
2xMAX8.60%9.06%15.57%25.34%26.56%

Dennis_G · August 14, 2015, 9:18pm

On Fri, Aug 14, 2015 at 2:03 PM, Dennis G. [email protected]
wrote:

it worth while but GNURadio buffers are fairly small. I don’t see a
reasonable way to increase buffer sizes for a single source->block without
modifying the constant in flat_flowgraph.cc which has the side effect of
the default size for all buffers. Yes?

Is this a sink block (i.e. no outputs)? In general, IIRC, you have more
control on output buffer size (since you own them, and input buffers are
owned by upstream blocks). You can call
set_output_multiple()/set_min_output_buffer(…)/set_min_noutput_items(…)
to influence output buffer size (and for a sync_block, and therefore a
sync_decimator/sync_interpolator, that has a corresponding influence on
the
input buffer size). Others may correct me on how much influence sink
blocks
have in current releases…

I am looking for a way to measure GNURadio overhead. There is a certain
amount of overhead depending on the number of blocks, set() functions, GUI
Sinks, etc. and I’d like to know what that overhead is. Ideas?

What exactly are you interested in measuring when you say ‘overhead’?
Are
you talking about memory usage? CPU usage? Latency (and if you’re
interested in latency, do you mean one-way, two-way)?

One thought is to set a hardware pin low in the source block and set it
high in the detector block then measuring with a scope. The problem is
these pins often incur kernel overhead by opening something in /dev,
writing a string, then closing the device and waiting for the kernel to get
around to actually toggling the pin. Measurements showed this is wildly
unpredictable. Another option is to toggle a ping on an SDR but the same
problem exists with additional USB transaction delays.

This sounds like you are interested in one-way latency… maybe?

  }
is. More specifically, there are roughly twelves sets of criteria that need
five with the idea GNURadio overhead and other blocks have three cores to

So using OpenMP inside a work function is a perfectly reasonable way to
try
to accelerate (via parallelization) that particular work function -
obviously as some point you are fighting against the thread-per-block
parallelization of GNURadio, so telling OMP to use fewer cores than your
machine has is a reasonable way to deal with this. My experience with
OMP
has indicated that the thread-spawning that happens each time you enter
the
work() function has a cost, and therefore instantiating a thread-pool in
the block constructor may give better results, but in the end the real
question you have to ask is: what amount of work is required to achieve
the
task at hand.

For example, collapsing the functions of multiple blocks into a single,
larger (super)block can increase performance because you aren’t
shuffling
data in-between blocks. Implementing custom thread pools is another
strategy. Writing lots of hand-optimized SIMD code (preferably inside
VOLK!) can help as well. Ultimately the question is: what is the least
amount of work required to make the thing do what you need
‘good-enough’,
where good-enough is some measure(s) of performance on the target
platform.
Basically what I’m saying is, there isn’t a single answer to the
question
of ‘what is the overhead of GNURadio’, because not only is that a moving
target, but it depends on what platform you’re targeting, and it depends
on
what measure of ‘overhead’ you really care about. Not to mention the
various knobs (e.g. the different ways blocks can influence buffer-size

etc.) you have to control, e.g. one-way latency, or computational load.

Doug

Dennis_G · August 15, 2015, 6:22am

On Fri, 2015-08-14 at 15:17 -0400, Douglas G. wrote:

one problem.
set_output_multiple()/set_min_output_buffer(…)/set_min_noutput_item
s(…) to influence output buffer size (and for a sync_block, and
therefore a sync_decimator/sync_interpolator, that has a
corresponding influence on the input buffer size). Others may correct
me on how much influence sink blocks have in current releases…

hackRF/BladeRF Source -> Preamble Detector (the block) -> multiple
blocks.
If the preamble detector detects a signal it forwards that set of
samples to a “framer” and a GUI Sink.

I am looking for a way to measure GNURadio overhead. There is a certain amount
of overhead depending on the number of blocks, set() functions, GUI Sinks, etc.
and I’d like to know what that overhead is. Ideas?

What exactly are you interested in measuring when you say ‘overhead’? Are you
talking about memory usage? CPU usage? Latency (and if you’re interested in
latency, do you mean one-way, two-way)?

CPU. Latency. One way.
I have samples coming in at a certain rate into a limited sized buffer.
I need to know the servicing interval (latency) in an attempt to
architect a solution to prevent or reduce SDR overruns.
There is a scheduler decision process to release the block for
execution and the overhead before calling general_work() (e.g., in
block_executor.cc), such as: update read/write pointers, are tags
present, maybe service performance counters, etc. At high sample rates
that can reduce the processing rate of samples.

That said, unless GNURadio can provide a selective and reasonably large
amount of samples to process then the value of applying OpenMP is probably moot.

Below, the term “sensitivity” is a bit of a misnomer because as
sensitivity increases signal rejection increases; but the text is what it is. More
specifically, there are roughly twelves sets of criteria that need to be met
before signal presence is declared. Some of those criteria involve std::log10()
and std::pow(10.0,x) operations but interestingly those math operations are a very
small amount of the detection effort (1.02% worst case).

The numbers in the first block below is the rate in samples/second I can
process samples. For example, “Baseline” “45.0” is “1,662,796” samples per second.

From an OpenMP perspective, I have eight cores but limited the effort to
five with the idea GNURadio overhead and other blocks have three cores to do their
thing, worst case. OpenMP gave me a >300% performance gain in "OMP 5core,2xMAX)
but the theoretical gain is 400%. Not perfect but I’ll take it. What these numbers
tell me is OpenMP can have significant value in the context of GNURadio.

My code was run on an AMD 9590 at 4.7GHz, 5GHz boost – my development
platform. For reasons, I also ran it on a CubieBoard4 (ARM architecture). I should
also mention I have seen NO side effects of running OpenMP within GNURadio other
than considerable amusement.

So using OpenMP inside a work function is a perfectly reasonable way to try to
accelerate (via parallelization) that particular work function - obviously as some
point you are fighting against the thread-per-block parallelization of GNURadio,
so telling OMP to use fewer cores than your machine has is a reasonable way to
deal with this. My experience with OMP has indicated that the thread-spawning that
happens each time you enter the work() function has a cost, and therefore
instantiating a thread-pool in the block constructor may give better results, but
in the end the real question you have to ask is: what amount of work is required
to achieve the task at hand.

For example, collapsing the functions of multiple blocks into a single, larger
(super)block can increase performance because you aren’t shuffling data in-between
blocks. Implementing custom thread pools is another strategy. Writing lots of
hand-optimized SIMD code (preferably inside VOLK!) can help as well. Ultimately
the question is: what is the least amount of work required to make the thing do
what you need ‘good-enough’, where good-enough is some measure(s) of performance
on the target platform. Basically what I’m saying is, there isn’t a single answer
to the question of ‘what is the overhead of GNURadio’, because not only is that a
moving target, but it depends on what platform you’re targeting, and it depends on
what measure of ‘overhead’ you really care about. Not to mention the various knobs
(e.g. the different ways blocks can influence buffer-size - etc.) you have to
control, e.g. one-way latency, or computational load.

Something to chew on. Thanks.
Not being a DSP person, the math is interesting; and not believable so
I am assuming I have something totally screwed.
10msps=1e-7/sample.
5GHz = 2e-10
Or an average of 500 CPU clocks between samples. Assuming 8 clocks per
instruction (an arbitrary and unsupported number) with zero overhead
(e.g., memory access), that is 62 instructions between samples on a
single core. Assuming that math is somewhere near correct, I can’t
really be ashamed of my low-end value of 7msps processing rate across
five cores but a best case rejection rate of ~1e-8? That would be one
heck of a fast if() and virtual function call statement; and I don’t
believe it but I haven’t (yet) found anything to debunk my numbers.
The test code is pretty simple and meets observed wall clock:
int noutput_items = (int)s.get()->size();
gr_vector_int ninput_items { (int)s.get()->size() };
gr_vector_const_void_star input_items = {
malloc( sizeof(complex) * noutput_items )
};
gr_vector_void_star output_items = {
malloc( sizeof(complex) * noutput_items )
};
…
for( float level : full_range ) {
preamble->set_gain( level );
t_start = std::chrono::high_resolution_clock::now();
for( int loop=0; loop < LOOPS; ++loop ) {
(void)preamble->general_work( noutput_items, ninput_items,
input_items, output_items );
}
t_stop = std::chrono::high_resolution_clock::now();
t_span = std::chrono::duration_cast<std::chrono::duration>
( t_stop - t_start );
std::cout << “Rx Detect Elapsed: "
<< t_span.count() << " sec”
<<", samp=" << s.get()->size()
<< “, samp/sec=” << (s.get()
->size()*LOOPS/t_span.count())
<< “, gain=” << preamble->gain()
<< std::endl;
}

Doug

–
Doug G.
[email protected]