Sustainable gnuradio MFLOPS for streaming processing

addis_a · July 11, 2013, 9:58pm

Hi Folks,

The number of sustainable gnuradio processing speed - I guess in terms
of
MFLOPS - is important for designing application. Suppose I have a FIR
filter
with a number of taps operating on a streaming sample of x Msps, this
would
translate to a certain number of required MFLOPS. And this number needs
to
be below the sustainable gnuradio MFLOPS limit. Hence my question is say
on
a 1GHz processor, what is the gnuradio MFLOPS limit running on this
processor?

Thanks,

LD

LD_Zhang · July 11, 2013, 10:40pm

On 07/11/2013 03:57 PM, LD Zhang wrote:

Thanks,

LD

Nobody is going to be able to give you a definitive answer. Much
depends on the processor in question – internal details including
pipelining
efficiency, cache sizes, etc, etc.

But having said that, a FIR filter is, from a high-level simply a series
of multiply-accumulate operations – one per tap. Conventionally, one
“FLOP” is one multiply-accumulate so, to a very sloppy first-order, a
100-tap FIR filter requires about 100 FLOPs. Now, unfortunately, a FIR
filter doesn’t live in “isolation”, it has much surrounding
scaffolding and infrastructure.

The only way to really get a handle on this stuff is to build test
cases, and measure on the target platform.

LD_Zhang · July 11, 2013, 11:08pm

On Thu, Jul 11, 2013 at 3:57 PM, LD Zhang [email protected] wrote:

Thanks,

LD

The situation is a little more complicated than this. A few
considerations:

What’s the order of your FIR filter? Mo’ taps mo’ flops (mo’ money
mo’ problems?)
As you mentioned sampling rate is also a factor.
Although clock speed is one very easy to use metric (and an
important one!) there’s more to processors than that.
Related to the last one, is the filter all complex? all real? real
taps, complex input? All will likely have different performance.
There’s probably more than just an FIR filter in your flowgraph,
everything eats up a chunk of total processing capacity.

You can probably get a rough estimate of the lower limit of your
processors ability to do something like an FIR filter with some simple
calculations:
A 10-point FIR filter needs to do 10 multiplies and 9 additions.
Blissfully ignoring branching that’s 19 instructions for each output.
So let’s say we’ve got a simple FIR filter that outputs the same
sample rate as it inputs.

Using the table in here:
Instructions per second - Wikipedia it looks like
modern CPUs clocked around 1GHz should expect between 2-5 IPS /
(1/clock speed). Pick your favorite (I’m guessing you’re on older
hardware so let’s go with 2). 2 * 1GHz = 2000 MIPs. So you can process
1M samples through a very poorly implemented 10-tap FIR filter. That
in itself is also a pretty poor estimate. I see Marcus just replied as
well and as he said, the best way to know is just to try it out on
your hardware; there’s no substitute for that.

LD_Zhang · July 12, 2013, 12:14am

Hi, Please my comment below:

You can probably get a rough estimate of the lower limit of your
processors
ability to do something like an FIR filter with some simple
calculations:
A 10-point FIR filter needs to do 10 multiplies and 9 additions.
Blissfully ignoring branching that’s 19 instructions for each output.
So let’s say we’ve got a simple FIR filter that outputs the same sample
rate
as it inputs.

Using the table in here:
Instructions per second - Wikipedia it looks like
modern
CPUs clocked around 1GHz should expect between 2-5 IPS / (1/clock
speed).
Pick your favorite (I’m guessing you’re on older hardware so let’s go
with
2). 2 * 1GHz = 2000 MIPs. So you can process 1M samples through a very
poorly implemented 10-tap FIR filter. That in itself is also a pretty
poor
estimate. I see Marcus just replied as well and as he said, the best way
to
know is just to try it out on your hardware; there’s no substitute for
that.

I am confused: 10-tap FIR according to the above is 19 IPs, so 1M
samples correspond to 19 MIPS, much below the 2000 MIPS limit?
Am I missing something?

LD

LD_Zhang · July 12, 2013, 1:18am

Great, these discussions actually help a lot, I am going to initially
design
it to be a factor of 10 less than the theoretical limit.

There is another question: in the case of no floating point operations
at
all, there must be a limit of how fast the data can stream through the
Gnuradio environment. So is the limit like 10 Msps, or like 50 Msps? A 1
Msps data stream fed through 10 parallel ports is like 10 Msps data
stream,
correct?

Thanks,

LD

LD_Zhang · July 12, 2013, 12:35am

On Thu, Jul 11, 2013 at 6:13 PM, LD Zhang [email protected] wrote:

Using the table in here:

I am confused: 10-tap FIR according to the above is 19 IPs, so 1M
samples correspond to 19 MIPS, much below the 2000 MIPS limit?
Am I missing something?

LD

Hmm, I guess you’re right. It’s not too important because the actual
estimate wouldn’t be close to anything close to what you would see.
The point is there is no easy answer (other than just running
something to see if it works), but you might be able to come up with a
rough estimate if you really need to and your application is really
simple. You should probably ignore my lousy attempt :-P. I came up
with it on the fly… There’s also the issue of how long it takes
those instructions to execute.

-Nathan

LD_Zhang · July 16, 2013, 11:43pm

Fantastic! Let us know if there are docs/links and code examples now. Or
maybe we will wait till the August presentation?

LD_Zhang · July 17, 2013, 4:14pm

On Tue, Jul 16, 2013 at 5:42 PM, LD Zhang [email protected] wrote:

Fantastic! Let us know if there are docs/links and code examples now. Or
maybe we will wait till the August presentation?

Pushed the documentation yesterday, actually. If you update from git
and rebuild the Doxygen manual locally, it now contains pages on
ControlPort and the Performance Counters.

Tom

LD_Zhang · July 16, 2013, 11:31pm

On Thu, Jul 11, 2013 at 7:17 PM, LD Zhang [email protected] wrote:

LD

Being able to calculate the cost of an SDR application running on a
GPP would be fantastic, but it would only ever be an approximation.

Instead, we’ve introduced the Performance Counters and a Performance
Monitor application with GNU Radio 3.7.0 that simply measures the
amount of time a block takes during a call to work. The paper that
I’ll be presenting at the SRIF workshop in August introduces this
concept. I’ve also just written some documentation that will go into
the manual soon that describes them better.

The point of this tool is to see how well your application is running
and to identify which blocks might be using too much CPU time to be
singled out for optimization. It could also potentially be used to
develop an understanding of how each block might work on your machine,
which you could then use to extrapolate how much your system could
process.

Tom