Performance on ARM Cortex-A8

luislavena · July 13, 2011, 10:41am

Hi all,

I complied DAB demodulation for ARM Cortex-A8 (TI OMAP 3). It
successfully demodulate DAB+ but spends 13 seconds decoding 1 second of
radio baseband (USRP file).

I used all the optimized code for Cortex-A8 like dotprod_ccf_armv7_a.c.
My compilation flags are: -mcpu=cortex-a8 -mfloat-abi=softfp -mfpu=neon
-O2. I used fftw-3.2.2.

Why is gnu radio too slow demodulating DAB+? Do you have some figures of
CPU consumption on ARM Cortex cores? Is there some optimization I missed
for the platform?

Here is the output of demodulation program:

–> receiving from file: DAB+_202-928Mhz_d32.cfile.2048
–> creating DAB parameter object
–> DAB parameters self check ok
–> updating DAB parameters
–> creating RX parameter object
autocorrect_sample_rate = false
–> using soft bits
ofdm_ffe_all_in_one: d_estimated_error: -2.092835 (-333.09 Hz)
ofdm_ffe_all_in_one: d_estimated_error: -2.092918 (-333.10 Hz)
ofdm_ffe_all_in_one: d_estimated_error: -2.093476 (-333.19 Hz)
ofdm_ffe_all_in_one: d_estimated_error: -2.095631 (-333.53 Hz)
ofdm_ffe_all_in_one: d_estimated_error: -2.094431 (-333.34 Hz)
ofdm_ffe_all_in_one: d_estimated_error: -2.093980 (-333.27 Hz)
ofdm_ffe_all_in_one: d_estimated_error: -2.094858 (-333.41 Hz)
ofdm_ffe_all_in_one: d_estimated_error: -2.094019 (-333.27 Hz)
ofdm_ffe_all_in_one: d_estimated_error: -2.094400 (-333.33 Hz)
ofdm_ffe_all_in_one: d_estimated_error: -2.094204 (-333.30 Hz)

Best regards,

Riadh.

Riadh_Elloumi · July 15, 2011, 9:57pm

On Wed, Jul 13, 2011 at 1:40 AM, Riadh Elloumi
[email protected] wrote:

Hi all,

I complied DAB demodulation for ARM Cortex-A8 (TI OMAP 3). It
successfully demodulate DAB+ but spends 13 seconds decoding 1 second of
radio baseband (USRP file).

I tested the demodulator with similar results. Decoding a file that
takes a few seconds on my desktop takes over a minute on the ARM.

Why is gnu radio too slow demodulating DAB+? Do you have some figures of
CPU consumption on ARM Cortex cores? Is there some optimization I missed
for the platform?

The performance limitation in this case largely comes from floating
point math performance, though that’s a vague and not particularly
useful conclusion.

OProfile report:

http://ttsou.github.com/gnuradio/dab_demod_oprof.txt

OProfile report with symbols:

http://ttsou.github.com/gnuradio/dab_demod_symbols.txt

Thomas

Profiling through timer interrupt
TIMER:0|
samples| %|

 3542 28.4727 libm-2.12.2.so
 2976 23.9228 libgnuradio-core-3.4.1git.so.0.0.0
 1637 13.1592 no-vmlinux
 1151  9.2524 libgcc_s.so.1
 1130  9.0836 libgnuradio-dab-3.3.0.so.0.0.0
  758  6.0932 libfftw3f.so.3.2.4
  549  4.4132 libc-2.12.2.so
  407  3.2717 libpthread-2.12.2.so
  120  0.9646 libpython2.6.so.1.0
   82  0.6592 libboost_thread.so.1.45.0
   72  0.5788 python
              TIMER:0|
      samples|      %|
    ------------------
           72 100.000 [vectors] (tgid:8032

range:0xffff0000-0xffff1000)
6 0.0482 ld-2.12.2.so
5 0.0402 busybox
3 0.0241 time.so
2 0.0161 oprofiled

Thomas

Riadh_Elloumi · July 15, 2011, 10:28pm

On 07/13/2011 04:40 AM, Riadh Elloumi wrote:

Hi all,

I complied DAB demodulation for ARM Cortex-A8 (TI OMAP 3). It
successfully demodulate DAB+ but spends 13 seconds decoding 1 second of
radio baseband (USRP file).

I used all the optimized code for Cortex-A8 like dotprod_ccf_armv7_a.c.
My compilation flags are: -mcpu=cortex-a8 -mfloat-abi=softfp -mfpu=neon
-O2. I used fftw-3.2.2.
What does -mfloat-abi=softfp do? Does that cause software
floating-point to be used?
If it does, then your floating-point performance is going to be
completely awful.

A good test for comparing oranges/oranges would be to construct simple
C program
that does, let’s say, 10e6 single-precision floating-point
multiply/accumulate operations,
and compare among platforms with simiilar clock speeds, etc.

Why is gnu radio too slow demodulating DAB+? Do you have some figures of
CPU consumption on ARM Cortex cores? Is there some optimization I missed
for the platform?

–
Marcus L.
Principal Investigator
Shirleys Bay Radio Astronomy Consortium

Riadh_Elloumi · July 15, 2011, 10:42pm

On Fri, Jul 15, 2011 at 1:24 PM, Marcus D. Leech [email protected]
wrote:

On 07/13/2011 04:40 AM, Riadh Elloumi wrote:

I used all the optimized code for Cortex-A8 like dotprod_ccf_armv7_a.c.
My compilation flags are: -mcpu=cortex-a8 -mfloat-abi=softfp -mfpu=neon
-O2. I used fftw-3.2.2.

What does -mfloat-abi=softfp do? Does that cause software floating-point to
be used?
If it does, then your floating-point performance is going to be completely
awful.

No, it’s one way of specifying hardware instructions.

“`softfp’ allows the generation of code using hardware floating-point
instructions, but still uses the soft-float calling conventions.”

“Use -mfloat-abi=softfp with the appropriate -mfpu option to allow the
compiler to generate code that makes use of the hardware
floating-point capabilities for these CPUs.”

Thomas

Riadh_Elloumi · July 15, 2011, 10:58pm

On 07/15/2011 04:24 PM, Marcus D. Leech wrote:

What does -mfloat-abi=softfp do? Does that cause software floating-point
to be used?
If it does, then your floating-point performance is going to be
completely awful.

No, that chooses the soft float ABI only. Basically, return values can
not be in NEON registers. This is not to bad, since we normally are
passing pointers to arrays.

We can compile the entire system with the hard float ABI, but it is not
a big win and adds some complexity for people using certain binary only
libraries (which are usually built with soft float).

A good test for comparing oranges/oranges would be to construct simple C
program
that does, let’s say, 10e6 single-precision floating-point
multiply/accumulate operations,
and compare among platforms with simiilar clock speeds, etc.

From a quick look at Tom’s oprofile results, first find out who is
calling into libm and see if you can change the block to stopp calling
libm. For example, calculate sin/cos via a table approximation (I think
GNU Radio already does that).

Then look at the signal processing blocks that are next in usage and do
some NEON optimizations using ORC.

Philip

Riadh_Elloumi · July 15, 2011, 10:50pm

On Fri, 2011-07-15 at 16:24 -0400, Marcus D. Leech wrote:

What does -mfloat-abi=softfp do? Does that cause software
floating-point to be used?
If it does, then your floating-point performance is going to be
completely awful.

Counterintuitively, that flag doesn’t mean “use emulated fp”. However,
gcc is notoriously bad at vectorizing code for the NEON vfpu. The
upcoming Volkification of gnuradio-core will hopefully do a lot to
improve E100 performance on CPU-intensive flowgraphs.

–n

Riadh_Elloumi · July 15, 2011, 11:06pm

On 07/15/2011 04:42 PM, Philip B. wrote:

From a quick look at Tom’s oprofile results, first find out who is
calling into libm and see if you can change the block to stopp calling
libm. For example, calculate sin/cos via a table approximation (I
think GNU Radio already does that).
I think libm already does that, too (table-based SIN/COS), not sure.

–
Marcus L.
Principal Investigator
Shirleys Bay Radio Astronomy Consortium

Riadh_Elloumi · July 21, 2011, 6:59am

On Fri, Jul 15, 2011 at 1:47 PM, Marcus D. Leech [email protected]
wrote:

On 07/15/2011 04:42 PM, Philip B. wrote:

From a quick look at Tom’s oprofile results, first find out who is calling
into libm and see if you can change the block to stopp calling libm. For
example, calculate sin/cos via a table approximation (I think GNU Radio
already does that).

I think libm already does that, too (table-based SIN/COS), not sure.

More symbols.

http://ttsou.github.com/gnuradio/dab_demod_more_symbols.txt

CPU: CPU with timer interrupt, speed 0 MHz (estimated)
Profiling through timer interrupt
samples % image name symbol name
2013 16.0642 no-vmlinux /no-vmlinux
1329 10.6057 libm-2.12.2.so __kernel_cosf
1250 9.9753 libm-2.12.2.so __kernel_sinf
1108 8.8421 libgcc_s.so.1 /lib/libgcc_s.so.1
520 4.1497 libm-2.12.2.so __ieee754_rem_pio2f

Thomas

Riadh_Elloumi · July 18, 2011, 5:43pm

Hi all,

Thank you for your help.

After deeper investigation, software emulated FP is performed form the
GNU libsdtc++ because it is compiled with software FP and does some
arithmetics on complex (add, multiply) which are struct of floats.

I will recompile libstdc++ with hard FP and measure again the
performance.

Best regards,

Riadh.

Performance on ARM Cortex-A8

Profiling through timer interrupt TIMER:0| samples| %|

Profiling through timer interrupt
TIMER:0|
samples| %|