Complex Short/INT16 type

Hi all -

I’m getting limited by the slow ARM processor in the E100 and I want to
modify parts of gr-digital and gnuradio-core to support complex
short/INT16 types in the modulation schemes. I suspect that it won’t be
as trivial as defining “typedef std::complex gr_complexs;” in
gnuradio-core/src/lib/runtime/gr_complex.h and doing a find-and-replace
in the relevant source files. There are probably issues with dynamic
range that I’ll have to deal with in addition to having to implement
filters using fixed-point math.

Questions:

  1.  Do you think I'd save anything by doing all the modulation & 
    

filtering in complex float32 and then converting at the very end? This
will reduce the bandwidth requirement to the FPGA by two, but I’m afraid
the float math is the true limitation.

  1.  Why is there a gr_complex_to_interleaved_short block but not a 
    

gr_complex_to_complex_short block? Would it be better if I rolled my own
or just hooked up a gr_complex_to_interleaved_short block and then a
deinterleave block? Or alternatively, split the complex float vector
into two streams and feed them to a USRP sink block using COMPLEX.INT16?

  1.  What specific parts of the modulation examples or gnuradio-core 
    

do you think I need to change to support complex short ints?

Thanks so much for your help.

Sean

On 11/07/2011 02:15 PM, Nowlan, Sean wrote:

Hi all -

I’m getting limited by the slow ARM processor in the E100 and I want
to modify parts of gr-digital and gnuradio-core to support complex
short/INT16 types in the modulation schemes. I suspect that it won’t
be as trivial as defining “typedef std::complex gr_complexs;”
in gnuradio-core/src/lib/runtime/gr_complex.h and doing a
find-and-replace in the relevant source files. There are probably

It may be that simple for some blocks. Like the symbol table in BPSK.

issues with dynamic range that I’ll have to deal with in addition to
having to implement filters using fixed-point math.

Often blocks will need to have scale factors. Fortunatly, with a FIR
filter, you get a free scale factor in the “filter taps”

Questions:

  1.  Do you think I'd save anything by doing all the modulation &
    

filtering in complex float32 and then converting at the very end?

Its good to make the conversion part of an operation that does something
useful rather than doing it for the sake of converting. Like a filter
that takes in floats and spits out shorts.

This will reduce the bandwidth requirement to the FPGA by two, but
I’m afraid the float math is the true limitation.

The format going into the FPGA is always integer. If you pass floats
into the UHD, they are copy-converted from host buffer to memory mapped
buffers.

  1.  Why is there a gr_complex_to_interleaved_short block but not
    

a gr_complex_to_complex_short block? Would it be better if I rolled
my own or just hooked up a gr_complex_to_interleaved_short block and
then a deinterleave block? Or alternatively, split the complex float
vector into two streams and feed them to a USRP sink block using
COMPLEX.INT16?

The interleaved short block is a strange hold-over from ancient times. I
would ignore it. I think a block such as “gr_complex_to_complex_short”
is a good idea.

  1.  What specific parts of the modulation examples or
    

gnuradio-core do you think I need to change to support complex short
ints?

Probably some new sc16 filter blocks for the matched filters. I have
mentioned the importance of volk before.

The constellation stuff relies on this new constellation library in
gr-digital. Perhaps Ben can lean in here and offer some advice on how to
modify this for alternative data types.

The recovery stuff in the BPSK is using Tom’s new gri-control-loop to
simplify writing things like FLLs, PLLs. Thats a place too look, see how
the timing recovery blocks make use of it.

-Josh

Sean, with all the talk about optimization for ARM, the first thing I
would do is start to integrate Volk with existing floating-point
blocks. Stock GCC is very, very bad at vectorizing for the NEON SIMD
unit – even when hardware floating point is used in GCC, most float
instructions end up allocated to the VFP rather than the NEON unit.
You might find an easy 2x-3x improvement just by doing the heavy
lifting in Volk rather than in C++. All of the Orc functions in Volk
will work for NEON. There’s no FIR filter in Orc right now (need to
get accumulators working properly in Orc), but Philip B. already
wrote NEON FIR filter cores for the _fff and _ccf FIR filters.

This isn’t to say that short complex wouldn’t be a useful addition to
GR. Just that it’s likely going to be more work than making use of the
existing floating-point hardware the E100 already has.

This is work that needs to be done anyway to make ARM platforms as
useful as possible, and we (Josh, Phil, and I) are happy to help you
optimize your application for E100 if you give us details on how your
application works. We’re putting together a “motivating example” using
Volk to show users how to Volkify their own blocks.

–n

3 quick questions - first, does the cmake setup automatically turn on
gcc optimizations, i.e, with “-O3”? Second, is there anything to be
gained (or lost) by turning on “-ftree-vectorize” and
“-funsafe-math-optimizations”? Finally, is the gcc on E100 really
CodeSourcery’s arm-none-eabi-gcc (or an upstream GNU version thereof)?

Thanks,
Sean

So, what needs to be done? I noticed that there are already hooks for
NEON in the volk library but no implementation (or very little… don’t
remember exactly).

My understanding of Orc is that it generates architecture-dependent
vector processor instructions from an Orc abstraction language. Is
integrating Orc into Volk for NEON as simple as linking into liborc with
a compile switch indicating that we want NEON output? Are the smarts
already built into the cmake build process?

Can I drop Philip’s _fff and _ccf filters into volk and hit “go?” (I
know there’s more nuance to it, but if the combination of integrating
Orc code and NEON FIR filter code that’s already written gets me 90% of
the way there, I’d be VERY happy!

Thanks,
Sean


From: Nick F. [[email protected]]
Sent: Tuesday, November 08, 2011 1:27 PM
To: [email protected]
Cc: [email protected]; Nowlan, Sean
Subject: Re: [Discuss-gnuradio] Complex Short/INT16 type

Sean, with all the talk about optimization for ARM, the first thing I
would do is start to integrate Volk with existing floating-point
blocks. Stock GCC is very, very bad at vectorizing for the NEON SIMD
unit – even when hardware floating point is used in GCC, most float
instructions end up allocated to the VFP rather than the NEON unit.
You might find an easy 2x-3x improvement just by doing the heavy
lifting in Volk rather than in C++. All of the Orc functions in Volk
will work for NEON. There’s no FIR filter in Orc right now (need to
get accumulators working properly in Orc), but Philip B. already
wrote NEON FIR filter cores for the _fff and _ccf FIR filters.

This isn’t to say that short complex wouldn’t be a useful addition to
GR. Just that it’s likely going to be more work than making use of the
existing floating-point hardware the E100 already has.

This is work that needs to be done anyway to make ARM platforms as
useful as possible, and we (Josh, Phil, and I) are happy to help you
optimize your application for E100 if you give us details on how your
application works. We’re putting together a “motivating example” using
Volk to show users how to Volkify their own blocks.

–n

On 11/08/2011 10:40 PM, Nowlan, Sean wrote:

3 quick questions - first, does the cmake setup automatically turn on gcc
optimizations, i.e, with “-O3”? Second, is there anything to be gained (or lost)
by turning on “-ftree-vectorize” and “-funsafe-math-optimizations”? Finally, is
the gcc on E100 really CodeSourcery’s arm-none-eabi-gcc (or an upstream GNU
version thereof)?

GCC is gcc 4.5.x + Linaro patches. If you try the compile flags, let us
know if there is any improvement. It might be best to write small test
sections and look at the generated asm to see how well they work.

Philip

On Tue, Nov 8, 2011 at 12:50 PM, Nowlan, Sean [email protected]
wrote:
Orc is actually a little cooler than that – it’s a runtime-compiled
architecture-independent vector assembly language. It’s integrated as one
alternative architecture for implementing Volk functions. Volk has been set up to
automatically select the fastest implementation available for a given function at
runtime, so for the user it’s as simple as #include <volk/volk.h> and then
volk_32f_x2_add_32f_a16(…) to implement an adder. Volk will automatically choose
the fastest implementation at runtime the first time the function is invoked,
after figuring out what architecture it’s running on and what implementations are
available for that given function. If an Orc version of a function is available,
it will be automatically selected and the Orc code will runtime-compile to
vectorized NEON. You don’t have to link against liborc at all, just against
libvolk. We don’t have any native NEON in Volk – we use Orc to provide coverage
on NEON platforms. We’ve found that Orc tends to b
e around 90% as fast as good, hand-tuned assembly most of the time, and
sometimes faster. The reason we don’t just use Orc for everything is
that it’s usually possible to do a little better with careful
optimization and compiler intrinsics, and we were “gifted” a large
library of well-optimized SSE DSP routines to use.

On 11/08/2011 07:40 PM, Nowlan, Sean wrote:

3 quick questions - first, does the cmake setup automatically turn on
gcc optimizations, i.e, with “-O3”? Second, is there anything to be
gained (or lost) by turning on “-ftree-vectorize” and
“-funsafe-math-optimizations”? Finally, is the gcc on E100 really
CodeSourcery’s arm-none-eabi-gcc (or an upstream GNU version
thereof)?

CMake will automatically build in release mode, which gives you -03.
Other important flags need to be specified, you can do this in one fell
swoop with a toolchain file. Once is checked into the cmake/Toolchains
directory, see comments for usage

-josh