Max and argmax blocks with SIMD instructions

Trond_D · April 23, 2007, 10:50am

Hi everyone,

I’ve written a couple of blocks for GNU Radio, but am not satisfied
with the performance. I am therefore thinking of using SIMD
instructions. However, I am not that familiar with x86 assembly
instructions, and finding the reference manual on Intel’s website was
not easy. I know that DSPs such as the Blackfin has special vector
instructions that would make this very simple, but I am not sure about
x86.

I am also going to write a general purpose multiply and accumulate
block that would benefit much from SIMD instructions.

Any comments are appreciated.

–
Trond D.

Trond_D · April 23, 2007, 4:42pm

On Mon, Apr 23, 2007 at 10:48:58AM +0200, Trond D. wrote:

I am also going to write a general purpose multiply and accumulate
block that would benefit much from SIMD instructions.

Any comments are appreciated.

–
Trond D.

Hi Trond,

Can you point us at your code? Before diving into SIMD, it would be
good to confirm that there isn’t an easier change to make. Have you
run oprofile on your code?

In general when going for a speed up, you want to be packaging enough
cycles in the block to have it make a difference. I.e., I’m not sure
that a general purpose multiple-accumulate (MAC) block is going to
solve your problem. However, if you take a look at the gr_fir_.cc
code, you’ll find that at the bottom of them they call out to SIMD
assembler in {c,}complex_dotprod_.S that implements the kernel of the
FIR filter. In those cases the equivalent of the MAC function is buried
in an unrolled inner loop.

With SIMD programming, a lot of the complexity is figuring out how to
schedule the loads and stores, since unless you’re careful, your
performance is dominated by the memory hierarchy and not the math.

Also, on the x86 architecuture, there are not enough registers
available to hide the load latencies. On the x86-64 it’s better,
since you’ve got twice as many registers. For a comparison, on the
Cell SPE you’ve got 128 (!) 128-bit registers. No shortage of
registers there

In addition to the “IA-32 Archicture Software Developer’s Manual” (I
suspect that was the one you had trouble finding), you’ll want to look
for the microarchitecture-specific optimization manuals. The one I’ve
got in front of me, “Intel Pentium 4 and Intel Xeon Processor
Optimization Reference Manual” (Order Number: 248966-04) isn’t the
latest, but is an example. I suspect that there’s a new one out that
covers the Pentium M, Core, Core Duo, Core 2 Duo, etc. AMD also has
similar manuals. All these are on the vendor web sites, typically in
the “developer” section somewhere.

Be sure to create meaningful benchmarks to measure the performance of
your code. That’s a whole art into into itself.

When all is said and done, algorithmic changes often result in bigger
wins than SIMD assembler. Be sure to look there first. In our
case, the FFT based FIR code is faster than the hand-coded SIMD code
for pretty much all cases where ntaps >= 20.

Have fun!
Eric

Trond_D · April 23, 2007, 4:47pm

If you don’t want to use assembly, you can use MMX and SSE intrisics
compiler support. These are C functions/macros to allow the use of SSE
instructions directly from C/C++. You can start with this introduction:
http://www.codeproject.com/cpp/sseintro.asp . For a reference, you can
go on
MSDN website and search for intrinsic . Even if it’s for Windows, it’s
still
valid for GNU. Have a look at this files on your computer (in your
system
include dir):
mmintrin.h → MMX
emmintrin.h → SSE2
xmmintrin.h → SSE
pmmintrin.h → SSE3

If you use it, don’t forget to use compilation macros to offer
alternative
for processor that don’t have these assembly instructions.

Pascal

PS From a personnal point of view, I don’t know if these instructions
are as
fast as assembly. But they are quite easy to use, which improve my code
writing speed

Trond_D · April 24, 2007, 1:12pm

2007/4/23, Eric B. [email protected]:

I am also going to write a general purpose multiply and accumulate
block that would benefit much from SIMD instructions.

Any comments are appreciated.

Hi Trond,

Can you point us at your code? Before diving into SIMD, it would be
good to confirm that there isn’t an easier change to make. Have you
run oprofile on your code?

Thanks a lot for your answer, very enlightening!

The max and argmax blocks can be found here:
ftp://open-gnss.org/pub/opengnss. If you find it useful I do not mind
including them in GNU Radio.

I haven’t profiled the code, so really cannot verify that it is the
main problem atm. I was just curious because I know that such
instructions exist for other processors.

–
Trond D.

Trond_D · April 24, 2007, 1:19pm

I have some C/C++ functions that utilize MMX and/or SSE available that
work on 16 bit signed integers. The URL is:

http://www.ngs.noaa.gov/gps-toolbox/Heckler.htm

They work particularly well for building a software correlator :).

Trond_D · April 24, 2007, 1:15pm

2007/4/23, Pascal C. [email protected]:

If you don’t want to use assembly, you can use MMX and SSE intrisics
compiler support. These are C functions/macros to allow the use of SSE
instructions directly from C/C++.

Thanks a lot, I will definitely look into that, as i would pref ere to
stay out of assembly land a little longer.

–
Trond D.

Trond_D · April 25, 2007, 12:27am

Another example of how to do this in a portable way is
found in FFTW.
Since fftw is prerequisite software to gnuradio most of us
already have fftw on our computers.

FFTW supports SSE/SSE2/3dNow!/Altivec and uses which
ever is available and fastest.

— Gregory W Heckler [email protected] wrote:

Discuss-gnuradio mailing list
[email protected]
Discuss-gnuradio Info Page

Chris Albertson
Home: 310-376-1029 [email protected]
Office: 310-336-5189 [email protected]

Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around

Trond_D · April 24, 2007, 4:11pm

Trond:

I, as well as GNU radio and others, need your copyright statement and
your grant of a GPL license to this code IN the code and header modules.
It would be best if you added this to your code as soon as possible.

THANK YOU for sharing this.

Bob

Trond D. wrote:

x86.
run oprofile on your code?

–
AMSAT Director and VP Engineering. Member: ARRL, AMSAT-DL,
TAPR, Packrats, NJQRP, QRP ARCI, QCWA, FRC. ARRL SDR WG Chair
“Taking fun as simply fun and earnestness in earnest shows
how thoroughly thou none of the two discernest.” - Piet Hine