Forum: GNU Radio max and argmax blocks with SIMD instructions

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Bb3ff9c86361ea921a64632a4c46e824?d=identicon&s=25 Trond Danielsen (Guest)
on 2007-04-23 10:50
(Received via mailing list)
Hi everyone,

I've written a couple of blocks for GNU Radio, but am not satisfied
with the performance. I am therefore thinking of using SIMD
instructions. However, I am not that familiar with x86 assembly
instructions, and finding the reference manual on Intel's website was
not easy. I know that DSPs such as the Blackfin has special vector
instructions that would make this very simple, but I am not sure about
x86.

I am also going to write a general purpose multiply and accumulate
block that would benefit much from SIMD instructions.

Any comments are appreciated.

--
Trond Danielsen
745d8202ef5a58c1058d0e5395a78f9c?d=identicon&s=25 Eric Blossom (Guest)
on 2007-04-23 16:42
(Received via mailing list)
On Mon, Apr 23, 2007 at 10:48:58AM +0200, Trond Danielsen wrote:
> I am also going to write a general purpose multiply and accumulate
> block that would benefit much from SIMD instructions.
>
> Any comments are appreciated.
>
> --
> Trond Danielsen

Hi Trond,

Can you point us at your code?  Before diving into SIMD, it would be
good to confirm that there isn't an easier change to make.  Have you
run oprofile on your code?

In general when going for a speed up, you want to be packaging enough
cycles in the block to have it make a difference.  I.e., I'm not sure
that a general purpose multiple-accumulate (MAC) block is going to
solve your problem.  However, if you take a look at the gr_fir_*.cc
code, you'll find that at the bottom of them they call out to SIMD
assembler in {c,}complex_dotprod_*.S that implements the kernel of the
FIR filter.  In those cases the equivalent of the MAC function is buried
in an unrolled inner loop.

With SIMD programming, a lot of the complexity is figuring out how to
schedule the loads and stores, since unless you're careful, your
performance is dominated by the memory hierarchy and not the math.

Also, on the x86 architecuture, there are not enough registers
available to hide the load latencies.  On the x86-64 it's better,
since you've got twice as many registers.  For a comparison, on the
Cell SPE you've got 128 (!) 128-bit registers.  No shortage of
registers there ;)

In addition to the "IA-32 Archicture Software Developer's Manual" (I
suspect that was the one you had trouble finding), you'll want to look
for the microarchitecture-specific optimization manuals.  The one I've
got in front of me, "Intel Pentium 4 and Intel Xeon Processor
Optimization Reference Manual" (Order Number: 248966-04) isn't the
latest, but is an example.  I suspect that there's a new one out that
covers the Pentium M, Core, Core Duo, Core 2 Duo, etc.  AMD also has
similar manuals.  All these are on the vendor web sites, typically in
the "developer" section somewhere.

Be sure to create meaningful benchmarks to measure the performance of
your code.  That's a whole art into into itself.

When all is said and done, algorithmic changes often result in bigger
wins than SIMD assembler.  Be sure to look there first.  In our
case, the FFT based FIR code is faster than the hand-coded SIMD code
for pretty much all cases where ntaps >= 20.

Have fun!
Eric
B90e3ae33b77dd3aa101c7fb8a9042e8?d=identicon&s=25 Pascal Charest (Guest)
on 2007-04-23 16:47
(Received via mailing list)
If you don't want to use assembly, you can use MMX and SSE intrisics
compiler support. These are C functions/macros to allow the use of SSE
instructions directly from C/C++. You can start with this introduction:
http://www.codeproject.com/cpp/sseintro.asp . For a reference, you can
go on
MSDN website and search for intrinsic . Even if it's for Windows, it's
still
valid for GNU. Have a look at this files on your computer (in your
system
include dir):
mmintrin.h -> MMX
emmintrin.h -> SSE2
xmmintrin.h -> SSE
pmmintrin.h -> SSE3

If you use it, don't forget to use compilation macros to offer
alternative
for processor that don't have these assembly instructions.

Pascal

PS From a personnal point of view, I don't know if these instructions
are as
fast as assembly. But they are quite easy to use, which improve my code
writing speed ;-)
Bb3ff9c86361ea921a64632a4c46e824?d=identicon&s=25 Trond Danielsen (Guest)
on 2007-04-24 13:12
(Received via mailing list)
2007/4/23, Eric Blossom <eb@comsec.com>:
> >
> > I am also going to write a general purpose multiply and accumulate
> > block that would benefit much from SIMD instructions.
> >
> > Any comments are appreciated.
>
> Hi Trond,
>
> Can you point us at your code?  Before diving into SIMD, it would be
> good to confirm that there isn't an easier change to make.  Have you
> run oprofile on your code?

Thanks a lot for your answer, very enlightening!

The max and argmax blocks can be found here:
ftp://open-gnss.org/pub/opengnss. If you find it useful I do not mind
including them in GNU Radio.

I haven't profiled the code, so really cannot verify that it is the
main problem atm. I was just curious because I know that such
instructions exist for other processors.

--
Trond Danielsen
Bb3ff9c86361ea921a64632a4c46e824?d=identicon&s=25 Trond Danielsen (Guest)
on 2007-04-24 13:15
(Received via mailing list)
2007/4/23, Pascal Charest <c.lacsap@gmail.com>:
> If you don't want to use assembly, you can use MMX and SSE intrisics
> compiler support. These are C functions/macros to allow the use of SSE
> instructions directly from C/C++.

Thanks a lot, I will definitely look into that, as i would pref ere to
stay out of assembly land a little longer.

--
Trond Danielsen
F59554dad66070cab7da45ce0937c7c2?d=identicon&s=25 Gregory W Heckler (Guest)
on 2007-04-24 13:19
(Received via mailing list)
I have some C/C++ functions that utilize MMX and/or SSE available that
work on 16 bit signed integers. The URL is:

http://www.ngs.noaa.gov/gps-toolbox/Heckler.htm

They work particularly well for building a software correlator :).
E16be4811324adf8f26be26d77e9d29d?d=identicon&s=25 Robert McGwier (Guest)
on 2007-04-24 16:11
(Received via mailing list)
Trond:

I, as well as GNU radio and others,  need your copyright statement and
your grant of a GPL license to this code IN the code and header modules.
It would be best if you added this to your code as soon as possible.

THANK YOU for sharing this.

Bob



Trond Danielsen wrote:
>> > x86.
>> run oprofile on your code?
>
--
AMSAT Director and VP Engineering. Member: ARRL, AMSAT-DL,
TAPR, Packrats, NJQRP, QRP ARCI, QCWA, FRC. ARRL SDR WG Chair
"Taking fun as simply fun and earnestness in earnest shows
how thoroughly thou none of the two discernest." - Piet Hine
86da0135316637037a61f57dbd9438f5?d=identicon&s=25 Chris Albertson (Guest)
on 2007-04-25 00:27
(Received via mailing list)
Another example of how to do this in a portable way is
found in FFTW.
Since fftw is prerequisite software to gnuradio most of us
already have fftw on our computers.

FFTW supports SSE/SSE2/3dNow!/Altivec and uses which
ever is available and fastest.



--- Gregory W Heckler <gheckler@pop500.gsfc.nasa.gov> wrote:

> Discuss-gnuradio mailing list
> Discuss-gnuradio@gnu.org
> http://lists.gnu.org/mailman/listinfo/discuss-gnuradio
>


Chris Albertson
  Home:   310-376-1029  chrisalbertson90278@yahoo.com
  Office: 310-336-5189  Christopher.J.Albertson@aero.org

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
This topic is locked and can not be replied to.