On Mon, Apr 23, 2007 at 10:48:58AM +0200, Trond D. wrote:
I am also going to write a general purpose multiply and accumulate
block that would benefit much from SIMD instructions.
Any comments are appreciated.
–
Trond D.
Hi Trond,
Can you point us at your code? Before diving into SIMD, it would be
good to confirm that there isn’t an easier change to make. Have you
run oprofile on your code?
In general when going for a speed up, you want to be packaging enough
cycles in the block to have it make a difference. I.e., I’m not sure
that a general purpose multiple-accumulate (MAC) block is going to
solve your problem. However, if you take a look at the gr_fir_.cc
code, you’ll find that at the bottom of them they call out to SIMD
assembler in {c,}complex_dotprod_.S that implements the kernel of the
FIR filter. In those cases the equivalent of the MAC function is buried
in an unrolled inner loop.
With SIMD programming, a lot of the complexity is figuring out how to
schedule the loads and stores, since unless you’re careful, your
performance is dominated by the memory hierarchy and not the math.
Also, on the x86 architecuture, there are not enough registers
available to hide the load latencies. On the x86-64 it’s better,
since you’ve got twice as many registers. For a comparison, on the
Cell SPE you’ve got 128 (!) 128-bit registers. No shortage of
registers there
In addition to the “IA-32 Archicture Software Developer’s Manual” (I
suspect that was the one you had trouble finding), you’ll want to look
for the microarchitecture-specific optimization manuals. The one I’ve
got in front of me, “Intel Pentium 4 and Intel Xeon Processor
Optimization Reference Manual” (Order Number: 248966-04) isn’t the
latest, but is an example. I suspect that there’s a new one out that
covers the Pentium M, Core, Core Duo, Core 2 Duo, etc. AMD also has
similar manuals. All these are on the vendor web sites, typically in
the “developer” section somewhere.
Be sure to create meaningful benchmarks to measure the performance of
your code. That’s a whole art into into itself.
When all is said and done, algorithmic changes often result in bigger
wins than SIMD assembler. Be sure to look there first. In our
case, the FFT based FIR code is faster than the hand-coded SIMD code
for pretty much all cases where ntaps >= 20.
Have fun!
Eric