Writing SIMD code with sse

Rohit_G · December 12, 2007, 9:03pm

Hi all,

I was following the separate discussion on this list about writing
various trig functions using vector intrinsics. I googled for it. The
top few results I got were for “old” processors when SIMD intrinsics
were new. The gcc documentation (my version is 4.1.2) has a list of
intrinsics but no description, not even one line per intrinsic. As
there is need to optimize the codebase for new processors (conroe,
barcelona etc) any way, can you please point me to some real
documenatation on the subject. I would really appreciate any help.

As a related question, possibly a digression, given that these
extensions are the key to unlock full power of new processors and yet
are rather low level (we are still writing trig funcs), is there any
FLOSS library for simd math?

–
Rohit G.
Junior Undergraduate
Department of Physics
IIT Bombay

Rohit_G · December 12, 2007, 9:03pm

Hi,

You can find some doc on intrinc functions on MSDN. Information is also
valid for gnu gcc:
http://msdn2.microsoft.com/en-us/library/y0dh78ez(VS.71).aspx

For library using SIMD, you can start looking BLAS on wikipedia. There
is
some links on the page about libraries doing optimized linear algebra.
http://en.wikipedia.org/wiki/BLAS

Pascal

Rohit_G · December 12, 2007, 9:03pm

On Wed, Dec 12, 2007 at 11:51:20PM +0530, Rohit G. wrote:

Hi all,

I was following the separate discussion on this list about writing
various trig functions using vector intrinsics. I googled for it. The
top few results I got were for “old” processors when SIMD intrinsics
were new. The gcc documentation (my version is 4.1.2) has a list of
intrinsics but no description, not even one line per intrinsic.

I believe those are 1-to-1 with the actual machine instructions.
See the intel or AMD docs.

As there is need to optimize the codebase for new processors (conroe,
barcelona etc) any way, can you please point me to some real
documenatation on the subject. I would really appreciate any help.

I’m not sure exactly what you’re looking for. Both intel and AMD
have manuals about optimizing code for their microarchitectures.
You’ll find them somewhere on their developer sites.

Probably the biggest place that needs improvement is trig functions.
I suggest starting with sin(x), cos(x) and sincos(x) for x a scalar
float, and a related version that computes 4 in parallel for x a
vector of 4 floats. I’d do two versions of each: SSE2 for x86 and
SSE2 for x86_64 (on the 64 you’ve got twice as many registers to work
with.)

We need them with something close to single-precision floating point
accuracy. You’ll need to figure out what input domain you’re willing to
accept; I’d say at a minimum +/- 4*pi.

As a related question, possibly a digression, given that these
extensions are the key to unlock full power of new processors and yet
are rather low level (we are still writing trig funcs), is there any
FLOSS library for simd math?

Not sure. Please check it out and let us know what you find.
There is of course the ATLAS stuff (optimized BLAS).

Eric

Rohit_G · December 12, 2007, 9:03pm

Hi!

The intrinsics are more or less C wrapper functions for assembler
commands. You can find a detailed description here:

http://www.intel.com/products/processor/manuals/index.htm

SSE1-3 is supported by modern AMD and Intel processors.

There are many possible improvements, but you need to have
processor-specific selection of code.

An example for intrinsics:

typedef float v4sf attribute ((vector_size(16)));
typedef short int v8hi attribute ((vector_size(16)));
typedef int v4si attribute ((vector_size(16)));

v4sf * o = static_cast<v4sf*>(buffer->write_pointer());
const v8hi * in = reinterpret_cast<v8hi*>(usrp_buffer);
for(i = 0; i < nbytes; i+=16, o+=2, ++in){
const v8hi x = *in;

o[0] = __builtin_ia32_cvtdq2ps(
__builtin_ia32_psradi128(
reinterpret_cast(
__builtin_ia32_punpcklwd128(x,x)),16));
o[1] = __builtin_ia32_cvtdq2ps(
__builtin_ia32_psradi128(
reinterpret_cast(
__builtin_ia32_punpckhwd128(x,x)),16));
}

The code snippet fastly converts the shorts the usrp delivers to floats,
using SSE. Actually, it ignores the endian-order and assumes
little-endian. The buffer size is supposed to be a multiple of 16 bytes.

Dominik

Rohit_G · December 13, 2007, 11:03am

Hi,

Thanks for these informative answers.

I believe those are 1-to-1 with the actual machine instructions.

This exactly is my point. Is there a simd math library which exports
vectorized functions like sin/sos/exp/log for real or complex numbers.
I mean for those not wanting to program at the assembly level like IBM
has one for it’s cell’s spe’s. I am sure Intel’s IPP and AMD’d AKML
provide these (I haven’t checked) but they obviously can’t be used
here.

–
Rohit G.
Junior Undergraduate
Department of Physics
IIT Bombay

Rohit_G · December 13, 2007, 2:06pm

Hi,

Just found a small error.

You should exchange
__builtin_ia32_punpcklwd128(x,x)),16));
__builtin_ia32_punpckhwd128(x,x)),16));
(note the h/l).

First …hwd, then …lwd.

Dominik

Rohit_G · December 13, 2007, 11:00am

Hi,

Thanks for these informative answers.

I believe those are 1-to-1 with the actual machine instructions.

This exactly is my point. Is there a simd math library which exports
vectorized functions like sin/sos/exp/log for real or complex numbers.
I mean for those not wanting to program at the assembly level like IBM
has one for it’s cell’s spe’s. I am sure Intel’s IPP and AMD’d AKML
provide these (I haven’t checked) but they obviously can’t be used
here.

–
Rohit G.
Junior Undergraduate
Department of Physics
IIT Bombay

Rohit_G · December 13, 2007, 9:04pm

Hi!

I am currently struggling with the memory/cache performance of the most
used inner loops in my code. For my code, I am pretty sure that most
clock cycles it spends are related to cache miss. I used VTune and
Cachegrind to analyze the code. But all I got was the information THAT I
frequently miss the cache. They don’t give a reason.

So, maybe you can point me to a good website, or give me a hint? Is
there a program that can tell me why this happens? E.g. for the Cell
processor, there is a static analysis tool that tells you everything
about your code. When did it stall, why did it stall, how many stall
cycles etc.

Thanks
Dominik

Rohit_G · December 13, 2007, 7:07pm

Rohit G. wrote:

I mean for those not wanting to program at the assembly level like IBM
has one for it’s cell’s spe’s. I am sure Intel’s IPP and AMD’d AKML
provide these (I haven’t checked) but they obviously can’t be used
here.

The following might be good starting points:

http://simdx86.sourceforge.net/
http://liboil.freedesktop.org/wiki/

I remember seeing others in the past, but can’t find them now.

Matt