I was following the separate discussion on this list about writing
various trig functions using vector intrinsics. I googled for it. The
top few results I got were for “old” processors when SIMD intrinsics
were new. The gcc documentation (my version is 4.1.2) has a list of
intrinsics but no description, not even one line per intrinsic. As
there is need to optimize the codebase for new processors (conroe,
barcelona etc) any way, can you please point me to some real
documenatation on the subject. I would really appreciate any help.
As a related question, possibly a digression, given that these
extensions are the key to unlock full power of new processors and yet
are rather low level (we are still writing trig funcs), is there any
FLOSS library for simd math?
–
Rohit G.
Junior Undergraduate
Department of Physics
IIT Bombay
For library using SIMD, you can start looking BLAS on wikipedia. There
is
some links on the page about libraries doing optimized linear algebra. http://en.wikipedia.org/wiki/BLAS
On Wed, Dec 12, 2007 at 11:51:20PM +0530, Rohit G. wrote:
Hi all,
I was following the separate discussion on this list about writing
various trig functions using vector intrinsics. I googled for it. The
top few results I got were for “old” processors when SIMD intrinsics
were new. The gcc documentation (my version is 4.1.2) has a list of
intrinsics but no description, not even one line per intrinsic.
I believe those are 1-to-1 with the actual machine instructions.
See the intel or AMD docs.
As there is need to optimize the codebase for new processors (conroe,
barcelona etc) any way, can you please point me to some real
documenatation on the subject. I would really appreciate any help.
I’m not sure exactly what you’re looking for. Both intel and AMD
have manuals about optimizing code for their microarchitectures.
You’ll find them somewhere on their developer sites.
Probably the biggest place that needs improvement is trig functions.
I suggest starting with sin(x), cos(x) and sincos(x) for x a scalar
float, and a related version that computes 4 in parallel for x a
vector of 4 floats. I’d do two versions of each: SSE2 for x86 and
SSE2 for x86_64 (on the 64 you’ve got twice as many registers to work
with.)
We need them with something close to single-precision floating point
accuracy. You’ll need to figure out what input domain you’re willing to
accept; I’d say at a minimum +/- 4*pi.
As a related question, possibly a digression, given that these
extensions are the key to unlock full power of new processors and yet
are rather low level (we are still writing trig funcs), is there any
FLOSS library for simd math?
Not sure. Please check it out and let us know what you find.
There is of course the ATLAS stuff (optimized BLAS).
The code snippet fastly converts the shorts the usrp delivers to floats,
using SSE. Actually, it ignores the endian-order and assumes
little-endian. The buffer size is supposed to be a multiple of 16 bytes.
I believe those are 1-to-1 with the actual machine instructions.
This exactly is my point. Is there a simd math library which exports
vectorized functions like sin/sos/exp/log for real or complex numbers.
I mean for those not wanting to program at the assembly level like IBM
has one for it’s cell’s spe’s. I am sure Intel’s IPP and AMD’d AKML
provide these (I haven’t checked) but they obviously can’t be used
here.
–
Rohit G.
Junior Undergraduate
Department of Physics
IIT Bombay
I believe those are 1-to-1 with the actual machine instructions.
This exactly is my point. Is there a simd math library which exports
vectorized functions like sin/sos/exp/log for real or complex numbers.
I mean for those not wanting to program at the assembly level like IBM
has one for it’s cell’s spe’s. I am sure Intel’s IPP and AMD’d AKML
provide these (I haven’t checked) but they obviously can’t be used
here.
–
Rohit G.
Junior Undergraduate
Department of Physics
IIT Bombay
I am currently struggling with the memory/cache performance of the most
used inner loops in my code. For my code, I am pretty sure that most
clock cycles it spends are related to cache miss. I used VTune and
Cachegrind to analyze the code. But all I got was the information THAT I
frequently miss the cache. They don’t give a reason.
So, maybe you can point me to a good website, or give me a hint? Is
there a program that can tell me why this happens? E.g. for the Cell
processor, there is a static analysis tool that tells you everything
about your code. When did it stall, why did it stall, how many stall
cycles etc.
I mean for those not wanting to program at the assembly level like IBM
has one for it’s cell’s spe’s. I am sure Intel’s IPP and AMD’d AKML
provide these (I haven’t checked) but they obviously can’t be used
here.