Re-writing blocks using intel libraries

Hello,

We are working on some systems that require high sampling rates. I am
already using the Intel C++ compiler at the highest optimization ratio,
but a lot of the blocks are very slow still. It appears that intel C++
does not properly vectorize data type.

I have been replacing almost every low level block with a functionally
equivalent using the intel performance libraries (IPP). These libraries
are not GPL, but are free for noncommercial use under Linux ($200
otherwise). At some point, I would like to contribute our work back to
gnuradio. Would this fit with the gr philosophy? How should we
structure
the code? (i.e. have a separate set of files, use #defines, or …)?

Eugene

On Tue, Dec 11, 2007 at 10:13:32AM -0800, Eugene Grayver wrote:

Hello,

We are working on some systems that require high sampling rates. I am
already using the Intel C++ compiler at the highest optimization ratio,
but a lot of the blocks are very slow still. It appears that intel C++
does not properly vectorize data type.

General curiosity questions:

Are you using oprofile to measure performance?

What h/w platform are you running on / tuning for?

You’re not trying to run your app on a cache-crippled machine like a
Celeron, are you? :wink:

Which blocks are causing you the biggest problem?

Are your problems caused primarily by lack of CPU cycles, cache
misses or mis-predicted branches?

I have been replacing almost every low level block with a functionally
equivalent using the intel performance libraries (IPP). These libraries
are not GPL, but are free for noncommercial use under Linux ($200
otherwise). At some point, I would like to contribute our work back to
gnuradio. Would this fit with the gr philosophy? How should we structure
the code? (i.e. have a separate set of files, use #defines, or …)?

Eugene

We would not accept the changes. Part of what we’re up to is building
an ever expanding universe of free code. Instead of using the
non-free IPP code, please consider using a free library such as ATLAS,
or help us find and fix performance challenges in a way that doesn’t
require non-free code. Also, are you sure that your performance
issues can’t be better addressed with an algorithmic change? If
you’re using a lot of very low-level blocks (e.g., add, multiply,
etc.) you’re probably better off writing a block that aggregates some
of the operations into a single block.

Eric

Please see answers in-line.

Thanks!


Eric B. [email protected]
12/11/2007 02:31 PM

To
Eugene Grayver [email protected]
cc
[email protected]
Subject
Re: [Discuss-gnuradio] Re-writing blocks using intel libraries

On Tue, Dec 11, 2007 at 10:13:32AM -0800, Eugene Grayver wrote:

Hello,

We are working on some systems that require high sampling rates. I am
already using the Intel C++ compiler at the highest optimization ratio,
but a lot of the blocks are very slow still. It appears that intel C++
does not properly vectorize data type.

General curiosity questions:

Are you using oprofile to measure performance?

I am a bit of a maverick, and for various reasons am using a pure C++
environment. I hacked my own ‘connect_block’ function (can;t wait for
v3.2, where these will be part of native gr). I am measuring the
performance using a custom block (gr_throughput) that simply reports the
average number of samples processed per second.

What h/w platform are you running on / tuning for?

The platform is currently Intel Xeon or Core2 Duo.

You’re not trying to run your app on a cache-crippled machine like a
Celeron, are you? :wink:

No, very high end.

Which blocks are causing you the biggest problem?

I got a 2x improvement on all the filtering blocks. About a 40%
improvement for sine/cosine generation blocks. This includes gr_expj,
gr_rotate.

Are your problems caused primarily by lack of CPU cycles, cache
misses or mis-predicted branches?

I am not sure, since I am not at all a software expect (mostly
dsp/comm).
My guess is that the SSE instructions are not being used (or not used to
a
full extent). Even the ‘multiply’ block is VERY slow compared to a
vector
x vector multiplication in the Intel library. Some of the gr_blocks
process each sample using a separate function call (e.g.
for (n=0; n<noutput_samples; n++)
scale(in[n])

Replacing this with a single vectorized function call is much faster.

I have been replacing almost every low level block with a functionally
equivalent using the intel performance libraries (IPP). These libraries

are not GPL, but are free for noncommercial use under Linux ($200
otherwise). At some point, I would like to contribute our work back to
gnuradio. Would this fit with the gr philosophy? How should we
structure
the code? (i.e. have a separate set of files, use #defines, or …)?

Eugene

We would not accept the changes. Part of what we’re up to is building
an ever expanding universe of free code. Instead of using the
non-free IPP code, please consider using a free library such as ATLAS,
or help us find and fix performance challenges in a way that doesn’t
require non-free code. Also, are you sure that your performance
issues can’t be better addressed with an algorithmic change? If
you’re using a lot of very low-level blocks (e.g., add, multiply,
etc.) you’re probably better off writing a block that aggregates some
of the operations into a single block.

That’s what I expected. We’ll try to contribute the more dsp-centric
blocks such as demodulators.

Eric

General curiosity questions:

Are you using oprofile to measure performance?

I am a bit of a maverick, and for various reasons am using a pure C++
environment. I hacked my own ‘connect_block’ function (can;t wait for
v3.2, where these will be part of native gr). I am measuring the
performance using a custom block (gr_throughput) that simply reports
the average number of samples processed per second.

While pure C++ may be desirable for some reasons, performance is not
really one of them. When you use Python, it isn’t running anything that
is really performance critical.

Which blocks are causing you the biggest problem?

I got a 2x improvement on all the filtering blocks.

That isn’t surprising. I believe our SSE filtering code was optimized
for prior generations of processors, so a new Core2 optimized version
would be useful, and likely competitive with IPP. Also, are you sure
that when you compile our code with Intel’s compiler that you are even
getting the SSE versions? Or are the pure C++ versions called?

Another thing, which I believe was mentioned earlier – if you really
care about FIR filter performance, you should be using the FFT versions
of the filters. The difference in performance can be huge, making the
2x you get from IPP insignificant.

About a 40% improvement for sine/cosine generation blocks. This
includes gr_expj, gr_rotate.
There is definitely room for improvement here.

    scale(in[n])

Replacing this with a single vectorized function call is much faster.

Those function calls should be inlined if nothing else.

In any case, GCC is not vectorizing this, but it would be trivial to
write it in SSE or intrinsics, which would allow this to be done in open
source code, without having to resort to IPP. That would be a very
useful contribution.

Matt

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Eugene Grayver wrote:

Please see answers in-line.
Which blocks are causing you the biggest problem?

I got a 2x improvement on all the filtering blocks. About a 40%
improvement for sine/cosine generation blocks. This includes gr_expj,
gr_rotate.

I should mention that gr_rotate’s performance can be greatly improved
by a simple change that, rather than rescaling the multiplier every
iteration, rescales every k, e.g. k=1000. I think I have an earlier
mailing list post about this. IIRC, the patch didn’t go in because there
seemed to be no consensus about what k to use…

That’s what I expected. We’ll try to contribute the more dsp-centric
blocks such as demodulators.

That said, if you put the code (and/or the modified makefiles) up
somewhere, I’m sure there are some users that would benefit even if it
doesn’t make it into the main release.

  • -Dan
    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.4.6 (GNU/Linux)
    Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHXyVQy9GYuuMoUJ4RAjSfAKDWVOeMbGteN+BQhl71tG5mo2D3CgCfdzCO
34TYmHjgnijbENsfxECZNwo=
=v01V
-----END PGP SIGNATURE-----

Eric B. wrote:

environment. I hacked my own ‘connect_block’ function (can;t wait for
What h/w platform are you running on / tuning for?
I got a 2x improvement on all the filtering blocks.

Some of the gr_blocks
That’s what I expected. We’ll try to contribute the more dsp-centric
blocks such as demodulators.

That would be great! Or if you want to code up an SSE Taylor series
expansion for sine/cosine good to 23-bits or so, we’d love that too :wink:
I am working on this in the little spare time I have.
I already got a SSE taylor series for atan2, working in gnuradio.
The atan2 needs some code cleanup and wrapper code to switch
implementations (if (processor=X86, processor supports_SSE2)=>optimized
else generic)
The sin/cos is far from ready.

Greetings,
Martin

On Tue, Dec 11, 2007 at 04:03:28PM -0800, Dan H. wrote:

iteration, rescales every k, e.g. k=1000. I think I have an earlier
mailing list post about this. IIRC, the patch didn’t go in because there
seemed to be no consensus about what k to use…

Ooops, sorry about that. Let me dig through the archived discussion
and we’ll get it in.

Eric

On Tue, Dec 11, 2007 at 03:41:46PM -0800, Eugene Grayver wrote:

Please see answers in-line.

Thanks!

General curiosity questions:

Are you using oprofile to measure performance?

I am a bit of a maverick, and for various reasons am using a pure C++
environment. I hacked my own ‘connect_block’ function (can;t wait for
v3.2, where these will be part of native gr).

The trunk contains C++ code for connect, hier_block2, etc. Some of
the pieces that are still missing include C++ support for the USRP
daughterboards, but Johnathan C. is working on that now.

I am measuring the performance using a custom block (gr_throughput)
that simply reports the average number of samples processed per
second.

I got a 2x improvement on all the filtering blocks.

If these are FIR filters, were you using gr_fft_filter_{fff,ccc}
or the gr_fir_filter* blocks? The FFT one’s are much faster with a
break-even point around 16 taps IIRC.

About a 40% improvement for sine/cosine generation blocks. This
includes gr_expj, gr_rotate.

No surprise there, and that’s a great example of SIMD code that should
be in GNU Radio.

Are your problems caused primarily by lack of CPU cycles, cache
misses or mis-predicted branches?

I am not sure, since I am not at all a software expect (mostly dsp/comm).
My guess is that the SSE instructions are not being used (or not used to a
full extent). Even the ‘multiply’ block is VERY slow compared to a vector
x vector multiplication in the Intel library.

OK.

Some of the gr_blocks
process each sample using a separate function call (e.g.
for (n=0; n<noutput_samples; n++)
scale(in[n])

Replacing this with a single vectorized function call is much faster.

OK.

We would not accept the changes.

That’s what I expected. We’ll try to contribute the more dsp-centric
blocks such as demodulators.

That would be great! Or if you want to code up an SSE Taylor series
expansion for sine/cosine good to 23-bits or so, we’d love that too :wink:

Thanks for telling us about your experience.

Eric

Tom R. wrote:

Are you using oprofile to measure performance?
I am measuring the performance using a custom block (gr_throughput)

includes gr_expj, gr_rotate.
(or not used to a full extent). Even the ‘multiply’ block is VERY

I am working on this in the little spare time I have.
Martin,

Bob put in a fast atan function (general/gr_fast_atan2f.cc) about a year
ago. Have you looked in this, and is the Taylor performance better?
The taylor performance is much better when you get (a multiple of) 4
atan2s at a time.
(because the SSE taylor series works with SIMD in blocks of 4)
When you only get one at a time, the performance is still better but not
by much.
The taylor series also is more precise then gr_fast_atan2f.cc
I don’t have the numbers at hand, but I also wrote qa and benchmark code
so exact numbers on precision and speed can be determined.

As a side note:
I have also been working on a new version off the FFT FIR filter.
This one is more efficient when decimating.
inverse_FFT_size=forward_FFT_size/decimation
This works very well when decimation is 2^n, it also works well for most
other decimation factors EXCEPT when decimation is a big prime.

This means the theoretical maximum speed improvement is a factor two
(when decimation is infinite)
But when you want multiple parts of the spectrum then the speed
improvement is much better then using a FIR filter per spectrum part.
Then you can use a single forward FFT with multiple inverse FFTs.

Greetings,
Martin

We really need a faster sin/cos. Glad to hear you’re working on it.

Tom

Martin D. wrote:

Are you using oprofile to measure performance?
I am measuring the performance using a custom block (gr_throughput)
No, very high end.
About a 40% improvement for sine/cosine generation blocks. This
My guess is that the SSE instructions are not being used (or not used to a

That would be great! Or if you want to code up an SSE Taylor series
expansion for sine/cosine good to 23-bits or so, we’d love that too :wink:

I am working on this in the little spare time I have.
I already got a SSE taylor series for atan2, working in gnuradio.
The atan2 needs some code cleanup and wrapper code to switch implementations (if (processor=X86, processor supports_SSE2)=>optimized else generic)
The sin/cos is far from ready.

Greetings,
Martin

Martin,

Bob put in a fast atan function (general/gr_fast_atan2f.cc) about a year
ago. Have you looked in this, and is the Taylor performance better?

We really need a faster sin/cos. Glad to hear you’re working on it.

Tom