Signal processing and GPU

Has anyone tried using GPU for signal processing?
Does anyone know of a reason why this would not be a good idea
I am planning on starting on it this week, would appreciate inputs

Thanks
Inderaj

Has anyone tried using GPU for signal processing?
I had a look a while ago, but didn’t get to far. If I remember correctly
I started out from a library written by some folks at Stanford, if
that’ll help you.
Does anyone know of a reason why this would not be a good idea
I am planning on starting on it this week, would appreciate inputs

I think it sounds useful. I know there are password recovery software
using the GPU for calculations. If you get further than me, please don’t
keep it to yourself :wink:

BR
//Mattias K.

We’ve got some people doing GNU Radio and GPU stuff. So far, we are
having great luck with the combo. I think the only drawback is that our
incoming block size can be no larger than 8191 samples so we have to
have our own buffers to collate the data before processing. There may
be a work around for this but we just haven’t gotten that far yet.

Isaac

I had a look a while ago, but didn’t get to far. If I remember correctly I
started out from a library written by some folks at Stanford, if that’ll
help you.

Have you looked at OpenCL. I think that is the best portable way to go.
I’ve looked at some sample code but not studied it well enough to
understand
100% how it works. but apparently you write the algorithm in a C-like
language then it gets compiled to run on multiple GPU cores and/or
multiple
CPU cores or a combination whatever it is you happen to have. Apple
is supporting this now but there are more vendors who are pushing this.
Nvidia, AMD and Intel are all on-board.

More info here:

http://www.khronos.org/news/press/releases/khronos_launches_heterogeneous_computing_initiative/

Chris Albertson
Redondo Beach, California

On Mon, 2008-10-20 at 15:03 -0700, [email protected] wrote:

Has anyone tried using GPU for signal processing?
Does anyone know of a reason why this would not be a good idea
I am planning on starting on it this week, would appreciate inputs

I have been working on this for quite some time now.
I did a glsl implementation a few years back but it didn’t perform that
well and had some severe limitations.

So I started over this year and have reimplemented major part of
GnuRadio using CUDA.
It is a one to one implementation.
(every gr_something block is replaced with a cuda_something block)

My work-in-progress code is at:
http://gnuradio.org/trac/browser/gnuradio/branches/developers/nldudok1/gpgpu-wip

The majority of the code is a unmodified gnuradio checkout of a few
moths back.

There are some important changes in gnuradio_core/src/lib/runtime
to support CUDA device memory as an emulated circular buffer.

I also implemented a gr.check_compare block which expects two input
streams and checks if they are outputting the same data.
I use this to check if my cuda blocks do exactly the same as the gr
blocks.

All the rest of the CUDA code is in gr_cuda.
gr_cuda has to be configured and build seperately.
Here are the cuda reimplementations of some gnuradio blocks.

Then there are new blocks cuda_to_host and host_to_cuda which copy
memory from and to the GPU device memory.

All python scripts to test and use the code are in /testbed.

The code in testbed is changing on a day-by-day basis.

There are several issues to be well aware of when doing SDR on a GPU.

-overhead
-call overhead
-copying data from and to the GPU
You need to do a lot of work on the GPU in one call to have any
benefit.
-circular buffers
-GPU memory cant’t be mmapped into a circular buffer
-solution 1: use copying to emulate a circular buffer
-solution 2: keep track of all the processing and make your own
intelligent scheduler which does not need a circular buffer.

-threads: with CUDA you can’t access GPU device memory from different
host-threads. So make sure you create use and destroy all device memory
from the same thread. (The standard GnuRadio scheduler does not do it
like this)

-debugging: Debugging is hard and works quite different from normal
debugging.

-parallel: The GPU is good in doing calculations in parallel which are
not dependant on each other. For this reason a FIR will perform well,
while an IIR will perform bad. An IIR can only use one processing block
of the GPU, in stead of 128.
It can still be benificial to do the IIR on the GPU when all your other
blocks are running on the GPU because you don’t have to copy all samples
to the CPU, do the IIR on the CPU and copy everything back to the GPU.

All that said. I do have a complete WFM receiver which is running
completely on the GPU.
(using FIR and/or FFT filters, quadrature_demod, fm-deemph)

At the moment it is not running faster then on the CPU, mainly because
of the call overhead. (too little work items done per call)
And the extra copying done to emulate circular buffers.

I can increase the amount of work done per call by using
output_multiple. But with the current scheduling code the flow-graph can
hang. This needs work.
So the performance will change in the future.
First I want to make sure everything is working as expected.

If I benchmark a single block with a big output_multiple then I do see
performance increases.

Greetings,
Martin

— Martin DvH [email protected] wrote:

per call)
And the extra copying done to emulate circular
buffers.

Do share with us which video card you are using and
roughly how much slower is the GPU WFM ?

Regards,
Hew