GNURadio and CUDA reprised

dubstep · January 11, 2011, 11:48pm

Greetings,

I’ve begun to look into accelerating GNURadio applications with Nvidia
CUDA
GPU’s
and have scanned through the archives of the discussion list. I had two
questions on the topic:

Is the CUDA-GNURadio port done by Martin DvH circa 2008 still
available and runnable? All links I’ve seen are broken.
Much of the results I’ve seen, both here and elsewhere, suggest that
CUDA is not typically applicable to general GNURadio applications. It
has worked in specific cases, but only where the data throughput
requirements are very high and the algorithms are extremely
parallelizable. Most of the discussion seems to be rather dated, with
no discussion after the introduction of Fermi (GTX400,500 and Tesla 20x0
series) cards. Does anyone have any comments on the feasibility of
these cards for GNURadio applications? Some of the major relevant
improvements are the ability to concurrently schedule multiple kernels
and asynchronously perform memory transfers.

Thank you,

Andrew H.

pond · January 12, 2011, 8:45am

On 11.01.2011 23:13, Andrew H. wrote:

I’ve begun to look into accelerating GNURadio applications with Nvidia CUDA
GPU’s
and have scanned through the archives of the discussion list. I had two
questions on the topic:

Is the CUDA-GNURadio port done by Martin DvH circa 2008 still
available and runnable? All links I’ve seen are broken.

Is CUDA really suitable? There is a certain overhead in data
communications.
CUDA is only useful, if it can compute complex things without
communicating.
But a data streaming application needs lots of I/O.
The CPU with SSE is also very fast in things like FFT.
I made some experiments with CUDA, but they were not very successful,
far below the peak FLOPS you get in benchmarks.
But I’m not an experienced programmer …

Much of the results I’ve seen, both here and elsewhere, suggest that
CUDA is not typically applicable to general GNURadio applications. It
has worked in specific cases, but only where the data throughput
requirements are very high and the algorithms are extremely

Yes, I had the same experiences. I tried to let CUDA do the
one-dimensional FFT.
It was slower than on CPU, had a large communication overhead.
Maybe better with larger FFT sizes, or with 2D FFT, or better
programming …
In contrast, the sample programs were very fast, but also very special
like Fractals computing, Image processing or particle physics.

these cards for GNURadio applications? Some of the major relevant
improvements are the ability to concurrently schedule multiple kernels
and asynchronously perform memory transfers.

I think important is that the kernels have to compute very much,
compared
to data transmission tasks. 1D FFT is not very computing-intensive,
related to
data shifting. What kind of algorithm do you want to port to CUDA?

pond · January 12, 2011, 10:36am

On 01/12/2011 08:44 AM, Moeller wrote:

On 11.01.2011 23:13, Andrew H. wrote:

I’ve begun to look into accelerating GNURadio applications with Nvidia CUDA
GPU’s

and have scanned through the archives of the discussion list. I had two
questions on the topic:

Is the CUDA-GNURadio port done by Martin DvH circa 2008 still
available and runnable? All links I’ve seen are broken.

Is CUDA really suitable? There is a certain overhead in data communications.
CUDA is only useful, if it can compute complex things without communicating.

True.

But with the DMA it’s still faster when you compute things like long
filters. Or if you have a wideband signal and you want to split it in
several small band signals, it can compute way faster than SSE.

Another advantage is that it’s all done in parallel with the CPU
(including most data movements if done properly), so you can work on the
demodulation in the CPU and let the GPU do all the pre-filtering/signal
shaping for you.

But as you noted further, it’s more for when your code is pretty much
working in the “off-line” case and you want to make it work real time on
big data streams.

Cheers,

Sylvain

pond · January 12, 2011, 2:27pm

I have a feeling – from working with OpenCL for a while now (but, not
in GNU Radio yet), watching profiling timing information (how long it
takes to move data around, how long kernels take to get queued and
executed) – that what folks here have written seems mostly true: there
is -significant- overhead so the number of computations must be quite
high to make using a GPU “better” than just the CPU (if one is
evaluating “better” just in terms of throughput, not taking into account
that the GPU can be executed asynchronously w/r.t. the CPU & hence the
combined “system” is generally actually faster overall than just using
the CPU). I think that if a GPU can be used, it will be most effective
in things like filterbanks, or when searching for packets (via their
unique sync sequence, so matched filtering), or very large FIR filters
– places where a LOT of computations and data must be processed and can
be parallelized easily. In my initial testing, doing something “simple”
such as “c = a + b” is probably better left to vector units (e.g., use
VOLK once it’s fully functional) – but, as I wrote above, if timing
constraints can be met then the GPU can do work in parallel with the CPU
& hence can increase system throughput somewhat even for such simple
tasks. More as I understand & program it; really, I’m still “in the
beginning” of heading down this road … If folks do make progress, I
hope they post to this list for those of us interested in this topic. -
MLD

pond · January 12, 2011, 2:57pm

Has anyone thought about something like Apple’s Core Image for signal
processing? Core Image lets you express image filters in a C-like filter
language (a subset of GLSL). You chain a set of filters together to
achieve the desired effect and then at runtime Core Image uses an LLVM
complier to generate optimized code for your GPU or CPU using whatever
vector capabilities it has. LLVM supports a growing list of back-end
hardware. There is even some work targeting FPGAs.

-Marc

pond · January 12, 2011, 3:58pm

On Wed, Jan 12, 2011 at 2:44 AM, Moeller [email protected] wrote:

communications.

has worked in specific cases, but only where the data throughput
these cards for GNURadio applications? Some of the major relevant
Discuss-gnuradio mailing list
[email protected]
Discuss-gnuradio Info Page

I’ve done some work with both CUDA and GNURadio, and I think there’s
definitely some potential there for using them jointly, but only for
certain
applications, and only if the software is architected intelligently.

GPUs are incredibly powerful, with 1+TFLOP operation and 100+GB/s memory
speeds within the GPU. I’ve used GPUs to perform real-time signal
processing
on 300+MHz of continuously-streaming data, without dropping a sample.
But
the PCI bus bandwidth of ~5GB/s can sometimes be a real bottleneck, so
you
have to design accordingly.

You DON’T want to try to make individual drop-in CUDA replacements for
multiple GNURadio processing blocks in a chain. It doesn’t make any
sense to
send data to the GPU, perform an operation (eg filtering), bring the
result
back to the host, send some more data to the GPU, perform a 2nd
operation,
bring the data back, etc. The PCI transfers will eat you alive. The key
is
to send large chunks (10s or 100s of MBs) of data to the GPU, and do as
much
computation as possible while there. Large batched ffts, wideband
frequency
searches, channelizing, it’s all gravy. It’s great if you can stream
wideband data to the GPU, have it do some computationally intensive
stuff,
perform a rate reduction, then stream the lower bandwidth data back to
the
host to do further (annoyingly serial) operations. You could even (if
you
wanted to) implement an entire transmitter or receiver within the GPU,
with
the CPU solely shuttling data to or from the ADC/DAC.

In summary, yes please do get excited about CUDA/OpenCL – it’s great
technology. When the USRP 9.0 comes out with a gigasample ADC/DAC, GPUs
are
there ready to do the heavy lifting

-Steven

pond · January 12, 2011, 5:29pm

On Wed, Jan 12, 2011 at 9:56 AM, Steven C. [email protected]
wrote:

Much of the results I’ve seen, both here and elsewhere, suggest that
like Fractals computing, Image processing or particle physics.

GPUs are incredibly powerful, with 1+TFLOP operation and 100+GB/s memory
to send large chunks (10s or 100s of MBs) of data to the GPU, and do as much
there ready to do the heavy lifting

-Steven

Steven,

That’s great information and about along the lines of what I was going
to say (sans the example of doing 300 MHz of processing since I
haven’t done anything that wide on it).

I wanted to throw out another idea that no one seems to be bringing
up, and this relates to a comment back about how CUDA is limited
because of the bus transfers. That’s not CUDA that is doing that but
the architecture of the machine and having the host (CPU) and device
(GPU) separated on a bus. That has nothing to do with CUDA as a
language.

But I keep thinking about the new Tegra from nVidia and to a lesser
extent Sanybridge from Intel. These are showing a trend of moving GPUs
and CPUs together on the same die. Sandybridge isn’t really exciting
from this perspective (yet) since their GPU core isn’t very powerful
and (I don’t believe) CUDA-enabled. My point is, though, that the
trend is exciting, and we are starting to see architectures that are
moving away from the bus issues that are the biggest problems with GPU
programming right now. Any effort spent now on working on GPU
programming I think will have legs far into the future as the
architectures become more amenable to our kind of problems.

Currently, though, GPUs still have a place for certain applications,
even in signal processing and radio. They are not a panacea for
improving the performance of all signal processing applications, but
if you understand the limitations and where they benefit you, you can
get some really good gains out of them. I’m excited about anyone
researching and experimenting in this area and very hopeful for the
future use of any knowledge and expertise we can generate now.

Tom

pond · January 12, 2011, 5:53pm

On Wed, Jan 12, 2011 at 11:03 AM, Tom R. [email protected]
wrote:

I wanted to throw out another idea that no one seems to be bringing
up, and this relates to a comment back about how CUDA is limited
because of the bus transfers. That’s not CUDA that is doing that but
the architecture of the machine and having the host (CPU) and device
(GPU) separated on a bus. That has nothing to do with CUDA as a
language.

I think the notion that the language is not the barrier (the hardware
architecture is) is precisely why I personally am more excited about
OpenCL as a language than CUDA per-se. CUDA is inherently tied to
nVidia hardware, and while is conceivable that CUDA will end up being
supported on a wider variety of CPU/GPU architectures (e.g. the
recently announced ‘Project Denver’), I don’t imagine it will ever
find support on non-nVidia hardware. OpenCL is, on the other hand,
enjoying support from a wide variety of hardware vendors (AMD/ATI,
nVidia, IBM, Intel, Apple, etc.), and was designed to run on a wide
variety of architectures (including a mix of CPU’s, GPU’s,
accelerator/DSP boards, etc.). In the long run it seems to me to be a
much better environment for dealing with heterogeneous computing, and
without bringing up any serious concerns about being tied to any
single vendor.

Currently, though, GPUs still have a place for certain applications,
even in signal processing and radio. They are not a panacea for
improving the performance of all signal processing applications, but
if you understand the limitations and where they benefit you, you can
get some really good gains out of them. I’m excited about anyone
researching and experimenting in this area and very hopeful for the
future use of any knowledge and expertise we can generate now.

Tom

Agreed. Having spent some time on working with OpenCL on GPU’s for
solving a different sort of problem, I completely agree they are both
powerful, and not a silver bullet.
I would like to echo some of the previous comments: replacing single
processing blocks in a flowgraph with a drop-in CUDA/OpenCL
replacement is not likely to lead to any significant gains. It may
relieve some of the work the CPU has to do (and thus be a net gain in
terms of total samples that can be processed without dropping any on
the floor), but I suspect Steve is correct: the big gains will be made
in either applications requiring large filtering/channelizers/etc. or
with complete RX and/or TX chains written in OpenCL, and GNURadio
merely acting as a shuttle from the USRPx/UHD-enabled source/sink and
the smaller trickle of bits coming back out (or going in). If that is
the case, I think the follow-on question becomes: does GNURadio need
to do anything to support OpenCL/CUDA/etc. enabled applications, or is
everyone that is doing that sort of work simply writing their own
custom block to interface with their custom OpenCL/CUDA/etc. kernel,
since they are likely going to have to do all sorts of nasty
optimization tricks to get the best performance for their particular
application anyways? Or can a common block serve as a generic
interface, which loads whatever custom kernel needs to be written, and
works well enough in 90% of the cases? I’d like to think the latter is
true, but I don’t have any evidence as of yet that it might be.
Perhaps at a later date I’ll have something to share that points in
one direction or the other.
Doug

–
Doug G.
[email protected]

pond · January 12, 2011, 9:23pm

On Jan 12, 2011, at 2:56 PM, Moeller wrote:

I’m curious about how much speedup you can achieve for FIR filters
(let’s say large/sharp filters of 1024 taps).

The “very large FIR filters” was a thought, as an example of an
operation that might benefit from a GPU at least when using OpenCL (or
CUDA). I haven’t done testing yet to know if a GPU can do better than a
CPU using vector instructions … but I’m getting there. If/when I do
get there, I’ll post my results & thoughts.

Your comment about global versus local memory certainly does seem true
from reading the OpenCL specs. Most modern GPUs have 3 levels of
memory: global (for the whole GPU, across all cores), core (across all
kernel execution units), and kernel – in order of decreasing size,
increasing access speed, and increasing time to move data to/from. I’ve
been playing around with global memory only so far, but I’ll look into
the other levels as well to see what they can provide & the trade-offs
required.

Good & interesting discussion! - MLD

pond · January 12, 2011, 9:42pm

On Wed, Jan 12, 2011 at 3:22 PM, Michael D. [email protected]
wrote:

a large set of data. So, isn’t this too much for the stream-processor
(let’s say large/sharp filters of 1024 taps).
units), and kernel – in order of decreasing size, increasing access speed,
and increasing time to move data to/from. I’ve been playing around with
global memory only so far, but I’ll look into the other levels as well to
see what they can provide & the trade-offs required.

Good & interesting discussion! - MLD

Since FFTS & IFFTs are so speedy on GPUs (CUFFT is quite good now), a
good
way is to filter in the frequency domain via FFT → pointwise multiply
→
IFFT. That way you can have arbitrarily sharp filters.

-Steven

pond · January 12, 2011, 8:56pm

On 12.01.2011 14:25, Michael D. wrote:

the CPU). I think that if a GPU can be used, it will be most effective in
things like filterbanks, or when searching for packets (via their unique sync
sequence, so matched filtering), or very large FIR filters – places where a LOT
of computations and data must be processed and can be parallelized easily. In my
initial testing, doing something “simple”

Is there an efficient parallel FIR implementation for CUDA? You need
only few operations on
a large set of data. So, isn’t this too much for the stream-processor
local-memory?
If GPU global memory has to be used, this would lead to a slower
concurrent access.
And then there is still the transfer time from/to the computer RAM.
It would be great to have a fast filter, but is it really faster than an
optimized SSE CPU FIR?
I had the feeling, that the ratio of computing operations vs. number of
samples has to be
high for a significant GPU vs. CPU speedup.
I’m curious about how much speedup you can achieve for FIR filters
(let’s say large/sharp filters of 1024 taps).

pond · January 12, 2011, 9:45pm

On Jan 12, 2011, at 2:56 PM, Moeller wrote:
The “very large FIR filters” was a thought, as an example of an operation that
might benefit from a GPU at least when using OpenCL (or CUDA). I haven’t done
testing yet to know if a GPU can do better than a CPU using vector instructions
… but I’m getting there. If/when I do get there, I’ll post my results&
thoughts.

Very large FFT filters is also something worth looking into. GPUs have
been considered for real-time coherent de-dispersion of radio astronomy
data streams for pulsar detection. De-dispersion over large
bandwidths at low frequencies requires ferociously-large FFT filters,
but in
order to make this a viable proposition, you likely have to do the
detection and folding on the GPU as well, producing an output data
stream that is several orders of magnitude smaller/slower than the
input stream. I read a paper on this, (for the specific case of
pulsar detection with real-time coherent de-dispersion), and they
concluded that it’s doable, on the higher end GPUs, provided that
you do detection and folding on the GPU as well, otherwise you lose
due to transfer overhead.

It seems like the only time you ever really “win” with a GPU-based
solution is when you have to suck in large amounts of data,
pound on it furiously, and then produce an output stream that’s
relatively modest. Otherwise, you seem to lose due to data-transfer
overhead.

–
Marcus L.
Principal Investigator
Shirleys Bay Radio Astronomy Consortium

pond · January 13, 2011, 7:29am

On 13.01.2011 01:49, Tom R. wrote:

From my experiments, I don’t thinks its a A and B situation. I think
if you have either A) a large amount of data OR B) have to pound on
it furiously, you get a win. Most filters needed for normal comms is
not enough data or computation, but doing, maybe, a turbo product code
or some heavy compute task with normal amounts of data (say, blocks of
around 8k samples), you can get a win.

Even for FFT you have to check it carefully. I really lost time with the
GPU.
Some benchmarks only count kernel time without transfer time,
some others compare an optimized CUFFT against a non-optimized CPU
implementation.
You have to compare GPU time including transfers against something like
FFTW.
After that, the speedup is not very high any more, depending on the
transform size.
To really boost your computations, more operations should be done on the
same
data set. I think FFT is not very suitable for GPU because of the
butterfly structure
(many data transfers between the blocks). Thinks like FEM (finite
element) are more
suitable, because differential equations are solved only on local and
direct neighbor data.

pond · January 13, 2011, 1:51am

On Wed, Jan 12, 2011 at 3:39 PM, Marcus D. Leech [email protected]
wrote:

low frequencies requires ferociously-large FFT filters, but in
is when you have to suck in large amounts of data,
pound on it furiously, and then produce an output stream that’s relatively
modest. Otherwise, you seem to lose due to data-transfer
overhead.

Marcus L.
Principal Investigator
Shirleys Bay Radio Astronomy Consortium
http://www.sbrac.org

From my experiments, I don’t thinks its a A and B situation. I think
if you have either A) a large amount of data OR B) have to pound on
it furiously, you get a win. Most filters needed for normal comms is
not enough data or computation, but doing, maybe, a turbo product code
or some heavy compute task with normal amounts of data (say, blocks of
around 8k samples), you can get a win.

Tom

GNURadio and CUDA reprised

low frequencies requires ferociously-large FFT filters, but in is when you have to suck in large amounts of data, pound on it furiously, and then produce an output stream that’s relatively modest. Otherwise, you seem to lose due to data-transfer overhead.

low frequencies requires ferociously-large FFT filters, but in
is when you have to suck in large amounts of data,
pound on it furiously, and then produce an output stream that’s relatively
modest. Otherwise, you seem to lose due to data-transfer
overhead.