CUDA GPU Vs CELL BE

Vincenzo_P · June 29, 2009, 5:23pm

Hi everybody,
I have recently had a look at two possibilities for SWRadio-aimed
intensive
computing,
which i guess are the two main development lanes for our kind of stuff:

.:. Cell BE platform
.:. CUDA & nVidia GPUs

I think this list is the best place to for a discussion on PROs and CONs
of
the two solutions,
but couldn’t find any by searching the mailing list.

has this been discussed already?

regards to all gr-fellows

vincenzo

Vincenzo_P · June 29, 2009, 5:59pm

On Mon, Jun 29, 2009 at 11:21 AM, Vincenzo P.[email protected]
wrote:

but couldn’t find any by searching the mailing list.
has this been discussed already?
regards to all gr-fellows

Last I checked, the CUDA stuff requires proprietary software and
drivers. Not an especially good fit for GNURadio.

Cell BE has the advantage of being exceptionally easy to target: I was
able to get a project I work on (http://www.celt-codec.org) running
entirely on the SPU of a PS3 within about 10 minutes of installing the
SDK. (Okay okay, I spent a little time reading the SDK docs while the
ISO downloaded and installed)

On the negative side, the only way mortals can access a Cell BE is by
using a PS3. This is silly and hard to justify to the applicable
finical controller (i.e. spouse or boss). Being able to slap in a
video card is nice, especially since you could plug several into a
system.

Hopefully the Intel Larrabee
(Larrabee (microarchitecture) - Wikipedia) will combine these
qualities and we’ll have something which is open, easily programmable,
scalable, and affordable.

Vincenzo_P · June 29, 2009, 11:51pm

On Mon, Jun 29, 2009 at 05:21:29PM +0200, Vincenzo P. wrote:

but couldn’t find any by searching the mailing list.

has this been discussed already?

There’s been a lot of conversation about this stuff, but mostly off
list.

Many of us are hoping that Larrabee turns out to be a big winner.

The Cell BE is pretty cool, and fun to program, but I’m not sure how
much of a future it has.

I’d say the court is still out on CUDA with regard to signal
processing applications. From my reading of the CUDA docs, it looks
like you need a very “data parallel” application to take good
advantage of it. Again from reading, it appears that you need at
least 64 elements that you can apply an instruction to, to be in it’s
target zone. For certain parts of our graphs, this is probably OK
(e.g., FEC decode, FIR’s, FFTs), but I’m kind of dubious about
anything with a depedency chain (IIR’s, PLLs, equalizers, etc.) I’m
also not sure if you can launch multiple kernels simultaneously
(CUDA-speak). If you could launch multiple kernels, we’d have a
better chance of using the parallelism. That said, more experimenting
should be done with GPUs to see if they can be made useful for signal
processing.

regards to all gr-fellows
vincenzo

Eric

Vincenzo_P · June 30, 2009, 5:59am

Thanks

2009/6/29 Eric B. [email protected]

I think this list is the best place to for a discussion on PROs and CONs
The Cell BE is pretty cool, and fun to program, but I’m not sure how
much of a future it has.

I’d say the court is still out on CUDA with regard to signal
processing applications. From my reading of the CUDA docs, it looks
like you need a very “data parallel” application to take good
advantage of it. Again from reading, it appears that you need at
least 64 elements that you can apply an instruction to, to be in it’s
target zone. For certain parts of our graphs, this is probably OK
(e.g., FEC decode, FIR’s, FFTs), but I’m kind of dubious about
anything with a depedency chain (IIR’s, PLLs, equalizers, etc.)

With regards to the 64 elements to apply an instruction to, can you say
a
bit more about this? Does it mean to at least a computation of than 64
times
(like a loop thats more than 64 times) per one call to the device?

I’m
also not sure if you can launch multiple kernels simultaneously
(CUDA-speak). If you could launch multiple kernels, we’d have a
better chance of using the parallelism. That said, more experimenting
should be done with GPUs to see if they can be made useful for signal
processing.

I am also trying to experiment to see how GPUs can make GNURadio faster,
but
progressing very slowly…anyone who can give me a boost would be
great!

Vincenzo_P · July 16, 2009, 4:50am

Eric B. wrote:
advantage of it. Again from reading, it appears that you need at
least 64 elements that you can apply an instruction to, to be in it’s
target zone. For certain parts of our graphs, this is probably OK
(e.g., FEC decode, FIR’s, FFTs), but I’m kind of dubious about
anything with a depedency chain (IIR’s, PLLs, equalizers, etc.)

32 threads in a so called “warp” execute together in a Single
Instruction Multiple Threads (SIMT) manner, on a particular Streaming
Multiprocessor (SM). The control flows among the 32 threads can diverge,
but when that happens, each set of divergence paths will be executed
serially. Your observations are correct. At least for now, CUDA’s
strength is still quite restricted to computation intensive data
parallel processing, where nVidia’s other 99% business lies in (of
course, the graphics processing). But after GPGPU processing takes off,
things could change.

I’m also not sure if you can launch multiple kernels simultaneously
(CUDA-speak). If you could launch multiple kernels, we’d have a
better chance of using the parallelism.

Currently no. But it is possible execute several parallel tasks within
the same kernel by diverging the control flow, and at the same time,
trying to group each different task (each variance of the control flow)
in groups of 32 threads (considering padding?) to avoid in warp
divergence. Nvcc compiler will at least take care of register allocation
so that multiple tasks won’t use more registers than the max a single
one requires.

Eric

-Yu

Vincenzo_P · June 30, 2009, 6:34am

On Mon, Jun 29, 2009 at 11:57:56PM -0400, Yu-Hua Y. wrote:

There’s been a lot of conversation about this stuff, but mostly off list.
least 64 elements that you can apply an instruction to, to be in it’s
target zone. For certain parts of our graphs, this is probably OK
(e.g., FEC decode, FIR’s, FFTs), but I’m kind of dubious about
anything with a depedency chain (IIR’s, PLLs, equalizers, etc.)

With regards to the 64 elements to apply an instruction to, can you say a
bit more about this? Does it mean to at least a computation of than 64 times
(like a loop thats more than 64 times) per one call to the device?

What little I know about CUDA is primarily based on reading this
document:

CUDA Toolkit Documentation 12.2

Eric