Forum: GNU Radio GnuRadio and CUDA

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
196c55d4c9a378e23f0b89482711c17e?d=identicon&s=25 Bob Keyes (bobk)
on 2008-11-15 01:43
(Received via mailing list)
I've just been given a Nvidia Quadro 5600 and I am thinking of using it
for DSP. Has anyone experimented with USRP & gnuradio & cuda?
3719f4fea703e38bcbf8de6fe6bcdf55?d=identicon&s=25 Martin DvH (Guest)
on 2008-11-16 23:47
(Received via mailing list)
On Fri, 2008-11-14 at 16:42 -0800, Bob Keyes wrote:
> I've just been given a Nvidia Quadro 5600 and I am thinking of using it for DSP. Has 
anyone experimented with USRP & gnuradio & cuda?
>
I have been working on this for quite some time now.
I did a glsl implementation a few years back but it didn't perform that
well and had some severe limitations.

So I started over this year and have reimplemented  major part of
GnuRadio using CUDA.
It is a one to one implementation.
(every gr_something block is replaced with a cuda_something block)

My work-in-progress code is at:
http://gnuradio.org/trac/browser/gnuradio/branches...

Make sure you read
http://gnuradio.org/trac/browser/gnuradio/branches...

Caleb Phillips made a wiki about my code, you can find it at:
http://www.smallwhitecube.com/php/dokuwiki/doku.ph...


The majority of the gnuradio-core code is a unmodified gnuradio checkout
of a few
moths back.

There are some important changes in gnuradio_core/src/lib/runtime
to support CUDA device memory as an emulated circular buffer.

I also implemented a gr.check_compare block which expects two input
streams and checks if they are outputting the same data.
I use this to check if my cuda blocks do exactly the same as the gr
blocks.

All the rest of the CUDA code is in gr_cuda.
gr_cuda has to be configured and build seperately.
gr_cuda is where  the cuda reimplementations of some gnuradio blocks
are.

Then there are also a few new blocks cuda_to_host and host_to_cuda which
copy memory from and to the GPU device memory.

All python scripts to test and use the code are in /testbed.

The code in testbed is changing on a day-by-day basis.


There are several issues to be well aware of when doing SDR on a GPU.

-overhead
        -call overhead
        -copying data from and to the GPU
        You need to do a lot of work on the GPU in one call to have any
benefit.
-circular buffers
        -GPU memory cant't be mmapped into a circular buffer
                -solution 1: use copying to emulate a circular buffer
                -solution 2: keep track of all the processing and make
your own
intelligent scheduler which does not need a circular buffer.

-threads: with CUDA you can't access GPU device memory from different
host-threads. So make sure you create use and destroy all device memory
from the same thread. (The standard GnuRadio scheduler does not do it
like this)

-debugging: Debugging is hard and works quite different from normal
debugging.

-parallel: The GPU is good in doing calculations in parallel which are
not dependant on each other. For this reason a FIR will perform well,
while an IIR will perform bad. An IIR can only use one processing block
of the GPU, in stead of 128.
It can still be benificial to do the IIR on the GPU when all your other
blocks are running on the GPU because you don't have to copy all samples
to the CPU, do the IIR on the CPU and copy everything back to the GPU.

All that said. I do have a complete WFM receiver which is running
completely on the GPU.
(using FIR and/or FFT filters, quadrature_demod, fm-deemph)

The FFT filters use the cuda provided FFT.
It shouldn't be too hard to use the FFT for other things
(just look at the code of gr_cuda/src/lib/cuda_fft_*)

At the moment the complete wfm receiver is not running faster then on
the CPU with my 9600GT card, mainly because of the call overhead. (too
little work items done per call)
And the extra copying done to emulate circular buffers.

I can increase the amount of work done per call by using
output_multiple. But with the current scheduling code the flow-graph can
hang. This needs work.
So the performance will change in the future.
First I want to make sure everything is working as expected.

If I benchmark a single block with a big output_multiple then I do see
performance increases.


Greetings,
Martin
8e20d1d43ba6433397284bdb59832c94?d=identicon&s=25 Inderaj Bains (Guest)
on 2008-11-24 19:56
(Received via mailing list)
Hi Martin

1) You seem to be using atan on host, did you try writing one for
device?

2) It seems you have each block implemented separately, did you try to
put multiple ones together so that data does not have to travel to the
card multiple times

3) I don't quite understand the compilation process for cuda stuff.
Can you tell more detail on this. I have an empty block cuda block at
the end of pipeline (details follow)

Thanks
Inderaj

Details of compile and runtime failure

I have the gr_how_to_write_a block calling the cuda funtion (in
another .cu file) that does malloc/copy/free. I am using this

.cu.lo:
  $(top_srcdir)/cudalt.py $@ $(NVCC) -c $(NVCCFLAGS) $<

==== RUN FAILURE ==============================================

[root@localhost lib]# ../python/F_fm_cuda.py
Traceback (most recent call last):
  File "../python/F_fm_cuda.py", line 32, in <module> import howto
  File
"/root/dev/gnuradio-3.1.3/gr-howto-write-a-block-3.1.3/src/lib/howto.py",
line 6, in <module>
     import _howto
ImportError: /usr/local/lib/python2.5/site-packages/gnuradio/_howto.so:
undefined symbol: cudaFree

==== ENV ===================================================
[root@localhost lib]# export | grep -i cuda
declare -x LD_LIBRARY_PATH=":/usr/local/cuda/lib:/usr/local/cuda/lib"
declare -x
PATH="/usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/lib/ccache:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/root/bin:.:/usr/libexec/sdcc:/usr/local/cuda/bin:/root/bin:.:/usr/libexec/sdcc:/usr/local/cuda/bin:/root/bin"

[root@localhost lib]# export | grep -i python
declare -x
PYTHONPATH=":/usr/lib/python2.5/site-packages:/usr/local/lib/python2.5/site-packages:/usr/local/lib/python2.5/site-packages/gnuradio:/usr/lib/python2.5/site-packages:/usr/local/lib/python2.5/site-packages"
[root@localhost lib]#

====MAKE TRACE=============================================
[root@localhost gr-howto-write-a-block-3.1.3]# make
make  all-recursive
make[1]: Entering directory
`/root/dev/gnuradio-3.1.3/gr-howto-write-a-block-3.1.3'
Making all in config
make[2]: Entering directory
`/root/dev/gnuradio-3.1.3/gr-howto-write-a-block-3.1.3/config'
make[2]: Nothing to be done for `all'.
make[2]: Leaving directory
`/root/dev/gnuradio-3.1.3/gr-howto-write-a-block-3.1.3/config'
Making all in src
make[2]: Entering directory
`/root/dev/gnuradio-3.1.3/gr-howto-write-a-block-3.1.3/src'
Making all in lib
make[3]: Entering directory
`/root/dev/gnuradio-3.1.3/gr-howto-write-a-block-3.1.3/src/lib'
make  all-am
make[4]: Entering directory
`/root/dev/gnuradio-3.1.3/gr-howto-write-a-block-3.1.3/src/lib'
/bin/sh ../../libtool --tag=CXX   --mode=compile g++ -DHAVE_CONFIG_H
-I. -I../..  -DOMNITHREAD_POSIX=1 -pthread
-I/usr/local/include/gnuradio -I/usr/local/include
-I/usr/include/python2.5    -g -O2 -Wall -Woverloaded-virtual -pthread
-MT howto_square_ff.lo -MD -MP -MF .deps/howto_square_ff.Tpo -c -o
howto_square_ff.lo howto_square_ff.cc
libtool: compile:  g++ -DHAVE_CONFIG_H -I. -I../..
-DOMNITHREAD_POSIX=1 -pthread -I/usr/local/include/gnuradio
-I/usr/local/include -I/usr/include/python2.5 -g -O2 -Wall
-Woverloaded-virtual -pthread -MT howto_square_ff.lo -MD -MP -MF
.deps/howto_square_ff.Tpo -c howto_square_ff.cc  -fPIC -DPIC -o
.libs/howto_square_ff.o
mv -f .deps/howto_square_ff.Tpo .deps/howto_square_ff.Plo
../../cudalt.py cuda_block.lo "nvcc" -c "-D_DEBUG -g -v -keep
-use_fast_math -I. -IUDASDK/common/inc" cuda_block.cu
#$ _SPACE_=
#$ _MODE_=DEVICE
#$ _HERE_=/usr/local/cuda/bin
#$ _THERE_=/usr/local/cuda/bin
#$ TOP=/usr/local/cuda/bin/..
#$
LD_LIBRARY_PATH=/usr/local/cuda/bin/../lib:/usr/local/cuda/bin/../extools/lib::/usr/local/cuda/lib
#$
PATH=/usr/local/cuda/bin/../open64/bin:/usr/local/cuda/bin/../bin:/usr/lib/qt-3.3/bin:/usr/kerberos/sbin:/usr/kerberos/bin:/usr/lib/ccache:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/root/bin:.:/usr/libexec/sdcc:/usr/local/cuda/bin:/root/bin
#$ INCLUDES="-I/usr/local/cuda/bin/../include"
"-I/usr/local/cuda/bin/../include/cudart"
#$ LIBRARIES=  "-L/usr/local/cuda/bin/../lib" -lcudart
#$ CUDAFE_FLAGS=
#$ OPENCC_FLAGS=
#$ PTXAS_FLAGS=
#$ gcc -D__CUDA_ARCH__=100 -E -x c++ -DCUDA_NO_SM_13_DOUBLE_INTRINSICS
-DCUDA_NO_SM_12_ATOMIC_INTRINSICS -DCUDA_NO_SM_11_ATOMIC_INTRINSICS
-DCUDA_FLOAT_MATH_FUNCTIONS  "-I/usr/local/cuda/bin/../include"
"-I/usr/local/cuda/bin/../include/cudart"   -I. -D__CUDACC__ -C  -fPIC
-I"." -I"UDASDK/common/inc" -D"_DEBUG" -include "cuda_runtime.h" -m32
-malign-double -g -o "cuda_block.cpp1.ii" "cuda_block.cu"
#$ cudafe --m32 --gnu_version=40102
--diag_error=host_device_limited_call -tused  --gen_c_file_name
"cuda_block.cudafe1.c" --stub_file_name "cuda_block.cudafe1.stub.c"
--stub_header_file_name "cuda_block.cudafe1.stub.h"
--gen_device_file_name "cuda_block.cudafe1.gpu" --include_file_name
cuda_block.fatbin.c "cuda_block.cpp1.ii"
#$ gcc -D__CUDA_ARCH__=100 -E -x c -DCUDA_NO_SM_13_DOUBLE_INTRINSICS
-DCUDA_NO_SM_12_ATOMIC_INTRINSICS -DCUDA_NO_SM_11_ATOMIC_INTRINSICS
-DCUDA_FLOAT_MATH_FUNCTIONS  "-I/usr/local/cuda/bin/../include"
"-I/usr/local/cuda/bin/../include/cudart"   -I. -D__CUDACC__ -C  -fPIC
-I"." -I"UDASDK/common/inc" -D"_DEBUG" -m32 -malign-double -g -o
"cuda_block.cpp2.i" "cuda_block.cudafe1.gpu"
#$ cudafe --m32 --gnu_version=40102 --c  --gen_c_file_name
"cuda_block.cudafe2.c" --stub_file_name "cuda_block.cudafe2.stub.c"
--stub_header_file_name "cuda_block.cudafe2.stub.h"
--gen_device_file_name "cuda_block.cudafe2.gpu" --include_file_name
cuda_block.fatbin.c "cuda_block.cpp2.i"
#$ gcc -D__CUDA_ARCH__=100 -E -x c -DCUDA_NO_SM_13_DOUBLE_INTRINSICS
-DCUDA_NO_SM_12_ATOMIC_INTRINSICS -DCUDA_NO_SM_11_ATOMIC_INTRINSICS
-DCUDA_FLOAT_MATH_FUNCTIONS  "-I/usr/local/cuda/bin/../include"
"-I/usr/local/cuda/bin/../include/cudart"   -I. -D__CUDABE__
-D__USE_FAST_MATH__  -fPIC -I"." -I"UDASDK/common/inc" -D"_DEBUG" -m32
-malign-double -g -o "cuda_block.cpp3.i" "cuda_block.cudafe2.gpu"
#$ filehash --skip-cpp-directives -s " " "cuda_block.cpp3.i" >
"cuda_block.hash"
#$ nvopencc  -TARG:sm_10  -m32 "cuda_block.cpp3.i"  -o "cuda_block.ptx"
#$ ptxas --key="6b4cfc7a7afd183d"  -arch=sm_10  "cuda_block.ptx"  -o
"cuda_block.cubin"
#$ fatbin --key="6b4cfc7a7afd183d" --source-name="cuda_block.cu"
--usage-mode=" " --embedded-fatbin="cuda_block.fatbin.c"
"--image=profile=sm_10,file=cuda_block.cubin"
#$ cudafe++ --m32 --gnu_version=40102
--diag_error=host_device_limited_call --dep_name  --gen_c_file_name
"cuda_block.cudafe1.cpp" --stub_file_name "cuda_block.cudafe1.stub.c"
--stub_header_file_name "cuda_block.cudafe1.stub.h"
"cuda_block.cpp1.ii"
#$ gcc -D__CUDA_ARCH__=100 -E -x c++ -DCUDA_NO_SM_13_DOUBLE_INTRINSICS
-DCUDA_NO_SM_12_ATOMIC_INTRINSICS -DCUDA_NO_SM_11_ATOMIC_INTRINSICS
-DCUDA_FLOAT_MATH_FUNCTIONS  "-I/usr/local/cuda/bin/../include"
"-I/usr/local/cuda/bin/../include/cudart"   -I. -fPIC -I"."
-I"UDASDK/common/inc" -D"_DEBUG" -m32 -malign-double -g -o
"cuda_block.cu.cpp" "cuda_block.cudafe1.cpp"
#$ gcc -D__CUDA_ARCH__=100 -c -x c++ -DCUDA_NO_SM_13_DOUBLE_INTRINSICS
-DCUDA_NO_SM_12_ATOMIC_INTRINSICS -DCUDA_NO_SM_11_ATOMIC_INTRINSICS
-DCUDA_FLOAT_MATH_FUNCTIONS  "-I/usr/local/cuda/bin/../include"
"-I/usr/local/cuda/bin/../include/cudart"   -I. -fPIC -I"."
-I"UDASDK/common/inc" -D"_DEBUG" -m32 -malign-double -g -o
".libs/cuda_block.o" "cuda_block.cu.cpp"
/bin/sh ../../libtool --tag=CXX   --mode=link g++  -g -O2 -Wall
-Woverloaded-virtual -pthread  -module -avoid-version  -o _howto.la
-rpath /usr/local/lib/python2.5/site-packages/gnuradio howto.lo
howto_square_ff.lo howto_square2_ff.lo cuda_block.lo  -lstdc++
                -L/usr/local/lib -lgnuradio-core -lgromnithread
-lfftw3f -lm
libtool: link: rm -fr  .libs/_howto.la .libs/_howto.lai .libs/_howto.so
libtool: link: g++ -shared -nostdlib
/usr/lib/gcc/i386-redhat-linux/4.1.2/../../../crti.o
/usr/lib/gcc/i386-redhat-linux/4.1.2/crtbeginS.o  .libs/howto.o
.libs/howto_square_ff.o .libs/howto_square2_ff.o .libs/cuda_block.o
-Wl,-rpath -Wl,/usr/local/lib -Wl,-rpath -Wl,/usr/local/lib
-L/usr/local/lib /usr/local/lib/libgnuradio-core.so
/usr/local/lib/libgromnithread.so -lrt /usr/local/lib/libfftw3f.so
-L/usr/lib/gcc/i386-redhat-linux/4.1.2
-L/usr/lib/gcc/i386-redhat-linux/4.1.2/../../.. -lstdc++ -lm -lc
-lgcc_s /usr/lib/gcc/i386-redhat-linux/4.1.2/crtendS.o
/usr/lib/gcc/i386-redhat-linux/4.1.2/../../../crtn.o  -pthread
-pthread -Wl,-soname -Wl,_howto.so -o .libs/_howto.so
libtool: link: ( cd ".libs" && rm -f "_howto.la" && ln -s
"../_howto.la" "_howto.la" )
make[4]: Leaving directory
`/root/dev/gnuradio-3.1.3/gr-howto-write-a-block-3.1.3/src/lib'
make[3]: Leaving directory
`/root/dev/gnuradio-3.1.3/gr-howto-write-a-block-3.1.3/src/lib'
Making all in python
make[3]: Entering directory
`/root/dev/gnuradio-3.1.3/gr-howto-write-a-block-3.1.3/src/python'
make[3]: Nothing to be done for `all'.
make[3]: Leaving directory
`/root/dev/gnuradio-3.1.3/gr-howto-write-a-block-3.1.3/src/python'
make[3]: Entering directory
`/root/dev/gnuradio-3.1.3/gr-howto-write-a-block-3.1.3/src'
make[3]: Nothing to be done for `all-am'.
make[3]: Leaving directory
`/root/dev/gnuradio-3.1.3/gr-howto-write-a-block-3.1.3/src'
make[2]: Leaving directory
`/root/dev/gnuradio-3.1.3/gr-howto-write-a-block-3.1.3/src'
make[2]: Entering directory
`/root/dev/gnuradio-3.1.3/gr-howto-write-a-block-3.1.3'
make[2]: Leaving directory
`/root/dev/gnuradio-3.1.3/gr-howto-write-a-block-3.1.3'
make[1]: Leaving directory
`/root/dev/gnuradio-3.1.3/gr-howto-write-a-block-3.1.3'




On Sun, Nov 16, 2008 at 2:46 PM, Martin DvH
<gnuradiomail@olifantasia.com> wrote:
> It is a one to one implementation.
>
> I use this to check if my cuda blocks do exactly the same as the gr
> All python scripts to test and use the code are in /testbed.
> benefit.
> like this)
> to the CPU, do the IIR on the CPU and copy everything back to the GPU.
> the CPU with my 9600GT card, mainly because of the call overhead. (too
> performance increases.
>> Discuss-gnuradio@gnu.org
>> http://lists.gnu.org/mailman/listinfo/discuss-gnuradio
>>
>
>
>
> _______________________________________________
> Discuss-gnuradio mailing list
> Discuss-gnuradio@gnu.org
> http://lists.gnu.org/mailman/listinfo/discuss-gnuradio
>



--
~Inderaj
3719f4fea703e38bcbf8de6fe6bcdf55?d=identicon&s=25 Martin DvH (Guest)
on 2008-11-25 00:27
(Received via mailing list)
On Mon, 2008-11-24 at 08:47 -0800, Inderaj Bains wrote:
> Hi Martin
>
> 1) You seem to be using atan on host, did you try writing one for device?
>
I use atan on cuda device.

Look in
gr-cuda/src/lib/cuda_quadrature_demod_cf_kernel.cu

__global__ void
cuda_quadrature_demod_cf_kernel(  const gr_complex* g_idata, float*
g_odata,const int noutput_items,const float gain)

I don't know which test script you are using.
The following is a FM receiver which should run completely on the
device:
testbed/wfm/usrp_wfm_rcv_nogui_cuda.py
This script includes:
testbed/wfm/cuda_wfm_rcv_cuda.py

And this uses
self.fm_demod = cuda.quadrature_demod_cuda_cf (fm_demod_gain)

which uses the device code I started my reply with.

> 2) It seems you have each block implemented separately, did you try to
> put multiple ones together so that data does not have to travel to the
> card multiple times
The data doesn't have to travel to the card multiple times.
I copy it once to the device with a host_to_cuda block.
Then the data is transfered from cuda block to cuda block all using cuda
device memory.

What does slow things down is that I emulate a circular buffer by
copying the buffer memeory after every work call (this is a "fast"
device-to-device copy which unfortunately is not that fast when used for
small sizes (small number of output_items))
>
> 3) I don't quite understand the compilation process for cuda stuff.
> Can you tell more detail on this. I have an empty block cuda block at
> the end of pipeline (details follow)
>
I am not quite sure what you are trying to do.
Did you make your own cuda block or are you trying to compile my code.
If you want to use cuda blocks you need to use my gnuradio gpgpu-wip
branche because the cuda device buffer support is in my
gpgpu-wip/gnuradio-core.

When building your own cuda libraries:
For the cuda compiling to work you have to put cudalt.py in the rootdir
gr_cuda.m4 in config/
and edit configure.ac and add:

dnl Check for CUDA (required)
GR_CUDA

then you need to run ./bootstrap  (do all the autoconf aclocal automake
stuff)

after that run ./configure and make sure it finds all cuda libs and the
nvcc compiler.

use gr-cuda/src/lib/Makefile.am as example for how to make a new cuda
library.

make sure you include
.cu.lo:
  $(top_srcdir)/cudalt.py $@ $(NVCC) -c $(NVCCFLAGS) $<

INCLUDES = $(STD_DEFINES_AND_INCLUDES) $(PYTHON_CPPFLAGS) $(CUDA_CFLAGS)

# magic flags
_yourlib_la_LDFLAGS = $(NO_UNDEFINED) -module -avoid-version

# link the library against some comon swig runtime code and the
# c++ standard library
_cuda_la_LIBADD =       \
  $(PYTHON_LDFLAGS)    \
        $(CUDA_LIBS) \
  -lstdc++


exmplanation:
CUDA_LIBS adds the needed cuda libraries
CUDA_CFLAGS add the needed cuda includes
.cu.lo:
  $(top_srcdir)/cudalt.py $@ $(NVCC) -c $(NVCCFLAGS) $<

makes sure a libtool like library is made for all .cu files using nvcc
as compiler


Very important:
cuda memory can only be used in the same thread it is created in.
Normally gnuradio instantiates blocks in one thread and runs the
flowgraph in another thread.
You can use the override the virtual methods start() and stop() of a new
block to create and destroy device memory.
You can also create and destroy device memory in the work(0 method of a
block.

The input and output circular buffers for cuda blocks are automatically
created as device memory when you use CUDA_BUFFER as buffer type in the
modified gr_io_sognature of input and/or output.
example:
This block will use cuda device memory as input and as output:
cuda_quadrature_demod_cuda_cf::cuda_quadrature_demod_cuda_cf (float
gain)
  : gr_block ("quadrature_demod_cuda_cf",
        gr_make_io_signature (MIN_IN, MAX_IN, sizeof
(gr_complex),GR_BUFFER_CUDA),
        gr_make_io_signature (MIN_OUT, MAX_OUT, sizeof
(float),GR_BUFFER_CUDA))


This block will use normal host memory as input and as output:
normal_gnuradio_cf::normal_gnuradio_cf (float gain)
  : gr_block ("quadrature_demod_cuda_cf",
        gr_make_io_signature (MIN_IN, MAX_IN, sizeof (gr_complex)),
        gr_make_io_signature (MIN_OUT, MAX_OUT, sizeof (float)))



I hope this helpes,
Martin
This topic is locked and can not be replied to.