CUDA-Enabled GNURadio gr_benchmark10 possible improvements

Yu-Hua_Y · June 29, 2009, 11:12am

Hi

Has anyone able to successfully improve CUDA-Enabled GNURadio’s
performance?
At the moment I am very new at this stuff so I am just looking at
Martin’s
code without any really solid understanding. I know that the
gr_benchmark10_test.py performance is slow computing on GPU due to the
over-head memory calls to and from the CPU and GPU, and that if more
compution/work is done per call, the GPU can out-perform the CPU.
However
looking at the gr_benchmark10 code, it seems that very trivial
computations
are being done to compare the CPU and GPU. Specifically:

testblock3= cuda.fir_filter_fff(1,taps)

testblock4= cuda.multiply_const_ff(1.0)
testblock5= cuda.multiply_const_ff(1.0)
testblock6= cuda.multiply_const_ff(1.0)

I attempted to “increase” the GPU performance by inserting very large
floating point numbers as parameters to cuda.multiply_const_ff and also
messing around taps which is declared by:

taps=range(1,64,1)

But in doing so, I assume that I am passing in “more work” to be done so
the
GPU should be faster, but it is not. the CPU still takes fractions of a
second to complete (with large floating points) while the GPU takes a
little
over 1 second.

Following this thread:
[Discuss-gnuradio] Re: GNU Radio GPGPU WIP Branch Status?
I would like to approach the problem by increasing computation
intensity,
thats why I am changing the benchmark parameters, but it doesnt seem to
work, Am I approaching this correctly?
From this thread:
Re: [Discuss-gnuradio] GPU progress?

If I benchmark a single block with a big output_multiple then I do see
performance increases.

How do I do the above? How have the experts (Martin, Achilleas) been able
to tweak the performance of CUDA-Enabled GNURadio to show that GPU
computing
can indeed be faster?

Is there anyway to measure the time the memory calls to and from CPU
and
CUDA? This way we can know what exactly is the overhead.

Please help!!