CUDA-Enabled GNURadio gr_benchmark10 possible improvements

Hi

Has anyone able to successfully improve CUDA-Enabled GNURadio’s
performance?
At the moment I am very new at this stuff so I am just looking at
Martin’s
code without any really solid understanding. I know that the
gr_benchmark10_test.py performance is slow computing on GPU due to the
over-head memory calls to and from the CPU and GPU, and that if more
compution/work is done per call, the GPU can out-perform the CPU.
However
looking at the gr_benchmark10 code, it seems that very trivial
computations
are being done to compare the CPU and GPU. Specifically:

testblock3= cuda.fir_filter_fff(1,taps)

testblock4= cuda.multiply_const_ff(1.0)
testblock5= cuda.multiply_const_ff(1.0)
testblock6= cuda.multiply_const_ff(1.0)

I attempted to “increase” the GPU performance by inserting very large
floating point numbers as parameters to cuda.multiply_const_ff and also
messing around taps which is declared by:

taps=range(1,64,1)

But in doing so, I assume that I am passing in “more work” to be done so
the
GPU should be faster, but it is not. the CPU still takes fractions of a
second to complete (with large floating points) while the GPU takes a
little
over 1 second.

If I benchmark a single block with a big output_multiple then I do see
performance increases.

How do I do the above? How have the experts (Martin, Achilleas) been able
to tweak the performance of CUDA-Enabled GNURadio to show that GPU
computing
can indeed be faster?

  • Is there anyway to measure the time the memory calls to and from CPU
    and
    CUDA? This way we can know what exactly is the overhead.

Please help!!