On Tue, 2009-06-30 at 02:52 -0400, Yu-Hua Y. wrote:
Upon a closer look into cuda_muiltiply_const_ff_kernel.cu, there
exists 5 different kernel functions to do the multiplication, where
the default one,
__global__ void
cuda_multiply_const_ff_kernel(const float* g_idata, float*
g_odata,const int noutput_items,const float konst)
is completely blank. But regardless, nobody calls these kernel
functions.
cuda_multiply_const_ff was never finished or I accidently removed some
code here.
Either way, this is clearly a bug.
The empty kernel should call one of the implemented kernels.
if wanted
params->num_inputs=0;//num_outputs_per_grid;//num_outputs;
initialized as 0 at the top and returns it. All the work in between to
specify and allocate thread, block sizes does not seem to matter. Not
sure why this code exists this way, did someone make an edit or did
Martin specifically commented out the kernel invocation?
Yes the kernel invocation is specifically commented out.
The commented out kernel invocation is only here as documentation on
how to do the actual kernel invocation.
There is still a small typo.
It should refer to cuda_multiply_const_ff_kernel and not
cuda_multiply_const_ff_filter_kernel
This methods name is get_cuda_multiply_const_ff_kernel_params
It does just that, determine the needed params for the kernel
invocation.
The actual kernel invocation is done in:
cuda_multiply_const_ff_work_device
Both get_cuda_multiply_const_ff_kernel_params and
cuda_multiply_const_ff_work_device are called from normal C++ code in
cuda_multiply_const_ff_kernel.cu
Better examples to look at and benchmark would be
cuda_quadrature_demod_cuda_cf
cuda_fir_filter_fff
These are both used in
testbed/wfm/cuda_wfm_rcv.py
Which calls the complete cuda implementation of a WFM receiver in
cuda_wfm_rcv_cuda.py
You will notice a cuda_multiply_const_ff block is instantiated as
volume_control but not used in the final connect because it didn’t work.
Now I know this is because of the bug you found.
Is this suppose to be this way? I don’t see how this can be a proper
benchmarking if it seems that we just test about allocating threads
and blocks on the device and memory access times, but really don’t do
any computation.
That is not a good benchmark indeed.
cuda_multiply_const_ff should be fixed first.
The fir-filter is however the thing you really want to benchmark.
(Which is also in there)
The more taps the fir-filter has, the more computations it will have to
make.
Note that the cuda implementation is at the moment limited in the number
of taps it can support at a certain decimation factor.
If I remember well decimationfactor*numberoftaps should stay below 2048
(or was it 512)
Ways of improving the resuls include having gnuradio do calculations in
bigger chunks.
This can be done in several ways.
One way is to have a block at the end of the processing chain which has
a large output_multiple requirement which needs 1 input for every
output.
In other words. Connect a block to the end of the processing chain which
has set_output_multiple(large_number) in its constructor.
You can use gr_test for this.
There are other ways which include changing the scheduler and the buffer
instantiating code.
I hope this helps,
Martin