Re: CUDA-Enabled GNURadio gr_benchmark10 possible improvements

Upon a closer look into cuda_muiltiply_const_ff_kernel.cu, there exists
5
different kernel functions to do the multiplication, where the default
one,

global void

cuda_multiply_const_ff_kernel(const float* g_idata, float* g_odata,const
int noutput_items,const float konst)

is completely blank. But regardless, nobody calls these kernel
functions.
Then, in the same file, which is called by cuda_multiply_const_ff.cc, in
this function

int

unsigned int num_threads_per_block=512;// =

params->threaddim, params->dynamic_shared_mem_size >>>(g_idata, g_odata,
params->num_outputs_padded*X,konst);
return result;
}

The kernel invocation is completely commented out! The result is
initialized
as 0 at the top and returns it. All the work in between to specify and
allocate thread, block sizes does not seem to matter. Not sure why this
code
exists this way, did someone make an edit or did Martin specifically
commented out the kernel invocation? Is this suppose to be this way? I
don’t
see how this can be a proper benchmarking if it seems that we just test
about allocating threads and blocks on the device and memory access
times,
but really don’t do any computation.
I am probably way off here, doesnt make any sense…someone please
clarify!

On Tue, 2009-06-30 at 02:52 -0400, Yu-Hua Y. wrote:

Upon a closer look into cuda_muiltiply_const_ff_kernel.cu, there
exists 5 different kernel functions to do the multiplication, where
the default one,

     __global__ void
    cuda_multiply_const_ff_kernel(const float* g_idata, float*
    g_odata,const int noutput_items,const float konst) 

is completely blank. But regardless, nobody calls these kernel
functions.
cuda_multiply_const_ff was never finished or I accidently removed some
code here.
Either way, this is clearly a bug.
The empty kernel should call one of the implemented kernels.

    if wanted
     
      params->num_inputs=0;//num_outputs_per_grid;//num_outputs;

initialized as 0 at the top and returns it. All the work in between to
specify and allocate thread, block sizes does not seem to matter. Not
sure why this code exists this way, did someone make an edit or did
Martin specifically commented out the kernel invocation?
Yes the kernel invocation is specifically commented out.
The commented out kernel invocation is only here as documentation on
how to do the actual kernel invocation.

There is still a small typo.
It should refer to cuda_multiply_const_ff_kernel and not
cuda_multiply_const_ff_filter_kernel

This methods name is get_cuda_multiply_const_ff_kernel_params
It does just that, determine the needed params for the kernel
invocation.

The actual kernel invocation is done in:
cuda_multiply_const_ff_work_device

Both get_cuda_multiply_const_ff_kernel_params and
cuda_multiply_const_ff_work_device are called from normal C++ code in
cuda_multiply_const_ff_kernel.cu

Better examples to look at and benchmark would be
cuda_quadrature_demod_cuda_cf
cuda_fir_filter_fff

These are both used in
testbed/wfm/cuda_wfm_rcv.py
Which calls the complete cuda implementation of a WFM receiver in
cuda_wfm_rcv_cuda.py

You will notice a cuda_multiply_const_ff block is instantiated as
volume_control but not used in the final connect because it didn’t work.
Now I know this is because of the bug you found.

Is this suppose to be this way? I don’t see how this can be a proper
benchmarking if it seems that we just test about allocating threads
and blocks on the device and memory access times, but really don’t do
any computation.
That is not a good benchmark indeed.
cuda_multiply_const_ff should be fixed first.

The fir-filter is however the thing you really want to benchmark.
(Which is also in there)
The more taps the fir-filter has, the more computations it will have to
make.
Note that the cuda implementation is at the moment limited in the number
of taps it can support at a certain decimation factor.
If I remember well decimationfactor*numberoftaps should stay below 2048
(or was it 512)

Ways of improving the resuls include having gnuradio do calculations in
bigger chunks.
This can be done in several ways.
One way is to have a block at the end of the processing chain which has
a large output_multiple requirement which needs 1 input for every
output.
In other words. Connect a block to the end of the processing chain which
has set_output_multiple(large_number) in its constructor.

You can use gr_test for this.

There are other ways which include changing the scheduler and the buffer
instantiating code.

I hope this helps,

Martin

Thank you very much for your reply. I have some more questions. I
understand
you wont be looking/updating this branch anymore due to the release of
OpenCL, but thanks for helping right now!

Before you dive in, one thing I am having concerns is this output that
happens every time I run CUDA-enable code:

gr_vmcircbuf_cuda::copy_buf_to_buf() error cudaMemcpy() returned 0x3 = 3
initialization error
gr_vmcircbuf_cuda::copy_buf_to_buf recreating buffer d_base=0x3820000
size=32768 *2

Are these 2 outputs normal? Are they suppose to happen? If not, how do I
fix
this?

2009/6/30 Martin DvH [email protected]

functions.
get_cuda_multiply_const_ff_kernel_params
unsigned int num_threads_per_block=512;// =
dim3 threaddim( num_threads_per_block, 1, 1);
//cuda_multiply_const_ff_filter_kernel<<< params->griddim,
Yes the kernel invocation is specifically commented out.
It does just that, determine the needed params for the kernel
invocation.

The actual kernel invocation is done in:
cuda_multiply_const_ff_work_device

Both get_cuda_multiply_const_ff_kernel_params and
cuda_multiply_const_ff_work_device are called from normal C++ code in
cuda_multiply_const_ff_kernel.cu

Here you mean called by cuda_multiply_const_ff.cc right? the kernel
itself
seems to just define the kernel functions and everything is called by
cuda_multiply_const_ff.cc, maybe I am wrong because this leads me to my
question which is, where in cuda_multiply_const_ff calls
cuda_multiply_const_ff_work_device? doesn’t seem like it does but maybe
its
some OOP process that I missed…anyways…not that important at the
moment…

You will notice a cuda_multiply_const_ff block is instantiated as

bigger chunks.
This can be done in several ways.
One way is to have a block at the end of the processing chain which has
a large output_multiple requirement which needs 1 input for every
output.
In other words. Connect a block to the end of the processing chain which
has set_output_multiple(large_number) in its constructor.

You can use gr_test for this.

I decided to abandon and comment out all the cuda.multiply_const_ff
function
calls and concentrate on cuda.fir_filter_fff as suggested. Things I got
questions/concerns

  1. I increased output_multiple by doing “options.output_multiple = xxx”
    and
    this has no effect on the computing time of either CUDA or CPU. Did I do
    something wrong?
  2. I increased the taps by doing “taps = range(1,256)” and also
    increasing
    number of blocks of fir_filter in the code and voila, I am now able to
    get
    CUDA to be faster than just CPU. However, if I implement something like
    “taps = range(1,512)” the CUDA part would be extremely slow (~20
    seconds)
    while the CPU is still cool (~ 2 sec). Why? But this maybe related to
    what
    you were saying about max number of taps…although why is CPU able to
    still
    compute?
  3. I had to increase the number of fir_filter blocks to 14 blocks before
    I
    can start seeing CUDA out-perform CPU. Experimentally its fine, I
    achieved
    my objective, but how is this “increased computation” justified in a
    normal
    GNURadio operation? I mean, when would a normal GNURadio operation
    require a
    chain of 14 fir_filters? I guess this is going beyond just
    “benchmarking”
    and asking where else can I take advantage of CUDA’s computation power
    in
    GNURadio in a “normal” operation?
  4. Looking at cuda_fir_fff_7_kernel, which I believe is the core of
    cuda_fir_filter, it seems you are using shared memory right? Just making
    sure we are not using global or local memory which would disastrously
    slow
    down the CUDA computations.

There are other ways which include changing the scheduler and the buffer
instantiating code.

I rather not pursue these ways because of the increased complexity. I
rather
increase computation, seems fastest/easiest to me, just not sure where.

I hope this helps,

Martin

It has been most helpful! Thank you again!

Thanks Martin, for your generous effort to help me.

It appears only one time so I think I am in the clear.

I decided to abandon and comment out all the cuda.multiply_const_ff
function
calls and concentrate on cuda.fir_filter_fff as suggested. Things I got
questions/concerns

  1. I increased output_multiple by doing “options.output_multiple = xxx”
    and
    this has no effect on the computing time of either CUDA or CPU. Did I do
    something wrong?
  2. I increased the taps by doing “taps = range(1,256)” and also
    increasing
    number of blocks of fir_filter in the code and voila, I am now able to
    get
    CUDA to be faster than just CPU. However, if I implement something like
    “taps = range(1,512)” the CUDA part would be extremely slow (~20
    seconds)
    while the CPU is still cool (~ 2 sec). Why? But this maybe related to
    what
    you were saying about max number of taps…although why is CPU able to
    still
    compute?
  3. I had to increase the number of fir_filter blocks to 14 blocks before
    I
    can start seeing CUDA out-perform CPU. Experimentally its fine, I
    achieved
    my objective, but how is this “increased computation” justified in a
    normal
    GNURadio operation? I mean, when would a normal GNURadio operation
    require a
    chain of 14 fir_filters? I guess this is going beyond just
    “benchmarking”
    and asking where else can I take advantage of CUDA’s computation power
    in
    GNURadio in a “normal” operation?
  4. Looking at cuda_fir_fff_7_kernel, which I believe is the core of
    cuda_fir_filter, it seems you are using shared memory right? Just making
    sure we are not using global or local memory which would disastrously
    slow
    down the CUDA computations.

Thank you again for taking the time to help me!

2009/7/1 Martin DvH [email protected]

On Wed, 2009-07-01 at 05:27 -0400, Yu-Hua Y. wrote:

size=32768 *2

If I remember well, this error is not a problem if you get it once per
instantiated CUDA block.

It is a problem if you get it multiple times for the same block.

If I remember well, this has to do with CUDA not being happy with
multithreaded applications and access to the same device buffer from
different host-threads.

Gnuradio using cuda should always use the single-threaded scheduler.
(This is not the default)
I don’t remember if my code does this automatically or if you have to
force it using an environment variable.
( you can select the
scheduler used at runtime by setting the GR_SCHEDULER environment
variable. E.g.,
multithreaded Thread-Per-Block scheduler
$ GR_SCHEDULER=TPB ./my-gnuradio-application.py …

or
Single-Threaded-Scheduler
$ GR_SCHEDULER=STS ./my-gnuradio-application.py …
)

If you have to use the environment variable, then this should go into
the gnuradio-CUDA docs (if it is not already there).

Greeting,
Martin