Bidirectional communication between attached blocks

Detlef_R · April 19, 2015, 5:59pm

Hi all,
I’d like to establish a bidirectional communication between two attached
blocks, without asking the user to write code like msg_connect(). One
way
could be: the upstream block generate an id like
ID=typeBlock+pseudorandomNumber
and send it to the next block using a tag.
After what the upstream block create a publish port called ID+“alice”
and a
subscribe to a port called ID+“bob”.
Can you tip me a better/cleaner way?

I need this bidirectional channel in order to comunicate device pointer,
send kernel code(thanks to CUDA 7),etc…

Thanks for your time,
marco

marco_Ribero · April 20, 2015, 9:29am

Hi marco,

what you describe as ID already exist: every block has a function
alias(), giving it a string “name”, which can be used with
global_block_registry::block_lookup(name) [1].

You will need to wrap your alias in a pmt::intern to get it into a
stream tag, so use that with block_lookup, and cast the result to
your_block_type::sptr.

Greetings,
Marcus

[1]
http://gnuradio.org/doc/doxygen/classgr_1_1block__registry.html#a67a83c42e2030bba463c99d51e7a8f92

marco_Ribero · April 20, 2015, 12:31pm

Thank you very much. Your solution is much cleaner.

Have a good day,
Marco

Il giorno lun 20 apr 2015 alle ore 09:29 Marcus Müller <
[email protected]> ha scritto:

marco_Ribero · April 20, 2015, 12:50pm

Hi Marco,

I just realized: Things might be much more easy than that, even:

What you do sounds like a job for a hierarchical block; if you’re not
used to that concept: It’s just a “subflowgraph”, represented as a block
with in- and outputs.
If you put both your blocks inside, you’ll always have them together.
And: in the constructor of your hierarchical block, you can for example
first construct your cuda block, and then give your “downstream” block
the pointer to that in its constructor.

To the user, this will look like one block, though there are two (or
more) inside.

Greetings,
Marcus

marco_Ribero · April 20, 2015, 4:10pm

I cannot do it.
For my thesis,I’m trying do bring various part of GnuRadio over CUDA…
My idea is to rewrite already existing blocks with CUDA, possibly
without
breaking compatibility with actual implementation of gnuradio. In this
way
a normal user can use these blocks without problems.

For the moment, I’ve token more confidence with gnuradio, made an FM
CUDA
receiver and started to port over CUDA some blocks. Is mandatory to
minimize host-device memcpy.
My actual approach is : each block loads its code and communicate with
neighboors using async transfers,streams and other(so I need to pass
addresses of memory locations,lock bits,etc…

My next step will be: at the beginning,each block will send down its
device
code and parameters…the block at the and of the chain will make a
dynamic
compilation (CUDA 7)… if I’ll have additional time I’ll also use warp
parallelism(reducing global-shared memcpy)

Thanks in any case,
marco

Il giorno lun 20 apr 2015 alle ore 12:48 Marcus Müller <
[email protected]> ha scritto:

marco_Ribero · April 20, 2015, 4:23pm

Hi Marco,

If I may recommend something, it would be having a look at VOLK [1].
It’s the optimizations library that comes with GNU Radio.
If you could implement some of these algorithms in CUDA, then every
block currently using VOLK (which is the majority of the arithmetically
challenging blocks at the moment) could automatically make use of your
accelerations, without having to change anything! Also, VOLK comes with
volk_profile, which it uses to test the different implementations that
work on your hardware, looking for the fastest one. That would be the
ultimate benchmark for your kernels, as it directly compares the
efficiency of the “general C” and CPU-SIMD implementations to your CUDA
kernels.

Furthermore, gr-theano is worth a visit [2], because it actually does
CUDA to accellerate channel models. The point here is that GPUs and
their high memcpy latency (and CPU cost) aren’t practical for all
problems. If I just want to add a small number of samples, doing it on a
CPU might simply pay off better; gr-theano for example offers a FFT,
which might be one of the algorithms typically working on large vectors
where the CPU/GPU boundary crossing might be worth it.

Best regards,
Marcus

[1] http://nathanwest.us/volk/
[2] http://www.cgran.org/pages/gr-theano.html

marco_Ribero · April 20, 2015, 4:31pm

On Mon, Apr 20, 2015 at 10:21 AM, Marcus Müller
[email protected]
wrote:

benchmark for your kernels, as it directly compares the efficiency of the
“general C” and CPU-SIMD implementations to your CUDA kernels.

We’ve never been hot on the idea of using VOLK for GPU stuff. VOLK
kernels
tend to do one thing at a time and don’t worry about data movement (too
much) because the SIMD registers are right there. Going to GPUs takes a
lot
longer, so you want to spend more of your time there once you get the
data
moved across. With VOLK, we’d be going back and forth, which is a huge
performance killer.

[1] http://nathanwest.us/volk/
[2] http://www.cgran.org/pages/gr-theano.html

I’m also not the biggest fan of CUDA for GNU Radio simply because it’s
too
hardware specific. I’d be more interested in seeing OpenCL
implementations
– but even that has it’s limitations for support. Theano looks nice
from
what I’ve heard (mostly from Tim and his gr-theano work), and I don’t
believe that it’s necessarily CUDA.

Tom

marco_Ribero · April 20, 2015, 9:37pm

Il giorno lun 20 apr 2015 alle ore 16:30 Tom R. [email protected]
ha
scritto:

We’ve never been hot on the idea of using VOLK for GPU stuff. VOLK kernels
tend to do one thing at a time and don’t worry about data movement (too
much) because the SIMD registers are right there. Going to GPUs takes a lot
longer, so you want to spend more of your time there once you get the data
moved across. With VOLK, we’d be going back and forth, which is a huge
performance killer.

Exactly, unlucky I cannot make this kind of porting.

I’m also not the biggest fan of CUDA for GNU Radio simply because it’s too
hardware specific. I’d be more interested in seeing OpenCL implementations
– but even that has it’s limitations for support. Theano looks nice from
what I’ve heard (mostly from Tim and his gr-theano work), and I don’t
believe that it’s necessarily CUDA.

I’m agree with the fact that CUDA is not the best, because it’s hardware
specific. But portability doesn’t imply performance.
I had spent a look over gr-theano. It’s a nice library based on CUDA
which
give interesting blocks like FFT and FIR. One of its limitation is the
mandatory device-host memcpy, in my opinion.

An interesting library that I was not able to find is GRGPU. I’ve only
read
about it on papers/forums, but I didn’t found any repository still
working,
unfortunally.