Efficient data transfer to co-processor (GPGPU using OpenCL)

Hello,

I am experimenting with OpenCL and need to create a GNU Radio processing
block where all the computations will happen inside a discrete GPU. More
precisely, I am writing an OpenCL implementation of the polyphase
channelizer.

The block needs to process 1.6 Gbps (50Msps@32bit). If I do everything
in one sychronous block, it means that every time that the work()
function is called, I have to transfer the samples from host->GPU,
compute and transfer back from GPU->host. Indeed, this is not optimal as
the PCI-E bus can have achieve throughput but also have high latency.
The optimal situation would be feeding and consuming from the GPU in an
asynchronous fashion (using DMA?) so that the GPU doesn’t stop
processing during transfers.

I know there are some ongoing efforts to better handle co-processors in
GNU Radio (http://gnuradio.org/redmine/projects/gnuradio/wiki/Keystone2,
https://gnuradio.org/redmine/projects/gnuradio/wiki/GRCon13Coprocessor).
Unfortunately, this is work is barely usable now and requires
modification of the GNU Radio runtime.
There is also the GREX project which has all the features I need
(extended buffer API/pinned memory); but the project is discontinued.

There are numerous papers about implementing a polyphase channelizer on
a gpu, and some of them claim that they did use GNU Radio but to my
knowledge none of them provide a reference implementation or explained
how they handled the buffer management inside GNU Radio.

As a workaround, I want to create a hier_block2 with one “pinned memory”
DMA sink feeding my GPU and one DMA source. Do you think this option is
valid? What would be my other options about this?

Any opinion on the subject is welcome. Thank you very much!