Question about UHD driver

Mark_McCarron · May 16, 2013, 8:02pm

I am wondering if the UHD driver has the ability to create multiple
copies of the stream in memory???

Let’s say I have a flow-graph that has two branches, the first pushes
complex data to an FFT, whereas the second demodulates a portion of that
data into AM.

Does the driver supply a single stream, which is then copied by the
application? Or does it create two copies of the stream and allow each
branch of the flow-graph to manipulate the data via pointers?

I’m digging into DMA to see if this is possible, I would be surprised if
there was a limitation here.

Regards,

Mark McCarron

Mark_McCarron · May 16, 2013, 8:08pm

I’m digging into DMA to see if this is possible, I would be surprised
if there was a limitation here.

Regards,

Mark McCarron
The UHD driver provides a single stream to the Gnu Radio interface.

Any “branching” of that stream is handled by the Gnu Radio scheduler,
and completely outside the control or knowledge of UHD.

Mark_McCarron · May 17, 2013, 5:53am

There is no need to create multiple copies. The consuming blocks are
each
given a pointer to the same data, and the memory is not freed until all
the
consuming blocks indicate they are done with it.

Matt

Mark_McCarron · May 17, 2013, 6:02am

On 16 May 2013 23:51, Matt E. wrote:

There is no need to
create multiple copies. The consuming blocks are each given a pointer to
the same data, and the memory is not freed until all the consuming
blocks indicate they are done with it.

Matt

Also, the “contract”
that work functions have with the scheduler is that they don’t write
into the input buffer.

Mark_McCarron · May 17, 2013, 6:03am

There is a performance issue with this. If your program needs to
manipulate the raw data, but at the same time provide that raw data to
another branch(es), a copy much be made. If this is the case, then it
would make more sense to duplicate the data in parallel as it enters the
system. This should be more efficient than memcopy.

I am looking into DMA to see if this is possible.

Regards,

Mark McCarron

From: [email protected]
Date: Thu, 16 May 2013 20:51:32 -0700
Subject: Re: [Discuss-gnuradio] Question about UHD driver
To: [email protected]
CC: [email protected]

There is no need to create multiple copies. The consuming blocks are
each given a pointer to the same data, and the memory is not freed until
all the consuming blocks indicate they are done with it.

Matt

On Thu, May 16, 2013 at 11:00 AM, Mark McCarron
[email protected] wrote:

I am wondering if the UHD driver has the ability to create multiple
copies of the stream in memory???

Let’s say I have a flow-graph that has two branches, the first pushes
complex data to an FFT, whereas the second demodulates a portion of that
data into AM.

Does the driver supply a single stream, which is then copied by the
application? Or does it create two copies of the stream and allow each
branch of the flow-graph to manipulate the data via pointers?

I’m digging into DMA to see if this is possible, I would be surprised if
there was a limitation here.

Regards,

Mark McCarron

Discuss-gnuradio mailing list

[email protected]

https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

Mark_McCarron · May 17, 2013, 6:13am

Are you saying that it is better to always make copies of the data
rather
than just make copies when you need them?

In any case, I think you misunderstand both how GNU Radio works and how
UHD
interacts with it. UHD provides a single copy of data to GNU Radio for
two reasons – first, that is the most efficient thing to do, and
second,
UHD can’t possibly know what GNU Radio plans to do with the data. GNU
Radio passes pointers of the data to every block that needs it. Blocks
are
not allowed to modify their inputs, only their outputs. This is
fundamental to how GNU Radio operates.

Matt

Mark_McCarron · May 17, 2013, 4:54pm

Hi Mark,

for currently available UHD devices with high bandwidth, data comes into
the host computer via a gigabit ethernet controller (or USB).
UHD basically talks to the kernel and uses the data provided by the
network card driver/network stack; therefore, UHD specifies a layer
built upon hardware drivers, it does not copy the data from the NIC
itself.
(Same applies for USB controllers)
So unless you start rewriting Gigabit Ethernet card hardware drivers,
there’s no way to get the same hardware data into the host system via
multiple DMA calls with the same origin.

Greetings,
Marcus

Am 17.05.2013 06:40, schrieb Mark McCarron:

Mark_McCarron · May 17, 2013, 6:42am

Matt,

My area of research is DSP and massive parallelism. Given the structure
of GNU Radio, it is possible to know what data is required upfront.
This opens up the possibility of a performance boost. I know how
GNURadio works, it was discussed earlier when I raised this question.

There is a different way though. Lets assume we have two branches
coming from the source. The first is going to an FFT, the second to
some form of flow-graph that performs functions on the IQ stream. Also,
we don’t want the changes we make to the IQ stream to be reflected in
the FFT.

Now, we can approach this several ways:

Serial - Send the data first to FFT, then to the second portion of
the flow-graph.
Parallel (memcopy) - Copy the data in memory, provide one to the FFT
and the other to the IQ flow-graph.
Parallel (DMA/Driver) - Driver duplicates the data in memory
according to the needs of the program. This is not a memcopy, but a
true parallel creation as the stream is extracted from the wire.

The last approach allows us to lighten the load on the application and
CPU by off-loading initial memory allocation to DMA controllers. This
way, we don’t need to manage FIFO streams within the app in relation to
the initial input.

As I said, I am still checking if this is possible, but when working
with multiple branches that require independent copies of the data this
would be best performing way to deliver the data.

Regards,

Mark McCarron

From: [email protected]
Date: Thu, 16 May 2013 21:11:42 -0700
Subject: Re: [Discuss-gnuradio] Question about UHD driver
To: [email protected]
CC: [email protected]

Are you saying that it is better to always make copies of the data
rather than just make copies when you need them?
In any case, I think you misunderstand both how GNU Radio works and how
UHD interacts with it. UHD provides a single copy of data to GNU Radio
for two reasons – first, that is the most efficient thing to do, and
second, UHD can’t possibly know what GNU Radio plans to do with the
data. GNU Radio passes pointers of the data to every block that needs
it. Blocks are not allowed to modify their inputs, only their outputs.
This is fundamental to how GNU Radio operates.

Matt

On Thu, May 16, 2013 at 9:02 PM, Mark McCarron
[email protected] wrote:

There is a performance issue with this. If your program needs to
manipulate the raw data, but at the same time provide that raw data to
another branch(es), a copy much be made. If this is the case, then it
would make more sense to duplicate the data in parallel as it enters the
system. This should be more efficient than memcopy.

I am looking into DMA to see if this is possible.

Regards,

Mark McCarron

From: [email protected]
Date: Thu, 16 May 2013 20:51:32 -0700

Subject: Re: [Discuss-gnuradio] Question about UHD driver
To: [email protected]
CC: [email protected]

There is no need to create multiple copies. The consuming blocks are
each given a pointer to the same data, and the memory is not freed until
all the consuming blocks indicate they are done with it.

Matt

On Thu, May 16, 2013 at 11:00 AM, Mark McCarron
[email protected] wrote:

I am wondering if the UHD driver has the ability to create multiple
copies of the stream in memory???

Let’s say I have a flow-graph that has two branches, the first pushes
complex data to an FFT, whereas the second demodulates a portion of that
data into AM.

Does the driver supply a single stream, which is then copied by the
application? Or does it create two copies of the stream and allow each
branch of the flow-graph to manipulate the data via pointers?

I’m digging into DMA to see if this is possible, I would be surprised if
there was a limitation here.

Regards,

Mark McCarron

Discuss-gnuradio mailing list

[email protected]

https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

Discuss-gnuradio mailing list

[email protected]

https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

Mark_McCarron · May 17, 2013, 5:39pm

Marcus,

I have been running into that issue as well. It seems that we are in
a transition period with the introduction of multi-core processors and
OpenCL. Bus design has not been modified to cope with the parallel
duplication of data from high speed serial streams.
The problem is that most hardware that does DMA isn’t willing to do so
more than once. You can’t tell the USB chip or the Ethernet chip
“OK, transfer that packet here, and here, and here, and here”. Once
the hardware completes a DMA transfer, it moves on, it’s done.

The network stack will, necessarily, be not-entirely-zero-copy as well,
since it has to deal with network headers, etc, etc.

But the main reason to avoid unnecessary memcopies, is, I think to
reduce pressure on the memory bus. With multi-core CPUs around, it’s
not necessarily the CPU cost that you’re concerned about.

Further, any such costs are usually dwarfed by the overall cost of the
DSP chain. The cost of getting the samples into the system is a very
small
part of the overall cost of any non-trivial DSP chain.

offloading data again, it looks like there will be some hard limits to
software defined radio that will make hardware defined solutions more
cost effective.
SDR isn’t about cost-effectiveness, necessarily. It’s largely about
flexibility. If you have a radio design that is fixed-in-stone, and
you’re going to produce
a barjillion of them, you cut an ASIC or two, and you’re done. WiFi
chips are cheap, cheap, cheap.

Consider an example. A wideband FM demodulator chip can be had for a
few 10s of cents. But I can emulate said demodulator on a computer
with SDR hardware for only a few orders of magnitude more in hardware
cost

But if today I want to demodulate wideband FM, and tomorrow, I want to
do radio astronomy, and the next day, I want to listen to police
scanners, and the next day, I want to process ionospheric sounder
data, an SDR platform may well be the way to go.

Similarly, if I have a “production” RF system of some sort where there
will only ever be a small-number (for various values of small-number)
deployed,
an SDR approach may make more economic sense than going down the path
of custom hardware design in ASICs, etc.

Mark_McCarron · May 17, 2013, 6:05pm

Hi Mark,

as interesting as your point is, that’s not something that
can be fixed within the scope of GNU Radio or even UHD…

Anyhow, I’m not really convinced that multiple DMA transfers are always
faster than copying data using memcpy - at least if those DMA transfers
copy only a few kilobytes, as is the case with packets from network
devices.
The fact that packets are of limited size is not really a problem of
current computing architectures - it’s a consequence of having packet
networks. Of course, it would be nice if your hardware would be able to
actually stream data into your userland, that somehow has the (zero
overhead) capability to tell the hardware to send the next sample - but
in reality, hardware-to-cpu-transfers usually happen en block, and that
is just fine for most applications, since there is little use of having
samples one after another; therefore, some buffering is always necessary
(and will always be).
For the sake of adaptivity, hardware supplied data most probably won’t
be written to copy device data to multiple RAM addresses, since the data
from the device usually needs some processing (hence the driver).
So in effect, in most imaginable cases a device will do a single
DMA transfer to RAM.

Greetings,
Marcus

Mark_McCarron · May 18, 2013, 11:38am

I would tend to agree, but if we do not outline what we require from
manufacturers, we will never get it. I would seriously suggest writing
a specification and submitting it to Intel, AMD, etc.

Regards,

Mark McCarron

Mark_McCarron · May 17, 2013, 5:19pm

Marcus,

I have been running into that issue as well. It seems that we are in a
transition period with the introduction of multi-core processors and
OpenCL. Bus design has not been modified to cope with the parallel
duplication of data from high speed serial streams.

This has implications for the performance of DSP, as well as other
fields, on traditional computing platforms. It looks like the entire
architecture of the PC needs a solid rethink at this point. As far as I
can tell, the current architecture choices are cost related and
manufacturers are attempting to software-define all transfers or
incorporate SoC solutions.

It means that we are saturating the CPUs with unnecessary tasks and
creating bottlenecks as a result. Until manufacturers start offloading
data again, it looks like there will be some hard limits to software
defined radio that will make hardware defined solutions more cost
effective.

Regards,

Mark McCarron

Mark_McCarron · May 18, 2013, 12:03pm

Mark McCarron
I think you’re still missing the very-important point that Gnu Radio
doesn’t duplicate the data. Blocks that share an input share a ring
buffer, with
each block having their own pointers into that ring buffer. There’s
no copying on the input, and work functions have a “contract” with the
scheduler
that they don’t modify their inputs, thus guaranteeing that sharing
the input buffer is “safe”. There are some pointers that move around,
but
a bifurcation of a stream simply means bifurcation of pointers, the
data aren’t duplicated.

Mark_McCarron · May 18, 2013, 12:09pm

On Fri, May 17, 2013 at 9:55 AM, Mark McCarron
[email protected]wrote:

In order to support massive parallelism, data must be duplicated as it
comes of the wire and into memory. Not duplicated in FIFO streams in an
application.

There is no duplication of buffer contents in GNU Radio.

To elaborate on what Matt E. described earlier, GNU Radio blocks are
connected via single-producer, multi-consumer FIFOs. Upstream blocks
(including hardware source blocks) write into the FIFO, and multiple
concurrently running downstream-connected blocks have read-only access
to
its contents, while writing the output of their individual DSP functions
into new FIFOs for the next stages of the pipeline.

There is no need to pre-copy the data into different memory areas for
multiple consumers to access, and no need to worry that processing in
one
block has any side effects on processing in another block.

Mark_McCarron · May 18, 2013, 12:45pm

Guys,

This places a limit on the performance of GNURadio that can be avoid
through a push for a more modern type of DMA.

The ideal scenario is to never copy data and it is achievable, to a
degree, through proper planning. If you look at your argument, you are
essentially saying that it is better to copy than to have a pointer.

I can’t agree with that.

Regards,

Mark McCarron

From: [email protected]
Date: Fri, 17 May 2013 10:06:08 -0700
Subject: Re: [Discuss-gnuradio] Question about UHD driver
To: [email protected]
CC: [email protected]

On Fri, May 17, 2013 at 9:55 AM, Mark McCarron
[email protected] wrote:

In order to support massive parallelism, data must be duplicated as it
comes of the wire and into memory. Not duplicated in FIFO streams in an
application.

There is no duplication of buffer contents in GNU Radio.

To elaborate on what Matt E. described earlier, GNU Radio blocks are
connected via single-producer, multi-consumer FIFOs. Upstream blocks
(including hardware source blocks) write into the FIFO, and multiple
concurrently running downstream-connected blocks have read-only access
to its contents, while writing the output of their individual DSP
functions into new FIFOs for the next stages of the pipeline.

There is no need to pre-copy the data into different memory areas for
multiple consumers to access, and no need to worry that processing in
one block has any side effects on processing in another block.

–

Johnathan C.
Corgan Labs - SDR Training and Development Services
http://corganlabs.com

Mark_McCarron · May 18, 2013, 1:27pm

The ideal scenario is to never copy data and it is achievable, to a
degree, through proper planning.
I have to strongly disagree with that.
You have to realize what a /driver/ is. And why it is needed:
A driver takes whatever ressources a piece of hardware offers and makes
these ressources usable to actual
application software. Thus: A driver is /necessary/ to convert and
transfer
data from “the wire” to something
a program can access without having to know how this particular piece of
hardware works.
This conversion has to happen using the CPU power of the host.
Therefore,
you either have to let the driver
do its work on all copies of the device data in RAM, or you just do it
once, and then copy the data using the CPU.
Which is way more intelligent, flexible, well-performing… and what is
done in current architectures.

If you look at your argument, you are essentially saying that it is
better to copy than to have a pointer.
In many cases it is.
Example?
You have an arbitrary computer architecture with external memory (this
is
desirable unless you want to be
limited to microcontrollers):
RAM—memory bus—cpu

Gigabytes of RAM aren’t easy to produce cheaply, and are even harder to
access with low latency.
Therefore, modern CPUs have caches:

RAM — memory bus — Cache — CPU

Those caches are designed to be fast, but are of limited size (for
reasons
aforementioned).
Now take your DMA transfer: You instruct the memory controller to write
data from your device to RAM.

That automatically invalidates the cache for this RAM region,if that
happens to be cached, which is
likely, because we’re in a scenario where we constantly use data from
the
device.

Now assume that this data is relevant to the system. (otherwise we
wouldn’t
argue over performance, would we?)
So, in the next few microseconds, someone is going to access that newly
written data.
Whether the cache/dma/memory controller updated the cache or not, there
will be one valid copy in the cache soon.
Now, copying that data from RAM address to RAM address is usually a lot
faster than a DMA - because

the cache can “hide” the copying by reading from the original address
as
long as no writes on either
original or copy take place,
access to dma’ed memory only present in RAM is as fast as access to
the
cache at best.

Therefore, zero copy is not always preferable above having a RAM copy -
especially for stuff that fits into L2 cache
multiple times; for ethernet packets in special.

Hope that mail explained my point of view well enough
Greetings,
Marcus

Mark_McCarron · May 18, 2013, 11:58am

I think you are missing the point. In order to support massive
parallelism, data must be duplicated as it comes of the wire and into
memory. Not duplicated in FIFO streams in an application. The latter
is a software implementation of a hardware task and is consuming
resources.

It requires hardware and architecture changes to implement properly.

Regards,

Mark McCarron

Mark_McCarron · May 18, 2013, 1:40pm

Marcus,

I was writing the Windows driver for Per Vices Corporation (Phi/Noctar)
last year, I know how drivers work. I should have mentioned that
earlier.

What you are missing is the fact that the DMA must occur first before
anything can get to a cache. So, if we are writing to memory in
parallel, it is always going to be faster as this happens long before
data gets to the CPU.

Also, just to correct some things, the whole point of DMA is to take the
CPU out of the loop, so the CPU is not used to conduct transfers. It
can take part in scheduling, but the data goes from the device into
memory and a pointer is returned. The FIFO buffer in an app makes use
of this pointer.

Regards,

Mark McCarron

Date: Fri, 17 May 2013 20:23:34 +0200
Subject: Re: [Discuss-gnuradio] Question about UHD driver
From: [email protected]
To: [email protected]
CC: [email protected]

The ideal scenario is to never copy data and it is achievable, to a degree,
through proper planning.
I have to strongly disagree with that.

You have to realize what a /driver/ is. And why it is needed:
A driver takes whatever ressources a piece of hardware offers and makes
these ressources usable to actual
application software. Thus: A driver is /necessary/ to convert and
transfer data from “the wire” to something

a program can access without having to know how this particular piece of
hardware works.
This conversion has to happen using the CPU power of the host.
Therefore, you either have to let the driver
do its work on all copies of the device data in RAM, or you just do it
once, and then copy the data using the CPU.

Which is way more intelligent, flexible, well-performing… and what is
done in current architectures.

If you look at your argument, you are essentially saying that it is better to
copy than to have a pointer.

In many cases it is.
Example?
You have an arbitrary computer architecture with external memory (this
is desirable unless you want to be
limited to microcontrollers):
RAM—memory bus—cpu

Gigabytes of RAM aren’t easy to produce cheaply, and are even harder to
access with low latency.

Therefore, modern CPUs have caches:

RAM — memory bus — Cache — CPU

Those caches are designed to be fast, but are of limited size (for
reasons aforementioned).
Now take your DMA transfer: You instruct the memory controller to write
data from your device to RAM.

That automatically invalidates the cache for this RAM region,if that
happens to be cached, which is
likely, because we’re in a scenario where we constantly use data from
the device.

Now assume that this data is relevant to the system. (otherwise we
wouldn’t argue over performance, would we?)

So, in the next few microseconds, someone is going to access that newly
written data.
Whether the cache/dma/memory controller updated the cache or not, there
will be one valid copy in the cache soon.
Now, copying that data from RAM address to RAM address is usually a lot
faster than a DMA - because

the cache can “hide” the copying by reading from the original address
as long as no writes on either
original or copy take place,
access to dma’ed memory only present in RAM is as fast as access to
the cache at best.

Therefore, zero copy is not always preferable above having a RAM copy -
especially for stuff that fits into L2 cache
multiple times; for ethernet packets in special.

Hope that mail explained my point of view well enough

Greetings,
Marcus

Mark_McCarron · May 18, 2013, 3:38pm

Hi Mark,

I wasn’t assuming you didn’t know what a driver is - I was just hoping
you’d try to realize more clearly,
that especially for something like network packets, you need a hardware
driver (and the network stack of the os)
to make use of your dma’ed data.

You’re totally right that data from a device needs to be transferred
somewhere before it can be used.
However, I don’t think you’re right in respect to a parallel DMA always
making your system faster - your second version
of the data still has to be processed by driver/stack (and therefore by
the
cpu), so that having it copied into RAM while
your machine is processing the first version is not necessarily faster
than
copying the processed version.
In fact, under my caching asumptions, that would even be slower on a
single
core system.

Mark_McCarron · May 18, 2013, 6:08pm

So, you think the penalty of processing in the stack, outweighs the
performance gained by having duplicate streams?

You do realise they are being processed in parallel in the stack???

By the time you would start the copy, my modified DMA would be ready
under all scenarios.

Regards,

Mark McCarron

Date: Fri, 17 May 2013 22:35:25 +0200
Subject: Re: [Discuss-gnuradio] Question about UHD driver
From: [email protected]
To: [email protected]
CC: [email protected]

Hi Mark,

I wasn’t assuming you didn’t know what a driver is - I was just hoping
you’d try to realize more clearly,
that especially for something like network packets, you need a hardware
driver (and the network stack of the os)

to make use of your dma’ed data.

You’re totally right that data from a device needs to be transferred
somewhere before it can be used.
However, I don’t think you’re right in respect to a parallel DMA always
making your system faster - your second version

of the data still has to be processed by driver/stack (and therefore by
the cpu), so that having it copied into RAM while
your machine is processing the first version is not necessarily faster
than copying the processed version.

In fact, under my caching asumptions, that would even be slower on a
single core system.

On Fri, May 17, 2013 at 8:36 PM, Mark McCarron
[email protected] wrote:

Marcus,

I was writing the Windows driver for Per Vices Corporation (Phi/Noctar)
last year, I know how drivers work. I should have mentioned that
earlier.

What you are missing is the fact that the DMA must occur first before
anything can get to a cache. So, if we are writing to memory in
parallel, it is always going to be faster as this happens long before
data gets to the CPU.

Also, just to correct some things, the whole point of DMA is to take the
CPU out of the loop, so the CPU is not used to conduct transfers. It
can take part in scheduling, but the data goes from the device into
memory and a pointer is returned. The FIFO buffer in an app makes use
of this pointer.

Regards,

Mark McCarron

Date: Fri, 17 May 2013 20:23:34 +0200
Subject: Re: [Discuss-gnuradio] Question about UHD driver
From: [email protected]

To: [email protected]
CC: [email protected]

The ideal scenario is to never copy data and it is achievable, to a degree,
through proper planning.
I have to strongly disagree with that.

You have to realize what a /driver/ is. And why it is needed:
A driver takes whatever ressources a piece of hardware offers and makes
these ressources usable to actual
application software. Thus: A driver is /necessary/ to convert and
transfer data from “the wire” to something

a program can access without having to know how this particular piece of
hardware works.
This conversion has to happen using the CPU power of the host.
Therefore, you either have to let the driver
do its work on all copies of the device data in RAM, or you just do it
once, and then copy the data using the CPU.

Which is way more intelligent, flexible, well-performing… and what is
done in current architectures.

If you look at your argument, you are essentially saying that it is better to
copy than to have a pointer.

In many cases it is.
Example?
You have an arbitrary computer architecture with external memory (this
is desirable unless you want to be
limited to microcontrollers):
RAM—memory bus—cpu

Gigabytes of RAM aren’t easy to produce cheaply, and are even harder to
access with low latency.

Therefore, modern CPUs have caches:

RAM — memory bus — Cache — CPU

Those caches are designed to be fast, but are of limited size (for
reasons aforementioned).
Now take your DMA transfer: You instruct the memory controller to write
data from your device to RAM.

That automatically invalidates the cache for this RAM region,if that
happens to be cached, which is
likely, because we’re in a scenario where we constantly use data from
the device.

Now assume that this data is relevant to the system. (otherwise we
wouldn’t argue over performance, would we?)

So, in the next few microseconds, someone is going to access that newly
written data.
Whether the cache/dma/memory controller updated the cache or not, there
will be one valid copy in the cache soon.
Now, copying that data from RAM address to RAM address is usually a lot
faster than a DMA - because

the cache can “hide” the copying by reading from the original address
as long as no writes on either
original or copy take place,
access to dma’ed memory only present in RAM is as fast as access to
the cache at best.

Therefore, zero copy is not always preferable above having a RAM copy -
especially for stuff that fits into L2 cache
multiple times; for ethernet packets in special.

Hope that mail explained my point of view well enough

Greetings,
Marcus

Discuss-gnuradio mailing list

[email protected]

https://lists.gnu.org/mailman/listinfo/discuss-gnuradio