Fwd: Question about UHD driver

alcina · May 17, 2013, 8:53pm

This was actually supposed to go to the list as well.

---------- Forwarded message ----------
From: Hilbert T. [email protected]
Date: Fri, May 17, 2013 at 2:48 PM
Subject: Re: [Discuss-gnuradio] Question about UHD driver
To: Mark McCarron [email protected]

Mark:

First, it’s “copies are bad”, then it’s “copies are good”. Make up your
mind, laddy.

The critical resource here, which drives the need to reduce memcpy-like
operations isn’t CPU, but memory
bandwidth. That memory bandwidth gets chewed up whether it’s the CPU
doing it, or the DMA controller.
There’s no magic on the bus. It doesn’t care who is doing
transactions.

In the land of multi-core CPUs, it’s rather silly to say “but the CPU
has
beter things to do than X”. So, those CPUs
should perhaps spend their time playing Zork? Or surfing porn?

Again, the “drive” to reduce memory-to-memory copy traffic is to reduce
pressure on memory bus bandwidth, not
save those oh-so-precous, I only have eight 'of 'em, CPUs. Since most
modern CPUs have microcoded
memory-to-memory copy instructions, the CPU burden is relatively
small.
We aren’t back in the dark days of
“optimized” memcpy operations being a series of word-wise copies,
followed by a byte-wise “mop up”.

Your argument could well be extended, reducto-ad-absurdium to “the CPU
has
better things to do than anything you
might want to do in a flow-graph” which is clearly absurd.

One might re-cast the problem as “any memory-to-memory operation should
have non-zero ‘work’ applied while doing the copy”. So, a memcpy is a
“no
useful work” motion of data from one place to another. Other types of
“data motion” have useful work applied as the data are in motion. This
is
roughly how Gnu Radio works. It doesn’t
leverage as many “zero copy” opportunities as perhaps it should, and
Josh
Blum’s GRAS work is a step in the
direction of leveraging zero-copy opportunities wherever possible.

But again, getting the data out of the hardware, while an important
problem, usually constitutes a small fraction of
the overall CPU and memory-bandwidth costs of any kind of non-trivial
SDR
signal flow.

In the era of multi-core CPUs (with ‘multi’ starting to scale to
“absurd”),
the notion of “the CPU shouldn’t be spending it’s precious time doing
that”
is a decreasingly-defensible position to take.

On Fri, May 17, 2013 at 2:36 PM, Mark McCarron
[email protected]wrote:

Also, just to correct some things, the whole point of DMA is to take the

A driver takes whatever ressources a piece of hardware offers and makes
done in current architectures.
Gigabytes of RAM aren’t easy to produce cheaply, and are even harder to
That automatically invalidates the cache for this RAM region,if that
Now, copying that data from RAM address to RAM address is usually a lot

Hope that mail explained my point of view well enough
Greetings,
Marcus

Discuss-gnuradio mailing list
[email protected]
Discuss-gnuradio Info Page

–
Hilbert (Godamn) Transform
[email protected]
Purveyor of fine Hilbert (Godamn) Transforms since 2013

Hilbert_T · May 17, 2013, 9:42pm

I don’t know if I agree with this. I don’t usually have issues with the
memory bus. Every problem I have encountered, in terms of bottlenecks,
is nearly always related to I/O. The CPU is useless at this and that’s
why we have DMA.

With constant streams of real-time data, there is a fixed window in
which to get all the processing done. Thus each stage needs to be
optimized and that begins with I/O. We really should have some
performance metrics for each block, so that when they are combined we
have estimate of the total end-to-end time.

Regards,

Mark McCarron

Date: Fri, 17 May 2013 14:52:09 -0400
From: [email protected]
To: [email protected]
Subject: [Discuss-gnuradio] Fwd: Question about UHD driver

This was actually supposed to go to the list as well.

---------- Forwarded message ----------
From: Hilbert T. [email protected]

Date: Fri, May 17, 2013 at 2:48 PM
Subject: Re: [Discuss-gnuradio] Question about UHD driver
To: Mark McCarron [email protected]

Mark:

First, it’s “copies are bad”, then it’s “copies are good”. Make up your
mind, laddy.

The critical resource here, which drives the need to reduce memcpy-like
operations isn’t CPU, but memory
bandwidth. That memory bandwidth gets chewed up whether it’s the CPU
doing it, or the DMA controller.

There’s no magic on the bus. It doesn’t care who is doing
transactions.

In the land of multi-core CPUs, it’s rather silly to say “but the CPU
has beter things to do than X”. So, those CPUs

should perhaps spend their time playing Zork? Or surfing porn?

Again, the “drive” to reduce memory-to-memory copy traffic is to reduce
pressure on memory bus bandwidth, not
save those oh-so-precous, I only have eight 'of 'em, CPUs. Since most
modern CPUs have microcoded

memory-to-memory copy instructions, the CPU burden is relatively
small. We aren’t back in the dark days of
“optimized” memcpy operations being a series of word-wise copies,
followed by a byte-wise “mop up”.

Your argument could well be extended, reducto-ad-absurdium to “the CPU
has better things to do than anything you
might want to do in a flow-graph” which is clearly absurd.

One might re-cast the problem as “any memory-to-memory operation should
have non-zero ‘work’ applied while doing the copy”. So, a memcpy is a
“no useful work” motion of data from one place to another. Other types
of “data motion” have useful work applied as the data are in motion.
This is roughly how Gnu Radio works. It doesn’t

leverage as many “zero copy” opportunities as perhaps it should, and
Josh B.'s GRAS work is a step in the
direction of leveraging zero-copy opportunities wherever possible.

But again, getting the data out of the hardware, while an important
problem, usually constitutes a small fraction of

the overall CPU and memory-bandwidth costs of any kind of non-trivial
SDR signal flow.

In the era of multi-core CPUs (with ‘multi’ starting to scale to
“absurd”), the notion of “the CPU shouldn’t be spending it’s precious
time doing that” is a decreasingly-defensible position to take.

On Fri, May 17, 2013 at 2:36 PM, Mark McCarron
[email protected] wrote:

Marcus,

I was writing the Windows driver for Per Vices Corporation (Phi/Noctar)
last year, I know how drivers work. I should have mentioned that
earlier.

What you are missing is the fact that the DMA must occur first before
anything can get to a cache. So, if we are writing to memory in
parallel, it is always going to be faster as this happens long before
data gets to the CPU.

Also, just to correct some things, the whole point of DMA is to take the
CPU out of the loop, so the CPU is not used to conduct transfers. It
can take part in scheduling, but the data goes from the device into
memory and a pointer is returned. The FIFO buffer in an app makes use
of this pointer.

Regards,

Mark McCarron

Date: Fri, 17 May 2013 20:23:34 +0200
Subject: Re: [Discuss-gnuradio] Question about UHD driver
From: [email protected]

To: [email protected]
CC: [email protected]

The ideal scenario is to never copy data and it is achievable, to a degree,
through proper planning.
I have to strongly disagree with that.

You have to realize what a /driver/ is. And why it is needed:
A driver takes whatever ressources a piece of hardware offers and makes
these ressources usable to actual
application software. Thus: A driver is /necessary/ to convert and
transfer data from “the wire” to something

a program can access without having to know how this particular piece of
hardware works.
This conversion has to happen using the CPU power of the host.
Therefore, you either have to let the driver
do its work on all copies of the device data in RAM, or you just do it
once, and then copy the data using the CPU.

Which is way more intelligent, flexible, well-performing… and what is
done in current architectures.

If you look at your argument, you are essentially saying that it is better to
copy than to have a pointer.

In many cases it is.
Example?
You have an arbitrary computer architecture with external memory (this
is desirable unless you want to be
limited to microcontrollers):
RAM—memory bus—cpu

Gigabytes of RAM aren’t easy to produce cheaply, and are even harder to
access with low latency.

Therefore, modern CPUs have caches:

RAM — memory bus — Cache — CPU

Those caches are designed to be fast, but are of limited size (for
reasons aforementioned).
Now take your DMA transfer: You instruct the memory controller to write
data from your device to RAM.

That automatically invalidates the cache for this RAM region,if that
happens to be cached, which is
likely, because we’re in a scenario where we constantly use data from
the device.

Now assume that this data is relevant to the system. (otherwise we
wouldn’t argue over performance, would we?)

So, in the next few microseconds, someone is going to access that newly
written data.
Whether the cache/dma/memory controller updated the cache or not, there
will be one valid copy in the cache soon.
Now, copying that data from RAM address to RAM address is usually a lot
faster than a DMA - because

the cache can “hide” the copying by reading from the original address
as long as no writes on either
original or copy take place,
access to dma’ed memory only present in RAM is as fast as access to
the cache at best.

Therefore, zero copy is not always preferable above having a RAM copy -
especially for stuff that fits into L2 cache
multiple times; for ethernet packets in special.

Hope that mail explained my point of view well enough

Greetings,
Marcus

Discuss-gnuradio mailing list

[email protected]

https://lists.gnu.org/mailman/listinfo/discuss-gnuradio

–
Hilbert (Godamn) Transform
[email protected]

Purveyor of fine Hilbert (Godamn) Transforms since 2013

–
Hilbert (Godamn) Transform
[email protected]
Purveyor of fine Hilbert (Godamn) Transforms since 2013