New implementation for fusb_linux without allocs/frees

Stefan_Bruens · February 23, 2009, 7:39am

Hi,

attached is a new version for fusb_linux.cc.

The current implementation uses three std::list lists for free, pending
and
completed urbs, so submitting a single urb causes three allocs and three
frees
(pushing and popping of the list).

The new implementation uses a circular list for the urbs, where each urb
is
marked as free, pending or completed. As the total number of allocated
urbs is
constant, no allocs or frees are needed.

Benchmark:
usrp/host/apps/test_usrp_standard_tx -B 512 -N 64 -M 128

old code needs ~990e6 instructions, new code 690e6 instructions. The
call to
usrp_basic_tx::write goes down from 380e6 to 80e6 (so almost down to a
fifth
…), the remaining instructions is the pattern fill for the sample
buffer.

Regards,

Stefan

Stefan_Bruens · February 23, 2009, 3:45pm

On Sun, Feb 22, 2009 at 10:38 PM, Stefan Bruens
[email protected] wrote:

attached is a new version for fusb_linux.cc.

Stefan–THANKS for all the wonderful work you’ve been submitting; we
really appreciate the kinds of optimization work you are doing. We
will likely get all of it into our distribution.

However, it makes it easier for us to review and apply your work if
you follow the guidelines in the below Wiki page:

http://gnuradio.org/trac/wiki/PatchSubmissionGuidelines

Thanks,

Johnathan

Stefan_Bruens · February 23, 2009, 4:06pm

On Mon, Feb 23, 2009 at 07:38:14AM +0100, Stefan Bruens wrote:

constant, no allocs or frees are needed.

Benchmark:
usrp/host/apps/test_usrp_standard_tx -B 512 -N 64 -M 128

old code needs ~990e6 instructions, new code 690e6 instructions. The call to
usrp_basic_tx::write goes down from 380e6 to 80e6 (so almost down to a fifth
…), the remaining instructions is the pattern fill for the sample buffer.

Regards,
Stefan

Thanks!

Stefan, please tell me again how you measured the instruction count?

Also, if you haven’t already, please send in the form that starts the
copyright assignment. Contact me off-list if you’ve got any more
questions about that.

Eric

Stefan_Bruens · February 24, 2009, 4:05pm

Stefan Bruens wrote:

constant, no allocs or frees are needed.
Stefan

Have you tested receive performance, and is it improved?

Bandwidth is my dearest friend in radio astronomy (in the absence of
RFI), so getting the best USB performance
that’s possible given CPU constraints is important to me.

–
Marcus L.
Principal Investigator, Shirleys Bay Radio Astronomy Consortium

Stefan_Bruens · February 24, 2009, 6:23pm

On Tue, Feb 24, 2009 at 10:03:05AM -0500, Marcus D. Leech wrote:

marked as free, pending or completed. As the total number of allocated urbs is

Stefan

Have you tested receive performance, and is it improved?

Bandwidth is my dearest friend in radio astronomy (in the absence of
RFI), so getting the best USB performance
that’s possible given CPU constraints is important to me.

Marcus, in my experience, USB performance has not be limited by cpu
cycles. It seems to be primarily a function of the design of the host
controller, the firmware in the device, and a reasonable way to get
the data into user mode. In most apps I’ve benchmarked, the overhead
of all usrp related stuff is typically on the order of 5 to 10% of the
total cycles consumed.

Eric

Stefan_Bruens · February 24, 2009, 8:10pm

Eric B. wrote:

Marcus, in my experience, USB performance has not be limited by cpu
cycles. It seems to be primarily a function of the design of the host
controller, the firmware in the device, and a reasonable way to get
the data into user mode. In most apps I’ve benchmarked, the overhead
of all usrp related stuff is typically on the order of 5 to 10% of the
total cycles consumed.

Eric

OK, so improving total USB cycle counts in user mode from 10% to 5%
perhaps wouldn’t noticeably improve
things like overruns–is that what you’re saying?

So, in your experience what is the sh*t-hottest USB controller out
there, and is it available on a PCI or PCI-E card?

–
Marcus L.
Principal Investigator, Shirleys Bay Radio Astronomy Consortium

Stefan_Bruens · February 24, 2009, 9:01pm

Eric B. wrote:

Once the controller can handle 32MB/s, you’re golden. Pretty much any
of the onboard controllers over the last few years work fine.

Measure twice, cut once. Time be time.

It is important to make sure that your system is not throttling back the
CPU because of heat or power saving modes.

Matt

Stefan_Bruens · February 24, 2009, 8:39pm

On Tue, Feb 24, 2009 at 01:07:34PM -0500, Marcus D. Leech wrote:

OK, so improving total USB cycle counts in user mode from 10% to 5%
perhaps wouldn’t noticeably improve
things like overruns–is that what you’re saying?

Overruns are generally caused because your signal processing can’t
keep up, not that there’s a problem with handling the USB.

So, in your experience what is the sh*t-hottest USB controller out
there, and is it available on a PCI or PCI-E card?

Once the controller can handle 32MB/s, you’re golden. Pretty much any
of the onboard controllers over the last few years work fine.

Measure twice, cut once. Time be time.

Eric

Stefan_Bruens · February 24, 2009, 9:36pm

On Tuesday 24 February 2009 16:03:05 Marcus D. Leech wrote:

Have you tested receive performance, and is it improved?

Hm, if you are going for high bandwidth, you should set the blocksize
quite
high (eg 4096). I am targetting low latency, so I go for the smallest
possible
blocksize (512).

If you are receiving with 32MB/s, the new code saves about 75e6
instructions
per second at a blocksize of 512, so for 4096 this should be about 9e6.
For a
current generation cpu, this should be about 1% of its throughput, on
the
other hand, the mallocs are most probably very unfriendly to branch
prediction
and cache access, so savings may be higher. If I find some time, I will
run
oprofile to get some numbers …

Stefan

–
Stefan Brüns / Bergstraße 21 / 52062 Aachen
phone: +49 241 53809034 mobile: +49 151 50412019