Forum: GNU Radio New implementation for fusb_linux without allocs/frees

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
A725014f091bcd9e8ff16e9f2a0d7e20?d=identicon&s=25 Stefan Bruens (Guest)
on 2009-02-23 07:39
(Received via mailing list)
Hi,

attached is a new version for fusb_linux.cc.

The current implementation uses three std::list lists for free, pending
and
completed urbs, so submitting a single urb causes three allocs and three
frees
(pushing and popping of the list).

The new implementation uses a circular list for the urbs, where each urb
is
marked as free, pending or completed. As the total number of allocated
urbs is
constant, no allocs or frees are needed.

Benchmark:
usrp/host/apps/test_usrp_standard_tx -B 512 -N 64 -M 128

old code needs ~990e6 instructions, new code 690e6 instructions. The
call to
usrp_basic_tx::write goes down from 380e6 to 80e6 (so almost down to a
fifth
...), the remaining instructions is the pattern fill for the sample
buffer.

Regards,

Stefan
D0072e69d706bb3ca211d33a1b536e2c?d=identicon&s=25 Johnathan Corgan (Guest)
on 2009-02-23 15:45
(Received via mailing list)
On Sun, Feb 22, 2009 at 10:38 PM, Stefan Bruens
<stefan.bruens@rwth-aachen.de> wrote:

> attached is a new version for fusb_linux.cc.

Stefan--THANKS for all the wonderful work you've been submitting; we
really appreciate the kinds of optimization work you are doing.  We
will likely get all of it into our distribution.

However, it makes it easier for us to review and apply your work if
you follow the guidelines in the below Wiki page:

http://gnuradio.org/trac/wiki/PatchSubmissionGuidelines

Thanks,

Johnathan
745d8202ef5a58c1058d0e5395a78f9c?d=identicon&s=25 Eric Blossom (Guest)
on 2009-02-23 16:06
(Received via mailing list)
On Mon, Feb 23, 2009 at 07:38:14AM +0100, Stefan Bruens wrote:
> constant, no allocs or frees are needed.
>
> Benchmark:
> usrp/host/apps/test_usrp_standard_tx -B 512 -N 64 -M 128
>
> old code needs ~990e6 instructions, new code 690e6 instructions. The call to
> usrp_basic_tx::write goes down from 380e6 to 80e6 (so almost down to a fifth
> ...), the remaining instructions is the pattern fill for the sample buffer.
>
> Regards,
> Stefan

Thanks!

Stefan, please tell me again how you measured the instruction count?

Also, if you haven't already, please send in the form that starts the
copyright assignment.  Contact me off-list if you've got any more
questions about that.

Eric
558c40b97bd1af8d912424757714bda9?d=identicon&s=25 Marcus D. Leech (Guest)
on 2009-02-24 16:05
(Received via mailing list)
Stefan Bruens wrote:
> constant, no allocs or frees are needed.
> Stefan
>
>
Have you tested receive performance, and is it improved?

Bandwidth is my dearest friend in radio astronomy (in the absence of
RFI), so getting the best USB performance
  that's possible given CPU constraints is important to me.

--
Marcus Leech
Principal Investigator, Shirleys Bay Radio Astronomy Consortium
http://www.sbrac.org
745d8202ef5a58c1058d0e5395a78f9c?d=identicon&s=25 Eric Blossom (Guest)
on 2009-02-24 18:23
(Received via mailing list)
On Tue, Feb 24, 2009 at 10:03:05AM -0500, Marcus D. Leech wrote:
> > marked as free, pending or completed. As the total number of allocated urbs is
> >
> > Stefan
> >
> >
> Have you tested receive performance, and is it improved?
>
> Bandwidth is my dearest friend in radio astronomy (in the absence of
> RFI), so getting the best USB performance
>   that's possible given CPU constraints is important to me.
>

Marcus, in my experience, USB performance has not be limited by cpu
cycles.  It seems to be primarily a function of the design of the host
controller, the firmware in the device, and a reasonable way to get
the data into user mode.  In most apps I've benchmarked, the overhead
of all usrp related stuff is typically on the order of 5 to 10% of the
total cycles consumed.

Eric
558c40b97bd1af8d912424757714bda9?d=identicon&s=25 Marcus D. Leech (Guest)
on 2009-02-24 20:10
(Received via mailing list)
Eric Blossom wrote:
> Marcus, in my experience, USB performance has not be limited by cpu
> cycles.  It seems to be primarily a function of the design of the host
> controller, the firmware in the device, and a reasonable way to get
> the data into user mode.  In most apps I've benchmarked, the overhead
> of all usrp related stuff is typically on the order of 5 to 10% of the
> total cycles consumed.
>
> Eric
>
>
OK, so improving total USB cycle counts in user mode from 10% to 5%
perhaps wouldn't noticeably improve
  things like overruns--is that what you're saying?

So, in your experience what is the sh*t-hottest USB controller out
there, and is it available on a PCI or PCI-E card?


--
Marcus Leech
Principal Investigator, Shirleys Bay Radio Astronomy Consortium
http://www.sbrac.org
745d8202ef5a58c1058d0e5395a78f9c?d=identicon&s=25 Eric Blossom (Guest)
on 2009-02-24 20:39
(Received via mailing list)
On Tue, Feb 24, 2009 at 01:07:34PM -0500, Marcus D. Leech wrote:
> >
> OK, so improving total USB cycle counts in user mode from 10% to 5%
> perhaps wouldn't noticeably improve
>   things like overruns--is that what you're saying?

Overruns are generally caused because your signal processing can't
keep up, not that there's a problem with handling the USB.

> So, in your experience what is the sh*t-hottest USB controller out
> there, and is it available on a PCI or PCI-E card?

Once the controller can handle 32MB/s, you're golden.  Pretty much any
of the onboard controllers over the last few years work fine.

Measure twice, cut once.  Time be time.

Eric
3596cfe1d579c65b9babd35e8787977c?d=identicon&s=25 Matt Ettus (Guest)
on 2009-02-24 21:01
(Received via mailing list)
Eric Blossom wrote:
>>>
>
> Once the controller can handle 32MB/s, you're golden.  Pretty much any
> of the onboard controllers over the last few years work fine.
>
> Measure twice, cut once.  Time be time.


It is important to make sure that your system is not throttling back the
CPU because of heat or power saving modes.

Matt
A725014f091bcd9e8ff16e9f2a0d7e20?d=identicon&s=25 Stefan Bruens (Guest)
on 2009-02-24 21:36
(Received via mailing list)
On Tuesday 24 February 2009 16:03:05 Marcus D. Leech wrote:
> Have you tested receive performance, and is it improved?

Hm, if you are going for high bandwidth, you should set the blocksize
quite
high (eg 4096). I am targetting low latency, so I go for the smallest
possible
blocksize (512).

If you are receiving with 32MB/s, the new code saves about 75e6
instructions
per second at a blocksize of 512, so for 4096 this should be about 9e6.
For a
current generation cpu, this should be about 1% of its throughput, on
the
other hand, the mallocs are most probably very unfriendly to branch
prediction
and cache access, so savings may be higher. If I find some time, I will
run
oprofile to get some numbers ...

Stefan

--
Stefan Brüns  /  Bergstraße 21  /  52062 Aachen
phone: +49 241 53809034     mobile: +49 151 50412019
This topic is locked and can not be replied to.