USB throughput numbers for NetBSD (and Linux)

Hi,

We collected some data comparing the USB throughput we’re getting now
on NetBSD against the throughput on Linux. For those who are
interested in the current performance on NetBSD, I’ve included a
summary. The full set of measurements taken (along with the summary)
is available at:
http://acert.ir.bbn.com/viewvc/adroitgrdevel/adroitgrdevel/radio_test/usb/test-results?view=co

Summary

The following USB throughput results were collected on two systems
with the same hardware, running NetBSD-current with our ugen changes
and SuSE Linux.

The ugen changes allow specifying the length of the transfer to
request from the host controller, and here the fusb_netbsd testing
code was recompiled with the different sizes. The fusb_linux code
uses 16k requests (and says that this is the largest request
possible). In both cases the USRP library’s default buffer size of 2
MB was used. The ugen driver could also be changed to avoid a copy to
the buffer in the driver, and these tests investigate how much
performance is improved in that case.

For reference, here is how interpolation/decimation relates to the
intended data rate:

data rate | decimation | interpolation

16 MB/s 16 32
18.3 MB/s 14 28
21.3 MB/s 12 24
25.6 MB/s 10 20
32 MB/s 8 16
42.6 MB/s 6 12

benchmark_usb.py (bidirectional test)

driver | xfer size | maximum (read+write)

NetBSD 16k 32 MB/s
Linux 16k 36.57 MB/s
NetBSD 64k 32 MB/s (usually gets 36.57)
NetBSD 128k 32 MB/s
NetBSD -copy 16k 32 MB/s
NetBSD -copy 64k 42.6 MB/s
NetBSD -copy 128k 42.6 MB/s

test_standard_usrp_rx

driver | xfer size | maximum

NetBSD 16k 21.3
Linux 16k 32
NetBSD 64k 25.6
NetBSD 128k 21.3
NetBSD -copy 16k 25.6
NetBSD -copy 64k 25.6
NetBSD -copy 128k 25.6

test_standard_usrp_tx

driver | xfer size | maximum

NetBSD 16k 21.3
Linux 16k 32
NetBSD 64k 25.6
NetBSD 128k 21.3
NetBSD -copy 16k 21.3
NetBSD -copy 64k 25.6
NetBSD -copy 128k 25.6

The Linux numbers suggest that there is about 36 MB/s bandwidth
available total (maybe more but less than 42), and it must be divided
between transmit and receive. So 32 can be done one-way, but as soon
as one needs bidirectional traffic, neither direction can do 32.
Probably the USRP could be set up to use, say, 25.6 and 8 between read
and write instead of 16 and 16, but not 25.6 and 16.

This follows fairly well from the implementation. On Linux, USRP
reads and writes are all done via a generic request mechanism funneled
through the control endpoint. So the sum of reads and writes in
aggregate seems to be constrained by how fast data can be pushed
through this system.

With our NetBSD implementation, unless the transactions go in
lock-step and thus one of read and write has to wait while the other’s
completion interrupt is being handled, read and write are handled
independently all the way down until you get to the host controller
driver. Therefore the bidirectional numbers are more related to the
sum of the two unidirectional numbers, instead of bidirectional being
essentially equal to unidirectional as we’re seeing with Linux.

The NetBSD numbers demonstrate that 128k transfers perform worse than
64k. As would be expected, 128k transfers aren’t worse with the extra
copy removed but they also aren’t notably better. So while there is
clearly too much cost copying 128k at a time vs. copying 64k, there is
still a lot of cost that’s not in the copy at all, because the numbers
don’t get vastly better when the copy is removed. The latter cost is
what’s preventing us from getting unidirectional rates comparable to
Linux.

Copying to/from user space is not showing to be the bottleneck; the
kernel debug logs clearly show that user space consumes and writes
faster than the bus in these tests.

Choosing a Good Buffer Size

The previous results are all using a buffer size of 2 MB (which is 2
MB for each of read and write with fusb_netbsd). Also, all reads and
writes from user space were 16k. The following tests indicated the
read and write length does not matter very much. However, reducing
the buffer size from 2 MB demonstrably helps with the bidirectional
throughput.

Because the highest rate reached is not always the same, these results
include several runs of benchmark_usb.py. The maximum rate is based
on what benchmark_usb.py claimed for five runs, trying to take into
account that all the higher transfer rates report underruns or
overruns occasionally.

driver | xfer | buffer | maximum
| size | size | rate

NetBSD 16k 2M 32
NetBSD 64k 2M 32
NetBSD 128k 2M 32

NetBSD 16k 1M 32
NetBSD 32k 1M 36.57
NetBSD 64k 1M 36.57
NetBSD 128k 1M 32

NetBSD 32k 256k 36.57
NetBSD 64k 256k 42.6

NetBSD 32k 128k 36.57
NetBSD 64k 128k 42.6

NetBSD 32k 64k 36.57
NetBSD 64k 64k 36.57

NetBSD 16k 64k 32
NetBSD 4k 64k 32
NetBSD 4k 32k 32

It appears that the best performance for these tests is 64k transfers
and a 256k buffer. The same is true with the copy removed, although
larger buffer and transfer sizes show an improvement:

driver | xfer | buffer | maximum
| size | size | rate

NetBSD -copy 16k 2M 32
NetBSD -copy 64k 2M 42.6
NetBSD -copy 128k 2M 42.6

NetBSD -copy 64k 1M 42.6
NetBSD -copy 128k 1M 42.6

NetBSD -copy 32k 256k 42.6
NetBSD -copy 64k 256k 42.6

NetBSD -copy 32k 128k 36.57
NetBSD -copy 64k 128k 42.6

On Sat, Jul 22, 2006 at 01:56:15AM -0400, Joanne M Mikkelson wrote:

Hi,

We collected some data comparing the USB throughput we’re getting now
on NetBSD against the throughput on Linux. For those who are
interested in the current performance on NetBSD, I’ve included a
summary. The full set of measurements taken (along with the summary)
is available at:
http://acert.ir.bbn.com/viewvc/adroitgrdevel/adroitgrdevel/radio_test/usb/test-results?view=co

Great report!

I’m not sure I followed the explanation for why on NetBSD the
unidirectional case isn’t equal to the sum of the bidirectional case.
Could you try explaining again? On second thought, is the problem
that there’s only one request in the h/w endpoint queue for a given
endpoint and direction? If so, I think you could get the completion
interrupt service time out of the critical path by ensuring that there
are always two requests queued in each direction, not just one.

I’d also be interested in seeing how the throughput holds up with
smaller transfer sizes and smaller total amount of buffer space.

For example, in gnuradio-examples/python/gmsk2/tunnel.py (ethernet
over GNU Radio using CSMA MAC) we’re currently running with:

1024 byte transfers, with a total of 16 blocks (16kB) in each
direction.

If we can’t enable real-time scheduling, then we run with

4096 byte transfers, with a total of 16 blocks (64kB) in each
direction.

As you can see, we’ve cut the total buffer allocated way down in
order to minimize the maximum round-trip latency as seen by a software
MAC. These numbers were empirically chosen as “close to the smallest
values that works on Eric’s laptops.” They can be overridden on the
command line, and the defaults can be set in the user prefs file,
~/.gnuradio/config.conf:

[fusb]
rt_block_size = 1024
rt_nblocks = 16
block_size = 4096
nblocks = 16

Currently only tunnel.py observes these settings.

FYI, test_usrp_standard_{tx,rx} support similar command line options:

fprintf (stderr, " [-B <fusb_block_size>] set fast usb
block_size\n");
fprintf (stderr, " [-N <fusb_nblocks>] set fast usb nblocks\n");
fprintf (stderr, " [-R] set real time scheduling: SCHED_FIFO; pri =
midpoint\n");

Thanks again for your efforts and the report!
Eric

Hi,

I’m sorry to say that I can’t share the enthusiam as it didn’t make much
difference on a system using Centrino Duo system running with
NetBSD-3.99.21.

Does it require GNU Radio current? I’m currently running GNU Radio
release
2.8 as found in the pkgsrc release!

cheerio Berndt


Discuss-gnuradio mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/discuss-gnuradio

On Saturday 22 July 2006 23:56, Greg T. wrote:

It lives here at the moment:
https://acert.ir.bbn.com/projects/adroitgrdevel

I can certainly believe that you’d get different results, but I would
expect a big improvement.

Thanks, I followed your instructions and now have stotter free sampling
at max
speed even with GUI’s running concurrently… AWESOME.

cheerio Berndt


Discuss-gnuradio mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/discuss-gnuradio


Discuss-gnuradio mailing list
[email protected]
http://lists.gnu.org/mailman/listinfo/discuss-gnuradio

On Wednesday 26 July 2006 23:19, Greg T. wrote:

0.491 41 underruns

So this doesn’t work. Could you try with decimation 10 (or 12, until
you get only a few underruns)?

Here are the values that stop producing overruns:

barossa: {101} ./test_usrp_standard_rx -D 10
xfered 1.34e+08 bytes in 5.24 seconds. 2.56e+07 bytes/sec. cpu time =
0.0399
noverruns = 0

barossa: {106} ./test_usrp_standard_tx -I 20
usb_control_msg failed: error sending control message: Input/output
error
xfered 1.34e+08 bytes in 5.28 seconds. 2.542e+07 bytes/sec. cpu time =
0.492
0 underruns

cheerio Berndt

On Wed, Jul 26, 2006 at 09:49:10AM -0400, Greg T. wrote:

Interestingly, the 2MB/sec test fails although the faster speeds are ok.

We’ve noticed that too. Note that the 32 MB/s speed is really 16 MB/s
in each direction.

It would be cool if benchmark_usrp.py tried decimation 14, 12, and 10
also, rather than stopping at 16 (and interpolation values with those
rates).

It would be even better if it gave reliable answers :wink:

Eric

On Wednesday 26 July 2006 22:22, Greg T. wrote:

I am interested in reports of how well this works on both i386 and
amd64.

It’s pretty clear that getting pipelining closer to the hardware is
needed - this is being pushed upstream since it works and gets ~80% of
the likely speed gain.
[…]

G’day,

for those interested here are a few results from the tests conducted on
a Dell
Inspiron 9400 Centrino Duo @ 2GHz/1GB running NetBSD-3.99.21:

barossa: {29} ./test_usrp_standard_rx

xfered 1.34e+08 bytes in 4.2 seconds. 3.197e+07 bytes/sec. cpu time =
0.04173
noverruns = 41

barossa: {30} ./test_usrp_standard_tx

usb_control_msg failed: error sending control message: Input/output
error
xfered 1.34e+08 bytes in 4.64 seconds. 2.894e+07 bytes/sec. cpu time =
0.491
41 underruns

barossa: {33} ./benchmark_usb.py
Testing 2MB/sec… usb_control_msg failed: error sending control
message:
Input/output error
uUusb_throughput = 2M
ntotal = 1000000
nright = 947559
runlength = 0
delta = 1000000
FAILED
Testing 4MB/sec… usb_control_msg failed: error sending control
message:
Input/output error
uUusb_throughput = 4M
ntotal = 2000000
nright = 1997896
runlength = 1997896
delta = 2104
OK
Testing 8MB/sec… usb_control_msg failed: error sending control
message:
Input/output error
usb_throughput = 8M
ntotal = 4000000
nright = 3999286
runlength = 3999286
delta = 714
OK
Testing 16MB/sec… usb_control_msg failed: error sending control
message:
Input/output error
usb_throughput = 16M
ntotal = 8000000
nright = 7997737
runlength = 7997737
delta = 2263
OK
Testing 32MB/sec… usb_control_msg failed: error sending control
message:
Input/output error
usb_throughput = 32M
ntotal = 16000000
nright = 15999303
runlength = 15999303
delta = 697
OK
Max USB/USRP throughput = 32MB/sec

Interestingly, the 2MB/sec test fails although the faster speeds are ok.

cheerio Berndt

Hi, sorry for my long delay, I was on vacation and then playing
catch-up.

I’m not sure I followed the explanation for why on NetBSD the
unidirectional case isn’t equal to the sum of the bidirectional case.
Could you try explaining again? On second thought, is the problem
that there’s only one request in the h/w endpoint queue for a given
endpoint and direction? If so, I think you could get the completion
interrupt service time out of the critical path by ensuring that there
are always two requests queued in each direction, not just one.

Yes, as the driver is currently implemented, there is only one
request queued for a given endpoint at a time. You’re correct that
having more than one would reduce the interrupt service time’s effect
on performance, but doing this will require changes to more than just
ugen. The ehci driver will need some work in order to work properly
with more than one bulk request queued at a time. We haven’t changed
the ehci driver, so until that happens, the ugen driver will have to
use just a single request at a time.

I’d also be interested in seeing how the throughput holds up with
smaller transfer sizes and smaller total amount of buffer space.

Because we only have one request at a time, the throughput will
suffer as request sizes get smaller. In my experience the total
buffer space need not be more than a few requests’ worth (and the
numbers showed that having the buffers too large hurts performance),
but this testing wasn’t with much computation load. At least the
latency should still be improved over what we had with ugen before.

Using test_usrp_standard_rx and _tx, a block size of 1024 only works
with decimation 64/interpolation 128 (4 MB/s) and a block size of
4096 works with decimation 16/interpolation 32 (16 MB/s). This is
without real-time scheduling, which isn’t working.

Joanne