Proposed change to ugen to enable USRP to work well on NetBS

greg2 · May 3, 2006, 8:03pm

At BBN we are working on a project involving teams of
cognitively-controlled software radios, funded by the US government.
As part of this, we will be using GNU Radio on NetBSD. Due to the
current implementation of ugen(4) (generic USB devices), reads from
the USRP are not pipelined and transfer rates top out at around 4
MB/s.

Joanne has written a proposal to modify NetBSD’s ugen(4) to get good
performance for the USRP. We’d like feedback about the technical
approach. (Once we have it working, the changes will be commited to
NetBSD-current.)

http://acert.ir.bbn.com/downloads/adroit/NetBSD-USB-continuous.pdf

–
Greg T. [email protected]

greg2 · May 4, 2006, 4:33am

On Thursday 04 May 2006 03:30, Greg T. wrote:

Joanne has written a proposal to modify NetBSD’s ugen(4) to get good
performance for the USRP. We’d like feedback about the technical
approach. (Once we have it working, the changes will be commited to
NetBSD-current.)

http://acert.ir.bbn.com/downloads/adroit/NetBSD-USB-continuous.pdf

I think the ioctl would be the cleanest approach - it would not break
any
existing software.

I wonder if it may be possible to have the ioctl specify a packet size
and the
kernel will keep reading data of that packet size into the buffer as
long as
it isn’t full. I think that would give you something that looks a lot
closer
to a normal device (normal being “what the unix IO model expects”

It would be nice if you could do a readv() and then
poll/kqueue/select/signal
to see when an iovec has been filled, however I suspect that would
require
severe modification of the kernel internals.

greg2 · May 5, 2006, 2:38am

On Thursday 04 May 2006 11:58, Daniel O’Connor wrote:

It would be nice if you could do a readv() and then
poll/kqueue/select/signal to see when an iovec has been filled, however I
suspect that would require severe modification of the kernel internals.

Ah now I think about it, this is called “aio_read”

I don’t know how widely supported it is - in FreeBSD it’s optional (via
a
kernel option or KLD).

It does allow you to enqueue read requests and then later check if they
have
been completed. IMO this is the best match for the USB IO model, but my
USB
fu is fairly weak.

(And it’s not much use if it isn’t widely supported…)

greg2 · May 5, 2006, 7:51pm

“Daniel O’Connor” [email protected] writes:

On Thursday 04 May 2006 11:58, Daniel O’Connor wrote:

It would be nice if you could do a readv() and then
poll/kqueue/select/signal to see when an iovec has been filled, however I
suspect that would require severe modification of the kernel internals.

Ah now I think about it, this is called “aio_read”

I don’t know how widely supported it is - in FreeBSD it’s optional (via a
kernel option or KLD).

This seems to be a POSIX 1003.2 feature. It doesn’t exist in NetBSD.

This is essentially what the Linux USB driver does, but apparently
without using a standard interface.

It does allow you to enqueue read requests and then later check if they have
been completed. IMO this is the best match for the USB IO model, but my USB
fu is fairly weak.

It’s a reasonable match, but it’s far more general than what the USRP
needs. The problem arises because we’re trying to use the USRP like a
traditional data input device, where data arrives and is buffered.
The current NetBSD ugen(4) code, and it seems the Linux code without
fusb, overload the read system call to both move data to userspace and
to inititiate a read operation over the USB. aio (like Linux fusb)
allows one to avoid the temporal coupling of this overloading.

I don’t see any reason why continuous read mode and aio are mutually
exclusive - one could still do an async read.

I think it comes down to two issues:

Do we want the user-space program to explicitly schedule all the
hardware reads?
Is continuous read mode easier to implement than AIO?

I find user-space code managing lots of pending reads to be more
complex than it ought to be when we really want to just say “start
reading and keep going”.

I didn’t find a clear explanation of how AIO works with multiple
outstanding reads on the same file descriptor.

I’m not sure how hard AIO is to implement - the kernel already is
doing things async and the read system call is in tsleep. But it’s
harder to do that than continuous read mode.

Thanks for pointing out aio, though - I had forgotten about that.

–
Greg T. [email protected]

greg2 · May 6, 2006, 4:08pm

On Wed, May 03, 2006 at 02:00:23PM -0400, Greg T. wrote:

approach. (Once we have it working, the changes will be commited to
NetBSD-current.)

I would like to run GR under FreeBSD so I am looking forward to your
progress on this.

Under FreeBSD (and probably NetBSD) the ugen(4) device is generic and
gets
assigned to any unknown device. I would suggest you look into getting a
number assigned to the USRP and a name assigned for it, say usrp(4). Any
application more serious than experimental would seem to justify it.

The ehci(4) driver is for USB 2.0 and the FreeBSD version contains
comments which indicate it may be the problem. I don’t quite understand
all the intracacies here but the code seems much too complicated for the
job. Ah for the good ol’ days when I/O was less than a hundred lines of
assembler code and two buffers were enough.

–
LRK
[email protected]

greg2 · May 5, 2006, 11:56pm

On Saturday 06 May 2006 03:19, Greg T. wrote:

I don’t know how widely supported it is - in FreeBSD it’s optional (via a
kernel option or KLD).

This seems to be a POSIX 1003.2 feature. It doesn’t exist in NetBSD.

OK, a pity.

This is essentially what the Linux USB driver does, but apparently
without using a standard interface.

Right.

It’s a reasonable match, but it’s far more general than what the USRP
needs. The problem arises because we’re trying to use the USRP like a
traditional data input device, where data arrives and is buffered.
The current NetBSD ugen(4) code, and it seems the Linux code without
fusb, overload the read system call to both move data to userspace and
to inititiate a read operation over the USB. aio (like Linux fusb)
allows one to avoid the temporal coupling of this overloading.

OK.

I don’t see any reason why continuous read mode and aio are mutually
exclusive - one could still do an async read.

Indeed, I just wanted to make sure you didn’t implement a hack when the
real
thing would be easier
(ie if NetBSD did AIO then no kernel mods would be necessary)

I think it comes down to two issues:

Do we want the user-space program to explicitly schedule all the
hardware reads?

I am not sure, but I think the answer is no (for the USRP case
especially)

Is continuous read mode easier to implement than AIO?

Yes
[since AIO has hooks deep in the kernel]

I find user-space code managing lots of pending reads to be more
complex than it ought to be when we really want to just say “start
reading and keep going”.

I didn’t find a clear explanation of how AIO works with multiple
outstanding reads on the same file descriptor.

I’m not sure to be honest, but I believe it’s just a queue of reads.

I’m not sure how hard AIO is to implement - the kernel already is
doing things async and the read system call is in tsleep. But it’s
harder to do that than continuous read mode.

Absolutely, it would also be difficult to ensure standards compliance,
etc…
IMO not worth it for this one application. I remember there where plenty
of
subtle bugs fixed in the FreeBSD implementation and even then it is
still an
optional API due to potential bugs.

Thanks for pointing out aio, though - I had forgotten about that.

I should try using it in FreeBSD and see if it works in practice

Good luck with your implementation!

greg2 · May 7, 2006, 5:02am

On Saturday 06 May 2006 23:35, LRK wrote:

I would like to run GR under FreeBSD so I am looking forward to your
progress on this.

Under FreeBSD (and probably NetBSD) the ugen(4) device is generic and gets
assigned to any unknown device. I would suggest you look into getting a
number assigned to the USRP and a name assigned for it, say usrp(4). Any
application more serious than experimental would seem to justify it.

Creating a kernel driver won’t magically make things go faster, it is
also
more complicated as you have to write a driver and the code to talk to
it…

IMO the best solution would be AIO but that isn’t very portable, next
best is
hacking ugen to not have so many restrictions.

This is because of the USB IO model - the problem is that ugen does not
know
how big a block of data to request from the device, and unlike more
traditional devices it does make a difference (since USB allows you to
request certain size transfers as well as allow short transfers). The
fix for
ugen is to supply it with a hint to say what size transfer it should
queue up
after the current one has been done. This gives you more transfers in
flight
and greatly improves throughput.

There have been attempts to fix ugen in FreeBSD to do this but it didn’t
work
very well because the API was not extended to provide the hint so it was
backed out.

A kernel driver could issue the multiple reads but it is a fair amount
of work
to write one, and a bit of a waste if ugen could be extended instead
(hence
benefiting other applications)

greg2 · May 7, 2006, 5:51pm

On Sun, May 07, 2006 at 12:29:15PM +0930, Daniel O’Connor wrote:

Creating a kernel driver won’t magically make things go faster, it is also
after the current one has been done. This gives you more transfers in flight
and greatly improves throughput.

There have been attempts to fix ugen in FreeBSD to do this but it didn’t work
very well because the API was not extended to provide the hint so it was
backed out.

A kernel driver could issue the multiple reads but it is a fair amount of work
to write one, and a bit of a waste if ugen could be extended instead (hence
benefiting other applications)

Obviously it would be neat to extend ugen if the fixes were generic but
if
there need to be USRP-specific fixes they would best be done in a
different
module. Maybe I’m not understanding this but it looks to me like ugen
just
responds with a code saying it will take a device if no other driver
wants
it. A copy of ugen named usrp could respond only to being offered a USRP
but the USRP should have a unique number assigned rather than the
general
one used now. If there was a driver unique to the USRP, it would not
need
to work with other USB devices, thus my suggesting that direction.

It also seems that the USRP tx and rx paths normally use the same block
size after each open. If that is right, the driver could simply use that
block size until the stream is closed, reading ahead on rx and providing
flow control on tx.

It appears the attempts to read the USRP at more than 4 MB/s just lock
and transfer no data. Changing the ‘read’ in libusb to just return as
though it had finished results in the ‘test_usrp_standard_rx’ giving
similar results at all speeds including the pattern of overrun errors.
The transfer rate calculated is very fast so the overrun error count
seems to be a function of the USRP code rather than actual overruns.

I guess this is getting much too complicated for the old guys like me
to comprehend so I’ll offer encouragement and await a solution, sooner
the better.

–
LRK
[email protected]

greg2 · May 7, 2006, 6:09pm

LRK [email protected] writes:

Obviously it would be neat to extend ugen if the fixes were generic
but if there need to be USRP-specific fixes they would best be done
in a different module.

Agreed in genearl, but not that any USRP-specific changes are needed.

Maybe I’m not understanding this but it looks to me like ugen just
responds with a code saying it will take a device if no other driver
wants it. A copy of ugen named usrp could respond only to being
offered a USRP but the USRP should have a unique number assigned
rather than the general one used now. If there was a driver unique
to the USRP, it would not need to work with other USB devices, thus
my suggesting that direction.

True, but then there’s more forked code to maintain, which is a big
minus.

It also seems that the USRP tx and rx paths normally use the same
block size after each open. If that is right, the driver could
simply use that block size until the stream is closed, reading ahead
on rx and providing flow control on tx.

That’a s good point: write transactions need to be some speed,
controllable by the user.

It appears the attempts to read the USRP at more than 4 MB/s just
lock and transfer no data.

What system? Could you be more precise? On NetBSD one gets missing
data according to the test program (presumably due to overruns in the
on-USRP buffer because USB transactions don’t happen fast enough).
But nothing else bad happens.

Changing the ‘read’ in libusb to just return as though it had
finished results in the ‘test_usrp_standard_rx’ giving similar
results at all speeds including the pattern of overrun errors.

You mean if you change the code to just skip the reads? I don’t see
what you are trying to find out from this experiment.

The transfer rate calculated is very fast so the overrun error
count seems to be a function of the USRP code rather than actual
overruns.

I don’t see how this follows.

–
Greg T. [email protected]

greg2 · May 7, 2006, 9:27pm

Some notes on the data pipeline and buffering:

Max length USB2 packets are 512 bytes, and that is all that we use.

The FIFOs in the FPGA are each 8192 bytes (16 packets), one for TX and
one for RX. This is limited by the RAM in the FPGA. At full data speed
of 32 MB/s over the USB, this is 256 uS worth of buffer.

The buffering in the USB controller chip is 2K bytes each for in and
out, or 4 packets worth.

The interface between the FPGA and the FX2 will only transfer full
512-byte chunks.

On Sun, 2006-05-07 at 12:15 -0400, Greg T. wrote:

Is the USB transaction size fixed by the USRP firmware? Is it
happy with a range of sizes?

Fixed by several factors.

Does the USRP ever send a short transaction on read? Are there
start/stop issues?

No.

Has an optimal transaction size been determined? This seems to be
a latency/efficiency tradeoff, but the amount of on-board buffering
seems key.

No, we haven’t looked into short packets for better latency.

Matt

Matt E. [email protected]

greg2 · May 7, 2006, 6:18pm

[We’re working on making USRP work well on NetBSD.]

http://acert.ir.bbn.com/downloads/adroit/NetBSD-USB-continuous.pdf

Regarding BSD ugen(4) support for the USRP, I have a few questions
about desired transfer sizes. With ugen now, and with Linux URBs, the
size of each USB transaction is controlled from user space.

Is the USB transaction size fixed by the USRP firmware? Is it
happy with a range of sizes?
Does the USRP ever send a short transaction on read? Are there
start/stop issues?
Has an optimal transaction size been determined? This seems to be
a latency/efficiency tradeoff, but the amount of on-board buffering
seems key.
Is the write size tightly linked to the read size? Do they have to
be the same?

The initial proposal didn’t discuss how to set the read transaction
size, and a way to do that is clearly needed. This could easily be
part of setting the read buffer size. For write, one also needs to
set the write transaction size, and limit the amount of buffered write
data, so perhaps a symmetric ioctl would be good; this would allow
different read and write sizes, and totally decouple reading and
writing.

–
Greg T. [email protected]

greg2 · May 8, 2006, 2:45am

On Monday 08 May 2006 01:18, LRK wrote:

A kernel driver could issue the multiple reads but it is a fair amount of
work to write one, and a bit of a waste if ugen could be extended instead
(hence benefiting other applications)

Obviously it would be neat to extend ugen if the fixes were generic but if
there need to be USRP-specific fixes they would best be done in a different
module. Maybe I’m not understanding this but it looks to me like ugen just

That’s where you are wrong. Writing a new module means writing (or
copying)
new code. If you copy ugen that means that any fixes for it need to get
duplicated. Also, since the USRP isn’t a “main stream” device it is
likely OS
changes will break it unless the maintainer is very active.

If you write a driver from scratch (which would be pretty silly given
that
ugen is almost exactly what you need) then you will take longer and have
to
fix more bugs.

It also seems that the USRP tx and rx paths normally use the same block
size after each open. If that is right, the driver could simply use that
block size until the stream is closed, reading ahead on rx and providing
flow control on tx.

Yes, but I think extended ugen to allow the user program to supply a
hint to
say “here is the block size, and keep reads queued” (via ioctl for
exampe) is
probably a lot simpler than creating a new driver.

It appears the attempts to read the USRP at more than 4 MB/s just lock
and transfer no data. Changing the ‘read’ in libusb to just return as
though it had finished results in the ‘test_usrp_standard_rx’ giving
similar results at all speeds including the pattern of overrun errors.
The transfer rate calculated is very fast so the overrun error count
seems to be a function of the USRP code rather than actual overruns.

The problem is that with no back to back transfer you are wasting slots
(USB
divides time into slots it will transfer data in). The limit of
4Mbytes/sec
is because of the number of slots per second (divided by 2 I guess)
multipled
by the maximum block size for a bulk transfer.

I guess this is getting much too complicated for the old guys like me
to comprehend so I’ll offer encouragement and await a solution, sooner
the better.

It’s OK, USB is complicated for younger minds too

greg2 · May 8, 2006, 12:53am

On Sun, May 07, 2006 at 12:06:31PM -0400, Greg T. wrote:

It appears the attempts to read the USRP at more than 4 MB/s just
lock and transfer no data.

What system? Could you be more precise? On NetBSD one gets missing
data according to the test program (presumably due to overruns in the
on-USRP buffer because USB transactions don’t happen fast enough).
But nothing else bad happens.

This test used FreeBSD 6.1-RC and GR updated from CVS a couple of weeks
ago. 1.8 GHz 686-class CPU, VIA VT6202 USB 2.0 controller on
motherboard.

This was discussed before and there seemed to be agreement that it does
not look right:

Running test_usrp_standard_rx with different decimation ( -D ) values.

xfered 1.34e+08 bytes in 68.1 seconds. 1.97e+06 bytes/sec. cpu time =
0.3424
noverruns = 3
xfered 1.34e+08 bytes in 33.6 seconds. 3.998e+06 bytes/sec. cpu time =
0.3356
noverruns = 1
xfered 1.34e+08 bytes in 32.8 seconds. 4.088e+06 bytes/sec. cpu time =
0.3381
noverruns = 167
xfered 1.34e+08 bytes in 32.8 seconds. 4.091e+06 bytes/sec. cpu time =
0.334
noverruns = 83
xfered 1.34e+08 bytes in 32.8 seconds. 4.092e+06 bytes/sec. cpu time =
0.3337
noverruns = 41

Note that the overrun counts go down as the speed should be going up.
The USRP is queried for overruns and answers some with the error code.
How many seems only to depend on the decimation rate. The first two
tests
may have different overrun counts but the last three are always the same
as seen above (maybe +/- 1 count).

Changing the ‘read’ in libusb to just return as though it had
finished results in the ‘test_usrp_standard_rx’ giving similar
results at all speeds including the pattern of overrun errors.

You mean if you change the code to just skip the reads? I don’t see
what you are trying to find out from this experiment.

I changed libusb to skip the actual reads. Since this happens fast, one
would expect the USRP to return every call as an error since no read was
actually done. The test doesn’t actually check the data so it thinks it
got the data in the time it took to run the test and calculates the high
transfer rate:

xfered 1.34e+08 bytes in 0.157 seconds. 8.523e+08 bytes/sec. cpu time
= 0.01688
noverruns = 614
xfered 1.34e+08 bytes in 0.0817 seconds. 1.642e+09 bytes/sec. cpu time
= 0.01491
noverruns = 319
xfered 1.34e+08 bytes in 0.0422 seconds. 3.178e+09 bytes/sec. cpu time
= 0.01432
noverruns = 164
xfered 1.34e+08 bytes in 0.0211 seconds. 6.372e+09 bytes/sec. cpu time
= 0.01363
noverruns = 82
xfered 1.34e+08 bytes in 0.0207 seconds. 6.475e+09 bytes/sec. cpu time
= 0.01324
noverruns = 41

The transfer rate calculated is very fast so the overrun error
count seems to be a function of the USRP code rather than actual
overruns.

I don’t see how this follows.

Since the modified test doesn’t actually transfer data and now gets
similar
results for the first two tests, it appears the last three probably
never
actually transfer the data.

It also seems that if I command the USRP to send data but never issue a
read,
it would return an error on every call to check overrun and the numbers
from
the test are not powers of two.

Doesn’t make sense to me so I mentioned it as a possible clue.

–
LRK
[email protected]

greg2 · May 11, 2006, 6:08am

On Sun, May 07, 2006 at 12:15:07PM -0400, Greg T. wrote:

[We’re working on making USRP work well on NetBSD.]

http://acert.ir.bbn.com/downloads/adroit/NetBSD-USB-continuous.pdf

Regarding BSD ugen(4) support for the USRP, I have a few questions
about desired transfer sizes. With ugen now, and with Linux URBs, the
size of each USB transaction is controlled from user space.

Is the USB transaction size fixed by the USRP firmware? Is it
happy with a range of sizes?

We currently always use 512 bytes on the Tx and Rx end points. This
of course only works with USB 2.0. We may want to vary this when we
start adding fixed headers to the packets, but it wouldn’t be the end
of the world if the packets stayed at 512 bytes.

Does the USRP ever send a short transaction on read? Are there
start/stop issues?

We currently don’t every send a short read.

Has an optimal transaction size been determined? This seems to be
a latency/efficiency tradeoff, but the amount of on-board buffering
seems key.

I haven’t run the experiment.

Is the write size tightly linked to the read size? Do they have to
be the same?

No. No.

The initial proposal didn’t discuss how to set the read transaction
size, and a way to do that is clearly needed. This could easily be
part of setting the read buffer size. For write, one also needs to
set the write transaction size, and limit the amount of buffered write
data, so perhaps a symmetric ioctl would be good; this would allow
different read and write sizes, and totally decouple reading and
writing.

Decoupling them seems like a good idea.

    Greg T. <[email protected]>

Eric

greg2 · May 11, 2006, 6:15am

On Sun, May 07, 2006 at 05:50:48PM -0500, LRK wrote:

This test used FreeBSD 6.1-RC and GR updated from CVS a couple of weeks
xfered 1.34e+08 bytes in 33.6 seconds. 3.998e+06 bytes/sec. cpu time = 0.3356
How many seems only to depend on the decimation rate. The first two tests
may have different overrun counts but the last three are always the same
as seen above (maybe +/- 1 count).

As currently implemented, the overrun/underrun detection is
implemented by the host library polling the USRP at approximately
10Hz, based on sample counting. Thus the absolute number of overruns
returned does not mean what you might expect. You should think of it
as more of a binary value: overruns <= 1 implies things are “good”.

Eric

greg2 · May 8, 2006, 2:45am

On Monday 08 May 2006 04:54, Matt E. wrote:

Some notes on the data pipeline and buffering:
[snip]

Seems like a hint to tell ugen what size block to read ahead with should
work
very well.

I wonder what bugs you’ll find in the EHCI code though