USB speed data point

Hi -

I’m bringing up a board
LLRF Evaluation Board
with a hardware and software USB stack based on and (for this purpose)
equivalent to the GNU Radio design, and measured its USB data transfer
capabilities more carefully than I have done before. There is a
distant possibility someone on this list might make use of the result,
so here it is:

Reading only, on a lightly loaded AMD64 3500+ machine (2.2 GHz,
dual-channel RAM), I can sustain 35.7 MByte/sec without errors.
Attempting 35.8 MByte/sec, packets get dropped left and right.
The host end of the USB is a VT8237 Chipset, seemingly run in
EHCI mode by Linux-2.6.16 (Debian sid 2.6.16-2-amd64-k8).

I think the limitation is on the 8051 end. One 512-byte packet takes
8.53 microseconds to cross the USB channel, and the 35.7 MByte/sec
sustained rate implies the 8051 sets up the next packet in only 5.81
microseconds. I don’t think there is any pipelining at this level.

  • Larry

On Fri, Oct 27, 2006 at 09:40:06AM -0700, [email protected]
wrote:

Reading only, on a lightly loaded AMD64 3500+ machine (2.2 GHz,
dual-channel RAM), I can sustain 35.7 MByte/sec without errors.
Attempting 35.8 MByte/sec, packets get dropped left and right.
The host end of the USB is a VT8237 Chipset, seemingly run in
EHCI mode by Linux-2.6.16 (Debian sid 2.6.16-2-amd64-k8).

I think the limitation is on the 8051 end.

I concur (recalling measurements made a long time ago).
As I recall, the limiting factor is the time to get through the main
loop in the FX2. I believe that without much trouble you could cut
the time to 1/2 of what it currently is.

E.g., there’s a check in the loop that’s always true and thus could be
removed. Also, the loop body could be recoded in assembler. See
FIXME’s below

One 512-byte packet takes
8.53 microseconds to cross the USB channel, and the 35.7 MByte/sec
sustained rate implies the 8051 sets up the next packet in only 5.81
microseconds. I don’t think there is any pipelining at this level.

  • Larry

static void
main_loop (void)
{
setup_flowstate_common ();

while (!(GPIFTRIG & bmGPIF_IDLE)) // FIXME add this code to ensure
loop invariant
;

while (1){

if (usb_setup_packet_avail ())
  usb_handle_setup_packet ();


if (GPIFTRIG & bmGPIF_IDLE){  // FIXME This is always true, remove 

the test

  // OK, GPIF is idle.  Let's try to give it some work.

  // First check for underruns and overruns

  if (UC_BOARD_HAS_FPGA && (USRP_PA & (bmPA_TX_UNDERRUN | 

bmPA_RX_OVERRUN))){

// record the under/over run
if (USRP_PA & bmPA_TX_UNDERRUN)
  g_tx_underrun = 1;

if (USRP_PA & bmPA_RX_OVERRUN)
  g_rx_overrun = 1;

// tell the FPGA to clear the flags
fpga_clear_flags ();
  }

  // Next see if there are any "OUT" packets waiting for our 

attention,
// and if so, if there’s room in the FPGA’s FIFO for them.

  if (g_tx_enable && !(EP24FIFOFLGS & 0x02)){  // USB end point fifo 

is not empty…

if (fpga_has_room_for_packet ()){	   // ... and FPGA has room for 

packet

  GPIFTCB1 = 0x01;	SYNCDELAY;
  GPIFTCB0 = 0x00;	SYNCDELAY;

  setup_flowstate_write ();

  SYNCDELAY;
  GPIFTRIG = bmGPIF_EP2_START | bmGPIF_WRITE; 	// start the xfer
  SYNCDELAY;

  while (!(GPIFTRIG & bmGPIF_IDLE)){
    // wait for the transaction to complete
  }
}
  }

  // See if there are any requests for "IN" packets, and if so
  // whether the FPGA's got any packets for us.

  if (g_rx_enable && !(EP6CS & bmEPFULL)){	// USB end point fifo is 

not full…

if (fpga_has_packet_avail ()){		// ... and FPGA has packet available

  GPIFTCB1 = 0x01;	SYNCDELAY;
  GPIFTCB0 = 0x00;	SYNCDELAY;

  setup_flowstate_read ();

  SYNCDELAY;
  GPIFTRIG = bmGPIF_EP6_START | bmGPIF_READ; 	// start the xfer
  SYNCDELAY;

  while (!(GPIFTRIG & bmGPIF_IDLE)){
    // wait for the transaction to complete
  }

  SYNCDELAY;
  INPKTEND = 6;	// tell USB we filled buffer (6 is our endpoint num)
}
  }
}

}
}

Eric

“ldoolitt” == ldoolitt [email protected] writes:

ldoolitt> Hi - I'm bringing up a board http://recycle.lbl.gov/llrf4/
ldoolitt> with a hardware and software USB stack based on and (for 

this

Some hints for your part list:

  • consider the Spartan 3E family versus Spartan3. 3E is often
    considerably
    cheaper
  • look at the actual lead time for the INA138

    Uwe Bonnes [email protected]

Institut fuer Kernphysik Schlossgartenstrasse 9 64289 Darmstadt
--------- Tel. 06151 162516 -------- Fax. 06151 164321 ----------

Uwe -

On Fri, Oct 27, 2006 at 08:35:22PM +0200, Uwe Bonnes wrote:

“ldoolitt” == ldoolitt [email protected] writes:
ldoolitt> Hi - I’m bringing up a board LLRF Evaluation Board
ldoolitt> with a hardware and software USB stack based on and (for this
Some hints for your part list:

  • consider the Spartan 3E family versus Spartan3. 3E is often considerably
    cheaper
  • look at the actual lead time for the INA138

Thanks for your comments.

The FPGA accounts a tiny fraction of the board cost.
It does look like a XC3S1200E-4FTG256C would shave the cost
down a little, and add a few more gates besides. There are
non-trivial changes in the pinout, though. I’m not sure I
want to deal with that on a respin.

The INA138 was in-stock at Digi-Key when I laid out the board,
and we managed to get enough to build the first six (although
they were removed, because I made a pin numbering mistake on
the layout). When Digi-Key first ran out of stock, I looked
for alternatives, but couldn’t find any I liked.

- Larry

I think the limitation is on the 8051 end. One 512-byte packet takes
8.53 microseconds to cross the USB channel, and the 35.7 MByte/sec
sustained rate implies the 8051 sets up the next packet in only 5.81
microseconds. I don’t think there is any pipelining at this level.

You’re probably right. It’s known that the SSRP (which had no FPGA,
just a USB interface and an ADC) got higher thruput, because it
programmed the USB interface registers to automatically stream data
through without the intervention of the 8051. (It knew all the
traffic was inbound, so didn’t need the option to switch directions on
the bus.)

We could get a similar speedup in the USRP (or LLRF) by having the
8051 jump to different inner loops when doing input-only, output-only,
or both input and output over the USB bus. Think of it as hoisting
one of the invariant decisions out of the inner loop.

John