FX2 firmware

Hi all!

I am studying the FX2 firmware provided by the USRP package, just to
“get a feeling” for this.

There are a few very old mails on the mail archive stating that an
improvement of the USB bandwidth could be possible if the FX2 timing is
tuned. Does anyone know where the current bottleneck is? Is it the main
loop, or the GPIF state machine?

There was an idea about moving the loop invariant, e.g. one loop only
for tx if only tx chain is enabled. However, my first quick’n dirty
trial didn’t change anything (test_usrp_standard_tx to test). At least
it still works :slight_smile:

It will do to point me in a good direction, I’ll find out the rest.
However, it would be faster if someone can direct me.

Best regards
Dominik

Hi!

A more specific question on the FX2:

do {
FLOWSTATE = 0x81;
FLOWLOGIC = 0x2d;
FLOWEQ0CTL = 0x26;
FLOWEQ1CTL = 0x00;
FLOWHOLDOFF = 0x04;
FLOWSTB = 0x04;
FLOWSTBEDGE = 0x03;
FLOWSTBHPERIOD = 0x02;
GPIFHOLDAMOUNT = 0x00;
} while (0)

If I have reengineered this correctly (gpif.gpf crashes the current GPIF
Designer, importing gpif.c skips the flow states), you set to transfer
data at rising AND falling edge while in flow state. Is this correct?

What I have found is:
state 1 is flow state (for both waveforms)
for flowstates:
for fifowr:
if TCXpire and TCXpire then - else WEN, BOGUS
Master Strobe Pin “unused”, Half Period 2 (=1 clock)
Holdoff pin =“unused”,but holdoff not asserted
for fiford:
if TCXpire and TCXpire then - else REN, OE, BOGUS
everything else not changed from fifowr

DP:
fiford:
if TCXpire and TCXpire then S2 else S1
etc…

Btw. there is an application note on flow states (I saw that someone
stated that these are barely documented):
http://www.cypress.com/?rID=12951

Best regards
Dominik

Hello,

I was able to increase the USB bandwidth of the rx chain to 40Mb/s if tx
is completely turned off (test_usrp_standard_rx -D 4). However, with
test_usrp_standard_tx -i 8, it won’t get beyond 32.7 Mb/s. I am ignoring
under/overruns for now.

Is there a way test wether this is a limitation of my mainboard, the
program or the USRP?

Best regards
Dominik

If I have reengineered this correctly (gpif.gpf crashes the current GPIF
Designer, importing gpif.c skips the flow states), you set to transfer
data at rising AND falling edge while in flow state. Is this correct?

I can give the answer to myself :wink: Took a while …

So, data is transferred on the falling and rising edge of master strobe
(which is not connected to the FPGA). The half period of MSTB is 1 IFCLK
cycle (which is the minimum). Hence, data is actually transferred once
per IFCLK cycle (twice per MSTB cycle). MSTB is toggling at 24 Mhz. This
gives a data rate of 96 Mb/s (16 bit per IFCLK cycle, which runs at 48
Mhz).

IFCLK is generated internally, and output inverted to the FPGA.

Dominik

On Thu, Apr 30, 2009 at 03:51:46PM +0200, Dominik A. wrote:

Hello,

I was able to increase the USB bandwidth of the rx chain to 40Mb/s if tx
is completely turned off (test_usrp_standard_rx -D 4). However, with
test_usrp_standard_tx -i 8, it won’t get beyond 32.7 Mb/s. I am ignoring
under/overruns for now.

Is there a way test wether this is a limitation of my mainboard, the
program or the USRP?

It’s hard to say. If you’ve got a logic analyzer you can instrument
the inner loop of the firmware and see if that’s the bottleneck or not.

Eric

Hi Eric,

Thanks for the answer.

It’s hard to say. If you’ve got a logic analyzer you can instrument
the inner loop of the firmware and see if that’s the bottleneck or not.
Unfortunately, I don’t have a access to a logic analyer :frowning:

However, I made progress that I am going to share once it is tested and
cleaned up.

Short summary:
When doing RX only, I am at 45 Mb/s (yes! decim=6 works without
underruns). On the TX side, I can’t get above 32.7 Mb/s. Now I suspect
that this is a host side bottleneck. On the FX2, if using only one
direction, I am setting the GPIF to loop infinitely. With GPIFABORT=0xFF
to switch if the state changes. Hence there is no main loop left that
could be a bottleneck. The TX state machine now consists of 2 states,
where state one is the idle state, and state 2 transferring data (one
word per clock, as before). The 8051 core is completely out of the data
path. (Auto commit etc.)
Same for RX, except that a few more states were needed.

When RX and TX are needed, the firmware is still faster, though the same
TX bottleneck appears (which is, of course, no big problem because we
already share USB bandwidth).

Do you have, maybe, an idea why TX bandwidth is limited? Interestingly
enough, 32.7 Mb/s is the limit on my computer and my notebook. Of
course, I made the tx loop on the host as short as possible, set
SCHED_FIFO and rtprio to 49, and played with fusb_nblock/size etc.

Dominik

Hi Philip,

http://gnuradio.org/trac/wiki/UsrpFAQ/Gen#USB:480MBitsec32MBytesec
http://gnuradio.org/trac/wiki/UsrpFAQ/FX2

We can get beyond. See
http://lists.gnu.org/archive/html/discuss-gnuradio/2006-10/msg00340.html
Larry achieved 35Mb/s. I got 40Mb/s when receiving. The SSRP sustains
more than 40Mb/s on receiver side
http://oscar.dcarr.org/ssrp/software/firmware/firmware.php .

Also:
http://lists.gnu.org/archive/html/discuss-gnuradio/2004-08/msg00011.html

So, there are demo firmwares for the FX2 sustaining 50Mb/s (though, I
didn’t find them, yet).

Best regards
Dominik

On Wednesday 06 May 2009 18:35:36 Dominik A. wrote:

Do you have, maybe, an idea why TX bandwidth is limited? Interestingly
enough, 32.7 Mb/s is the limit on my computer and my notebook. Of
course, I made the tx loop on the host as short as possible, set
SCHED_FIFO and rtprio to 49, and played with fusb_nblock/size etc.

Are you transmitting random data, or a stream of zeros? In the latter
case
(IIRC), every 6 zeros will have a single 1 added to aid clock recovery,
limiting net bandwidth to 6/7 (which is about 42MByte). Try transmitting
random data, a stream of ones should be fine to.

For detail have a look at the USB spec.

Stefan


Stefan Brüns / Bergstraße 21 / 52062 Aachen
phone: +49 241 53809034 mobile: +49 151 50412019

Dominik A. wrote:

Same for RX, except that a few more states were needed.
Hmmm. My application is RX-only. Using 8-bit samples, that 45Mb/s
gives about 20Msps. I have a QX9770 system running
at 3.7GHz, but still get overruns at two channels, 8Msps per
(complex) channel. I also get overruns at 16Msps, single-channel.

At 8Msps dual-channel, my application (an all-mode radio astronomy
receiver system) burns up about 2.75CPU on the above-mentioned
[email protected] (with slower memory that will get upgraded soon!). I
get overruns a couple of times per minute with this
setup.

What type of system are you getting reliable 45Mb/s receive throughput
on, and how complicated is your signal processing
flowgraph?

Marcus L.
Principal Investigator, Shirleys Bay Radio Astronomy Consortium

Hi!

Hmmm. My application is RX-only. Using 8-bit samples, that 45Mb/s
gives about 20Msps. I have a QX9770 system running
at 3.7GHz, but still get overruns at two channels, 8Msps per
(complex) channel. I also get overruns at 16Msps, single-channel.
You mean, your system doesn’t even sustain 32 Mb/s?

At 8Msps dual-channel, my application (an all-mode radio astronomy
receiver system) burns up about 2.75CPU on the above-mentioned
[email protected] (with slower memory that will get upgraded soon!). I
get overruns a couple of times per minute with this
setup.
Could be a problem of your CPU, too. In our lab, our eightcore machine
has overruns, while my notebook with a core 2 duo does not. I have
figured out that this is because of multiprocessor communication, the
eight cores are composed of two quadcore processors, which are themself
two dualcores on one die. Restricting the scheduler (taskset 0x11 app)
to two cores which reside in the same dual core, it was fine, no
overrun. Adding one core, whatever location, and there were overruns.
However, before noticing this fact, I had already turned down cpu usage
of that specific app (the transmitter) down to two cores by aggressive
optimization.

What type of system are you getting reliable 45Mb/s receive throughput
on, and how complicated is your signal processing
flowgraph?
C2D E6750, 4 Gb RAM, ICH9 USB Controller
I am using test_usrp_standard_rx, no signal processing.

Dominik

It is a saw wave (0-255 per packet, upper 8 bits of each short are
zero). Thanks for the info! I will try sending different data this
evening.

Dominik

On Wed, May 06, 2009 at 06:35:36PM +0200, Dominik A. wrote:

Short summary:
When doing RX only, I am at 45 Mb/s (yes! decim=6 works without
underruns).

That’s great!

When RX and TX are needed, the firmware is still faster, though the same
TX bottleneck appears (which is, of course, no big problem because we
already share USB bandwidth).

Do you have, maybe, an idea why TX bandwidth is limited? Interestingly
enough, 32.7 Mb/s is the limit on my computer and my notebook.

Not sure. Could be the EHCI controller, or the host driver, etc.

Of course, I made the tx loop on the host as short as possible, set
SCHED_FIFO and rtprio to 49, and played with fusb_nblock/size etc.

Let us know what else you figure out!

Eric