Large RX Delay

I am currently investigating different USRP delays. Some of them I can
explain, others not. For example, I see an average delay of 6.2ms
while receiving data from the USRP at a sample rate of 8MHz (short
real samples, i.e. I am using the usrp.source_s).
Here is my setup:

  • I have a function generator which generates a 1Hz square wave. This
    signal is fed into a LFRX on the USRP and into CH1 of an oscilloscope.
  • On the PC side, I modified the null sink to check for a pos/neg
    signal (with some thresholding). When the signal is high, I output 1
    on the parallel port, when it is low, I output 0. The parallel port is
    also hooked up to the Oscilloscope. This allows me to measure the
    delay between the two signals.
  • I usefusb_block_size=512, fusb_nblocks=1

Here are some numbers I got:

  • decimation 8: mean delay 6.1ms
  • decimation 32: mean delay 6.4ms
  • decimation 64: mean delay 4.3ms
  • decimation 256: mean delay 14.7ms

First of all, I don’t understand why we have such a high delay.
Shouldn’t it be more in the hundreds of /mu s instead of in the ms
range? Second, why is the delay shorter for decimation 64, and again
larger for a decimation of 256?

Thomas

On Tuesday 14 November 2006 11:33, Thomas S. wrote:

First of all, I don’t understand why we have such a high delay.
Shouldn’t it be more in the hundreds of /mu s instead of in the ms
range? Second, why is the delay shorter for decimation 64, and again
larger for a decimation of 256?

How many times a second is your OS switching process context?
How are you frobbing the parallel port? Does it involve a syscall?

I use the outb command to change the parallel port. From what I read,
that command should have a delay of around 1 \mu s, not more. I am not
sure about the context switches, but I am almost sure that this is not
the problem. I don’t run anything else on the machine (i.e., no other
heavy load process), and the machine is pretty powerfull (dual CPU,
dual core with huyper threading, that makes 8 CPU’s under linux). If
it helps, I am running the stock Ubuntu 6.10 kernel (2.6.17-10-generic
#2 SMP Fri Oct 13 18:45:35 UTC 2006 i686 GNU/Linux). Additionally, I
was able to measure the TX delay for burst transmission with a similar
technique to be in the order of a couple of hundred \mu s. This is why
I am wondering about the large RX delay.

Thomas

On Tuesday 14 November 2006 12:10, Thomas S. wrote:

I use the outb command to change the parallel port. From what I read,
that command should have a delay of around 1 \mu s, not more. I am not

Hmm, well you probably will incur a few microseconds because you need to
talk
to the legacy hardware which tends to be very slow.

You won’t have any context switch or syscall overhead though.

sure about the context switches, but I am almost sure that this is not
the problem. I don’t run anything else on the machine (i.e., no other
heavy load process), and the machine is pretty powerfull (dual CPU,
dual core with huyper threading, that makes 8 CPU’s under linux). If
it helps, I am running the stock Ubuntu 6.10 kernel (2.6.17-10-generic
#2 SMP Fri Oct 13 18:45:35 UTC 2006 i686 GNU/Linux). Additionally, I
was able to measure the TX delay for burst transmission with a similar
technique to be in the order of a couple of hundred \mu s. This is why
I am wondering about the large RX delay.

Not sure hyperthreading is a good idea here. The virtual cores may be
stalled
without you really knowing. I wouldn’t imagine it would have such a
large
effect as you are seeing though.

Not really sure what else to try though sorry…

Hi Eric,

On 11/13/06, Eric B. [email protected] wrote:

On Mon, Nov 13, 2006 at 05:03:17PM -0800, Thomas S. wrote:

I am currently investigating different USRP delays. Some of them I can
explain, others not. For example, I see an average delay of 6.2ms
while receiving data from the USRP at a sample rate of 8MHz (short
real samples, i.e. I am using the usrp.source_s).

With usrp.souce_s, you still get 16-bit I & Q, they’re just
interleaved in the single stream. This may be the cause of some
confusion. Thus, if you set decim == 8, you’ll be getting 8 MS/s
complex across the USB. This is 32MB/sec == 16 M shorts/sec.

Ah, yes that makes sense. I will take that into considerations and go
through my calculations again.

I suspect that you are getting overruns with this configuration.
You’re not saying anything about this, but I doubt it’s going to
work well with fusb_nblocks = 1.

Try running it with fusb_nblocks = 4. At the high data rates, this
still isn’t a reliable rate on my Core Duo laptop.

I have to check that. I put stderr into a different file to see my own
“debug” output. Might very well be that I get a lot of underruns.

Here are some numbers I got:

  • decimation 8: mean delay 6.1ms
  • decimation 32: mean delay 6.4ms
  • decimation 64: mean delay 4.3ms
  • decimation 256: mean delay 14.7ms

Have you logged the received data to a file?
Remember, you’re not getting “real” samples. You’re getting
interleaved I & Q.

No, I do not log the received data into a file. I record the wave
forms on the oscilloscope and do a post processing on them in octave.

First of all, I don’t understand why we have such a high delay.
Shouldn’t it be more in the hundreds of /mu s instead of in the ms
range?

Yes, at the high input rates.

Second, why is the delay shorter for decimation 64, and again
larger for a decimation of 256?

Underruns and/or incorrect examination of the incoming data?

I think now, after considering your response, that this might be the
problem. I will check both things tomorrow and do the measurements
again. Hopefully I will find smaller delays.

problem/opportunity in the usrp1_source_base.cc.

I’ll get back to you with more info in a bit…

Thank you very much. I am looking forward to your response.

Thomas

On Mon, Nov 13, 2006 at 10:19:22PM -0800, Thomas S. wrote:

Second, why is the delay shorter for decimation 64, and again
the data on the receive path. This could be reduced without much
I’ll get back to you with more info in a bit…

Thank you very much. I am looking forward to your response.

Thomas

Thomas,

I’ve made a change to usrp1_source_base.cc that might help.
You can either pick it up from my developer branch:

$ svn co
http://gnuradio.org/svn/gnuradio/branches/developers/eb/rx-latency

or apply the attached patch.

Please let me know if it helps. I think it will, once the overruns
are take care of. If it does work, I’ll merge it into the trunk.

Eric

On Mon, Nov 13, 2006 at 05:03:17PM -0800, Thomas S. wrote:

I am currently investigating different USRP delays. Some of them I can
explain, others not. For example, I see an average delay of 6.2ms
while receiving data from the USRP at a sample rate of 8MHz (short
real samples, i.e. I am using the usrp.source_s).

With usrp.souce_s, you still get 16-bit I & Q, they’re just
interleaved in the single stream. This may be the cause of some
confusion. Thus, if you set decim == 8, you’ll be getting 8 MS/s
complex across the USB. This is 32MB/sec == 16 M shorts/sec.

Here is my setup:

  • I have a function generator which generates a 1Hz square wave. This
    signal is fed into a LFRX on the USRP and into CH1 of an oscilloscope.
  • On the PC side, I modified the null sink to check for a pos/neg
    signal (with some thresholding). When the signal is high, I output 1
    on the parallel port, when it is low, I output 0. The parallel port is
    also hooked up to the Oscilloscope. This allows me to measure the
    delay between the two signals.
  • I usefusb_block_size=512, fusb_nblocks=1

I suspect that you are getting overruns with this configuration.
You’re not saying anything about this, but I doubt it’s going to
work well with fusb_nblocks = 1.

Try running it with fusb_nblocks = 4. At the high data rates, this
still isn’t a reliable rate on my Core Duo laptop.

Here are some numbers I got:

  • decimation 8: mean delay 6.1ms
  • decimation 32: mean delay 6.4ms
  • decimation 64: mean delay 4.3ms
  • decimation 256: mean delay 14.7ms

Have you logged the received data to a file?
Remember, you’re not getting “real” samples. You’re getting
interleaved I & Q.

First of all, I don’t understand why we have such a high delay.
Shouldn’t it be more in the hundreds of /mu s instead of in the ms
range?

Yes, at the high input rates.

Second, why is the delay shorter for decimation 64, and again
larger for a decimation of 256?

Underruns and/or incorrect examination of the incoming data?

Also, the received data (and transmitted data too) is “quad buffered”
in the FX2, so there’s a maximum of 4 512-byte buffers between you and
the data on the receive path. This could be reduced without much
trouble to “double buffered”. But I don’t think this is really the
problem.

With the quad buffered case and decim = 8, the most data that could be
buffered in the FX2 is 4*512 = 2048 --> 2048/32e6 = 64 us worth.

Right now I’m looking at how the received data buffering is done in
the usrp and gr-usrp code. Looks like there may be a
problem/opportunity in the usrp1_source_base.cc.

I’ll get back to you with more info in a bit…

Thanks for going to the trouble of making the measurements!

Eric

Hi Eric,

I did new test today, and you were right. I had a lot of underrun.
Therefore, I increase fusb_nbloccks to 8 and fusb_block_size to 2048.
Even with this setting, I got some underruns, but they were not often
at all. Here are the new numbers (for the code in trunk):
Decimation, Nice, Real_Time, mean [s]
8, 10, no, 0.00058
8, -20, no, 0.00057
8, -20, yes, 0.00058
16, 10, no, 0.00108
16, -20, no, 0.00102
16, -20, yes, 0.00106

Interpret this as a CSV file ;). Nice is the nice value with which the
process run. Real_time means if real time scheduling was enabled or
not. As you can see, these numbers are way better then the ones I had
yesterday.

Now, here the result for your updated code:
Decimation, Nice, Real_Time, mean [s]
8, 10, no, 0.000635
8, -20, no, 0.000629
8, -20, yes, 0.000627

As you can see, the new code performs worse. I didn’t do more than
these three tests with your code, because I wanted to see a first
result, before I spend more time collecting data.

Thomas

On Tue, Nov 14, 2006 at 10:04:46PM -0800, Thomas S. wrote:

16, 10, no, 0.00108
8, 10, no, 0.000635
8, -20, no, 0.000629
8, -20, yes, 0.000627

As you can see, the new code performs worse. I didn’t do more than
these three tests with your code, because I wanted to see a first
result, before I spend more time collecting data.

Thomas

Interesting.

Note that “real time” won’t directly impact the the latency, however
it should allow you to use smaller values for fusb_nblocks and
fusb_block_size without incurring underruns.

I’m somewhat surprised that the new code doesn’t show less latency.
I’ll have to think about that some more.

When you run out of things to measure, running with real-time, can you
try reducing fusb_block_size and fusb_nblocks :wink:

Can you mail me your complete rx program off the list?

Thanks,
Eric

On 11/14/06, Eric B. [email protected] wrote:

8, -20, yes, 0.00058
Decimation, Nice, Real_Time, mean [s]
Interesting.

Note that “real time” won’t directly impact the the latency, however
it should allow you to use smaller values for fusb_nblocks and
fusb_block_size without incurring underruns.

I’m somewhat surprised that the new code doesn’t show less latency.
I’ll have to think about that some more.

I did the tests again (including recompiling of your modifications
etc), and this time I took 5000 measurements. The result is the same.
The original code is faster. What I noticed is, that I have more
Underruns in your modified code than in the original one. This might
be why we have a higher delay.

When you run out of things to measure, running with real-time, can you
try reducing fusb_block_size and fusb_nblocks :wink:

I will try to do that if I find some spare minutes. It will be
interesting to see the result…

Thomas