UHD Announcement - February 25rd 2011

luislavena · February 26, 2011, 2:33am

Hello list,

In preparation for the coming gnuradio release, and the cut-over from
next to master, changes have been pushed to both the uhd.git master
branch and the gnuradio.git next branch.

http://code.ettus.com/redmine/ettus/projects/uhd/wiki

– highlights of this announcement

2x performance increase
addition of sensors api
re-clocking support
gr-uhd api changes
stability + changes

– Performance improvements

We have oprofiled the send() and recv() fast-paths in the UHD. Users can
expect about a 2x performance improvement using the most recent UHD.

– Sensors API

The sensors API provides arbitrary read-only properties to various
“things” on a USRP motherboard or daughterboard. Examples: LO locked,
RSSI, reference locked, etc… The sensors API deprecates the
read_rssi(), and get_lo_locked() calls.

http://www.ettus.com/uhd_docs/doxygen/html/classuhd_1_1usrp_1_1multi__usrp.html#acd37d327931cec64e3701eb2a5aa7bfb

One can query the available sensors from the API and in the future the
available sensors will be documented with the daughterboard app notes.
FYI, the “lo_locked” is implemented on all daughterboards with an LO.
and “rssi” is implemented on the XCVR2450.

– re-clocking support

Re-clocking support has been added to the API:
http://www.ettus.com/uhd_docs/doxygen/html/classuhd_1_1usrp_1_1multi__usrp.html#a99254abfa5259b70a020e667eee619b9

On a USRP1 board, you can specify usrp->set_master_clock_rate(52e6) so
that the driver knows to use 52MHz in its calculations. Note that this
does not really modify the clock rate, it just informs the driver of the
hardware changes.

In contrast, when setting the clock rate on the usrp-e100, the driver
will dynamically reprogram the registers on the clock generator to
obtain the desired rate. See application notes:
http://www.ettus.com/uhd_docs/manual/html/usrp_e1xx.html#changing-the-master-clock-rate

Support for modifying the clock rate has been brought into the gr-uhd
blocks as well as the grc wrappers.

– gr-uhd changes

The gnuradio source and sink wrappers have been renamed and cleaned up
to become part of the stable API in the gnuradio master.

There is only one source and one sink wrapper (gr_uhd_usrp_source, and
gr_uhd_usrp_sink). These wrappers handle single and multichannel
configurations with one more USRP devices.

If you compiled against the gr-uhd headers, you will need to change the
code to reflect the new header names and factory function names. No
other changes necessary. Python code generated by GRC or hand-written
will continue to work without changes.

The gnuradio-companion blocks have been unified onto a single set of
wrappers as well (UHD USRP Source, and UHD USRP Sink blocks). The grc
flow graphs written with the single/multi source/sink blocks will
continue to work as long as the old wrappers remain installed.

– On stability and upcoming changes

No new images are required for this release of code with the exception
of USRP-E100 users. If you are an embedded user, I recommend waiting for
my next announcement with new images. But if you can’t wait, you can
grab it here:
http://www.ettus.com/downloads/uhd_images/UHD-images-most-recent/

We will be making another announcement with changes to the uhd.git
master branch and a new set of images. At this point, the master branch
will become stable and will only take on fixes and new hardware support.
Expect this announcement around the time gnuradio next cuts over to
master.

We are committing to the current gr-uhd API and except this to become
part of the new stable gnuradio master in the coming days/weeks.

Josh_B · February 27, 2011, 3:23pm

Josh,

When you say “2x” performance increase, do you mean CPU performance or
send()/recv() latency? Do you mind saying a few words on what changes
you have made?

Andrew

Josh_B · February 28, 2011, 4:17am

When you say “2x” performance increase, do you mean CPU performance or
send()/recv() latency? Do you mind saying a few words on what changes
you have made?

Much of the performance gains involved removing things out of the
fast-path that only needed to be called once at initialization (forgoing
code simplicity for performance). Example: a vector of pointers, a bound
callable object; many of which had calls to malloc and free which incurs
quite a lot of unnecessary overhead.

Less cpu cycles/less time are spent in the send()/recv() calls. This
roughly worked out to about half the CPU usage when looking at oprofile.
Because of this, the overall latency is reduced. We measured about 250us
RTT from device to host and back to device with the latency measurement
app in uhd examples.

-josh

Josh_B · March 1, 2011, 10:25am

Josh,

Thanks for sharing the information and your changes sound quite
reasonable.

However, it seems that your changes have introduced some bugs at the
transmitter side. I updated my system using your new code (following
your instructions in your Feb. 24’s email titled “Re: GRC + N210 +
RFX2200 + UHD not working”); then I ran python-based benchmark_tx.py. I
tested two cases: in the first case, I sent packets continuously and it
worked well; in the second case, I sent packets every second and the
transmitter side could send only about 10~12 packets, then stopped
sending data into USRP2 (based on observation from wireshark results).
Both cases used 1500B for each packet and the send-buff-size was 100kB.

Would you please take a look at this?

Andrew

Message: 3
Date: Sat, 26 Feb 2011 16:19:06 -0500
From: Feng Andrew Ge[email protected]
Subject: [Discuss-gnuradio] Re: UHD Announcement - February 25rd 2011
To:[email protected]
Message-ID:[email protected]
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Josh,

When you say “2x” performance increase, do you mean CPU performance or
send()/recv() latency? Do you mind saying a few words on what changes
you have made?

Andrew

Much of the performance gains involved removing things out of the
fast-path that only needed to be called once at initialization (forgoing
code simplicity for performance). Example: a vector of pointers, a bound
callable object; many of which had calls to malloc and free which incurs
quite a lot of unnecessary overhead.

Less cpu cycles/less time are spent in the send()/recv() calls. This
roughly worked out to about half the CPU usage when looking at oprofile.
Because of this, the overall latency is reduced. We measured about 250us
RTT from device to host and back to device with the latency measurement
app in uhd examples.

-josh

Josh_B · March 1, 2011, 1:40pm

On 02/28/2011 08:21 AM, Feng Andrew Ge wrote:

transmitter side could send only about 10~12 packets, then stopped
sending data into USRP2 (based on observation from wireshark results).
Both cases used 1500B for each packet and the send-buff-size was 100kB.

I think this is a problem in gr-uhd:
http://gnuradio.org/cgit/gnuradio.git/tree/gr-uhd/lib/gr_uhd_usrp_sink.cc?h=next#n183

I am putting a time stamp into the packet. This is helpful for a
multi-channel continuous transmission, but not single channel
non-continuous.

Can you apply the diff attached and let me know if that fixes the
problem?

Thanks,
-Josh

Josh_B · March 2, 2011, 9:40am

Josh,

Thanks a lot for the explanation.

When you say 90 packets, I assume that you mean UDP packets (which
contain samples). Given the default MTU (1500-8-20)B and 2 samples per
symbol as well as 4B per sample, for BPSK or GMSK, 90 packets of samples
correspond to 901472/(24*8)=2070B of user data. If I use 1500B per
user packet, that’s less than 2 packets. For 700 UDP packets, that’s
about 10 user packets. This actually explains what I observed, after
about 10 user packets, my transmission stopped. According to you, the
host blocked first. However, it seemed that USRP didn’t send back update
packets for some reason–which is unusual. So it’s likely timeout was
called. To help my understanding what caused the above behavior, would
you please spend little time answering the following questions?

(1) Which parameter (*ups_per_fifo or *ups_per_sec) corresponds to the
above control parameter here (90 transmission packets and 700 packets
update)? (2) How is the update packet generated on USRP? (3) In normal
cases, when the host transmits a packet, does it specify a transmission
time for USRP? If so, it must get the clock of USRP first and then leave
some margin, this introduce some time overhead; if not, does the USRP
send whatever it receives immediately? (4) What’s the content of the
short update packet?

Andrew

Josh_B · March 1, 2011, 10:56am

Further, I found that at the packet size of 1500B, the interval between
two packet transmissions must be less than about 20ms (on an intel
i5-M20 processor); otherwise, the received side couldn’t not receive any
packets. This interval decreases as the packet size decreases.

Andrew

Josh_B · March 2, 2011, 1:50pm

Thanks a lot for the explanation.

To explain your observations for the curious:

Prior to the fix, a recv buffer would be lost to the transport layer on
each timeout (thanks to an optimization I made earlier).

So, for every 100ms window (the default timeout) that did not have at
least 90 packets transmitted. A receive buffer was lost. After 32
timeouts, there were no more available buffers and the flow control
throttled back.

-Josh

answers below:

you please spend little time answering the following questions?

(1) Which parameter (*ups_per_fifo or *ups_per_sec) corresponds to the
above control parameter here (90 transmission packets and 700 packets
update)? (2) How is the update packet generated on USRP? (3) In normal
cases, when the host transmits a packet, does it specify a transmission
time for USRP? If so, it must get the clock of USRP first and then leave
some margin, this introduce some time overhead; if not, does the USRP
send whatever it receives immediately? (4) What’s the content of the
short update packet?

ups_per_fifo
it counts the number of transmitted packets, and sends an update
packet every nth packet (default n = 90)
a transmission time is optional, when not specified the send is
immediate
the sequence of the last transmitted packet

Josh_B · March 2, 2011, 3:22pm

On 03/01/2011 12:52 PM, Feng Andrew Ge wrote:

negligible, I think that the interface between UHD and GNU Radio is
introducing some overhead. Do you have any thought on this?

The ping time is talking to the embedded cpu and is not a reflection of
the latency when dealing with data/samples. For a better explanation,
see:
http://lists.ettus.com/pipermail/usrp-users_lists.ettus.com/2011-January/000521.html

Also, make sure you pull the latest gnuradio next branch. I pushed that
diff I sent you earlier in regards to the time stamps. With the latest
change, all packets are sent ASAP in the single channel case.

Would you tell me what threads are running in UHD when
uhd_single_usrp_sink and uhd_single_usrp_source is called? It seems that
at least two threads are called for each.

In reference to what runs on the latest master in uhd:

For a USB device (USRP1) there is a thread running libusb async
transactions.

For a network device (USRP2/N210) there is a thread receiving async
message packets, these include flow control updates and transmit error
notifications.

The other threads you see (and the ones with any major overhead) are the
threads in gnuradio (thread-per-block scheduler)

Is it right that the maximum amount of data that each socket.send() or
socket.recv() can operate is dynamically determined by
noutput_items/ninput_items from the general work function in

correct

uhd_single_usrp_*? Originally I thought the num_recv_frames have
control over this; but I noticed that the UDP transport document is
updated: “Note1: num_recv_frames and num_send_frames do not affect
performance.”

Those refer to number of buffers allocated. But in the UDP
implementation, buffers are used and disposed of synchronously, so you
only ever need a few.

In the libusb implementation, buffers are processed asynchronously, so
you can potentially have all the buffers being emptied/filled in the
background, so altering those values may make sense to do.

-Josh

Josh_B · March 2, 2011, 4:24pm

Josh,

First of all, I am aware of what you pointed out and I did use the code
latency_test.cpp for measuring latency between USRP2 and a host. The
latency is negligible.

I think I was not clear enough in my previous email.
My setting is this: host_1–USRP2_1 talks to host_2–USRP2_2. The
latency I measured is based on GNU Radio-created wireless network
interface, e.g., gr0. I started tunnel.py and created a digital link
between host_1 and host_2; then I compared ping RTT time performance
between using the UHD code and using the Raw_Ethernet code base. UHD
introduced 9 ms of overhead and I am really puzzled about this. Since
USRP2 sends samples immediately out and the latency between the host and
USRP2 is negligible, the likely place I can think of is the interface
between UHD and GNU Radio, for example, can UDP packet sending be
preempted by other threads? Each time how many UDP packets can be
possibly sent out?

Another possibility is how you allocate CPU resources in handling UHD
and what the impact might be.

The third possibility is buffer management: how do you handle buffer
management in UHD for sending and receiving? How do data stay in those
buffers and how are data are processed, by FIFO or LIFO? If overflow
happens, will newly coming packets get simply dropped?

Andrew

Josh_B · March 2, 2011, 3:00pm

Josh,

That’s great, thanks.

When using UHD in GNU Radio, I observed huge time overhead: for example,
using the raw_Ethernet code and 500kb/s, tunnel.py has only about 8 ms
ping RTT time between two nodes; now with UHD, I have 17ms in average.
As I increase the ping payload, the overhead time (excluding the extra
data communication time) increases accordingly. Since USRP2 by default
sends data samples immediately and the RTT time between UHD and USRP2 is
negligible, I think that the interface between UHD and GNU Radio is
introducing some overhead. Do you have any thought on this?

Would you tell me what threads are running in UHD when
uhd_single_usrp_sink and uhd_single_usrp_source is called? It seems that
at least two threads are called for each.

Is it right that the maximum amount of data that each socket.send() or
socket.recv() can operate is dynamically determined by
noutput_items/ninput_items from the general work function in
uhd_single_usrp_*? Originally I thought the num_recv_frames have
control over this; but I noticed that the UDP transport document is
updated: “Note1: num_recv_frames and num_send_frames do not affect
performance.”

Andrew

Josh_B · March 2, 2011, 5:19pm

On 03/01/2011 02:21 PM, Feng Andrew Ge wrote:

Josh,

First of all, I am aware of what you pointed out and I did use the code
latency_test.cpp for measuring latency between USRP2 and a host. The
latency is negligible.

Ok, i see. You were measuring the ping time over the tunnel.

Can you tell me: Is this a new problem with UHD since the " February
25rd 2011" announcement. That is, was it working properly for you
previously?

I think I was not clear enough in my previous email.
My setting is this: host_1–USRP2_1 talks to host_2–USRP2_2. The
latency I measured is based on GNU Radio-created wireless network
interface, e.g., gr0. I started tunnel.py and created a digital link
between host_1 and host_2; then I compared ping RTT time performance
between using the UHD code and using the Raw_Ethernet code base. UHD
introduced 9 ms of overhead and I am really puzzled about this. Since

I am puzzled as well. 9ms sounds pretty bad. Is this a port of tunnel.py
to UHD, can you share it?

USRP2 sends samples immediately out and the latency between the host and
USRP2 is negligible, the likely place I can think of is the interface
between UHD and GNU Radio, for example, can UDP packet sending be
preempted by other threads? Each time how many UDP packets can be
possibly sent out?

The work function is called with a randomly sized buffer determined by
the scheduler. The number of packets received or sent depends on the
size of the buffer when the work() function is called.

I think this is exactly the same for the raw_ethernet driver.

It may be helpful to print the number of items in the work function. It
seems to be in the 10s of thousands of samples last I looked.

When you compared UHD vs raw_ethernet driver, it was all the same
version of gnuradio, correct?

Another possibility is how you allocate CPU resources in handling UHD
and what the impact might be.

The third possibility is buffer management: how do you handle buffer
management in UHD for sending and receiving? How do data stay in those
buffers and how are data are processed, by FIFO or LIFO? If overflow
happens, will newly coming packets get simply dropped?

Nothing gets buffered in the UHD in the usrp2/n210 implementation.
However, there is a kernel socket buffer on receive that has enough
buffering for a second of samples. Once this buffer fills, newer packets
are dropped.

I also believe that this is the same on the raw ethernet driver.

-josh

Josh_B · March 2, 2011, 6:43pm

Josh,

Your explanation makes sense. Is there a quick fix for me to bypass
this problem temporarily while you are working on the modification.

Further, following your current bandwidth optimization implementation,
is the code trying to fill the buffer in both uhd_single_usrp_sink
(sending buffer) and uhd_single_usrp_source (receiving buffer)?

When I started uhd_benchmark_tx.py, it also asked for specification of
*recv_buff_size, where is it used? *

Josh_B · March 2, 2011, 6:05pm

On 03/01/2011 03:25 PM, Feng Andrew Ge wrote:

found time to check GNU Radio changes yet, what might cause such huge
performance drop.

Andrew,

Here is an idea that may explain your problem:

When the raw ethernet source calls into work(), it does not attempt to
fill the entire buffer (noutput_items). Rather, it waits at least one
packet to become available and then copies only the data that is
available immediately into the buffer.

In contrast, the UHD work function is bandwidth optimized and tries to
fill the entire buffer. At your sample rate (500ksps), this will impose
serious delays for very large noutput_items (10s of thousands). I hope
that explains the issue you see.

I am going to attempt a modification to the work function, where we recv
a single packet with timeout, and then anything else that is available
without waiting.

I will let you know when I post a branch with changes.
-Josh

Josh_B · March 2, 2011, 7:08pm

Josh,

Once I start uhd_benmark_rx.py, USRP2 continuously sends data to the
host. The data rate is the sample rate times 4 (bytes per sample). This
happens even when no transmitter is around. Therefore, I assume that the
ADC just converts noise into samples and USRP2 sends those samples at
the rate specified by the sample rate when uhd_usrp_source is
initialized.

I have one question, is data communication between USRP2 and
uhd_usrp_source “polling” or “pushing”? I thought it is “polling”
because only UDP socket client exists in uhd_usrp_source. In this case,
data RECV is triggered by the scheduler’s work function. Usually the
noutput_items varies from time to time. If so, how can USRP2 send
samples at a constant rate of the specified samp_rate (as I observed)?

If it is “pushing” (which means that USRP2’s firmware initiates the data
sending to the host), it looks like that USRP2 even sends samples of
noise at a constant speed. But if so, would the such samples fill the
kernel socket buffer (whose size is determined by “sudo sysctl -w
net.core.rmem_max=”)? Newer packets will get dropped.

Josh_B · March 2, 2011, 8:52pm

Josh,

As predicted, changing from “RECV_MODE_FULL_BUFF” to
“RECV_MODE_ONE_PACKET” and then “RECV_MODE_FULL_BUFF” reduced the
latency significantly. My ping RTT time was >17ms in average, now it is
10.5ms. This is GREAT:-)

Nonetheless, 10.5ms still includes quite some overhead because the
communication time for my ping message is less than 3.9 ms in total.
Above 6ms delay still exists somewhere.

In your description below, “for the source, the work function waits for
the entire RX buffer” (now it is one UDP packet of samples first and
then the entire RX buffer), do you mean the buffer size determined by
“noutput_items”? Likewise, is the entire TX buffer at the sink
determined by “ninput_items”?

One UDP packet of samples is only 1472/4= 368 samples and 1472/4/2/8 =
23 B of user data (given BPSK and 2 SPS). For a ping ICMP message ( 42B,
which is tiny), plus 19B of frame overhead, it still must wait for
“waits for the entire RX buffer” which may be possibly 10s of thousands
of samples.

Is there some way that we can further optimize the behavior? For
example, can we limit the size of noutput_items? Is this purely GNU
Radio scheduler’s job?

One previous question I had is this: why recv_buff_size is needed for
benchmark_tx.py which involves only gr_uhd_usrp_sink.cc? I thought only
send_buff_siz is needed.

Andrew

Josh_B · March 2, 2011, 7:25pm

On 03/01/2011 04:39 PM, Feng Andrew Ge wrote:

Josh,

Your explanation makes sense. Is there a quick fix for me to bypass
this problem temporarily while you are working on the modification.

http://gnuradio.org/cgit/jblum.git/commit/?id=75538e12300cb0d593792a986841ba2df9997c54

Further, following your current bandwidth optimization implementation,
is the code trying to fill the buffer in both uhd_single_usrp_sink
(sending buffer) and uhd_single_usrp_source (receiving buffer)?

For the source, the work function waits for the entire RX buffer to be
filled with samples. I strongly believe this is the cause of the
latency.

For the sink, the work function sends the entire buffer. This is correct
behavior for optimum latency and bandwidth.

When I started uhd_benchmark_tx.py, it also asked for specification of
*recv_buff_size, where is it used? *

http://www.ettus.com/uhd_docs/manual/html/transport.html#resize-socket-buffers

-Josh

Josh_B · March 2, 2011, 9:13pm

On 03/02/2011 07:37 AM, Feng Andrew Ge wrote:

Josh,

As predicted, changing from “RECV_MODE_FULL_BUFF” to
“RECV_MODE_ONE_PACKET” and then “RECV_MODE_FULL_BUFF” reduced the
latency significantly. My ping RTT time was >17ms in average, now it is
10.5ms. This is GREAT:-)

glad to hear improvement!

Nonetheless, 10.5ms still includes quite some overhead because the
communication time for my ping message is less than 3.9 ms in total.
Above 6ms delay still exists somewhere.

True. Can you run the tunnel application on the same gnuradio install
with the raw ethernet driver? That way we can isolate the problem. It
would be good to know if there was a regression in the scheduler, or
that the issue is inside gr-uhd.

In your description below, “for the source, the work function waits for
the entire RX buffer” (now it is one UDP packet of samples first and
then the entire RX buffer),

So, its not the entire rx buffer on the “gr_uhd_source_latency_work”
branch. The second call to recv() is non blocking, so it only handles
data that is already available without delay. It should return less than
noutput_items. You can confirm this by printing (num_samps <
noutput_items) before work() returns.

do you mean the buffer size determined by

“noutput_items”? Likewise, is the entire TX buffer at the sink
determined by “ninput_items”?

yes (although its still called noutput_items thanks to copy/paste)

One UDP packet of samples is only 1472/4= 368 samples and 1472/4/2/8 =
23 B of user data (given BPSK and 2 SPS). For a ping ICMP message ( 42B,
which is tiny), plus 19B of frame overhead, it still must wait for
“waits for the entire RX buffer” which may be possibly 10s of thousands
of samples.

Is there some way that we can further optimize the behavior? For
example, can we limit the size of noutput_items? Is this purely GNU
Radio scheduler’s job?

We can return less than noutput_items but not zero. Which is what the
changes on that branch attempt to do.

One previous question I had is this: why recv_buff_size is needed for
benchmark_tx.py which involves only gr_uhd_usrp_sink.cc? I thought only
send_buff_siz is needed.

It is not needed; and whoever made that application should not have put
those options in. UHD will choose the best settings automatically.

-josh

Josh_B · March 2, 2011, 9:46pm

Josh,

Before I test it with the raw ethernet driver, would you tell me what’s
the difference in gr_uhd_usrp_sink between blocking send and
non-blocking send? Since the sending data is already in the buffer and
it is UDP sending, I don’t see any difference. But perhaps you might
point out any difference from a time delay perspective.

Andrew

Josh_B · March 2, 2011, 11:09pm

Another question is how “UUU” gets generated—I still got many of
them? Further, “UUU” makes sense when GNU Radio works with an audio
device; however, I am not sure about its use when GNU Radio sends
non-continuous data to USRP. When this “UUU” happens, will it be
possible that some samples get held somewhere or even get dropped? I
assume not, but like to get your confirmation.

Andrew