Threaded model for USRP2 source

dgalindo · July 23, 2009, 5:17pm

In my messing around with a custom USRP2 source block (since I’m trying
to
get time-aligned samples out from multiple USRP2’s), I’ve run into what
is
probably the main issue with getting full-bandwidth samples (-d 4 =>
25MS/s)
out of multiple USRP2’s: dropped sample frames at the host PC. While I’m
(currently) content to work with lower sample rates to the PC,
eventually
I’m going to want the full 25MS/s out of two (or more) USRP2’s. So, I’m
been
giving some thought to getting that work. On my development machine I’ve
got
many cores (two quad-core Xeon’s) - and for the most part I’ve got
plenty of
CPU left - i.e. my thread that’s handling samples from the USRP2’s is
using
100% of one of the cores, and any other blocks I may be using are using
some
percentage of the other cores - but my main limit is the single thread
handling the USRP2 source block. So, the apparently obvious solution is
to
split the source block into multiple threads, right? It seems to me that
the
obvious thing to do is have separate threads doing the work of the
rx_handler code - which just constantly check if samples are available,
and
put them in a queue/vector/something (perhaps a deque would be best -
since
I can add to the end, and pop off the front…), and the main thread of
the
block then just polls the queue/etc. for incoming samples, rather than
calling rx_handler only when the scheduler calls the work() function.
So, while this makes sense to me, I figured I’d check with the list to
see
if anyone else has tried this - and if they have
suggestions/comments/pitfalls to avoid, etc. If not, I’m happy to plow
ahead
and find out just how much trouble I’m in for here. The boost library
does
seem to have nice functions for handling threads, so I’ve started
messing
around with them, but this is a fairly new area for me. Comments
welcome!
Doug

dgalindo · July 23, 2009, 7:04pm

On Thu, Jul 23, 2009 at 08:13, Douglas
Geiger[email protected] wrote:

It seems to me that the
obvious thing to do is have separate threads doing the work of the
rx_handler code - which just constantly check if samples are available, and
put them in a queue/vector/something (perhaps a deque would be best - since
I can add to the end, and pop off the front…), and the main thread of the
block then just polls the queue/etc. for incoming samples, rather than
calling rx_handler only when the scheduler calls the work() function.

In your situation, you are using two USRP2s on two different GbE
ports, and trying to time align samples for your output, correct?

You could create two (more won’t help) separate service threads, each
calling into libusrp2 to receive frames and metadata, and have
rx_samples copy them into whatever synchronized data structure you
need. Your block main thread that calls work() can read from your
data structure (again, using synchronization primitives), and copy the
time aligned output samples to the block output buffers. (You’ll need
to decide how to deal with missing frames from either USRP2, but
that’s another conversation.)

libusrp2 will have a user space thread per USRP2, and your block will
have three threads, so this is five in total. You may need to
experiment with thread placement so the right threads share a
processor/cache to avoid trips to main memory.

Remember, you are trying to move 200 Mbytes/sec around multiple times,
and eventually do math on them–it’s not a trivial task.

Johnathan

dgalindo · July 23, 2009, 7:56pm

John,

On Thu, Jul 23, 2009 at 1:00 PM, Johnathan C. <
[email protected]> wrote:

In your situation, you are using two USRP2s on two different GbE
ports, and trying to time align samples for your output, correct?

Yes, one GbE per USRP2 (two for the moment, but I expect to be able
experiment with more soon).

You could create two (more won’t help) separate service threads, each
calling into libusrp2 to receive frames and metadata, and have
rx_samples copy them into whatever synchronized data structure you
need. Your block main thread that calls work() can read from your
data structure (again, using synchronization primitives), and copy the
time aligned output samples to the block output buffers. (You’ll need
to decide how to deal with missing frames from either USRP2, but
that’s another conversation.)

Right, ok, this sounds like what I should be aiming for. Currently I
just
have a single main service thread calling libusrp2 twice: once for each
USRP2 I want to talk to. Currently my alignment code ends up dropping
samples from USRP2 if the other had missing frames.

libusrp2 will have a user space thread per USRP2, and your block will
have three threads, so this is five in total. You may need to
experiment with thread placement so the right threads share a
processor/cache to avoid trips to main memory.

Right - and this is the user space thread that talks to the kernel
ring_buffer, right?
Re: thread placement - are you referring to doing something to ‘pin’
threads
to a certain core? E.g. numactl?

Remember, you are trying to move 200 Mbytes/sec around multiple times,
and eventually do math on them–it’s not a trivial task.

Johnathan

Yes, the enormous amount of information I would like to process in
real-time
is daunting: I think eventually I would like to move some of it into the
FPGA, for the time being I find it much easier to experiment in the
world of
C++ (I haven’t yet become fluent in Verilog).

I think the main issue I’d like to remedy is the fact that I know my
machine
can handle two simultaneous streams of 25MS/s: i.e. if I spawn two
separate
processes, each talking to a separate USRP2: but my time-aligned source
block is currently a bottleneck.
Thanks,
Doug

dgalindo · July 23, 2009, 8:19pm

On Thu, Jul 23, 2009 at 10:51, Douglas
Geiger[email protected] wrote:

Right, ok, this sounds like what I should be aiming for. Currently I just
have a single main service thread calling libusrp2 twice: once for each
USRP2 I want to talk to.

This is bad. If the first USRP2 has no frames available, your thread
will block, even if the second USRP2 does. Also, while the thread is
servicing one USRP2, the other is being ignored.

Right - and this is the user space thread that talks to the kernel
ring_buffer, right?

Yes. FYI, libusrp2 will change in 3.3 (well, it will become libvrt)
and use UDP sockets. This will ultimately allow the kernel to do the
demultiplexing of control and data frames (separate UDP ports), and
we’ll be able to eliminate the service thread and one copy operation.

Re: thread placement - are you referring to doing something to ‘pin’ threads
to a certain core? E.g. numactl?

Yes.

Yes, the enormous amount of information I would like to process in real-time
is daunting: I think eventually I would like to move some of it into the
FPGA, for the time being I find it much easier to experiment in the world of
C++ (I haven’t yet become fluent in Verilog).

Well, some could happen there, but if you need to manipulate samples
from multiple USRPs, as in MIMO, you’re stuck doing it on the host.

I think the main issue I’d like to remedy is the fact that I know my machine
can handle two simultaneous streams of 25MS/s: i.e. if I spawn two separate
processes, each talking to a separate USRP2: but my time-aligned source
block is currently a bottleneck.

It’s the sequential calls into libusrp2 that are the problem.

Johnathan

dgalindo · July 24, 2009, 9:46pm

Johnathan,

On Thu, Jul 23, 2009 at 2:10 PM, Johnathan C. <
[email protected]> wrote:

This is bad. If the first USRP2 has no frames available, your thread
will block, even if the second USRP2 does. Also, while the thread is
servicing one USRP2, the other is being ignored.

Johnathan

Thanks - I think the service-thread per USRP2, and a main thread to
align
the samples will work well: I have an initial version now, and it is
working
much better than the single thread (reading from each libusrp2 for each
USRP2 sequentially) already.