Parallel programming

dubstep · January 10, 2011, 4:53pm

Hello All,

I’ve been writing my own signal processing blocks and I noticed that
gnuradio only uses one of my cores.

I’m not sure if it is using just one core for my blocks or for all
processing.

Is gnuradio written to take advantage of multicore processing?

I have been writing my blocks in generic c++ code, but now I am looking
to
write my blocks using multithreading/multicore processing. However, I
am
new to this and would like some advice on how to approach this.

I have an Intel 8-Core Xenon in my PC (I don’t know the exact model but
I
believe the clock rate works around 2.8 Ghz). What libraries should I
use?
I have been looking into Intel Thread Building Blocks, but I am
wondering
what people mainly use for gnuradio.

Please let me know.

Thanks

–
View this message in context:
http://old.nabble.com/Parallel-programming-tp30634902p30634902.html
Sent from the GnuRadio mailing list archive at Nabble.com.

sirjanselot · January 10, 2011, 9:12pm

How do I know that my flow-graph is executing in thread per block mode?

As far as I can tell my only 1 core out of the 8 is being used when I
run my
flow-graphs. This is what I see when I run the performance monitor (or
whatever it is called) in Ubuntu.

I am currently using gnuradio 3.3.0 as my version.

So can I parallelize my block without having to create a meta-block as
you
say? I have a lot of for-loops and vector calculations that need to be
optimized (adaptive fir filters).

Michael D.-3 wrote:

Discuss-gnuradio mailing list
[email protected]
Discuss-gnuradio Info Page

–
View this message in context:
http://old.nabble.com/Parallel-programming-tp30634902p30636613.html
Sent from the GnuRadio mailing list archive at Nabble.com.

sirjanselot · January 10, 2011, 6:37pm

Assuming you’re using a reasonably recent GIT checkout, then your
flow-graph should be executing in “thread per block” mode by default –
each block you create in your flow-graph will reside in its own unique
thread. You can manually override this setting to be in “single
threaded scheduler” mode instead, where all blocks are executed within a
single thread in a round-robin fashion (roughly; no need for its
complexities here). Those are your 2 choices when using GNU Radio
(without rewriting the scheduler yourself). IIRC the latter (STS) is
being deprecated “real soon now” – someone please correct me if I’m
remembering incorrectly.

Generally, you shouldn’t need to further parallelize beyond what’s
already provided. A specific case where one would want do add another
thread is when data must be transferred non-synchronously (e.g., async
or isync) – for example, the native USB driver for Mac OS X spawns a
new thread to handle the OS-interface part. Otherwise, you can probably
find a clever way to create a “meta-block” that encloses a number of
actual blocks, and then let the “thread per block” scheduler handle
them. GNU Radio uses Intel’s TBB already, so if you feel for some
reason that your particular block(s) need more parallelization, then
that’s probably the best way to go.

My US$0.02, for what it’s worth. - MLD

sirjanselot · January 10, 2011, 9:43pm

On 01/10/2011 03:11 PM, sirjanselot wrote:

optimized (adaptive fir filters).
By default, each flow-graph is assigned its own thread. It’s up to the
kernel to schedule these as it sees fit.

Getting parallelism inside your own custom block is something you’ll
have to deal with yourself.

I’ve done experiments with using the multi-threaded FFTW libraries for
very large transforms, which causes
internal parallelism with the FFTW library. This works, and doesn’t
appear to adversely affect Gnu Radio
thread-per-block scheduling.

The thread-per-block scheduler is the default behaviour, so the reason
you may only be seeing one core in use is just
due to the dynamic behaviour of your flowgraph.

–
Marcus L.
Principal Investigator
Shirleys Bay Radio Astronomy Consortium

sirjanselot · January 10, 2011, 9:59pm

All,

I am testing 2x USRP1 with RFX2400 daughter cards with benchmark_rx and
benchmark_tx scripts and have questions.
When I set the MCS to dbpsk, error rate I got was close to zero.
When I set it to dqpsk, d8psk, error rate is 100%.
When I set it to gmsk, certain daughter cards had close to 0 error rate,
but certain swung from 0% to 100%.

I asked about it to Ettus research, they told me that it’s likely to be
due to frequency offset between the two boxes I have.
If it is true,

how can I compensate the offset?
Has someone used dqpsk, d8psk in 2.4MHz ISM band before? If so, what
extra step is necessary?
Thanks,

Thomas

sirjanselot · January 10, 2011, 9:46pm

Without seeing your GRC implementation or Python script & block’s
implementation code, mostly what I or anyone else can provide is general
advice. GNU Radio 3.3.0 uses the thread per block (TBP) scheduler by
default; if you’re not doing anything else except running the flow-graph
(meaning: you don’t set special GNU Radio environment variables or use a
GNU Radio configuration file), then that’s what you’re using. The
performance of any flow-graph really depends on how complex the
flow-graph is, how much data you’re trying to push through it, and how
fast your processors are able to perform the block’s computations. The
host OS influences execution speed a little, but mostly its those listed
factors that make the difference; that said, I haven’t used GNU Radio on
Ubuntu in a long time so I cannot talk about that OS specifically
(Linux, in general, provides very low OS overhead & more time executing
the flow-graph’s computations). It might be that your flow-graph is
running fast enough already to use just 1 core; does it run in “real
time” for what you need? Rewriting a given block to use vector-based
instructions (SSE, Altivec, Neon) often dramatically increases the
computations / time for that block. As for parallelizing your block,
without knowing what it is/does exactly, I would always advise you to
break down the computations into smaller pieces and then implement those
as blocks (if they are no already), then create the “meta-block” (I
forget the exact name of it now; maybe “heir_block2”?) using those.
That way, the TBP scheduler will have more to work with and the
flow-graph will end up being executed more in parallel. If your block
has internal data-feedback, then the meta-block will not work (GNU Radio
doesn’t “do” data-flow feedback in the flow-graph) & you’ll have to find
some way of parallelizing your algorithm. There are plenty of good
books on this subject. - MLD

sirjanselot · January 10, 2011, 10:18pm

On Mon, 2011-01-10 at 15:55 -0500, Thomas H Kim wrote:

be due to frequency offset between the two boxes I have.
If it is true,

how can I compensate the offset?

Make a simple little FFT sink in GRC and use it on one of the USRPs to
determine the received signal offset from the other USRP while it is
transmitting. Or receive a signal from a signal generator of known
frequency and note the offset for both USRPs. Or transmit a signal from
each USRP and receive it using a different receiver and note the
difference between the frequencies of the received signals. Or vary the
frequency of either benchmark_rx or benchmark_tx via trial and error
until you get proper transmission/reception. You will probably find less
than 20kHz of offset at 2.4GHz.

–n

sirjanselot · January 10, 2011, 10:55pm

On Mon, Jan 10, 2011 at 3:11 PM, sirjanselot [email protected]
wrote:

How do I know that my flow-graph is executing in thread per block mode?

You can run ‘top’ while executing your flow graph and then toggle
threads on (type capital ‘H’). Each thread will be displayed on its
own. With a Python-based GNU Radio application, all you will see is
python; when you toggle threads on, you should see python listed
multiple times; one for each thread.

‘man top’ will tell you how to get a lot out of that program.

Tom

sirjanselot · January 10, 2011, 11:29pm

–n

This topic comes up repeatedly on this list. When you have a radio
“tuned” to a specific frequency, there is
nearly-always a certain amount of residual frequency error.
Synthesized LOs (local oscillators) have a frequency
offset that is proportional to the PPM tolerance of the reference
oscillator that they’re using.

Let’s say that the oscillator has a 10PPM tolerance (which is typical, I
don’t, off the top of my head, know what the
PPM spec is for the XO in the USRP1). So, that’s 10Hz for every MHz
of crystal frequency. That error “carries through”
the PLL synthesizer. In this case, we have an LO frequency of
2.4GHz, so, that 2.4GHz/10PPM which gives us a potential
frequency offset error of 24KHz–let’s be charitable and assume that
means anywhere +/- 12KHz.

For wideband modulations (higher data rates), a small frequency offset
is generally mostly-harmless, since most of the
modulation energy falls within your passband, even if the edges of
your passband don’t fall where you think they fall.
But for narrowband modulation schemes (low data rates, or narrow OFDM
buckets), frequency offset can be disasterous,
since an offset in the band-edge causes most of the energy to fall
outside of your passband.

In “real” data receivers, particularly narrowband ones, there’s
generally a feedback mechanism that tugs on the LO circuit
a little bit to try and zero-in on the correct frequency. In the
analog world, FM stereo receivers have a so-called AFC
circuit that uses noise estimates to steer the LO. Television does
the same thing (so-called AFT).

In the world of software-defined radio, it’s easy to forget about these
things, because, hey, it’s all digital, and therefore
perfect, right? Wrong.

So, an SDR-based analog data communications system needs to be able to
deal with this. There are a couple of ways of doing this:

  1) lock your PLL Synthesizer to a high-quality reference clock,

usually improving the PPM error by an order of magnitude or more
On the USRP2/N2XX/E100 there are explicit inputs for an
external 10MHz reference clock, and UHD makes it easy to
enable this feature when you create the UHD source/sink.

  2) Have your demodulator provide feedback to the frequency-setting

code to tweak the actual LO frequency (or DDC frequency,
which is usually faster). This is the most general approach,
since it makes your code work well even on a platform that doesn’t
have a high-quality external reference. Note this is on the
receive side. No sense in tweaking the transmitter when you have
potentially-many receivers. The conventional thing to do is
to have the receiver track wherever the transmitter is right now.

This class of problem is in no way unique to SDR hardware. Ham radio
operators on the list will tell you about many adventures
(particularly in the old days) of tweaking the LO performance on
their VHF receivers to allow “full quieting” reception of the local
FM repeater, and as I observed, FM radios and televisions have had
some kind of automatic-fine-tuning for many many decades.

–
Marcus L.
Principal Investigator
Shirleys Bay Radio Astronomy Consortium

sirjanselot · January 10, 2011, 11:56pm

Thanks.

Yes, my block has internal data-feedback [using signal processing block
output to calculate new FIR filter coefficients, a trait common in
adaptive
filters]. It runs with 1 FIR Filter pretty quickly with 1 core no
problem,
but once I start pushing it to 5 and up, my computer can’t keep up. At
around 2 or 3 the core working on it is really stressed.

I did notice that when I run example flow graphs or when I create flow
graphs that doesn’t have any of my custom algorithms, it does really
well
dividing the tasks to separate cores.

Could you point a reference to this topic please? I tried googling
“internal data feedback” and “data-flow feedback” with words like
parallel,
c++, and I’m not getting good results.

Thanks.

sirjanselot · January 11, 2011, 12:41am

  2) Have your demodulator provide feedback to the
frequency-setting code to tweak the actual LO frequency (or DDC
frequency, which is usually faster). This is the most general
approach, since it makes your code work well even on a platform that
doesn’t have a high-quality external reference. Note this is on the
receive side. No sense in tweaking the transmitter when you have
potentially-many receivers. The conventional thing to do is to have
the receiver track wherever the transmitter is right now.

Two ideas:

You might need to tweak your transmitter’s frequency in order to
keep it within your transmission boundaries (e.g. your license), or to
meet specs for interoperability. For example, when using an
unmodified USRP as a GSM cell tower with the OpenBTS code, the
transmitter was too far off frequency spec for some cellphones to
interoperate with it.
It seems to me we could minimize this problem by writing a small
program that would tune an over-the-air frequency standard (like one
of the WWV broadcasts) and compare it to the local oscillator. The
resulting frequency offset could then be stored as a default setting
for subsequent GNU Radio runs, so that e.g. if your program asked to
tune to 250.000 MHz and the USRP’s LO was slow by 50 kHz (0.050 MHz)
then internally it would know to tune to 250.050 which is probably
closer to where the real signal will be. Of course the LO would shift
slightly based on temperature, but if you measured and stored the
value after warm-up, it would probably be relatively stable.

John

sirjanselot · January 11, 2011, 1:01am

You might need to tweak your transmitter’s frequency in order to
keep it within your transmission boundaries (e.g. your license), or to
meet specs for interoperability. For example, when using an
unmodified USRP as a GSM cell tower with the OpenBTS code, the
transmitter was too far off frequency spec for some cellphones to
interoperate with it.
An excellent point, to be sure. The transmitter carrier frequency
should be adjusted to be as close to
spot-on as as practical, given test equipment accuracy, etc. [Some
frequency counters have
woefully-inadequate clock crystals on them, so using them for
fine-scale tweaking is just asking
for trouble].

But dynamic tweaking of the transmitter frequency often leads to
trouble in a multi-receiver scenario.

John

Probably reasonable as a first-order approach. That assumes that
frequency errors are more-or-less linear.
Further, you want to store the result as a PPM estimate, rather than
an absolute frequency offset. For some
cards, the difference won’t matter, but for something like the WBX,
with a very-wideband synthesizer, it
does matter.

FM radio stations also tend to use very-high-quality LOs for their
transmitters, although the wideband nature of their
signal makes it somewhat awkward to do fine tweaking. Hmmm, I
wonder about the audio carrier of a TV signal, that
might also be reasonably stable.

Too bad the galaxy is in rotation, else you could use 1420.4058MHz and a
small dish as a super-accurate frequency standard
[Actually, if you have precise notions of the rotation curve along
your line of site, you can correct, but, I digress…]

–
Marcus L.
Principal Investigator
Shirleys Bay Radio Astronomy Consortium

sirjanselot · January 11, 2011, 1:11am

On 1/10/2011 6:58 PM, Marcus D. Leech wrote:

FM radio stations also tend to use very-high-quality LOs for their
transmitters, although the wideband nature of their
signal makes it somewhat awkward to do fine tweaking. Hmmm, I wonder
about the audio carrier of a TV signal, that
might also be reasonably stable.

I haven’t used it, but kalibrate [1] seems to do what we’re all talking
about, using GSM base station clocks (that are required to have an
accuracy of 50 parts per billion). Maybe see what that code is all
about?

[1] http://thre.at/kalibrate

Patrick Yeon
ThinkRF
613-369-5104 x418

sirjanselot · January 11, 2011, 8:23pm

I don’t think there’s much specialized info “out there” on this topic,
since it’s relatively standard programming; experience and knowledge are
your best guides! That said, there are some excellent books on parallel
algorithms and programming out there. Some of my favorites include (in
no particular order):

“Task Scheduling for Parallel Systems” by Oliver Sinnen
“Parallel Algorithms” by Casanova, Legrand, and Robert
“The Art of Multiprocessor Programming” by Herlihy and Shavit
“Fundamentals of Sequential and Parallel Algorithms” by Berman and Paul
“Algorithms Sequential & Parallel, a Unified Approach” by Miller and
Boxer

I’m also reading through and playing with the OpenCL 1.1 spec & OSX 10.6
library right now. Pretty cool stuff, though the documentation is a bit
dry. I do wonder on the portability of the code … supposed to be
quite cross-OS compatible (Windows, Linux, OSX 10.6+, various other
UNIX-y flavors) but that can be quite challenging.

Depending on the type of feedback you require, you might be able to use
a standard flow-graph. There are 2 primary types of feedback in a
flow-graph system: data feedback and non-data feedback. One can
specialize the latter into control, block settings, and other types.

GNU Radio will not let you do data-feedback with data that is controlled
by the scheduler – meaning you cannot “connect” the data-flow output of
block (A) to the data-flow input of any block that is before block (A)
in the flow-graph (there are technical terms in Graph Theory for all of
this; hopefully what I wrote is clear enough). Although in theory one
-could- do in-flow-graph data-feedback, it is not done in practice
because this sort of feedback is generally done as a single item or just
a few items – which would be very inefficient in terms of overhead
versus actual number of computations.

On the other hand, if what your block is doing is feeding back just
filter coefficients (a type of non-data feedback), then you -might- be
able to use either direct setting, or even better a msg_queue, between
distinct blocks. I’ve never done either, but I don’t see why they
couldn’t be made to work (anyone?).

For the former, you’d pass your block a reference to the prior block’s
method for setting filter coefficients (or, maybe, just the block itself
depending on your programming style); for the latter, you’d pass in the
msg_queue to your block to use & in the prior block you’d need to add in
handling the msg_queue. You’d need to set default values somewhere
since your block won’t otherwise set those coefficients until after the
prior block has already processed some data. I think to make this case
work you’d just have to be careful that no matter how the coefficients
are set, that your block cannot generate coefficients that make the
flow-graph “unstable” (whatever that might mean in this case). I’d say
that if your block works right now then this is probably the case, but
it is possible to mess things up because there will be an added delay
due to letting GNU Radio handle moving data around. I’m sure someone
can apply some discrete-time control theory here

If what you are concerned with is the number of filters you’re using
quickly saturating your CPU, you’ll probably want to look into the new
Volk library and/or maybe using a GPU since both offer speed-ups
depending on just how much computation needs to be performed & either
can be more efficient than using generic FIR filter programmed for a
CPU.

Anyway, and again, without knowing what your block actually does nor
details of your flow-graph, that’s about as much as I can comment.