How to utilize multi-thread processor

aris · August 27, 2012, 2:01pm

Hi there,

I am currently doing a OFDM transceiver project based on RawOFDM. We
want
to implement 20MHz bandwidth transmit/receive, but the RawOFDM code
seems
to support only narrow band (<1MHz). Once I set the sample-rate larger
than
1MHz, my program will block with overrun messages (more details here
[Discuss-gnuradio] Support wideband(20MHz) OFDM transmitting/receiving u).
I think the reason is that at 20MHz sample-rate, USRP produces too much
data for the PC to process and drain PC’s computation power.

To boost the speed, I have two questions

My cpu have 8 threads(4 cores), can I manually dedicate one thread to
each gr block, and make it a pipe-line system? Tom mentioned that
gnuradio
use a “thread-per-block” scheduler (
Re: [Discuss-gnuradio] GNURadio and multi core processors)
but in my case only two threads are 100% occupied when I run the
program.
Inside some blocks, we extensively use vector multiplications (e.g.,
precoding, CFO compensation). I’ve heard about the use of SSE to boost
the
speed of vector multiplication. How can I utilize this technology in my
program?

Best regards,

Qing_Y · August 28, 2012, 2:23pm

On Mon, Aug 27, 2012 at 7:07 AM, Qing Y. [email protected]
wrote:

To boost the speed, I have two questions
program?

Best regards,

Yang, Qing
Information Engineering, CUHK

Qing,

Yes, the default scheduler is the thread-per-block, so each block
operates in its own thread, and the OS will distribute those across
the CPU’s. What you are seeing is probably that two blocks in
particular are taking a long time to process and starving the others.
So CPU affinity won’t help you. From your other posts, it looks like
you are trying to profile the code. That’s the better way to go;
figure out which blocks are taking the most time and try to optimize
them.

Tom

Qing_Y · September 2, 2012, 11:23am

Hi Tom,

We are profiling our codes on Xeon w3530(8 cores)+12GB memory+N210, and
find some interesting issues.

The receiver works well at 1MHz sample rate, we see each core is
10%~20%
occupied using system monitor. Once we set sample rate larger than 1M
(say
2M), the program blocks(no decoding output) and we see only one core is
100% occupied while others are idle. Using Kcachegrind, we see 86% cpu
time
is cost by function “raw_peak_detector_fb::work(…)”. This function is
used by the first module (synchronization) of RawOFDM, I think this is
the
module that choke the system. My first step is to dig into this module
and
try to make it faster.
In the ordinary case (1MHz) both the transmitter and receiver call
the
function “gr_multiply_cc::work()” frequently, and its cost is quite high
(nearly 18% of the program). I think there are methods to boost this
function, right? Perhaps the VOLK lib will help, I will try it out.

Sincerely,

Yang, Qing
Information Engineering, CUHK

2012/8/28 Tom R. [email protected]

Qing_Y · September 2, 2012, 11:13pm

On Sun, Sep 2, 2012 at 5:22 AM, Qing Y. [email protected]
wrote:

the first module (synchronization) of RawOFDM, I think this is the module
that choke the system. My first step is to dig into this module and try to
make it faster.

Qing,
Sounds like you’re on the right track to id the low-performing blocks
to optimize them.

In the ordinary case (1MHz) both the transmitter and receiver call the
function “gr_multiply_cc::work()” frequently, and its cost is quite high
(nearly 18% of the program). I think there are methods to boost this
function, right? Perhaps the VOLK lib will help, I will try it out.

In the current release (since 3.6.0, if I recall), the gr_multiply_cc
function has used VOLK. So make sure that you’ve run volk_profile on
your machine to select the best version of the kernel to use at
runtime. As it is, you’re probably not going to be doing any better
than this for performance of a complex multiply. It’s likely that the
blocks giving you specific problems are those running at the highest
sampling rate. You might think about how to re-engineer the system to
avoid doing this or to somehow wrap the multiply into another block’s
function as opposed to trying to optimize this particular block.

Sincerely,

Yang, Qing
Information Engineering, CUHK

Tom