How to utilize multi-thread processor

Hi there,

I am currently doing a OFDM transceiver project based on RawOFDM. We
want
to implement 20MHz bandwidth transmit/receive, but the RawOFDM code
seems
to support only narrow band (<1MHz). Once I set the sample-rate larger
than
1MHz, my program will block with overrun messages (more details here
[Discuss-gnuradio] Support wideband(20MHz) OFDM transmitting/receiving u).
I think the reason is that at 20MHz sample-rate, USRP produces too much
data for the PC to process and drain PC’s computation power.

To boost the speed, I have two questions

  1. My cpu have 8 threads(4 cores), can I manually dedicate one thread to
    each gr block, and make it a pipe-line system? Tom mentioned that
    gnuradio
    use a “thread-per-block” scheduler (
    Re: [Discuss-gnuradio] GNURadio and multi core processors)
    but in my case only two threads are 100% occupied when I run the
    program.

  2. Inside some blocks, we extensively use vector multiplications (e.g.,
    precoding, CFO compensation). I’ve heard about the use of SSE to boost
    the
    speed of vector multiplication. How can I utilize this technology in my
    program?

Best regards,

On Mon, Aug 27, 2012 at 7:07 AM, Qing Y. [email protected]
wrote:

To boost the speed, I have two questions
program?

Best regards,

Yang, Qing
Information Engineering, CUHK

Qing,

Yes, the default scheduler is the thread-per-block, so each block
operates in its own thread, and the OS will distribute those across
the CPU’s. What you are seeing is probably that two blocks in
particular are taking a long time to process and starving the others.
So CPU affinity won’t help you. From your other posts, it looks like
you are trying to profile the code. That’s the better way to go;
figure out which blocks are taking the most time and try to optimize
them.

Tom

Hi Tom,

We are profiling our codes on Xeon w3530(8 cores)+12GB memory+N210, and
find some interesting issues.

  1. The receiver works well at 1MHz sample rate, we see each core is
    10%~20%
    occupied using system monitor. Once we set sample rate larger than 1M
    (say
    2M), the program blocks(no decoding output) and we see only one core is
    100% occupied while others are idle. Using Kcachegrind, we see 86% cpu
    time
    is cost by function “raw_peak_detector_fb::work(…)”. This function is
    used by the first module (synchronization) of RawOFDM, I think this is
    the
    module that choke the system. My first step is to dig into this module
    and
    try to make it faster.

  2. In the ordinary case (1MHz) both the transmitter and receiver call
    the
    function “gr_multiply_cc::work()” frequently, and its cost is quite high
    (nearly 18% of the program). I think there are methods to boost this
    function, right? Perhaps the VOLK lib will help, I will try it out.

Sincerely,

Yang, Qing
Information Engineering, CUHK

2012/8/28 Tom R. [email protected]

On Sun, Sep 2, 2012 at 5:22 AM, Qing Y. [email protected]
wrote:

the first module (synchronization) of RawOFDM, I think this is the module
that choke the system. My first step is to dig into this module and try to
make it faster.

Qing,
Sounds like you’re on the right track to id the low-performing blocks
to optimize them.

  1. In the ordinary case (1MHz) both the transmitter and receiver call the
    function “gr_multiply_cc::work()” frequently, and its cost is quite high
    (nearly 18% of the program). I think there are methods to boost this
    function, right? Perhaps the VOLK lib will help, I will try it out.

In the current release (since 3.6.0, if I recall), the gr_multiply_cc
function has used VOLK. So make sure that you’ve run volk_profile on
your machine to select the best version of the kernel to use at
runtime. As it is, you’re probably not going to be doing any better
than this for performance of a complex multiply. It’s likely that the
blocks giving you specific problems are those running at the highest
sampling rate. You might think about how to re-engineer the system to
avoid doing this or to somehow wrap the multiply into another block’s
function as opposed to trying to optimize this particular block.

Sincerely,

Yang, Qing
Information Engineering, CUHK

Tom