Oprofile inband code results

george_n · October 6, 2007, 6:31pm

Hey all,

I’m a bit new to profiling, I’ve used gprof in the past but this is my
first time using oprofile. I’m running oprofile on the inband code to
try and better understand the m-block performance problem we are seeing.
Since Eric and I’s first short attempt at profiling a while ago showed
significant time in the kernel, I’ve enabled full kernel profiling.

I haven’t specified any specific oprofile options, just sort of started
the daemon, dumped before running the program, ran it for ~1 minute and
saved a session, then examined the single application.

I’ve hosted the results to spare the list some traffic:
http://www.andrew.cmu.edu/user/gnychis/inband_profile_4

So it looks like there is a lot of time spent scheduling the threads in
the kernel. We’re not concerned about optimizing the application
itself.

I can certainly help run more results to make things clearer.

George

george_n · October 6, 2007, 7:35pm

George N. wrote:

So it looks like there is a lot of time spent scheduling the threads in
the kernel. We’re not concerned about optimizing the application itself.

Not really.

Look at just the functions that take up more than 1% of CPU time (the
bottom 21 lines.) These total about 70% of the overall time spent. Of
these (approximate numbers)

26% libm, libc or libstdc++
19% libpmt
12% test_usrp_inband_tx
5% libusrp-inband
5% vmlinux or usbcore
3% libmblock

So it doesn’t appear whatever performance issue you are seeing is
directly related to CPU time spent in the kernel. I have some other
ideas for profiling but don’t have the time until later today perhaps to
email to you.

–
Johnathan C.
Corgan Enterprises LLC
http://corganenterprises.com

george_n · October 6, 2007, 11:28pm

Johnathan C. wrote:

email to you.
Ah, you’re right. From what I can remember, it is spending
significantly less time in the kernel compared to the last time we
profiled it in June. I’m going to do a couple more runs on some of the
other inband applications also and see what happens.

Thanks!
George

george_n · October 6, 2007, 11:57pm

Heres a new and important run, where it seems like the application
cannot keep up (Eric knows what I’m talking about) from the low
interpolation and decimation rates, the memory usage shoots through the
roof, and the system ends up locking up for a period of time:

http://www.andrew.cmu.edu/user/gnychis/inband_tx_rx_3

the PMT library is really sucking face with 23% of the time, surpassing
libm, libc and libstdc++ which are at 22%.

I can certainly think of ways of reducing the PMT usage… there are
numerous cases where I will parse a PMT list, and pass the list to
another method only to re-parse it because it is convenient rather than
passing tons of parameters.

George

george_n · October 7, 2007, 11:52am

Hi!

I wonder what PMT is and what is the benefit of PMT? I am not familiar
with
your current development, except that it will support MAC-layers and
stuff
like that. It seems very interesting. And it sounds like we will
certainly
upgrade our system to your mblocks.

Greetings
Dominik

george_n · October 7, 2007, 12:03am

George N. wrote:

I can certainly think of ways of reducing the PMT usage… there are
numerous cases where I will parse a PMT list, and pass the list to
another method only to re-parse it because it is convenient rather than
passing tons of parameters.

George

and just to clarify, i’m not saying it would be extremely beneficial to
optimize the PMT use through all of the inband code… but most
certainly in the TX and RX chain through the inband code. Optimizing
the parsing of TX and RX messages would seem key.

George

george_n · October 7, 2007, 9:43pm

On Sat, Oct 06, 2007 at 12:30:43PM -0400, George N. wrote:

saved a session, then examined the single application.
Uhh, you want to reset before running your test case, not dump.
Then run the test app for about a minute, then dump and stop.
Then

opreport -l

I’ve hosted the results to spare the list some traffic:
http://www.andrew.cmu.edu/user/gnychis/inband_profile_4

So it looks like there is a lot of time spent scheduling the threads in
the kernel. We’re not concerned about optimizing the application itself.

Thanks.
If you get a chance, please try again, running oprofile as suggested
above.

Eric

george_n · October 7, 2007, 10:26pm

On Sat, Oct 06, 2007 at 06:02:09PM -0400, George N. wrote:

the PMT library is really sucking face with 23% of the time, surpassing
and just to clarify, i’m not saying it would be extremely beneficial to
optimize the PMT use through all of the inband code… but most
certainly in the TX and RX chain through the inband code. Optimizing
the parsing of TX and RX messages would seem key.

George

George,

Your profile shows:

44459 2.2640 libpmt.so
pmt_cons(boost::shared_ptr<pmt_base>, boost::shared_ptr<pmt_base>)
45992 2.3421 libusrp_inband.so.0.0.0
usrp_usb_interface::handle_message(boost::shared_ptr<mb_message>)
51390 2.6170 libm-2.5.so (no symbols)
56609 2.8828 libmblock.so
mb_msg_accepter_smp::operator()(boost::shared_ptr<pmt_base>,
boost::shared_ptr<pmt_base>, boost::shared_ptr<pmt_base>, unsigned int)
62358 3.1755 libstdc++.so.6.0.8 (no symbols)
65506 3.3358 vmlinux-2.6.20-16-generic schedule
73772 3.7568 libpmt.so pmt_nthcdr(unsigned int,
boost::shared_ptr<pmt_base>)
337319 17.1777 libc-2.5.so (no symbols)

Before even thinking about libpmt, it would be wise to figure out the
much greater CPU consumption in libc. Start with the big offenders.

When that’s sorted out (and we’re confident about your profile
numbers) we can have a discussion about what’s next.

BTW, what is the test case that you’re running?
What is the exact sequence of commands that you are issuing to
generate this trace? I’d like to try to reproduce this on my machines.

Eric

george_n · October 8, 2007, 12:17am

Eric,

Before even thinking about libpmt, it would be wise to figure out the
much greater CPU consumption in libc. Start with the big offenders.

When that’s sorted out (and we’re confident about your profile
numbers) we can have a discussion about what’s next.

Sounds good.

BTW, what is the test case that you’re running?
What is the exact sequence of commands that you are issuing to
generate this trace? I’d like to try to reproduce this on my machines.

So actually I was using reset, but said dump.

Heres my exact sequence:

sudo opcontrol --start
sudo opcontrol --reset
./test_usrp_inband_underrun (interp/decim @ 16, 30e6 samples)
sudo opcontrol --save=inband_tx_rx_5
cd .libs/
opreport -rl session:inband_tx_rx_5 lt-test_usrp_inband_underrun &>
inband_tx_rx_5

If i run another test, I omit the start

And, heres the results of doing just that:
http://www.andrew.cmu.edu/user/gnychis/inband_tx_rx_5

Heres the rbf I’m using, you can change it by editing the underrun
application, the current checked in copy of my branch is using
nanocell9, just switch it.
http://www.andrew.cmu.edu/user/gnychis/cs1.rbf

A new run with test_usrp_inband_tx:
http://www.andrew.cmu.edu/user/gnychis/inband_tx_6

The key difference between the two is underrun has both TX and RX
running, TX only has TX running… and also underrun generates one large
message of size d_nsamples_to_send, where the normal TX program breaks
d_nsamples_to_send down into multiple frames.

George

george_n · October 8, 2007, 12:44am

On Sun, Oct 07, 2007 at 06:16:28PM -0400, George N. wrote:

BTW, what is the test case that you’re running?
What is the exact sequence of commands that you are issuing to
generate this trace? I’d like to try to reproduce this on my machines.

So actually I was using reset, but said dump.

Good

Heres my exact sequence:

sudo opcontrol --start
sudo opcontrol --reset
./test_usrp_inband_underrun (interp/decim @ 16, 30e6 samples)
sudo opcontrol --save=inband_tx_rx_5
cd .libs/

I think you want a dump here, to ensure that everything is forced to
disk.

nanocell9, just switch it.
http://www.andrew.cmu.edu/user/gnychis/cs1.rbf

A new run with test_usrp_inband_tx:
http://www.andrew.cmu.edu/user/gnychis/inband_tx_6

The key difference between the two is underrun has both TX and RX
running, TX only has TX running… and also underrun generates one large
message of size d_nsamples_to_send, where the normal TX program breaks
d_nsamples_to_send down into multiple frames.

Thanks!

Eric

george_n · October 9, 2007, 5:31am

I’ve been digging around the oprofile docs and google for this, but
haven’t turned up with anything yet as many of the docs show libc
symbols in oprofile… but any ideas why my results are not achieving
per symbol analysis of libc? Instead, it just says “(no symbols)”

George

george_n · October 9, 2007, 7:34pm

George N. wrote:

I’ve been digging around the oprofile docs and google for this, but
haven’t turned up with anything yet as many of the docs show libc
symbols in oprofile… but any ideas why my results are not achieving
per symbol analysis of libc? Instead, it just says “(no symbols)”

Finally found it:
sudo apt-get install libc-dbg

So in that case:
http://www.andrew.cmu.edu/user/gnychis/inband_tx_10

And I’ve stopped reverse sorting heh

So libm’s bulk comes from sincosf, which I’m not worrying about, but
that solves the where question for that lib.

When it comes to libc, its free and mallocs that are killing us. If you
look down libc to even the smaller percentages, its free/malloc related
(malloc_consolidate, _int_malloc, _int_free … etc).

This happens in two major areas that I can think of: once in the
application which stores the samples (our sine wave) and passes them
down. Another is in the usrp_server which allocates new memory for the
USB packets for which the samples are then treated transparently and
copied in to. Then of course, this memory gets free.

Thoughts/comments?

George

george_n · October 9, 2007, 8:19pm

On Tue, Oct 09, 2007 at 01:33:12PM -0400, George N. wrote:

sudo apt-get install libc-dbg

So in that case:
http://www.andrew.cmu.edu/user/gnychis/inband_tx_10

And I’ve stopped reverse sorting heh

Glad to see it

USB packets for which the samples are then treated transparently and
copied in to. Then of course, this memory gets free.

Thoughts/comments?

The malloc/free’s could be coming from the C++ allocations for the pmt
objects. Thanks for investigating!

Eric

george_n · October 9, 2007, 9:04pm

Eric B. wrote:

The malloc/free’s could be coming from the C++ allocations for the pmt
objects. Thanks for investigating!

No problem! Wasn’t as bad as we thought

Suggestions on where to go from here with the inband code?

Leo is just finishing up mapping the rest of the registers, we got stuck
on a Verilog issue Brian is helping us with. Then we are going to do
another merge and we can ping you when that’s ready.

We also implemented the delay functionality which we’ve tested to be
working checking the oscope.

George

george_n · October 9, 2007, 10:40pm

George N. wrote:

Suggestions on where to go from here with the inband code?

Leo is just finishing up mapping the rest of the registers, we got stuck
on a Verilog issue Brian is helping us with. Then we are going to do
another merge and we can ping you when that’s ready.

We also implemented the delay functionality which we’ve tested to be
working checking the oscope.

Actually, I’m going to partially answer my own question… I thought of
something when I went running.

I’d like to sort out the RSSI issue. I’m going to dig at the RSSI
verilog to figure out exactly what its returning. We’re never getting
any difference in the top half of the 32 bits, only the bottom half.
Then we can decide exactly what we want to return, and if we want to
allocate any more of the header to it.

But if there’s anything else let me know.

George

george_n · October 9, 2007, 11:14pm

I really don’t know much about oprofile and haven’t done much
profiling, but I do have a question or two.

Q. Since the profiler looks at the lowest function that is taking so
much time, I find it strange that pmt_nthcdr is the second method
listed there. Intuitively, pmt_nthcdr should just run a tight loop of
pmt_cdr in which case I would assume pmt_cdr would be higher on the
list but it is not. Same with pmt_nth. What might be taking so long
within those functions that is NOT taking as long within
pmt_cdr/pmt_car? Is something turning into an inline function which
really yields a false profile?

Q. I am surprised to see a destructor (pmt_pair::~pmt_pair())
utilizing so much time. Are there that many pmt_pairs that have to
get destroyed? To answer my own question, I suppose so since every
call to pmt_cons actually creates a new pmt_pair - which might be a
good reason why the malloc and frees are high on the list. Any idea
why so many pmt_cons are used?

Sorry if these seem naive or silly to answer, but it really just makes
me wonder.

Thanks,
Brian

george_n · October 9, 2007, 11:41pm

As another naive note, since sincosf seems to be the dominating
function by a large factor, this website might be worth a read:

http://www.devmaster.net/forums/showthread.php?t=5784

I know there is a fast sine and cosine somewhere for filtering, but I
also thought those caused some issues recently with GMSK demodulation?
When converted from floating -> fixed point for transmission, the
error might be swamped by the quantization noise?

The last section of the forum post really is interesting since we
generally do work on more than 1 piece of data at a time, the SIMD
version of calculating sines or cosines might come in handy.

Brian

george_n · October 10, 2007, 1:40am

On Tue, Oct 09, 2007 at 05:13:08PM -0400, Brian P. wrote:

really yields a false profile?

Q. I am surprised to see a destructor (pmt_pair::~pmt_pair())
utilizing so much time. Are there that many pmt_pairs that have to
get destroyed? To answer my own question, I suppose so since every
call to pmt_cons actually creates a new pmt_pair - which might be a
good reason why the malloc and frees are high on the list. Any idea
why so many pmt_cons are used?

Because we use a lot of them to construct argument lists.
It would be possible to move to an pmt_vector based approach, which
would cut this down dramatically. I think we’re still a bit early in
the game to start that kind of modification.

I think the first thing I would try is moving to the intrusive
implementation of the boost shared pointers for the pmt types.
Then I’d look at a data type specific alloc/free as well as see how
the default allocator is working across multiple threads. That is,
does it already use a separate allocation pool / thread. If it
doesn’t, we could speed up the allocation/free and reduce the amount
of locking required in the typical case.

Before hacking away, I think we need to run the same test cases on
other machines besides the P4 Xeon and gather the oprofile data, as
well as the basic [ $ time ] numbers. We may find wildy
different answers as f(microarchitecture). There’s a reason intel
isn’t featuring the P4 Xeon anymore

Eric

george_n · October 10, 2007, 5:42am

On Tue, Oct 09, 2007 at 11:32:52PM -0400, George N. wrote:

the default allocator is working across multiple threads. That is,

This sounds good to me. I can contribute numbers for the core duo,
which is my x60s laptop. Other than that, I think I only have access to
P4’s. Let me organize this in terms of which applications to run for
the testing, write up a little how-to for running the tests, and maybe
others on the list can contribute results with different architectures.

Very good.

Eric

george_n · October 10, 2007, 5:32am

Eric B. wrote:

doesn’t, we could speed up the allocation/free and reduce the amount
of locking required in the typical case.

Before hacking away, I think we need to run the same test cases on
other machines besides the P4 Xeon and gather the oprofile data, as
well as the basic [ $ time ] numbers. We may find wildy
different answers as f(microarchitecture). There’s a reason intel
isn’t featuring the P4 Xeon anymore

This sounds good to me. I can contribute numbers for the core duo,
which is my x60s laptop. Other than that, I think I only have access to
P4’s. Let me organize this in terms of which applications to run for
the testing, write up a little how-to for running the tests, and maybe
others on the list can contribute results with different architectures.

George