Gnuradio locking up

luislavena · November 22, 2011, 4:25am

Hello all,

I seem to be having an issue that, after about 30-45 minutes of running
normally my gnuradio based python app will just lock up. It wont respond
to
control C, it holds all of its existing file handles open but doesnt do
anything with them, and an strace attach shows only:
futex(0xb2ff54f4, FUTEX_WAIT_PRIVATE, 1, NULL

I dont really have any experience troubleshooting this type of thing,
could
anyone provide some guidance on what to look for (in gdb I assume)?

Thanks,
Matt.

mattmcswain · November 22, 2011, 1:31pm

On 11/21/2011 10:24 PM, Matt M. wrote:

Hello all,

I seem to be having an issue that, after about 30-45 minutes of running
normally my gnuradio based python app will just lock up. It wont respond to
control C, it holds all of its existing file handles open but doesnt do
anything with them, and an strace attach shows only:
futex(0xb2ff54f4, FUTEX_WAIT_PRIVATE, 1, NULL

I dont really have any experience troubleshooting this type of thing, could
anyone provide some guidance on what to look for (in gdb I assume)?

Memory leak? Run top in another window and watch the memory usage
numbers.

Philip

mattmcswain · November 22, 2011, 2:53pm

On Tue, Nov 22, 2011 at 12:29 PM, Philip B.
[email protected]wrote:

I dont really have any experience troubleshooting this type of thing,
could
anyone provide some guidance on what to look for (in gdb I assume)?

Memory leak? Run top in another window and watch the memory usage numbers.

I’ve seen lockups of this sort when multi-threaded python processes
exit.
You might also like to take a look at what each thread is up to in gdb.

Mark

mattmcswain · November 22, 2011, 4:19pm

Curiously at startup python begins consuming ~1950M of VIRT, but only
46M
of RES and 23M of SHR… No signs of any of those numbers increasing
more
than +/- 5% (although VIRT occasionally drops momentarily to ~160-180 MB
before returning to ~1950 MB which seems awfully strange).

About 15 minutes of running, the app has locked up, Virt shows 364M,
res=45M, SHR=23M. I did notice that while running their were about 20k
rescheduling interrupts per second in /proc/interrupts (which I believe
is
causing 30-40% SYS CPU use).

mattmcswain · November 22, 2011, 4:23pm

On 22/11/11 10:18 AM, Matt M. wrote:

On Tue, Nov 22, 2011 at 5:29 AM, Philip B. <[email protected]
mailto:[email protected]> wrote:
Memory leak? Run top in another window and watch the memory usage
numbers.

What type of machine? OS? How much physical memory. Is your OS
up-to-date?

mattmcswain · November 22, 2011, 4:26pm

On Tue, Nov 22, 2011 at 6:52 AM, Mark S. [email protected]
wrote:

I’ve seen lockups of this sort when multi-threaded python processes exit.
You might also like to take a look at what each thread is up to in gdb.

I’m not really sure how to get around in GDB, but I’ve captured a
backtrace
of each thread in this pastebin: (gdb) bt#0 0xb7873430 in __kernel_vsyscall ()#1 0xb784f245 in sem_wait@@GL - Pastebin.com If theirs
something that would make more sense to look at let me know.

mattmcswain · November 22, 2011, 4:31pm

Ubuntu 10.04 LTS (x86) on a physical desktop (intel G6950 dual core
CPU), 2
GB physical ram, 6 GB swap space, OS is up to date per apt. Gnuradio and
UHD are both built from git as of yesterday.

Linux -hostname- 2.6.32-35-generic-pae #78-Ubuntu SMP Tue Oct 11
17:01:12
UTC 2011 i686 GNU/Linux

mattmcswain · November 22, 2011, 4:40pm

On 22/11/11 10:30 AM, Matt M. wrote:

mailto:[email protected]> wrote:

What type of machine?   OS?  How much physical memory.  Is your OS
up-to-date?

And this is still the flow-graph that has lock/unlock() in it? From
the report of very-high
rescheduling interrupts, I wonder if there’s a subtle bug in the Gnu
Radio block
scheduler around lock()/unlock() that causes horrible thrashing.

mattmcswain · November 22, 2011, 4:45pm

This graph doesnt have any unlock/locks in the code itself, it does use
valve blocks (which I believe use unlock/lock internally) which are used
to
mute/unmute streams (there is probably an average of 2-4 valve state
changes per second across the graphs 19 valves).

mattmcswain · November 22, 2011, 4:50pm

On 22/11/11 10:44 AM, Matt M. wrote:

This graph doesnt have any unlock/locks in the code itself, it does
use valve blocks (which I believe use unlock/lock internally) which
are used to mute/unmute streams (there is probably an average of 2-4
valve state changes per second across the graphs 19 valves).

Well, if there was a subtle deadlock/race/non-linearity in
lock()/unlock()–that’s the type of graph that
would uncover it, for sure.

–
Principal Investigator
Shirleys Bay Radio Astronomy Consortium

mattmcswain · November 22, 2011, 4:50pm

And this is still the flow-graph that has lock/unlock() in it? From the report
of very-high
rescheduling interrupts, I wonder if there’s a subtle bug in the Gnu Radio
block
scheduler around lock()/unlock() that causes horrible thrashing.

It’s pretty easy to get wedged forever if you call lock and unlock a lot
in conjunction with connect and disconnect. Sooner or later, you’ll hit
a race and things will get stuck.

I have a simple reproduction case if anyone is interested. It’ll hang
reliably after a few dozen iterations.

mattmcswain · November 22, 2011, 4:58pm

On 22/11/11 10:48 AM, Rachel Kroll wrote:

It’s pretty easy to get wedged forever if you call lock and unlock a lot in
conjunction with connect and disconnect. Sooner or later, you’ll hit a race and
things will get stuck.

I have a simple reproduction case if anyone is interested. It’ll hang reliably
after a few dozen iterations.

That’s the type of information that shouldn’t be withheld from this
list, and by implication, the
developers. Don’t assume that because you’ve found a
bug/unexpected-behaviour, that the developers
know about it, and are working on a fix.

–
Principal Investigator
Shirleys Bay Radio Astronomy Consortium

mattmcswain · November 22, 2011, 5:03pm

On Nov 22, 2011, at 7:56 AM, Marcus D. Leech wrote:

know about it, and are working on a fix.
It’s come up a few times in the mailing list archives. The usual
solution seems to be “add more sleeps”, which of course is not a fix.

Anyway, here’s the reproduction case:

#include <gnuradio/gr_file_sink.h>
#include <gnuradio/gr_sig_source_f.h>
#include <gnuradio/gr_hier_block2.h>
#include <gnuradio/gr_io_signature.h>
#include <gnuradio/gr_top_block.h>

static void connect(gr_top_block_sptr block, gr_sig_source_f_sptr
source,
gr_hier_block2_sptr block2) {
fprintf(stderr, “connect: calling lock, connect, unlock\n”);
block->lock();
block->connect(source, 0, block2, 0);
block->unlock();
fprintf(stderr, “connect: done\n”);
}

static void disconnect(gr_top_block_sptr block, gr_sig_source_f_sptr
source,
gr_hier_block2_sptr block2) {
fprintf(stderr, “disconnect: calling block->lock\n”);
block->lock();

fprintf(stderr, “disconnect: calling block->disconnect\n”);
block->disconnect(source, 0, block2, 0);

fprintf(stderr, “disconnect: calling block->unlock\n”);
block->unlock(); // It usually hangs here.

fprintf(stderr, “disconnect: done\n”);
}

int main(int argc, char** argv) {
// Inner block: block to sink.
gr_hier_block2_sptr inner;
inner = gr_make_hier_block2(“inner”,
gr_make_io_signature(1, 1, sizeof(float)),
gr_make_io_signature(0, 0, 0));

gr_file_sink_sptr sink;
sink = gr_make_file_sink(sizeof(float), “/dev/null”);
inner->connect(inner, 0, sink, 0);

// Outer block: signal source to inner block.
gr_top_block_sptr outer = gr_make_top_block(“outer”);
gr_sig_source_f_sptr src = gr_make_sig_source_f(11025, GR_COS_WAVE,
400, .1, 0);

// Hook it up and get it going.
connect(outer, src, inner);
outer->start();

// Frob it until we die.
while (true) {
disconnect(outer, src, inner);
fprintf(stderr, “\n\n------------------------\n\n”);

connect(outer, src, inner);

}

return 0;
}

mattmcswain · November 22, 2011, 5:29pm

On 11/22/2011 11:02 AM, Rachel Kroll wrote:

list, and by implication, the
developers. Don’t assume that because you’ve found a
bug/unexpected-behaviour, that the developers
know about it, and are working on a fix.

It’s come up a few times in the mailing list archives. The usual solution seems
to be “add more sleeps”, which of course is not a fix.

Anyway, here’s the reproduction case:

How do you compile this? I put it in a file and made a couple fo quick
stabs at it.

#include <gnuradio/gr_file_sink.h>

This raises a question, the standard search paths find this file, but
the gnuradio headers have lines like:

#include <gr_core_api.h>

which force you to add -I/usr/local/include/gnuradio to the compile
command. I don’t like mixing my include styles and feel searching both
paths can lead to problems.

Philip

mattmcswain · November 22, 2011, 5:25pm

On 22/11/11 11:02 AM, Rachel Kroll wrote:

#include <gnuradio/gr_file_sink.h>
block->unlock();
                          gr_make_io_signature(1, 1, sizeof(float)),
}

return 0;
}

Thanks for this code snippet. Oh, and I read your “Forest for the
Trees” piece. Nice.

I’ve never had a big flow-graph that used dynamic topology reconfig,
because when I first started
using Gnu Radio back in 2005, the flow-graph topology had to be static
once started. So, I tend
to structure things based on that assumption, and generally don’t run
into wedgies ever.

I’ve had flow-graphs that have run for weeks, and only died when a
power-failure tripped out the
entire computer.

So, clearly, the lock()/unlock()/connect()/disconnect() logic is subtly
broken. Parallelism tends to
break subtly, unfortunately.

–
Principal Investigator
Shirleys Bay Radio Astronomy Consortium

mattmcswain · November 22, 2011, 5:32pm

How do you compile this? I put it in a file and made a couple fo quick
stabs at it.

My Makefile is just:

grlock: grlock.cc
g++ -g -Wall -I/usr/local/include/gnuradio -o grlock grlock.cc
-lgnuradio-core -Xlinker -rpath /usr/local/lib64

You probably won’t need the -Xlinker -rpath stuff unless your machine
has some weird library path issues.

This raises a question, the standard search paths find this file, but
the gnuradio headers have lines like:

#include <gr_core_api.h>

I’m not a fan of that include scheme either, but I work with what I
have.

The least I can do is not propagate it into my own code, which is why I
have the leading “gnuradio/” on those paths.

mattmcswain · November 22, 2011, 6:59pm

I may have also neglected to mention that this graph, by my count, has
about 197 blocks in it…

So is their anything I could look at further in my app (aside from
trying
to eliminate the valve blocks, which I’m attempting to do) that I could
positively determine the cause of the lockups (and if it is related to
Rachel’s test case, that whatever patch produced will fix it).

Thanks all,
Matt.

mattmcswain · November 22, 2011, 7:15pm

On 22/11/11 12:58 PM, Matt M. wrote:

The lock()/unlock() is a likely contributing factor, and having such a
large number of blocks is
likely to exacerbate any underlying deficiencies in that area.

That’s a lot of blocks, which means that there’s going to be a lot of
data shuffling going on inside
the block scheduler. Try to merge blocks if you can. For example,
adjacent multiplies can
be merged into a single multiply, etc, etc. With that many blocks,
you’ll be chewing up
MFLOPS pretty quickly.

–
Principal Investigator
Shirleys Bay Radio Astronomy Consortium

mattmcswain · November 22, 2011, 7:29pm

On 11/22/2011 11:31 AM, Rachel Kroll wrote:

How do you compile this? I put it in a file and made a couple fo quick
stabs at it.

I can duplicate the hang. Also it looks like it does not hang using the
single threaded scheduler. (Which I guess we expect)

You can use the single threaded scheduler by setting an environment
variable:

export GR_SCHEDULER=STS

and running your program.

Philip

mattmcswain · November 22, 2011, 5:57pm

On 11/22/2011 08:28 AM, Philip B. wrote:

That’s the type of information that shouldn’t be withheld from this
How do you compile this? I put it in a file and made a couple fo quick
command. I don’t like mixing my include styles and feel searching both
paths can lead to problems.

If you look at the pc file, the intention was to manually specify the
include path: -I/usr/local/include/gnuradio This is a fairly common
paradigm. I like the way we handle gruel better, but in any case, I’d
recommend keeping the coding style aligned with the manufacturer’s
style. You will also find that many gr headers that include one another
depend on the headers being in this “flat” search space.

-Josh