Gnuradio locking up

mattmcswain · November 23, 2011, 4:22am

On Tue, Nov 22, 2011 at 11:28 AM, Philip B.
[email protected]wrote:

You can use the single threaded scheduler by setting an environment
variable:

export GR_SCHEDULER=STS

Gave this a shot; app runs for a while (at 100% CPU with quite a few
overruns) then segfaults…
[20077.594080] python[16760]: segfault at 8 ip b7478dc4 sp a862c09c
error 6
in libc-2.11.1.so[b741c000+153000]
and
[21548.361453] python[15672]: segfault at 0 ip b74b9330 sp a9b6f0e0
error 4
in libc-2.11.1.so[b745a000+153000]

mattmcswain · November 22, 2011, 7:38pm

Just so y’all know what lock/unlock is doing:

When you unlock the flow graph it basically interrupts and joins all
scheduler threads. Then it creates an entirely new scheduler.

wait() is a good candidate for the cause of lockups, that is,
interrupted threads are not exiting. This may be a sign of a bug in a
particular block’s work function.

Now, if you are seeing crashing, that may be the result of a interrupted
block not cleaning itself up properly to be called by work again:
hanging onto old memory, not resetting state variables…

It may be a useful test to not lock/unlock, but to stop/wait/start and
see where the flow graph breaks down.

-Josh

mattmcswain · November 23, 2011, 4:42am

On Tue, Nov 22, 2011 at 8:30 PM, Marcus D. Leech [email protected]
wrote:

**
It’s a big flow-graph (you’d mentioned 197 blocks), so it will be a pain
to whittle down exactly which block is causing the segfault–and it’s
provoking it from inside libc, which makes it even less fun.

While its a big flow graph, there arent very many different types of
blocks
(10 of them are instantiated 19 times). However I’ve seen some segfaults
before that I’ve identified as coming from the wav_file sink… I havent
yet had a chance to look at the actual code to see if anything sticks
out
(and I’m not sure that I’d even be able to identify something sticking
out
).

mattmcswain · November 23, 2011, 4:31am

On 22/11/11 10:21 PM, Matt M. wrote:

Gave this a shot; app runs for a while (at 100% CPU with quite a few
overruns) then segfaults…
[20077.594080] python[16760]: segfault at 8 ip b7478dc4 sp a862c09c
error 6 in libc-2.11.1.so http://libc-2.11.1.so[b741c000+153000]
and
[21548.361453] python[15672]: segfault at 0 ip b74b9330 sp a9b6f0e0
error 4 in libc-2.11.1.so http://libc-2.11.1.so[b745a000+153000]

The SegFaults are discouraging–it means there’s a bug in one of the
blocks you’re using, or the
Gnu Radio scheduler/data-shuffler. It’s a big flow-graph (you’d
mentioned 197 blocks), so it will
be a pain to whittle down exactly which block is causing the
segfault–and it’s provoking it from
inside libc, which makes it even less fun.

The overruns aren’t a big surprise, given that you were having some
with TPB scheduling (and thus
using both your CPU cores), with STS, you’re running all on a single
CPU.

mattmcswain · November 23, 2011, 5:45pm

On Tue, Nov 22, 2011 at 10:21 PM, Matt M. [email protected] wrote:

Gave this a shot; app runs for a while (at 100% CPU with quite a few
overruns) then segfaults…
[20077.594080] python[16760]: segfault at 8 ip b7478dc4 sp a862c09c error
6 in libc-2.11.1.so[b741c000+153000]
and
[21548.361453] python[15672]: segfault at 0 ip b74b9330 sp a9b6f0e0 error
4 in libc-2.11.1.so[b745a000+153000]

Since it’s a segfault, I just added this to the FAQ on the wiki to
explain
how to get more information:
http://gnuradio.org/redmine/projects/gnuradio/wiki/FAQ#How-do-I-debug-GNU-Radio-in-Python

Post the output of the backtrace to help us find out what’s going wrong.

Tom

mattmcswain · November 23, 2011, 6:21am

On 22/11/11 10:41 PM, Matt M. wrote:

blocks (10 of them are instantiated 19 times). However I’ve seen some
segfaults before that I’ve identified as coming from the wav_file
sink… I havent yet had a chance to look at the actual code to see if
anything sticks out (and I’m not sure that I’d even be able to
identify something sticking out ).

Well, here’s a flow-graph that exercises “valve” with WAV file sinks–8
of them. The valves are
operated randomly, with a state-change possible at a rate of 10Hz.
It doesn’t crash on my
machine (currently, an ancient dual-core laptop).

It uses a signal source, rather than UHD. The signal source operates at
50Khz.

mattmcswain · November 24, 2011, 7:46am

It appears UHD does not like to be watched; reproducibly if I run with
GDB
attached, UHD eventually stops sending data to the upstream blocks and
my
screen fills up with:

thread[single-threaded-scheduler]: RuntimeError: Control channel send
error
thread[single-threaded-scheduler]: RuntimeError: Control channel send
error
thread[single-threaded-scheduler]: RuntimeError: Control channel send
error
thread[single-threaded-scheduler]: RuntimeError: Control channel send
error
thread[single-threaded-scheduler]: RuntimeError: Control channel send
error
thread[single-threaded-scheduler]: RuntimeError: Control channel send
error
thread[single-threaded-scheduler]: RuntimeError: Control channel send
error
thread[single-threaded-scheduler]: RuntimeError: Control channel send
error
thread[single-threaded-scheduler]: RuntimeError: Control channel send
error

if I run it without GDB attached, it segfaults.

mattmcswain · December 7, 2011, 6:01pm

Matt M. wrote:

Has anyone had time to look into the unlock() lockup that Rachel
reproduced
below further? I seem to be running into it left and right for some reason
and sadly my C++ isnt anywhere near good enough to go seeking the cause
myself.

This looks like an old boost problem
(https://svn.boost.org/trac/boost/ticket/2330). Is there any chance you
are
using a version of boost older than 1.45?

– Don W.

It’s come up a few times in the mailing list archives. The usual

source,
fprintf(stderr, “disconnect: done\n”);
sink = gr_make_file_sink(sizeof(float), “/dev/null”);

mattmcswain · December 7, 2011, 11:28pm

On Wed, Dec 7, 2011 at 10:00 AM, Don W. [email protected]
wrote:

This looks like an old boost problem (
https://svn.boost.org/trac/boost/ticket/2330). Is there any chance you
are using a version of boost older than 1.45?

frowns at ubuntu

ii libboost1.40-dev
1.40.0-4ubuntu4 Boost C++ Libraries
development files

mattmcswain · December 7, 2011, 5:33pm

Has anyone had time to look into the unlock() lockup that Rachel
reproduced
below further? I seem to be running into it left and right for some
reason
and sadly my C++ isnt anywhere near good enough to go seeking the cause
myself.

mattmcswain · December 8, 2011, 10:06pm

On Wed, Dec 7, 2011 at 3:27 PM, Matt M. [email protected] wrote:

frowns at ubuntu

After updating to Ubuntu 11.10 (which has boost 1.46) I still experience
the lockup.

linux; GNU C++ version 4.6.1; Boost_104601; UHD_003.004.000-7dc76db

mattmcswain · December 9, 2011, 3:14am

On Dec 8, 2011, at 1:05 PM, Matt M. wrote:

After updating to Ubuntu 11.10 (which has boost 1.46) I still experience the
lockup.

For the record, I was testing on this:

morbo:~$ cat /etc/slackware-version; uname -a
Slackware 13.37.0
Linux morbo 2.6.37.6 #3 SMP Sat Apr 9 22:49:32 CDT 2011 x86_64 Intel®
Core™ i7-2600 CPU @ 3.40GHz GenuineIntel GNU/Linux
morbo:~$ ls /var/log/packages/boost*
/var/log/packages/boost-1.45.0-x86_64-1

That’s a stock Slackware64 13.37 package.

mattmcswain · December 9, 2011, 6:48am

On Thu, Dec 8, 2011 at 2:05 PM, Matt M. [email protected] wrote:

After updating to Ubuntu 11.10 (which has boost 1.46) I still experience
the lockup.

Also, Here’s the python I’ve been using to reproduce:

and the GDB backtrace after its locked up: (gdb) bt#0 0xb7833424 in __kernel_vsyscall ()#1 0xb7809619 in __lll_lock_w - Pastebin.com

mattmcswain · December 9, 2011, 7:04am

On Thu, Dec 8, 2011 at 21:47, Matt M. [email protected] wrote:

Also, Here’s the python I’ve been using to reproduce:
#!/usr/bin/env pythonfrom gnuradio import grfrom gnuradio import uhdfrom t - Pastebin.com
and the GDB backtrace after its locked up: (gdb) bt#0 0xb7833424 in __kernel_vsyscall ()#1 0xb7809619 in __lll_lock_w - Pastebin.com

The unlock() code, when the number of calls to unlock() reaches the same
as
the number of calls to lock(), initiates the process of reconfiguring
the
running flowgraph with the modified topology created by calls to
connect()
and disconnect() in the interim.

At a certain point, it stops the running flowgraph. This is shown in
stack
frame #7 in your backtrace.

To stop the graph, it interrupts all the threads in the thread group for
the flowgraph (each block work() function runs in its own thread),
showin
in stack frame #5.

Stack frames #4-#0 show the Boost library interrupting one of the
threads,
which ultimately results in a kernel syscall that never returns.

It’s possible that whatever thread is being interrupted is somewhere in
an
uninterruptible state, though I’m not sure what that could be. If you
could do an “info threads” in gdb, that might shed some light.

Johnathan C.

mattmcswain · December 9, 2011, 7:36am

On Thu, Dec 8, 2011 at 22:15, Matt M. [email protected] wrote:

With an even simpler version of the app (signal → null sink) just running
lock() unlock() in the loop it still locks up; using that version gives
this info from gdb (info threads, and bt on both threads included)

(gdb) info threads Id Target Id Frame 2 Thread 0xb25d8b70 (L - Pastebin.com

Any chance you could use git bisect to narrow it down?

Johnathan

mattmcswain · December 9, 2011, 7:16am

On Thu, Dec 8, 2011 at 11:02 PM, Johnathan C. <
[email protected]> wrote:

It’s possible that whatever thread is being interrupted is somewhere in an
uninterruptible state, though I’m not sure what that could be. If you
could do an “info threads” in gdb, that might shed some light.

With an even simpler version of the app (signal → null sink) just
running
lock() unlock() in the loop it still locks up; using that version gives
this info from gdb (info threads, and bt on both threads included)

mattmcswain · December 17, 2011, 3:58pm

All,

I’m willing to offer a 100$ bounty on a patch that fixes the GNUradio
lockup in the attached script (basically just running lock() unlock() in
a
loop).

Thanks,
Matt.

---------- Forwarded message ----------
From: Matt M. [email protected]
Date: Fri, Dec 9, 2011 at 9:02 AM
Subject: Re: [Discuss-gnuradio] Gnuradio locking up
To: Don W. [email protected]

Don,

File is attached; If you watch CPU use, on my system it goes to about
40%
for 5-10 seconds and then drops to 0%. Once it has dropped to 0 it has
locked up.

-MM

mattmcswain · December 29, 2011, 4:59am

Matt,

It looks like the problem here lies within boost itself.

Until boost 1.47 a deadlock condition existed durring concurrent
condition_wait() and interrupt() calls, which is what we are running
into.
this seems to happen most commonly as a TPB thread body is blocking on
an input condition variable while the python/top-level thread is calling
unlock() → interrupt() which blocks on a thread_info mutex.

see https://svn.boost.org/trac/boost/ticket/4978

The boost minimum requirement in GNU Radio should probable be raised to
1.47 (or at least some kind of warning added) in order to avoid this
behavior in the future.

test2.py seems to run without any issues or deadlock using boost 1.48
from source.

-Tim

mattmcswain · December 9, 2011, 7:54am

On Thu, Dec 8, 2011 at 11:15 PM, Matt M. [email protected] wrote:

With an even simpler version of the app (signal → null sink) just running
lock() unlock() in the loop it still locks up; using that version gives
this info from gdb (info threads, and bt on both threads included)

(gdb) info threads Id Target Id Frame 2 Thread 0xb25d8b70 (L - Pastebin.com

After compiling with debug symbols, it looks like its getting stuck in
line
71 of gr_tpb_thread_body.cc (
http://gnuradio.org/redmine/projects/gnuradio/repository/revisions/b6005d9e5823d871c091f3b4a048ca67cd885821/entry/gnuradio-core/src/lib/runtime/gr_tpb_thread_body.cc#L71
)

Updated pastebin of gdb: (gdb) bt#0 0xb7833424 in __kernel_vsyscall ()#1 0xb7809619 in __lll_lock_w - Pastebin.com