Data lost whe using big file sources

Hello gnuradio fellows,

I have an issue that appears in all gnuradio versions I used lately (I
started with 3.3 and last week I updated to latest from git) and I
thought I should post here before allocating the time to look into it by
myself.

I’m modifying a gnuradio block that is connected from python to 6 file
sources. Everything works fine as long as the files I’m using as source
for data are relatively small (30MB). When the files become large I see
that the data received in general_work() function is corrupted. It is
not massively corrupted but enough to screw my work.

Investigating the problem, I did a small test with having the 6 files
filled with known patterns and printing an error in the general_work()
if what is received is different. The result is that if I use files with
sizes over 100MB I see 50-80 errors in total.

Now, to unblock my work I did observe that if I insert some printing in
general_work() I will not get the errors. Going further, inserting a
boost delay of 100uS also solves the problem.

some more data:

  1. I know the new blocks should use work() instead of general_work() but
    is it still supposed to work as long as I call consume_each(), right?

  2. I’m using core i7 2900K if that matters

This error has haunted me for long time but now I finished my work and
thinking to look for the error (is it scheduler, the way data is taken
from the file sources, where it should be?) and maybe fix it.

Thanks,
Bogdan

Hi Bogdan,

On Tue, Apr 10, 2012 at 03:48:11AM -0700, Bogdan D. wrote:

Hello gnuradio fellows,

I have an issue that appears in all gnuradio versions I used lately (I started
with 3.3 and last week I updated to latest from git) and I thought I should post
here before allocating the time to look into it by myself.

I’m modifying a gnuradio block that is connected from python to 6 file sources.
Everything works fine as long as the files I’m using as source for data are
relatively small (30MB). When the files become large I see that the data received
in general_work() function is corrupted. It is not massively corrupted but enough
to screw my work.

I’ve seen this behaviour, too. My workaround was to use a throttle block
after noticing that my code works with USRPs, but not with files.
However, I never managed to trace the bug to the file source. Thanks for
this!

Investigating the problem, I did a small test with having the 6 files filled
with known patterns and printing an error in the general_work() if what is
received is different. The result is that if I use files with sizes over 100MB I
see 50-80 errors in total.

Do you have errors, or are samples missing?

Now, to unblock my work I did observe that if I insert some printing in
general_work() I will not get the errors. Going further, inserting a boost delay
of 100uS also solves the problem.

some more data:

  1. I know the new blocks should use work() instead of general_work() but is it
    still supposed to work as long as I call consume_each(), right?

That should work since general_work() calls work() and then
consume_each(). Have a look at the code of gr_block (or was it
gr_basic_block?)

MB


Karlsruhe Institute of Technology (KIT)
Communications Engineering Lab (CEL)

Dipl.-Ing. Martin B.
Research Associate

Kaiserstraße 12
Building 05.01
76131 Karlsruhe

Phone: +49 721 608-43790
Fax: +49 721 608-46071
www.cel.kit.edu

KIT – University of the State of Baden-Württemberg and
National Laboratory of the Helmholtz Association

Hi Martin,

thanks for quick reply.

There are errors, values of the data that is different than the pattern.
It is gr_block.

I never had time to look into this but I always speculated that the code
that takes the data from the files does not expect the data to be
available.

Thanks,
Bogdan

On Tue, Apr 10, 2012 at 04:22, Bogdan D.
[email protected] wrote:

I never had time to look into this but I always speculated that the code that
takes the data from the files does not expect the data to be available.

Can you elaborate on this?

Does the data that you are getting incorrectly, resemble any other
portion of the file, is it random garbage, flipped bits, or something
else?

Do you have the repeat option set to true or false?

Are you calling consume_each() with the proper individual values for
what was available from each of the inputs and what you processed out
of each?

FYI, I’ve successfully used file sources on tens of gigabyte size
input capture files with no issues. Not saying there isn’t any, but
this is a fairly widely exercised bit of code.

You might want to start printing or writing to a disk file the values
of all the parameters given to the general_work() function when it is
called, the values given to consume_each(), and the return value from
general_work(), then look for a pattern for when there is corrupted
input data.

Johnathan

Ok, I got no so many replyies to this post and I thought I should
present what I investigated so far.

I tried to trace the root of the problem and firstly I simplified the
use case by filling the source files with just the same value
(arbitrarily 0x18) and looking for different values into
gnuradio-core/src/lib/runtime/gr_block_executor.cc:gr_block_executor::run_one_iteration()
that is in the loop for the TPB.

One note, if I change the TPB to Single Threaded Schedulrer the problem
is gone but the point is to use one thread for each block.

Looking for an error in the run_one_iteration() I see that the output of
the file sinks is not corrupted, the corruption appears only in
run_one_iteration() for the block I’m using, specifically d->input(i)
has at a moment a corrupted value.

Speaking about corrupted value, the corruption I see is actually a zero
value instead of expected data (0x18) but the funny thing is that if I
print the value twice, on second print the value is the correct 0x18 (I
guess here Heisenberg principle has it’s part here :slight_smile: )

From where the d->input(i) comes from? It is gr_block_detail that gets
set-up when connecting blocks each other. This is probably next step to
investigate.

The results so far: moslty learning gnuradio internals, problem is still
hidden.

Thanks,
Bogdan

— On Fri, 4/13/12, Martin B. [email protected] wrote:

simplified the use case by filling the source files with

Heisenberg principle has it’s part here :slight_smile: )
can you cook up a test case that (sometimes) fails the way
easier to debug.

MB

Hi Martin,

sure, I will craft a small example when I’ll get at my development
machine. Although the test case is quite simple:

  1. generate 6 big files (I’m using now 800MB each file just to get more
    errors) filled with the same value, e.g. 0x25
  2. modify an existing gr_block based block to support 6 file sources (I
    guess 6 is not important, it could be more or less)
  3. modify forecast() function to return 1:1 for each input
  4. modify general_work() to test each byte in the 6 inputs against the
    0x25 value. If different this is the error.
  5. connect the file sources to the module in python and run

That is for short.

Thanks and keep in touch
Bogdan

On Thu, Apr 12, 2012 at 11:14:37PM -0700, Bogdan D. wrote:

Ok, I got no so many replyies to this post and I thought I should present what I
investigated so far.

I tried to trace the root of the problem and firstly I simplified the use case
by filling the source files with just the same value (arbitrarily 0x18) and
looking for different values into
gnuradio-core/src/lib/runtime/gr_block_executor.cc:gr_block_executor::run_one_iteration()
that is in the loop for the TPB.

One note, if I change the TPB to Single Threaded Schedulrer the problem is gone
but the point is to use one thread for each block.

This is also what I have observed in the past.

Looking for an error in the run_one_iteration() I see that the output of the
file sinks is not corrupted, the corruption appears only in run_one_iteration()
for the block I’m using, specifically d->input(i) has at a moment a corrupted
value.

Speaking about corrupted value, the corruption I see is actually a zero value
instead of expected data (0x18) but the funny thing is that if I print the value
twice, on second print the value is the correct 0x18 (I guess here Heisenberg
principle has it’s part here :slight_smile: )

From where the d->input(i) comes from? It is gr_block_detail that gets set-up
when connecting blocks each other. This is probably next step to investigate.

The results so far: moslty learning gnuradio internals, problem is still hidden.

Bogdan,

can you cook up a test case that (sometimes) fails the way you describe
and upload it somewhere? I’d love to join the hunt.
My initial guess to this problem was that the TPB scheduler is the
source of the randomness and some race conditions which only appear when
the flow graph is running at infinity clock rate are the actual bug. I
was really hoping the file source was to blame, would have been much
easier to debug.

MB


Karlsruhe Institute of Technology (KIT)
Communications Engineering Lab (CEL)

Dipl.-Ing. Martin B.
Research Associate

Kaiserstraße 12
Building 05.01
76131 Karlsruhe

Phone: +49 721 608-43790
Fax: +49 721 608-46071
www.cel.kit.edu

KIT – University of the State of Baden-Württemberg and
National Laboratory of the Helmholtz Association