Dropped stream tags with mm & pfb clock recovery/sync blocks

Hi,

I’ve got some issues with two of the clock sync/recovery blocks shipped
in the gr-digital package (maint) regarding stream tag propagation/non
deterministic behavior.

I use a file_source block to read previously captured traces and tag
some of the samples before the stream is handled by the clock
recovery/sync block. After the clock block, only a few (the first)
samples remain tagged. I tried to use a throttle block but that did not
fix the issue. As a custom clock recovery block works fine, I tried to
find the cause and potential workarounds.

I share my findings, so that someone with more knowledge of the clock
recovery blocks might find real fixes.

  1. clock_recovery_mm_ff_impl
    In this block the relative rate (rr) is set by:
    set_relative_rate (1.0 / omega);

Enabling the scheduler to update the rr with
enable_update_rate(true);

causes it to deviate slightly (4th/5th decimal) from the set rate but
fixes the issue of “dropped tags”.

  1. pfb_clock_sync_fff_impl.cc
    This one proved to be a little tricky. I think there are two issues with
    this block:

a) Producing outputs without input
The block produces many, many samples without input. Thus, the scheduler
controlled rr goes through the roof (>70.0) for quite a couple of calls
to the work function. This really messes up the tag propagation.

b) Non deterministic behavior:
Some smaller tests consisting of the following topology:
same file -> file source -> pfb -> file sink
with multiple iterations result in different outputs, thus the block
itself seems to be non deterministic. A throttle blocks helps most of
the time, but in my opinion should not be needed.

Potential causes and workarounds follow. They fix the tag propagation
issue but do not fully fix the non deterministic behavior:

I) The “in” pointer should be initialized to the first new tag, not the
beginning of the history, as count+d_out_idx might become negative.
Thus:
out[i+d_out_idx] = d_filters[d_filtnum]->filter(&in[count+d_out_idx]);
might produce bad results.

II) After skimming through a paper on pfb [1], imho the counter “count”
should only be in-/decremented once per over-/underflow using the 1/N
architecture.

III) Is there really a (maybe indirect) check that there are no samples
produced, if no input is available? Both checks which break the main
loop are against noutput_items?

Regards,
Alexander

[1] Harris, F.J. and Rice, M., “Multirate digital filters for symbol
timing synchronization in software defined radios”, IEEE Journal on
Selected Areas in Communications, Vol. 19, No. 12, 2001

Alexander,

once every year or so a message pops up with someone describing
non-deterministic behaviour using file sources, but it never gets
followed up because it’s so hard to reproduce.

Any chance you could provide the actual source files and flow graphs, so
other people can try this and see for themselves?

M

Hello Martin,

as the test source file is about 1.3mb I uploaded everything to the
following URL:
http://sys.cs.uos.de/gnuradio/

There you will find 3 files:
in: float trace previously captured
check.py: script to run the tests
check_md5.py: script to compute and compare the resulting md5 hashes

Executing
./check.py --test pfb in
will perform 100 iterations and create 100 out files out_n (each about
220kb) for the pfb test case.

After it has finished you can run
./check_md5.py
This will calculate the md5 sums of the previously created files. It
will print a summary of unique md5 sums and create a file called md5
containing some details (file, md5, size).

You should get more than one md5 sum/file size as a result after one
run, but you might need to increase the number of iterations or perform
it 2-3 times.

For reference you can also use the “io” or “mm” test case, which should
produce deterministic results.

Regards
Alexander

On Fri, Aug 22, 2014 at 8:08 AM, Alexander B.
[email protected]
wrote:

You should get more than one md5 sum/file size as a result after one
run, but you might need to increase the number of iterations or perform
it 2-3 times.

For reference you can also use the “io” or “mm” test case, which should
produce deterministic results.

Regards
Alexander

By the way, I looked at the example you sent here, and I see what you
mean.
The signal you have in that file is pretty simple, though, and we should
be
able to get it to lock. But I’m not seeing proper behavior out of the
system. I haven’t spent much more time then verifying things, though,
and
suspect I won’t find time in the near future, unfortunately.

In other words, I get your point about the behavior. These problems tend
to
result from improper settings of the block. But if you figure out a
patch
that works for you, send it along and we’ll look into it.

Tom

On Thu, Aug 21, 2014 at 5:34 PM, Alexander B.
[email protected]
wrote:

fix the issue. As a custom clock recovery block works fine, I tried to
enable_update_rate(true);

causes it to deviate slightly (4th/5th decimal) from the set rate but
fixes the issue of “dropped tags”.

Which is exactly what the enable_update_rate function was designed to
do.

  1. pfb_clock_sync_fff_impl.cc
    This one proved to be a little tricky. I think there are two issues with
    this block:

a) Producing outputs without input
The block produces many, many samples without input. Thus, the scheduler
controlled rr goes through the roof (>70.0) for quite a couple of calls
to the work function. This really messes up the tag propagation.

What? I don’t think that I understand what you’re saying. That block
should
not be producing anything if there are no inputs.

I) The “in” pointer should be initialized to the first new tag, not the
beginning of the history, as count+d_out_idx might become negative. Thus:
out[i+d_out_idx] = d_filters[d_filtnum]->filter(&in[count+d_out_idx]);
might produce bad results.

II) After skimming through a paper on pfb [1], imho the counter “count”
should only be in-/decremented once per over-/underflow using the 1/N
architecture.

It is. What line(s) are you referring to? Lines 435 - 444 of that file
take
care of this case.

III) Is there really a (maybe indirect) check that there are no samples
produced, if no input is available? Both checks which break the main
loop are against noutput_items?

The forecast function provides a relationship between the number of
input
items and number of output items. I’m at a loss for why you are seeing
this.

Regards,
Alexander

[1] Harris, F.J. and Rice, M., “Multirate digital filters for symbol
timing synchronization in software defined radios”, IEEE Journal on
Selected Areas in Communications, Vol. 19, No. 12, 2001

A number of us have successfully used this block to capture real, live,
over-the-air data. Some of what you’re suggesting here means that we
wouldn’t be able to get correct data with it in bursty transmissions, if
I
understand correctly.

Much like what Martin said about providing an example, have you tried
your
modifications? If yes, do you have a patch we could test?

Also, you are using a loop bandwidth value of 0.3, which is about an
order
of magnitude higher than it should be. Check out
share/gnuradio/examples/digital/example_timing.py.

Tom

Hi Tom,

answers inline, cropped some lines to maintain readability:

On 22/08/14 16:34, Tom R. wrote:

On Thu, Aug 21, 2014 at 5:34 PM, Alexander B. [email protected]
wrote:
Enabling the scheduler to update the rr with
enable_update_rate(true);
fixes the issue of “dropped tags”.

Which is exactly what the enable_update_rate function was designed to do.

Ok, so I’ll keep that as a fix.

  1. pfb_clock_sync_fff_impl.cc
    a) Producing outputs without input
    The block produces many, many samples without input. Thus, the scheduler
    controlled rr goes through the roof (>70.0) for quite a couple of calls
    to the work function. This really messes up the tag propagation.

What? I don’t think that I understand what you’re saying. That block should
not be producing anything if there are no inputs.

My mistake, it should have said new inputs to make any sense. I think
it is related to II) & III) so please see below (III).

It is. What line(s) are you referring to? Lines 435 - 444 of that file take
care of this case.

In the float implementation the equivalent lines are 398-407, but of
course the check is the same. My issue with those two loops is that the
count variable is in-/decremented multiple times. From the file:

// If we’ve run beyond the last filter, wrap around and go to next
sample
while(d_filtnum >= d_nfilters) {
d_k -= d_nfilters;
d_filtnum -= d_nfilters;
count += 1;
}

Depending on the loop, “count” might be incremented multiple times and
does not necessarily point to the next sample. A quick and dirty
proposal would be something along the following lines:

if(d_filtnum >= d_nfilters){
while(d_filtnum >= d_nfilters) {
d_k -= d_nfilters;
d_filtnum -= d_nfilters;
}
count += 1;
}

Same goes for the second case, stuffing the previous sample by
decrementing “count”.

III) Is there really a (maybe indirect) check that there are no samples
produced, if no input is available? Both checks which break the main
loop are against noutput_items?

The forecast function provides a relationship between the number of input
items and number of output items. I’m at a loss for why you are seeing this.

As the “count” variable is also used for the consume_each() call it
seems that - in combination with multiple decrements in II) - to many
items might be created without consuming anything.

Adding those workarounds seems to improve the non deterministic
behavior, still it does not produce the same results every time.
Comparing the details in the md5 file created by the posted script, it
seems that all “wrong” results have additional samples (bigger file
sizes) compared to the “right” result.

A number of us have successfully used this block to capture real, live,
over-the-air data. Some of what you’re suggesting here means that we
wouldn’t be able to get correct data with it in bursty transmissions, if I
understand correctly.

Of course it is always a possibility that I’m just doing something
wrong. But if not, there might be cases the block produces usable, but
technically speaking incorrect data.

Much like what Martin said about providing an example, have you tried your
modifications? If yes, do you have a patch we could test?

I tried my modifications and they fixed the tag propagation issue for
both clock recoveries. They also reduced the non deterministic behavior
of the pfb block. But as it is still not fully deterministic I did not
create a patch to test, yet.

Also, you are using a loop bandwidth value of 0.3, which is about an order
of magnitude higher than it should be. Check out
share/gnuradio/examples/digital/example_timing.py.

Thanks for the hint, I’ll definitely check out that example.

Tom

Regards
Alexander