Scheduler Issues with Branch Prediction

Hello everyone,

I’ve been having the same issue with the scheduler for the past 2 months
or
so and I’m starting to unravel what I believe is occurring. Let me
describe
my flowgraph first, then describe what’s happening when I run the flow
graph, and then describe some REALLY interesting things I’ve noticed and
my
thoughts. I apologize for the length of this in advance! :slight_smile: . This isn’t
as
much of a “help me!” post as it is “what is happening here and is there
a
problem with the scheduler or my thinking?”

My flowgraph consists of:

File source → Custom Encode → Custom Server Query → Custom Server
Query
→ Custom Decode → Bit Error Rate Calculator

The encode scheme I am using is rather simple: it splits a byte into 4
“flags”, each representing the four possible values of two bits
(00,01,10,11). The first “Custom Server Query” sends these flags to our
server, which stores them and returns a reference number, which is the
output of the block, with the second query block doing this process in
reverse. Overall, the actual implementations of these blocks are all
extremely simple, but unimportant, except that they all have if
statement
chains to perform different processes based on the incoming flag. (This
is
important later)

Now I expect the entire file to transfer over with a 0 BER. What I find
is
that only a relatively short length transfers successfully before the
data
on the other end becomes completely jumbled. I cannot ascertain what
sort
of jumbling occurs, ie dropping, shifting, delays, because it appears to
change each run and happen at different file locations.

I don’t have a throttle after the file source because it does not work.
Extreme example: I can set the throttle speed to 1 and it will still
process at the fastest rate possible. If I put any other block in
between
the file source and the throttle, then the throttle works as it should.
It
doesn’t fix the jumble issue, but the throttle acts as it was designed
to
by bursting with an average of the set value. That’s weird thing #1: Why
does throttle work if there is a block in between it and the file source
block, but not if directly connected?

Next, I determined that the jumbling doesn’t happen for small file sizes
(21kB) for random content files, but does for larger files. Why is
random
import? Because if I put sequential data through (0x0 0x1 0x2 0x3 0x0
0x1
0x2 0x3 0x4…), it works perfectly with any file length at any flow
graph speed. This is a little unexpected, but not unheard of, as it
appears
to be an artifact of branch prediction failure (see the following for a
good discussion if you are unfamiliar:
java - Why is processing a sorted array faster than processing an unsorted array? - Stack Overflow).
This would be a factor because my blocks have if statements for what
type
of input flag is being processed.

After that, I put a sleep() in my encode block to more uniformly control
execution speed for testing. Even then, the lowest sleep value I can use
for random file data before jumbling is around 300microseconds. For
sequential data I can omit the sleep altogether, meaning it can process
at
the fastest possible rate.

This is the opposite of what I would think would occur. If this was a
buffer overrun issue, then why does the jumbling only happen when the
flowgraph runs slower?

After some thinking, I’m starting to become suspicious of the threading
model. I’ll admit that I haven’t fully reviewed the threading code
(thus,
am I making any obvious blunders in thinking below?) My idea of the
issue
boils down to threads becoming unsynchronized. Executing with sequential
data would perform all encoding and decoding tasks with optimal speed
and
ideally no branch prediction failures, such that the time slice of each
thread for general work would ideally take the same time to execute. If
running random data, branch prediction would surely fail almost every
time.
But the time it didn’t fail would speed up that particular iteration
greatly. Is is possible these threads are getting jumbled when this
occurs
a few times in a row, making a “bubble” of fast execution speed and the
data NOT being guaranteed FIFO?

Unless there is another feature of the scheduler I am unaware of and I
am
over thinking the importance of branch prediction here, I cannot reason
why
sequential data produces no problems. Even if I am completely wrong with
my
reasoning here, it doesn’t explain away the fact that my jumbling still
occurs. I’ve tried this on multiple machines with different versions of
linux, but with GNR version 3.6.4.1. I’ve used built from source
installs
and the Ettus binary.

Does anyone have any thoughts on this? Where are the guts of the
scheduler
located in the repo? I can find files for the thread for a block, but
not
the file that creates the threads and controls them. I’m just using
sleep()
in my encoding block as a temporary fix.

Best,
Ron S.

On Mon, Jul 15, 2013 at 2:29 PM, Ron S. [email protected]
wrote:

My flowgraph consists of:
extremely simple, but unimportant, except that they all have if statement
Extreme example: I can set the throttle speed to 1 and it will still process
at the fastest rate possible. If I put any other block in between the file
source and the throttle, then the throttle works as it should. It doesn’t
fix the jumble issue, but the throttle acts as it was designed to by
bursting with an average of the set value. That’s weird thing #1: Why does
throttle work if there is a block in between it and the file source block,
but not if directly connected?

That should be a clue. The throttle block absolutely works right after
a file source. It always has in the thousands of cases I’ve tried and
seen. If it’s not working in this way, I would look to your code for
doing something wrong, possibly interacting with the scheduler
incorrectly (are you consuming and producing at the right rates?).

After some thinking, I’m starting to become suspicious of the threading
FIFO?
located in the repo? I can find files for the thread for a block, but not
the file that creates the threads and controls them. I’m just using sleep()
in my encoding block as a temporary fix.

Best,
Ron S.

Ron,

I’ve been over the thread-per-block scheduler code a lot in the past
few years. What you are suggesting here is, frankly impossible. And I
don’t think that branch prediction has anything to do with it. I
haven’t looked at the link you sent, but branch prediction misses
should only slow down the processing but have no effect on the data.

I think you’ll have to look closer at your own blocks for the problem.
Have you written QA for each of the blocks? Best to verify that each
one works as expected on its own before trying to put them all in line
together.

Tom

Indeed, if branch-prediction misses in the CPU caused actual visible
effects to data integrity, thousands and thousands of bits of software
would be negatively impacted, not just Gnu Radio and not
just multi-threaded applications.

All of the CPU-implementation “tricks” that are buried in the CPU, like
branch-prediction, out-of-order execution, pipelining, are completely
and utterly transparent to executing code. If that weren’t
the case, then there’d be chaos.

Indeed, it sounds like a race condition to me. Could just as easily be
due to cache-miss changing timing as anything else.

Thanks for the responses.

That should be a clue. The throttle block absolutely works right after
a file source. It always has in the thousands of cases I’ve tried and
seen.

I’m still having this problem, even after creating the simplest flow
graph possible:

File source -> Throttle (val = 1) -> Terminal Sink.

Terminal sink has the following code:
const unsigned char *inputByte = (const unsigned char *)input_items[0];
std::cerr << (unsigned char)inputByte[0];
consume_each(1);
return 1;

I understand this is not ideal and could be looped for available
items, but this is the simplest I could break the terminal sink down
to. I assume there’s something here I’m doing incorrectly, and that’s
why it is not working properly? My original flow graph does not use
terminal sink, so it’d have to be a common problem to the other blocks
due to a consistent programming error on my part (embarrassing, but
it’d explain a lot).

I’ve been over the thread-per-block scheduler code a lot in the past
few years. What you are suggesting here is, frankly impossible.

Totally understand and agree. I’ll admit it was a bit careless to
assume and speculate so much, but I’m simply flummoxed as to why
sequential data would work while random data does not. I’ve never
heard of any bug or feature (other than branch prediction) that would
care whether the data was random or not. The fact that the file size
contributes is confusing as well. I’ve done much testing on the
individual blocks and everything seems to check out.

I suspect the change in behavior observed by the original poster is
caused by a race condition that manifests itself differently depending
on the relative speed of different flowgraph threads, to which branch
prediction failure rate would contribute.

This was certainly my first thought of what could be happening,
especially due to the server communication that’s used and the
threading model it uses. But after removing the server references from
the flowgraph, the issue still occurs. The encode, decode, and BER
block do not share any common variables and certainly no variables
that persist outside of the work function. Is it possible a race
condition is still occurring, even with nothing shared among blocks?

On 07/17/2013 10:48 AM, Ron S. wrote:

I’m still having this problem, even after creating the simplest flow
graph possible:

File source -> Throttle (val = 1) -> Terminal Sink.

Terminal sink has the following code:
const unsigned char *inputByte = (const unsigned char *)input_items[0];
std::cerr << (unsigned char)inputByte[0];
consume_each(1);
return 1;

What parent class are you deriving your “Terminal Sink” block from?
Normally, you don’t call consume_each() in anything but a block derived
from gr::block directly. Otherwise, the other parent classes take care
of this for you, based on the return value of work().

On 07/17/2013 10:12 AM, Marcus L. wrote:

Indeed, if branch-prediction misses in the CPU caused actual visible
effects to data integrity, thousands and thousands of bits of
software would be negatively impacted, not just Gnu Radio and not
just multi-threaded applications.

I suspect the change in behavior observed by the original poster is
caused by a race condition that manifests itself differently depending
on the relative speed of different flowgraph threads, to which branch
prediction failure rate would contribute.

At least that is where I would start debugging efforts.

On Wed, Jul 17, 2013 at 11:36 AM, Ron S. [email protected]
wrote:

What parent class are you deriving your “Terminal Sink” block from?
Normally, you don’t call consume_each() in anything but a block derived
from gr::block directly. Otherwise, the other parent classes take care
of this for you, based on the return value of work().

I am deriving from gr_block and overriding general_work. Does the
return value look correct in this case? I understand what the value
means in a typical block, but what does it mean specifically on a sink
block?

Can you just post the header and implementation files?

What parent class are you deriving your “Terminal Sink” block from?
Normally, you don’t call consume_each() in anything but a block derived
from gr::block directly. Otherwise, the other parent classes take care
of this for you, based on the return value of work().

I am deriving from gr_block and overriding general_work. Does the
return value look correct in this case? I understand what the value
means in a typical block, but what does it mean specifically on a sink
block?

I have posted them on Pastebin, here:

Header:

Implementation:

I’m not sure what format is best for posting files to this mailing
list. If Pastebin is unacceptable, please just let me know and I will
not use it again in the future.