On Wed, Jun 25, 2014 at 3:40 AM, Jan Krämer [email protected]
right in time for Midtermevalutaions I have a bigger update for you.
Viterbi Decoder is now fully Volkified, but results are a bit meh
You can read more on my blog
I only recently read your blog, so I may not be familiar with the
overall objectives. But I have some questions and comments.
How essential is it for you to compute full Euclidean metrics?
The general case for convolutional decoding does call for Euclidean
metrics, but most practical FEC use cases will use binary codes and
equal power symbols to which the distance calculation reduces to a
basic dot product. Furthermore, using sign changes, the branch metric
unit simplifies to a max-sum operation and the metrics can be computed
without the use of multipliers. If you look at the SPIRAL code that
Tom mentioned, you’ll see that that multiplication is not used. But,
those are specific cases, and if the objective necessitates full
Euclidean metrics, than one needs to compute Euclidean metrics.
for (int k = 0; k < _PACKETLENGTH; ++k)
obsValLSB = _mm_load1_ps(obsPtr++);
distLB = _mm_sub_ps(obsValLSB,tValLSB);
distLB = _mm_mul_ps(distLB, distLB);
obsValMSB = _mm_load1_ps(obsPtr++);
distHB = _mm_sub_ps(obsValMSB,tValMSB);
distHB = _mm_mul_ps(distHB, distHB);
metricVal = _mm_add_ps(distHB, distLB);
metricPtr += 4;
Regarding the code, the main performance limitation you’ll run into
with this segment is pipeline utilization. If you look at the argument
dependencies, and therefore register dependencies at the instruction
level, each math intrinsic is dependent on the results of the prior
instruction. This creates read-after-write (RAW) hazards, and will
inevitably lead to a pipeline stall at some point. Both the compiler
and the CPU will aggressively attempt to reorder instructions to avoid
this scenario, but there’s very little to reorder in this case.
What may help, then, is to give the compiler and cpu some help by
unrolling the loop such that there is more code between branch points.
That will allow instructions to be rearranged to alleviate the
argument dependencies. Looking again at the SPIRAL example, the entire
vertical pass through the 64-state trellis is fully unrolled leaving
only a single conditional for normalization purposes. SPIRAL is an
extreme example, but the above code will likely benefit from a longer
Hope that helps.