On Wed, Jun 25, 2014 at 3:40 AM, Jan KrĂ¤mer [email protected]

wrote:

right in time for Midtermevalutaions I have a bigger update for you.

Viterbi Decoder is now fully Volkified, but results are a bit meh

You can read more on my blog

http://spectrejan.blogspot.de/

Hi Jan,

I only recently read your blog, so I may not be familiar with the

overall objectives. But I have some questions and comments.

How essential is it for you to compute full Euclidean metrics?

The general case for convolutional decoding does call for Euclidean

metrics, but most practical FEC use cases will use binary codes and

equal power symbols to which the distance calculation reduces to a

basic dot product. Furthermore, using sign changes, the branch metric

unit simplifies to a max-sum operation and the metrics can be computed

without the use of multipliers. If you look at the SPIRAL code that

Tom mentioned, youâll see that that multiplication is not used. But,

those are specific cases, and if the objective necessitates full

Euclidean metrics, than one needs to compute Euclidean metrics.

for (int k = 0; k < _PACKETLENGTH; ++k)

{

obsValLSB = _mm_load1_ps(obsPtr++);

distLB = _mm_sub_ps(obsValLSB,tValLSB);

distLB = _mm_mul_ps(distLB, distLB);

```
obsValMSB = _mm_load1_ps(obsPtr++);
distHB = _mm_sub_ps(obsValMSB,tValMSB);
distHB = _mm_mul_ps(distHB, distHB);
metricVal = _mm_add_ps(distHB, distLB);
_mm_store_ps(metricPtr, metricVal);
metricPtr += 4;
```

}

Regarding the code, the main performance limitation youâll run into

with this segment is pipeline utilization. If you look at the argument

dependencies, and therefore register dependencies at the instruction

level, each math intrinsic is dependent on the results of the prior

instruction. This creates read-after-write (RAW) hazards, and will

inevitably lead to a pipeline stall at some point. Both the compiler

and the CPU will aggressively attempt to reorder instructions to avoid

this scenario, but thereâs very little to reorder in this case.

What may help, then, is to give the compiler and cpu some help by

unrolling the loop such that there is more code between branch points.

That will allow instructions to be rearranged to alleviate the

argument dependencies. Looking again at the SPIRAL example, the entire

vertical pass through the 64-state trellis is fully unrolled leaving

only a single conditional for normalization purposes. SPIRAL is an

extreme example, but the above code will likely benefit from a longer

loop section.

Hope that helps.

-TT