Hi Mostafa,

VOLK is but an accelerated Library of Vector Optimized Kernels.

What you want is basically three operations:

a) finding maximum absolute

b) finding average absolute

c) dividing these two values

Now, looking closer at a) and b), one notices that both require the

samples to be converted to their magnitudes, first. And because we’re in

the business of optimizing things, let’s just use the squared magnitude,

because that’s faster to compute by one sqrt, usually. So this boils

down to

a) take mag_squared of input (length N)

b1) find maximum of a)

b2) find sum of a)

c) sqrt(b2/b1)/N

As you can see, c) is not a vector operation, and thus not a case for

volk.

For a) (“Complex to Mag ^ 2”) there is a GNU Radio block that uses VOLK.

That’s the example for using VOLK that I would have recommended to read,

anyway

In other terms: If you don’t have to write your own highly optimized

block, don’t use VOLK directly, use the standard GNU Radio blockset.

It’s rather optimized

Now, for the maximum search b1, things are a bit more complicated.

Searching for a maximum is not *easily* vectorizable, because it is a

inherently sequential operation (think of it as the first step of a

bubble sort).

Now, you can achieve *awesome* performance by basically turning your

linear search into a N-ary tree, with N being the order of parallelism

you can achieve by using a maximum-finding SIMD instruction. But that

requires the size of the problem to be a power of N. That just doesn’t

fly well with the usually more “multiple of 64 bit”-typey alignment

restrictions.

You’re however, highly encouraged to try just that: use the existing

volk_32f_x2_max_32f, which compares two vectors, and stores the

element-wise maximum in a third one, to compare the first with the

second half of your mag_squared vector, and repeat the same with the

first and second half of the result (and so on) until you have a single

maximum value. That’s the comparison tree from above for the N=2 case.

You can employ clever overlapping to use as many values twice in the

input to virtually extend your input’s length to a power of two, and

then just waltz on.

For b2) you can simply use the “integrate” block, which is not VOLK

optimized (possibly because it’s a gengen template and these are *so*

much fun to specialize). But seeing as it is simply an accumulating for

loop, I kind of expect your compiler to make the best of the situation.

However, you can also use the volk_32f_accumulator_s32f VOLK kernel. I

kind of want to use that in integrate, because for my machine, the SSE

VOLK kernel is 4 times as fast as the generic implementation, which

nicely matches the 4-operand SSE SIMD instruction behind it.

Greetings,

Marcus