Hi Mostafa,
VOLK is but an accelerated Library of Vector Optimized Kernels.
What you want is basically three operations:
a) finding maximum absolute
b) finding average absolute
c) dividing these two values
Now, looking closer at a) and b), one notices that both require the
samples to be converted to their magnitudes, first. And because we’re in
the business of optimizing things, let’s just use the squared magnitude,
because that’s faster to compute by one sqrt, usually. So this boils
down to
a) take mag_squared of input (length N)
b1) find maximum of a)
b2) find sum of a)
c) sqrt(b2/b1)/N
As you can see, c) is not a vector operation, and thus not a case for
volk.
For a) (“Complex to Mag ^ 2”) there is a GNU Radio block that uses VOLK.
That’s the example for using VOLK that I would have recommended to read,
anyway 
In other terms: If you don’t have to write your own highly optimized
block, don’t use VOLK directly, use the standard GNU Radio blockset.
It’s rather optimized 
Now, for the maximum search b1, things are a bit more complicated.
Searching for a maximum is not easily vectorizable, because it is a
inherently sequential operation (think of it as the first step of a
bubble sort).
Now, you can achieve awesome performance by basically turning your
linear search into a N-ary tree, with N being the order of parallelism
you can achieve by using a maximum-finding SIMD instruction. But that
requires the size of the problem to be a power of N. That just doesn’t
fly well with the usually more “multiple of 64 bit”-typey alignment
restrictions.
You’re however, highly encouraged to try just that: use the existing
volk_32f_x2_max_32f, which compares two vectors, and stores the
element-wise maximum in a third one, to compare the first with the
second half of your mag_squared vector, and repeat the same with the
first and second half of the result (and so on) until you have a single
maximum value. That’s the comparison tree from above for the N=2 case.
You can employ clever overlapping to use as many values twice in the
input to virtually extend your input’s length to a power of two, and
then just waltz on.
For b2) you can simply use the “integrate” block, which is not VOLK
optimized (possibly because it’s a gengen template and these are so
much fun to specialize). But seeing as it is simply an accumulating for
loop, I kind of expect your compiler to make the best of the situation.
However, you can also use the volk_32f_accumulator_s32f VOLK kernel. I
kind of want to use that in integrate, because for my machine, the SSE
VOLK kernel is 4 times as fast as the generic implementation, which
nicely matches the 4-operand SSE SIMD instruction behind it.
Greetings,
Marcus