Volk branch on github

dubstep · February 15, 2012, 4:57am

There’s been a ton of work going on in getting us ready to really start
using Volk in GNU Radio blocks. Instead of repeating myself, here, you
can
see more about the who/what/when/why/how of the changes here:

The basic summary is that I’m seeing amazing performance results and I’m
very excited to get this into our project.

I’m really hoping that people can check out the branch and test it out
against their applications. A number of changes were made inside GNU
Radio
and a handful of blocks have been converted to using Volk, and I’d like
to
see how the performance compares. My own tests show great results, but I
have a pretty heterogeneous setup (Linux/Ubuntu and Intel processors).

I should have another post on my website later this week discussing my
benchmark results for the Volk blocks, but anyone interested in testing
it
out on their own should check out
gnuradio-examples/python/volk_benchmark.
The README in that directory should help you understand what to do and
how
to do it.

We would like to get this merged into GNU Radio master (and therefore
version 3.5.2) as soon as possible, so I would really appreciate
feedback
and bug reports as soon as possible.

Thanks!
Tom

tomrobinson · February 15, 2012, 12:08pm

On Tue, 2012-02-14 at 22:56 -0500, Tom R. wrote:

There’s been a ton of work going on in getting us ready to really
start using Volk in GNU Radio blocks. Instead of repeating myself,
here, you can see more about the who/what/when/why/how of the changes
here:

I think you copied the wrong link.
You probably meant:

Martin

tomrobinson · February 15, 2012, 12:46pm

I think it would make sense to change the volk interface by adding
kernel calls which can handle the two alignment cases (aligned,
unaligned).
We would have to add a is_unaligned parameter to the volk kernel calls.

The gnuradio blocks would then change in the following way:

So in stead of:

if(is_unaligned()) {
for(size_t i = 1; i < input_items.size(); i++){
volk_32fc_x2_multiply_32fc_u(out, out, (gr_complex*)input_items[i],
noi);
}
} else {
for(size_t i = 1; i < input_items.size(); i++){
volk_32fc_x2_multiply_32fc_a(out, out, (gr_complex*)input_items[i],
noi);
}
}

You would have:

for(size_t i = 1; i < input_items.size(); i++)
volk_32fc_x2_multiply_32fc(is_unaligned(), out, out,
(gr_complex*)input_items[i], noi);

You halve the amount of code in gnuradio blocks which to my opinion
makes it much more maintainable.

Martin

tomrobinson · February 16, 2012, 4:07am

On Wed, Feb 15, 2012 at 6:45 AM, Martin DvH
[email protected]wrote:

if(is_unaligned()) {

Martin
Martin,

I think that’s a good idea. The only real question is if we can (easily)
implement it with the runtime dynamics of the Volk calls. It basically
moves the decision from GNU Radio into Volk, but since we’re looking at
Volk as behind-the-scenes stuff, it’s more logical to place the
responsibility there than expose it to GR block developers.

Thanks,
Tom

tomrobinson · February 15, 2012, 7:28pm

You would have:

for(size_t i = 1; i < input_items.size(); i++)
volk_32fc_x2_multiply_32fc(is_unaligned(), out, out,
(gr_complex*)input_items[i], noi);

You halve the amount of code in gnuradio blocks which to my opinion
makes it much more maintainable.

Here is a possible solution, I dont know how viable it is.

Suppose that we have the gotten to a head or tail case where the
number of samples isnt an alignment multiple or the last call to work
ended us on a non-aligned boundary.
In an effort to re-align, the scheduler could memcpy the smallest
possible chunk into aligned memory, pad the length, call work, and
memcpy the result to the output buffer.
Now the next call to work will always be aligned. Also, the work
function never needs to change, and will always use the aligned call.

I tried to implement this in gr block executor, but got confused trying
to handle all of the edge cases, like multiple IO ports with different
data types. And so, how the block would configure the scheduler in the
most generic of cases isnt clear to me. But, even if it was
oversimplified, I still think its the better way to solve 90% of the use
cases.

Thoughts?

-Josh

tomrobinson · February 16, 2012, 8:09pm

Also, you never want to work on the smallest amount of memory possible.
This is covered in my discussion on my blog. Making arbitrarily small calls
to work functions causes much more overhead than just running the unaligned
version of a Volk call. I found this out pretty quickly when I started
looking into things. Better to process a large chunk to get back into
alignment than try to handle calls to small amounts of data.

Perhaps this is because you have a processor that doesn’t penalize you
for unaligned loads/stores.

-Josh

tomrobinson · February 16, 2012, 4:14am

On Wed, Feb 15, 2012 at 1:27 PM, Josh B. [email protected] wrote:

Now the next call to work will always be aligned. Also, the work

-Josh

Josh,
I already tried this approach. It was the first thing that I went after
when working on the alignment issues with the buffers. It becomes way
too
big of a hassle to follow through with it in the end. It only really
makes
sense to always move the data to an aligned buffer, but that’s too
expensive. Like you ran into, the corner cases and the issues of keeping
track are way too much. It makes the code confusing and fragile.

Also, you never want to work on the smallest amount of memory possible.
This is covered in my discussion on my blog. Making arbitrarily small
calls
to work functions causes much more overhead than just running the
unaligned
version of a Volk call. I found this out pretty quickly when I started
looking into things. Better to process a large chunk to get back into
alignment than try to handle calls to small amounts of data.

Tom

tomrobinson · February 16, 2012, 8:34pm

On 02/16/2012 11:24 AM, Tom R. wrote:

alignment than try to handle calls to small amounts of data.

Perhaps this is because you have a processor that doesn’t penalize you
for unaligned loads/stores.

-Josh

I tested this on a handful of different processors: Core2Due, QuadCore, i7
(first get), i7 (second gen) and they all told me the same thing. You are

For most if not all recent x86 processors there is no unaligned penalty.
You should be able to always call the unaligned volk routine and see no
difference in performance. I’m wondering about neon for example, which
has a penalty. And I suppose to a lesser extent, older x86 processors. I
dont have numbers now, but I think the volk profiler can confirm this
about said processors.

-Josh

tomrobinson · February 16, 2012, 8:25pm

On Thu, Feb 16, 2012 at 2:08 PM, Josh B. [email protected] wrote:

Perhaps this is because you have a processor that doesn’t penalize you
for unaligned loads/stores.

-Josh

I tested this on a handful of different processors: Core2Due, QuadCore,
i7
(first get), i7 (second gen) and they all told me the same thing. You
are
still better doing unaligned loads in Volk than doing the generic loop.
Also, the overhead of calling the scheduler functions for small data
items
is MUCH more costly than an unaligned load. Seriously, making these
arbitrarily small calls to the work function for alignment reasons,
which
allowed me to always run aligned, made things run 4 - 5 times slower
than
the non-Volk version of the block.

Tom

tomrobinson · February 16, 2012, 10:31pm

On Thu, Feb 16, 2012 at 2:08 PM, Josh B. [email protected] wrote:

Perhaps this is because you have a processor that doesn’t penalize you
for unaligned loads/stores.

-Josh

Which suggests this decision may need to be made on a
per-arch/processor basis, and therefore it may be most appropriate for
Volk to figure it out rather than the scheduler.

–
Doug G.
[email protected]

tomrobinson · February 16, 2012, 10:48pm

On 02/16/2012 01:30 PM, Douglas G. wrote:

There was some talk about making volk handle head cases (most kernels
already handle tail cases).

This would mean writing a volk_32f_x2_multiply_32f that calls
volk_32f_x2_multiply_32f_a and volk_32f_x2_multiply_32f_u based on the
boundary conditions.

Such a thing could be generated, so long as we have a way to convey to
the generator something about the parameters. Maybe we just need the
framework… and every time someone wants a volk kernel that handles
head and tail cases, they just fill in a few lines to the generator.

-Josh

tomrobinson · February 16, 2012, 8:40pm

On 02/16/2012 11:32 AM, Josh B. wrote:

unaligned

I tested this on a handful of different processors: Core2Due, QuadCore, i7
(first get), i7 (second gen) and they all told me the same thing. You are

For most if not all recent x86 processors there is no unaligned penalty.
You should be able to always call the unaligned volk routine and see no
difference in performance. I’m wondering about neon for example, which
has a penalty. And I suppose to a lesser extent, older x86 processors. I
dont have numbers now, but I think the volk profiler can confirm this
about said processors.

The answer for neon is probably a case of the “don’t do that”. In other
words, keep your blocks fed with aligned multiples, regardless of how
the scheduler handles things.

-Josh

tomrobinson · February 16, 2012, 10:57pm

On Thu, Feb 16, 2012 at 1:47 PM, Josh B. [email protected] wrote:

Which suggests this decision may need to be made on a

Such a thing could be generated, so long as we have a way to convey to
the generator something about the parameters. Maybe we just need the
framework… and every time someone wants a volk kernel that handles
head and tail cases, they just fill in a few lines to the generator.

The side benefit to this approach is it lets us get rid of the tacky
_a/_u
suffixes and just use a single function call without the user worrying
about alignment.

–n

tomrobinson · February 17, 2012, 2:44pm

On 02/16/2012 11:39 AM, Josh B. wrote:

This is covered in my discussion on my blog. Making arbitrarily small

dont have numbers now, but I think the volk profiler can confirm this
about said processors.

The answer for neon is probably a case of the “don’t do that”. In other
words, keep your blocks fed with aligned multiples, regardless of how
the scheduler handles things.

The answer is more like:

Aligned is better

and

if you are forced to chose between aligned loads versus stores, align
stores

but

start by using NEON and do not worry about alignment since the penalty
for unaligned access is not dreadful.

Philip