Google Summer of Code 2014 applicant : Optimization with VOLK

musicdenotation · February 23, 2014, 7:53am

Hello,
I have completed a Bachelor’s degree in Electrical Engineering from IIT
Bombay, India and will be joining a masters program in Computer Science
in
August. For the summer, I am interested in participating GSoC 2014 and
GNU
Radio is an organization where my background fits nicely.

I went through the ideas page and was particularly interested in doing
performance optimization with VOLK. After going through some online
documentation about the library and the SDR’12 paper, I realised that
following areas need work :

Profiling GNU radio code to identify new kernels and implement them
for
existing Intel SIMD extensions, also porting kernels to other ISA
extensions.
Better testing of the effects of more complex scheduler logic on
larger
environments (beyond simple kernels)
Exploring extension of Volk to GPU ISAs, to leverage chips such as
AMD
Fusion (However, this seems to more research than software development)

According to the GSoC proposal, point (1) seems to be the expectation.
Given this, I would like some advice on how to go ahead looking for
potential ideas (and some feedback on feasibility of the other ideas as
well)

My background : C++, Python, Signal Processing, Computer Architecture

Thanks,
Abhishek B.

casper_the_ghost · February 24, 2014, 6:17am

On Sun, Feb 23, 2014 at 12:52 AM, Abhishek B.
[email protected] wrote:

Hello,
I have completed a Bachelor’s degree in Electrical Engineering from IIT
Bombay, India and will be joining a masters program in Computer Science in
August. For the summer, I am interested in participating GSoC 2014 and GNU
Radio is an organization where my background fits nicely.

I went through the ideas page and was particularly interested in doing
performance optimization with VOLK.

Great to hear. Just keep in mind we have another ~13 hours before we
as an organization
know whether we were accepted or not.

According to the GSoC proposal, point (1) seems to be the expectation. Given
this, I would like some advice on how to go ahead looking for potential
ideas (and some feedback on feasibility of the other ideas as well)

My background : C++, Python, Signal Processing, Computer Architecture

Thanks,
Abhishek B.

Abhishek,

Right, so points 1 and 2 are what I had in mind when I wrote the idea
on our list. Point 3
is technically possible to do in VOLK, but probably not really worth
using GPUs in this way
since the transport costs would dwarf any acceleration from the
current VOLK kernel. That said,
there’s nothing really wrong with a proposal that is on a research
area, but I do think we would want
code and something contributed back to the community at the end of the
project. Also, don’t let that
prevent you from submitting a proposal about GPU programming if that’s
what you’re interested in, it’s
probably just not best targeted for VOLK. My understanding of GSoC
proposals is that you can submit
any number,so you can submit one for doing some GPU acceleration and
another for something more
related to VOLK.

So for points 1 and 2 it would be good to see a specific algorithm or
module that you think would
benefit from moving to VOLK, which would take some research on your
part. I think gr-atsc is a
good place to look for some acceleration gains, and it would be good
to see that application run
real time. One of the things I had in mind is accelerating OFDM frame
sync. Martin’s ofdm_{rx,tx}
and gr-80211 are good examples where they use blocks that have VOLK
kernels in them to do the
sync, but that’s somewhat inefficient because we move data in and out
of SIMD registers. Of course
there’s the old trade-off of modularization and code re-use vs. speed.
I’d be glad to discuss this and
similar ideas more once we know we are accepted as an org. There’s one
more possibility that recently
came up, but I’d like to wait until things are official before
recommending it (and I’ll need to talk with
other interested parties).

At the moment the only mainstream ISA not being targeted is probably
AVX2, which has
some nice features for the type of kernels we’re doing. If you went
that route it would likely need add
protokernels to a pretty large number of kernels.

Nathan

casper_the_ghost · February 24, 2014, 4:31pm

On Mon, Feb 24, 2014 at 12:15 AM, West, Nathan
[email protected] wrote:

Better testing of the effects of more complex scheduler logic on larger
Thanks,
current VOLK kernel. That said,
related to VOLK.
I agree with Nathan that VOLK is probably not the right abstraction
for GPUs. The Fusion concept with the GPU and GPP on the same die is
compelling, but maybe too specific. There is another project called
gr-gpu that’s focusing on the GPU problem more generally.

sync, but that’s somewhat inefficient because we move data in and out
of SIMD registers. Of course
there’s the old trade-off of modularization and code re-use vs. speed.
I’d be glad to discuss this and
similar ideas more once we know we are accepted as an org. There’s one
more possibility that recently
came up, but I’d like to wait until things are official before
recommending it (and I’ll need to talk with
other interested parties).

We’re working with Andrew D. on updating gr-atsc
(https://github.com/glneo/gnuradio/tree/atscfixup). If you decide to
focus on ATSC speedups with VOLK, look into that project instead of
the one inside gnuradio (which will be deprecated).

Tom

casper_the_ghost · February 24, 2014, 9:34pm

On Mon, Feb 24, 2014 at 9:00 PM, Tom R. [email protected] wrote:

I went through the ideas page and was particularly interested in doing
existing Intel SIMD extensions, also porting kernels to other ISA
My background : C++, Python, Signal Processing, Computer Architecture
using GPUs in this way
any number,so you can submit one for doing some GPU acceleration and
module that you think would
there’s the old trade-off of modularization and code re-use vs. speed.
the one inside gnuradio (which will be deprecated).

Tom

Firstly, congratulations on being accepted as a mentor organization.

Thanks for the pointers to gr-atsc and gr-80211. I have started
looking there as a
starting point. Are there similar modules which are undergoing volk
speedup fixes?
I am also trying to meet up with other people who have been using GNU
radio
to identify potential modules for acceleration. As you are now a
mentor organization, I feel it’s a good time for us to get into
detailed discussions.

At the moment the only mainstream ISA not being targeted is probably
AVX2, which has
some nice features for the type of kernels we’re doing. If you went
that route it would likely need add
protokernels to a pretty large number of kernels.

Nathan

This also seems to be promising, though I guess it would require me to
come up to speed with AVX2 (which I would love to do). Could you
please elaborate
a little on the kind of beneficial features you have in mind ? I am
concerned that the
job of adding proto-kernels might turn out to be mundane/tedious ? Is
that a valid
concern ?

Abhishek

casper_the_ghost · February 25, 2014, 9:48am

On 02/24/2014 09:33 PM, Abhishek B. wrote:

Firstly, congratulations on being accepted as a mentor organization.

Thanks for the pointers to gr-atsc and gr-80211. I have started
looking there as a
starting point. Are there similar modules which are undergoing volk
speedup fixes?

I had started working on accelerating the OFDM sync blocks in-tree…
I’d need to dig around a bit to get that up to date. Contact me off-list
if you want to see about this.

I am also trying to meet up with other people who have been using GNU radio
to identify potential modules for acceleration. As you are now a
mentor organization, I feel it’s a good time for us to get into
detailed discussions.

Absolutely, that goes for you and all other applicants. The ideas list
is just to get you started, for successful participation, you will need
to write a proposal that details what you plan to do for 3 months of
full-time coding.

please elaborate
a little on the kind of beneficial features you have in mind ? I am
concerned that the
job of adding proto-kernels might turn out to be mundane/tedious ? Is
that a valid
concern ?

That’s highly subjective If you don’t know much about those specific
SIMD instruction sets, it’s probably neither. On the other hand, if
you’re already an expert, it’s not that much work, and you can quickly
benefit the project. I don’t think anyone expects you to do 3 months of
proto-kernel development, so you can balance it in your project
proposal.

M

casper_the_ghost · February 25, 2014, 3:11pm

On Tue, Feb 25, 2014 at 8:21 AM, Bogdan D.
[email protected] wrote:

Hi Abhishek,

When implemented gr-dvbt (GitHub - BogdanDIA/gr-dvbt: DVB-T implementation in gnuradio) I used VOLK in
many places to speed-up the processing. However, there is a great deal of speed-up
that still need to be achieved on both Tx/Rx in order to lower cpu cycles
consumption so there are a lot of challenges in the project from this point of
view.

For example the Viterbi implementation is done using intrinsics instead of using
VOLK just because when I used VOLK it was quite slow, achieving only 16mbps of
processing per single thread (7-8mbps on just C implementation).
Using intrinsics it raised the spead to 32-37mbps per thread which is quite good
but the code is not directly portable. So, a good Viterbi decoder that achieves
easily over 60mbps speed at input is still necessary probably not only in dvb-t
implementation but perhaps in other applications. Just to add more to the
challenge one may want to have a readable code beside the necessary speed (Spiral
viterbi implementation is on the opposite side).

Bogdan,

Good advice, generally. Just a few issues to point out. First, I think
there’s a misconception between “VOLK” and “using intrinsics.” VOLK
uses intrinsics and so whatever code you wrote with the intrinsics
could be done in VOLK. For instance, the fecapi that we are working to
bring into GNU Radio has a constitutional decoder defined as a single
VOLK kernel:

github.com

namccart/fecapi/blob/master/volk_fecapi/kernels/volk_fecapi/volk_fecapi_8u_x4_conv_k7_r2_f2048_8u.h

#ifndef INCLUDED_volk_fecapi_8u_x4_conv_k7_r2_f2048_8u_H
#define INCLUDED_volk_fecapi_8u_x4_conv_k7_r2_f2048_8u_H



#if LV_HAVE_SSE3

#include <pmmintrin.h>
#include <emmintrin.h>
#include <xmmintrin.h>
#include <mmintrin.h>
#include <stdio.h>

static inline void volk_fecapi_8u_x4_conv_k7_r2_f2048_8u_spiral(unsigned char* Y, unsigned char* X, const unsigned char* syms, unsigned char* dec, unsigned int framebits, unsigned int excess, unsigned char* Branchtab) {
  int i9;
  for(i9 = 0; i9 < (framebits >> 1) + (excess >> 1); i9++) {
    unsigned char a75, a81;
    int a73, a92;
    short int s20, s21, s26, s27;
    unsigned char  *a74, *a80, *b6;

This file has been truncated. show original

This is actually Spiral code that was wrapped up into a kernel to make
it portable and usable.

Basically, I’m trying to convey that there is not limit to what we can
define as a kernel in VOLK. In fact, the more complex the kernel, the
better the speedup because you can keep the data inside the registers
and more tightly control the algorithm. We just want a kernel to
represent some operations that would be usable in other situations,
like a convolutional decoder.

The OFDM synchronization code is also very time consuming and although uses VOLK
already it can be using with great benefit new AVX2 instructions. Actually many of
the blocks can use new instructions to speed-up the data processing.

Yes, certainly. The synchronization part is a good place for
optimization.

Tom

casper_the_ghost · February 25, 2014, 2:23pm

Hi Abhishek,

When implemented gr-dvbt (GitHub - BogdanDIA/gr-dvbt: DVB-T implementation in gnuradio) I used
VOLK in many places to speed-up the processing. However, there is a
great deal of speed-up that still need to be achieved on both Tx/Rx in
order to lower cpu cycles consumption so there are a lot of challenges
in the project from this point of view.

For example the Viterbi implementation is done using intrinsics instead
of using VOLK just because when I used VOLK it was quite slow, achieving
only 16mbps of processing per single thread (7-8mbps on just C
implementation).
Using intrinsics it raised the spead to 32-37mbps per thread which is
quite good but the code is not directly portable. So, a good Viterbi
decoder that achieves easily over 60mbps speed at input is still
necessary probably not only in dvb-t implementation but perhaps in other
applications. Just to add more to the challenge one may want to have a
readable code beside the necessary speed (Spiral viterbi implementation
is on the opposite side).

The OFDM synchronization code is also very time consuming and although
uses VOLK already it can be using with great benefit new AVX2
instructions. Actually many of the blocks can use new instructions to
speed-up the data processing.

Basically, for dvb-t on it’s maximum speed with OFDM FFT 8k, QAM-64 and
puncturing rate 7/8 the output of video is of 32mbps which means more
than 60mbps of processing speed after de-puncturing. A bigger challenge
would be implementing real life DVB-S receiver where the data rate is
over 50mbps at video output ).

This is just my short insight of challenges one may face when dealing
with speed optimizations in a modern communication project.

Bogdan

On Sun, 2/23/14, Abhishek B. [email protected] wrote:

Subject: [Discuss-gnuradio] Google Summer of Code 2014 applicant :
Optimization with VOLK
To: [email protected]
Date: Sunday, February 23, 2014, 8:52 AM

Hello,
I have completed a Bachelor’s degree in
Electrical Engineering from IIT Bombay, India and will be
joining a masters program in Computer Science in August. For
the summer, I am interested in participating GSoC 2014 and
GNU Radio is an organization where my background fits
nicely.

I went through the ideas page and was
particularly interested in doing performance optimization
with VOLK. After going through some online documentation
about the library and the SDR’12 paper, I realised that
following areas need work :

Profiling GNU radio code to identify new
kernels and implement them for existing Intel SIMD
extensions, also porting kernels to other ISA extensions.
Better testing of the effects of more complex
scheduler logic on larger environments (beyond simple
kernels)
Exploring extension of Volk to GPU ISAs, to
leverage chips such as AMD Fusion (However, this seems to
more research than software development)

According to the GSoC proposal, point (1) seems
to be the expectation. Given this, I would like some advice
on how to go ahead looking for potential ideas (and some
feedback on feasibility of the other ideas as well)

My background : C++, Python, Signal Processing,
Computer Architecture

Thanks,
Abhishek B.

-----Inline Attachment Follows-----

Discuss-gnuradio mailing list
[email protected]
Discuss-gnuradio Info Page

casper_the_ghost · February 25, 2014, 4:31pm

Hi Tom,

thanks for your answer. The point I was making was that at the moment of
me writing the Viterbi code, I tried to use the available VOLK functions
(multiplications, subtractions, etc) and the code was slower than using
directly intrinsics. Implementing a new kernel for Viterbi decoder (with
intrinsics of course like the others) was just the next step in the
process.

So, I totally agree that it worth creating a kernel to completely solve
a problem like a convolutional decoder as it will make it faster. The
downside would be, though, that the next time you want to do something
slightly different you’ll need to create another kernel. But that is the
tradeoff between the flexibility and speed.

I see the your code using Spiral implementation, I will look to see what
speed it gives as for me this is one of the biggest challenges. I still
believe there will be someone who will create a convolutional decoder
implementation that is both readable and fast :). I know, I am speaking
from a open source sw. guys perspective who inherently has the need
to understand all the code.

Bogdan

BTW, from my experience, to speed-up in the case of depuncturing it
worth making depuncturer part of the decoder or at least aware of that.

On Tue, 2/25/14, Tom R. [email protected] wrote:

Subject: Re: [Discuss-gnuradio] Google Summer of Code 2014 applicant :
Optimization with VOLK
To: “Bogdan D.” [email protected]
Cc: “GNURadio D.ion List” [email protected], “Abhishek
Bhowmick” [email protected]
Date: Tuesday, February 25, 2014, 4:09 PM

On Tue, Feb 25, 2014 at 8:21 AM,
Bogdan D.
[email protected]
wrote:

Hi Abhishek,

When implemented gr-dvbt (GitHub - BogdanDIA/gr-dvbt: DVB-T implementation in gnuradio) I
used VOLK in
many places to speed-up the processing. However, there is a
great deal of speed-up that still need to be achieved on
both Tx/Rx in order to lower cpu cycles consumption so there
are a lot of challenges in the project from this point of
view.

For example the Viterbi implementation is done using
intrinsics instead of using VOLK just because when I used
VOLK it was quite slow, achieving only 16mbps of processing
per single thread (7-8mbps on just C implementation).
Using intrinsics it raised the spead to 32-37mbps per
thread which is quite good but the code is not directly
portable. So, a good Viterbi decoder that achieves easily
over 60mbps speed at input is still necessary probably not
only in dvb-t implementation but perhaps in other
applications. Just to add more to the challenge one may want
to have a readable code beside the necessary speed (Spiral
viterbi implementation is on the opposite side).

Bogdan,

Good advice, generally. Just a few issues to point out.
First, I think
there’s a misconception between “VOLK” and “using
intrinsics.” VOLK
uses intrinsics and so whatever code you wrote with the
intrinsics
could be done in VOLK. For instance, the fecapi that we are
working to
bring into GNU Radio has a constitutional decoder defined as
a single
VOLK kernel:

fecapi/volk_fecapi/kernels/volk_fecapi/volk_fecapi_8u_x4_conv_k7_r2_f2048_8u.h at master · namccart/fecapi · GitHub

This is actually Spiral code that was wrapped up into a
kernel to make
it portable and usable.

Basically, I’m trying to convey that there is not limit to
what we can
define as a kernel in VOLK. In fact, the more complex the
kernel, the
better the speedup because you can keep the data inside the
registers
and more tightly control the algorithm. We just want a
kernel to
represent some operations that would be usable in other
situations,
like a convolutional decoder.

The OFDM synchronization code is also very time
consuming and although uses VOLK already it can be using
with great benefit new AVX2 instructions. Actually many of
the blocks can use new instructions to speed-up the data
processing.

Yes, certainly. The synchronization part is a good place for
optimization.

Tom

Basically, for dvb-t on it’s maximum speed with OFDM
FFT 8k, QAM-64 and puncturing rate 7/8 the output of video
is of 32mbps which means more than 60mbps of processing
speed after de-puncturing. A bigger challenge would be
implementing real life DVB-S receiver where the data rate is
over 50mbps at video output ).

This is just my short insight of challenges one may
face when dealing with speed optimizations in a modern
communication project.

Bogdan

On Sun, 2/23/14, Abhishek B. [email protected]
wrote:

Subject: [Discuss-gnuradio] Google Summer of Code
2014 applicant : Optimization with VOLK
To: [email protected]
Date: Sunday, February 23, 2014, 8:52 AM

Hello,
I have completed a Bachelor’s degree in
Electrical Engineering from IIT Bombay, India and
will be
joining a masters program in Computer Science in
August. For
the summer, I am interested in participating GSoC
2014 and
GNU Radio is an organization where my background
fits
nicely.

I went through the ideas page and was
particularly interested in doing performance
optimization
with VOLK. After going through some online
documentation
about the library and the SDR’12 paper, I
realised that
following areas need work :

Profiling GNU radio code to identify new
kernels and implement them for existing Intel
SIMD
extensions, also porting kernels to other ISA
extensions.

Better testing of the effects of more complex
scheduler logic on larger environments (beyond
simple
kernels)

Exploring extension of Volk to GPU ISAs, to
leverage chips such as AMD Fusion (However, this
seems to
more research than software development)

According to the GSoC proposal, point (1) seems
to be the expectation. Given this, I would like
some advice
on how to go ahead looking for potential ideas
(and some
feedback on feasibility of the other ideas as
well)

My background : C++, Python, Signal Processing,
Computer Architecture

Thanks,
Abhishek B.

-----Inline Attachment Follows-----

Discuss-gnuradio mailing list
[email protected]
Discuss-gnuradio Info Page

Discuss-gnuradio mailing list
[email protected]
Discuss-gnuradio Info Page

casper_the_ghost · February 26, 2014, 12:34am

On Tue, Feb 25, 2014 at 4:37 PM, West, Nathan
[email protected] wrote:

Electrical Engineering from IIT Bombay, India and

kernels and implement them for existing Intel
seems to

(https://plus.google.com/u/1/events/ch3jrjcvp7mdiqelpismfieg3n0).

starting point. Are there similar modules which are undergoing volk
on the list of places to look we have
algorithm and knowledge of that algorithm is useful for acceleration.

This also seems to be promising, though I guess it would require me to
reason AVX isn’t so prominently featured (I suspect) is that the
ranges (except maybe decimators), but it could be useful for loading

Nathan

I also see that GNSS-SDR made it to GSoC and they have a VOLK related
project.

casper_the_ghost · February 26, 2014, 7:50am

Thanks everyone. These are quite a few pointers, I will spend some time
digesting it all.

So there are really two approaches, large complex kernels on
one hand and AVX2/AVX/FMA on the other, or a combination of the two.

I guess I should propose identifying and implementing larger complex
kernels
and then further accelerating using AVX2/FMA etc. Doing both will of
course limit the
number of applications/algorithms I can feasibly target. What’s your
take on
this ?

Abhishek

On Wed, Feb 26, 2014 at 5:03 AM, West, Nathan
[email protected] wrote:

Hello,

Exploring extension of Volk to GPU ISAs, to
well)
This is a great conversation, and I’ll take the opportunity to plug
Thanks for the pointers to gr-atsc and gr-80211. I have started
acceleration is primarily going to come from larger more complex
their thoughts on accelerating blocks they have written. The reality

Nathan
wouldn’t go so far as to say it’s mundane, especially if you have
you to do a single load operation that gathers vector elements that
see plans that include NEON support for anything you’d add to amd64
platforms, but that’s not a requirement.

Nathan

I also see that GNSS-SDR made it to GSoC and they have a VOLK related project.
Redirecting…

Yeah, I also noticed that. I might submit a proposal to them also.

Abhishek

casper_the_ghost · March 10, 2014, 3:34pm

Hello,
I would like to clarify some things :

I feel it is tough to beat spiral implementations through manual
vectorization, performance wise. If so, is readability the prime and
only reason for using intrinsics manually, and hence of value to the
community ?
What is currently the state of adding support for sse4, neon in
stock volk kernels (project ideas page mentions some work is under
way) ? Would be great if someone who is working on this already shares
his branch, so that I may know how much/if any work is needed in this
before moving on to avx. Of course, new kernels will need support for
all.
How feasible/useful does it sound to incorporate the newly added
idea of ‘turbo equalizer’ within the ofdm system ? Are the
requirements of the proposed equalizer overkill for the ofdm blocks?

Abhishek

On Wed, Feb 26, 2014 at 1:49 AM, Abhishek B.
[email protected] wrote:

this ?

Subject: [Discuss-gnuradio] Google Summer of Code
the summer, I am interested in participating GSoC
with VOLK. After going through some online
2. Better testing of the effects of more complex
to be the expectation. Given this, I would like
Thanks,

mentor organization, I feel it’s a good time for us to get into

gr-dvbt
latter 2 would be good fits. OFDM frame detection probably has a

AVX2, which has
concerned that the
obvious being that it looks like Intel and AMD finally settled in on

I also see that GNSS-SDR made it to GSoC and they have a VOLK related project.
Redirecting…

Yeah, I also noticed that. I might submit a proposal to them also.

Abhishek

–
Regards;
Abhishek B.,
Senior Undergraduate,
Department of Electrical Engineering,
IIT Bombay.

On Wed, Feb 26, 2014 at 12:19 PM, Abhishek B.
[email protected] wrote:

this ?

Subject: [Discuss-gnuradio] Google Summer of Code
the summer, I am interested in participating GSoC
with VOLK. After going through some online
2. Better testing of the effects of more complex
to be the expectation. Given this, I would like
Thanks,

mentor organization, I feel it’s a good time for us to get into

gr-dvbt
latter 2 would be good fits. OFDM frame detection probably has a

AVX2, which has
concerned that the
obvious being that it looks like Intel and AMD finally settled in on

I also see that GNSS-SDR made it to GSoC and they have a VOLK related project.
Redirecting…

Yeah, I also noticed that. I might submit a proposal to them also.

Abhishek

–
Regards;
Abhishek B.,
Senior Undergraduate,
Department of Electrical Engineering,
IIT Bombay.

casper_the_ghost · February 25, 2014, 11:38pm

Electrical Engineering from IIT Bombay, India and
will be
joining a masters program in Computer Science in
August. For
the summer, I am interested in participating GSoC
2014 and
GNU Radio is an organization wheAbhishekre my background
fits
nicely.

kernels and implement them for existing Intel
seems to

My background : C++, Python, Signal Processing,
Computer Architecture

Thanks,
Abhishek B.

This is a great conversation, and I’ll take the opportunity to plug
the up coming VOLK working group call
(https://plus.google.com/u/1/events/ch3jrjcvp7mdiqelpismfieg3n0).
Bogdan, your results aren’t particula> >

rly surprising, but the feedback is really good to hear.

Back to GSoC:

Abhishek,

Thanks for the pointers to gr-atsc and gr-80211. I have started
looking there as a
starting point. Are there similar modules which are undergoing volk
speedup fixes?
I am also trying to meet up with other people who have been using GNU radio
to identify potential modules for acceleration. As you are now a
mentor organization, I feel it’s a good time for us to get into
detailed discussions.

From the previous discussion it should be apparent that how algorithms
are implemented will make the biggest difference, and that the new
acceleration is primarily going to come from larger more complex
kernels. At the end of the day it’s going to be your proposal. So far
on the list of places to look we have

in-tree OFDM (contact Martin)
gr-atsc (use Andrew D.’ fork)
gr-dvbt
gr-fecapi

For your proposal I would recommend looking at their code, then
getting in contact with the author(s) of those modules to ask about
their thoughts on accelerating blocks they have written. The reality
of this project is that we are accelerating some signal processing
algorithm and knowledge of that algorithm is useful for acceleration.
Whatever application you have interested and/or knowledge in (fresh
out of a BS it’s more likely to be interest) should guide your
proposal. If you know anything about error correcting codes then the
latter 2 would be good fits. OFDM frame detection probably has a
gentler learning curve since at the basic level you’re looking at
convolution, and there’s papers you can look for on more involved
algorithms. Other algorithms to look at might include agc or
equalizers.

If you’re interested in GPU programming don’t forget to checkout gr-gpu.

This also seems to be promising, though I guess it would require me to
come up to speed with AVX2 (which I would love to do). Could you
please elaborate
a little on the kind of beneficial features you have in mind ? I am
concerned that the
job of adding proto-kernels might turn out to be mundane/tedious ? Is
that a valid concern ?

Right, so as Martin mentioned the answer is sort of relative. I
wouldn’t go so far as to say it’s mundane, especially if you have
little experience with using intrinsics and SIMD instructions. One
reason AVX isn’t so prominently featured (I suspect) is that the
instructions are almost the same as SSE instructions, but the vectors
are twice as long so that is actually mundane. AVX2/FMA extensions
introduce some new features to the amd64 instruction set. The most
obvious being that it looks like Intel and AMD finally settled in on
the same fused multiply-add (there’s also a multiply-subtract that’s
good for complex numbers) implementation. That will likely be able to
speed things up a bit, but I’m also looking forward to seeing gains
from the various load_gathers that have been introduced. They allow
you to do a single load operation that gathers vector elements that
span pretty large ranges. VOLK won’t be so interested in the large
ranges (except maybe decimators), but it could be useful for loading
complex vectors. There’s some other math functions we may be able to
leverage, but those are two features that I think would be widely
applicable.

In your proposal you should definitely include what ISAs you intend to
use, and if there are features specific to that instruction set then
point out why it’s a good choice. This is mostly important for
choosing between SSE and friends, AVX, AVX2/FMA. It would be good to
see plans that include NEON support for anything you’d add to amd64
platforms, but that’s not a requirement.

Nathan

casper_the_ghost · March 11, 2014, 4:26pm

On Mon, Mar 10, 2014 at 2:32 PM, Abhishek B.
[email protected] wrote:

stock volk kernels (project ideas page mentions some work is under
some existing kernels thrown in if time permits.
I am afraid my question didn’t come out correctly. I was referring to the

Senior Undergraduate,
Department of Electrical Engineering,
IIT Bombay.

Ah! So there was a slight miscommunication. Yes, porting the
OpenAirInterfaces
SIMD code to VOLK is a good option as well. The turbo channel
coder/decoder
is part of that. I’ve briefly looked at the code to see what is
currently there, and
it’s my understanding that the work involved will be to write generic
C implementations
of vectorized code where the generic version does not exist. Beyond
that porting to
newer/different ISAs (AVX or NEON depending on your preference and
hardware
availability). I think Florian is on the gr-discuss mailing list, but
I’ve CCed him to
hopefully provide more details as he’s more familiar with the original
code base.

casper_the_ghost · March 11, 2014, 11:09pm

Hi Nathan and Abhishek,

On 10/03/2014 23:22, West, Nathan wrote:

I’ve CCed him to
hopefully provide more details as he’s more familiar with the original
code base.
I only joined this mailing list recently, so I probably missed a part of
the discussion. Let me summarize briefly what OpenAirInterface can
provide. We have optimized SIMD (SSE4) implementations of the LTE turbo
encoder and decoder as well as the LTE tail-biting Viterbi encoder and
decoder. We also have the 802.11 Viterbi encoder and decoder. The only
functions for which we have generic non-vectorized functional
equivalents is the LTE turbo decoder.
I am not sure I understand why it is necessary to write generic versions
for the already optimized SIMD code. My idea was to port the optimized
SIMD code from OpenAirInterface to VOLK, such that is can be used by GR
applications. I am not familiar with VOLK (yet) but this might just be
as easy as writing a wrapper function.
As Nathan suggested, the more interesting part is probably to upgrade
the code to AVX2 or similar.

Cheers,
Florian.

casper_the_ghost · March 12, 2014, 12:40am

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi Florian,

the generic implementation serves 2 purposes, at least in my opinion.
Firstly, in case the necessary hardware extension is not available on
the target hardware, there is a backup default.
Secondly, the generic implementation is usually easier to read and
thus preferable as a reference. I know that this is a ‘the code is the
documentation’ argument. But even though documentation might be really
good, sometimes looking at the code just makes things clearer.

happy hacking
Johannes

On 11.03.2014 23:08, Florian Kaltenberger wrote:

NEON depending on your preference and hardware availability). I
decoder. I am not sure I understand why it is necessary to write
generic versions for the already optimized SIMD code. My idea was
to port the optimized SIMD code from OpenAirInterface to VOLK, such
that is can be used by GR applications. I am not familiar with VOLK
(yet) but this might just be as easy as writing a wrapper
function. As Nathan suggested, the more interesting part is
probably to upgrade the code to AVX2 or similar.

Cheers, Florian.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQIcBAEBAgAGBQJTH56MAAoJEO7fmkDsqywMb+AP/jNXrJoV7Cs6wY7Cx9AHkllM
NEo1mxxBhaALsxWv9xwTImaGpA83guiBZ8o0CufYj65oN/i1mN8dUHgK9D/SlLSn
GhWTZSBlBiVvIUtxFskDaAA0sqg/2Ae+iYoDKm0yxJerU49K5YGrTBFhzgl7i/r5
fZz+BIGPm29rP1kHyRfw/ROmonXOlz1z+jIR7PGK7DEQbw/Uy9eITchVVKMNsdjm
X73+vJHF9UXftzbpwEF/CsgwWVnTvWEVy3YjvxaKRMET/2zQWtEJst51l+aVWFLp
M4ejRtf4zmuSBx5JUMf0/eY1lnNWUkqdlcEaLkddalDwl5chkkfxtS+Gwd6YEqJH
pdqIa7BfHMaPwrKJ/bX3Wp9u9czWcwI9c8A9GjxnFrASIy2g+QLzU21XDdmImFWm
iqaOB0p+/y6bK/V91a4ZjL9gtTBRahlmlmB2EIcPsxlnW+PjJZKNPA833BkuqEE8
gU7w9diq5nbEQYhsvxeqz0WX16yZNwJlz98ane8+oZaVNt9JRI0cjuj0JX24EEPT
9wUfPkmnr1325NJISFJx8X7w2mAQV3zrf5Md1wOfI6Ls9Byfp8+WjeRYrTfzC95b
kXvcTVID0XZdoTItSjQUEbbJAbl8IkfwWaQCNgHCcJzfZLdJ2hHy7RqpiBAkMmAv
QzLU1GZSSHXzEPKUI1fM
=INm4
-----END PGP SIGNATURE-----

casper_the_ghost · March 10, 2014, 4:46pm

On 03/10/2014 03:33 PM, Abhishek B. wrote:

way) ? Would be great if someone who is working on this already shares
his branch, so that I may know how much/if any work is needed in this
before moving on to avx. Of course, new kernels will need support for
all.

How feasible/useful does it sound to incorporate the newly added
idea of ‘turbo equalizer’ within the ofdm system ? Are the
requirements of the proposed equalizer overkill for the ofdm blocks?

Turbo equalizers are generally not a good choice for OFDM, because of
the way the OFDM parameters are chosen w.r.t. the channel properties. In
OFDM, you usually have 1-tap equalizers, the only difficulty is the
channel estimation.

MB

casper_the_ghost · March 14, 2014, 7:28pm

Hi,
So, according to some suggestions, I looked into how I can potentially
use
better signal processing for the OFDM receiver. I was thinking of a LS
estimator with higher order interpolation or an MMSE estimator for the
channel estimator part. Also, a MMSE-DFE or Viterbi equalizer. These
will
need matrix operations and other computations, which can potentially be
developed into new volk kernels.

Are the computational complexities involved feasible in the current
framework ?
Though they can give better BER in adverse channel conditions, can
they
do deliver more in terms of throughput/performance?
Is it a good idea to include such implementations alongside doing new
volk kernels in the same proposal ?

Abhishek

On Wed, Mar 12, 2014 at 3:38 AM, Florian Kaltenberger <

casper_the_ghost · March 14, 2014, 11:09pm

On 14.03.2014 19:27, Abhishek B. wrote:

they do deliver more in terms of throughput/performance?
3. Is it a good idea to include such implementations alongside doing new
volk kernels in the same proposal ?

Abishek,

at this point, please just put together a proposal and upload it so we
can make sure it gets into Melange in time.

M

casper_the_ghost · March 19, 2014, 1:10am

Can you enter this through Melange? It should be sufficient to link to
your PDF/repo on Melange.

It’s good to see you were able to get control port and oprofile results.

On Sat, Mar 15, 2014 at 4:37 AM, Abhishek B.

casper_the_ghost · March 19, 2014, 4:56pm

My current hardware doesn’t support AVX2. How practical is it to develop
software for AVX2 intrinsics using Intel’s SW Development Emulator (and
possible performance testing on a remote machine) ?

Abhishek

On Wed, Mar 19, 2014 at 5:39 AM, West, Nathan

Google Summer of Code 2014 applicant : Optimization with VOLK

This is a great conversation, and I’ll take the opportunity to plug the up coming VOLK working group call (https://plus.google.com/u/1/events/ch3jrjcvp7mdiqelpismfieg3n0). Bogdan, your results aren’t particula> >

This is a great conversation, and I’ll take the opportunity to plug
the up coming VOLK working group call
(https://plus.google.com/u/1/events/ch3jrjcvp7mdiqelpismfieg3n0).
Bogdan, your results aren’t particula> >