Try to improve E100's performance at high sample rate

dubstep · January 13, 2012, 6:20pm

Dear all,

I have been trying to transmit data between usrp e100 (with RFX2400
daughter board) and a non-usrp device, which has a fixed 4M sample rate.
On e100 side (running in half duplex mode), although data from the
uhd_source will go through a gr.pwr_squelch_cc block before being
demodulated, the device will still continuously overflow at this sample
rate, and judging from the output of the packet_sink, only less than
half of the data is correctly demodulated.

To reduce the computation load of the processor, I tried two methods:

modify the gr.quadrature_demod_cf block, replace some multiplication
operations with volk-based operations (gr.multiply and gr.multiply_const
modules in gr_blocks);
use complex_int16 type instead of complex_float32 at uhd interfaces
and modulation / demodulation blocks.

But I got unexpected results after the changes. If I adopt method 1),
the amount of correctly demodulated data is roughly the same as when
using the original demodulation block. And for method 2), the result is
even worse - less data is correctly demodulated.

Could someone please tell me if I did something wrong, or there are
other solutions to this overflow-at-high-sample-rate problem?

On e100, I burned the latest console file system, use
UHD_003.004.000-6795022, and Josh’s next branch of GNU radio,
uhd setup: center frequency 2.416 GHz, sample rate 4M.

BTW, it took almost twice the amount of time to build both uhd and GNU
radio after burning the latest console file system, and the
initialization process of uhd (when device information is printed out)
took a lot longer time than before as well. But what I concerned more is
that much fewer packets could be received, compared with another e100
which has the same flow graph setup but installed previous versions of
tools: file system, UHD_003.004.000-1a25e48, GNU radio 3.5.0. Could it
be caused by some hardware problems? I will greatly appreciate it if
someone could give me a hint on this.

Thanks in advance,

Terry

ziyang · January 13, 2012, 9:31pm

To reduce the computation load of the processor, I tried two methods:

modify the gr.quadrature_demod_cf block, replace some multiplication
operations with volk-based operations (gr.multiply and gr.multiply_const
modules in gr_blocks);

I like it. Make sure to contribute patches like that back.

use complex_int16 type instead of complex_float32 at uhd interfaces
and modulation / demodulation blocks.

But I got unexpected results after the changes. If I adopt method 1),
the amount of correctly demodulated data is roughly the same as when
using the original demodulation block. And for method 2), the result is
even worse - less data is correctly demodulated.

I wouldnt expect it to be worse in general. How did you implement it?
Did you combine the float to short conversion into the processing of
another block? Or did you add an extra block of conversion from float to
short, because the extra conversion would definitely make things worse.

Also, you may consider timing a particular operation as a performance
metric, rather than counting the number of demodulated packets.

Could someone please tell me if I did something wrong, or there are
other solutions to this overflow-at-high-sample-rate problem?

On e100, I burned the latest console file system, use
UHD_003.004.000-6795022, and Josh’s next branch of GNU radio,
uhd setup: center frequency 2.416 GHz, sample rate 4M.

BTW, I just updated my next branch, in which I pulled into it the
volk-ified adder, multipliers. So if any of the code you are using,
includes a gr.add, add const, mult, etc for floating point or float
complex; you will get the benefit without any code change.

BTW, it took almost twice the amount of time to build both uhd and GNU
radio after burning the latest console file system, and the
initialization process of uhd (when device information is printed out)

Where did you get the rootfs? When I hear “console” image, I think you
might be using something very very old. Perhaps you just downgraded by
accident. You can find the latest stuff here:

http://code.ettus.com/redmine/ettus/projects/usrpe1xx/wiki/FAQ#How-do-I-create-re-create-E1xx-SD-Card-Images

http://files.ettus.com/e1xx_images/e1xx-002/

took a lot longer time than before as well. But what I concerned more is
that much fewer packets could be received, compared with another e100
which has the same flow graph setup but installed previous versions of
tools: file system, UHD_003.004.000-1a25e48, GNU radio 3.5.0. Could it

Can you isolate some of the changes? Perhaps keep the console image
constant, or version or maybe uhd; basically to see in which package,
uhd, gnuradio, rootfs, gcc there was this performance regression?

I cant recall something specific in gnuradio or uhd that would have
changed performance. There is honestly not significant e100 changes
between 1a25e48 and current master.

-Josh

ziyang · January 16, 2012, 5:52pm

On 01/13/2012 12:19 PM, ziyang wrote:

BTW, it took almost twice the amount of time to build both uhd and GNU
radio after burning the latest console file system, and the
initialization process of uhd (when device information is printed out)
took a lot longer time than before as well. But what I concerned more is
that much fewer packets could be received, compared with another e100
which has the same flow graph setup but installed previous versions of
tools: file system, UHD_003.004.000-1a25e48, GNU radio 3.5.0. Could it
be caused by some hardware problems? I will greatly appreciate it if
someone could give me a hint on this.

Assuming you downloaded u-boot from dropbox, we just found a regression
that slipped into u-boot on Dec 6. I have updated the version in the
dropbox to:

http://dl.dropbox.com/u/14618236/u-boot-usrp-e1xx-2011.12-r1.bin

All you need to do is copy u-boot onto the FAT partition and things
should speed up quite a bit.

This makes sure L2 cache is on before Linux boots. There is some
discussion of how to properly turn it on at boot in Linux at the moment.

This does not effect the factory file system.

Thanks for reporting this!

Philip

ziyang · January 16, 2012, 6:11pm

On 01/16/2012 05:51 PM, Philip B. wrote:

Assuming you downloaded u-boot from dropbox, we just found a regression

This does not effect the factory file system.

Thanks for reporting this!

Philip

Previously, I got the console image from here:
http://ettus-apps.sourcerepo.com/redmine/ettus/projects/usrpe1xx/wiki/Images

and 2011.12 version of u-boot.

Now, I change the u-boot to that updated version, now the initialization
time of uhd is normal. And after booting up e100, the time for ethernet
connection is normal as well, previously, ethernet will be disconnected
for a couple of seconds.

Thank you for your help!

Best Regards,

Terry

ziyang · January 16, 2012, 7:20pm

On 01/13/2012 09:30 PM, Josh B. wrote:

To reduce the computation load of the processor, I tried two methods:

modify the gr.quadrature_demod_cf block, replace some multiplication
operations with volk-based operations (gr.multiply and gr.multiply_const
modules in gr_blocks);
I like it. Make sure to contribute patches like that back.
Actually, what I did was writing a new quadrature_demod block without
the multiplication and delay operations, and connect extra gr.multiply
and gr.delay blocks instead in the flow graph. Because my understanding
is that the volk functions take a vector (multiple values) as input, and
I didn’t figure out a way to do the single-item-operation in the volk
style.

another block? Or did you add an extra block of conversion from float to
short, because the extra conversion would definitely make things worse.
I wrote a new pair of modulation / demodulation blocks with the same
processing steps but with short types (in the demodulation block, only
one step (multiplication) is done with short inputs, then convert to
float types before using the gr_fast_atan2f function). As the way in
method 1), I added an extra block in python to do the delay work. So
will it make things worse if a processing step is taken out of one block
and done in a separate block?

Also, you may consider timing a particular operation as a performance
metric, rather than counting the number of demodulated packets.

I was wondering if there are examples from which I can learn how to do
this?

complex; you will get the benefit without any code change.

That’s great, thanks!

BTW, it took almost twice the amount of time to build both uhd and GNU
radio after burning the latest console file system, and the
initialization process of uhd (when device information is printed out)
Where did you get the rootfs? When I hear “console” image, I think you
might be using something very very old. Perhaps you just downgraded by
accident. You can find the latest stuff here:

http://code.ettus.com/redmine/ettus/projects/usrpe1xx/wiki/FAQ#How-do-I-create-re-create-E1xx-SD-Card-Images

files.ettus.com:/e1xx_images/e1xx-002/
I get the rootfs from here:

http://ettus-apps.sourcerepo.com/redmine/ettus/projects/usrpe1xx/wiki/Images

I changed the u-boot back to an updated version as Philip said, now the
initialization time of uhd is normal.

between 1a25e48 and current master.

-Josh
Actually, before I re-burn the rootfs and build the latest uhd,
gnuradio, the performance of packet demodulation on this e100 was worse
than on the other one: with the same flow graph and the same uhd
configurations, less packets could be received on this e100. So I got
confused and started to wonder if there were hardware problems. Could
you give me some advices about how to narrow down the problem? Thanks.

Best Regards,

Terry

ziyang · January 17, 2012, 7:37pm

On 01/16/2012 09:51 AM, ziyang wrote:

I didn’t figure out a way to do the single-item-operation in the volk
style.

I dont recommend using the extra blocks, that would probably cause more
overhead. Looking at gr_quadrature_demod_cf::work, it looks like you can
vectorize the operation of the conjugate multiply, then the atan, then
the gain scaler. So, that would be one for loop that operates on 4
samples at a time, and calls 3 volk functions.

Also, you may consider timing a particular operation as a performance
metric, rather than counting the number of demodulated packets.

I was wondering if there are examples from which I can learn how to do
this?

Sorry, I guess there isnt much in the way of examples.

You can time individual work functions by adding some code before an
after. We have some high resolution timers in
gruel/include/gruel/high_res_timers.h

I have also seen people time the block in a simple flow graph with a
null source, head, your_block, null_sink. You can time tb.run() and
compare run duration vs the non-vectorized code.

-Josh

ziyang · January 17, 2012, 8:17pm

On 01/17/2012 07:36 PM, Josh B. wrote:

and gr.delay blocks instead in the flow graph. Because my understanding
is that the volk functions take a vector (multiple values) as input, and
I didn’t figure out a way to do the single-item-operation in the volk
style.

I dont recommend using the extra blocks, that would probably cause more
overhead. Looking at gr_quadrature_demod_cf::work, it looks like you can
vectorize the operation of the conjugate multiply, then the atan, then
the gain scaler. So, that would be one for loop that operates on 4
samples at a time, and calls 3 volk functions.

Josh, thank you for your advice! Before I tried using gr.multiply out of
the block, I actually implemented a demodulation block in a way that’s
similar to your suggestion, but the loop operated on 100 samples at a
time. I don’t know if it was the 100-samples-vectorization that caused a
bad performance. I will try processing 4 samples at a time.

So I call the timer functions of high_res_timers.h before and after the
operation in the work function, is that right?

I have also seen people time the block in a simple flow graph with a
null source, head, your_block, null_sink. You can time tb.run() and
compare run duration vs the non-vectorized code.

-Josh

I got two questions about this:

Is the “head” block for generating data for the processing block?
The initialization of uhd is done first after tb.run(), so how could
I isolate the processing time from the time between tb.run() - tb.stop()
?

Thanks.

Best Regards,

Terry

ziyang · January 17, 2012, 8:27pm

You can time individual work functions by adding some code before an
after. We have some high resolution timers in
gruel/include/gruel/high_res_timers.h

So I call the timer functions of high_res_timers.h before and after the
operation in the work function, is that right?

Yes. Its the standard way to time things. Save the begin time, then the
end time. Average the difference in begin/end times. Its probably more
practical for you to put the timer operations in the work function, at
the beginning and end (before return).

head shutdowns the flow graph after it passes N items.

The initialization of uhd is done first after tb.run(), so how could
I isolate the processing time from the time between tb.run() - tb.stop() ?

tb.run(), executes the flowgraph until completions. This is different
than the start/stop/wait model. So in this case, just time the run()
call. run() will exit when head block completes.

Second, you dont want to put UHD in this flowgraph. The goal is to time
the processing of a single block. The more blocks you add, the less
accurate the measurement of the individual block you are measuring: So
the flow graph is like: null source, head, quad_demod, null_sink

-Josh

ziyang · January 17, 2012, 7:55pm

On Tue, Jan 17, 2012 at 10:36 AM, Josh B. [email protected] wrote:

Actually, what I did was writing a new quadrature_demod block without
the gain scaler. So, that would be one for loop that operates on 4
samples at a time, and calls 3 volk functions.

Right now, the Volk atan2 function is only implemented for SSE and only
works if libsimdmath is installed. If not, it will fall back to a
generic
implementation which is considerably slower than Gnuradio’s LUT atan2.
There’s no NEON implementation, so right now the fastest option on E100
is
to use Gnuradio’s built-in atan2.

I spent some quality time a couple of months ago during SDR Forum
writing a
vectorized atan2 algorithm in Volk via Orc. I was unable to get the
entire
algorithm to fit within the register constraints the Orc runtime
compiler
applies. The end goal is to get the entire algorithm vectorized so it
only
needs to write out to memory once, which is going to be far faster than
running three vector operations across a large buffer which won’t fit
into
cache. I’ll get back to it one of these days but it looks like parts of
Orc’s compiler will have to be improved. Terry, if you’re interested,
Orc
code is easily read and looks like vector pseudocode, so my Orc
implementation might be of use if you’re interested in writing a custom
NEON implementation for Volk. It’s based on the libsimdmath
implementation,
which is in turn based on Cephes, and uses all sorts of Crazy Math
Tricks.

–n

ziyang · January 17, 2012, 8:47pm

On 01/17/2012 07:54 PM, Nick F. wrote:

multiplication
> is that the volk functions take a vector (multiple values) as
vectorize the operation of the conjugate multiply, then the atan, then
I spent some quality time a couple of months ago during SDR Forum
Volk. It’s based on the libsimdmath implementation, which is in turn
based on Cephes, and uses all sorts of Crazy Math Tricks.

–n

Thank you for your help, Nicks. Right now, I really want to have a
faster atan implementation, but I use python and occationally c++ for
most of the time, so I’m not sure if I can handle the custom NEON
implementation because these Orc / NEON / libsmdmath / Cephes are all
completely new to me.

Thanks.

Best Regards,

Terry

ziyang · January 17, 2012, 8:49pm

On 01/17/2012 08:26 PM, Josh B. wrote:

the beginning and end (before return).

the processing of a single block. The more blocks you add, the less
accurate the measurement of the individual block you are measuring: So
the flow graph is like: null source, head, quad_demod, null_sink

-Josh

Discuss-gnuradio mailing list
[email protected]
Discuss-gnuradio Info Page

Thank you for the explaination, Josh. Just one more question, I haven’t
used the head block before, is it a gr block or should I implement a
custom one?

Best Regards,

Terry

ziyang · January 19, 2012, 7:14pm

On Thu, Jan 19, 2012 at 10:04 AM, ziyang [email protected] wrote:

Nick said, and there is no conjugate-multiply function for FC32 inputs, I
only 0.163 ms to demodulate).

Optimizing an algorithm is a hard and sometimes counterintuitive
process.
You might benchmark the following:

Gnuradio’s atan2 WITHOUT any Volk multiplications (just comment out
the
volk mults in your block)
The Volk multiplications WITHOUT Gnuradio’s atan2 (just comment out
the
atan2 in your block)

This will let you determine where the bottleneck is. In addition, try
running over a MUCH larger dataset. The clock resolution at <1ms is not
very good and the scheduler will have a correspondingly larger effect at
smaller timescales.

I think you’ll find the atan2 part takes vastly longer than the
multiplications do, and that will be where you have to look for
performance
improvements.

–n

ziyang · January 19, 2012, 7:05pm

I dont recommend using the extra blocks, that would probably cause more
overhead. Looking at gr_quadrature_demod_cf::work, it looks like you can
vectorize the operation of the conjugate multiply, then the atan, then
the gain scaler. So, that would be one for loop that operates on 4
samples at a time, and calls 3 volk functions.

Hi, Josh. I implemented a quadrature_demod_cf block (please find it in
the attachment). Since the Volk atan2 function is currently only for SSE
as Nick said, and there is no conjugate-multiply function for FC32
inputs, I use Gnuradio’s built-in conjugate and fast_atan_2f functions,
plus two volk multiply functions. The for loop is timed by
high_res_timer. Besides, the work function of gr_quadrature_demod_cf is
timed for comparison purpose (also attached). Each of these two blocks
is connected to a file_source which provides modulated data.

I tested two blocks individually, firstly on a PC with Intel processor,
then on E100. On PC, it always take volk-based block less time to
demodulate a same-size-buffer of data (i.e. for 4096 input items, it
takes the original quadrature_demod_cf block 0.185 ms but takes
volk-based block only 0.163 ms to demodulate).

However, the results are different on E100: sometimes the original block
runs faster, sometimes the volk-based block does. I ran the tests for
several times, although the recorded time changes by some tens
(occasionally a few handreds) of nanoseconds, but neither block is
always faster than the other.

Now I’m confused by the results, since I expected the volk-ified
demodulator to be faster. Could you give me some help on this issue?
Thanks.

Best Regards,

Terry

ziyang · January 19, 2012, 7:58pm

running over a MUCH larger dataset. The clock resolution at <1ms is
not very good and the scheduler will have a correspondingly larger
effect at smaller timescales.

I think you’ll find the atan2 part takes vastly longer than the
multiplications do, and that will be where you have to look for
performance improvements.

–n

Hi Nick,

Thank you for your advices! I will try benchmarking those two operations
separately and narrow down the problem. Besides, the explanation of the
effect scheduler have on clock resolution clears my question about the
inconsistency of the results.

There is just one thing that I’m not sure about. From the results of my
previous tests, I noticed that different number of data (ninput_items)
is fed to the work function every time (4096, 2048, etc). Therefore the
time of operation changes accordingly. So by “much larger dataset”, do
you mean that providing much more data to the block and then summing up
the operation time for each buffer of data?

Thanks.

Terry

ziyang · January 24, 2012, 6:57pm

On 01/19/2012 07:13 PM, Nick F. wrote:

not very good and the scheduler will have a correspondingly larger
effect at smaller timescales.

I think you’ll find the atan2 part takes vastly longer than the
multiplications do, and that will be where you have to look for
performance improvements.

–n

Hi Nick,

I have been doing some tests about the demodulation module. As you said,
the atan2 part takes much longer than the multiplication. So in order to
maximize the performance improvement that volk could bring to the
processing, I took a division and a multiplication out of atan2, and use
volk-ified divider and multiplier instead. Then I run tests using a much
larger dataset.

But from the test results, I did not observe a performance improvement.
In fact, the average processing time even increase a little bit. So I
was wondering if what I did was not a good way to improve the
performance?

Another issue is that when I ran Cmake to build Gnuradio on E100, it
reported this:
– Available arches: generic;neon
– Available machines: generic;neon
– Did not find liborc and orcc, disabling orc support…

But from the “opkg list-installed | grep orc” check, both orc and liborc
are installed. Could this lack of orc support be part of the reason why
my implementation did not have a performance improvement?

I will appreciate it if you could give me a hand on this. Thanks.

Best Regards,

Terry

ziyang · January 24, 2012, 7:23pm

Has anybody looked at using the CORDIC approximation for atan2?
Depending
on the required accuracy, this may dramatically improve performance in
your
C code. Ultimately, you can implement the CORDIC functions in the FPGA
(quasi math-coprocessor style) which would then give you the fastest
possible computation speed.

Evan

ziyang · January 24, 2012, 7:14pm

On Tue, Jan 24, 2012 at 9:56 AM, ziyang [email protected] wrote:

This will let you determine where the bottleneck is. In addition, try

But from the test results, I did not observe a performance improvement. In
are installed. Could this lack of orc support be part of the reason why my
implementation did not have a performance improvement?

Very likely. Make sure that orcc is somewhere that pkgconfig can find
it,
and make sure its version is > 0.4.10.

ziyang · January 24, 2012, 7:35pm

I haven’t used VOLK with the OMAP processor but from my experience with
the E100 every multiplication and/or division in your flowgraph counts
… When I was working on my C64x+ DSP based FM receiver on the E100 I
was moving individual blocks 1-by-1 from the GPP to the DSP and almost
every multiplication/division on the GPP caused a buffer overflow my
impression at least is if you’re going for a pure GPP implementation you
need to make used of NEOS vector operations and if you’re using a DSP
based solution you’ll need to find a way to speed up the GPP/DSP
buffers, which is something I’m hoping to have more time to look into.

Almohanad F.
[email protected]

ziyang · January 24, 2012, 11:02pm

On 01/24/2012 01:43 PM, ziyang wrote:

    comment out the volk mults in your block)
    look for performance improvements.
bring to the processing, I took a division and a multiplication
-- Available arches: generic;neon
it, and make sure its version is > 0.4.10.
cmake -DCMAKE_INSTALL_PREFIX=/usr
-DCMAKE_CXX_FLAGS:STRING="-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp
-g" -DCMAKE_C_FLAGS:STRING="-mcpu=cortex-a8 -mfpu=neon
-mfloat-abi=softfp -g" …/
Could you tell me what might be the problem? Thanks.

Add -DENABLE_ORC=ON to the cmake command line.

Philip

ziyang · January 24, 2012, 7:44pm

On 01/24/2012 07:12 PM, Nick F. wrote:

    - The Volk multiplications WITHOUT Gnuradio's atan2 (just

out of atan2, and use volk-ified divider and multiplier instead.
-- Available machines: generic;neon
-- Did not find liborc and orcc, disabling orc support...

But from the "opkg list-installed | grep orc" check, both orc and
liborc are installed. Could this lack of orc support be part of
the reason why my implementation did not have a performance
improvement?

Very likely. Make sure that orcc is somewhere that pkgconfig can find
it, and make sure its version is > 0.4.10.

This is what it shows when I run a “opkg list-installed | grep orc”
check:

liborc-0.4-0 - 0.4.16-r1.0.9
liborc-test-0.4-0 - 0.4.16-r1.0.9
orc - 0.4.16-r1.0.9

I don’t understand why orc/liborc cannot be detected by CMake. The
options for CMake are:

cmake -DCMAKE_INSTALL_PREFIX=/usr
-DCMAKE_CXX_FLAGS:STRING="-mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp
-g" -DCMAKE_C_FLAGS:STRING="-mcpu=cortex-a8 -mfpu=neon
-mfloat-abi=softfp -g" …/

Could you tell me what might be the problem? Thanks.

Terry