User experience with E1x0 boards

dubstep · April 22, 2011, 11:44pm

Hi All,

My lab is interested in purchasing some USRPs. It is pretty settled
that some of the boards will be the N2x0 series, but I am interested
to hear from people who have used the E1x0 boards. From what I can
tell, the E1x0 board should have better latency performance than the
N2x0 and should have a better interface with the FPGA (GPIO pins);
also, it has an onboard DSP. It seems that the latency would be the
main motivation for the product, from an experimental point of view.

Has anyone tried integrating processing on the FPGA and DSP to get
better latency results? Your thoughts on the board versus the N2x0?

If you want to take the discussion off list, feel free to.

Thanks,
Colby

Colby_B · April 23, 2011, 12:22am

I haven’t worked with the E100 per se but I’ve worked with the
Beagleboard + USRP1 which is the same thing except the E100 uses SPI to
communicate with the processor versus USB. So I’ve done work on
integrating the OMAP3530 DSP with the GPP within GNU Radio. I feel that
the use of the DSP is a MUST to make the most out of the E100 and with
my Beagleboard + USRP setup I was able to run some FM flowgraphs but one
of my bottlenecks was the USB USRP interface which the E100 would
definitely fix.

Basically with the E100 I feel that you need to get your hands “dirty”
where you might need to rewrite some GNU Radio components to make use of
the NEOS coprocessor, which Philip B. has done some good work in
that respect, and you would also need to make use of the C64x+ DSP which
I can point you to the source code I’ve developed.

The E100 has a lot of untapped potential at the moment and I hope to see
more momentum in working with the NEOS and DSP which is what I’m
focusing on and willing to help people to get started on.

al fayez

Colby_B · April 23, 2011, 12:30am

On 04/22/2011 06:00 PM, Almohanad F. wrote:

I haven’t worked with the E100 per se but I’ve worked with the Beagleboard +
USRP1 which is the same thing except the E100 uses SPI to communicate with the
processor versus USB. So I’ve done work on integrating the OMAP3530 DSP with the
GPP within GNU Radio. I feel that the use of the DSP is a MUST to make the most
out of the E100 and with my Beagleboard + USRP setup I was able to run some FM
flowgraphs but one of my bottlenecks was the USB USRP interface which the E100
would definitely fix.

Al points are good, except the E100 uses the GPMC bus to communicate
with the FPGA. 16 bit wide data bus instead of 1

Philip

Colby_B · April 23, 2011, 1:26am

I’ve always wondered about the design difference between the E100 and
the work you did with Chris Anderson’s board … now I know. BTW where
do you have your driver code posted for the E100 and any documentation,
if it exists yet ? I found slides that you presented on April 13th.

I want to get acquainted with what you did in hopes that I can get hold
of an E100 this summer.

al fayez

Colby_B · April 23, 2011, 3:06pm

On 04/22/2011 07:05 PM, Almohanad F. wrote:

I’ve always wondered about the design difference between the E100 and the work
you did with Chris Anderson’s board … now I know. BTW where do you have your
driver code posted for the E100 and any documentation, if it exists yet ? I
found slides that you presented on April 13th.

Those slides are a recent as it gets. There may be video of that talk in
a few months.

Driver code is here:

Philip

Colby_B · May 3, 2011, 5:26pm

Hi Josh, Philip,

On Sat, Apr 23, 2011 at 17:05, Philip B. [email protected]
wrote:

Driver code is here:

GitHub - balister/linux-omap-philip: Drivers for beagle sdr and maybe some other stuff

We’re seeking to get maximum throughput from USRP E100. Our goal is to
collect some samples to RAM and then process them offline. Right now
E100 can’t achieve even 4MSPS, which is not enough for us. What is
your feeling what is the limiting element? GPMC should be wide enough
to transfer much more data and RAM should be fast enough too. Is it
IRQ, or user-space processing?

We consider replacing Gumstix with a more powerful C6-Integra SoC,
like TMS320C6A8167. But before we dive into hardware modswe want to
make it somehow working with what we have already.

This is for the open-source WiMAX scanner project:
http://code.google.com/p/wimax-scanner/

–
Regards,
Alexander C…

Colby_B · May 3, 2011, 11:04pm

On 05/03/2011 11:25 AM, Alexander C. wrote:

Those slides are a recent as it gets. There may be video of that talk in a
to transfer much more data and RAM should be fast enough too. Is it
IRQ, or user-space processing?

First, do you have all the E100 kernel updates from here:

http://ettus-apps.sourcerepo.com/redmine/ettus/projects/usrpe1xx/wiki/Updating_E1XX_Boot_Files_and_Kernel_Modules

The MLO update is very significant as the L3 clock is running slow with
the original MLO. The driver updates help some.

There are a few factors here:

The interface with the FPGA is still asynchronous. This limits the
bus cycle time we can use. We have spent some time looking at a
synchronous interface, but the GPMC controller does not provide a free
running clock for the FPGA. (The clock is only active during bus cycles,
leaving no clocks available to finish internal fpga cycles)
The transfer size is 2048 bytes. Larger sizes are possible, but they
make latency worse. Smaller sizes are better for latency, but max
transfer rate suffers.
There is a delay getting interrupts after the fpga signals data ready
via GPIO. This is not huge, but for high rates it hurts. I’m not certain
where the delay is (gpio interupt controller, or kernel interrupt
handler).
Be sure to tell UHD you want integer samples. I’m thinking even then
UHD has to swap IQ for historical reasons. (Josh, help?)
Anything you can do in the FPGA to reduce the sample rate helps. With
the E100 there is lots of free space in the FPGA for custom processing.
If I was smart about the DSP (I’m not), having the DSP take the data
from the FPGA and reduce the rate should help also.

As always, I am very interested in ideas for improving performance.

Philip

Colby_B · May 4, 2011, 12:34am

Be sure to tell UHD you want integer samples. I’m thinking even then
UHD has to swap IQ for historical reasons. (Josh, help?)

There is a copy-conversion operation between kernel buffer memory and
user memory. If the user requests complex shorts, I believe the
conversion is the equivalent of a 16 bit integer pair swapping. This
routine could be replaced with NEON intrinsics pretty easily.

-Josh

Colby_B · May 3, 2011, 8:56pm

Check top when running a simple data sink or source. If the CPU is
pegged, maybe that is the limiter, if not there is a memory bottle
neck somewhere.

On Tue, May 3, 2011 at 8:25 AM, Alexander C.

Colby_B · May 4, 2011, 8:02am

Philip,

On Wed, May 4, 2011 at 01:03, Philip B. [email protected]
wrote:

the

First, do you have all the E100 kernel updates from here:

http://ettus-apps.sourcerepo.com/redmine/ettus/projects/usrpe1xx/wiki/Updating_E1XX_Boot_Files_and_Kernel_Modules

The MLO update is very significant as the L3 clock is running slow with the
original MLO. The driver updates help some.

No, we haven’t updated. Thank you for pointing to this!

There are a few factors here:

The interface with the FPGA is still asynchronous. This limits the bus
cycle time we can use. We have spent some time looking at a synchronous
interface, but the GPMC controller does not provide a free running clock for
the FPGA. (The clock is only active during bus cycles, leaving no clocks
available to finish internal fpga cycles)

Interesting.
What throughput have you achieved with asynchronous GPMC? It is ok for
is if it can push >=12MSPS, i.e. if it’s throughput is 24e6 words/sec
or more.

The transfer size is 2048 bytes. Larger sizes are possible, but they make
latency worse. Smaller sizes are better for latency, but max transfer rate
suffers.

We don’t care about latency at all - we want to capture a lot of
samples to RAM (say, 40Mb of samples) and then slowly process them in
non-real-time. We’re looking into ways to remove this 2048 bytes
limitation, because it may help us get higher rates. Could you please
advise us where to look for this? Is FPGA code changes needed? We see
that kernel driver has no notion of 2048 bytes buffer and can provide
any number of samples - does it reside at some higher levels?

There is a delay getting interrupts after the fpga signals data ready via
GPIO. This is not huge, but for high rates it hurts. I’m not certain where
the delay is (gpio interupt controller, or kernel interrupt handler).

Could we transfer in bigger packets to reduce GPIO overhead?

Be sure to tell UHD you want integer samples. I’m thinking even then UHD
has to swap IQ for historical reasons. (Josh, help?)

Anything you can do in the FPGA to reduce the sample rate helps. With the
E100 there is lots of free space in the FPGA for custom processing.

We thought about processing in FPGA and even wrote some code, but then
decided to do everything in software - it’s easier to get enough
powerful DSP then develop and maintain FPGA code. As I mentioned we
plan to use TMS320C6A8167, and then move towards C66x.

If I was smart about the DSP (I’m not), having the DSP take the data from
the FPGA and reduce the rate should help also.

Do you mean C64x DSP in Gumstix? I’m not sure I get this.

As always, I am very interested in ideas for improving performance.

And thank you for your help!

–
Regards,
Alexander C…

Colby_B · May 4, 2011, 2:45pm

On 05/04/2011 02:03 AM, Alexander C. wrote:

routine could be replaced with NEON intrinsics pretty easily.

Ugh. Could you please point us to this code so we can disable it? We
don’t need to convert data in real-time, and if we can avoid copying
it would be great help too. We need to capture raw data to RAM with
maximum possible throughput and then process it in offline mode.

I would do it in the FPGA. It is not a huge deal, but at the rates you
want, every little bit helps.

The floating point translation uses a NEON vrev instruction to do the
swapping, but I think the swap is done in C++ for the int case. (Based
on a quick look last night).

Philip

Colby_B · May 4, 2011, 8:05am

On Wed, May 4, 2011 at 02:33, Josh B. [email protected] wrote:

Be sure to tell UHD you want integer samples. I’m thinking even then
UHD has to swap IQ for historical reasons. (Josh, help?)

There is a copy-conversion operation between kernel buffer memory and
user memory. If the user requests complex shorts, I believe the
conversion is the equivalent of a 16 bit integer pair swapping. This
routine could be replaced with NEON intrinsics pretty easily.

Ugh. Could you please point us to this code so we can disable it? We
don’t need to convert data in real-time, and if we can avoid copying
it would be great help too. We need to capture raw data to RAM with
maximum possible throughput and then process it in offline mode.

–
Regards,
Alexander C…

Colby_B · May 4, 2011, 3:06pm

We should move this to the usrp-users list since this has no gnuradio
content. I’ve added it to the cc list.

On 05/04/2011 02:01 AM, Alexander C. wrote:

On 04/22/2011 07:05 PM, Almohanad F. wrote:
few months.
IRQ, or user-space processing?

is if it can push>=12MSPS, i.e. if it’s throughput is 24e6 words/sec
or more.

I don’t remember of the top of my head. On a loopback test, I see about
2 MSPS, which means 2 MSPS go into the PFGA and 2 come back. There is a
test program that lets you set a decimation and the looks for drops for
testing one way transfers. 90% of my work has revolved around
correctness to this point.

The Read and Write cycle times are 17 clocks at the moment (L3 Clock
rate is 166 Mhz). So that is 102 nS per sample if everything else is
perfect. See arch/arm/mach-omap2/board-overo.c for the gpmc config.
(This setup move to u-boot at some point)

any number of samples - does it reside at some higher levels?

The driver does have a concept of 2048 buffers. This could easily go to
4K buffers since the ring buffer is allocated via get_free_page. It
could be bigger if you allocated contiguous pages so you had larger than
4K physical blocks of memory. The majority of the complexity of the
driver is creating buffers usable by the DMA system that can be mapped
into user space and dealing with cache management.

There is a delay getting interrupts after the fpga signals data ready via
GPIO. This is not huge, but for high rates it hurts. I’m not certain where
the delay is (gpio interupt controller, or kernel interrupt handler).

Could we transfer in bigger packets to reduce GPIO overhead?

Yes. This will reduce the percentage of time you are waiting on the
interrupt handlers.

plan to use TMS320C6A8167, and then move towards C66x.

Even though the PFGA is tightly coupled to the OMAP, anything you can do
in the FPGA will help I know it is hard to do processing in the FPGA
than a processor, but the FPGA is really good at the high rate stuff.

If I was smart about the DSP (I’m not), having the DSP take the data from
the FPGA and reduce the rate should help also.

Do you mean C64x DSP in Gumstix? I’m not sure I get this.

Yes. Basically, instead of having the ARM control the interface to the
FPGA, have the DSP (in the OMAP) do it. Then pass processed data to the
ARM.

Philip

Colby_B · May 4, 2011, 3:38pm

On Wed, May 4, 2011 at 17:05, Philip B. [email protected]
wrote:

you
GitHub - balister/linux-omap-philip: Drivers for beagle sdr and maybe some other stuff

or more.

I don’t remember of the top of my head. On a loopback test, I see about 2
MSPS, which means 2 MSPS go into the PFGA and 2 come back. There is a test
program that lets you set a decimation and the looks for drops for testing
one way transfers. 90% of my work has revolved around correctness to this
point.

The Read and Write cycle times are 17 clocks at the moment (L3 Clock rate is
166 Mhz). So that is 102 nS per sample if everything else is perfect.

If 9 MSPS is a theoretical limit, that’s too bad. Do you know is it
the maximum for async GPMC?

See arch/arm/mach-omap2/board-overo.c for the gpmc config. (This setup move to
u-boot at some point)

In this repo?

advise us where to look for this? Is FPGA code changes needed? We see

interrupt handlers.

We thought about processing in FPGA and even wrote some code, but then

the FPGA and reduce the rate should help also.

Do you mean C64x DSP in Gumstix? I’m not sure I get this.

Yes. Basically, instead of having the ARM control the interface to the FPGA,
have the DSP (in the OMAP) do it. Then pass processed data to the ARM.

I’m not sure that this is possible. If I read docs correctly,
interrupts from GPMC can go only to ARM and only ARM can control GPMC.
I would be happy to be wrong.

–
Regards,
Alexander C…

Colby_B · May 4, 2011, 5:33pm

On 05/04/2011 09:37 AM, Alexander C. wrote:

On 05/03/2011 11:25 AM, Alexander C. wrote:

work you did with Chris Anderson’s board … now I know. BTW where do

First, do you have all the E100 kernel updates from here:
There are a few factors here:
is if it can push>=12MSPS, i.e. if it’s throughput is 24e6 words/sec
166 Mhz). So that is 102 nS per sample if everything else is perfect.

If 9 MSPS is a theoretical limit, that’s too bad. Do you know is it
the maximum for async GPMC?

I’m not sure what the max is. The FPGA clock is 64 MHz, so you need to
be able to sync the gpnc signals to that clock. With a sync interface, I
hope you could to transfers in 4-5 L3 clocks.

All this said, I still feel like the best solution is to reduce the
sample rate in the FPGA.

See arch/arm/mach-omap2/board-overo.c for the gpmc config. (This setup move to
u-boot at some point)

In this repo?
GitHub - balister/linux-omap-philip: Drivers for beagle sdr and maybe some other stuff

Yes.

Philip

Colby_B · May 6, 2011, 10:55pm

On Wed, May 4, 2011 at 19:32, Philip B. [email protected]
wrote:

What is the difference between this repo and patches in
ettus_oe.git/recipes/linux/linux-usrp-embedded-2.6.35 ?

–
Regards,
Alexander C…

Colby_B · May 4, 2011, 6:12pm

On Wed, May 4, 2011 at 19:32, Philip B. [email protected]
wrote:

On 04/22/2011 07:05 PM, Almohanad F. wrote:

collect some samples to RAM and then process them offline. Right now

http://ettus-apps.sourcerepo.com/redmine/ettus/projects/usrpe1xx/wiki/Updating_E1XX_Boot_Files_and_Kernel_Modules

bus
is if it can push>=12MSPS, i.e. if it’s throughput is 24e6 words/sec

you could to transfers in 4-5 L3 clocks.

All this said, I still feel like the best solution is to reduce the sample
rate in the FPGA.

We need at least 11.2MSPS to get WiMAX going. We could use 8-bit
samples, but this decreases dynamic range. That’s the last resort.

–
Regards,
Alexander C…

Colby_B · May 6, 2011, 11:03pm

On 05/06/2011 04:53 PM, Alexander C. wrote:

Yes.

What is the difference between this repo and patches in
ettus_oe.git/recipes/linux/linux-usrp-embedded-2.6.35 ?

The patches in the recipe come from the e100-update-1 branch in the git
repository. In other words they should give the same result.

Philip