Re: interfacing a DSP array card to USRP2

Hi,

From: Jeff B. [email protected]

Matt-

We’re working on a project at Signalogic to interface one of our DSP array PCIe cards to the USRP2. This would
?>provide a way for one or more TI DSPs to “insert” into the data flow
and run C/C++ code for low-latency
and/or other high performance applications. The idea is that we would
modify the current USRP2 driver (or create an alternative) so it would
read/write to/from the PCIe card instead of the Linux (motherboard)
GbE.

I want to share my little DSP PCI cards experience (not PCI-E) with with
the community.

The most important thing when playing with these cards is how the card
hardware and software driver works. It is not necessarily when you work
with PCI card that you will get a low latency system.

To clear the picture, I worked (from about 2 years ago) with a PCI card
(from a respected manufacture, four 125MSPS 14 bits ADC and four GC5016
DDC, 4M gate Xilinx Virtex2 Pro). The card was 64/32 bit (it can work
from 32 or 64 bit PCI bus) and it accept 66/33 MHz PCI clock.
Theoretically it can transfer up to 528MByte/sec when hosted with a
64bit/66MHz PC bus (very difficult to find) and can transfer up to
132MByte/sec with 32bit/33MHz PCI bus (very common). With realtime
testing, it gave me about 113MByte/sec data streaming because my
platform was 32bit/33MHz.

The card problem was in the transfer latency. The card can transfer a
data block of up to 64k @about 350usec latency (very high). I could not
reduce transfer latency significantly even by using faster
multiprocessor PC. The card working technique is to collect data in its
built-in FIFO, transfer this data to a shared PCI RAM then initiate a
hardware interrupt to acknowledge the OS that data is available and the
driver copy this data to the user working space. The card drivers was
for windows OS. At first I thought this is a slow windows kernel
interrupt serving problem. When the card manufacture released a Linux
driver (after about 1 year), I carried out the tests again but the same
latency problem persist.

I concluded that PCI transfer mechanizem is not efficient for small
packet transfer. However, it is very useful when transferring large
amount of data streaming. Again these observations was for PCI card and
not PCI_Express and it was for the card I used to do the experiments.

May be it was a card bad design philosophy, but I wanted to share this
information with the community.

Best Regards,

Firas

Firas-

The most important thing when playing with these cards is how the card hardware and software driver works. It is not
working technique is to collect data in its built-in FIFO, transfer this data to a shared PCI RAM then initiate a
hardware interrupt to acknowledge the OS that data is available and the driver copy this data to the user working
space. The card drivers was for windows OS. At first I thought this is a slow windows kernel interrupt serving
problem. When the card manufacture released a Linux driver (after about 1 year), I carried out the tests again but the
same latency problem persist.

I concluded that PCI transfer mechanizem is not efficient for small packet transfer. However, it is very useful when
transferring large amount of data streaming. Again these observations was for PCI card and not PCI_Express and it was
for the card I used to do the experiments.

May be it was a card bad design philosophy, but I wanted to share this information with the community.

A couple of brief comments:

  1. Sounds like this was a high-speed data acq card, optimized for
    streaming, not an accelerator card. How big was the
    FIFO?

  2. On typical motherboards, PCIe connects to the same (or similar)
    bridge chip as GMII… placing the burden on driver
    software to be efficient.

-Jeff

Hi,

From: Jeff B. [email protected]

Firas-
A couple of brief comments:

  1. Sounds like this was a
    high-speed data acq card, optimized for streaming, not an accelerator
    card. How big was the
    FIFO?

The FIFO was 64MByte.

  1. On typical motherboards, PCIe
    connects to the same (or similar) bridge chip as GMII… placing the burden on
    driver software to be efficient.

-Jeff

Best Regards,

Firas

Firas-

From: Jeff B. [email protected]

Firas-
A couple of brief comments:

  1. Sounds like this was a
    high-speed data acq card, optimized for streaming, not an accelerator
    card. How big was the
    FIFO?

The FIFO was 64MByte.

That’s huge… and you mentioned a 64k block transfer, much smaller, but
still more than 40 times a large Ethernet
packet. It sounds to me like this particular card mfg was focused on
very high rate streaming (without gaps or drops)
and not on low-latency, small transfers. I would guess they didn’t set
up their driver to optimize small transfer
sizes. Maybe even the board didn’t support a small size, for example if
the FIFO had to contain a minimum number of
channels and/or data length before it could assert “not empty”.

-Jeff

Hi,

From: Jeff B. [email protected]

That’s huge… and you mentioned a 64k block transfer, much smaller, but still more than 40 times a
large Ethernet packet. It sounds to me like this particular card mfg
was focused on very high rate streaming (without gaps or drops) and not on
low-latency, small transfers. I would guess they didn’t set up their
driver to optimize small transfer sizes. Maybe even the board didn’t
support a small size, for example if the FIFO had to contain a minimum number
of channels and/or data length before it could assert “not
empty”.

-Jeff

I agree that may be the driver is not optimized to transfer small
packets. In this card, you can setup number of samples in the packet.
The problem was if you configured it to transfer 1024 complex samples or
32K complex samples, the latency is the same.

Best Regards,

Firas