USRP packet parsing

On 3/22/07, Thibaud H. [email protected] wrote:

Yes, I forgot that the packet are ordered by timestamps, which solved
the fragmentation issues. However I cannot find an Altera RAM
megafunction that provides more that two independent ports. This is not
enough and will prevent the FPGA from processing packets (that have the

You cannot have more than 2 ports on their RAM. It is wired up as a
dual-port RAM at most. Use a FIFO for each channel as they are
independent from each other anyway. Moreover, if you get it working
for 1 channel, instantiating the rest of the chain for 2, 3, N
channels is as easy as a for loop.

I think sending one packet per transmit window in the most common.
However I remember having read somewhere on the wiki what the maximum
transmit is but I cannot find it again.

I am not really thinking about the rates, but more the mechanism as to
how it will be compared. What will the state machine look like? What
is the length of 1 tick? How will it operate to make sure it can send
everything properly? If a FIFO gets half empty, how quickly can we
get more samples to send out of modulated data?

Things like that.

Brian

On 3/22/07, Thibaud H. [email protected] wrote:

I can copy the sample to a fifo, but I still have 3 processes that want
to use the RAM a the same time: One to progressively store the packets
coming from the usb bus, one to copy the samples into the corresponding
channel fifo and one to copy the subcommands to be executed now. So, if
I am not mistaken I will have to find a way to synchronize the two last
processes, right?

The FX2 will be writing into a FIFO that stores all packets
intermingled with each other - be it either channel data or control
data.

A read process on that FIFO can separate out the channel data to be
stored into separate channel FIFOs - one FIFO per channel that is
implemented. That same process can also put the control data into a
control FIFO. Only one of those is needed.

That’s 1 writer and 1 reader on the direct FX2 -> FPGA link.

On the RX end, the FX2 can just read from a separate RX FIFO - that is
easy enough to do with a mux at the IO pins and tri-stating the pins.

I can write and read from the fifo at the same time, so one process
would be in charge of filling the channel fifo. Two states: either wait
for the timestamps to match the time, or proceed a copy of the samples
from the ram to the fifo. The problem is if there are more than one
channels, then this process can be busy filling in channel 1 fifo while
channel 2 fifo is empty. I don’t know how to solve that.

Just do it in parallel. If you are busy filling in channel 1 FIFO and
channel 2 FIFO is empty, then (inherently) channel 2 didn’t have
anything that needs to go out any time soon - right?

Can you come up with a sequence of channel packets that would be sent
that you are confused about? You should be looking at each channel as
an independent source of information to be sent to each channel.
There is a bandwidth limitation that we all have to funnel from the
same USB spigot, but once it’s separated out - we have a ton of
bandwidth to handle all the rest of the processing.

I can see an issue where you may have channel 1 and channel 2 start
sending both of their packets at the exact same time. In this case,
you would want to interleave the data on the 512-byte boundary of the
packet. Therefore, the USB would receive a TX packet N for channel 1,
TX packet N for channel 2, TX packet N+1 for channel 1, TX packet N+1
for channel 2, …

Does that make sense?

For instance the packets for one channel can stack up in ram until it’s
full, preventing any other channel to receive data.

If you are sending all data in the timed order that they have to be
sent, then you shouldn’t have a super rough time keeping up with the
data samples. 8MSPS is about the speed we can send over USB - which
gives us 8 clock cycles per sample to be able to shuffle them around
and do what we need with them.

Did you ever compile the design with Quartus II to get a report of how
many resources we can use for these FIFOs? It might be a good idea to
see how much space we have in block RAM to distribute and use between
all the channels (just for an estimated target).

What do you think of all that?

Brian

Brian P. wrote:

data.

A read process on that FIFO can separate out the channel data to be
stored into separate channel FIFOs - one FIFO per channel that is
implemented. That same process can also put the control data into a
control FIFO. Only one of those is needed.

That’s 1 writer and 1 reader on the direct FX2 -> FPGA link.

So the fpga would only use fifo and push the data to next one. It does
not work if shared RAM is used to avoid copy between buffers; or I am
completely wrong and I fail to understand your point.

Just do it in parallel. If you are busy filling in channel 1 FIFO and
channel 2 FIFO is empty, then (inherently) channel 2 didn’t have
anything that needs to go out any time soon - right?

Ok.

you would want to interleave the data on the 512-byte boundary of the
packet. Therefore, the USB would receive a TX packet N for channel 1,
TX packet N for channel 2, TX packet N+1 for channel 1, TX packet N+1
for channel 2, …

Does that make sense?

Yes it does.

Did you ever compile the design with Quartus II to get a report of how
many resources we can use for these FIFOs? It might be a good idea to
see how much space we have in block RAM to distribute and use between
all the channels (just for an estimated target).

Yes, currently 39 out of 52 M4K blocks are used. A fifo of 8192 bytes
takes up 16 M4K whereas a fifo of 32 bytes takes up 2 M4K.

Then the two unknown are how many channels there are and how much data
to we want to be able to burst.

On 3/22/07, Thibaud H. [email protected] wrote:

So the fpga would only use fifo and push the data to next one. It does
not work if shared RAM is used to avoid copy between buffers; or I am
completely wrong and I fail to understand your point.

You are correct on that one unless you setup an internal linked list
of starting packets and their start times, then you can possibly read
the values with slight offsets for clock latencies out of the same
RAM.

This, I would argue, is a bit too complicated for the beginning of the
design but may want to be revisited later on. Maybe I should expand
on this idea.

If we keep a mapping of pointers to packets along with their
respective times of interest to start, even if we had something going
out at the same timestamp, we could theoretically read the data in a
staggered fashion such that if there are multiple packets to process
at once, we read one of the pieces of data at a time for each packet
and send it along the signal chain until that packet no longer needs
servicing. This would cause us to have a finite number of packets we
could process at one time (# of channels + control) since we would
need 1 clock cycle to read the next packet location for each of the
packets, register them all, then send out the “ship these samples off”
signal. For packets that have different sizes, we will have to keep
track of how many bytes of data are left to read and take them out of
the queue when it is all finished. This wouldn’t really be a FIFO,
but really a RAM mapping that we would manipulate pretty heavily.

It’s an interesting idea, and can probably be engineered to be pretty
cool. It will also be conducive to lowering the resolution of the
32-bit timestamp to do things since we’ll have to multiplex the
pulling of data out of the RAM.

I’ll think about this method and what problems there may be, but give
me feedback. What do you think of this? See how you can get the
reads all done in one large FIFO instead of needing all the little
ones? See how much extra complexity it adds to the FSM that reads and
feeds the TX chain?

Yes, currently 39 out of 52 M4K blocks are used. A fifo of 8192 bytes
takes up 16 M4K whereas a fifo of 32 bytes takes up 2 M4K.

How many TX and RX chains does that include? Can we make a table of
how many M4K blocks are used per TX and per RX channel that is added
to the USRP? That would be helpful in defining constraints.

Then the two unknown are how many channels there are and how much data
to we want to be able to burst.

Right - that somewhat becomes a latency issue. How quickly can we get
data down once a flag is sent to the host saying "a FIFO is half full

  • requesting more data". Does this control really need to be in
    there? How much memory is within the FX2 for USB endpoints? Could
    that be a temporary place to store packets and from there is where the
    request for more data comes?

Any ideas Eric?

Brian

A quick update while I am understanding your previous message.

It looks like each Rx chain takes up 3 blocks, but the Tx chains does
not use any memory block. (Actually Tx0 uses one block but not Tx1).

The Rx and Tx buffers use 16 blocks each.

Thibaud

On Thu, Mar 22, 2007 at 04:34:49PM -0400, Thibaud H. wrote:

Then the two unknown are how many channels there are and how much data
to we want to be able to burst.

First pass (current USRP), assume 2 channels, one for control, one for
data. Design and test it such that we can trivially add additional data
channels.

Eric

On Thu, Mar 22, 2007 at 03:33:08PM -0400, Brian P. wrote:

data.
Agreed.

A read process on that FIFO can separate out the channel data to be
stored into separate channel FIFOs - one FIFO per channel that is
implemented. That same process can also put the control data into a
control FIFO. Only one of those is needed.

That’s 1 writer and 1 reader on the direct FX2 -> FPGA link.

Yes.

That also means that only one of the FIFO’s (the first one) needs to
use two separate clocks, which is good. Thibaud, if you haven’t
already, be sure to take a look at “Cyclone Device Handbook”, Section
III, “Memory”. It spells out the RAM/FIFO/clock configurations
available.

On the RX end, the FX2 can just read from a separate RX FIFO - that is
easy enough to do with a mux at the IO pins and tri-stating the pins.

Yep.

packet. Therefore, the USB would receive a TX packet N for channel 1,
data samples. 8MSPS is about the speed we can send over USB - which
gives us 8 clock cycles per sample to be able to shuffle them around
and do what we need with them.

Did you ever compile the design with Quartus II to get a report of how
many resources we can use for these FIFOs? It might be a good idea to
see how much space we have in block RAM to distribute and use between
all the channels (just for an estimated target).

FWIW, there are 53 M4K blocks in the EP1C12. Each one can implement a
256 x 16 FIFO (or 128 x 32 FIFO), enough to hold a single USB packet.
We
use a bit of ram other places, but there should be plenty to go around.

Assume you’ve got one fifo, perhaps 2x packet length, for the FX2
FIFO, then work it out so that the rest of the channels have at least
say, 2x packet length.

Eric

On Thu, Mar 22, 2007 at 04:51:10PM -0400, Brian P. wrote:

On 3/22/07, Thibaud H. [email protected] wrote:

So the fpga would only use fifo and push the data to next one. It does
not work if shared RAM is used to avoid copy between buffers; or I am
completely wrong and I fail to understand your point.

I think I steered us down the wrong path with the shared RAM idea.
We really do want a dedicated read port for each channel in the Tx
direction. This is easy if we assigned a FIFO (or dedicated RAM) per
channel.

It’s an interesting idea, and can probably be engineered to be pretty
cool. It will also be conducive to lowering the resolution of the
32-bit timestamp to do things since we’ll have to multiplex the
pulling of data out of the RAM.

I’m not following why you want to reduce the resolution of the clock
from the full ADC clock speed (64 MHz)…

Right - that somewhat becomes a latency issue. How quickly can we get
data down once a flag is sent to the host saying "a FIFO is half full

  • requesting more data". Does this control really need to be in
    there? How much memory is within the FX2 for USB endpoints? Could
    that be a temporary place to store packets and from there is where the
    request for more data comes?

All the TX data comes down a single FX2 endpoint, as does the RX data.
Each endpoint is dual or quad buffered in the FX2. Right now, we’re
quad-buffering, though to reduce latency (useful for some MACs) we may
want to consider going to double buffering.

In the Tx direction, the flow control between the FX2 and the FPGA is
implemented with a single pin, “HAVE_SPACE” (I may have the name
wrong, but I’m close). This is currently asserted by the FPGA
whenever there’s room for the FPGA to receive a 512 byte packet from
the FX2 across the GPIF. When the FX2 sees this pin asserted, and
there’s a packet ready in the FX2, it schedules a 512 byte DMA xfer
across the GPIF.

It’s similar in the Rx direction. The FPGA asserts HAVE_PKT_AVAIL
when it has something that it wants the FX2 to pull from the FPGA.
The GPIF is configured such that it is mastered from the FX2 side.

I glad we’re having this discussion!

Eric

On Thu, Mar 22, 2007 at 06:44:01PM -0400, Thibaud H. wrote:

A quick update while I am understanding your previous message.

The Rx and Tx buffers use 16 blocks each.

This is probably overkill. As I mentioned earlier, we weren’t using
the memory, so we just assigned it to these buffers.

Eric

On Fri, Mar 23, 2007 at 10:15:12AM -0400, Brian P. wrote:

I don’t know the timing within a Cyclone device, and I know you are
already using Gray counters within the FPGA, so I was getting worried
that a 32-bit accumulator running at 64MHz might cause timing closure
issues. It probably isn’t an issue, but those carry chains are only
so long - and the propagation delay really adds up. I suppose we
could always just run them as smaller adders with a pipelined carry
signal instead of having to have the carry asynchronously propagate
through the chain.

Good observation re pipelining the carry if reqd. I don’t think we’re
going to have many counters running at full speed (1 ?), so I’m not
too worried about this.

All the TX data comes down a single FX2 endpoint, as does the RX data.
Each endpoint is dual or quad buffered in the FX2. Right now, we’re
quad-buffering, though to reduce latency (useful for some MACs) we may
want to consider going to double buffering.

Can you have double-buffering for the RX and quad-buffering for the
TX? In the TX case, I am not sure double-buffering would allow for a
lower latency.

Yes, the buffering is independent on Tx and Rx.

Since everything will be ordered in time, it would seem (to me) to be
better to have queued up more packets on the outgoing stream. Correct
or incorrect?

In both directions, we need enough buffer space to cover any jitter in
the driver and user-space host s/w.

I agree that more on the outgoing stream makes more sense.

Eric

Eric B. wrote:

data.
Yes.

So, let me summarize:

One dual clock fifo (usb_fifo) to buffer the packet from the FX2. One
process store the data from the usb while one other process read them
and splits commands (stored in cmd_fifo) from data (stored in
chan0_fifo, chan1_fifo, etc). The data in chanX_fifo would be stored
like this:

: :
| sample 2 |
| sample 1 |
| #samples |
| timestamps |
±-----------+

For each channel fifo, a process would wait for the timestamps to match
the time register and then write the next <#samples> to the
corresponding channel transmit chain at every tx_clock tick. So we end
up with (#channel + 1) fifos.

The problem I see is that between to block of samples in the fifo, there
will be some processing delay to read the next timestamps and the
#samples. That’s why I added a fifo between the usb_fifo and the
chanX_fifo in the diagram on the wiki. Is that a real problem?

That also means that only one of the FIFO’s (the first one) needs to
use two separate clocks, which is good. Thibaud, if you haven’t
already, be sure to take a look at “Cyclone Device Handbook”, Section
III, “Memory”. It spells out the RAM/FIFO/clock configurations
available.

Ok, thanks.

On 3/23/07, Thibaud H. [email protected] wrote:

| sample 1 |
will be some processing delay to read the next timestamps and the
#samples. That’s why I added a fifo between the usb_fifo and the
chanX_fifo in the diagram on the wiki. Is that a real problem?

I was under the impression that if a timestamp is reached that is
already in the past, but was not able to be sent out, a signal will be
asserted stating that it could not be processed. After that, the
offset of the samples will be added to the read pointer, and the next
timestamp should be read. This shouldn’t take longer than a few clock
cycles at most.

On a related note and to keep processing continuous, should the
original packet header that starts the packet and identifies the first
data samples have the actual number of data samples in the entire
packet to be sent down and not just the current samples length? That
way you can have an uninterrupted stream of samples for each
timestamp.

There probably won’t be any hiccups as long as the output sample rate
is less than 64e6/3 since we will have to go from SENDING_SAMPLES ->
CHECK_NEXT_TIMESTAMP -> READ_SAMPLE_LENGTH -> SENDING_SAMPLES in a
state machine feeding the TX values.

Brian

On 3/23/07, Eric B. [email protected] wrote:

I think I steered us down the wrong path with the shared RAM idea.
We really do want a dedicated read port for each channel in the Tx
direction. This is easy if we assigned a FIFO (or dedicated RAM) per
channel.

Sounds good. It really should make it much more trivial to instantiate.

I’m not following why you want to reduce the resolution of the clock
from the full ADC clock speed (64 MHz)…

I don’t know the timing within a Cyclone device, and I know you are
already using Gray counters within the FPGA, so I was getting worried
that a 32-bit accumulator running at 64MHz might cause timing closure
issues. It probably isn’t an issue, but those carry chains are only
so long - and the propagation delay really adds up. I suppose we
could always just run them as smaller adders with a pipelined carry
signal instead of having to have the carry asynchronously propagate
through the chain.

All the TX data comes down a single FX2 endpoint, as does the RX data.
Each endpoint is dual or quad buffered in the FX2. Right now, we’re
quad-buffering, though to reduce latency (useful for some MACs) we may
want to consider going to double buffering.

Can you have double-buffering for the RX and quad-buffering for the
TX? In the TX case, I am not sure double-buffering would allow for a
lower latency.

Since everything will be ordered in time, it would seem (to me) to be
better to have queued up more packets on the outgoing stream. Correct
or incorrect?

In the Tx direction, the flow control between the FX2 and the FPGA is
implemented with a single pin, “HAVE_SPACE” (I may have the name
wrong, but I’m close). This is currently asserted by the FPGA
whenever there’s room for the FPGA to receive a 512 byte packet from
the FX2 across the GPIF. When the FX2 sees this pin asserted, and
there’s a packet ready in the FX2, it schedules a 512 byte DMA xfer
across the GPIF.

Neat - the buffering should be very nice for this.

It’s similar in the Rx direction. The FPGA asserts HAVE_PKT_AVAIL
when it has something that it wants the FX2 to pull from the FPGA.
The GPIF is configured such that it is mastered from the FX2 side.

Nice. Thanks for the answers.

Brian

Brian P. wrote:

| sample 2 |
The problem I see is that between to block of samples in the fifo, there
On a related note and to keep processing continuous, should the
original packet header that starts the packet and identifies the first
data samples have the actual number of data samples in the entire
packet to be sent down and not just the current samples length? That
way you can have an uninterrupted stream of samples for each
timestamp.

There probably won’t be any hiccups as long as the output sample rate
is less than 64e6/3 since we will have to go from SENDING_SAMPLES ->
CHECK_NEXT_TIMESTAMP -> READ_SAMPLE_LENGTH -> SENDING_SAMPLES in a
state machine feeding the TX values.

Ok, sounds good. By tomorrow, I will update the wiki page, add the state
machines and figure out the fifo/ram sizes. Is there any other
information that would be useful to state?

Thibaud

On Fri, Mar 23, 2007 at 02:45:54PM -0400, Thibaud H. wrote:

intermingled with each other - be it either channel data or control

: :
| sample 2 |
| sample 1 |
| #samples |
| timestamps |
±-----------+

I assume that you just copy the received USB packet (including the
8-byte header) into the channel FIFO. The underrun processing really
needs to be handled “next to the DACs” and is a function of the
current state of the channel, the state of the fifo and the S & E
bits in the header.

For each channel fifo, a process would wait for the timestamps to match
the time register and then write the next <#samples> to the
corresponding channel transmit chain at every tx_clock tick.

Actually, the timestamp specifies the time that the sample is supposed
to hit the DACs, not the time that it enters the signal processing pipe.
This may take some thought to implement correctly, since the latency
through the pipe varies depending on the details of the pipeline and
the interpolation factor.

So we end up with (#channel + 1) fifos.

1 usb_fifo
1 cmd_fifo
N chanX_fifo

The problem I see is that between to block of samples in the fifo, there
will be some processing delay to read the next timestamps and the
#samples. That’s why I added a fifo between the usb_fifo and the
chanX_fifo in the diagram on the wiki. Is that a real problem?

I don’t think it’s needed. Can you say more about why you think it’s
needed? Remember that the master clock is running at 64MHz, and the
packets are coming in at at most 32MB/s.

Eric

On Fri, Mar 23, 2007 at 03:21:03PM -0700, Eric B. wrote:

the interpolation factor.
We may be able to finesse this by having some readable registers in the
FPGA that allow the host to compute the appropriate offset to apply
the the application supplied timestamps.

Eric

On Fri, Mar 23, 2007 at 02:59:37PM -0400, Brian P. wrote:

| sample 2 |
The problem I see is that between to block of samples in the fifo, there

On a related note and to keep processing continuous, should the
original packet header that starts the packet and identifies the first
data samples have the actual number of data samples in the entire
packet to be sent down and not just the current samples length? That
way you can have an uninterrupted stream of samples for each
timestamp.

I was thinking that we could handle that with the S + E flags in the
packet header. It seems like less state to deal with, and also allows
the host to begin transmitting a frame before it knows the full
length. Lower latency too.

There probably won’t be any hiccups as long as the output sample rate
is less than 64e6/3 since we will have to go from SENDING_SAMPLES ->
CHECK_NEXT_TIMESTAMP -> READ_SAMPLE_LENGTH -> SENDING_SAMPLES in a
state machine feeding the TX values.

Brian

Eric

Eric B. wrote:

The FX2 will be writing into a FIFO that stores all packets
Yes.
| sample 2 |

What do you mean by “state of the channel” ?
Are the underrun, overrun, S and E flags per channel ? I thought they
were global…

So we end up with (#channel + 1) fifos.

1 usb_fifo
1 cmd_fifo
N chanX_fifo

Yes, you are right.
Actually, if I want to be able to skip outdated packets, the N
chanX_fifo will be implemented as RAM blocks, so I can freely access the
whole contents and skip packets.

The problem I see is that between to block of samples in the fifo, there
will be some processing delay to read the next timestamps and the
#samples. That’s why I added a fifo between the usb_fifo and the
chanX_fifo in the diagram on the wiki. Is that a real problem?

I don’t think it’s needed. Can you say more about why you think it’s
needed? Remember that the master clock is running at 64MHz, and the
packets are coming in at at most 32MB/s.

So it should be ok.

Thibaud

On 3/24/07, Thibaud H. [email protected] wrote:

I updated the wiki paged and added state machines.
(http://gnuradio.org/trac/wiki/UsrpTxModifications)

A comment with the USB Block - I believe all the processing going into
the USB FIFO is done within the FX2. Moreover, i don’t know why
you’re keeping the byte_count around - it’s coming in with the packet
header.

Here are the issues still unresolved that I am aware of:

  • Are the overrun/underrun flag per channel or global? From which fifo
    should they be generated?

Each channel should be responsibile for returning each individual
status back up to the host. With that given, I believe the messages
should be sent up and identified as stating “TX Channel 0” or “RX
Channel 1” is having over or under runs. That would be helpful,
correct?

  • When a packet is outdated I still have to walk though it to empty to
    skip it. This can be resolved by using a RAM block rather than a fifo
    for the N chanX_fifo be will require more coordination between the USB
    block and the data block.

Being able to skip over an entire packet if the over or underrun
happens, is extremely helpful and should be implemented. We can have
a modified FIFO possibly with a way to “skip” a specific number of
packets? That would be interesting and easily implementable within
your state machines.

  • Now I am assuming that the samples are 16bits interleaved, how will
    the sample format chosen by the user be reported to the FPGA?

There is a command that is sent down to the FPGA and can be set in a
mux. The “format” should be passed into the processing FSM and
handled there. It’s a good question - no real idea right now how to
handle this.

Thibaud

Good job so far. It’s probably just me, but those FSMs just seem a
little busy and a little confusing. I’ll figure them out soon enough.

Brian

On Sat, Mar 24, 2007 at 01:46:21PM -0400, Thibaud H. wrote:

Eric B. wrote:

On Fri, Mar 23, 2007 at 02:45:54PM -0400, Thibaud H. wrote:

I assume that you just copy the received USB packet (including the
8-byte header) into the channel FIFO. The underrun processing really
needs to be handled “next to the DACs” and is a function of the
current state of the channel, the state of the fifo and the S & E
bits in the header.

What do you mean by “state of the channel” ?
Are the underrun, overrun, S and E flags per channel ? I thought they
were global…

S and E are definitely per channel. The indicate whether or not it’s
OK to run underrun on the channel. You could think of S & E as
controlling a per-channel bit of state called “OK to underrun”.

Consider a frame of 1024 samples (16-bit I & Q), broken up into 9 USB
packets:

Underrunning anywhere within the span of the 9 packets is an error.
Underrunning after the 9th packet is not an error. The S & E bits are
used to demarcate the frame boundaries. Note that in a short frame
(one that fits in a single USB packet) S & E would both be set.

Underrun and overrun are logically per channel, but could be “or’d
together” across all channels without much loss of info.

So we end up with (#channel + 1) fifos.

1 usb_fifo
1 cmd_fifo
N chanX_fifo

Yes, you are right.
Actually, if I want to be able to skip outdated packets, the N
chanX_fifo will be implemented as RAM blocks, so I can freely access the
whole contents and skip packets.

Good. chanX_fifo could also have the read and write ports have
different
sizes. E.g., 16-bits on the write port, 32-bits on the read port.

So it should be ok.
OK.

Thibaud

Eric