Inband timestamp issues

Hi all, I’d really like to help with this. I’ve been pouring over the
design (usrp_inband_usb), so my brain is a little mushy at the moment,
hopefully I’m not totally off track here.

  1. Is it true that there are N RX Sample FIFOs? It seems that the
    channels
    are already muxed by the time they get to the packet_builder, no?

  2. Can we just subtract the FIFO size (usedw?) from the timestamp in the
    packet builder?

  3. When checking timestamps for expired command/data packets, is an
    overflow/wraparound on the timestamp_clock handled? (ref
    cmd_reader.v:107,
    chan_fifo_reader.v:145)

  4. I wholeheartedly agree that it would be nice to have a “read N
    samples at
    time T command”.

–ets

Steve P. wrote:

I agree with this solution, I think this is what Ketan’s idea was that
I explained terribly in my last email. Hopefully removing the USB
FIFO will allow the extra packet builders room on the chip.

I think that this is the right thing to do. Now we just have to get it
done. I could probably start working on it Monday when I get back.

the usrp_rx mblock, but I’m not confident about that.

This is definitely possible, and would be best implemented over the CS
channel to the USRP. usrp_rx would not need changed, as it simply does
a read(). If the USRP is withholding samples, its simply going to
block.

  • George

The current inband RX chain looks like:

N RX Sample Streams -> N RX Sample FIFOs -> 1 Packet Builder -> 1
USB FIFO -> FX2 USB Interface

What it should look like (in my opinion):

N RX Sample Streams -> N Packet Builders -> N Packet FIFOs -> N:1
FIFO MUX -> FX2 USB Interface

I agree with this solution, I think this is what Ketan’s idea was that
I explained terribly in my last email. Hopefully removing the USB
FIFO will allow the extra packet builders room on the chip.

On a side note, it might be interesting to have a command that can
turn on the receiver and receive a specific number of inband packets.
For example, if you know you may be receiving a transmission that is
only 2ms long in a specific slot, it might be beneficial to only
schedule 2ms (+/- a guard time) worth of samples to be delivered to
the host, freeing up more CPU cycles for signal processing and using
the USB bandwidth a little more efficiently.

I’m not too concerned about this, although it would certainly improve
the flexibility of the receiver. This might also require redesigning
the usrp_rx mblock, but I’m not confident about that.

Brian

Steve

http://www.gnuradio.org/trac/browser/gnuradio/trunk/usrp/fpga/inband_li
b/packet_builder.v

I only see one chan_fifodata input to packet builder:

http://www.gnuradio.org/trac/browser/gnuradio/trunk/usrp/fpga/inband_lib/pac
ket_builder.v#L8

It would appear that the muxing is between the fifos and the packet
builder
at:

http://www.gnuradio.org/trac/browser/gnuradio/trunk/usrp/fpga/inband_lib/rx_
buffer_inband.v#L205

Not a big difference, but it might make your “N Packet Builders”
solution
that much easier.

The N RX sample FIFOs are built here:

I was looking at RTL schematics on a 1Rx1Tx configuration, which oddly
enough only has 1 RX FIFO, go figure! blush

  1. Can we just subtract the FIFO size (usedw?) from the timestamp in
    the
    packet builder?

Sure, but what for? I suppose I am not sure what this is addressing?

I’m assuming that at least some of the rx timestamp inaccuracy is due to
the
variable latency introduced by the rx_chan_fifo(s). If usedw represents
the
number of samples currently in a FIFO, then the actual timestamp of the
sample read out of the fifo is current_timestamp - usedw, correct?

I do think that building the packets before buffering is a better idea,
but
adjusting the timestamps might be a quick interim fix.

  1. When checking timestamps for expired command/data packets, is an
    overflow/wraparound on the timestamp_clock handled? (ref
    cmd_reader.v:107,
    chan_fifo_reader.v:145)

I don’t think it’s explicitly being taken care of, but I am also not
sure if the logic just ends up working out. This should probably be
looked at and a testbench written.

If timestamp_clock is near its max (every ~67s), and a packet is
scheduled
such that its future time will cause wraparound (making it a smallish
number), then the reader will detect that pkt_timestamp <
timestamp_clock
and discard the packet even though in a “few” ticks timestamp_clock
would be
0 again.

–ets

Brian

–ets

On Thu, Aug 21, 2008 at 4:27 PM, Eric S.
[email protected] wrote:

I’m assuming that at least some of the rx timestamp inaccuracy is due to the
variable latency introduced by the rx_chan_fifo(s). If usedw represents the
number of samples currently in a FIFO, then the actual timestamp of the
sample read out of the fifo is current_timestamp - usedw, correct?

I do think that building the packets before buffering is a better idea, but
adjusting the timestamps might be a quick interim fix.

I think doing the actual fix is better than doing some interim one.

If timestamp_clock is near its max (every ~67s), and a packet is scheduled
such that its future time will cause wraparound (making it a smallish
number), then the reader will detect that pkt_timestamp < timestamp_clock
and discard the packet even though in a “few” ticks timestamp_clock would be
0 again.

Right - looking at the code, I don’t think the logic is there and a
testbench probably should be written to verify the problem/make sure
it works properly anyway.

One idea is that it may be OK to assume that packets will never be
delivered out of date and to always transmit and never drop.
Unfortunately, if one expired packet is delivered, it will be an
entire rollover before it gets back to transmitting - stalling the
pipe for quite a long time.

Another solution would be to have a “rollover” flag which tells the
FSM to first wait for a rollover before checking the timestamp. This
flag could be set by the host when it detects a rollover in timing. I
don’t know if the host keeps track of the last transmission time or
who would be responsible for setting the flag. Maybe it’s getting a
bit complicated?

What do you think of those ideas? Do you have a proposed solution?

Brian

Just thinking out loud here…

Given your suggested solution (which I like):

N RX Sample Streams -> N Packet Builders -> N Packet FIFOs -> N:1 FIFO
MUX
-> FX2 USB Interface

The FIFO MUX could be a module that implements a virtual FIFO output,
and
automatically selects (and emulates) the input FIFO based on fullness,
so
that the fullest FIFO always gets read first. A tie would default to
natural ordering. There should also be a minimum size (128?) or a
channel_ready input to prevent premature FX2 transfer. Sounds pretty
simple
and should just drop-in to the existing design.

What about moving the Packet FIFO into packet_builder? It seems like we
are
just be wasting cycles by pushing headers to an external FIFO when we
could
handle that with some read logic. In this way, the first “FIFO” reads
from
packet_builder actually output internally generated/stored header
values,
then later the internal FIFO with the channel data. But maybe it’s more
trouble than it’s worth.

–ets

Brian P. wrote:

Another solution would be to have a “rollover” flag which tells the
FSM to first wait for a rollover before checking the timestamp. This
flag could be set by the host when it detects a rollover in timing. I
don’t know if the host keeps track of the last transmission time or
who would be responsible for setting the flag. Maybe it’s getting a
bit complicated?

What do you think of those ideas? Do you have a proposed solution?

I was thinking about this problem when writing the TDMA MAC. I like the
rollover flag idea much better than the assumption everything will be
transmitted. It’s very easy for the host to detect and set a bit like
this. The host is generating the timestamps, and can easily tell
something like this.

But, I’d say lets take care of the RX issue first…

  • George

On Thu, Aug 21, 2008 at 5:04 PM, Eric S.
[email protected] wrote:

natural ordering. There should also be a minimum size (128?) or a
channel_ready input to prevent premature FX2 transfer. Sounds pretty simple
and should just drop-in to the existing design.

There is already a FIFO which has the master_clock on one side, and
the FX2 on the other side. I was thinking we connect the output of
the packet_builder to those FIFOs. They already have a have_pkt
output which can be fed into the FIFO mux. The mux can cycle through
the have_pkt signals of each of the FIFOs without giving precedence to
one or the other and signal the FX2 there is a packet ready to
transfer - muxing the USB data lines for the DMA transfer. I believe
this is similar to what you’re talking about, but I also think the
packet_builder has to have some changes since it is not a streaming
module.

Ideally I’d like to see the packet_builder turn into something that
just inserts headers at the appropriate times in the stream and can
have a mechanism to build an indefinite number of packets (infinite
receive) or a specific number (possibly a 16 or 32-bit count?).

What about moving the Packet FIFO into packet_builder? It seems like we are
just be wasting cycles by pushing headers to an external FIFO when we could
handle that with some read logic. In this way, the first “FIFO” reads from
packet_builder actually output internally generated/stored header values,
then later the internal FIFO with the channel data. But maybe it’s more
trouble than it’s worth.

–ets

Brian

On Thu, Aug 21, 2008 at 08:21:34PM -0600, Eric S. wrote:

(way) more than enough. The host can delay longer before sending the packet
if longer delays are required.

The basic logic would be to check the delta, and if greater than 2^31, then
make go/wait/drop decisions assuming a wrap is involved.

Yes. If it looks like it’s more than 2^31 away, then it’s late.
Drop it and report an error.

Eric

-----Original Message-----
From: Brian P. [mailto:[email protected]]
Sent: Thursday, August 21, 2008 3:09 PM

What do you think of those ideas? Do you have a proposed solution?

Brian

Of the two, I would prefer the rollover flag solution.

If we are willing to put limit on how far ahead we can schedule then we
should be able to decide implicitly if a wrap has occurred.

2^31, or half the counter range would provide a 32s range, and should be
(way) more than enough. The host can delay longer before sending the
packet
if longer delays are required.

The basic logic would be to check the delta, and if greater than 2^31,
then
make go/wait/drop decisions assuming a wrap is involved.

–ets

Regarding the rx_buffer design, is there a reason that there are two
FIFO
stages in the current design? I seems that one layer of FIFOs should
do.
Either on the channel side or the fx2 side, before or after packet
building,
either should work. I haven’t decided which I prefer, maybe I’ll have
something more insightful to say tomorrow…

–ets

Hi Eric,

Most, if not all, of our FPGA work is done by an undergrad here who
knows Verilog. He might not be able to contribute until September, as
our first week of classes is this week and things need to settle.

If you have Verilog experience and want to work with me on tackling the
problem, I’d be more than happy to work with you. We use Quartus tools
to build the FPGA, which are completely free, and there is also a
simulator available for it too if we are (and should) building some test
benches.

If you’re a Linux user, like us, we run everything through VMWare with a
Windows XP image and it works just fine.

Here is where you can get the web edition:
http://www.altera.com/products/software/quartus-ii/web-edition/qts-we-index.html

This is the FPGA code branch I work out of:
http://gnuradio.org/trac/browser/gnuradio/branches/developers/gnychis/fpga

We can work together on changes to the branch, without you having SVN
commit access any changes you’d like to make you might have to push
through me as patches… but regardless, we can figure out a way.

I’m free to work with you at any time, even when I’m traveling, I just
want to get things done ASAP :slight_smile:

  • George

PS. I pushed this to the list because its slightly useful information

I’m geared up and have created a test rbf, all seems well. I’m running
Quartus on a Windows PC, while the USRP sits on Linux/Ubuntu.

It would seem that rx_buffer_inband.v and packet_builder.v will be
seeing
major changes, if not complete rewrites.

I need to document the exact interface spec of rx_buffer_inband. I’m
assuming there isn’t any other design documentation for the guts
anywhere,
just source, is that correct? Perhaps the wiki would be a good place
for my
notes? That way any errors could be easily corrected in place, and we
would
have at least a pittance of design docs for the new stuff…

–ets

Okay, so to elaborate on the design options a little more…

I see to primary topologies: packet push, and packet pull. Both designs
have a slimmed down packet_builder for each channel. The multichannel
multiplexing would be similar in either design.

Packet Push:

Packet buffers only, packets built on channel input.

This design pushes preformed packets into a packet FIFO, simply to be to
be
read out via USB as is.

Multiple rxclk ticks would be required while writing the header
information
to the FIFO, which limits our maximum sample speed.
(I think) Without any channel buffering, the maximum sample rate would
be
1/6 master clock, since we need 4 additional master cycles to push out
the
header before the first sample of a packet. Maybe that is an issue,
maybe
not.

  1. New packet begins, I/Q samples waiting, save metadata, write headerH
  2. Write headerL
  3. Write timestampH
  4. Write timestampL
  5. write I
  6. write Q

Packet Pull:

Channel buffering only, packets built on USB read.

This design uses a separate FIFO/s for packet metadata (timestamp, etc)
parallel to the channel FIFO. The actual packets are constructed via
the
USB read process. The metadata FIFO/s would be pretty shallow, as they
are
1 write per packet.

This design should theoretically be able to read samples at master_clock
rates. While the USB/host couldn’t stream that, if operated on short
burst
reads (less than the FIFO capacities), it could increase our maximum
sampling speeds.

Also, since the header isn’t built until read, there is the possibility
to
include information in the header that is not available at the first
sample.
E.g.:

  • Packet size: don’t need to know size in advance, read could be
    interrupted, and the size can be adjusted on USB read.
  • RSSI: e.g. peak, avg, last, etc.
  • Padding: padding would be virtual space, not FIFO space. Just send 0,
    don’t store them.

In general, it seems to me that the packet pull design offers more
performance and flexibility. Unless someone points out some flaws in my
reasoning, I will proceed on that path.

–ets

On Tue, Aug 26, 2008 at 12:03 PM, Eric S.
[email protected] wrote:

This design pushes preformed packets into a packet FIFO, simply to be to be
read out via USB as is.

Multiple rxclk ticks would be required while writing the header information
to the FIFO, which limits our maximum sample speed.
(I think) Without any channel buffering, the maximum sample rate would be
1/6 master clock, since we need 4 additional master cycles to push out the
header before the first sample of a packet. Maybe that is an issue, maybe
not.

Maximum throughput for 16-bit IQ is a decimation by 8, so 6 clock
cycles for header setup should be fine. Even for decimation rates of
4, which drops IQ down to 8-bits each, the scheme should probably work
fine if you write both I and Q in the same clock cycle.

  1. New packet begins, I/Q samples waiting, save metadata, write headerH
  2. Write headerL
  3. Write timestampH
  4. Write timestampL
  5. write I
  6. write Q

You need the time the first sample comes out of the halfband filter as
the timestamp on the whole packet, so you really have to write the
header, and then wait for the first sample. When the first sample is
strobed in, get and write the timestamp and then write the IQ sample.
Do this until the entire packet has been filled and repeat N times or
infinitely (depending on setting).

Note that you can make 32-bit wide FIFOs - or even a FIFO with 16-bits
on one side and 32-bits on the other (as long as each side is fully
synchronous to each other).

rates. While the USB/host couldn’t stream that, if operated on short burst
reads (less than the FIFO capacities), it could increase our maximum
sampling speeds.

This sounds like a decent idea, but the smallest block RAM in the FPGA
is 256x16. If you wanted to make this just in the fabric, you’ll have
to deal with crossing clock domains and that can get a bit hairy with
just flops used as memory.

Even in the previous design, I don’t think there is a debilitating
limit to the flow of samples to the host. Requiring 2 clocks per
sample (and a small buffer per channel) doesn’t seem that limiting
with regards to performance.

Also, since the header isn’t built until read, there is the possibility to
include information in the header that is not available at the first sample.
E.g.:

  • Packet size: don’t need to know size in advance, read could be
    interrupted, and the size can be adjusted on USB read.
  • RSSI: e.g. peak, avg, last, etc.
  • Padding: padding would be virtual space, not FIFO space. Just send 0,
    don’t store them.

The packet size is always the same size since the ADC is always
running. There won’t ever really be a lack of samples to be pushed to
the host. The RSSI might be interesting, but since it’s reported with
every packet, you can get a granularity at the packet level which
should be pretty sufficient. For the padding, I don’t think that will
be required since the ADC is always running and there isn’t a lack of
samples to send to the host.

This is kind of why I wanted to be able to send down a command to say
“At time=X, receive N packets and then stop” as it builds in the
limiting factor to the command.

In general, it seems to me that the packet pull design offers more
performance and flexibility. Unless someone points out some flaws in my
reasoning, I will proceed on that path.

The last issue I have is when dealing with sample overruns. Can this
scheme easily recover if the sample FIFO is full when a new sample
comes in, but the metadata FIFO has header information pushed into it?
In the packet push situation, there will be a discontinuity in the
timestamps inherently within the system.

I prefer the packet pushing idea, but feel free to do what you feel is
the better idea.

Brian

On Tue, Aug 26, 2008 at 3:21 PM, Eric S.
[email protected] wrote:

– snip –

What do you see as the advantage of the push design?

You’re generating significantly less muxing at the interface to the
FX2. In the pull design, you have 2 muxes per channel and at least 2
channels (one control, one data). Increasing the number of channels
grows significantly faster using this method.

Moreover, you’re not eliminating packet FIFOs - you’re creating 2 sets
of FIFOs. Header FIFOs and Data FIFOs. Your inband metadata goes
into the header FIFO where the samples go into the data FIFO.

In the push design, you have one set of packet FIFOs for each channel
and no data FIFOs since the packets are built as the samples come in.
I believe this design has the most efficient use of BRAM as you will
never use any more than 4 16-bit words per packet depth of data FIFO
in the pull method.

The push design works for the USRP and the inherent limitations of the
platform without increasing the complexity significantly. I don’t
believe your assertion that the pull design can run at full-speed on
both sides. You still have data insertion going on - which slightly
increases your data rate for that period of time. You WILL overflow
at some point in time. For any inband signaling, you can never run at
full speed since you’re generating more data than can ever get sent
over the channel.

But, as I said before, feel free to implement what you believe is the
better design.

Brian

Thanks for the comments Brian, my replies are inline.

-----Original Message-----
From: Brian P. [mailto:[email protected]]
Sent: Tuesday, August 26, 2008 11:41 AM

Maximum throughput for 16-bit IQ is a decimation by 8, so 6 clock
cycles for header setup should be fine. Even for decimation rates of
4, which drops IQ down to 8-bits each, the scheme should probably work
fine if you write both I and Q in the same clock cycle.

I agree that in normal operations, there shouldn’t be a problem.
However,
in the pull design it’s not even a consideration. Max clocks on both
sides,
overflows are the only limit.

You need the time the first sample comes out of the halfband filter as
the timestamp on the whole packet, so you really have to write the
header, and then wait for the first sample. When the first sample is
strobed in, get and write the timestamp and then write the IQ sample.
Do this until the entire packet has been filled and repeat N times or
infinitely (depending on setting).

Just to be clear, the pull method does save the timestamp at the
beginning
of a packet, it just doesn’t write it into the header until read via the
USB/FX2 interface.

This sounds like a decent idea, but the smallest block RAM in the FPGA
is 256x16. If you wanted to make this just in the fabric, you’ll have
to deal with crossing clock domains and that can get a bit hairy with
just flops used as memory.

A pair of 256x16 FIFOs doesn’t sound exorbitant, do you think this will
be
an issue? We are getting rid of the packet buffer all together.

limiting factor to the command.
I agree that for channel data, a less than full packet is unlikely, nor
does
it even need to be supported. Using pull, we could easily support
variable
payload sizes, even when we don’t know the final size when we get the
first
sample. The benefit is of questionable value however.

Inband communications would normally have a lot of padding. But I
suppose
that the control channel is so different that it could have a different
structure than the rx channels. So I guess padding in rx channels is
mostly
moot point. But I think that the pull design would be basically the
same
for both.

More importantly is the general idea that we could include information
in
the header that is not available at the time of the first sample,
because we
are not actually constructing it until all the data is available and it
is
being read via the fx/usb interface. For example how would you push a
packet max or average RSSI into a header when data hasn’t even arrived
yet?
I’m not saying that this is something we want to do, but rather that we
could, if desired. Timestamp is moot point, as it is always available
at
the first sample, by definition. I’m just trying to think beyond this
particular issue.

The last issue I have is when dealing with sample overruns. Can this
scheme easily recover if the sample FIFO is full when a new sample
comes in, but the metadata FIFO has header information pushed into it?
In the packet push situation, there will be a discontinuity in the
timestamps inherently within the system.

In either push or pull, it would probably be a good idea to not start a
packet if the receiving buffer doesn’t have enough room to hold all of
it.
As soon as there is enough room, start a new packet. Packets would then
always contain contiguous samples. The number of lost samples could be
easily identified by the timestamp on the next packet.

When to set the overrun flag? On the next complete packet after the
overrun? What about the USB rx_overrun signal? As soon as the overrun
occurs? How does (or will) this affect the good packets that have
already
been queued?

I prefer the packet pushing idea, but feel free to do what you feel is
the better idea.

What do you see as the advantage of the push design?

–ets

-----Original Message-----
From: Brian P. [mailto:[email protected]]

You’re generating significantly less muxing at the interface to the
FX2. In the pull design, you have 2 muxes per channel and at least 2
channels (one control, one data). Increasing the number of channels
grows significantly faster using this method.

I’ll have to ponder this. I don’t see much of a difference. For push,
you
are muxing input to the packet FIFO. For pull I’m muxing the output to
usbdata. It seems pretty equivalent to me.

Moreover, you’re not eliminating packet FIFOs - you’re creating 2 sets
of FIFOs. Header FIFOs and Data FIFOs. Your inband metadata goes
into the header FIFO where the samples go into the data FIFO.

The current design has channel and packet FIFOs. Either push or pull
will
be an improvement IMO, either/or. The pull will have an extra FIFO per
channel, but no packet FIFO. Push, only the packet FIFO.

In the push design, you have one set of packet FIFOs for each channel
and no data FIFOs since the packets are built as the samples come in.
I believe this design has the most efficient use of BRAM as you will
never use any more than 4 16-bit words per packet depth of data FIFO
in the pull method.

I think that you are correct that push is better in terms of memory
utilization, but I’m not sure of the importance. The metadata FIFO
would be
minimal. 216256 / channel would be more than enough, wider with more
metadata like RSSI.

The push design works for the USRP and the inherent limitations of the
platform without increasing the complexity significantly. I don’t
believe your assertion that the pull design can run at full-speed on
both sides. You still have data insertion going on - which slightly
increases your data rate for that period of time. You WILL overflow
at some point in time. For any inband signaling, you can never run at
full speed since you’re generating more data than can ever get sent
over the channel.

In the USRP, there is no way to run at master_clock rates, just because
of
the architecture. However, using pull, the rx_buffer_inband module
could,
simply because we can always absorb samples as fast as they come (no
pauses
to generate headers). Obviously, in order to continuously stream, the
usbdata bus would have to run faster/wider than master_clock channel
rates
to deal with the extra header data; which in the USRP, it doesn’t. But
that
is a limitation the module’s use, not the module itself. Even in the
USRP,
pull could sample at maximum channel rates, for short durations of time,
limited only by FIFO capacity.

But, as I said before, feel free to implement what you believe is the
better design.

Just to be clear, I am only trying to figure out the best design. If
pull
has problems that I am not accounting for, then I’d rather find out now,
prior to spending the design time. I am the newcomer here, and greatly
value your opinion. I hope I don’t come across otherwise. I suspect
that
either design would work well.

–ets