A simple flow-graph that is giving me infinite grief

I’ve attached both the .grc and python for a relatively-straightforward
flow-graph that is giving me loads of grief.

A few things to observe about it:

o The virtual size of this thing in execution is elephantine
o Manipulating it inside GRC is really, really, really, painful
o Running it above 10Msps causes UHD to silently stop sending data
(or maybe it never gets started)
o Running it at 5Msps or slower, and it works, although it’s still
elephantine

So, yeah, the FFT is big, but I tried different sized FFTs from 32768
upwards, and I still see the behaviour that at 10Msps or above,
I get no data (after a loooong pause while Gnu Radio does some
cogitation, presumably in (elephantine) buffer setup).

Now this flow-graph (actually a more-complex version of it) used to
work, even at 10Msps, up to about 3 weeks ago, when I updated
UHD and Gnu Radio.

Some perhaps useful data points:

o 6-core AMD 1090T at 3.2GHz with 4GB of 1333MHz memory
o I'm using gr_vmcircbuf_mmap_shm_open_factory
o The very latest UHD and Gnu Radio
o USRP2 + WBX

I’ll make the observation that there’s just got to be a better buffer
allocation policy than the existing one. Should it really take
Gigabytes of VM, and hundreds of MB of RSS to execute the flow-graph
I’ve attached? The largest inter-block objects–the vectors
in front of and behind the FFT, are on the order of a few MB, and
even accounting for some ballooning factor so you can have
several such vectors “in flight”, that doesn’t add up to hundreds of
MB of RSS, and several GB of Virtual Size.

On 29/06/2011 12:12 PM, Johnathan C. wrote:

even accounting for some ballooning factor so you can have

accessing the start of the buffer which is mapped into the next block
of memory. Not only does this make the code and I/O much faster, but
the work function code gets a lot simpler.
I wonder if that assertion is generally true, or only in some cases?
Increment and test shouldn’t be that expensive.

floats, etc.
that.
The worst-case item is the FFT, which is typically large–sometimes as
much as 512K, but it’s always a power-of-two size, and given
that it’s “complex”, the result should also be a power-of-two size,
since each complex-float takes 8 bytes.

I’m guessing something like this last is what is happening in your
flowgraph. I’d trace through all the blocks in your graph and their
configuration parameters, then manually calculate what the item size
is and the LCM with 4096 that results.

My recollection is that sizes get ballooned outwards based on the degree
of decimation downstream from the source block, which is
a policy I’ve never quite gotten my head around. The FFT in my
flow-graph might get decimated by a factor of 10 or more, which given
the policy I mentioned, might lead to chunky allocations.

On Wed, Jun 29, 2011 at 09:24, Marcus D. Leech [email protected]
wrote:

make the code and I/O much faster, but the work function code gets a lot
simpler.

I wonder if that assertion is generally true, or only in some cases?
Increment and test shouldn’t be that expensive.

I’m sure the benefit varies depending on the situation, including some
where
there is no benefit. But modulo increments add a conditional operation
for
every pointer increment, which can cause a processor pipeline stall, and
takes up register file space to hold the array boundaries for every
input
and output stream. It also forces the author of the work() function to
know
about how GNU Radio handles circular buffers. The double-mapped
circular
buffer lets the work() function treat all its inputs and outputs as
linear
arrays in memory, no matter the actual case.

My recollection is that sizes get ballooned outwards based on the degree
of decimation downstream from the source block, which is
a policy I’ve never quite gotten my head around. The FFT in my
flow-graph might get decimated by a factor of 10 or more, which given
the policy I mentioned, might lead to chunky allocations.

Ah, I think I remember this discussion a while back. It’s the
gr.keep_one_in_n block used to decimate the sample stream into blocks
that
update the FFT sink at the requested screen update rate. Decimator
blocks
tell the runtime their decimation ratio via set_relative_rate().
Knowing
this, GNU Radio expands the buffer allocation by twice the decimation
rate,
to allow enough room for the upstream block thread to be simultaneously
writing a full set of inputs and the downstream block thread to be
reading a
full set of inputs:

http://gnuradio.org/redmine/projects/gnuradio/repository/revisions/master/annotate/gnuradio-core/src/lib/runtime/gr_flat_flowgraph.cc#L106

Since gr.keep_one_in_n is simply discarding n-1 input items, and doesn’t
need to actually process them like most decimators would, there may be
an
alternative way to handle this for this particular block.

Johnathan

about how GNU Radio handles circular buffers. The double-mapped circular
buffer lets the work() function treat all its inputs and outputs as linear
arrays in memory, no matter the actual case.

Suppose that we didn’t have a specially mapped memory. I believe the
scheduler could easily ensure that contiguous chunks of memory are
always passed into the work function with little overhead for both read
and for write.

Now, I bet there will be a slight problem when you have a block with
“history” like FIR filter that needs ntaps worth of samples repeated
into the next work function. So, when the scheduler has to actually wrap
back around to the beginning of the buffer, it will have to memcpy
nsamps to the begging of the buffer. Which shouldnt be bad when buffer
size >>> history size.

Its sounds possible (so stop me if I missed something big here).
Basically, it would be nice to support a linear memory/buffer mode for
the block executor for two reasons. 1) switch to a linear memory when
allocation of mapping fails, and 2) swap out the buffer with something
like a memory hole for a dsp for direct-copy.

-Josh

On Tue, Jun 28, 2011 at 21:38, Marcus D. Leech [email protected]
wrote:

I’ll make the observation that there’s just got to be a better buffer
allocation policy than the existing one. Should it really take
Gigabytes of VM, and hundreds of MB of RSS to execute the flow-graph I’ve
attached? The largest inter-block objects–the vectors
in front of and behind the FFT, are on the order of a few MB, and even
accounting for some ballooning factor so you can have
several such vectors “in flight”, that doesn’t add up to hundreds of MB of
RSS, and several GB of Virtual Size.

GNU Radio achieves its streaming performance in part due to a circular
buffer design that maps the same physical memory twice in the virtual
memory
map for the process, directly adjacent to each other. This allows work
function code to avoid doing any bounds checking or modular index
arithmetic
operations when reading and writing its input and output buffers. Going
“off the end” of the buffer just results in accessing the start of the
buffer which is mapped into the next block of memory. Not only does
this
make the code and I/O much faster, but the work function code gets a lot
simpler.

In order to make this strategy work, however, the buffer memory must end
on
an exact multiple of the size of a single stream item. Thus, the
allocated
buffer size has to be the LCM (least common multiple) of the machine
page
size and the item size. The machine page size is typically 4096 bytes,
so
in simple cases when the item size is a small power of two, the minimum
buffer size remains 4096 bytes. (In practice, other things the user
might
ask of the runtime can increase the size by an integer factor of this.)
The
usual GNU Radio buffer in these cases is 32K, which can hold 16K shorts,
8K
floats, 4K complex floats, etc.

If your stream item size is larger than 4K, things can get ugly. The
best
case is that your item size is a power of two, so the LCM will end up
being
the same as the item size. Then GNU Radio can allocate memory for
however
many items it needs to hold in the buffer simply as a multiple of that.
The
worst case is when the item size is relatively prime to the page size,
and
then LCM is the product of the page size and the item size. Asking GNU
Radio to allocate a buffer for items of size 4097 results in a minimum
buffer size of 4096*4097 or slightly over 16Mbytes. The virtual memory
footprint will be twice that.

I’m guessing something like this last is what is happening in your
flowgraph. I’d trace through all the blocks in your graph and their
configuration parameters, then manually calculate what the item size is
and
the LCM with 4096 that results.

If you are writing your own blocks, there is a technique where you can
“lie”
to the GNU Radio runtime and say your item size is the next higher power
of
two, but your own custom written blocks on either end of the buffer know
the
“real” size of the item and how to access them spaced out in the buffer
accordingly. But this hack obviously won’t work with existing blocks.

Johnathan

size>>> history size.

Its sounds possible (so stop me if I missed something big here).
Basically, it would be nice to support a linear memory/buffer mode for
the block executor for two reasons. 1) switch to a linear memory when
allocation of mapping fails, and 2) swap out the buffer with something
like a memory hole for a dsp for direct-copy.

-Josh

I am intrigued by your ideas, and wish to subscribe to your newsletter
:slight_smile:


Marcus L.
Principal Investigator
Shirleys Bay Radio Astronomy Consortium

One should be able to pretty easily do a single buffer (map or not), or
double map buffer. Either way, one needs to check for wrap.

The former (single buffer), as Josh states, requires a memory copy every
so often (when wrap happens). There is a trade-off between buffer
allocation size and how often memory copies are required: bigger buffer
memory allocations mean less frequent memory copies (wrap less often).
The single buffer can be implemented via a memory map or some other
user-space means, depending on how the application is created (e.g., as
multiple processes or threads). A benefit of having this buffer
available is that not all computation devices support memory maps (e.g.,
current GPUs, many DSPs) – but, the trend is towards support in the
future if you believe what NVIDIA, Intel, and AMD/ATI are presenting
(shared memory between the various SoC processors, allowing for dynamic
runtime memory maps between any and/or all of them).

The latter avoids the memory copy at wrap time, and the buffer
allocation can be as small as possible. Because the buffer allocation
must be in kernel memory & hence a multiple of pagesize(), there is a
limit to how small it can be. But, avoiding the copy will reduce
overhead time – which was, I bet, the original goal in creating the
double map buffer. - MLD