Inefficient large vectors

Rafa_F · August 21, 2013, 8:00pm

Hi!

I have many sync blocks that work with large fixed size vectors, e.g.
converts one vector of size 12659 to another with size 18353. I have
just multiplied the sizeof(gr_complex) with 12659 and 18353 in the
signature. However, when the flow graph is running, then I get a
warning about paging: the circular buffer implementation allocates
large buffers (e.g. 4096 many to make the paging requirement). I do
not want really large buffers. I have implemented the whole thing with
padding, but that becomes also really inefficient, since when you want
to switch between vectors and streams, then you have to jump through
extra hoops with the padding. In a previous version I had streams
everywhere, but then there is absolutely no verification whether I
messed up the sizes of my “virtual vectors”.

So is there a way to work with large odd length vectors which does not
have this buffer allocation problem, and does not require padding? It
seems to me that it could be supported: regular streams but the vector
size would be verified separately at connection time and would not be
used to multiply the item size. Any advice is appreciated…

Best,
Miklos

Miklos_M · August 21, 2013, 9:06pm

On Wednesday, August 21, 2013, Miklos M. wrote:

extra hoops with the padding. In a previous version I had streams
everywhere, but then there is absolutely no verification whether I
messed up the sizes of my “virtual vectors”.

So is there a way to work with large odd length vectors which does not
have this buffer allocation problem, and does not require padding? It
seems to me that it could be supported: regular streams but the vector
size would be verified separately at connection time and would not be
used to multiply the item size. Any advice is appreciated…

The best technique here is to round up your itemsize to the next integer
multiple of the machine page size, typically 4K. You can still operate
a
vector at a time, but you’ll have to do a little math to identify the
start
of each vector in the input and output buffers, as they will no longer
be
contiguous. It sounds like you might have already tried something like
this.

Miklos_M · August 21, 2013, 9:48pm

Yes, this is what I am doing, but it is not very nice, and you cannot
easily mix in blocks that want to work at the stream level. What
really bugs me that I think the scheduler could figure all out, and
treat my vectors as a stream, allocate nice buffers (who cares if the
vector fits in the buffer in an integer multiple times). Am I wrong
with this? I think this would be a nice further development… Miklos

The aligned-to-page-size buffer management is due to the way that mmap()
is used to mutliply-map these buffers into the address space.
That only “works” if the sizes are multiples of the native page size.

–
Marcus L.
Principal Investigator
Shirleys Bay Radio Astronomy Consortium

Miklos_M · August 21, 2013, 9:43pm

Yes, this is what I am doing, but it is not very nice, and you cannot
easily mix in blocks that want to work at the stream level. What
really bugs me that I think the scheduler could figure all out, and
treat my vectors as a stream, allocate nice buffers (who cares if the
vector fits in the buffer in an integer multiple times). Am I wrong
with this? I think this would be a nice further development… Miklos

On Wed, Aug 21, 2013 at 9:04 PM, Johnathan C.

Miklos_M · August 21, 2013, 11:19pm

On Wed, Aug 21, 2013 at 07:59:37PM +0200, Miklos M. wrote:

So is there a way to work with large odd length vectors which does not
have this buffer allocation problem, and does not require padding? It
seems to me that it could be supported: regular streams but the vector
size would be verified separately at connection time and would not be
used to multiply the item size. Any advice is appreciated…

Miklos,

if Johnathan’s tips aren’t helping, you might be able to use tags to
delimit vectors and then pass them as streams of scalars. You then have
to keep up with vector boundaries internally in your block.

Depending on what your application is, this could be a solution or can
make things even more inefficient. But it’s worth a try!

MB

–
Karlsruhe Institute of Technology (KIT)
Communications Engineering Lab (CEL)

Dipl.-Ing. Martin B.
Research Associate

Kaiserstraße 12
Building 05.01
76131 Karlsruhe

Phone: +49 721 608-43790
Fax: +49 721 608-46071
www.cel.kit.edu

KIT – University of the State of Baden-Württemberg and
National Laboratory of the Helmholtz Association

Miklos_M · August 22, 2013, 2:20am

Hi Martin, Yes, I know of stream tags, but it would just make the
blocks complicated: now I can rely on the fact that data is coming in
a multiple of the vector length. For now, padding solves my immediate
needs, but it is not an ideal solution. Miklos

On Wed, Aug 21, 2013 at 11:18 PM, Martin B. (CEL)

Miklos_M · August 22, 2013, 12:02pm

Just to add my two cents:
Depending on your actual application, your large vectors might not
actually quite fit the idea of “streams”; they might, for example, be a
valid, decoded network packet or something of the like. If they don’t
need sample-synchronous handling, using messages to pass them around
might work well.
Downside of that is of course that you can’t use your favourite GR block
on messages. You break the sample synchronous architecture of a
flowgraph with multiple paths from source(s) to sink(s); and if you
convert from message to stream and back, you basically lose the vector
attribute of your data (or run into the same problems as before).
I can’t really tell you much about computational performance of passing
around large messages, however.

On the other hand, you can reduce your per-block coding overhead for
Martin’s suggested tag-based solution:
Write a base class that implements a input and output buffer and a
minimal state machine based on stream tag evaluation. Let your blocks
inherit from that. Always copy as many items from your general_work’s
input vector to your input buffer as you can, and copy as many samples
from the output buffer to your general_works output vector as possible.
Execute your computation when your input buffer is full and your output
buffer empty. That way, you’ll get a quasi-fixed relative rate, but get
all the freedom and scheduling disadvantages of a general_work block
with an itemsize of gr_complex (or whatever your data type is).

I know from experience that this might be hard to debug. However, once
your state machine is watertight, you’re not very likely to run into
issues later.

Happy hacking,
Marcus

Miklos_M · August 22, 2013, 2:19am

Hi Marcus,

Yes, I understand the page size limitation. However, if your vector is
1234 bytes, then you can happily allocate 4096 size buffer, but the
the block you always give out the multiple of 1234 byes (i.e. 1, 2 or
3 vectors). The address space wrapping would work fine, so the start
of the vectors would not be always at the same place. I think it could
be done, the question is whether it is worth to do it.

Miklos

Miklos_M · August 22, 2013, 9:03pm

Hi Miklos,

with sync blocks and fixed rate decimators/interpolators, the scheduler
inherently knows how many buffers to allocate etc down the signal
processing line to always keep all blocks busy.
With general blocks, this is not possible; calls to forecast are
necessary to determine how much data needs to be supplied to keep the
signal processing chain running.
I’m not quite sure if there is a performance penalty for blocks that
just forecast a need for as many samples as they’re asked to produce or
if they are scheduled identically to sync blocks;
Documentation on the original GR scheduler is really really sparse and
the code itself is your primary source of help… I really can’t give
you any hints, as I’ve (most of the time) tried to get along without
going too deep into the GR scheduling framework (and hoping for
something more readible to come along for the next major release of GR
)

Happy Hacking
Marcus

Miklos_M · August 22, 2013, 6:11pm

Hi Marcus,

On Thu, Aug 22, 2013 at 12:00 PM, Marcus Müller [email protected]
wrote:

run into the same problems as before).
computation when your input buffer is full and your output buffer empty.
That way, you’ll get a quasi-fixed relative rate, but get all the freedom
and scheduling disadvantages of a general_work block with an itemsize of
gr_complex (or whatever your data type is).

I know from experience that this might be hard to debug. However, once your
state machine is watertight, you’re not very likely to run into issues
later.

Thank you for the excellent advice. I did not thought of a generic
base class, which might help me. There is one block that produces one
vector of bytes (packet data) and one vector of ints (number of errors
corrected and uncorrected), which would be impossible to solve with a
stream since the output rate is not the same. Other than this, I think
your suggestion would work.

You say that there are scheduling disadvantages of a general_work
block. What are they? Sometimes I am running into issues with the
scheduler, but it is not clear how it really works, what should I try
to do and what should I try to avoid. Can you describe it in a few
words or give a pointer where I can read upon the technical (!!)
details.

Best,
Miklos