I have discovered that the implemented random number generator in
gnuradio (see file [0]) is almost older than me. As written in the code,
the implementation is taken from ‘Numerical recipes in C’ (see version
from 1992). The problem is that this algorithm is really bad compared to
current algorithms. E.g. the period length is 1e8 compared to an
up-to-date algorithm (Mersenne twister) with a period length of about
1e6000.
I have fixed this [1] using boost.random and the mentioned Mersenne
twister. Furthermore I have written some test-cases and fixed the
transformation to Laplacian random numbers (the current implementation
is wrong). As well, I have added the random class to swig so that you
can use this in python.
Now my question: Before doing a pull request, do you have any concerns
regarding memory use or processing load? Obviously the new
implementation isn’t that light-weight as the ten lines of code before.
But the current implementation can not be used in any serious simulation
or publication, which is highly dependent on good random numbers. Some
information about the performance is given on this page: [2]. Look for
the generator mt19937 in table 24.5.
strange, I was looking at the same code just a few days ago, when I
needed to find out whether I understood filters well enough by pushing
noise through. Here’s my small testcases (thankfully I didn’t restart my
machine since then – these were still in /tmp).
So, in fact, I tested the performance of boost’s mt19937 + boost’s
normal distribution variate, and basically, it’s relatively slow,
indeed; generating 1e8 complex random numbers (meaning 200e8 random
floats) with a single thread takes
g++ (GCC) 5.1.1 20150618 (x86_64)
test_rnd (Boost)
g++ -OXXX test_rnd.cpp
test_gr (gr::random)
g++ $(pkg-config --cflags --libs gnuradio-runtime) -OXXX te_gr.cpp
-O3 (optimized for speed)
1.25 s
5.92 s
-O0 (explicitly unoptimized) (default)
23.62 s
7.13 s
Someone who cares for speed could get a 5x increase of speed by using
boost, because that person would be using optimization, anyways.
However, I can’t really explain what happens with boost and the
unoptimized case; it’s really three times as bad as the current
implementation.
However, asking perf about this, the -O3 version spends nearly all of
its time in main(), whereas the -O0 spends most of it in some boost
functions – which means that -O3 definitely inlines the calculations,
and from the disassembly alone I’d say all the stack operations needed
to jump in and out of routines seem to contribute significantly to the
problem.
Of course, that’s not really a benchmark for all systems. What about 32
bit? What about ARM? What about clang? Can anyone try to make a Windows
build?
a few days ago I stumbled over an RNG implementation [2] that uses some
of the newer AVX2 instructions. I guess this would improve performance
radically. There is also a paper on the matter [1].
for me, these results don’t look that bad. Consider that the quality of
the random numbers is increased enormously and with an ordinary
simulation you loop the period length of the current generator (about
1e8 samples) multiple times. The results should be highly correlated if
you use approximately the same periodicity in your simulation! That is a
bug that you hardly see at first sight.
It is possible to use the mt11213b generator, but this results only in a
performance increase of 7% (see boost reference page). I don’t think
that it is worth using this one.
Can you benchmark the used memory with your code (don’t know how this
works)? I think this isn’t that important, but would be interesting.
Wow, that’s a bit much to read right now. The problem I have with using
AVX2 would be portability, which would be no issue if we wrapped RNG in
a VOLK kernel and offered a good baseline competitor in portable C code.
Point is though that our alternative in RNG seems to be boost::mt19937,
which we could only “extern C{…}” into VOLK. I don’t like that
architecturally, but I think it’s definitely something I’d like to hear
Nathan’s input on.
What I do like, however is the takeaway that,
“we can safely ignore the risk of overlapping subsequences”
for two parallelly running mt19937 sequences that were seeded
differently – which means that we can well scale random number
generation to multiple cores, and especially lets us just divide each
output buffer into n_threads subbuffers and let the different generators
run parallely on that, without wrecking memory coherency bus bandwidth
too much. I guess that we’d ideally just spawn as many threads as there
are 4kB-pages to fill with random samples (meaning, at every 512
gr_complexes), just a wild guess.
hehe, I thought you might have an idea how to benchmark memory
consumption. So I’m stuck with this myself
What I can definitely say is that the -O3 variant of boost definitely
can’t allocate much memory, based on the time it needs – frequent
allocations would probably have caught my eye.
I’m afraid real numbers are impossible to get without the help of tools
like valgrind, which bend/replace libc/syscalls… and hence stretch the
execution duration massively.
However, I think “/usr/bin/time -v” (the ZSH builtin time isn’t as
capable) gives some information, too:
For the boost thing, it says, with little difference between -03 and -O0
(O0 gives 783772 kB, so even a bit more)
Maximum resident set size (kbytes): 783760
For the GNU Radio-random.h based approach:
-O0 = 789352 kB, -O3 = 789380 kB
Considering that 1e8 complex are 781,250 kB
The 5.5MB delta might be explained by the fact that I didn’t link
statically (that would’ve been unfair, I thought), and the gr::random
rtloads a lot libraries.
Now my question: Before doing a pull request, do you have any concerns
regarding memory use or processing load? Obviously the new
implementation isn’t that light-weight as the ten lines of code before.
But the current implementation can not be used in any serious simulation
or publication, which is highly dependent on good random numbers. Some
information about the performance is given on this page: [2]. Look for
the generator mt19937 in table 24.5.
Just a quick reminder that already, our noise source is known to be
pretty slow, which prompted the fast_noise_source (which precaches some
random data and reuses that). So, when you need maximum entropy, you use
the slow one and if you just want some AWGN you use the fast one. Unless
the load is significantly higher, I’d say we rather want better
randomness than faster noise sources – and the fast noise source will
stay unaffected anyway.
On Wed, Sep 2, 2015 at 12:39 PM, Marcus Müller [email protected]
wrote:
What I do like, however is the takeaway that,
Cheers,
Marcus
I agree with Martin here on the speed vs randomness. This is a noise
source
that should primarily be used for simulations where it’s OK to be rate
limited by your noise source (if you’re biggest problem in life is being
rate limited by your noise source then I’m jealous). I’m a huge fan of
Agner Fog and his work is always great, but it uses C++ which can’t be
used
in VOLK kernels. If someone wants to port that to C and stick it in VOLK
I’d accept it.
I’d say we rather want better
randomness than faster noise sources
If that’s the case, I’d recommend that Stefan uses the
normal_distribution variate from boost (Stefan, see my example), rather
than doing his own “normalization” of the RV, and we use that.
Regarding Boost’s mt19937 and the ways Stefan and I make normal
distributed RV out of the uint32_t that this emits:
From getting “raw” uniform uint32_t instead of normal floats through
boost’s normal_distribution(rng) variate is that, to little surprise,
75% of time is spent looking up/interpolating/calculating based on the
uniform integers to normally distributed floats. Blind guess is that
this would be something that someone who really cares about speed might
implement in VOLK with sufficient accuracy, making WGN generation
X3x0-capable.
75% of time is spent looking up/interpolating/calculating based on the
uniform integers to normally distributed floats. Blind guess is that
this would be something that someone who really cares about speed might
implement in VOLK with sufficient accuracy, making WGN generation
X3x0-capable.
Cheers,
Marcus
A call to SHA-256 can return 256 bytes at a time… Map those bytes
into {-1.0,1.0}. That may still not be fast enough, dunno. But the
distribution
should be good.
boost’s normal_distribution(rng) variate is that, to little surprise,
bytes into {-1.0,1.0}. That may still not be fast enough, dunno. But
the distribution
should be good.
Gah, that should be 256 bits, or 32 bytes at a time. Blame my
cats. They keep me awake at night. Yeah, that’s it.
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.