Upgrade (downgrade?) to a Q6600 CPU

Marcus_DSLeech · December 10, 2008, 7:23pm

I’ve been running my radio astronomer receiver software for a couple of
years on a Dual-core Pentium P25 CPU, with
2GB of 533 Ram, with the CPU clocked at 3.2Ghz. This has allowed me
to do 8Mhz dual-polarization continuum and
spectrum, with very few overruns (uOuOuO)

I “upgraded” this system to a Quad-core Q6600, with 4Gb of 667Mhz
ram. It’s only able to do 4.5Mhz in situations where
the previous system could handle 8Mhz. This is, of course, a big
disappointment, since I was hoping that multiple cores
would increase the ability for me to add more complex signal
processing, but maintain the same bandwidth as the old system.

Apart from the obvious hardware upgrades (move to a MOBO that can
overclock the Q6600 into the 3Ghz region, and use
dual-channel RAM), are there software tweaks (I’m thinking
particularly of USB buffering or something) that would allow me to get
better
performance, without overruns?

Note that I’m using the latest trunk code, as of a few days ago.

–
Marcus L.
Principal Investigator, Shirleys Bay Radio Astronomy Consortium

Marcus_DSLeech · December 10, 2008, 9:00pm

Are you taking full advantage of the multi threading now available? Are
you
running a 64 bit version of Linux?

Have you tried compiling GnuRadio with the following flags?

-O3 -msse -msse3 -mfpmath=sse

The gain from the multiple cores is that more threads get “two thirds of
processor” making the aggregate faster but NOT necessarily the single
threads.

Bob

ARRL SDR Working Group Chair
Member: ARRL, AMSAT, AMSAT-DL, TAPR, Packrats,
NJQRP, QRP ARCI, QCWA, FRC.
“And yes I said, yes I will Yes”, Molly Bloom

Marcus_DSLeech · December 12, 2008, 4:10pm

Bob McGwier wrote:

restructuring your code to take advantage of the multiple cores. This is
not done automatically. It must be done by design in the application.

Bob

Yup, I found all that in the “configure” file, and rebuilt. It improved
performance marginally, from being able
to handle 4.5Mhz without too many uOuO to about 5.33Mhz with the same
level of overruns.

Also, I looked into the process details of the process running
usrp_ra_receiver.py, and found that it had
25 threads. That must mean that each block runs in its own thread
now? I’m intrigued by multi-threaded
FFT (FFTW3 can do this if you ask it to), but I don’t think Gnu Radio
takes advantage of that, and it’s
not clear how much of a win it would be–unless the FFT length is
quite long.

–
Marcus L.
Principal Investigator, Shirleys Bay Radio Astronomy Consortium

Marcus_DSLeech · December 12, 2008, 5:24pm

On Thu, Dec 11, 2008 at 1:46 PM, Marcus D. Leech [email protected]
wrote:

Also, I looked into the process details of the process running
usrp_ra_receiver.py, and found that it had
25 threads. That must mean that each block runs in its own thread
now?

Yes. The “thread-per-block” design by Eric allows, in many cases, for
the flowgraph throughput to rise until an individual block consumes
100% of a single core. This is a great improvement over the
single-threaded scheduler that would rate-limit when an entire
flowgraph of blocks consumed a single core.

You can switch between the two for comparison:

$ export GR_SCHEDULER=STS # single-threaded-scheduler (old)
$ export GR_SCHEDULER=TPB # thread-per-block (new, default)

-Johnathan

Marcus_DSLeech · December 12, 2008, 5:29pm

To be completely honest, I have never understood the gain to us
provided by
the COST the single threaded scheduler imposed. I cannot find the
service
the single thread scheduler provides that cannot be done more easily and
with much greater efficiency in the TPB or even before in a simpler
worker
thread pool thing but I am sure this is my ignorance. Of course, it is
easy
to criticize when I did not write it and when the author is in the
middle of
the Atlantic with a torn sail limping home (take it easy, he was 30 nm
from
the end at 7 this morning)!

Well anyway, we are there now with TPB and that is only going to get
better.

I would prefer to have enabled and know how to set this in the config
file
than to have environment variables alone.

Bob

ARRL SDR Working Group Chair
Member: ARRL, AMSAT, AMSAT-DL, TAPR, Packrats,
NJQRP, QRP ARCI, QCWA, FRC.
“And yes I said, yes I will Yes”, Molly Bloom

Marcus_DSLeech · December 10, 2008, 11:35pm

Bob McGwier wrote:

Are you taking full advantage of the multi threading now available? Are you
running a 64 bit version of Linux?

Is it not true that the latest SVN trunk will automatically turn on
SMP/Multi-threaded support? From what I’m seeing
from CPU usage, it certainly looks like it’s scheduling threads across
multiple CPUs. One CPU seems to always
be busier than the others, which is likely where the bottleneck is.
I’m running an x86_64 system (F10).

Have you tried compiling GnuRadio with the following flags?

-O3 -msse -msse3 -mfpmath=sse

Did that (via CFLAGS=“-O3 -msse -msse3 -mfpmath=sse”). That made a
slight improvement. I also added
-with-md-arch=x86_64 when I did the “configure”.

The gain from the multiple cores is that more threads get “two thirds of
processor” making the aggregate faster but NOT necessarily the single
threads.

Bob

Indeed, if the single thread that reads data from the USB can’t quite
keep up, then the rest of the threads
won’t have any work to do. I’m wonder if larger buffer sizes in that
chain could possibly help, but that’s
clutching at straws.

–
Marcus L.
Principal Investigator, Shirleys Bay Radio Astronomy Consortium

Marcus_DSLeech · December 13, 2008, 1:49am

Johnathan C. wrote:

$ export GR_SCHEDULER=STS # single-threaded-scheduler (old)
$ export GR_SCHEDULER=TPB # thread-per-block (new, default)

-Johnathan

The CPU consumption behavior in my application appears to be that one
CPU is biased over the other
three by almost 2:1. No single CPU is consuming 100%, but the total
CPU consumption of the usrp_ra_receiver.py
process hovers around 85-90%.

I’m working on getting a new motherboard with dual-channel RAM
capability, and the ability to drive my Q6600
at a higher clock rate.

I think what’s happening is that the I/O thread is going as fast as it
can, but it just isn’t fast enough to service the
data coming off of the USB. Could it also be that my USB subsystem is
just not that good?

–

Marcus L.
Principal Investigator, Shirleys Bay Radio Astronomy Consortium

Marcus_DSLeech · December 12, 2008, 6:09pm

On Fri, Dec 12, 2008 at 8:19 AM, Bob McGwier [email protected]
wrote:

To be completely honest, I have never understood the gain to us provided by
the COST the single threaded scheduler imposed. I cannot find the service
the single thread scheduler provides that cannot be done more easily and
with much greater efficiency in the TPB or even before in a simpler worker
thread pool thing but I am sure this is my ignorance.

The default is to use the thread-per-block scheduler. It is almost
always better than the single-threaded scheduler, but there are some
pathological cases where it isn’t. One case is where there are simply
too many blocks, as in the case of the wideband gr-pager receiver.
This application creates ~1500 blocks as it is simultaneously decoding
120 FSK channels across 3 MHz. This can exhaust resources in some
environments.

Of course, it is easy
to criticize when I did not write it and when the author is in the middle of
the Atlantic with a torn sail limping home (take it easy, he was 30 nm from
the end at 7 this morning)!

Boo hoo. Sailing the Atlantic for several weeks, fresh salt air
breezes, beautiful sunrises and sunsets, the physical workout–I’m
having a hard time feeling sympathy for a torn sail

I would prefer to have enabled and know how to set this in the config file
than to have environment variables alone.

Again, the tpb is the default; you’ve already been using it.

-Johnathan

Marcus_DSLeech · December 28, 2008, 12:26am

Can we implement some block in FPGA to reduce the CPU consumption?

Have any work like this done already ?

Thanks

Jing

On Mon, Dec 15, 2008 at 12:36 PM, Johnathan C.

Marcus_DSLeech · December 15, 2008, 7:02pm

On Fri, Dec 12, 2008 at 4:48 PM, Marcus D. Leech [email protected]
wrote:

I think what’s happening is that the I/O thread is going as fast as it
can, but it just isn’t fast enough to service the
data coming off of the USB. Could it also be that my USB subsystem is
just not that good?

You can try running the usrp_benchmark_usb.py example to see if your
USB is not handling the highest data rate of 32 MB/sec. But it isn’t
likely that the USB is the issue.

Are you familiar with the ‘oprofile’ profiler?

-Johnathan

Marcus_DSLeech · December 28, 2008, 4:32pm

On Sat, Dec 27, 2008 at 3:25 PM, Jing C. [email protected]
wrote:

Can we implement some block in FPGA to reduce the CPU consumption?

Have any work like this done already ?

This has been done in a variety of ways for the USRP1, although there
isn’t much space free in the FPGA for new logic unless you are willing
to sacrifice transmit capability or the number of receiver DDCs.

The USRP2 has a much larger amount of free logic (~50% currently, may
change) and was designed with the idea that people might offload the
high rate portions of the signal processing chain, or even all of it,
and run hostless.

Of course, the effort to write HDL, verify in simulation, verify in
synthesis, and debug with a logic analyzer is lot more than assembling
blocks into a flowgraph in Python.

-Johnathan