Forum: GNU Radio Upgrade (downgrade?) to a Q6600 CPU

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Marcus D. Leech (Guest)
on 2008-12-10 20:23
(Received via mailing list)
I've been running my radio astronomer receiver software for a couple of
years on a Dual-core Pentium P25 CPU, with
  2GB of 533 Ram, with the CPU clocked at 3.2Ghz.   This has allowed me
to do 8Mhz dual-polarization continuum and
  spectrum, with very few overruns (uOuOuO)

I "upgraded" this system to a Quad-core Q6600, with 4Gb of 667Mhz
ram.    It's only able to do 4.5Mhz in situations where
  the previous system could handle 8Mhz.  This is, of course, a big
disappointment, since I was hoping that multiple cores
  would increase the ability for me to add more complex signal
processing, but maintain the same bandwidth as the old system.

Apart from the obvious hardware upgrades (move to a MOBO that can
overclock the Q6600 into the 3Ghz region, and use
  dual-channel RAM), are there software tweaks (I'm thinking
particularly of USB buffering or something) that would allow me to get
better
  performance, without overruns?

Note that I'm using the latest trunk code, as of a few days ago.

--
Marcus L.
Principal Investigator, Shirleys Bay Radio Astronomy Consortium
http://www.sbrac.org
Bob McGwier (Guest)
on 2008-12-10 22:00
(Received via mailing list)
Are you taking full advantage of the multi threading now available?  Are
you
running a 64 bit version of Linux?

Have you tried compiling GnuRadio with the following flags?

-O3 -msse -msse3 -mfpmath=sse



The gain from the multiple cores is that more threads get "two thirds of
processor" making the aggregate faster but NOT necessarily the single
threads.

Bob


ARRL SDR Working Group Chair
Member: ARRL, AMSAT, AMSAT-DL, TAPR, Packrats,
NJQRP, QRP ARCI, QCWA, FRC.
"And yes I said, yes I will Yes", Molly Bloom
Marcus D. Leech (Guest)
on 2008-12-11 00:35
(Received via mailing list)
Bob McGwier wrote:
> Are you taking full advantage of the multi threading now available?  Are you
> running a 64 bit version of Linux?
>
Is it not true that the latest SVN trunk will automatically turn on
SMP/Multi-threaded support?  From what I'm seeing
  from CPU usage, it certainly looks like it's scheduling threads across
multiple CPUs.  One CPU seems to always
  be busier than the others, which is likely where the bottleneck is.
I'm running an x86_64 system (F10).
> Have you tried compiling GnuRadio with the following flags?
>
> -O3 -msse -msse3 -mfpmath=sse
>
>
Did that (via CFLAGS="-O3 -msse -msse3 -mfpmath=sse").  That made a
slight improvement.  I also added
  -with-md-arch=x86_64 when I did the "configure".
>
> The gain from the multiple cores is that more threads get "two thirds of
> processor" making the aggregate faster but NOT necessarily the single
> threads.
>
> Bob
>
>
Indeed, if the single thread that reads data from the USB can't quite
keep up, then the rest of the threads
  won't have any work to do.  I'm wonder if larger buffer sizes in that
chain could possibly help, but that's
  clutching at straws.

--
Marcus L.
Principal Investigator, Shirleys Bay Radio Astronomy Consortium
http://www.sbrac.org
Marcus D. Leech (Guest)
on 2008-12-12 17:10
(Received via mailing list)
Bob McGwier wrote:
> restructuring your code to take advantage of the multiple cores.  This is
> not done automatically.  It must be done by design in the application.
>
> Bob
>
Yup, I found all that in the "configure" file, and rebuilt.  It improved
performance marginally, from being able
  to handle 4.5Mhz without too many uOuO to about 5.33Mhz with the same
level of overruns.

Also, I looked into the process details of the process running
usrp_ra_receiver.py, and found that it had
  25 threads.  That must mean that each block runs in its own thread
now?   I'm intrigued by multi-threaded
  FFT (FFTW3 can do this if you ask it to), but I don't think Gnu Radio
takes advantage of that, and it's
  not clear how much of a win it would be--unless the FFT length is
quite long.

--
Marcus L.
Principal Investigator, Shirleys Bay Radio Astronomy Consortium
http://www.sbrac.org
Johnathan C. (Guest)
on 2008-12-12 18:24
(Received via mailing list)
On Thu, Dec 11, 2008 at 1:46 PM, Marcus D. Leech 
<removed_email_address@domain.invalid>
wrote:

> Also, I looked into the process details of the process running
> usrp_ra_receiver.py, and found that it had
>  25 threads.  That must mean that each block runs in its own thread
> now?

Yes.  The "thread-per-block" design by Eric allows, in many cases, for
the flowgraph throughput to rise until an individual block consumes
100% of a single core.  This is a great improvement over the
single-threaded scheduler that would rate-limit when an entire
flowgraph of blocks consumed a single core.

You can switch between the two for comparison:

$ export GR_SCHEDULER=STS       # single-threaded-scheduler (old)
$ export GR_SCHEDULER=TPB       # thread-per-block (new, default)

-Johnathan
Bob McGwier (Guest)
on 2008-12-12 18:29
(Received via mailing list)
To be completely honest,  I have never understood the gain to us
provided by
the COST the single threaded scheduler imposed.  I cannot find the
service
the single thread scheduler provides that cannot be done more easily and
with much greater efficiency in the TPB or even before in a simpler
worker
thread pool thing but I am sure this is my ignorance.  Of course, it is
easy
to criticize when I did not write it and when the author is in the
middle of
the Atlantic with a torn sail limping home (take it easy, he was 30 nm
from
the end at 7 this morning)!

Well anyway, we are there now with TPB and that is only going to get
better.

I would prefer to have enabled and know how to set this in the config
file
than to have environment variables alone.

Bob


ARRL SDR Working Group Chair
Member: ARRL, AMSAT, AMSAT-DL, TAPR, Packrats,
NJQRP, QRP ARCI, QCWA, FRC.
"And yes I said, yes I will Yes", Molly Bloom
Johnathan C. (Guest)
on 2008-12-12 19:09
(Received via mailing list)
On Fri, Dec 12, 2008 at 8:19 AM, Bob McGwier 
<removed_email_address@domain.invalid>
wrote:

> To be completely honest,  I have never understood the gain to us provided by
> the COST the single threaded scheduler imposed.  I cannot find the service
> the single thread scheduler provides that cannot be done more easily and
> with much greater efficiency in the TPB or even before in a simpler worker
> thread pool thing but I am sure this is my ignorance.

The default is to use the thread-per-block scheduler.  It is almost
always better than the single-threaded scheduler, but there are some
pathological cases where it isn't.  One case is where there are simply
too many blocks, as in the case of the wideband gr-pager receiver.
This application creates ~1500 blocks as it is simultaneously decoding
120 FSK channels across 3 MHz.  This can exhaust resources in some
environments.

> Of course, it is easy
> to criticize when I did not write it and when the author is in the middle of
> the Atlantic with a torn sail limping home (take it easy, he was 30 nm from
> the end at 7 this morning)!

Boo hoo.  Sailing the Atlantic for several weeks, fresh salt air
breezes, beautiful sunrises and sunsets, the physical workout--I'm
having a hard time feeling sympathy for a torn sail :)

> I would prefer to have enabled and know how to set this in the config file
> than to have environment variables alone.

Again, the tpb is the default; you've already been using it.

-Johnathan
Marcus D. Leech (Guest)
on 2008-12-13 02:49
(Received via mailing list)
Johnathan C. wrote:
> $ export GR_SCHEDULER=STS       # single-threaded-scheduler (old)
> $ export GR_SCHEDULER=TPB       # thread-per-block (new, default)
>
> -Johnathan
>
>
The CPU consumption behavior in my application appears to be that one
CPU is biased over the other
  three by almost 2:1.  No single CPU is consuming 100%, but the total
CPU consumption of the usrp_ra_receiver.py
  process hovers around 85-90%.

I'm working on getting a new motherboard with dual-channel RAM
capability, and the ability to drive my Q6600
  at a higher clock rate.

I think what's happening is that the I/O thread is going as fast as it
can, but it just isn't fast enough to service the
  data coming off of the USB.  Could it also be that my USB subsystem is
just not that good?

--

Marcus L.
Principal Investigator, Shirleys Bay Radio Astronomy Consortium
http://www.sbrac.org
Johnathan C. (Guest)
on 2008-12-15 20:02
(Received via mailing list)
On Fri, Dec 12, 2008 at 4:48 PM, Marcus D. Leech 
<removed_email_address@domain.invalid>
wrote:

> I think what's happening is that the I/O thread is going as fast as it
> can, but it just isn't fast enough to service the
>  data coming off of the USB.  Could it also be that my USB subsystem is
> just not that good?

You can try running the usrp_benchmark_usb.py example to see if your
USB is not handling the highest data rate of 32 MB/sec.  But it isn't
likely that the USB is the issue.

Are you familiar with the 'oprofile' profiler?

-Johnathan
Jing C. (Guest)
on 2008-12-28 01:26
(Received via mailing list)
Can we implement some block in FPGA to reduce the CPU consumption?

Have any work like this done already ?

Thanks

Jing

On Mon, Dec 15, 2008 at 12:36 PM, Johnathan C.
Johnathan C. (Guest)
on 2008-12-28 17:32
(Received via mailing list)
On Sat, Dec 27, 2008 at 3:25 PM, Jing C. <removed_email_address@domain.invalid>
wrote:

> Can we implement some block in FPGA to reduce the CPU consumption?
>
> Have any work like this done already ?

This has been done in a variety of ways for the USRP1, although there
isn't much space free in the FPGA for new logic unless you are willing
to sacrifice transmit capability or the number of receiver DDCs.

The USRP2 has a much larger amount of free logic (~50% currently, may
change) and was designed with the idea that people might offload the
high rate portions of the signal processing chain, or even all of it,
and run hostless.

Of course, the effort to write HDL, verify in simulation, verify in
synthesis, and debug with a logic analyzer is lot more than assembling
blocks into a flowgraph in Python.

-Johnathan
This topic is locked and can not be replied to.