QT GUI time sink (float) unnecessary memmove()?

Hi,

Can this memmove() be safely skipped

if ((d_start == 0) || (gr::high_res_timer_now() - d_last_time >
d_update_time))?

I think it can, but I am not sure.

With some high throughput, high sample rate flowgraphs, ps shows my
time_sink_f thread taking up ~41% of a CPU and oprofile shows memmove()
as the number 2 pig on the list:

CPU: Intel Sandy Bridge microarchitecture, speed 3.5e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
unit mask of 0x00 (No unit mask) count 90000
samples % image name symbol name
24498 33.9825 libvolk.so.0.0.0 volk_32f_convert_64f_u_avx
14278 19.8058 libc-2.18.so __memmove_ssse3_back
7488 10.3870 no-vmlinux /no-vmlinux
2668 3.7009 libgnuradio-qtgui-3.7.7git.so.0.0.0
gr::qtgui::time_sink_f_impl::_test_trigger_slope(float const*) const
2073 2.8756 libpthread-2.18.so pthread_mutex_lock

The volk_32f_convert_64f_u_avx() call is unavoidable as Qwt wants
doubles for plotting and not floats. But it might also be able to be
deferred to the very end when the decision to plot is known for sure.
(But that’s more surgery than I care to take on at the moment.)

Regards,
Andy

On Sat, Mar 28, 2015 at 11:00 AM, Andy W.
[email protected]
wrote:

I think it can, but I am not sure.
14278 19.8058 libc-2.18.so __memmove_ssse3_back
Regards,
Andy

The for loop there is in case we’re triggering with a delay set, so that
sets d_start into the buffers. But we pass the vector of buffers to the
plotting widget, which will start at index 0. There are a couple of
things
that could work here. We add an argument to TimeUpdateEvent that adds
the
index value. We could also only do the memcpy if d_start > 0. But
thinking
about the volk convert function, that’s both copying the data from the
input buffer into the internal buffer as well as performing the
conversion.
We can’t just hold data in the input since we don’t want to back up the
data until we’re ready to plot both with timing and with a full enough
buffer – it’s just sampling a section at a time and drops everything in
between. That part could be converted into a memcpy instead of the volk
convert. Then, when we’re ready to plot, we call the volk convert that
also
does the move from d_start to 0, so it combines those two elements.

Thoughts on those proposals?

Tom

Hi Tom:

On Sat, 2015-03-28 at 11:12 -0700, Tom R. wrote:

On Sat, Mar 28, 2015 at 11:00 AM, Andy W.
[email protected] wrote:

    Can this memmove() be safely skipped

[snip]

    The volk_32f_convert_64f_u_avx() call is unavoidable as Qwt
    wants
    doubles for plotting and not floats. But it might also be able
    to be
    deferred to the very end when the decision to plot is known
    for sure.
    (But that's more surgery than I care to take on at the
    moment.)

The for loop there is in case we’re triggering with a delay set, so
that sets d_start into the buffers. But we pass the vector of buffers
to the plotting widget, which will start at index 0. There are a
couple of things that could work here. We add an argument to
TimeUpdateEvent that adds the index value. We could also only do the
memcpy if d_start > 0.

Modifying TimeUpdateEvent() would work. I don’t know how much ripple
that would cause in forcing other code to change.

BTW, I do use delayed trigger.
I did a hack to only do the memmove() only if d_start != 0 and we were
going to plot in this call to work(). See below for results.

But thinking about the volk convert function, that’s both copying the
data from the input buffer into the internal buffer as well as
performing the conversion. We can’t just hold data in the input since
we don’t want to back up the data until we’re ready to plot both with
timing and with a full enough buffer – it’s just sampling a section
at a time and drops everything in between.

Right.

That part could be converted into a memcpy instead of the volk
convert. Then, when we’re ready to plot, we call the volk convert that
also does the move from d_start to 0, so it combines those two
elements.

Yeah, that’s the surgery part. :slight_smile: It would require adding a new set of
buffers to hold floats objects, and then convert them when a
determination to plot was made.

This also affects the memmove() of the tail for the trigger delay. It
would operate on the new set of float buffers (vs the buffers holding
doubles).

Thoughts on those proposals?

Tom

I just did this quick hack and tested it:

    --- a/gr-qtgui/lib/time_sink_f_impl.cc
    +++ b/gr-qtgui/lib/time_sink_f_impl.cc
    @@ -622,13 +622,15 @@ namespace gr {

           // If we've have a trigger and a full d_size of items in 

the buffers, plo
if((d_triggered) && (d_index == d_end)) {
- // Copy data to be plotted to start of buffers.
- for(n = 0; n < d_nconnections; n++) {
- memmove(d_buffers[n], &d_buffers[n][d_start],
d_sizesizeof(double));
- }
-
// Plot if we are able to update
if(gr::high_res_timer_now() - d_last_time >
d_update_time) {
+ if(d_start != 0) {
+ // Copy data to be plotted to start of buffers.
+ for(n = 0; n < d_nconnections; n++) {
+ memmove(d_buffers[n],
+ &d_buffers[n][d_start],
d_size
sizeof(double));
+ }
+ }
d_last_time = gr::high_res_timer_now();

I got about 2-5% of my CPU back and memmove() fell into the “I don’t
care” part of the oprofile stats. :slight_smile:
Everything seemed ok in the GUI, but my input causes triggers to happen
quite often, so I haven’t tested infrequent trigger cases.

CPU: Intel Sandy Bridge microarchitecture, speed 3.5e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
unit mask of 0x00 (No unit mask) count 100000
samples % image name symbol name
78158 39.0737 libvolk.so.0.0.0 volk_32f_convert_64f_u_avx
22777 11.3870 no-vmlinux /no-vmlinux
13972 6.9851 libgnuradio-qtgui-3.7.7git.so.0.0.0
gr::qtgui::time_sink_f_impl::_test_trigger_slope(float const*) const
7781 3.8900 libgnuradio-qtgui-3.7.7git.so.0.0.0
gr::qtgui::time_sink_f_impl::_test_trigger_norm(int, std::vector<void
const*, std::allocator<void const*> >)
7236 3.6175 libpthread-2.18.so pthread_mutex_lock
6163 3.0811 libgnuradio-runtime-3.7.7git.so.0.0.0
boost::detail::sp_counted_base::release()
5942 2.9706 libpthread-2.18.so pthread_mutex_unlock
4947 2.4732 libgnuradio-runtime-3.7.7git.so.0.0.0
gr::block_executor::run_one_iteration()
3826 1.9127 libgnuradio-runtime-3.7.7git.so.0.0.0
gr::block_detail::input(unsigned int)
3555 1.7773 libstdc++.so.6.0.19
/usr/lib64/libstdc++.so.6.0.19
3206 1.6028 libc-2.18.so __memmove_ssse3_back

Regards,
Andy

On Sat, Mar 28, 2015 at 12:50 PM, Andy W.
[email protected]
wrote:

would operate on the new set of float buffers (vs the buffers holding

13972 6.9851 libgnuradio-qtgui-3.7.7git.so.0.0.0
3826 1.9127 libgnuradio-runtime-3.7.7git.so.0.0.0
samples % image name symbol name
2091 2.6980 libstdc++.so.6.0.19 /usr/lib64/libstdc++.so.6.0.19
Andy

Andy,

Excellent!

I’ve got a few other minor patches for some things, I’ll put this in
there
to and test on my end as well.

Tom

On Sat, 2015-03-28 at 14:45 -0400, Andy W. wrote:

[snip]

    The volk_32f_convert_64f_u_avx() call is unavoidable as Qwt
    wants
    doubles for plotting and not floats. But it might also be able
    to be
    deferred to the very end when the decision to plot is known
    for sure.
    (But that's more surgery than I care to take on at the
    moment.)

That part could be converted into a memcpy instead of the volk
doubles).

Thoughts on those proposals?

Your proposal for implementing memcpy() and deferring volk_*() to do the
conversion and “memmove” in one step is great! :slight_smile:

I just implemented it, and the time_sink_f thread has gone from 41.5%
CPU down to 29.1% CPU in my tests. :slight_smile: memcpy() now dominates the
thread, but that’s to be expected.

With my initial hack:

4947 2.4732 libgnuradio-runtime-3.7.7git.so.0.0.0
gr::block_executor::run_one_iteration()
3826 1.9127 libgnuradio-runtime-3.7.7git.so.0.0.0
gr::block_detail::input(unsigned int)
3555 1.7773 libstdc++.so.6.0.19 /usr/lib64/libstdc++.so.6.0.19
3206 1.6028 libc-2.18.so __memmove_ssse3_back
[…]

With my implementation of your suggestion:

CPU: Intel Sandy Bridge microarchitecture, speed 3.5e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a
unit mask of 0x00 (No unit mask) count 90000
samples % image name symbol name
27595 35.6051 libc-2.18.so __memcpy_sse2_unaligned
12225 15.7736 no-vmlinux /no-vmlinux
4051 5.2269 libpthread-2.18.so pthread_mutex_lock
3739 4.8243 libgnuradio-runtime-3.7.7git.so.0.0.0
boost::detail::sp_counted_base::release()
3362 4.3379 libpthread-2.18.so pthread_mutex_unlock
2876 3.7108 libgnuradio-runtime-3.7.7git.so.0.0.0
gr::block_executor::run_one_iteration()
2364 3.0502 libgnuradio-runtime-3.7.7git.so.0.0.0
gr::block_detail::input(unsigned int)
2091 2.6980 libstdc++.so.6.0.19
/usr/lib64/libstdc++.so.6.0.19
1388 1.7909 libgnuradio-runtime-3.7.7git.so.0.0.0
gr::tpb_detail::notify_upstream(gr::block_detail*)
1138 1.4683 libc-2.18.so __memmove_ssse3_back
[…]
2 0.0026 libvolk.so.0.0.0 __volk_32f_convert_64f_d
[…]
1 0.0013 libvolk.so.0.0.0 volk_32f_convert_64f_a_avx

Regards,
Andy

On Sat, Mar 28, 2015 at 5:32 PM, Andy W.
[email protected]
wrote:

When testing, I used 5 float streams rumning at over 150 Msps each, with
15 microsecomd bursts of 50 MHz at about 10 microseconds apart. I used
enough x points to see two bursts on the gui. Normal trigger. (Free or auto
trigger moght be too taxing.)

-Regards
Andy

Andy, if you have a chance, can you check out this new branch:

https://github.com/trondeau/gnuradio/tree/qtgui/controlpanel

It adds the fixes that we talked about. I just want to verify that
things
are still looking and behaving well for you.

The other trick of this branch is if you go into the QT GUI Time Sink
properties and turn “Control Panel” to Yes. I wouldn’t mind a quick bit
of
feedback there, either.

Tom

When testing, I used 5 float streams rumning at over 150 Msps each, with
15 microsecomd bursts of 50 MHz at about 10 microseconds apart. I used
enough x points to see two bursts on the gui. Normal trigger. (Free
or auto trigger moght be too taxing.)

-Regards
Andy

On Sun, 2015-03-29 at 17:20 -0700, Tom R. wrote:

On Sat, Mar 28, 2015 at 5:32 PM, Andy W.
[email protected] wrote:

Andy, if you have a chance, can you check out this new branch:

https://github.com/trondeau/gnuradio/tree/qtgui/controlpanel

It adds the fixes that we talked about. I just want to verify that
things are still looking and behaving well for you.

I had time to inspect the change for time_sink_f_impl.* but haven’t had
time to test yet. What you did was very close to what I did (including
naming the new buffers ‘d_fbuffers’ :slight_smile: ), but I will suggest two
improvements:

  1. The size of the allocated double ‘d_buffers’ arrays can now be
    reduced from d_buffer_size to d_size, so they only take up 1/2 as much
    memory.

  2. (I think) You can defer the call to volk_32f_convert_64f() even more,
    to when you know for sure plotting is going to happen in the clause that
    checks the update time. See the patch in line below.
    It works for me.

The other trick of this branch is if you go into the QT GUI Time Sink
properties and turn “Control Panel” to Yes. I wouldn’t mind a quick
bit of feedback there, either.

I’ll have some time on Wednesday to recompile and evaluate.

Tom

Regards,
Andy

diff --git a/gr-qtgui/lib/time_sink_f_impl.cc
b/gr-qtgui/lib/time_sink_f_impl.cc
index 8e45e2c…5aa126d 100644
— a/gr-qtgui/lib/time_sink_f_impl.cc
+++ b/gr-qtgui/lib/time_sink_f_impl.cc
@@ -68,9 +68,12 @@ namespace gr {
d_main_gui = NULL;

   for(int n = 0; n < d_nconnections; n++) {

d_buffers.push_back((double*)volk_malloc(d_buffer_size*sizeof(double),

  • d_fbuffers.push_back((float*)volk_malloc(d_buffer_size*sizeof(float),
    volk_get_alignment()));
  • memset(d_buffers[n], 0, d_buffer_size*sizeof(double));
  • memset(d_fbuffers[n], 0, d_buffer_size*sizeof(float));
  • d_buffers.push_back((double*)volk_malloc(d_size*sizeof(double),

volk_get_alignment()));

  • memset(d_buffers[n], 0, d_size*sizeof(double));
    }

    // Set alignment properties for VOLK
    

@@ -96,6 +99,7 @@ namespace gr {

   // d_main_gui is a qwidget destroyed with its parent
   for(int n = 0; n < d_nconnections; n++) {
  • volk_free(d_fbuffers[n]);
    volk_free(d_buffers[n]);
    }

@@ -329,10 +333,14 @@ namespace gr {

// Resize buffers and replace data
for(int n = 0; n < d_nconnections; n++) {

  • volk_free(d_fbuffers[n]);
  • d_fbuffers[n] = (float*)volk_malloc(d_buffer_size*sizeof(float),
  •                                          volk_get_alignment());
    
  • memset(d_fbuffers[n], 0, d_buffer_size*sizeof(float));
    volk_free(d_buffers[n]);
  • d_buffers[n] = (double*)volk_malloc(d_buffer_size*sizeof(double),
  • d_buffers[n] = (double*)volk_malloc(d_size*sizeof(double),
    volk_get_alignment());
  • memset(d_buffers[n], 0, d_buffer_size*sizeof(double));
  • memset(d_buffers[n], 0, d_size*sizeof(double));
    }

    // If delay was set beyond the new boundary, pull it back.
    

@@ -427,7 +435,7 @@ namespace gr {
int n;
if(d_trigger_delay) {
for(n = 0; n < d_nconnections; n++) {

  •      memmove(d_buffers[n], &d_buffers[n][d_size-d_trigger_delay], 
    

d_trigger_delay*sizeof(double));

  •      memmove(d_fbuffers[n], 
    

&d_fbuffers[n][d_size-d_trigger_delay], d_trigger_delay*sizeof(float));
}

     // Also move the offsets of any tags that occur in the tail

@@ -606,8 +614,7 @@ namespace gr {
// Copy data into the buffers.
for(n = 0; n < d_nconnections; n++) {
in = (const float*)input_items[idx];

  •    volk_32f_convert_64f(&d_buffers[n][d_index],
    
  •                         &in[1], nitems);
    
  •    memcpy(&d_fbuffers[n][d_index], &in[1], nitems*sizeof(float));
    
       uint64_t nr = nitems_read(idx);
       std::vector<gr::tag_t> tags;
    

@@ -622,13 +629,13 @@ namespace gr {

   // If we've have a trigger and a full d_size of items in the 

buffers, plot.
if((d_triggered) && (d_index == d_end)) {

  •    // Copy data to be plotted to start of buffers.
    
  •    for(n = 0; n < d_nconnections; n++) {
    
  •      memmove(d_buffers[n], &d_buffers[n][d_start], 
    

d_size*sizeof(double));

  •    }
    
  •    // Plot if we are able to update
       if(gr::high_res_timer_now() - d_last_time > d_update_time) {
    
  •      // Convert and copy data to be plotted to start of buffers.
    
  •      for(n = 0; n < d_nconnections; n++) {
    
  •        volk_32f_convert_64f(&d_buffers[n][0],
    
  •                             &d_fbuffers[n][d_start], d_size);
    
  •      }
         d_last_time = gr::high_res_timer_now();
         d_qApplication->postEvent(d_main_gui,
                                   new TimeUpdateEvent(d_buffers, 
    

d_size, d_tags));
diff --git a/gr-qtgui/lib/time_sink_f_impl.h
b/gr-qtgui/lib/time_sink_f_impl.h
index 2da1db9…d8e2261 100644
— a/gr-qtgui/lib/time_sink_f_impl.h
+++ b/gr-qtgui/lib/time_sink_f_impl.h
@@ -41,6 +41,7 @@ namespace gr {
int d_nconnections;

   int d_index, d_start, d_end;
  •  std::vector<float*> d_fbuffers;
     std::vector<double*> d_buffers;
     std::vector< std::vector<gr::tag_t> > d_tags;