Help - Custom block generating data @ 1:5115 input to output ratio causes flowgraph to hang -- how t

Hi,
I am posting this question again with better explanation as I got no
help
yet.

I have a custom C++ block that I use in the modified dbpsk.py modulation
scheme. This block basically spreads each input data bit by 1023.

The flowgraph connect looks like this
self.connect(self,self.bytes2chunks,self.symbol_mapper,self.diffenc,self.CUSTOM_BLOCK,self.chunks2symbols,self.rrc_filter,self)

The CUSTOM_BLOCK outputs 5115 bytes for every input byte read therefore,
in
the flowgraph the input rate at self.chunks2symbols is 5115 times to
that of
input at self.CUSTOM_BLOCK. This causes the flowgraph to slow down
incredibly to such an extent that I have to force kill it. I am using
benchmark_tx.py to pass data to the flowgraph.

I implemented the custom block in 2 different ways once by inheriting
gr_block and the other by using gr_sync_interpolator but the result is
still
the same. What should I do to make it work smoothly?

Thanks

P.S - The work function is shown below when using gr_sync_interpolator.

dsss_sync_spread_b::work(int noutput_items,gr_vector_const_void_star
&input_items,gr_vector_void_star &output_items)
{
const unsigned char *in = (const unsigned char *)input_items[0];
unsigned char *out = (unsigned char )output_items[0];
int data_items=noutput_items/interpolation(); // interploation()
returns
(d_length_PN * d_n_pn which is equal to 1023 * 5)
int nout=0;
for(int i=0;i<data_items;i++){
if(in[i]&0x01){
for(int j=0;j<interpolation();j++){
out[nout]=d_pn_array1[j%d_length_PN]; // the array
d_pn_array1
has datatype ‘char’ and is of size 1023. d_length_PN = 1023 and is
initialised in the constructor and is never changed
nout++;
}
}
else{
for(int j=0;j<d_length_PN
d_n_pn;j++){
out[nout]=d_pn_array0[j%d_length_PN]; // the array
d_pn_array2
has datatype ‘char’ and is of size 1023. d_length_PN = 1023 and is
initialised in the constructor and is never changed
nout++;
}
}
}
return noutput_items;
}

The general_work function when using gr_block is shown below,

int
dsss_spreading_b::general_work(int noutput_items,gr_vector_int
&ninput_items,gr_vector_const_void_star &input_items,gr_vector_void_star
&output_items)
{
const unsigned char *in = (const unsigned char )input_items[0];
unsigned char out = (unsigned char )output_items[0];
int data_items=noutput_items/(d_length_PN
d_n_pn); // d_length_PN =
1023,
d_n_pn = 5
int nout=0;
for(int i=0;i<data_items;i++){
if(in[i]&0x01){
for(int j=0;j<d_length_PN
d_n_pn;j++){
out[nout]=d_pn_array1[j%d_length_PN];
nout++;
}
}
else{
for(int j=0;j<d_length_PN
d_n_pn;j++){
out[nout]=d_pn_array0[j%d_length_PN];
nout++;
}
}
}

consume(0,data_items);
return noutput_items;
}

On Wed, Nov 17, 2010 at 9:14 AM, John A. [email protected]
wrote:

The CUSTOM_BLOCK outputs 5115 bytes for every input byte read therefore, in

P.S - The work function is shown below when using gr_sync_interpolator.

John,
There’s nothing obvious that I would think would kill your
application, but there are definitely some modifications that I think
could help. See below.

Are you familiar with using Oprofile of valgrind --cachegrind? They
can help you isolate areas of particular trouble.

Because you know that you’re creating N items our for every 1 item in,
use the sync_interpolator.

dsss_sync_spread_b::work(int noutput_items,gr_vector_const_void_star
&input_items,gr_vector_void_star &output_items)
{
const unsigned char *in = (const unsigned char *)input_items[0];
unsigned char *out = (unsigned char *)output_items[0];
int data_items=noutput_items/interpolation(); // interploation() returns
(d_length_PN * d_n_pn which is equal to 1023 * 5)
int nout=0;
for(int i=0;i<data_items;i++){

Using ‘if’ statements inside the loop here isn’t the best thing to do.
Branches can cause problems, especially if the branch path is
mis-predicted. In modern Intel processors, they usually do a good job,
though. Still, if you can figure out a way not to branch on every
symbol, that’d be better.

if(in[i]&0x01){

for(int j=0;j<interpolation();j++){
out[nout]=d_pn_array1[j%d_length_PN]; // the array d_pn_array1
has datatype ‘char’ and is of size 1023. d_length_PN = 1023 and is
initialised in the constructor and is never changed
nout++;
}

Use a memcpy here instead of the four loop. Same for below.

}
else{
for(int j=0;j<d_length_PN*d_n_pn;j++){

Why do you use “d_length_PN*d_n_pn” here but “interpolation()” above?
From the comments, these sound like the same value.

out[nout]=d_pn_array0[j%d_length_PN]; // the array d_pn_array2
has datatype ‘char’ and is of size 1023. d_length_PN = 1023 and is
initialised in the constructor and is never changed
nout++;
}
}
}
return noutput_items;
}

Tom

On Wed, Nov 17, 2010 at 09:14:47AM -0800, John A. wrote:

The CUSTOM_BLOCK outputs 5115 bytes for every input byte read therefore, in
the flowgraph the input rate at self.chunks2symbols is 5115 times to that of
input at self.CUSTOM_BLOCK. This causes the flowgraph to slow down
incredibly to such an extent that I have to force kill it. I am using
benchmark_tx.py to pass data to the flowgraph.

Stating the obvious, you have just increased the workload by a factor
of 5115. You seem surprised that it’s taking 5000 times longer to
run…

{
initialised in the constructor and is never changed
nout++;
}

The modulo operator in the inner loop isn’t helping matters. div and
mod are not free. Q: How may cycles does an integer divide take on
the Core 2 microarchitecture?

Have you used oprofile or some other tool to see where you’re actually
spending your cycles?

With a bit of restructuring, you could turn the inner loop into a
memcpy. Left as an exercise…

However, I strongly recommend using oprofile of some other tool to see
where you’re spending your cycles before you change anything.