Project Home
Project Home
Source Code
Source Code
Documents
Documents
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
Forum Topic - mistral OMAP3530 EVM:C Code with neon intrinsics is 54 times SLOWER than unoptimized code !!!!!!!!: (1 Item)
   
mistral OMAP3530 EVM:C Code with neon intrinsics is 54 times SLOWER than unoptimized code !!!!!!!!  
-mtune=cortex-a8 -march=armv7-a -Wa,-march=armv7-a -ftree-vectorize -mfpu=neon -Wa,-mfpu=neon -mfloat-abi=softfp -fno-
builtin -lm

void fir_REF(short * y, const short *x, const short *h, int n_out, int n_coefs)
{
	int n;
	for (n = 0; n < n_out; n++)
	{
		int k, sum = 0;
		for(k = 0; k < n_coefs; k++)
		{
			sum += h[k] * x[n - n_coefs + 1 + k];
		}
		y[n] = ((sum>>15) + 1) >> 1;
	}
}


void fir_NEON(short * y, const short *x, const short *h, int n_out, int n_coefs)
{
	int n, k;
	int sum;
	int16x4_t h_vec;
	int16x4_t x_vec;
	int32x4_t result_vec;

	for (n = 0; n < n_out; n++)
	{
		/* Clear the scalar and vector sums */
		sum = 0;
		result_vec = vdupq_n_s32(0);
		for(k = 0; k < n_coefs / 4; k++)
		{
			/* Four vector multiply-accumulate operations in parallel */
			h_vec = vld1_s16(&h[k*4]);
			x_vec = vld1_s16(&x[n - n_coefs + 1 + k*4]);
			result_vec = vmlal_s16(result_vec, h_vec, x_vec);
		}

		/* Reduction operation - add each vector lane result to the sum */
		sum += vgetq_lane_s32(result_vec, 0);
		sum += vgetq_lane_s32(result_vec, 1);
		sum += vgetq_lane_s32(result_vec, 2);
		sum += vgetq_lane_s32(result_vec, 3);

		/* consume the last few data using scalar operations */
		if(n_coefs % 4)
		{
			for(k = n_coefs - (n_coefs % 4); k < n_coefs; k++)
			sum += h[k] * x[n - n_coefs + 1 + k];
		}
		/* Store the adjusted result */
		y[n] = ((sum>>15) + 1) >> 1;
	}
}

1. n_out is set to 100 and n_coefs is set to 80
2. Setting Properties->QNX C/C++ Projec->Compiler->Other options is set as '-mtune=cortex-a8 -march=armv7-a -Wa,-march=
armv7-a -mfpu=neon -Wa,-mfpu=neon  -mfloat-abi=softfp'
3. Setting Properties->QNX C/C++ Projec->Options->Build for profiling(Function instrumentation)
4. Setting Run as -> Run configurations -> Tools -> Functions Instrumentation & Single application (Default options)etc.

5. When the above application is run the vector_add_of_n is taking 53.568 ms and add_of_n function is taking 0.997 ms i.
e. Unoptimized code is 53.72 times faster than the optimized code!!!!.
6. Compiling and running this on mistral omap3530 evm(used the BSP provided by QNX).
6. What is the wrong thing that I am doing in the above procedure and why pefrormance is degrading instead of improving 
?

Momentics :
 Version: 4.6.0
 Build id: I20090510 
 GCC version 4.3.3

Please let me know the reason & solution for the same