foundry27 : Post

Forum Topic - mistral OMAP3530 EVM:C Code with neon intrinsics is 54 times SLOWER than unoptimized code !!!!!!!!: (1 Item)

View: as

Update

Expand All | Collapse All

Girisha SG

07/16/2010 7:15 AM

post59629

mistral OMAP3530 EVM:C Code with neon intrinsics is 54 times SLOWER than unoptimized code !!!!!!!!

-mtune=cortex-a8 -march=armv7-a -Wa,-march=armv7-a -ftree-vectorize -mfpu=neon -Wa,-mfpu=neon -mfloat-abi=softfp -fno-
builtin -lm

void fir_REF(short * y, const short *x, const short *h, int n_out, int n_coefs)
{
	int n;
	for (n = 0; n < n_out; n++)
	{
		int k, sum = 0;
		for(k = 0; k < n_coefs; k++)
		{
			sum += h[k] * x[n - n_coefs + 1 + k];
		}
		y[n] = ((sum>>15) + 1) >> 1;
	}
}


void fir_NEON(short * y, const short *x, const short *h, int n_out, int n_coefs)
{
	int n, k;
	int sum;
	int16x4_t h_vec;
	int16x4_t x_vec;
	int32x4_t result_vec;

	for (n = 0; n < n_out; n++)
	{
		/* Clear the scalar and vector sums */
		sum = 0;
		result_vec = vdupq_n_s32(0);
		for(k = 0; k < n_coefs / 4; k++)
		{
			/* Four vector multiply-accumulate operations in parallel */
			h_vec = vld1_s16(&h[k*4]);
			x_vec = vld1_s16(&x[n - n_coefs + 1 + k*4]);
			result_vec = vmlal_s16(result_vec, h_vec, x_vec);
		}

		/* Reduction operation - add each vector lane result to the sum */
		sum += vgetq_lane_s32(result_vec, 0);
		sum += vgetq_lane_s32(result_vec, 1);
		sum += vgetq_lane_s32(result_vec, 2);
		sum += vgetq_lane_s32(result_vec, 3);

		/* consume the last few data using scalar operations */
		if(n_coefs % 4)
		{
			for(k = n_coefs - (n_coefs % 4); k < n_coefs; k++)
			sum += h[k] * x[n - n_coefs + 1 + k];
		}
		/* Store the adjusted result */
		y[n] = ((sum>>15) + 1) >> 1;
	}
}

1. n_out is set to 100 and n_coefs is set to 80
2. Setting Properties->QNX C/C++ Projec->Compiler->Other options is set as '-mtune=cortex-a8 -march=armv7-a -Wa,-march=
armv7-a -mfpu=neon -Wa,-mfpu=neon  -mfloat-abi=softfp'
3. Setting Properties->QNX C/C++ Projec->Options->Build for profiling(Function instrumentation)
4. Setting Run as -> Run configurations -> Tools -> Functions Instrumentation & Single application (Default options)etc.

5. When the above application is run the vector_add_of_n is taking 53.568 ms and add_of_n function is taking 0.997 ms i.
e. Unoptimized code is 53.72 times faster than the optimized code!!!!.
6. Compiling and running this on mistral omap3530 evm(used the BSP provided by QNX).
6. What is the wrong thing that I am doing in the above procedure and why pefrormance is degrading instead of improving 
?

Momentics :
 Version: 4.6.0
 Build id: I20090510 
 GCC version 4.3.3

Please let me know the reason & solution for the same

Return

The text you entered is not a valid object ID
More Information
Object IDs begin with an object prefix and end with a number. For example, if you enter
artf2345
the application will jump directly to an artifact with the ID artf2345. Some valid object prefixes are:
artf	for an artifact
doc	for a document
page	for a project page
topc	for a discussion topic
wiki	for a wiki page