Project Home
Project Home
Trackers
Trackers
Documents
Documents
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
Forum Topic - Compiler Error due to using Neon Pipeline on OMAP3530: (13 Items)
   
Compiler Error due to using Neon Pipeline on OMAP3530  
Hello,

I already posted the topic in the IDE Forum, I was not sure, so I also post the problem
here.
 
I'm trying to make use of the NEON-pipeline for fast floating point arithmetics on an Cortex-A8. I also have problems 
using the not fully pipelined VFP. 

IDE:     QNX® Momentics® Integrated Development Environment  Version: 4.6.0

The Board is a Gumstix Overo, with an OMAP3530, 256MB PoP Memory (Micron). I've
modified the Mistral OM3530 EVM BSP, it working perfectly so far. We're debugging
via WLAN using a Ralink USB Wifi Module. 

GCC compiler (v4.3.3) is set up with following tags:
 -mtune=cortex-a8 -march=armv7-a -mfpu=neon -ftree-vectorize  -mfloat-abi=softfp 
and I'm also linking to libm-vfp.so (instead of libm.so)

During compiling i get the following errors:
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qccvAhTrc\TestPro.s: Assembler messages:
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qccvAhTrc\TestPro.s:130: Error: selected processor does not support `flds s14,[fp,#-

28]'
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qccvAhTrc\TestPro.s:131: Error: selected processor does not support `flds s15,[fp,#-

24]'
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qccvAhTrc\TestPro.s:132: Error: selected processor does not support `fadds s14,s14,
s15'

and so on...

That the simple test code:

//begin code
//This Code compiles fine with the option -mfpu=vfp but not with neon

#include <cstdlib>
#include <iostream>
#include <math.h>

int main(int argc, char *argv[]) {
	float a,b,c;

	a = 8.0f;
	b = 1.4f;

	for(float g=0.0f; g < 50; g+= 1.0f)
	{
		c = a+b+g;
		c += a*b;

		std::printf("g:%f\n",g);
		std::printf("Ergebnis:%f\n",c);
	}

	return EXIT_SUCCESS;
}
//end code

In an ohter project using the option -mfpu=vfp i get following errors:
(The code that yields following errors doesn't compile with either neon or vfp, it works fine with software float 
emulation (as it is set by default))

C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qcc0yf3eb\Vector.s: Assembler messages:
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qcc0yf3eb\Vector.s:117: Error: D register out of range for selected VFP version -- `

fldd d16,[r1,#0]'
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qcc0yf3eb\Vector.s:118: Error: register out of range in list -- `fldmiad ip!,{d17}'
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qcc0yf3eb\Vector.s:119: Error: bad instruction `vadd.f32 d16,d16,d17'
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qcc0yf3eb\Vector.s:120: Error: register out of range in list -- `fstmiad r1!,{d16}'

etc...

I used this wiki page as guide http://wiki.davincidsp.com/index.php/Cortex-A8_Features


What am I doing wrong?
Re: Compiler Error due to using Neon Pipeline on OMAP3530  
Florian Berger wrote:
> Hello,
> 
> I already posted the topic in the IDE Forum, I was not sure, so I also post the problem
> here.
>  
> I'm trying to make use of the NEON-pipeline for fast floating point arithmetics on an Cortex-A8. I also have problems 
using the not fully pipelined VFP. 
> 
> IDE:     QNX® Momentics® Integrated Development Environment  Version: 4.6.0
> 
> The Board is a Gumstix Overo, with an OMAP3530, 256MB PoP Memory (Micron). I've
> modified the Mistral OM3530 EVM BSP, it working perfectly so far. We're debugging
> via WLAN using a Ralink USB Wifi Module. 
> 
> GCC compiler (v4.3.3) is set up with following tags:
>  -mtune=cortex-a8 -march=armv7-a -mfpu=neon -ftree-vectorize  -mfloat-abi=softfp 
> and I'm also linking to libm-vfp.so (instead of libm.so)

Add -Wa,-mfpu=neon to your compiler options.

Regards,

Ryan Mansfield
Re: Compiler Error due to using Neon Pipeline on OMAP3530  
Hello Ryan,

thank you very much! It's now compiling. But I've discovered a new missbehaviour:

When using arm_neon.h for neon instrinsics, there is following error

C:\QNX641\host\win32\x86\usr\lib\gcc\arm-unknown-nto-qnx6.4.0\4.3.3\include\arm_neon.h:35:2: error: #error You must 
enable NEON instructions (e.g. -mfloat-abi=softfp -mfpu=neon) to use arm_neon.h

Compiler Options:
-march=armv7-a -mtune=cortex-a8  -Wa,-mfpu=neon  -ftree-vectorize -mfloat-abi=softfp

What does -Wa stand for?

Kind regards,

Florian
Re: Compiler Error due to using Neon Pipeline on OMAP3530  
It's also compiling with instrinsics when using:

-march=armv7-a -mtune=cortex-a8 -mfloat-abi=softfp -mfpu=neon -ftree-vectorize  -Wa,-mfpu=neon

Thank you, I hope it is really using NEON, I'll benchmark it later.

Kind regards,

Florian
Re: Compiler Error due to using Neon Pipeline on OMAP3530  
Hi Florian,

what was the final outcome of your benchmarks?
I'm persuing the same issue right now using the same hardware...
Here is what i've got when using the well-known "whetstone" floating-point benchmark (C sources found here: http://www.
netlib.org/benchmark/whetstone.c) and testing for both, Linux and QNX:

qcc compiler flags used: -O3 -mtune=cortex-a8 -march=armv7-a -Wa,-march=armv7-a -ftree-vectorize -mfpu=neon -Wa,-mfpu=
neon -mfloat-abi=softfp

Results after two minutes testing:
Linux:  266 MIPS
QNX:  130 MIPS

When i'm using even the "dangerous" compiler flags: -ffast-math -fomit-frame-pointer -funroll-loops

Results after two minutes testing:
Linux: 1700 MIPS
QNX:    134 MIPS

I also figured, that not matter whether you tell the compiler to use NEON (-mfpu=neon) or the VFP (-mfpu=VFP) you get 
the same results when you tell the compiler to try to vectorize the given code.  Now the big question is: why the hell 
the Linux (CodeSourcery2007q3-51, gcc-4.2.1) OS has twice the performance of the QNX system? According to this page (
http://community.qnx.com/sf/wiki/do/viewPage/projects.core_os/wiki/ARMv7_support) NEON should be supported in QNX 6.4.1.


Can you please share your benchmark results? I'm quite curious...

Cheers
Phil
Re: Compiler Error due to using Neon Pipeline on OMAP3530  
On 10-05-20 11:58 AM, Philipp Lutz wrote:
> Hi Florian,
>
> what was the final outcome of your benchmarks?
> I'm persuing the same issue right now using the same hardware...
> Here is what i've got when using the well-known "whetstone" floating-point benchmark (C sources found here: http://www
.netlib.org/benchmark/whetstone.c) and testing for both, Linux and QNX:
>
> qcc compiler flags used: -O3 -mtune=cortex-a8 -march=armv7-a -Wa,-march=armv7-a -ftree-vectorize -mfpu=neon -Wa,-mfpu=
neon -mfloat-abi=softfp
>
> Results after two minutes testing:
> Linux:  266 MIPS
> QNX:  130 MIPS

Hi Phil,

Are you using the soft-float libm.so or the libm-vfp.so? If you're using 
the soft-float libm.so, can you replace the libm.so.2 on your target 
with the libm-vfp.so and let me know if you see any performance improvement.

> When i'm using even the "dangerous" compiler flags: -ffast-math -fomit-frame-pointer -funroll-loops
>
> Results after two minutes testing:
> Linux: 1700 MIPS
> QNX:    134 MIPS
>

In the fast math case it looks math.h is redefining sin, cos and a few 
others so the compiler builtins are not being used. Can you extract the 
attached tarball into $QNX_TARGET and recompile/rerun the benchmark?

Regards,

Ryan Mansfield

Attachment: Text fastmath.tar.gz 8.08 KB
Re: Compiler Error due to using Neon Pipeline on OMAP3530  
Hi Ryan,

yes, I've tried to follow this guide: http://www.qnx.com/developers/docs/6.4.1/neutrino/technotes/vector_floating_point.
html#id5 in order to replace the libm.so by the libm-vfp.so. however ALL my programs I'm liking against libm-vfp are 
finally liked against libm (after checking with ldd).
What did I wrong? Is there an existing issue?

Wow, these header files gave me an increase to around 5200 MIPS, but thats rather close to ridiculous than to realistic.
 It seems that these header files activate the fastmath ability of QNX in the first place. What are these compiler 
builtin functions? Is this only a pointer to the compiler to implement the math function in a predefined way which is 
optimized by execution speed?

Another point: Can you explain why I get twice the speed (without fast-math) under linux compared to QNX?

Thank you!

Best Regards
Phil



> Hi Phil,
> 
> Are you using the soft-float libm.so or the libm-vfp.so? If you're using 
> the soft-float libm.so, can you replace the libm.so.2 on your target 
> with the libm-vfp.so and let me know if you see any performance improvement.
> 
> > When i'm using even the "dangerous" compiler flags: -ffast-math -fomit-frame
> -pointer -funroll-loops
> >
> > Results after two minutes testing:
> > Linux: 1700 MIPS
> > QNX:    134 MIPS
> >
> 
> In the fast math case it looks math.h is redefining sin, cos and a few 
> others so the compiler builtins are not being used. Can you extract the 
> attached tarball into $QNX_TARGET and recompile/rerun the benchmark?
> 
> Regards,
> 
> Ryan Mansfield
> 


Re: Compiler Error due to using Neon Pipeline on OMAP3530  
On 10-05-21 06:29 AM, Philipp Lutz wrote:
> Hi Ryan,
>
> yes, I've tried to follow this guide: http://www.qnx.com/developers/docs/6.4.1/neutrino/technotes/
vector_floating_point.html#id5 in order to replace the libm.so by the libm-vfp.so. however ALL my programs I'm liking 
against libm-vfp are finally liked against libm (after checking with ldd).
> What did I wrong? Is there an existing issue?

> Wow, these header files gave me an increase to around 5200 MIPS, but thats rather close to ridiculous than to 
realistic. It seems that these header files activate the fastmath ability of QNX in the first place. What are these 
compiler builtin functions? Is this only a pointer to the compiler to implement the math function in a predefined way 
which is optimized by execution speed?

Yes, in my patch when you compile with -ffast-math there is a compiler 
define __FAST_MATH__ set which redefines the calls to sin, cos, etc with 
explicit calls to __builtin_sin, __builtin_cos, etc. The builtins are 
optimized routines provided by GCC. For more information, see 
http://gcc.gnu.org/onlinedocs/gcc-4.4.2/gcc/Other-Builtins.html. Often 
these compiler builtins can replace a expensive function call with only 
a handful of inline instructions.  Under Linux when you compile with -O1 
or greater the gcc will automatically replace the call to sin/cos/atan 
with their corresponding builtin. When compiling with >=O1 will use 
builtins for cos, sin and atan and by adding fast-math gcc will use 
builtins for log, exp, and sqrt. You'll notice that when you specify -O3 
-ffast-math you no longer have to link against libm. Under QNX our libm 
headers redefine sin, cos to be _Sin which prevents gcc from 
automatically using the compiler builtins. The headers I gave you do the 
redefinition so the optimized builtins are used.

> Another point: Can you explain why I get twice the speed (without fast-math) under linux compared to QNX?

The difference can be mainly attributed to the use of the builtin 
functions. For example, under Linux whetstone compiled with  -O3 
-mtune=cortex-a8 -march=armv7-a -Wa,-march=armv7-a -ftree-vectorize 
-mfpu=neon -Wa,-mfpu=neon -mfloat-abi=softfp -fno-builtin -lm yields 
approximately the same performance under QNX.  The previous patch I 
attached had all the builtins enabled by __FAST_MATH__. I've tweaked the 
patch to use builtins for sin and cos when O1 or greater is used, not 
just -ffast-math is specified.  I also attached a version of whetsone 
for Neutrino that uses Clockcycles for better precision.

Here are my results:

Linux (mainline gcc 4.3.2):

arm-unknown-linux-gnueabi-gcc ~/whetstone.c  -O3 -mtune=cortex-a8 
-march=armv7-a -Wa,-march=armv7-a -ftree-vectorize -mfpu=neon 
-Wa,-mfpu=neon -mfloat-abi=softfp -fno-builtin -lm

Loops: 10000, Iterations: 1, Duration: 8 sec. 

C Converted Double Precision Whetstones: 125.0 MIPS

arm-unknown-linux-gnueabi-gcc ~/whetstone.c  -O3 -mtune=cortex-a8 
-march=armv7-a -Wa,-march=armv7-a -ftree-vectorize -mfpu=neon 
-Wa,-mfpu=neon -mfloat-abi=softfp  -lm

Loops: 10000, Iterations: 1, Duration: 4 sec. 

C Converted Double Precision Whetstones: 250.0 MIPS

arm-unknown-linux-gnueabi-gcc ~/whetstone.c  -O3 -mtune=cortex-a8 
-march=armv7-a -Wa,-march=armv7-a -ftree-vectorize -mfpu=neon 
-Wa,-mfpu=neon -mfloat-abi=softfp  -ffast-math -funroll-loops 
-fomit-frame-pointer

Loops: 100000, Iterations: 1, Duration: 4 sec. 

C Converted Double Precision Whetstones: 2500.0 MIPS

QNX (with attached header changes)

qcc -V4.3.3,gcc_ntoarmle whetstone-qnx.c -O3 -mtune=cortex-a8 
-march=armv7-a -Wa,-march=armv7-a -ftree-vectorize -mfpu=neon 
-Wa,-mfpu=neon -mfloat-abi=softfp -fno-builtin -lm

Loops: 1000, Iterations: 1, Duration: 0.767395 sec. 

C Converted Double Precision Whetstones: 130.3 MIPS

qcc -V4.3.3,gcc_ntoarmle whetstone-qnx.c -O3 -mtune=cortex-a8 
-march=armv7-a -Wa,-march=armv7-a...
View Full Message
Attachment: Text files.tar.gz 11.27 KB
Re: Compiler Error due to using Neon Pipeline on OMAP3530  
Hi Ryan,

thanks for your effort! I attached my results with comparable compiler-flags. They seem to be quite similar although I 
think running more cycles (in my case 100 000) yields more accurate results regarding the processor pipelining for let's
 say a continuous FP algorithms.

However there is only one thing which I can't explain: the linking of libm-vfp.so
I can do whatever I want, the compiler will not link the libm-vfp.so, it's always using libm.so.
Later on the board I have to force the programs take the libm-vfp by setting a symbolic link from libm-vfp.so.2 to libm.
so.2, but that's only a dirty workaround.

By the way: good idea to use the clockcycle counter for the benchmark ;)

Have you ever had numerical problems using the -ffast-math compiler flag?

Best Regards
Phil
Attachment: PDF whetstone_results.pdf 86.97 KB
Re: Compiler Error due to using Neon Pipeline on OMAP3530  
On 10-06-02 09:49 AM, Philipp Lutz wrote:
> Hi Ryan,
>
> thanks for your effort! I attached my results with comparable compiler-flags. They seem to be quite similar although I
 think running more cycles (in my case 100 000) yields more accurate results regarding the processor pipelining for 
let's say a continuous FP algorithms.

Were the numbers for QNX run with or without my second patch (i.e. some 
of the gcc builtins triggered by __OPTIMIZE__ and not just __FAST_MATH__)?

> However there is only one thing which I can't explain: the linking of libm-vfp.so
> I can do whatever I want, the compiler will not link the libm-vfp.so, it's always using libm.so.
> Later on the board I have to force the programs take the libm-vfp by setting a symbolic link from libm-vfp.so.2 to 
libm.so.2, but that's only a dirty workaround.

If your target(s) always have VFP hardware my suggestion would be just 
cp $QNX_TARGET/armle/lib/libm-vfp.so.2 over top of 
$QNX_TARGET/armle/lib/libm.so.2 on your host, link everything against 
libm.so.2 and rebuild your image.

If you're not using the gcc builtins (i.e. libm) you can also get a 
performance improvement by adding [phys_align=64k] in front of libm.so.2 
in your build file. With this attribute, libm.so.2 will get aligned on a 
64k boundary and then the memmgr can use 64k pages to do contiguous 64k 
mappings of the file which improves TLB usage.

> By the way: good idea to use the clockcycle counter for the benchmark ;)
>
> Have you ever had numerical problems using the -ffast-math compiler flag?

I've been fortunate enough not to have any issues with it personally. 
But I investigated a customer issue a years ago where it -ffast-math was 
causing unexpected behaviour.

Regards,

Ryan Mansfield
Re: Compiler Error due to using Neon Pipeline on OMAP3530  
Hi Ryan

> Were the numbers for QNX run with or without my second patch (i.e. some 
> of the gcc builtins triggered by __OPTIMIZE__ and not just __FAST_MATH__)?

i only used your first patch because I'm not sure if I want to have fast-math as soon as I'm activating some compiler 
optimization flags.

> If your target(s) always have VFP hardware my suggestion would be just 
> cp $QNX_TARGET/armle/lib/libm-vfp.so.2 over top of 
> $QNX_TARGET/armle/lib/libm.so.2 on your host, link everything against 
> libm.so.2 and rebuild your image.

> If you're not using the gcc builtins (i.e. libm) you can also get a 
> performance improvement by adding [phys_align=64k] in front of libm.so.2 
> in your build file. With this attribute, libm.so.2 will get aligned on a 
> 64k boundary and then the memmgr can use 64k pages to do contiguous 64k 
> mappings of the file which improves TLB usage.

Interesting idea, however it only provided around 1-2 WMIPS more.

> > Have you ever had numerical problems using the -ffast-math compiler flag?
> 
> I've been fortunate enough not to have any issues with it personally. 
> But I investigated a customer issue a years ago where it -ffast-math was 
> causing unexpected behaviour.

I'll have a look at this as well in order to find out if it causes problems or even unexpected behaviour for our 
applications.

Cheers
Phil
Re: Compiler Error due to using Neon Pipeline on OMAP3530  
> If your target(s) always have VFP hardware my suggestion would be just 
> cp $QNX_TARGET/armle/lib/libm-vfp.so.2 over top of 
> $QNX_TARGET/armle/lib/libm.so.2 on your host, link everything against 
> libm.so.2 and rebuild your image.

Today I figured that this is not a good idea as the program "random" gives me the following error when i replaced libm 
by libm-vfp:
"""
unknown symbol: __fixsfisi
ldd:FATAL: Could not resolve all symbols
"""

Isn't there another way to link against libm-vfp.so.2 during compile time? Are you able to link with "-lm-vfp" and 
finally see the link to libm-vfp.so.2 when using ldd? Is this a (known) compiler issue?

Thanks for helping!

Regards
Phil
Re: Compiler Error due to using Neon Pipeline on OMAP3530  
Florian Berger wrote:
> Hello Ryan,
> 
> thank you very much! It's now compiling. But I've discovered a new missbehaviour:
> 
> When using arm_neon.h for neon instrinsics, there is following error
> 
> C:\QNX641\host\win32\x86\usr\lib\gcc\arm-unknown-nto-qnx6.4.0\4.3.3\include\arm_neon.h:35:2: error: #error You must 
enable NEON instructions (e.g. -mfloat-abi=softfp -mfpu=neon) to use arm_neon.h
> 
> Compiler Options:
> -march=armv7-a -mtune=cortex-a8  -Wa,-mfpu=neon  -ftree-vectorize -mfloat-abi=softfp
> 
> What does -Wa stand for?

-Wa, passes the option after the comma to the assemlber. In this case, 
-mfpu=neon the assembler needs to be specified.

The second problem is that you need to specify -mfloat-abi=softfp to 
tell gcc to generate hard floating point instructions but to still 
follow the soft float ABI.

Regards,

Ryan Mansfield