foundry27 : Post

Forum Topic - Compiler Error due to using Neon Pipeline on OMAP3530: (13 Items)

View: as

Florian Berger

07/12/2009 11:27 AM

post33603

Compiler Error due to using Neon Pipeline on OMAP3530

Hello,

I already posted the topic in the IDE Forum, I was not sure, so I also post the problem
here.
 
I'm trying to make use of the NEON-pipeline for fast floating point arithmetics on an Cortex-A8. I also have problems 
using the not fully pipelined VFP. 

IDE:     QNX® Momentics® Integrated Development Environment  Version: 4.6.0

The Board is a Gumstix Overo, with an OMAP3530, 256MB PoP Memory (Micron). I've
modified the Mistral OM3530 EVM BSP, it working perfectly so far. We're debugging
via WLAN using a Ralink USB Wifi Module. 

GCC compiler (v4.3.3) is set up with following tags:
 -mtune=cortex-a8 -march=armv7-a -mfpu=neon -ftree-vectorize  -mfloat-abi=softfp 
and I'm also linking to libm-vfp.so (instead of libm.so)

During compiling i get the following errors:
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qccvAhTrc\TestPro.s: Assembler messages:
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qccvAhTrc\TestPro.s:130: Error: selected processor does not support `flds s14,[fp,#-

28]'
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qccvAhTrc\TestPro.s:131: Error: selected processor does not support `flds s15,[fp,#-

24]'
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qccvAhTrc\TestPro.s:132: Error: selected processor does not support `fadds s14,s14,
s15'

and so on...

That the simple test code:

//begin code
//This Code compiles fine with the option -mfpu=vfp but not with neon

#include <cstdlib>
#include <iostream>
#include <math.h>

int main(int argc, char *argv[]) {
	float a,b,c;

	a = 8.0f;
	b = 1.4f;

	for(float g=0.0f; g < 50; g+= 1.0f)
	{
		c = a+b+g;
		c += a*b;

		std::printf("g:%f\n",g);
		std::printf("Ergebnis:%f\n",c);
	}

	return EXIT_SUCCESS;
}
//end code

In an ohter project using the option -mfpu=vfp i get following errors:
(The code that yields following errors doesn't compile with either neon or vfp, it works fine with software float 
emulation (as it is set by default))

C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qcc0yf3eb\Vector.s: Assembler messages:
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qcc0yf3eb\Vector.s:117: Error: D register out of range for selected VFP version -- `

fldd d16,[r1,#0]'
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qcc0yf3eb\Vector.s:118: Error: register out of range in list -- `fldmiad ip!,{d17}'
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qcc0yf3eb\Vector.s:119: Error: bad instruction `vadd.f32 d16,d16,d17'
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qcc0yf3eb\Vector.s:120: Error: register out of range in list -- `fstmiad r1!,{d16}'

etc...

I used this wiki page as guide http://wiki.davincidsp.com/index.php/Cortex-A8_Features


What am I doing wrong?

Ryan Mansfield(deleted)

07/13/2009 8:25 AM

post33620

Re: Compiler Error due to using Neon Pipeline on OMAP3530

Florian Berger wrote:
> Hello,
> 
> I already posted the topic in the IDE Forum, I was not sure, so I also post the problem
> here.
>  
> I'm trying to make use of the NEON-pipeline for fast floating point arithmetics on an Cortex-A8. I also have problems 
using the not fully pipelined VFP. 
> 
> IDE:     QNX® Momentics® Integrated Development Environment  Version: 4.6.0
> 
> The Board is a Gumstix Overo, with an OMAP3530, 256MB PoP Memory (Micron). I've
> modified the Mistral OM3530 EVM BSP, it working perfectly so far. We're debugging
> via WLAN using a Ralink USB Wifi Module. 
> 
> GCC compiler (v4.3.3) is set up with following tags:
>  -mtune=cortex-a8 -march=armv7-a -mfpu=neon -ftree-vectorize  -mfloat-abi=softfp 
> and I'm also linking to libm-vfp.so (instead of libm.so)

Add -Wa,-mfpu=neon to your compiler options.

Regards,

Ryan Mansfield

Florian Berger

07/13/2009 9:33 AM

post33632

Re: Compiler Error due to using Neon Pipeline on OMAP3530

Hello Ryan,

thank you very much! It's now compiling. But I've discovered a new missbehaviour:

When using arm_neon.h for neon instrinsics, there is following error

C:\QNX641\host\win32\x86\usr\lib\gcc\arm-unknown-nto-qnx6.4.0\4.3.3\include\arm_neon.h:35:2: error: #error You must 
enable NEON instructions (e.g. -mfloat-abi=softfp -mfpu=neon) to use arm_neon.h

Compiler Options:
-march=armv7-a -mtune=cortex-a8  -Wa,-mfpu=neon  -ftree-vectorize -mfloat-abi=softfp

What does -Wa stand for?

Kind regards,

Florian

Florian Berger

Re: Compiler Error due to using Neon Pipeline on OMAP3530

Florian Berger

07/13/2009 9:37 AM

post33633

Re: Compiler Error due to using Neon Pipeline on OMAP3530

It's also compiling with instrinsics when using:

-march=armv7-a -mtune=cortex-a8 -mfloat-abi=softfp -mfpu=neon -ftree-vectorize  -Wa,-mfpu=neon

Thank you, I hope it is really using NEON, I'll benchmark it later.

Kind regards,

Florian

Philipp Lutz

05/20/2010 11:58 AM

post55288

Re: Compiler Error due to using Neon Pipeline on OMAP3530

Hi Florian,

what was the final outcome of your benchmarks?
I'm persuing the same issue right now using the same hardware...
Here is what i've got when using the well-known "whetstone" floating-point benchmark (C sources found here: http://www.
netlib.org/benchmark/whetstone.c) and testing for both, Linux and QNX:

qcc compiler flags used: -O3 -mtune=cortex-a8 -march=armv7-a -Wa,-march=armv7-a -ftree-vectorize -mfpu=neon -Wa,-mfpu=
neon -mfloat-abi=softfp

Results after two minutes testing:
Linux:  266 MIPS
QNX:  130 MIPS

When i'm using even the "dangerous" compiler flags: -ffast-math -fomit-frame-pointer -funroll-loops

Results after two minutes testing:
Linux: 1700 MIPS
QNX:    134 MIPS

I also figured, that not matter whether you tell the compiler to use NEON (-mfpu=neon) or the VFP (-mfpu=VFP) you get 
the same results when you tell the compiler to try to vectorize the given code.  Now the big question is: why the hell 
the Linux (CodeSourcery2007q3-51, gcc-4.2.1) OS has twice the performance of the QNX system? According to this page (
http://community.qnx.com/sf/wiki/do/viewPage/projects.core_os/wiki/ARMv7_support) NEON should be supported in QNX 6.4.1.


Can you please share your benchmark results? I'm quite curious...

Cheers
Phil

Ryan Mansfield(deleted)

05/20/2010 8:43 PM

post55326

Re: Compiler Error due to using Neon Pipeline on OMAP3530

On 10-05-20 11:58 AM, Philipp Lutz wrote:
> Hi Florian,
>
> what was the final outcome of your benchmarks?
> I'm persuing the same issue right now using the same hardware...
> Here is what i've got when using the well-known "whetstone" floating-point benchmark (C sources found here: http://www
.netlib.org/benchmark/whetstone.c) and testing for both, Linux and QNX:
>
> qcc compiler flags used: -O3 -mtune=cortex-a8 -march=armv7-a -Wa,-march=armv7-a -ftree-vectorize -mfpu=neon -Wa,-mfpu=
neon -mfloat-abi=softfp
>
> Results after two minutes testing:
> Linux:  266 MIPS
> QNX:  130 MIPS

Hi Phil,

Are you using the soft-float libm.so or the libm-vfp.so? If you're using 
the soft-float libm.so, can you replace the libm.so.2 on your target 
with the libm-vfp.so and let me know if you see any performance improvement.

> When i'm using even the "dangerous" compiler flags: -ffast-math -fomit-frame-pointer -funroll-loops
>
> Results after two minutes testing:
> Linux: 1700 MIPS
> QNX:    134 MIPS
>

In the fast math case it looks math.h is redefining sin, cos and a few 
others so the compiler builtins are not being used. Can you extract the 
attached tarball into $QNX_TARGET and recompile/rerun the benchmark?

Regards,

Ryan Mansfield

Attachment:

fastmath.tar.gz 8.08 KB

Philipp Lutz

05/21/2010 6:29 AM

post55348

Re: Compiler Error due to using Neon Pipeline on OMAP3530

Hi Ryan,

yes, I've tried to follow this guide: http://www.qnx.com/developers/docs/6.4.1/neutrino/technotes/vector_floating_point.
html#id5 in order to replace the libm.so by the libm-vfp.so. however ALL my programs I'm liking against libm-vfp are 
finally liked against libm (after checking with ldd).
What did I wrong? Is there an existing issue?

Wow, these header files gave me an increase to around 5200 MIPS, but thats rather close to ridiculous than to realistic.
 It seems that these header files activate the fastmath ability of QNX in the first place. What are these compiler 
builtin functions? Is this only a pointer to the compiler to implement the math function in a predefined way which is 
optimized by execution speed?

Another point: Can you explain why I get twice the speed (without fast-math) under linux compared to QNX?

Thank you!

Best Regards
Phil



> Hi Phil,
> 
> Are you using the soft-float libm.so or the libm-vfp.so? If you're using 
> the soft-float libm.so, can you replace the libm.so.2 on your target 
> with the libm-vfp.so and let me know if you see any performance improvement.
> 
> > When i'm using even the "dangerous" compiler flags: -ffast-math -fomit-frame
> -pointer -funroll-loops
> >
> > Results after two minutes testing:
> > Linux: 1700 MIPS
> > QNX:    134 MIPS
> >
> 
> In the fast math case it looks math.h is redefining sin, cos and a few 
> others so the compiler builtins are not being used. Can you extract the 
> attached tarball into $QNX_TARGET and recompile/rerun the benchmark?
> 
> Regards,
> 
> Ryan Mansfield
>

Ryan Mansfield(deleted)

06/01/2010 1:04 PM

post55889

Re: Compiler Error due to using Neon Pipeline on OMAP3530

On 10-05-21 06:29 AM, Philipp Lutz wrote:
> Hi Ryan,
>
> yes, I've tried to follow this guide: http://www.qnx.com/developers/docs/6.4.1/neutrino/technotes/
vector_floating_point.html#id5 in order to replace the libm.so by the libm-vfp.so. however ALL my programs I'm liking
against libm-vfp are finally liked against libm (after checking with ldd).
> What did I wrong? Is there an existing issue?

> Wow, these header files gave me an increase to around 5200 MIPS, but thats rather close to ridiculous than to
realistic. It seems that these header files activate the fastmath ability of QNX in the first place. What are these
compiler builtin functions? Is this only a pointer to the compiler to implement the math function in a predefined way
which is optimized by execution speed?

Yes, in my patch when you compile with -ffast-math there is a compiler
define __FAST_MATH__ set which redefines the calls to sin, cos, etc with
explicit calls to __builtin_sin, __builtin_cos, etc. The builtins are
optimized routines provided by GCC. For more information, see
http://gcc.gnu.org/onlinedocs/gcc-4.4.2/gcc/Other-Builtins.html. Often
these compiler builtins can replace a expensive function call with only
a handful of inline instructions. Under Linux when you compile with -O1
or greater the gcc will automatically replace the call to sin/cos/atan
with their corresponding builtin. When compiling with >=O1 will use
builtins for cos, sin and atan and by adding fast-math gcc will use
builtins for log, exp, and sqrt. You'll notice that when you specify -O3
-ffast-math you no longer have to link against libm. Under QNX our libm
headers redefine sin, cos to be _Sin which prevents gcc from
automatically using the compiler builtins. The headers I gave you do the
redefinition so the optimized builtins are used.

> Another point: Can you explain why I get twice the speed (without fast-math) under linux compared to QNX?

The difference can be mainly attributed to the use of the builtin
functions. For example, under Linux whetstone compiled with -O3
-mtune=cortex-a8 -march=armv7-a -Wa,-march=armv7-a -ftree-vectorize
-mfpu=neon -Wa,-mfpu=neon -mfloat-abi=softfp -fno-builtin -lm yields
approximately the same performance under QNX. The previous patch I
attached had all the builtins enabled by __FAST_MATH__. I've tweaked the
patch to use builtins for sin and cos when O1 or greater is used, not
just -ffast-math is specified. I also attached a version of whetsone
for Neutrino that uses Clockcycles for better precision.

Here are my results:

Linux (mainline gcc 4.3.2):

arm-unknown-linux-gnueabi-gcc ~/whetstone.c -O3 -mtune=cortex-a8
-march=armv7-a -Wa,-march=armv7-a -ftree-vectorize -mfpu=neon
-Wa,-mfpu=neon -mfloat-abi=softfp -fno-builtin -lm

Loops: 10000, Iterations: 1, Duration: 8 sec.

C Converted Double Precision Whetstones: 125.0 MIPS

arm-unknown-linux-gnueabi-gcc ~/whetstone.c -O3 -mtune=cortex-a8
-march=armv7-a -Wa,-march=armv7-a -ftree-vectorize -mfpu=neon
-Wa,-mfpu=neon -mfloat-abi=softfp -lm

Loops: 10000, Iterations: 1, Duration: 4 sec.

C Converted Double Precision Whetstones: 250.0 MIPS

arm-unknown-linux-gnueabi-gcc ~/whetstone.c -O3 -mtune=cortex-a8
-march=armv7-a -Wa,-march=armv7-a -ftree-vectorize -mfpu=neon
-Wa,-mfpu=neon -mfloat-abi=softfp -ffast-math -funroll-loops
-fomit-frame-pointer

Loops: 100000, Iterations: 1, Duration: 4 sec.

C Converted Double Precision Whetstones: 2500.0 MIPS

QNX (with attached header changes)

qcc -V4.3.3,gcc_ntoarmle whetstone-qnx.c -O3 -mtune=cortex-a8
-march=armv7-a -Wa,-march=armv7-a -ftree-vectorize -mfpu=neon
-Wa,-mfpu=neon -mfloat-abi=softfp -fno-builtin -lm

Loops: 1000, Iterations: 1, Duration: 0.767395 sec.

C Converted Double Precision Whetstones: 130.3 MIPS

qcc -V4.3.3,gcc_ntoarmle whetstone-qnx.c -O3 -mtune=cortex-a8
-march=armv7-a -Wa,-march=armv7-a...

View Full Message

Attachment:

files.tar.gz 11.27 KB

Philipp Lutz

06/02/2010 9:49 AM

post55930

Re: Compiler Error due to using Neon Pipeline on OMAP3530

Hi Ryan,

thanks for your effort! I attached my results with comparable compiler-flags. They seem to be quite similar although I 
think running more cycles (in my case 100 000) yields more accurate results regarding the processor pipelining for let's
 say a continuous FP algorithms.

However there is only one thing which I can't explain: the linking of libm-vfp.so
I can do whatever I want, the compiler will not link the libm-vfp.so, it's always using libm.so.
Later on the board I have to force the programs take the libm-vfp by setting a symbolic link from libm-vfp.so.2 to libm.
so.2, but that's only a dirty workaround.

By the way: good idea to use the clockcycle counter for the benchmark ;)

Have you ever had numerical problems using the -ffast-math compiler flag?

Best Regards
Phil

Attachment:

whetstone_results.pdf 86.97 KB

Ryan Mansfield(deleted)

06/02/2010 12:36 PM

post55939

Re: Compiler Error due to using Neon Pipeline on OMAP3530

On 10-06-02 09:49 AM, Philipp Lutz wrote:
> Hi Ryan,
>
> thanks for your effort! I attached my results with comparable compiler-flags. They seem to be quite similar although I
 think running more cycles (in my case 100 000) yields more accurate results regarding the processor pipelining for 
let's say a continuous FP algorithms.

Were the numbers for QNX run with or without my second patch (i.e. some 
of the gcc builtins triggered by __OPTIMIZE__ and not just __FAST_MATH__)?

> However there is only one thing which I can't explain: the linking of libm-vfp.so
> I can do whatever I want, the compiler will not link the libm-vfp.so, it's always using libm.so.
> Later on the board I have to force the programs take the libm-vfp by setting a symbolic link from libm-vfp.so.2 to 
libm.so.2, but that's only a dirty workaround.

If your target(s) always have VFP hardware my suggestion would be just 
cp $QNX_TARGET/armle/lib/libm-vfp.so.2 over top of 
$QNX_TARGET/armle/lib/libm.so.2 on your host, link everything against 
libm.so.2 and rebuild your image.

If you're not using the gcc builtins (i.e. libm) you can also get a 
performance improvement by adding [phys_align=64k] in front of libm.so.2 
in your build file. With this attribute, libm.so.2 will get aligned on a 
64k boundary and then the memmgr can use 64k pages to do contiguous 64k 
mappings of the file which improves TLB usage.

> By the way: good idea to use the clockcycle counter for the benchmark ;)
>
> Have you ever had numerical problems using the -ffast-math compiler flag?

I've been fortunate enough not to have any issues with it personally. 
But I investigated a customer issue a years ago where it -ffast-math was 
causing unexpected behaviour.

Regards,

Ryan Mansfield

Philipp Lutz

Re: Compiler Error due to using Neon Pipeline on OMAP3530

Philipp Lutz

06/08/2010 12:49 PM

post56318

Re: Compiler Error due to using Neon Pipeline on OMAP3530

Hi Ryan

> Were the numbers for QNX run with or without my second patch (i.e. some 
> of the gcc builtins triggered by __OPTIMIZE__ and not just __FAST_MATH__)?

i only used your first patch because I'm not sure if I want to have fast-math as soon as I'm activating some compiler 
optimization flags.

> If your target(s) always have VFP hardware my suggestion would be just 
> cp $QNX_TARGET/armle/lib/libm-vfp.so.2 over top of 
> $QNX_TARGET/armle/lib/libm.so.2 on your host, link everything against 
> libm.so.2 and rebuild your image.

> If you're not using the gcc builtins (i.e. libm) you can also get a 
> performance improvement by adding [phys_align=64k] in front of libm.so.2 
> in your build file. With this attribute, libm.so.2 will get aligned on a 
> 64k boundary and then the memmgr can use 64k pages to do contiguous 64k 
> mappings of the file which improves TLB usage.

Interesting idea, however it only provided around 1-2 WMIPS more.

> > Have you ever had numerical problems using the -ffast-math compiler flag?
> 
> I've been fortunate enough not to have any issues with it personally. 
> But I investigated a customer issue a years ago where it -ffast-math was 
> causing unexpected behaviour.

I'll have a look at this as well in order to find out if it causes problems or even unexpected behaviour for our 
applications.

Cheers
Phil

Philipp Lutz

Re: Compiler Error due to using Neon Pipeline on OMAP3530

Philipp Lutz

06/09/2010 10:24 AM

post56390

Re: Compiler Error due to using Neon Pipeline on OMAP3530

> If your target(s) always have VFP hardware my suggestion would be just 
> cp $QNX_TARGET/armle/lib/libm-vfp.so.2 over top of 
> $QNX_TARGET/armle/lib/libm.so.2 on your host, link everything against 
> libm.so.2 and rebuild your image.

Today I figured that this is not a good idea as the program "random" gives me the following error when i replaced libm 
by libm-vfp:
"""
unknown symbol: __fixsfisi
ldd:FATAL: Could not resolve all symbols
"""

Isn't there another way to link against libm-vfp.so.2 during compile time? Are you able to link with "-lm-vfp" and 
finally see the link to libm-vfp.so.2 when using ldd? Is this a (known) compiler issue?

Thanks for helping!

Regards
Phil

Ryan Mansfield(deleted)

Re: Compiler Error due to using Neon Pipeline on OMAP3530

Ryan Mansfield(deleted)

07/13/2009 9:43 AM

post33634

Re: Compiler Error due to using Neon Pipeline on OMAP3530

Florian Berger wrote:
> Hello Ryan,
> 
> thank you very much! It's now compiling. But I've discovered a new missbehaviour:
> 
> When using arm_neon.h for neon instrinsics, there is following error
> 
> C:\QNX641\host\win32\x86\usr\lib\gcc\arm-unknown-nto-qnx6.4.0\4.3.3\include\arm_neon.h:35:2: error: #error You must 
enable NEON instructions (e.g. -mfloat-abi=softfp -mfpu=neon) to use arm_neon.h
> 
> Compiler Options:
> -march=armv7-a -mtune=cortex-a8  -Wa,-mfpu=neon  -ftree-vectorize -mfloat-abi=softfp
> 
> What does -Wa stand for?

-Wa, passes the option after the comma to the assemlber. In this case, 
-mfpu=neon the assembler needs to be specified.

The second problem is that you need to specify -mfloat-abi=softfp to 
tell gcc to generate hard floating point instructions but to still 
follow the soft float ABI.

Regards,

Ryan Mansfield

Return

The text you entered is not a valid object ID
More Information
Object IDs begin with an object prefix and end with a number. For example, if you enter
artf2345
the application will jump directly to an artifact with the ID artf2345. Some valid object prefixes are:
artf	for an artifact
doc	for a document
page	for a project page
topc	for a discussion topic
wiki	for a wiki page