Florian Berger
|
Compiler Error due to using Neon Pipeline on OMAP3530
|
Florian Berger
07/12/2009 11:27 AM
post33603
|
Compiler Error due to using Neon Pipeline on OMAP3530
Hello,
I already posted the topic in the IDE Forum, I was not sure, so I also post the problem
here.
I'm trying to make use of the NEON-pipeline for fast floating point arithmetics on an Cortex-A8. I also have problems
using the not fully pipelined VFP.
IDE: QNX® Momentics® Integrated Development Environment Version: 4.6.0
The Board is a Gumstix Overo, with an OMAP3530, 256MB PoP Memory (Micron). I've
modified the Mistral OM3530 EVM BSP, it working perfectly so far. We're debugging
via WLAN using a Ralink USB Wifi Module.
GCC compiler (v4.3.3) is set up with following tags:
-mtune=cortex-a8 -march=armv7-a -mfpu=neon -ftree-vectorize -mfloat-abi=softfp
and I'm also linking to libm-vfp.so (instead of libm.so)
During compiling i get the following errors:
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qccvAhTrc\TestPro.s: Assembler messages:
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qccvAhTrc\TestPro.s:130: Error: selected processor does not support `flds s14,[fp,#-
28]'
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qccvAhTrc\TestPro.s:131: Error: selected processor does not support `flds s15,[fp,#-
24]'
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qccvAhTrc\TestPro.s:132: Error: selected processor does not support `fadds s14,s14,
s15'
and so on...
That the simple test code:
//begin code
//This Code compiles fine with the option -mfpu=vfp but not with neon
#include <cstdlib>
#include <iostream>
#include <math.h>
int main(int argc, char *argv[]) {
float a,b,c;
a = 8.0f;
b = 1.4f;
for(float g=0.0f; g < 50; g+= 1.0f)
{
c = a+b+g;
c += a*b;
std::printf("g:%f\n",g);
std::printf("Ergebnis:%f\n",c);
}
return EXIT_SUCCESS;
}
//end code
In an ohter project using the option -mfpu=vfp i get following errors:
(The code that yields following errors doesn't compile with either neon or vfp, it works fine with software float
emulation (as it is set by default))
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qcc0yf3eb\Vector.s: Assembler messages:
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qcc0yf3eb\Vector.s:117: Error: D register out of range for selected VFP version -- `
fldd d16,[r1,#0]'
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qcc0yf3eb\Vector.s:118: Error: register out of range in list -- `fldmiad ip!,{d17}'
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qcc0yf3eb\Vector.s:119: Error: bad instruction `vadd.f32 d16,d16,d17'
C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\2qcc0yf3eb\Vector.s:120: Error: register out of range in list -- `fstmiad r1!,{d16}'
etc...
I used this wiki page as guide http://wiki.davincidsp.com/index.php/Cortex-A8_Features
What am I doing wrong?
|
|
|
Ryan Mansfield(deleted)
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
|
Ryan Mansfield(deleted)
07/13/2009 8:25 AM
post33620
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
Florian Berger wrote:
> Hello,
>
> I already posted the topic in the IDE Forum, I was not sure, so I also post the problem
> here.
>
> I'm trying to make use of the NEON-pipeline for fast floating point arithmetics on an Cortex-A8. I also have problems
using the not fully pipelined VFP.
>
> IDE: QNX® Momentics® Integrated Development Environment Version: 4.6.0
>
> The Board is a Gumstix Overo, with an OMAP3530, 256MB PoP Memory (Micron). I've
> modified the Mistral OM3530 EVM BSP, it working perfectly so far. We're debugging
> via WLAN using a Ralink USB Wifi Module.
>
> GCC compiler (v4.3.3) is set up with following tags:
> -mtune=cortex-a8 -march=armv7-a -mfpu=neon -ftree-vectorize -mfloat-abi=softfp
> and I'm also linking to libm-vfp.so (instead of libm.so)
Add -Wa,-mfpu=neon to your compiler options.
Regards,
Ryan Mansfield
|
|
|
Florian Berger
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
|
Florian Berger
07/13/2009 9:33 AM
post33632
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
Hello Ryan,
thank you very much! It's now compiling. But I've discovered a new missbehaviour:
When using arm_neon.h for neon instrinsics, there is following error
C:\QNX641\host\win32\x86\usr\lib\gcc\arm-unknown-nto-qnx6.4.0\4.3.3\include\arm_neon.h:35:2: error: #error You must
enable NEON instructions (e.g. -mfloat-abi=softfp -mfpu=neon) to use arm_neon.h
Compiler Options:
-march=armv7-a -mtune=cortex-a8 -Wa,-mfpu=neon -ftree-vectorize -mfloat-abi=softfp
What does -Wa stand for?
Kind regards,
Florian
|
|
|
Florian Berger
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
|
Florian Berger
07/13/2009 9:37 AM
post33633
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
It's also compiling with instrinsics when using:
-march=armv7-a -mtune=cortex-a8 -mfloat-abi=softfp -mfpu=neon -ftree-vectorize -Wa,-mfpu=neon
Thank you, I hope it is really using NEON, I'll benchmark it later.
Kind regards,
Florian
|
|
|
Philipp Lutz
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
|
Philipp Lutz
05/20/2010 11:58 AM
post55288
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
Hi Florian,
what was the final outcome of your benchmarks?
I'm persuing the same issue right now using the same hardware...
Here is what i've got when using the well-known "whetstone" floating-point benchmark (C sources found here: http://www.
netlib.org/benchmark/whetstone.c) and testing for both, Linux and QNX:
qcc compiler flags used: -O3 -mtune=cortex-a8 -march=armv7-a -Wa,-march=armv7-a -ftree-vectorize -mfpu=neon -Wa,-mfpu=
neon -mfloat-abi=softfp
Results after two minutes testing:
Linux: 266 MIPS
QNX: 130 MIPS
When i'm using even the "dangerous" compiler flags: -ffast-math -fomit-frame-pointer -funroll-loops
Results after two minutes testing:
Linux: 1700 MIPS
QNX: 134 MIPS
I also figured, that not matter whether you tell the compiler to use NEON (-mfpu=neon) or the VFP (-mfpu=VFP) you get
the same results when you tell the compiler to try to vectorize the given code. Now the big question is: why the hell
the Linux (CodeSourcery2007q3-51, gcc-4.2.1) OS has twice the performance of the QNX system? According to this page (
http://community.qnx.com/sf/wiki/do/viewPage/projects.core_os/wiki/ARMv7_support) NEON should be supported in QNX 6.4.1.
Can you please share your benchmark results? I'm quite curious...
Cheers
Phil
|
|
|
Ryan Mansfield(deleted)
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
|
Ryan Mansfield(deleted)
05/20/2010 8:43 PM
post55326
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
On 10-05-20 11:58 AM, Philipp Lutz wrote:
> Hi Florian,
>
> what was the final outcome of your benchmarks?
> I'm persuing the same issue right now using the same hardware...
> Here is what i've got when using the well-known "whetstone" floating-point benchmark (C sources found here: http://www
.netlib.org/benchmark/whetstone.c) and testing for both, Linux and QNX:
>
> qcc compiler flags used: -O3 -mtune=cortex-a8 -march=armv7-a -Wa,-march=armv7-a -ftree-vectorize -mfpu=neon -Wa,-mfpu=
neon -mfloat-abi=softfp
>
> Results after two minutes testing:
> Linux: 266 MIPS
> QNX: 130 MIPS
Hi Phil,
Are you using the soft-float libm.so or the libm-vfp.so? If you're using
the soft-float libm.so, can you replace the libm.so.2 on your target
with the libm-vfp.so and let me know if you see any performance improvement.
> When i'm using even the "dangerous" compiler flags: -ffast-math -fomit-frame-pointer -funroll-loops
>
> Results after two minutes testing:
> Linux: 1700 MIPS
> QNX: 134 MIPS
>
In the fast math case it looks math.h is redefining sin, cos and a few
others so the compiler builtins are not being used. Can you extract the
attached tarball into $QNX_TARGET and recompile/rerun the benchmark?
Regards,
Ryan Mansfield
|
|
|
Philipp Lutz
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
|
Philipp Lutz
05/21/2010 6:29 AM
post55348
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
Hi Ryan,
yes, I've tried to follow this guide: http://www.qnx.com/developers/docs/6.4.1/neutrino/technotes/vector_floating_point.
html#id5 in order to replace the libm.so by the libm-vfp.so. however ALL my programs I'm liking against libm-vfp are
finally liked against libm (after checking with ldd).
What did I wrong? Is there an existing issue?
Wow, these header files gave me an increase to around 5200 MIPS, but thats rather close to ridiculous than to realistic.
It seems that these header files activate the fastmath ability of QNX in the first place. What are these compiler
builtin functions? Is this only a pointer to the compiler to implement the math function in a predefined way which is
optimized by execution speed?
Another point: Can you explain why I get twice the speed (without fast-math) under linux compared to QNX?
Thank you!
Best Regards
Phil
> Hi Phil,
>
> Are you using the soft-float libm.so or the libm-vfp.so? If you're using
> the soft-float libm.so, can you replace the libm.so.2 on your target
> with the libm-vfp.so and let me know if you see any performance improvement.
>
> > When i'm using even the "dangerous" compiler flags: -ffast-math -fomit-frame
> -pointer -funroll-loops
> >
> > Results after two minutes testing:
> > Linux: 1700 MIPS
> > QNX: 134 MIPS
> >
>
> In the fast math case it looks math.h is redefining sin, cos and a few
> others so the compiler builtins are not being used. Can you extract the
> attached tarball into $QNX_TARGET and recompile/rerun the benchmark?
>
> Regards,
>
> Ryan Mansfield
>
|
|
|
Ryan Mansfield(deleted)
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
|
Ryan Mansfield(deleted)
06/01/2010 1:04 PM
post55889
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
On 10-05-21 06:29 AM, Philipp Lutz wrote:
> Hi Ryan,
>
> yes, I've tried to follow this guide: http://www.qnx.com/developers/docs/6.4.1/neutrino/technotes/
vector_floating_point.html#id5 in order to replace the libm.so by the libm-vfp.so. however ALL my programs I'm liking
against libm-vfp are finally liked against libm (after checking with ldd).
> What did I wrong? Is there an existing issue?
> Wow, these header files gave me an increase to around 5200 MIPS, but thats rather close to ridiculous than to
realistic. It seems that these header files activate the fastmath ability of QNX in the first place. What are these
compiler builtin functions? Is this only a pointer to the compiler to implement the math function in a predefined way
which is optimized by execution speed?
Yes, in my patch when you compile with -ffast-math there is a compiler
define __FAST_MATH__ set which redefines the calls to sin, cos, etc with
explicit calls to __builtin_sin, __builtin_cos, etc. The builtins are
optimized routines provided by GCC. For more information, see
http://gcc.gnu.org/onlinedocs/gcc-4.4.2/gcc/Other-Builtins.html. Often
these compiler builtins can replace a expensive function call with only
a handful of inline instructions. Under Linux when you compile with -O1
or greater the gcc will automatically replace the call to sin/cos/atan
with their corresponding builtin. When compiling with >=O1 will use
builtins for cos, sin and atan and by adding fast-math gcc will use
builtins for log, exp, and sqrt. You'll notice that when you specify -O3
-ffast-math you no longer have to link against libm. Under QNX our libm
headers redefine sin, cos to be _Sin which prevents gcc from
automatically using the compiler builtins. The headers I gave you do the
redefinition so the optimized builtins are used.
> Another point: Can you explain why I get twice the speed (without fast-math) under linux compared to QNX?
The difference can be mainly attributed to the use of the builtin
functions. For example, under Linux whetstone compiled with -O3
-mtune=cortex-a8 -march=armv7-a -Wa,-march=armv7-a -ftree-vectorize
-mfpu=neon -Wa,-mfpu=neon -mfloat-abi=softfp -fno-builtin -lm yields
approximately the same performance under QNX. The previous patch I
attached had all the builtins enabled by __FAST_MATH__. I've tweaked the
patch to use builtins for sin and cos when O1 or greater is used, not
just -ffast-math is specified. I also attached a version of whetsone
for Neutrino that uses Clockcycles for better precision.
Here are my results:
Linux (mainline gcc 4.3.2):
arm-unknown-linux-gnueabi-gcc ~/whetstone.c -O3 -mtune=cortex-a8
-march=armv7-a -Wa,-march=armv7-a -ftree-vectorize -mfpu=neon
-Wa,-mfpu=neon -mfloat-abi=softfp -fno-builtin -lm
Loops: 10000, Iterations: 1, Duration: 8 sec.
C Converted Double Precision Whetstones: 125.0 MIPS
arm-unknown-linux-gnueabi-gcc ~/whetstone.c -O3 -mtune=cortex-a8
-march=armv7-a -Wa,-march=armv7-a -ftree-vectorize -mfpu=neon
-Wa,-mfpu=neon -mfloat-abi=softfp -lm
Loops: 10000, Iterations: 1, Duration: 4 sec.
C Converted Double Precision Whetstones: 250.0 MIPS
arm-unknown-linux-gnueabi-gcc ~/whetstone.c -O3 -mtune=cortex-a8
-march=armv7-a -Wa,-march=armv7-a -ftree-vectorize -mfpu=neon
-Wa,-mfpu=neon -mfloat-abi=softfp -ffast-math -funroll-loops
-fomit-frame-pointer
Loops: 100000, Iterations: 1, Duration: 4 sec.
C Converted Double Precision Whetstones: 2500.0 MIPS
QNX (with attached header changes)
qcc -V4.3.3,gcc_ntoarmle whetstone-qnx.c -O3 -mtune=cortex-a8
-march=armv7-a -Wa,-march=armv7-a -ftree-vectorize -mfpu=neon
-Wa,-mfpu=neon -mfloat-abi=softfp -fno-builtin -lm
Loops: 1000, Iterations: 1, Duration: 0.767395 sec.
C Converted Double Precision Whetstones: 130.3 MIPS
qcc -V4.3.3,gcc_ntoarmle whetstone-qnx.c -O3 -mtune=cortex-a8
-march=armv7-a -Wa,-march=armv7-a...
View Full Message
|
|
|
Philipp Lutz
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
|
Philipp Lutz
06/02/2010 9:49 AM
post55930
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
Hi Ryan,
thanks for your effort! I attached my results with comparable compiler-flags. They seem to be quite similar although I
think running more cycles (in my case 100 000) yields more accurate results regarding the processor pipelining for let's
say a continuous FP algorithms.
However there is only one thing which I can't explain: the linking of libm-vfp.so
I can do whatever I want, the compiler will not link the libm-vfp.so, it's always using libm.so.
Later on the board I have to force the programs take the libm-vfp by setting a symbolic link from libm-vfp.so.2 to libm.
so.2, but that's only a dirty workaround.
By the way: good idea to use the clockcycle counter for the benchmark ;)
Have you ever had numerical problems using the -ffast-math compiler flag?
Best Regards
Phil
|
|
|
Ryan Mansfield(deleted)
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
|
Ryan Mansfield(deleted)
06/02/2010 12:36 PM
post55939
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
On 10-06-02 09:49 AM, Philipp Lutz wrote:
> Hi Ryan,
>
> thanks for your effort! I attached my results with comparable compiler-flags. They seem to be quite similar although I
think running more cycles (in my case 100 000) yields more accurate results regarding the processor pipelining for
let's say a continuous FP algorithms.
Were the numbers for QNX run with or without my second patch (i.e. some
of the gcc builtins triggered by __OPTIMIZE__ and not just __FAST_MATH__)?
> However there is only one thing which I can't explain: the linking of libm-vfp.so
> I can do whatever I want, the compiler will not link the libm-vfp.so, it's always using libm.so.
> Later on the board I have to force the programs take the libm-vfp by setting a symbolic link from libm-vfp.so.2 to
libm.so.2, but that's only a dirty workaround.
If your target(s) always have VFP hardware my suggestion would be just
cp $QNX_TARGET/armle/lib/libm-vfp.so.2 over top of
$QNX_TARGET/armle/lib/libm.so.2 on your host, link everything against
libm.so.2 and rebuild your image.
If you're not using the gcc builtins (i.e. libm) you can also get a
performance improvement by adding [phys_align=64k] in front of libm.so.2
in your build file. With this attribute, libm.so.2 will get aligned on a
64k boundary and then the memmgr can use 64k pages to do contiguous 64k
mappings of the file which improves TLB usage.
> By the way: good idea to use the clockcycle counter for the benchmark ;)
>
> Have you ever had numerical problems using the -ffast-math compiler flag?
I've been fortunate enough not to have any issues with it personally.
But I investigated a customer issue a years ago where it -ffast-math was
causing unexpected behaviour.
Regards,
Ryan Mansfield
|
|
|
Philipp Lutz
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
|
Philipp Lutz
06/08/2010 12:49 PM
post56318
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
Hi Ryan
> Were the numbers for QNX run with or without my second patch (i.e. some
> of the gcc builtins triggered by __OPTIMIZE__ and not just __FAST_MATH__)?
i only used your first patch because I'm not sure if I want to have fast-math as soon as I'm activating some compiler
optimization flags.
> If your target(s) always have VFP hardware my suggestion would be just
> cp $QNX_TARGET/armle/lib/libm-vfp.so.2 over top of
> $QNX_TARGET/armle/lib/libm.so.2 on your host, link everything against
> libm.so.2 and rebuild your image.
> If you're not using the gcc builtins (i.e. libm) you can also get a
> performance improvement by adding [phys_align=64k] in front of libm.so.2
> in your build file. With this attribute, libm.so.2 will get aligned on a
> 64k boundary and then the memmgr can use 64k pages to do contiguous 64k
> mappings of the file which improves TLB usage.
Interesting idea, however it only provided around 1-2 WMIPS more.
> > Have you ever had numerical problems using the -ffast-math compiler flag?
>
> I've been fortunate enough not to have any issues with it personally.
> But I investigated a customer issue a years ago where it -ffast-math was
> causing unexpected behaviour.
I'll have a look at this as well in order to find out if it causes problems or even unexpected behaviour for our
applications.
Cheers
Phil
|
|
|
Philipp Lutz
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
|
Philipp Lutz
06/09/2010 10:24 AM
post56390
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
> If your target(s) always have VFP hardware my suggestion would be just
> cp $QNX_TARGET/armle/lib/libm-vfp.so.2 over top of
> $QNX_TARGET/armle/lib/libm.so.2 on your host, link everything against
> libm.so.2 and rebuild your image.
Today I figured that this is not a good idea as the program "random" gives me the following error when i replaced libm
by libm-vfp:
"""
unknown symbol: __fixsfisi
ldd:FATAL: Could not resolve all symbols
"""
Isn't there another way to link against libm-vfp.so.2 during compile time? Are you able to link with "-lm-vfp" and
finally see the link to libm-vfp.so.2 when using ldd? Is this a (known) compiler issue?
Thanks for helping!
Regards
Phil
|
|
|
Ryan Mansfield(deleted)
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
|
Ryan Mansfield(deleted)
07/13/2009 9:43 AM
post33634
|
Re: Compiler Error due to using Neon Pipeline on OMAP3530
Florian Berger wrote:
> Hello Ryan,
>
> thank you very much! It's now compiling. But I've discovered a new missbehaviour:
>
> When using arm_neon.h for neon instrinsics, there is following error
>
> C:\QNX641\host\win32\x86\usr\lib\gcc\arm-unknown-nto-qnx6.4.0\4.3.3\include\arm_neon.h:35:2: error: #error You must
enable NEON instructions (e.g. -mfloat-abi=softfp -mfpu=neon) to use arm_neon.h
>
> Compiler Options:
> -march=armv7-a -mtune=cortex-a8 -Wa,-mfpu=neon -ftree-vectorize -mfloat-abi=softfp
>
> What does -Wa stand for?
-Wa, passes the option after the comma to the assemlber. In this case,
-mfpu=neon the assembler needs to be specified.
The second problem is that you need to specify -mfloat-abi=softfp to
tell gcc to generate hard floating point instructions but to still
follow the soft float ABI.
Regards,
Ryan Mansfield
|
|
|
|