Project Home
Project Home
Source Code
Source Code
Documents
Documents
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
Forum Topic - QNX6.5(M9) performance issue with shared objects and "dlopen": (7 Items)
   
QNX6.5(M9) performance issue with shared objects and "dlopen"  
Hi,
we’ve been testing the QNX6.5(M9) pre-release with our SW and seem to have an ugly performance issue with libraries 
generated by the new compiler/linker tool-chain. We’re using shared objects loaded by “dlopen”. Since for us load 
time does not matter, but run-time is crucial, we use the “RTLD_NOW” mode, expecting all symbols to be linked at load-
time.
With QNX6.5M9 we’re now seeing a major performance difference when calling a library function for the very first time 
(40us instead of 4us). When compiling/linking the sources with the QNX6.4 tools, there is no such effect (except for 
minor caching effects, of course) – even when using the new kernel etc. 
We already determined that the first call to a library function touches a lot of additional memory areas (code/data). So
, to us it seems that initial function calls now triggers some on-demand linking overhead - just as if “RTLD_LAZY” was
 now the fixed and default behavior. “RTLD_LAZY” wasn’t supported with earlier versions and is a new feature for 6.5.
 Are there any known performance issues with “dlopen” in the pre-release?
And we’re compiling for PowerPC targets and running on PPC405.

Kind regards,
Thorsten
Re: QNX6.5(M9) performance issue with shared objects and "dlopen"  
On Tue, 2010-06-29 at 09:42 -0400, Thorsten Brehm wrote:
> Hi,
> we’ve been testing the QNX6.5(M9) pre-release with our SW and seem to
> have an ugly performance issue with libraries generated by the new
> compiler/linker tool-chain. We’re using shared objects loaded by
> “dlopen”. Since for us load time does not matter, but run-time is
> crucial, we use the “RTLD_NOW” mode, expecting all symbols to be
> linked at load-time.
> With QNX6.5M9 we’re now seeing a major performance difference when
> calling a library function for the very first time (40us instead of
> 4us). When compiling/linking the sources with the QNX6.4 tools, there
> is no such effect (except for minor caching effects, of course) – even
> when using the new kernel etc. 
> We already determined that the first call to a library function
> touches a lot of additional memory areas (code/data). So, to us it
> seems that initial function calls now triggers some on-demand linking
> overhead - just as if “RTLD_LAZY” was now the fixed and default
> behavior. “RTLD_LAZY” wasn’t supported with earlier versions and is a
> new feature for 6.5. Are there any known performance issues with
> “dlopen” in the pre-release?
> And we’re compiling for PowerPC targets and running on PPC405.

You will probably get better and more authoritative answers from our
tool experts, but one thing you could try is explicitly setting the
environment variable LD_BIND_NOW=1.  This actually does something
slightly different than RTLD_NOW.  However, it may not have any real
benefit in your particular case.

Another thing you can try to see if you can narrow down the behaviour is
to set LD_DEBUG to see what ldd is up to.  The allowable values for
LD_DEBUG include a comma separated list of any of help, all, libs,
reloc, statistics, lazyload, and debug.

Regards,
Neil
Re: QNX6.5(M9) performance issue with shared objects and "dlopen"  
> You will probably get better and more authoritative answers from our
> tool experts, but one thing you could try is explicitly setting the
> environment variable LD_BIND_NOW=1.  This actually does something
> slightly different than RTLD_NOW.  However, it may not have any real
> benefit in your particular case.
> 
> Another thing you can try to see if you can narrow down the behaviour is
> to set LD_DEBUG to see what ldd is up to.  The allowable values for
> LD_DEBUG include a comma separated list of any of help, all, libs,
> reloc, statistics, lazyload, and debug.
> 
> Regards,
> Neil

Thanks a lot, Neil!

Using LD_DEBUG we indeed saw lazy binding. Initial calls to library functions caused additional symbols to be resolved. 
Not the expected behaviour when using "RTLD_NOW" instead of "RTLD_LAZY"…

We now configured LD_BIND_NOW=1 and it *did* solve the issue. Initial function calls are now as fast as subsequent calls
. LD_DEBUG shows all symbols are being resolved immediately once the library is loaded, and no linking happens during 
run-time. Nice. And we saw the issue also affected shared libraries linked statically – not just the ones loaded by “
dlopen”.

I can see the advantage of lazy binding to reduce start-up delays. Not sure it's such a good idea for hard real-time 
systems though – especially as a default behaviour. I guess other customers will face similar migration issues… But it
’s working for us now and all is well ;-).

Kind regards,
Thorsten
Re: QNX6.5(M9) performance issue with shared objects and "dlopen"  
On Mon, 2010-07-05 at 11:43 -0400, Thorsten Brehm wrote:
> Using LD_DEBUG we indeed saw lazy binding. Initial calls to library
> functions caused additional symbols to be resolved. Not the expected
> behaviour when using "RTLD_NOW" instead of "RTLD_LAZY"…
> by “dlopen”.

I'll check in with our runtime linker expert and check exactly what
references are left unbound in the RTLD_NOW case.  My memory is
dreadful, but I think it has to do primarily with bits of libc (because
that's actually where the linker is embedded and it has to bootstrap
itself in an "unusual" way) and possibly also some runtime support
routines.  But I probably don't have the whole story (or even
necessarily the right story....)

> I can see the advantage of lazy binding to reduce start-up delays. Not
> sure it's such a good idea for hard real-time systems though –
> especially as a default behaviour. I guess other customers will face
> similar migration issues… But it’s working for us now and all is
> well ;-).

Yes, we agonized for a long time about what the default should be.  In
the end we decided to turn on lazy by default, for various reasons, and
added it to the release notes.  Time will tell....  Thank you very much
for your feedback!  Our customer's opinions are highly valued, and we
appreciate knowing if things don't always go smoothly so we can try to
improve.

Regards,
Neil
Re: QNX6.5(M9) performance issue with shared objects and "dlopen"  
There are several ways of changing default (lazy bind) behaviour:

1) by specifying LD_BIND_NOW=1 env. variable.
2) on a per-executable basis by passing -znow option to the linker when
linking executable
3) by passing -znow option to the linker when linking shared object. In
this case, only that particular shared object is bound 'now', but not
whole application.

Passing RTLD_NOW/RTLD_LAZY to dlopen does not change binding strategy
(which is determined as per above), however, RTLD_LAZY will, even in
bind-now case, allow symbol definition to not be found and it will defer
its ("its" = symbol's for which definition was not found at load time)
resolution to the first call. This allows for scenarios where shared
object explicitly loads definitions for its own calls, i.e. it
effectively leaves resolution scope (list of objects where definition is
looked for) open, as opposed to all other cases where resolution scope
is immutable and determined at load time.

I hope this clarifies things.

---
Aleksandar

p.s. passing 'help' to LD_DEBUG will list categories of debug output;
multiple categories can be specified, e.g. LD_DEBUG=libs,reloc



Re: QNX6.5(M9) performance issue with shared objects and "dlopen"  
Yes, thanks for clearing this up! This certainly explains the behaviour.

IMHO, maybe you could improve the documentation for dlopen. It says there:

“RTLD_LAZY [..] References to functions aren't relocated until that function is invoked. This improves performance by 
preventing unnecessary relocations. [..]”
“RTLD_NOW  All references are relocated when the object is loaded. This may waste cycles if relocations are performed 
for functions that never get called [..]”

This is certainly sounds as if this did change the binding strategy…

Admittedly, the 6.5.0 documentation for dlopen/RTLD_LAZY now has a link to the “lazy loading” section in the 
programmer’s guide. And the explanation there is very detailed and consistent with your explanation above.
But then again, I guess we all know the problem with lazy on demand reading... It's better when the initial look-up 
already provides correct information... ;-)

Thorsten
Re: QNX6.5(M9) performance issue with shared objects and "dlopen"  
Ok, technically the docs are correct, but maybe not clear enough. I
probably introduced additional confusion talking about binding strategy.


Let me try again:

* By default, all binding is lazy and can be overriden by -znow on the
executable, -znow on a shared object or LD_BIND_NOW=1 env. var.

* When binding is lazy, then RTLD_NOW will change it to "now" for object
being dlopen-ed and its not yet loaded dependencies. RTLD_LAZY makes no
difference.

* When binding is 'now', either by LD_BIND_NOW=1 or -znow on the
executable, using RTLD_LAZY will make shared object be resolved lazily,
including its not yet loaded dependencies. RTLD_NOW will make no
difference.

* When -znow is used on a shared object, binding is always now for that
object. RTLD_LAZY can not override that.



I believe this is now precise. If not, please do not hesitate to ask
further questions.



Thanks,

Aleksandar