Oleh Derevenko(deleted)
04/17/2009 12:54 PM
post27385
|
Hi All,
Let me start telling my story...
So, today my process has hung on exit. I have attached to it with gdb and found that the thread is infinitely looping in
thread_pool_destroy(). It's 6.3.2, so there is older thread pool implementation (prior to commit #167031) with sleepons
. The thread is looping here
/*
Wait for everyone to die. We should probably not use the
default sleepon here, but it is such a low runner case.
*/
pthread_sleepon_lock();
while(pool->created > 0) {
pthread_sleepon_wait(pool);
}
pthread_sleepon_unlock();
it repeatedly calls _sleepon_wait(), that one allocates _sleepon_entry from heap, tries to initialize a new condvar for
it, pthread_cond_init fails with EBUSY, then it releases memory block back to heap exits pthread_sleepon_wait() and
loops again ad infinitum. All the other pool threads are blocked because the looping thread holds sleepon mutex.
So, the code is now gone and new implementation does not use sleepons. But how could _sleepon_entry be released without
having its condvar destroyed? I checked sleepon.c implementation but could not find any possibilities for that. Were
there any bugs like that fixed in commits after 6.3.2?
The only possibility for the situation to happen would be if SyncTypeCreate_r() for pthread_cond_init() could return
error status but leave condvar actually allocated (or partially allocated).
OK, I checked the kernel. I, actually, have never peeked into the kernel code and do not know it, but on first glance
everything looks fine for condvar allocation. Just in nano_sync.c in line 47
// Allocate a sync entry.
if((syp = object_alloc(prp, &sync_souls)) == NULL) {
kererr(act, EAGAIN);
return(NULL);
}
// Map the virtual address of the users sync to a physical addr.
// (flags & PTHREAD_PROCESSSHARED_MASK) may be useful to memmgr
// Note that we tell vaddr_to_memobj to mark this page as containing a
// sync object.
if((obp = memmgr.vaddr_to_memobj(prp, sync, &addr, 1)) == (OBJECT *)-1) {
<------------------------------------ HERE
kererr(act, EINVAL);
return NULL;
}
// Add the entry to the sync vector.
if(synchash_add(obp, addr, syp) == -1) {
object_free(prp, &sync_souls, syp);
kererr(act, EBUSY);
return(NULL);
}
would not it be good to call object_free() just like it is done in next conditional statement? Anyway, even if there is
a memory leak it happens before binding object to physical address and I doubt it could result in EBUSY from subsequent
calls. So the question "How could sleepon get into the state when it has lost a condvar for a released block?" remains
open.
Also, while examining thread pool code I've found one more cosmetic issue.
In thread_pool_destroy()...
/* destroy pool cond */
pthread_mutex_destroy(&(props->pool_cond));
/* destroy pool mutex */
pthread_mutex_destroy(&(props->inline_lock));
It's great that invocations of both pthread_cond_destroy() and pthread_mutex_destroy() result in the same kernel call,
but still, it would be better to call pthread_cond_destroy() for props->pool_cond as it is a condvar, not a mutex.
|
|
|
Colin Burgess(deleted)
04/17/2009 1:41 PM
post27396
|
If could be that someone forgot to release a synchronisation object. r161301 was such a case - a mutex was not being
destroyed before
the memory it was in was free()d.
Thanks for the pthread_mutex_destroy issue - you're right that it doesn't cause a problem but it's just a "happens
to work" thing, and we'll fix it.
I actually suggested to the libmalloc folks that they add sync object tracking to libmalloc so that when you
do a free it will warn if it contains a 'live' sync object, I'm not sure if they have actioned that yet.
Cheers,
Colin
Oleh Derevenko wrote:
> Hi All,
>
> Let me start telling my story...
> So, today my process has hung on exit. I have attached to it with gdb and found that the thread is infinitely looping
in thread_pool_destroy(). It's 6.3.2, so there is older thread pool implementation (prior to commit #167031) with
sleepons. The thread is looping here
>
> /*
> Wait for everyone to die. We should probably not use the
> default sleepon here, but it is such a low runner case.
> */
> pthread_sleepon_lock();
> while(pool->created > 0) {
> pthread_sleepon_wait(pool);
> }
> pthread_sleepon_unlock();
>
> it repeatedly calls _sleepon_wait(), that one allocates _sleepon_entry from heap, tries to initialize a new condvar
for it, pthread_cond_init fails with EBUSY, then it releases memory block back to heap exits pthread_sleepon_wait() and
loops again ad infinitum. All the other pool threads are blocked because the looping thread holds sleepon mutex.
>
> So, the code is now gone and new implementation does not use sleepons. But how could _sleepon_entry be released
without having its condvar destroyed? I checked sleepon.c implementation but could not find any possibilities for that.
Were there any bugs like that fixed in commits after 6.3.2?
> The only possibility for the situation to happen would be if SyncTypeCreate_r() for pthread_cond_init() could return
error status but leave condvar actually allocated (or partially allocated).
>
> OK, I checked the kernel. I, actually, have never peeked into the kernel code and do not know it, but on first glance
everything looks fine for condvar allocation. Just in nano_sync.c in line 47
>
> // Allocate a sync entry.
> if((syp = object_alloc(prp, &sync_souls)) == NULL) {
> kererr(act, EAGAIN);
> return(NULL);
> }
>
> // Map the virtual address of the users sync to a physical addr.
> // (flags & PTHREAD_PROCESSSHARED_MASK) may be useful to memmgr
> // Note that we tell vaddr_to_memobj to mark this page as containing a
> // sync object.
> if((obp = memmgr.vaddr_to_memobj(prp, sync, &addr, 1)) == (OBJECT *)-1) {
> <------------------------------------ HERE
> kererr(act, EINVAL);
> return NULL;
> }
>
> // Add the entry to the sync vector.
> if(synchash_add(obp, addr, syp) == -1) {
> object_free(prp, &sync_souls, syp);
> kererr(act, EBUSY);
> return(NULL);
> }
>
> would not it be good to call object_free() just like it is done in next conditional statement? Anyway, even if there
is a memory leak it happens before binding object to physical address and I doubt it could result in EBUSY from
subsequent calls. So the question "How could sleepon get into the state when it has lost a condvar for a released block?
" remains open.
>
> Also, while examining thread pool code I've found one more cosmetic issue.
>
> In thread_pool_destroy()...
>
> /* destroy pool cond */
> pthread_mutex_destroy(&(props->pool_cond));
> /* destroy pool mutex */
> pthread_mutex_destroy(&(props->inline_lock));
>
> It's great that invocations of both pthread_cond_destroy() and pthread_mutex_destroy() result in the same kernel call,
but still, it would be better to call...
View Full Message
|
|
|
Oleh Derevenko(deleted)
04/17/2009 2:02 PM
post27398
|
> If could be that someone forgot to release a synchronisation object. r161301
> was such a case - a mutex was not being destroyed before
> the memory it was in was free()d.
>
No, that could not be the case. I only have a single thread pool and I do not deleted it until program exit.
|
|
|
Colin Burgess(deleted)
04/17/2009 2:12 PM
post27400
|
Maybe not the same case, but it could be the same type of issue - the _sleepon_wait does a malloc, and there
is a forgotten live mutex/condvar in the memory returned.
Colin
Oleh Derevenko wrote:
>> If could be that someone forgot to release a synchronisation object. r161301
>> was such a case - a mutex was not being destroyed before
>> the memory it was in was free()d.
>>
>
> No, that could not be the case. I only have a single thread pool and I do not deleted it until program exit.
>
> _______________________________________________
> OSTech
> http://community.qnx.com/sf/go/post27398
>
--
cburgess@qnx.com
|
|
|
Elena Laskavaia
04/17/2009 2:19 PM
post27402
|
We have PR 63103 about supporting it in librcheck, the issue with implementation is to track what is an active object.
If kernel had an API get_all_active_mutexes
that would be easy to implement...
Colin Burgess wrote:
> If could be that someone forgot to release a synchronisation object. r161301 was such a case - a mutex was not being
destroyed before
> the memory it was in was free()d.
>
> Thanks for the pthread_mutex_destroy issue - you're right that it doesn't cause a problem but it's just a "happens
> to work" thing, and we'll fix it.
>
> I actually suggested to the libmalloc folks that they add sync object tracking to libmalloc so that when you
> do a free it will warn if it contains a 'live' sync object, I'm not sure if they have actioned that yet.
>
> Cheers,
>
> Colin
>
> Oleh Derevenko wrote:
>> Hi All,
>>
>> Let me start telling my story...
>> So, today my process has hung on exit. I have attached to it with gdb and found that the thread is infinitely looping
in thread_pool_destroy(). It's 6.3.2, so there is older thread pool implementation (prior to commit #167031) with
sleepons. The thread is looping here
>>
>> /*
>> Wait for everyone to die. We should probably not use the
>> default sleepon here, but it is such a low runner case.
>> */
>> pthread_sleepon_lock();
>> while(pool->created > 0) {
>> pthread_sleepon_wait(pool);
>> }
>> pthread_sleepon_unlock();
>>
>> it repeatedly calls _sleepon_wait(), that one allocates _sleepon_entry from heap, tries to initialize a new condvar
for it, pthread_cond_init fails with EBUSY, then it releases memory block back to heap exits pthread_sleepon_wait() and
loops again ad infinitum. All the other pool threads are blocked because the looping thread holds sleepon mutex.
>>
>> So, the code is now gone and new implementation does not use sleepons. But how could _sleepon_entry be released
without having its condvar destroyed? I checked sleepon.c implementation but could not find any possibilities for that.
Were there any bugs like that fixed in commits after 6.3.2?
>> The only possibility for the situation to happen would be if SyncTypeCreate_r() for pthread_cond_init() could return
error status but leave condvar actually allocated (or partially allocated).
>>
>> OK, I checked the kernel. I, actually, have never peeked into the kernel code and do not know it, but on first glance
everything looks fine for condvar allocation. Just in nano_sync.c in line 47
>>
>> // Allocate a sync entry.
>> if((syp = object_alloc(prp, &sync_souls)) == NULL) {
>> kererr(act, EAGAIN);
>> return(NULL);
>> }
>>
>> // Map the virtual address of the users sync to a physical addr.
>> // (flags & PTHREAD_PROCESSSHARED_MASK) may be useful to memmgr
>> // Note that we tell vaddr_to_memobj to mark this page as containing a
>> // sync object.
>> if((obp = memmgr.vaddr_to_memobj(prp, sync, &addr, 1)) == (OBJECT *)-1) {
>> <------------------------------------ HERE
>> kererr(act, EINVAL);
>> return NULL;
>> }
>>
>> // Add the entry to the sync vector.
>> if(synchash_add(obp, addr, syp) == -1) {
>> object_free(prp, &sync_souls, syp);
>> kererr(act, EBUSY);
>> return(NULL);
>> }
>>
>> would not it be good to call object_free() just like it is done in next conditional statement? Anyway, even if there
is a memory leak it happens before binding object to physical address and I doubt it could result in EBUSY from
subsequent calls. So the question "How could sleepon get into the state when it has lost a condvar for a released block?
" remains open.
>>
>> Also, while examining...
View Full Message
|
|
|
Oleh Derevenko(deleted)
04/18/2009 8:02 AM
post27428
|
> We have PR 63103 about supporting it in librcheck, the issue with
> implementation is to track what is an active object. If kernel had an API
> get_all_active_mutexes
> that would be easy to implement...
Be careful with that. 1st, the sync object list could be quite long. 2nd, only the kernel itself can check for the
object existance in the consistent manner.
This is similar to "why is not there a function to check if the file exist? - that's because that information can not be
reliable: the file could exist right now but in a moment it could already got deleted".
The only possible approach to this is to have a kernel API that would tell if trere are any sync objects within the
given address range.
|
|
|
Colin Burgess(deleted)
04/18/2009 2:25 PM
post27434
|
If you had custom versions of the pthread_mutex*, pthread_cond* functions, etc then
you could track the creation destruction of most of the objects. Objects that are within a region that is munmap()ed
are auto destroyed anyway, so it's only syn objects that are within the malloc arenas that are problematic.
________________________________
From: Oleh Derevenko [mailto:community-noreply@qnx.com]
Sent: Sat 4/18/2009 8:03 AM
To: ostech-core_os
Subject: Re: Multiple issues
> We have PR 63103 about supporting it in librcheck, the issue with
> implementation is to track what is an active object. If kernel had an API
> get_all_active_mutexes
> that would be easy to implement...
Be careful with that. 1st, the sync object list could be quite long. 2nd, only the kernel itself can check for the
object existance in the consistent manner.
This is similar to "why is not there a function to check if the file exist? - that's because that information can not be
reliable: the file could exist right now but in a moment it could already got deleted".
The only possible approach to this is to have a kernel API that would tell if trere are any sync objects within the
given address range.
_______________________________________________
OSTech
http://community.qnx.com/sf/go/post27428
|
|
|
|