foundry27 : Post

Forum Topic - Multiple issues: (7 Items)

View: as

Oleh Derevenko(deleted)

04/17/2009 12:54 PM

post27385

Hi All,

Let me start telling my story...
So, today my process has hung on exit. I have attached to it with gdb and found that the thread is infinitely looping in
 thread_pool_destroy(). It's 6.3.2, so there is older thread pool implementation (prior to commit #167031) with sleepons
. The thread is looping here

	/*
	 Wait for everyone to die.  We should probably not use the
	 default sleepon here, but it is such a low runner case.
	*/
	pthread_sleepon_lock();
	while(pool->created > 0) {
		pthread_sleepon_wait(pool);
	}
	pthread_sleepon_unlock();

it repeatedly calls _sleepon_wait(), that one allocates _sleepon_entry from heap, tries to initialize a new condvar for 
it, pthread_cond_init fails with EBUSY, then it releases memory block back to heap exits pthread_sleepon_wait() and 
loops again ad infinitum. All the other pool threads are blocked because the looping thread holds sleepon mutex.

So, the code is now gone and new implementation does not use sleepons. But how could _sleepon_entry be released without 
having its condvar destroyed? I checked sleepon.c implementation but could not find any possibilities for that. Were 
there any bugs like that fixed in commits after 6.3.2? 
The only possibility for the situation to happen would be if SyncTypeCreate_r() for pthread_cond_init() could return 
error status but leave condvar actually allocated (or partially allocated). 

OK, I checked the kernel. I, actually, have never peeked into the kernel code and do not know it, but on first glance 
everything looks fine for condvar allocation. Just in nano_sync.c in line 47

	// Allocate a sync entry.
	if((syp = object_alloc(prp, &sync_souls)) == NULL) {
		kererr(act, EAGAIN);
		return(NULL);
	}

	// Map the virtual address of the users sync to a physical addr.
	// (flags & PTHREAD_PROCESSSHARED_MASK) may be useful to memmgr
	// Note that we tell vaddr_to_memobj to mark this page as containing a
	// sync object.
	if((obp = memmgr.vaddr_to_memobj(prp, sync, &addr, 1)) == (OBJECT *)-1) {
		<------------------------------------ HERE
		kererr(act, EINVAL);
		return NULL;
	}

	// Add the entry to the sync vector.
	if(synchash_add(obp, addr, syp) == -1) {
		object_free(prp, &sync_souls, syp);
		kererr(act, EBUSY);
		return(NULL);
	}

would not it be good to call object_free() just like it is done in next conditional statement? Anyway, even if there is 
a memory leak it happens before binding object to physical address and I doubt it could result in EBUSY from subsequent 
calls. So the question "How could sleepon get into the state when it has lost a condvar for a released block?" remains 
open.

Also, while examining thread pool code I've found one more cosmetic issue.

In thread_pool_destroy()...

    /* destroy pool cond */
    pthread_mutex_destroy(&(props->pool_cond));
    /* destroy pool mutex */
    pthread_mutex_destroy(&(props->inline_lock));

It's great that invocations of both pthread_cond_destroy() and pthread_mutex_destroy() result in the same kernel call, 
but still, it would be better to call pthread_cond_destroy() for props->pool_cond as it is a condvar, not a mutex.

Colin Burgess(deleted)

04/17/2009 1:41 PM

post27396

Re: Multiple issues

If could be that someone forgot to release a synchronisation object.  r161301 was such a case - a mutex was not being 
destroyed before
the memory it was in was free()d.

Thanks for the pthread_mutex_destroy issue - you're right that it doesn't cause a problem but it's just a "happens
to work" thing, and we'll fix it.

I actually suggested to the libmalloc folks that they add sync object tracking to libmalloc so that when you
do a free it will warn if it contains a 'live' sync object, I'm not sure if they have actioned that yet.

Cheers,

Colin

Oleh Derevenko wrote:
> Hi All,
> 
> Let me start telling my story...
> So, today my process has hung on exit. I have attached to it with gdb and found that the thread is infinitely looping 
in thread_pool_destroy(). It's 6.3.2, so there is older thread pool implementation (prior to commit #167031) with 
sleepons. The thread is looping here
> 
> 	/*
> 	 Wait for everyone to die.  We should probably not use the
> 	 default sleepon here, but it is such a low runner case.
> 	*/
> 	pthread_sleepon_lock();
> 	while(pool->created > 0) {
> 		pthread_sleepon_wait(pool);
> 	}
> 	pthread_sleepon_unlock();
> 
> it repeatedly calls _sleepon_wait(), that one allocates _sleepon_entry from heap, tries to initialize a new condvar 
for it, pthread_cond_init fails with EBUSY, then it releases memory block back to heap exits pthread_sleepon_wait() and 
loops again ad infinitum. All the other pool threads are blocked because the looping thread holds sleepon mutex.
> 
> So, the code is now gone and new implementation does not use sleepons. But how could _sleepon_entry be released 
without having its condvar destroyed? I checked sleepon.c implementation but could not find any possibilities for that. 
Were there any bugs like that fixed in commits after 6.3.2? 
> The only possibility for the situation to happen would be if SyncTypeCreate_r() for pthread_cond_init() could return 
error status but leave condvar actually allocated (or partially allocated). 
> 
> OK, I checked the kernel. I, actually, have never peeked into the kernel code and do not know it, but on first glance 
everything looks fine for condvar allocation. Just in nano_sync.c in line 47
> 
> 	// Allocate a sync entry.
> 	if((syp = object_alloc(prp, &sync_souls)) == NULL) {
> 		kererr(act, EAGAIN);
> 		return(NULL);
> 	}
> 
> 	// Map the virtual address of the users sync to a physical addr.
> 	// (flags & PTHREAD_PROCESSSHARED_MASK) may be useful to memmgr
> 	// Note that we tell vaddr_to_memobj to mark this page as containing a
> 	// sync object.
> 	if((obp = memmgr.vaddr_to_memobj(prp, sync, &addr, 1)) == (OBJECT *)-1) {
> 		<------------------------------------ HERE
> 		kererr(act, EINVAL);
> 		return NULL;
> 	}
> 
> 	// Add the entry to the sync vector.
> 	if(synchash_add(obp, addr, syp) == -1) {
> 		object_free(prp, &sync_souls, syp);
> 		kererr(act, EBUSY);
> 		return(NULL);
> 	}
> 
> would not it be good to call object_free() just like it is done in next conditional statement? Anyway, even if there 
is a memory leak it happens before binding object to physical address and I doubt it could result in EBUSY from 
subsequent calls. So the question "How could sleepon get into the state when it has lost a condvar for a released block?
" remains open.
> 
> Also, while examining thread pool code I've found one more cosmetic issue.
> 
> In thread_pool_destroy()...
> 
>     /* destroy pool cond */
>     pthread_mutex_destroy(&(props->pool_cond));
>     /* destroy pool mutex */
>     pthread_mutex_destroy(&(props->inline_lock));
> 
> It's great that invocations of both pthread_cond_destroy() and pthread_mutex_destroy() result in the same kernel call,
 but still, it would be better to call...

View Full Message

Oleh Derevenko(deleted)

Re: Multiple issues

Oleh Derevenko(deleted)

04/17/2009 2:02 PM

post27398

Re: Multiple issues

> If could be that someone forgot to release a synchronisation object.  r161301 
> was such a case - a mutex was not being destroyed before
> the memory it was in was free()d.
> 

No, that could not be the case. I only have a single thread pool and I do not deleted it until program exit.

Colin Burgess(deleted)

04/17/2009 2:12 PM

post27400

Re: Multiple issues

Maybe not the same case, but it could be the same type of issue - the _sleepon_wait does a malloc, and there
is a forgotten live mutex/condvar in the memory returned.

Colin

Oleh Derevenko wrote:
>> If could be that someone forgot to release a synchronisation object.  r161301 
>> was such a case - a mutex was not being destroyed before
>> the memory it was in was free()d.
>>
> 
> No, that could not be the case. I only have a single thread pool and I do not deleted it until program exit.
> 
> _______________________________________________
> OSTech
> http://community.qnx.com/sf/go/post27398
> 

-- 
cburgess@qnx.com

Elena Laskavaia

Re: Multiple issues

Elena Laskavaia

04/17/2009 2:19 PM

post27402

Re: Multiple issues

We have PR 63103 about supporting it in librcheck, the issue with implementation is to track what is an active object. 
If kernel had an API get_all_active_mutexes
that would be easy to implement...


Colin Burgess wrote:
> If could be that someone forgot to release a synchronisation object.  r161301 was such a case - a mutex was not being 
destroyed before
> the memory it was in was free()d.
> 
> Thanks for the pthread_mutex_destroy issue - you're right that it doesn't cause a problem but it's just a "happens
> to work" thing, and we'll fix it.
> 
> I actually suggested to the libmalloc folks that they add sync object tracking to libmalloc so that when you
> do a free it will warn if it contains a 'live' sync object, I'm not sure if they have actioned that yet.
> 
> Cheers,
> 
> Colin
> 
> Oleh Derevenko wrote:
>> Hi All,
>>
>> Let me start telling my story...
>> So, today my process has hung on exit. I have attached to it with gdb and found that the thread is infinitely looping
 in thread_pool_destroy(). It's 6.3.2, so there is older thread pool implementation (prior to commit #167031) with 
sleepons. The thread is looping here
>>
>> 	/*
>> 	 Wait for everyone to die.  We should probably not use the
>> 	 default sleepon here, but it is such a low runner case.
>> 	*/
>> 	pthread_sleepon_lock();
>> 	while(pool->created > 0) {
>> 		pthread_sleepon_wait(pool);
>> 	}
>> 	pthread_sleepon_unlock();
>>
>> it repeatedly calls _sleepon_wait(), that one allocates _sleepon_entry from heap, tries to initialize a new condvar 
for it, pthread_cond_init fails with EBUSY, then it releases memory block back to heap exits pthread_sleepon_wait() and 
loops again ad infinitum. All the other pool threads are blocked because the looping thread holds sleepon mutex.
>>
>> So, the code is now gone and new implementation does not use sleepons. But how could _sleepon_entry be released 
without having its condvar destroyed? I checked sleepon.c implementation but could not find any possibilities for that. 
Were there any bugs like that fixed in commits after 6.3.2? 
>> The only possibility for the situation to happen would be if SyncTypeCreate_r() for pthread_cond_init() could return 
error status but leave condvar actually allocated (or partially allocated). 
>>
>> OK, I checked the kernel. I, actually, have never peeked into the kernel code and do not know it, but on first glance
 everything looks fine for condvar allocation. Just in nano_sync.c in line 47
>>
>> 	// Allocate a sync entry.
>> 	if((syp = object_alloc(prp, &sync_souls)) == NULL) {
>> 		kererr(act, EAGAIN);
>> 		return(NULL);
>> 	}
>>
>> 	// Map the virtual address of the users sync to a physical addr.
>> 	// (flags & PTHREAD_PROCESSSHARED_MASK) may be useful to memmgr
>> 	// Note that we tell vaddr_to_memobj to mark this page as containing a
>> 	// sync object.
>> 	if((obp = memmgr.vaddr_to_memobj(prp, sync, &addr, 1)) == (OBJECT *)-1) {
>> 		<------------------------------------ HERE
>> 		kererr(act, EINVAL);
>> 		return NULL;
>> 	}
>>
>> 	// Add the entry to the sync vector.
>> 	if(synchash_add(obp, addr, syp) == -1) {
>> 		object_free(prp, &sync_souls, syp);
>> 		kererr(act, EBUSY);
>> 		return(NULL);
>> 	}
>>
>> would not it be good to call object_free() just like it is done in next conditional statement? Anyway, even if there 
is a memory leak it happens before binding object to physical address and I doubt it could result in EBUSY from 
subsequent calls. So the question "How could sleepon get into the state when it has lost a condvar for a released block?
" remains open.
>>
>> Also, while examining...

View Full Message

Oleh Derevenko(deleted)

04/18/2009 8:02 AM

post27428

Re: Multiple issues

> We have PR 63103 about supporting it in librcheck, the issue with 
> implementation is to track what is an active object. If kernel had an API 
> get_all_active_mutexes
> that would be easy to implement...

Be careful with that. 1st, the sync object list could be quite long. 2nd, only the kernel itself can check for the 
object existance in the consistent manner.
This is similar to "why is not there a function to check if the file exist? - that's because that information can not be
 reliable: the file could exist right now but in a moment it could already got deleted". 
The only possible approach to this is to have a kernel API that would tell if trere are any sync objects within the 
given address range.

Colin Burgess(deleted)

04/18/2009 2:25 PM

post27434

RE: Multiple issues

If you had custom versions of the pthread_mutex*, pthread_cond* functions, etc then
you could track the creation destruction of most of the objects.  Objects that are within a region that is munmap()ed 
are auto destroyed anyway, so it's only syn objects that are within the malloc arenas that are problematic.

________________________________

From: Oleh Derevenko [mailto:community-noreply@qnx.com]
Sent: Sat 4/18/2009 8:03 AM
To: ostech-core_os
Subject: Re: Multiple issues

> We have PR 63103 about supporting it in librcheck, the issue with
> implementation is to track what is an active object. If kernel had an API
> get_all_active_mutexes
> that would be easy to implement...

Be careful with that. 1st, the sync object list could be quite long. 2nd, only the kernel itself can check for the 
object existance in the consistent manner.
This is similar to "why is not there a function to check if the file exist? - that's because that information can not be
 reliable: the file could exist right now but in a moment it could already got deleted".
The only possible approach to this is to have a kernel API that would tell if trere are any sync objects within the 
given address range.

_______________________________________________
OSTech
http://community.qnx.com/sf/go/post27428

Attachment:

winmail.dat 3.82 KB

Return

The text you entered is not a valid object ID
More Information
Object IDs begin with an object prefix and end with a number. For example, if you enter
artf2345
the application will jump directly to an artifact with the ID artf2345. Some valid object prefixes are:
artf	for an artifact
doc	for a document
page	for a project page
topc	for a discussion topic
wiki	for a wiki page