foundry27 : Post

Forum Topic - Unblock pulse again: (5 Items)

View: as

Oleh Derevenko(deleted)

10/05/2009 3:10 AM

post39347

http://community.qnx.com/sf/discussion/do/listPosts/projects.core_os/discussion.newcode.topc9212

Guys, I've been keeping silence for quite a long time but the issue is not solved and it's definitely an issue in the 
QNX kernel.
I know it's hard to believe but if you like I can capture video and upload it to youtube for you to see yourselves. In 
GDB I just create a file wrapper object with calling `new' and with next step I call a method that invokes open(). 
That's all. And that open() call returns EINTR immeduiately because the server has unblock pending flag initially set in
 msginfo for a new OCB and it aborts the open request because of that.
This only happens when client/server are on different nodes and this never happens on the same one.

Colin Burgess(deleted)

Re: Unblock pulse again

Colin Burgess(deleted)

10/05/2009 9:46 AM

post39368

Re: Unblock pulse again

In the log output you sent there are no instances of KER_EXIT:MSG_RECEIVEV that had (info->flags & _NTO_MI_UNBLOCK_REQ) == _NTO_MI_UNBLOCK_REQ

I see _NTO_MI_ENDIAN_BIG, and _NTO_MI_NET_CRED_DIRTY, and 0

Can you send me the original log and point out the event number where you see a problem?

Oleh Derevenko wrote:
> http://community.qnx.
com/sf/discussion/do/listPosts/projects.core_os/discussion.newcode.topc9212
> 
> Guys, I've been keeping silence for quite a long time but the issue is not solved and it's definitely an issue in the 
QNX kernel.
> I know it's hard to believe but if you like I can capture video and upload it to youtube for you to see yourselves. In
 GDB I just create a file wrapper object with calling `new' and with next step I call a method that invokes open(). 
That's all. And that open() call returns EINTR immeduiately because the server has unblock pending flag initially set in
 msginfo for a new OCB and it aborts the open request because of that.
> This only happens when client/server are on different nodes and this never happens on the same one.
> 
> 
> 
> _______________________________________________
> 
> OSTech
> http://community.qnx.com/sf/go/post39347
> 

-- 
cburgess@qnx.com

Oleh Derevenko(deleted)

10/05/2009 4:36 PM

post39403

Re: Unblock pulse again

Colin, please check your mail.
I've re-created the problem once again and sent the tracelogger files to you.

Thank you for taking a look at this.

> Can you send me the original log and point out the event number where you see 
> a problem?
>

Oleh Derevenko(deleted)

10/06/2009 5:27 PM

post39465

Re: Unblock pulse again

Hi Colin,

I've recreated the problem as you asked. I've also added some TraceEvents(). However it's the first time I'm doing 
anything like that. Hope I did everything correctly.
Check your mail for file download link.

Oleh Derevenko
-- ICQ: 36361783


----- Original Message ----- 
Subject: Re: Unblock pulse with multithreaded RM


Yes, I do see MsgReceivev exiting with _NTO_MI_UNBLOCK_REQ set.

The fact that the debugger is attached is making the log mostly silent - can you
do the trace with the debugger not attached?

Also, you could add in some user events to annotate the log
by using the TraceEvent() kernel calls.

Tracking send/receive/reply across qnet is tricky in these kernel logs... :-)

Cheers,

Colin

Oleh Derevenko(deleted)

Re: Unblock pulse again

Oleh Derevenko(deleted)

04/06/2010 10:04 AM

post51203

Re: Unblock pulse again

Hi,

> http://community.qnx.com/sf/discussion/do/listPosts/projects.core_os/
> discussion.newcode.topc9212
> 
> Guys, I've been keeping silence for quite a long time but the issue is not 
> solved and it's definitely an issue in the QNX kernel.
> I know it's hard to believe but if you like I can capture video and upload it 
> to youtube for you to see yourselves. In GDB I just create a file wrapper 
> object with calling `new' and with next step I call a method that invokes open
> (). That's all. And that open() call returns EINTR immeduiately because the 
> server has unblock pending flag initially set in msginfo for a new OCB and it 
> aborts the open request because of that.
> This only happens when client/server are on different nodes and this never 
> happens on the same one.


This is not exactly the same issue but it seems to me it's somehow related. At least it's reproduced with similar 
actions.
This time the kernel seems to lock up threads in network requests in case if request is unblocked from read/write 
operations with a signal and then the file is closed right after that.
Roughly it is like in the following example
------------
Worker thread
------------
{
  ...
   read(fd);
  ...
}

------------
Aborting thread
------------
{
  pthread_kill(worker_thread_tid, SIGINT);
  close(fd);
}

This approach makes the kernel to lock up threads really soon.
I've discovered that waiting for worker thread to exit from request helps to work the problem around (at least at the 
first glance). That is, if I do like this, the problem seems to go away.
------------
Worker thread
------------
{
  ...
   pthread_mutex_lock(&request_mutex);
   read(fd);
   pthread_mutex_unlock(&request_mutex);
  ...
}

------------
Aborting thread
------------
{
  pthread_kill(worker_thread_tid, SIGINT);
   // Wait until request thread exits from read()
   pthread_mutex_lock(&request_mutex);
   pthread_mutex_unlock(&request_mutex);
  close(fd);
}

This makes me think that the kernel may not separate unblock pulse from close request correctly and somehow handle close
 request before unblock (in wrong order) or leave some state not cleaned up properly after unblock pulse is discarded by
 close. This is, sure, a pure guess but I thought I'll better let you know.
Also, as I've already mentioned at the very beginning, waiting for worker thread to exit from request does not solve 
original problem of this forum thread. So these might be related but are not exactly the same.

QNX 6.3.0SP3 x86 with patch 630SP2-0284

Return

The text you entered is not a valid object ID
More Information
Object IDs begin with an object prefix and end with a number. For example, if you enter
artf2345
the application will jump directly to an artifact with the ID artf2345. Some valid object prefixes are:
artf	for an artifact
doc	for a document
page	for a project page
topc	for a discussion topic
wiki	for a wiki page