Project Home
Project Home
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
Forum Topic - Socket stops being serviced...: (1 Item)
   
Socket stops being serviced...  
So, we have a pretty old version of QNX (6.2.0) which we have been using for quite some time.  We have a multi-platform 
C++ OS interface for all of our common OS calls, socket io, file io, ipc, timers, the works.  We have been using this 
for many years, so many of the bugs have been worked out pretty well.

So, we have a bug which we have only seen in the QNX side of things which we have tried to address on a couple of 
occasions.  We have two programs running on the same board, one which drives the system, and one which simulates some IO
 using a socket interface which injects signal values into our drive software.  The rate which we push code into the 
socket is pretty high (25 Hz or so), and the messages are what I would consider medium-sized (maybe 2K?).  We have not 
seen this issue when running our drive software, just in simulation.

What seems to happen, is the drive software stops servicing the socket to the simulation.  The simulation socket send 
queue fills up until we run out of memory, and causes the application to lock up the board.

We have implemented several "fixes".  I am able to detect the filling of the send queue, so I have tried closing the 
socket.  This generates a broken pipe on the simulation side, and sometimes the system is able to recover, and sometimes
 it locks up the board.  It usually takes several hours for the anomaly to occur.

This issue could be causes by a number of things (a mutex/semaphore deadlock is suspect on the drive software perhaps, 
or maybe we are running out of other resources, like socket file handles?).  I can add some debug on the simulation side
, but I am looking for a good way to debug the issue by adding calls when I detect the send queue filling on the 
simulation side.  What things should I look at?  Is there any way to detect a mutex/semaphore deadlock from outside the 
application?  I obviously can't add much debug in the semaphore/mutex code without generating a ton of debug, since it 
takes hours to reproduce.

So, what I am looking for, would be good things to check from the simulation software, or otherwise.  Once we detect the
 condition, we have more than a few seconds (maybe a minute) with which we can debug before the board becomes non-
functional.

Thanks