Project Home
Project Home
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
Forum Topic - TCP stream socket send() thread safety: Page 1 of 2 (28 Items)
   
TCP stream socket send() thread safety  
Hello

Sorry if I've chosen a wrong forum and/or if my story is not a priority at this time but I would like to attract some 
more attention to the problem.

So, few years ago I had a problem PR24873 reported regarding socket send() function usage. If there were several 
provider threads sending data blocks over a single socket, those blocks might arrive to receiver with data intermixed. 
That is, the data transfer was not atomic. I worked the problem around by locking a mutex around send() invocation.
After a year and something I was happy to find my PR in resolved list of one of cumulative pre-SP3 patches (patch 234). 
Naturally, my support plan was expired till that time already.
Well, I removed the mutex and the data did not intermix any more. However a new problem appeared. When program was 
launched and there were numerous data transfers from multiple threads whole networking subsystem stalled. Only few data 
blocks could go through and the receiver detected a timeout of 10 seconds! and  disconnected. Then after some delay it 
tried to reconnect, initiated startup once again and the same story was repeated over and over. Only after 10-15 minutes
 system could stabilize and transit to normal functioning. However if there were load peaks later at runtime the 
scenario with networking denial of service and disconnects-reconnects could repeat.

Obviously, the patch was not that good. Perhaps somebody blocked too much code with a mutex and introduced a bottleneck.
 But since I did not have a support plan any more, my further letters were happily ignored. So I had to sigh and 
uncomment my mutex around send() again. This is the way I'm running it till this time in "fast, robust and reliable 
operating system".

So, if you are now doing active development in networking area, maybe somebody would take a look at this issue (if the 
code has not been completely rewritten yet, of course)?
Re: TCP stream socket send() thread safety  
On Thu, Nov 29, 2007 at 05:58:51AM -0500, Oleh Derevenko wrote:
> Hello
> 
> Sorry if I've chosen a wrong forum and/or if my story is not a priority
> at this time but I would like to attract some more attention to the
> problem.
> 
> So, few years ago I had a problem PR24873 reported regarding socket
> send() function usage. If there were several provider threads sending
> data blocks over a single socket, those blocks might arrive to receiver
> with data intermixed. That is, the data transfer was not atomic. I
> worked the problem around by locking a mutex around send() invocation.
> After a year and something I was happy to find my PR in resolved list of
> one of cumulative pre-SP3 patches (patch 234). Naturally, my support
> plan was expired till that time already.
> Well, I removed the mutex and the data did not intermix any more.
> However a new problem appeared. When program was launched and there were
> numerous data transfers from multiple threads whole networking subsystem
> stalled. Only few data blocks could go through and the receiver detected
> a timeout of 10 seconds! and  disconnected. Then after some delay it
> tried to reconnect, initiated startup once again and the same story was
> repeated over and over. Only after 10-15 minutes system could stabilize
> and transit to normal functioning. However if there were load peaks
> later at runtime the scenario with networking denial of service and
> disconnects-reconnects could repeat.
> 
> Obviously, the patch was not that good. Perhaps somebody blocked too
> much code with a mutex and introduced a bottleneck. But since I did not
> have a support plan any more, my further letters were happily ignored.
> So I had to sigh and uncomment my mutex around send() again. This is the
> way I'm running it till this time in "fast, robust and reliable
> operating system".
> 
> So, if you are now doing active development in networking area, maybe
> somebody would take a look at this issue (if the code has not been

I remember this issue.  tcp is a stream protocol and
therefore is not atomic.  I couldn't find a lot of direction
in any spec in this area at the time so I made it more
'intuitive'.  Two scheduling issues come into play when a
send() has to block.  First the threads block on a
particular socket and secondly, when they wake up they are
rescheduled according to the client's priority.  The issue
you were seeing was a fifo vs lifo issue when waking up
threads blocked on a particular socket.  The initial unblock
at the socket level now comes out in the order you expect
but if the requesting threads are of different priority the
higher one can still preempt the lower.  That is, it works
as you expect because all your threads are at the same
priority.  Note this was fixed without any extra mutexes.

Your new issue sounds like you may be exhausting the number
of 'threads' in the stack.  Are thrads reply blocked on
io-net in this situation?  Is there a sloginfo entry to this
effect (must be running the latest patch)?  You can increase
the number of stack 'threads' as follows:

# io-net -ptcpip threads_max=400

-seanb
Re: TCP stream socket send() thread safety  
> I remember this issue.  tcp is a stream protocol and
> therefore is not atomic.  I couldn't find a lot of direction
> in any spec in this area at the time so I made it more
> 'intuitive'.  

Well, but you should agree that if I send data envelopes from multiple threads it is natural to expect they arrive to 
client intact (even though their order could be unpredictable). At least socket implementation in Windows acts like this
.
And also, you should agree that serializing access to socket in client application is quite inefficient. send() is 
simply retranslated in MsgSend() and I do not see any reasons why several MsgSend's could not be invoked in parallel and
 serialized in server process if necessary.

> higher one can still preempt the lower.  That is, it works
> as you expect because all your threads are at the same
> priority.  

That's pretty sad to find out. :( Why can't you lock a mutex while data is being put in output buffer (I do not know how
 it is implemented, of course). If higher priority thread becomes ready while lower priority thread holds a mutex it 
will block on the mutex and temporarily raise priority of first thread. There would be a small delay for high priority 
thread, however send() would act atomically.

> Your new issue sounds like you may be exhausting the number
> of 'threads' in the stack.  Are thrads send blocked on
> io-net in this situation?  Is there a sloginfo entry to this
> effect (must be running the latest patch)?  You can increase
> the number of stack 'threads' as follows:
> 
> # io-net -ptcpip threads_max=400

It was about 18 months ago and I did not check the state of threads. Documentation says there is 200 threads limit by 
default. My inspections of sender process showed 120-130 threads running. And also the last but not least, we had never 
seen that "denial of service" problem before patch even though we were running without send() serialized for quite a 
long time before we discovered it may intermix the data.

I can remove mutex and make some experiments in next few days to see what is the state of threads if you would like.
Re: TCP stream socket send() thread safety  
On Thu, Nov 29, 2007 at 10:29:30AM -0500, Oleh Derevenko wrote:
> > I remember this issue.  tcp is a stream protocol and
> > therefore is not atomic.  I couldn't find a lot of direction
> > in any spec in this area at the time so I made it more
> > 'intuitive'.  
> 
> Well, but you should agree that if I send data envelopes from multiple
> threads it is natural to expect they arrive to client intact (even
> though their order could be unpredictable). At least socket
> implementation in Windows acts like this.
> And also, you should agree that serializing access to socket in client
> application is quite inefficient. send() is simply retranslated in
> MsgSend() and I do not see any reasons why several MsgSend's could not
> be invoked in parallel and serialized in server process if necessary.

Since tcp is a stream, the protocol has no concept of
envelopes or boundaries.  Any such concept is built on
top of it at the application level.

> 
> > higher one can still preempt the lower.  That is, it works
> > as you expect because all your threads are at the same
> > priority.  
> 
> That's pretty sad to find out. :( Why can't you lock a mutex while data
> is being put in output buffer (I do not know how it is implemented, of
> course). If higher priority thread becomes ready while lower priority
> thread holds a mutex it will block on the mutex and temporarily raise
> priority of first thread. There would be a small delay for high priority
> thread, however send() would act atomically.

The handling of the MsgSends are serialized in the stack.
This situatation arises when the send buffer fills up and
the send(), write(), sendto(), sendmsg() in the client has
to block.  When the send buffer drains we can continue
processing requests on this particular socket.  Since tcp
is a stream which request should be processed first?  Since
sockets have no concept like PIPE_BUF (that I could find)
it seems logical that the highest priority request should
win.

> 
> > Your new issue sounds like you may be exhausting the number
> > of 'threads' in the stack.  Are thrads send blocked on
> > io-net in this situation?  Is there a sloginfo entry to this
> > effect (must be running the latest patch)?  You can increase
> > the number of stack 'threads' as follows:
> > 
> > # io-net -ptcpip threads_max=400
> 
> It was about 18 months ago and I did not check the state of threads.
> Documentation says there is 200 threads limit by default. My inspections
> of sender process showed 120-130 threads running. And also the last but
> not least, we had never seen that "denial of service" problem before
> patch even though we were running without send() serialized for quite a
> long time before we discovered it may intermix the data.
> 
> I can remove mutex and make some experiments in next few days to see
> what is the state of threads if you would like.

I'm pretty confident that the changes for this issue wouldn't
in themselves introduce an issue like this.

-seanb
Re: TCP stream socket send() thread safety  
> The handling of the MsgSends are serialized in the stack.
> This situatation arises when the send buffer fills up and
> the send(), write(), sendto(), sendmsg() in the client has
> to block.  When the send buffer drains we can continue
> processing requests on this particular socket.  Since tcp
> is a stream which request should be processed first?  Since
> sockets have no concept like PIPE_BUF (that I could find)
> it seems logical that the highest priority request should
> win.

It may be logical from scheduler's point of view but it is completely senseless from point of view of socket's 
functionality. Client does not have any possibility of parsing data successfully after preemption like that and this 
makes socket inapplicable for multithreaded use. Even though there may not be a concept like PIPE_BUF for the socket its
 functionality should be "user friendly". Who wins from strict adherence to priority rules if the data is corrupted as a
 result? Can you show me at least one use case when there would be a benefit for the client from inserting unrelated 
data inside its data block?

> I'm pretty confident that the changes for this issue wouldn't
> in themselves introduce an issue like this.

Well, I'll try to find out what is the state of threads when the connection is going down. However I'm running 6.3.0 SP3
. I can't ruin my development/testing environment by upgrading it to 6.3.2. I can try September 6.3.2 release at one 
node at most. And I can't use the latest M2 build at all because kernel crashes after my application is started even 
though it is a pure user-mode application without any privileged access to ports or hardware.
Re: TCP stream socket send() thread safety  
On Thu, Nov 29, 2007 at 11:33:51AM -0500, Oleh Derevenko wrote:
> > The handling of the MsgSends are serialized in the stack.
> > This situatation arises when the send buffer fills up and
> > the send(), write(), sendto(), sendmsg() in the client has
> > to block.  When the send buffer drains we can continue
> > processing requests on this particular socket.  Since tcp
> > is a stream which request should be processed first?  Since
> > sockets have no concept like PIPE_BUF (that I could find)
> > it seems logical that the highest priority request should
> > win.
> 
> It may be logical from scheduler's point of view but it is completely
> senseless from point of view of socket's functionality. Client does not
> have any possibility of parsing data successfully after preemption like
> that and this makes socket inapplicable for multithreaded use. Even
> though there may not be a concept like PIPE_BUF for the socket its
> functionality should be "user friendly". Who wins from strict adherence
> to priority rules if the data is corrupted as a result? Can you show me
> at least one use case when there would be a benefit for the client from
> inserting unrelated data inside its data block?
> 
> > I'm pretty confident that the changes for this issue wouldn't
> > in themselves introduce an issue like this.
> 
> Well, I'll try to find out what is the state of threads when the
> connection is going down. However I'm running 6.3.0 SP3. I can't ruin my
> development/testing environment by upgrading it to 6.3.2. I can try
> September 6.3.2 release at one node at most. And I can't use the latest
> M2 build at all because kernel crashes after my application is started
> even though it is a pure user-mode application without any privileged
> access to ports or hardware.
> 

The fixes for this were local to npm-tcpip.so so you
shouldn't need to do anything except run the latest stack
and get rid of your mutex.

-seanb
Re: TCP stream socket send() thread safety  
Hi, Sean
> 
> The fixes for this were local to npm-tcpip.so so you
> shouldn't need to do anything except run the latest stack
> and get rid of your mutex.

I have downloaded corenet-6.4.0-M0.tar.gz but there is no npm-tcpip.so in it. Should I use .so library from September 
2007 build of 6.3.2?

Re: TCP stream socket send() thread safety  
On Fri, Nov 30, 2007 at 06:00:57AM -0500, Oleh Derevenko wrote:
> Hi, Sean
> > 
> > The fixes for this were local to npm-tcpip.so so you
> > shouldn't need to do anything except run the latest stack
> > and get rid of your mutex.
> 
> I have downloaded corenet-6.4.0-M0.tar.gz but there is no npm-tcpip.so
> in it. Should I use .so library from September 2007 build of 6.3.2?

The networking project on foundry 27 doesn't contain io-net
code.  Apparently 6.3.2 has the latest npm-tcpip.so that
contains this fix but you can also get it here:

http://www.qnx.com/download/feature.html?programid=13008


-seanb
Re: TCP stream socket send() thread safety  
> The networking project on foundry 27 doesn't contain io-net
> code.  Apparently 6.3.2 has the latest npm-tcpip.so that
> contains this fix but you can also get it here:
> 
> http://www.qnx.com/download/feature.html?programid=13008

I'm running 6.3.0SP3 and this is pre-SP3 patch which is already included in SP3.

Re: TCP stream socket send() thread safety  
> The handling of the MsgSends are serialized in the stack.
> This situatation arises when the send buffer fills up and
> the send(), write(), sendto(), sendmsg() in the client has
> to block.  When the send buffer drains we can continue
> processing requests on this particular socket.  Since tcp
> is a stream which request should be processed first?  Since
> sockets have no concept like PIPE_BUF (that I could find)
> it seems logical that the highest priority request should
> win.

And also, please consider that the situation with send buffer overflow is just an implementation limitation. In theory, 
send buffer should be considered infinite and every new thread should just queue its data at the end. So, if the 
operation that should normally be atomic, is suspended because of physical limitations it should have priority for 
resume. All the rest threads which have not started their operations yet can be judged by their priority (you need not 
preserve thread acceptance order for data) but you must not discriminate a thread depending on its luck to arrive to 
empty buffer or to the buffer which is nearly full.
Re: TCP stream socket send() thread safety  
On Thu, Nov 29, 2007 at 12:03:18PM -0500, Oleh Derevenko wrote:
> > The handling of the MsgSends are serialized in the stack.
> > This situatation arises when the send buffer fills up and
> > the send(), write(), sendto(), sendmsg() in the client has
> > to block.  When the send buffer drains we can continue
> > processing requests on this particular socket.  Since tcp
> > is a stream which request should be processed first?  Since
> > sockets have no concept like PIPE_BUF (that I could find)
> > it seems logical that the highest priority request should
> > win.
> 
> And also, please consider that the situation with send buffer overflow
> is just an implementation limitation. In theory, send buffer should be
> considered infinite and every new thread should just queue its data at
> the end. So, if the operation that should normally be atomic, is
> suspended because of physical limitations it should have priority for
> resume. All the rest threads which have not started their operations yet
> can be judged by their priority (you need not preserve thread acceptance
> order for data) but you must not discriminate a thread depending on its
> luck to arrive to empty buffer or to the buffer which is nearly full.

SO_SNDBUF is a documented, real limitation.  Currently the
requirements for not having data interleaved when multiple
threads write simultaneously on the same stream socket is to
run them at the same priority.  Note even this might not be
portable as it is ascribing undocumented characteristics to
stream sockets. I don't think this is a bug.  If you can
point out some spec, prior art or example (that doesn't work
by accident) that I may have missed I'll be happy to
reevaluate.

Regards,

-seanb
Re: TCP stream socket send() thread safety  
> SO_SNDBUF is a documented, real limitation.  Currently the
> requirements for not having data interleaved when multiple
> threads write simultaneously on the same stream socket is to
> run them at the same priority.  Note even this might not be
> portable as it is ascribing undocumented characteristics to
> stream sockets. I don't think this is a bug.  If you can
> point out some spec, prior art or example (that doesn't work
> by accident) that I may have missed I'll be happy to
> reevaluate.

OK. I agree, I was wrong. Not because scheduling highest-priority thread is more important than preserving send block 
atomicity and not because I found any specification on this topic. Rather, I realized that it would be questionable to 
provide such atomicity with regard to absence of any limits for send() buffer maximum length. If a thread can push many 
megabytes of data in a single call it should not block all the other threads waiting until all that data is transmitted.

Perhaps, what I need is SOCK_SEQPACKET mode. But it is not implemented. :(
Re: TCP stream socket send() thread safety  
By the way, here is an interesting read on topic provided to me in one of newsgroups
http://www.almaden.ibm.com/cs/people/marksmith/sendmsg.html
Re: TCP stream socket send() thread safety  
> > > Your new issue sounds like you may be exhausting the number
> > > of 'threads' in the stack.  Are thrads send blocked on
> > > io-net in this situation?  Is there a sloginfo entry to this
> > > effect (must be running the latest patch)?  You can increase
> > > the number of stack 'threads' as follows:
> > > 
> > > # io-net -ptcpip threads_max=400
> > 
> > It was about 18 months ago and I did not check the state of threads.
> > Documentation says there is 200 threads limit by default. My inspections
> > of sender process showed 120-130 threads running. And also the last but
> > not least, we had never seen that "denial of service" problem before
> > patch even though we were running without send() serialized for quite a
> > long time before we discovered it may intermix the data.
> > 
> > I can remove mutex and make some experiments in next few days to see
> > what is the state of threads if you would like.
> 
> I'm pretty confident that the changes for this issue wouldn't
> in themselves introduce an issue like this.

Well, if thread poll maximum is for whole system (and it is, since there is only one io-net process and I would not 
believe there could be a personal thread pool for every socket :) ), I can assume possibility to have more than 200 
threads in several processes together. But 10 seconds is nearly the infinity! Several threads over 200 can't create a 
delay like that considering the fact that all of them send just one data envelope and block afterwards.
Re: TCP stream socket send() thread safety  
On Thu, Nov 29, 2007 at 12:12:14PM -0500, Oleh Derevenko wrote:
> > > > Your new issue sounds like you may be exhausting the number
> > > > of 'threads' in the stack.  Are thrads send blocked on
> > > > io-net in this situation?  Is there a sloginfo entry to this
> > > > effect (must be running the latest patch)?  You can increase
> > > > the number of stack 'threads' as follows:
> > > > 
> > > > # io-net -ptcpip threads_max=400
> > > 
> > > It was about 18 months ago and I did not check the state of threads.
> > > Documentation says there is 200 threads limit by default. My
> inspections
> > > of sender process showed 120-130 threads running. And also the last
> but
> > > not least, we had never seen that "denial of service" problem before
> > > patch even though we were running without send() serialized for
> quite a
> > > long time before we discovered it may intermix the data.
> > > 
> > > I can remove mutex and make some experiments in next few days to see
> > > what is the state of threads if you would like.
> > 
> > I'm pretty confident that the changes for this issue wouldn't
> > in themselves introduce an issue like this.
> 
> Well, if thread poll maximum is for whole system (and it is, since there
> is only one io-net process and I would not believe there could be a
> personal thread pool for every socket :) ), I can assume possibility to
> have more than 200 threads in several processes together. But 10 seconds
> is nearly the infinity! Several threads over 200 can't create a delay
> like that considering the fact that all of them send just one data
> envelope and block afterwards.
> 

The 'threads_max' argument to the stack controls how many
blocking operations the stack can service simultaneously.
They may be read(), write(), accept() ...

It's just an educated guess at this point.  Look for clients
SEND blocked on io-net and the aforementioned sloginfo
entry.

Thanks,

-seanb
Re: TCP stream socket send() thread safety  
> The 'threads_max' argument to the stack controls how many
> blocking operations the stack can service simultaneously.
> They may be read(), write(), accept() ...
> 
> It's just an educated guess at this point.  Look for clients
> SEND blocked on io-net and the aforementioned sloginfo
> entry.

So far, I tried it with my current system (that is, 6.3.0 SP3).
To be able to reproduce it I had to plug ethernet back into 100Mbit hub just as we were running last year because with 
1GB switch it did not reproduce at once.

When connections start flashing offline/online there are no threads in SEND-blocked state at all. 

-----------------------------ONE NODE-----------------------------
# ps -A | grep io-net
     77841 ?        00:00:30 io-net
# pidin -p 376859
     pid tid name               prio STATE       Blocked
  376859   1 ../bin/qnxip.bin    10o MUTEX       376859-06 #1
  376859   2 ../bin/qnxip.bin    10o CONDVAR     8126bb4
  376859   3 ../bin/qnxip.bin    10o CONDVAR     812666c
  376859   4 ../bin/qnxip.bin    10o RECEIVE     2
  376859   5 ../bin/qnxip.bin    10o RECEIVE     5
  376859   6 ../bin/qnxip.bin    10o REPLY       77841
  376859   7 ../bin/qnxip.bin    10o CONDVAR     824f874
  376859   8 ../bin/qnxip.bin    10o REPLY       77841
  376859   9 ../bin/qnxip.bin    10o CONDVAR     81eca14
  376859  10 ../bin/qnxip.bin    10o CONDVAR     81fe80c
  376859  11 ../bin/qnxip.bin    10o CONDVAR     82abef4
  376859  12 ../bin/qnxip.bin    10o CONDVAR     81fe73c
  376859  13 ../bin/qnxip.bin    10o CONDVAR     81ec604
  376859  14 ../bin/qnxip.bin    10o CONDVAR     81ec32c
  376859  15 ../bin/qnxip.bin    10o CONDVAR     81391f4
  376859  16 ../bin/qnxip.bin    10o CONDVAR     81f1944
  376859  17 ../bin/qnxip.bin    10o CONDVAR     81f159c
  376859  18 ../bin/qnxip.bin    10o CONDVAR     81f10bc
  376859  19 ../bin/qnxip.bin    10o CONDVAR     81f1394
  376859  20 ../bin/qnxip.bin    10o CONDVAR     81f1b4c
  376859  21 ../bin/qnxip.bin    10o CONDVAR     8201944
  376859  22 ../bin/qnxip.bin    10o CONDVAR     81febb4
  376859  23 ../bin/qnxip.bin    10o CONDVAR     8201f5c
  376859  24 ../bin/qnxip.bin    10o CONDVAR     8201ae4
  376859  25 ../bin/qnxip.bin    10o CONDVAR     8201c1c
  376859  26 ../bin/qnxip.bin    10o CONDVAR     82161f4
  376859  27 ../bin/qnxip.bin    10o CONDVAR     82234cc
  376859  28 ../bin/qnxip.bin    10o CONDVAR     8223e8c
  376859  29 ../bin/qnxip.bin    10o CONDVAR     8223944
  376859  30 ../bin/qnxip.bin    10o CONDVAR     8216ae4
  376859  31 ../bin/qnxip.bin    10o CONDVAR     8216ef4
  376859  32 ../bin/qnxip.bin    10o CONDVAR     8228124
  376859  33 ../bin/qnxip.bin    10o CONDVAR     8228464
  376859  34 ../bin/qnxip.bin    10o CONDVAR     8235604
  376859  36 ../bin/qnxip.bin    30o CONDVAR     81ec8dc
  376859  37 ../bin/qnxip.bin    10o CONDVAR     8228bb4
  376859  40 ../bin/qnxip.bin    10o CONDVAR     8239bb4
  376859  41 ../bin/qnxip.bin    30o CONDVAR     8139dbc
  376859  44 ../bin/qnxip.bin    10o REPLY       376881
  376859  45 ../bin/qnxip.bin    30o CONDVAR     81bab4c
  376859  46 ../bin/qnxip.bin    10o REPLY       376852
  376859  47 ../bin/qnxip.bin    10o REPLY       376887
  376859  48 ../bin/qnxip.bin    10o REPLY       376884
  376859  50 ../bin/qnxip.bin    10o REPLY       376890
  376859  53 ../bin/qnxip.bin    10o CONDVAR     82639ac
  376859  54 ../bin/qnxip.bin    10o CONDVAR     82666d4
  376859  55 ../bin/qnxip.bin    30o CONDVAR     81ade8c
  376859  56 ../bin/qnxip.bin    10o CONDVAR     82661f4
  376859  63 ../bin/qnxip.bin    10o CONDVAR     827be24
  376859  65 ../bin/qnxip.bin    30o CONDVAR     82b9604
  376859  67 ../bin/qnxip.bin    10o CONDVAR     82ab0bc
  376859  68 ../bin/qnxip.bin    10o CONDVAR     827fd54
  376859  69 ../bin/qnxip.bin    10o CONDVAR     82abb4c
  376859  70 ../bin/qnxip.bin    10o CONDVAR     82af18c
  376859  71...
View Full Message
Re: TCP stream socket send() thread safety  
What is strange about all this, that my sshd connection was quite fine and I did not see any delays. This is an argument
 against the assumption that communication problems could be caused by excessive collisions in hub.
Another thing is that io-net still contains only 10 threads in it. As far as I understand it should have allocated more 
threads in case of throughput problems.
Re: TCP stream socket send() thread safety  
On Sun, Dec 02, 2007 at 08:30:59AM -0500, Oleh Derevenko wrote:
> What is strange about all this, that my sshd connection was quite fine
> and I did not see any delays. This is an argument against the assumption
> that communication problems could be caused by excessive collisions in
> hub.

If tcp connections seem fine I'd check your qnet access as
that appeared to be in use in the previous pidin trace.

> Another thing is that io-net still contains only 10 threads in it. As
> far as I understand it should have allocated more threads in case of
> throughput problems.

'threads_max' in the stack context is really a misnomer.
What's actually increased is the number of co-routines the
stack allocates to handle message requests.  The use of
the term 'threads' here is really a holdover from QNX4.

-seanb
Re: TCP stream socket send() thread safety  
On Fri, Nov 30, 2007 at 11:01:02AM -0500, Oleh Derevenko wrote:
> > The 'threads_max' argument to the stack controls how many
> > blocking operations the stack can service simultaneously.
> > They may be read(), write(), accept() ...
> > 
> > It's just an educated guess at this point.  Look for clients
> > SEND blocked on io-net and the aforementioned sloginfo
> > entry.
> 
> So far, I tried it with my current system (that is, 6.3.0 SP3).
> To be able to reproduce it I had to plug ethernet back into 100Mbit hub
> just as we were running last year because with 1GB switch it did not
> reproduce at once.
> 
> When connections start flashing offline/online there are no threads in
> SEND-blocked state at all. 
> 
> -----------------------------ONE NODE-----------------------------
> # ps -A | grep io-net
>      77841 ?        00:00:30 io-net
> # pidin -p 376859
>      pid tid name               prio STATE       Blocked
>   376859   1 ../bin/qnxip.bin    10o MUTEX       376859-06 #1
>   376859   2 ../bin/qnxip.bin    10o CONDVAR     8126bb4
>   376859   3 ../bin/qnxip.bin    10o CONDVAR     812666c
>   376859   4 ../bin/qnxip.bin    10o RECEIVE     2
>   376859   5 ../bin/qnxip.bin    10o RECEIVE     5
>   376859   6 ../bin/qnxip.bin    10o REPLY       77841
>   376859   7 ../bin/qnxip.bin    10o CONDVAR     824f874
>   376859   8 ../bin/qnxip.bin    10o REPLY       77841
>   376859   9 ../bin/qnxip.bin    10o CONDVAR     81eca14
>   376859  10 ../bin/qnxip.bin    10o CONDVAR     81fe80c
>   376859  11 ../bin/qnxip.bin    10o CONDVAR     82abef4
>   376859  12 ../bin/qnxip.bin    10o CONDVAR     81fe73c
>   376859  13 ../bin/qnxip.bin    10o CONDVAR     81ec604
>   376859  14 ../bin/qnxip.bin    10o CONDVAR     81ec32c
>   376859  15 ../bin/qnxip.bin    10o CONDVAR     81391f4
>   376859  16 ../bin/qnxip.bin    10o CONDVAR     81f1944
>   376859  17 ../bin/qnxip.bin    10o CONDVAR     81f159c
>   376859  18 ../bin/qnxip.bin    10o CONDVAR     81f10bc
>   376859  19 ../bin/qnxip.bin    10o CONDVAR     81f1394
>   376859  20 ../bin/qnxip.bin    10o CONDVAR     81f1b4c
>   376859  21 ../bin/qnxip.bin    10o CONDVAR     8201944
>   376859  22 ../bin/qnxip.bin    10o CONDVAR     81febb4
>   376859  23 ../bin/qnxip.bin    10o CONDVAR     8201f5c
>   376859  24 ../bin/qnxip.bin    10o CONDVAR     8201ae4
>   376859  25 ../bin/qnxip.bin    10o CONDVAR     8201c1c
>   376859  26 ../bin/qnxip.bin    10o CONDVAR     82161f4
>   376859  27 ../bin/qnxip.bin    10o CONDVAR     82234cc
>   376859  28 ../bin/qnxip.bin    10o CONDVAR     8223e8c
>   376859  29 ../bin/qnxip.bin    10o CONDVAR     8223944
>   376859  30 ../bin/qnxip.bin    10o CONDVAR     8216ae4
>   376859  31 ../bin/qnxip.bin    10o CONDVAR     8216ef4
>   376859  32 ../bin/qnxip.bin    10o CONDVAR     8228124
>   376859  33 ../bin/qnxip.bin    10o CONDVAR     8228464
>   376859  34 ../bin/qnxip.bin    10o CONDVAR     8235604
>   376859  36 ../bin/qnxip.bin    30o CONDVAR     81ec8dc
>   376859  37 ../bin/qnxip.bin    10o CONDVAR     8228bb4
>   376859  40 ../bin/qnxip.bin    10o CONDVAR     8239bb4
>   376859  41 ../bin/qnxip.bin    30o CONDVAR     8139dbc
>   376859  44 ../bin/qnxip.bin    10o REPLY       376881
>   376859  45 ../bin/qnxip.bin    30o CONDVAR     81bab4c
>   376859  46 ../bin/qnxip.bin    10o REPLY       376852
>   376859  47 ../bin/qnxip.bin    10o REPLY       376887
>   376859  48 ../bin/qnxip.bin    10o REPLY       376884
>   376859  50 ../bin/qnxip.bin    10o REPLY       376890
>   376859  53 ../bin/qnxip.bin    10o CONDVAR     82639ac
>   376859  54 ../bin/qnxip.bin    10o CONDVAR     82666d4
>   376859  55 ../bin/qnxip.bin    30o CONDVAR     81ade8c
>   376859  56...
View Full Message
Re: TCP stream socket send() thread safety  
Hi

> You're close to the default 'threads_max' value on one node
> but they don't seem to all be doing socket operations.  You
> can check the sloginfo output to be sure but it doesn't look
> like you're hitting this limit.
> 
> You can try exercising subsystems: does localhost work, can
> you ping offnode, does qnet offnode work?  

Well, I can't really know the moment when the problem appears. I only see the consequences: the client software 
disconnects because of communication timeout. But after it disconnects it is late to check anything already. Anyway, as 
I already told my SSH terminal is quite fine and it looks like I can access network nodes (at least shell script run 
from network node does not terminate).

> Anything
> different in the 'netstat -s' or 'cat /proc/qnetstats'
> output from when it works vs when you're in failure mode?

So, I made an experiment. I ran the following script
=== begin ===
out="stats.out"

while true; 
do
  echo "=" >> $out
  date >> $out
  echo "=" >> $out
  echo "1111111111111111111111111111111111111111111111" >> $out
  echo "=" >> $out
  pidin -p $1 >> $out
  echo "=" >> $out
  echo "2222222222222222222222222222222222222222222222" >> $out
  echo "=" >> $out
  netstat -s >> $out
  echo "=" >> $out
  echo "3333333333333333333333333333333333333333333333" >> $out
  echo "=" >> $out
  cat /proc/qnetstats >> $out
  sleep 1
done;
=== end ===

It dumps thread states and IP+qnet statistics every second.
I'll send you an excerpt from the output that contains a single connect-disconnect period with next letter to your 
personal e-mail.
You can know that client is connected if there are threads blocked over QNet on other nodes. When client disconnects, 
all the threads are blocked locally (the first and few last passes).
Perhaps, it would be important to know that client is a Windows application and both sides call setsockopt(..., 
IPPROTO_TCP, TCP_NODELAY, ...).
Re: TCP stream socket send() thread safety  
> 
> It dumps thread states and IP+qnet statistics every second.
> I'll send you an excerpt from the output that contains a single
> connect-disconnect period with next letter to your personal e-mail.

Please use the group so more eyes can see it.

-seanb
Re: TCP stream socket send() thread safety  
> > 
> > It dumps thread states and IP+qnet statistics every second.
> > I'll send you an excerpt from the output that contains a single
> > connect-disconnect period with next letter to your personal e-mail.
> 
> Please use the group so more eyes can see it.

Could you forward the file to the people you think could help solving the problem? I'm concerned that publishing network
 data in general access newsgroup could be a security threat.

Re: TCP stream socket send() thread safety  
After reviewing you logs there doesn't seem
to be anything all that abnormal therein.  You might want to check if the following count correlates with
your issue:

4 connections dropped by rexmit timeout

At this point I'd try to instrument the app to see why connections are being dropped.  Is it normal
termination?  You seem to have multiple threads
sending on a socket; could you have multiple threads
closing a socket?   Could socket creation be happening
between two closes?

-seanb
Re: TCP stream socket send() thread safety  
> 4 connections dropped by rexmit timeout
> 
> At this point I'd try to instrument the app to see why connections are being dropped.  Is it normal termination? 

Well, I can't answer this question because I do not have any idea when these connections were dropped and even if they 
have been dropped by my process.

> You seem to have multiple threads sending on a socket; could you have multiple threads closing a socket? Could socket 
creation be happening between two closes?

So, the program works as follows:
1) There is a thread that listens for incoming connections and accepts them (Thread A).
2) For every accepted socket a new worker thread is created to serve that socket (Thread B).
3) Worker thread reads command envelopes from socket and for most types of commands command processing threads are 
created (Thread C). The rest of commands are executed synchronously in Thread B before reading next command envelope (in
 particular, heartbeats are answered synchronously).
4) Command processing thread gets as parameters command input data (read by Thread B before) and the socket (to be able 
to send command status/response back to client).

So Thread C just gets input data and a socket handle. It does its job, writes result back to the socket and terminates. 
Socket is closed by Thread B after client shuts the connection down or communication error is detected.

The answers for the questions are:
> Could you have multiple threads closing a socket?
No. Because only Thread B closes its personal socket it was created with and it does it only after connection 
termination by client or communication error and after all the related command processing threads (Threads C) finish 
sending their responses and terminate.

> Could socket creation be happening between two closes?
Well, if we consider new connection accept to be a socket creation then there can be anything: several of Thread B's can
 be closing their sockets and Thread A can accept new connections at the same time.
Re: TCP stream socket send() thread safety  
> > 4 connections dropped by rexmit timeout
> > 
> > At this point I'd try to instrument the app to see why connections are being
>  dropped.  Is it normal termination? 
> 
> Well, I can't answer this question because I do not have any idea when these 
> connections were dropped and even if they have been dropped by my process.

I have checked full dump output and those "4 connections dropped" remain the same during all 11 minutes I was monitoring
 the network.