Project Home
Project Home
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
Forum Topic - TCP Communication Problem: (9 Items)
   
TCP Communication Problem  
I would like to ask for assistance with investigating a communication problem we have with Modbus device over TCP. We 
use 6.3.0 SP3 with all the network patches available for that release. 
The code was running correctly until we raised priorities of communication thread to 52 (it was running with default 
priority of 10 before). Now, once every 15 minutes or something, TCP communication stops for a second (like a packet is 
lost) and then, after a second it continues, but hardware times out without data until that.
I have attached an excerpt from Wireshark capture file with the incident (I can provide full file, if necessary).
So, if to look at packet #13, it's a Modbus query with Seq=37, NextSeq=49. The packet #16 is the response for the query,
  it has Seq=34, NextSeq=45 and it contains Ack=49 confirming that the query has been received by the device. The next 
packet from the node to the device (#17) contains Ack=45, confirming that the response has been received by the stack. 
However its Seq=64 and NextSeq=79 instead of expected Seq=49 and Wireshark states in packet analysis that previous 
segment has been lost. Is not that sequence chain break strange? 
Indeed, the application does not receive the data from packet #16 (the response for Modbus query). I have added receive 
timeout and attempted to resend original query hoping that it will force packet retransmission from Modbus device (user 
level resend - this is what packet #17 and all the next packets from node to device with Transaction Id increasing are).
 But as you can see from the logs, it had no effect. The node and the device keep acknowledging they both have received 
the last packets but it looks like they do not understand each other, as if they speak different languages. At last, 
after a second, the node stack retransmits original packet #13 on its own with packet #31 and the device answers it once
 again with packet #35. The node accepts this answer, confirms it with Ack in packet #36 and device starts answering all
 the 14 user-level retries I've sent during that second of delay (#39, #42, #46, #49 and so on). So, even though the 
communication restores after a second, it's already too late for the hardware and node terminates the socket.
Could someone tell me, please, what was wrong with the first communication attempt (packets #13-#16)? Why did not the 
data reach the application? Which side has lost the packet (the node or the device)? Everything looks exactly the same 
both times except for strange sequence chain break I don't know how to explain.
Thank you for the help and sorry for taking your time.
Attachment: Text comm.pcap 8.29 KB
RE: TCP Communication Problem  
Couple of quick questions:

	What is the command line you use to start the stack (including
the parameter for the priority setting)?
	What driver are you using (is it a custom driver for Modbus)?
	Can you post a pidin / pidin mem output for io-net for us?

	Thanks!
		Robert.

 

-----Original Message-----
From: Oleh Derevenko [mailto:community-noreply@qnx.com] 
Sent: Friday, February 20, 2009 5:58 PM
To: general-networking
Subject: TCP Communication Problem

I would like to ask for assistance with investigating a communication
problem we have with Modbus device over TCP. We use 6.3.0 SP3 with all
the network patches available for that release. 
The code was running correctly until we raised priorities of
communication thread to 52 (it was running with default priority of 10
before). Now, once every 15 minutes or something, TCP communication
stops for a second (like a packet is lost) and then, after a second it
continues, but hardware times out without data until that.
I have attached an excerpt from Wireshark capture file with the incident
(I can provide full file, if necessary).
So, if to look at packet #13, it's a Modbus query with Seq=37,
NextSeq=49. The packet #16 is the response for the query,  it has
Seq=34, NextSeq=45 and it contains Ack=49 confirming that the query has
been received by the device. The next packet from the node to the device
(#17) contains Ack=45, confirming that the response has been received by
the stack. However its Seq=64 and NextSeq=79 instead of expected Seq=49
and Wireshark states in packet analysis that previous segment has been
lost. Is not that sequence chain break strange? 
Indeed, the application does not receive the data from packet #16 (the
response for Modbus query). I have added receive timeout and attempted
to resend original query hoping that it will force packet retransmission
from Modbus device (user level resend - this is what packet #17 and all
the next packets from node to device with Transaction Id increasing
are). But as you can see from the logs, it had no effect. The node and
the device keep acknowledging they both have received the last packets
but it looks like they do not understand each other, as if they speak
different languages. At last, after a second, the node stack retransmits
original packet #13 on its own with packet #31 and the device answers it
once again with packet #35. The node accepts this answer, confirms it
with Ack in packet #36 and device starts answering all the 14 user-level
retries I've sent during that second of delay (#39, #42, #46, #49 and so
on). So, even though the communication restores after a second, it's
already too late for the hardware and node terminates the socket.
Could someone tell me, please, what was wrong with the first
communication attempt (packets #13-#16)? Why did not the data reach the
application? Which side has lost the packet (the node or the device)?
Everything looks exactly the same both times except for strange sequence
chain break I don't know how to explain.
Thank you for the help and sorry for taking your time.

_______________________________________________
General
http://community.qnx.com/sf/go/post22675
Re: RE: TCP Communication Problem  
> Couple of quick questions:
> 
> 	What is the command line you use to start the stack
>              (includingthe parameter for the priority setting)?
Just the default "-ptcpip" without any customization.

> 	What driver are you using (is it a custom driver for Modbus)?
I'm not quite sure what do you mean by this question. .13 is a QNX node with our software communicating over the TCP 
socket. .111 is a Beckhoff module (either http://infosys.beckhoff.com/english.php?content=content/1033/BK9000/HTML/
Bt_BK9000_Introduction.htm or http://www.beckhoff.com/english.asp?bus_terminal/bc9000_bc9050.htm - not sure which one as
 it's in the network of our clients).

> 	Can you post a pidin / pidin mem output for io-net for us?
# pidin -p 77840 mem
     pid tid name               prio STATE           code  data         stack
   77840   1 sbin/io-net         10o SIGWAITINFO       64K 1968K  8192(516K)*
   77840   2 sbin/io-net         20o RECEIVE           64K 1968K  4096(132K)
   77840   3 sbin/io-net         10o RECEIVE           64K 1968K   4096(68K)
   77840   5 sbin/io-net         10o RECEIVE           64K 1968K   4096(68K)
   77840   6 sbin/io-net         21r RECEIVE           64K 1968K  4096(132K)
   77840   7 sbin/io-net         10o RECEIVE           64K 1968K  4096(132K)
   77840   8 sbin/io-net         10o RECEIVE           64K 1968K  4096(132K)
   77840   9 sbin/io-net         10o RECEIVE           64K 1968K   4096(68K)
   77840  10 sbin/io-net         10o RECEIVE           64K 1968K   4096(68K)
            ldqnx.so.2         @b0300000             348K   20K
            npm-tcpip.so       @b8200000             276K   32K
            devn-rtl.so        @b824d000              48K  8192
            npm-qnet.so        @b825b000             160K   36K
#
RE: TCP Communication Problem  
As far as I can tell from the packet log, there suppose to be a segment
from 192.168.10.13 to .111, sequence 49 - 63. This one never reaches the
.111's stack.

.111 report that with a dup ack at pkt# 24, and .13 retransmitted that
segment (with all other outstanding data), and things start to rolling
again.

Is .13 not send that out, or is .111 not receive it? I am not sure.
Where did you get this log. On .13, on .111, or a 3rd sniff machine?

If I had to guess, I would guess the packet is sent out, but .111 failed
to receive it. Because you indicated this only happened after you raise
your application priority to 52, I would guess that is higher enough to
starve the driver's receiving processing thread (typically running at
21, and there is a command line option to adjust it). Depends on what
network card you have, this may causing an overrun in driver.

Something you can try to see:

1) is nicinfo report any errors in driver?
2) Either lower your application to priority 20/21, or increas the
driver priority to 52/53, see if that help?

-xtang

> -----Original Message-----
> From: Oleh Derevenko [mailto:community-noreply@qnx.com]
> Sent: February 20, 2009 5:58 PM
> To: general-networking
> Subject: TCP Communication Problem
> 
> I would like to ask for assistance with investigating a communication
> problem we have with Modbus device over TCP. We use 6.3.0 SP3 with all
the
> network patches available for that release.
> The code was running correctly until we raised priorities of
communication
> thread to 52 (it was running with default priority of 10 before). Now,
> once every 15 minutes or something, TCP communication stops for a
second
> (like a packet is lost) and then, after a second it continues, but
> hardware times out without data until that.
> I have attached an excerpt from Wireshark capture file with the
incident
> (I can provide full file, if necessary).
> So, if to look at packet #13, it's a Modbus query with Seq=37,
NextSeq=49.
> The packet #16 is the response for the query,  it has Seq=34,
NextSeq=45
> and it contains Ack=49 confirming that the query has been received by
the
> device. The next packet from the node to the device (#17) contains
Ack=45,
> confirming that the response has been received by the stack. However
its
> Seq=64 and NextSeq=79 instead of expected Seq=49 and Wireshark states
in
> packet analysis that previous segment has been lost. Is not that
sequence
> chain break strange?
> Indeed, the application does not receive the data from packet #16 (the
> response for Modbus query). I have added receive timeout and attempted
to
> resend original query hoping that it will force packet retransmission
from
> Modbus device (user level resend - this is what packet #17 and all the
> next packets from node to device with Transaction Id increasing are).
But
> as you can see from the logs, it had no effect. The node and the
device
> keep acknowledging they both have received the last packets but it
looks
> like they do not understand each other, as if they speak different
> languages. At last, after a second, the node stack retransmits
original
> packet #13 on its own with packet #31 and the device answers it once
again
> with packet #35. The node accepts this answer, confirms it with Ack in
> packet #36 and device starts answering all the 14 user-level retries
I've
> sent during that second of delay (#39, #42, #46, #49 and so on). So,
even
> though the communication restores after a second, it's already too
late
> for the hardware and node terminates the socket.
> Could someone tell me, please, what was wrong with the first
communication
> attempt (packets #13-#16)? Why did not the data reach the application?
> Which side has lost the packet (the node or the device)? Everything
looks
>...
View Full Message
Re: RE: TCP Communication Problem  
> As far as I can tell from the packet log, there suppose to be a segment
> from 192.168.10.13 to .111, sequence 49 - 63. This one never reaches the
> .111's stack.
Well, but what would you expect to see in that segment? Modbus logical request-response sequence is not broken. .13 
sends a request to Modbus device with packet #13 and waits for a response which comes in packet #16 but does not reach 
the application.

> .111 report that with a dup ack at pkt# 24, and .13 retransmitted that
> segment (with all other outstanding data), and things start to rolling
> again.
#24 contains ACK=49 which was already notified three times with packets #14, #15 and #16 before. Well, I don't really 
know the low level details of TCP protocol - I may be telling something irrelevant.
Wireshark tells that #24 contains duplicate ACK for frame #16. But frames from #17 to #23 all confirm to Modbus device 
that frame #16 has been received (at least "SEQ/ACK analysis" node for pkt #17 clearly says "This is an ACK to the 
segment in frame: 16"). 

> 
> Is .13 not send that out, or is .111 not receive it? I am not sure.
> Where did you get this log. On .13, on .111, or a 3rd sniff machine?
So, for the purpose of tracking the issue down .111 had been separated and plugged into a network hub. QNX network 
contains ~20 QNX nodes connected with switches and is also connected to the same hub. And also there is a Windows 
machine plugged in the hub which runs Wireshark and logs the traffic. Initially, when the problem was first reported 
everything was connected with switches only.

> If I had to guess, I would guess the packet is sent out, but .111 failed
> to receive it. 
.111 is a Beckhoff module. There is no other communication to it except for Modbus traffic. And Modbus messages embedded
 in the packets form a valid communication sequence.

> Because you indicated this only happened after you raise
> your application priority to 52, I would guess that is higher enough to
> starve the driver's receiving processing thread (typically running at
> 21, and there is a command line option to adjust it). Depends on what
> network card you have, this may causing an overrun in driver.
The CPU usage on the node is less than 10%. There might be short peaks for 10-20 milliseconds at high priorities, of 
course, but could that lead to communication problems? We do not put the node into high load for the testing, it just 
runs normal communications in "idle mode" and after some 10-15 minutes the connection breaks.

> Something you can try to see:
> 
> 1) is nicinfo report any errors in driver?
Just a single buffer underrun error.
# nicinfo
RealTek 8139 Ethernet Controller

  Physical Node ID ........................... 00C049 F402FD
  Current Physical Node ID ................... 00C049 F402FD
  Current Operation Rate ..................... 100.00 Mb/s full-duplex
  Active Interface Type ...................... MII
    Active PHY address ....................... 0
  Maximum Transmittable data Unit ............ 1514
  Maximum Receivable data Unit ............... 1514
  Hardware Interrupt ......................... 0xa
  I/O Aperture ............................... 0xe400 - 0xe4ff
  Memory Aperture ............................ 0xd6000000 - 0xd60000ff
  Promiscuous Mode ........................... Off
  Multicast Support .......................... Enabled

  Packets Transmitted OK ..................... 90012419
  Bytes Transmitted OK ....................... 8409723780
  Memory Allocation Failures on Transmit ..... 0

  Packets Received OK ........................ 100714313
  Bytes Received OK .......................... 7954104082
  Memory Allocation Failures on Receive ...... 0

  Single Collisions on Transmit .............. 0
  Transmits aborted (excessive collisions) ... 0
  Transmit Underruns ......................... 1
  No Carrier on Transmit ..................... 0
...
View Full Message
RE: RE: TCP Communication Problem  

> -----Original Message-----
> From: Oleh Derevenko [mailto:community-noreply@qnx.com]
> Sent: February 22, 2009 12:23 PM
> To: general-networking
> Subject: Re: RE: TCP Communication Problem
> 
> > As far as I can tell from the packet log, there suppose to be a
segment
> > from 192.168.10.13 to .111, sequence 49 - 63. This one never reaches
the
> > .111's stack.
> Well, but what would you expect to see in that segment? Modbus logical
> request-response sequence is not broken. .13 sends a request to Modbus
> device with packet #13 and waits for a response which comes in packet
#16
> but does not reach the application.

Sorry I don't have the log in hand, but sequence 49 - 63 is
retransmitted later. So you can see the contents from that retransmit.
So if you look into the contents of that retransmit, the first 15 bytes
is your contents.

> > .111 report that with a dup ack at pkt# 24, and .13 retransmitted
that
> > segment (with all other outstanding data), and things start to
rolling
> > again.
> #24 contains ACK=49 which was already notified three times with
packets
> #14, #15 and #16 before. Well, I don't really know the low level
details
> of TCP protocol - I may be telling something irrelevant.
> Wireshark tells that #24 contains duplicate ACK for frame #16. But
frames
> from #17 to #23 all confirm to Modbus device that frame #16 has been
> received (at least "SEQ/ACK analysis" node for pkt #17 clearly says
"This
> is an ACK to the segment in frame: 16").

These are all standard procedure how TCP recover from a packet lost. It
keeps ACKing the last received seq. The other side may not see the
dupped packet after sending a few more packets, but finally it realized
a particular segment is lost, and re-transmit it.

> > Is .13 not send that out, or is .111 not receive it? I am not sure.
> > Where did you get this log. On .13, on .111, or a 3rd sniff machine?
> So, for the purpose of tracking the issue down .111 had been separated
and
> plugged into a network hub. QNX network contains ~20 QNX nodes
connected
> with switches and is also connected to the same hub. And also there is
a
> Windows machine plugged in the hub which runs Wireshark and logs the
> traffic. Initially, when the problem was first reported everything was
> connected with switches only.

This would be better because we will know if the packet is not send out,
or it is not received.

> > 2) Either lower your application to priority 20/21, or increas the
> > driver priority to 52/53, see if that help?
> We already tried to raise TCP thread priorities and NIC driver
priority to
> values higher than communication thread in application (and even to
values
> higher than any other thread in the system) before posting to the
forum,
> but this does not solve the problem. :(

Hmm... But if your applicaton running at 10, and network driver is
running at default, everything is fine?

Thanks,

-xtang
Re: RE: RE: TCP Communication Problem  
> Sorry I don't have the log in hand, but sequence 49 - 63 is
> retransmitted later. So you can see the contents from that retransmit.
> So if you look into the contents of that retransmit, the first 15 bytes
> is your contents.

Ah, now I see the lost packet. Thank you very much. I have not noticed that data in packets #13 and #17 is different.

I'll also try to setup packet capture at QNX node itself and recompile application back to normal priority tomorrow.
Re: RE: RE: TCP Communication Problem  
> 
> I'll also try to setup packet capture at QNX node itself and recompile 
> application back to normal priority tomorrow.

We are not able to reproduce the problem any more. It was running fine for whole day yesterday. 
I can't find any explanation except for "maybe there was some hardware (a switch?) dropping packets from time to time 
and now it's been reset and started to work fine".
That's a pity we did not set up packet capture at QNX from the very beginning. :-(
Re: RE: RE: TCP Communication Problem  
> I can't find any explanation except for "maybe there was some hardware (a 
> switch?) dropping packets from time to time and now it's been reset and 
> started to work fine".
> That's a pity we did not set up packet capture at QNX from the very beginning.
>  :-(

We finally were able to reproduce communication problem and saved traffic dump on QNX node. Now I clearly see, that 
packet is sent but does not reach the Modbus device. From what was seen in first dump captured on Windows machine via 
hub where there was no packet at all, it's clear that network is losing packets (it's not the Modbus device that fails 
to accept packets on arrival).
Thanks very much to everyone for the help.