Project Home
Project Home
Documents
Documents
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
Forum Topic - Node down detection protocol in QNet: (3 Items)
   
Node down detection protocol in QNet  
Could anyone tell me more details regarding how one node detects another one being down when there is outgoing traffic 
to dead node? Our clients ask to clarify what is the time after which the node is detected as offline.
For idle channel total timeout can be derived from conn_up_idle and 
conn_up_retries QNet parameters. However, if there is traffic between nodes, the experiments show that offline condition
 can be detected much sooner than time calculated from tx_ticks and tx_retries.
So, the main question is, if there is an outgoing traffic queued to dead node, how to calculate time after which sending
 ends with error and node is marked as "offline"?
Re: Node down detection protocol in QNet  
Hiya.  Sorry about the delay responding, I was
out of town.

tx_ticks and tx_retries is definitely what you
want to use at the command line, to change the
timeout to detect a dead node, when traffic is
being transmitted.

The defaults work out to 10 seconds, which may
be too conservative for you.  Try dropping tx_retries from 25 to 5 and see if that helps.

As a test, talk to a node that you know is up, 
via something like "ls /net/fubar" then unplug
your network cable, and run "ls /net/fubar" again.

You should see a delay, then something like
this:
 
  ls: Host is down (/net/fubar/)

At that point, you can do a:

  # cat /proc/qnetstats

which will tell you exactly what happened,
and when.  Hope this helps!
Re: Node down detection protocol in QNet  
In my original post I mentioned that node is detected as "offline" much faster than tx_ticks * tx_retries. The question 
was "why would that happen?" and "how the time can be calculated if multiplying tx_ticks by tx_retries and by tick time 
does not give a correct value?"

Could that be because of other channels being idle and detecting node going offline by conn_up_idle and conn_up_retries 
which give much smaller product with my settings?

In a word, is packet transmission failure tx_retries times in a row with delays tx_ticks between them the criterion of 
declaring node "offline" during data transfer or there is some other criterion?