Warren Deitch(deleted)
|
Re: HAM guardian not taking over when primary fails.
|
Warren Deitch(deleted)
03/01/2010 6:47 PM
post48582
|
Re: HAM guardian not taking over when primary fails.
This session was being logged because I already had three system failures and I was trying to get to the cause.
I had two nodes connected with a patch cable. No other nodes involved. The target node (the one that crashed) was NOT
running Photon. I was using a second node running Photon as my probe node.
The system was fairly idle. Nothing dynamic beyond my testing.
The ham interface (miniboss) reads an initialisation file and loads up HAM with the requirements - mainly with execute/
log actions. I have extended the interface to allow monitoring of arbitary paths (eg /net/remoteNode to quickly
determine if a node or QNET has died) and non-session 1 tasks (apache2) by using self-attached proxies that the main
task manipulates. I can send the package if that would help.
When the ls -FRC ran, the primary HAM vanished from /proc/ and pidin did not show it BUT the guardian did not take over.
Trying to find out why suggested it was still blocked on the supposedly terminated primary task. I made a little awk
script that read a pidin and reported the pid column in hex. After creating a few extra ksh's to ensure I was not
leaving any gaps in the processor slots, I could see the slot previously occupied by the main ham was not being re-used.
This was all under 6.4.0 but I was not in a position to repeat the tast under 6.4.1.
As I said in the first posting, there are two problems - ham crashing on a ls as well as the termination not allowing
the guardian to take over.
|
|
|