foundry27 : Post

Forum Topic - HAM guardian not taking over when primary fails.: (5 Items)

View: as

Warren Deitch(deleted)

02/11/2010 11:29 PM

post47189

HAM guardian not taking over when primary fails.

First off - some details
# pidin in
CPU:X86 Release:6.4.0  FreeMem:727Mb/1014Mb BootTime:Feb 05 11:33:24 AEDT 2010
Processor1: 1586 Intel ?86 F15M4S1 2796MHz FPU
Processor2: 1586 Intel ?86 F15M4S1 2798MHz FPU

-rw-rw-r--  1 admin     root         761609 Oct 22  2008 procnto-smp-instr

More details including a large ham -vvv log in crash.txt.gz

The short form of the story.
Started ham -vvv 2>/tmp/ham with a task that loads all the monitoring desired (.ini file) and handles monitoring of non-
session 1 tasks plus arbitary paths by creating proxies that self_attach to ham.

ham         2568211
guardian  2572309 

kill -9 2568211       to test failover - it all worked OK

ham now    2572309
guardian     2990099

From another node
# ls -FRC /net/hlu-8-4/proc/ham

/net/hlu-8-4/proc/ham:
.info         httpd/        miniboss/     redundant/

/net/hlu-8-4/proc/ham/httpd:
.info     death/    inform/

/net/hlu-8-4/proc/ham/httpd/death:
.info      restart
ls: No such process (/net/hlu-8-4/proc/ham/httpd/inform/.info)
ls: No such file or directory (/net/hlu-8-4/proc/ham/httpd/inform/log)

2572309 is missing - syslog shows
Feb 12 13:49:50    5    21     0 run fault pid 2572309 tid 10 signal 11 code 1 ip 0xb035a428 usr/sbin/ham
but the guardian has not picked up.

A simple list of  pidin a (see attachment) shows 2572309 which was in slot 0x15 and that is still not listed in further 
tasks (even though I created a bunch of ksh's)

It seems ham guardian (299099) is still waiting for 2572309 to REALLY die!

# pidin -Pham -nhlu-8-4 fam
     pid name                   sid    pgrp    ppid sibling   child
 2990099 usr/sbin/ham             1 2990099 2572309

# pidin -Pham -nhlu-8-4 
 2990099   1 usr/sbin/ham        10r RECEIVE     2

Now what!  This is all apart from the problem of letting ham start tasks that daemon(0ize themselve and hence appear to 
die on startup.

Two issues here -1.  HAM died (I have a core dump) when I tried to do a ls and 2. the kernel has not released the slot !
   Of the two, the kernel problem seems more significant so -- what more info can I provide or has it already been fixed
?

Attachment:

crash.txt.gz 7.12 KB

Warren Deitch(deleted)

02/14/2010 4:47 PM

post47291

Re: HAM guardian not taking over when primary fails.

Sometime over the weekend the kernel crashed with a 
Crash[1,1] at kerext_process line 352.
- so I will not be able to track down any more clues.

Warren Deitch(deleted)

02/28/2010 5:41 AM

post48475

Re: HAM guardian not taking over when primary fails.

Ping ??

Shiv Nagarajan(deleted)

03/01/2010 11:08 AM

post48523

Re: HAM guardian not taking over when primary fails.

sorry for the delayed response. Is this problem reproducible? The kernel crash you encountered subsequently seems to 
suggest some sort of issue relating to termination. 


in your log

# ls -FRC /net/hlu-8-4/proc/ham

/net/hlu-8-4/proc/ham:
.info         httpd/        miniboss/     redundant/

/net/hlu-8-4/proc/ham/httpd:
.info     death/    inform/

/net/hlu-8-4/proc/ham/httpd/death:
.info      restart
ls: No such process (/net/hlu-8-4/proc/ham/httpd/inform/.info)
ls: No such file or directory (/net/hlu-8-4/proc/ham/httpd/inform/log)

#

# pidin -Pham -nhlu-8-4 fam
     pid name                   sid    pgrp    ppid sibling   child
 2990099 usr/sbin/ham             1 2990099 2572309
NOTE no pid 2572309 is shown!!

syslog shows

Feb 12 13:49:50    5    21     0 run fault pid 2572309 tid 10 signal 11 code 1 ip 0xb035a428 usr/sbin/ham

the ls of /net is being performed on a remote node. But the ham crash is on the local node. Is there any additional 
information you can provide about the overall structure of the system. i.e how many nodes.. which process died, which 
process was notified, and when the Ham crash occurred? or was it all transient? 

shiv

Warren Deitch(deleted)

03/01/2010 6:47 PM

post48582

Re: HAM guardian not taking over when primary fails.

This session was being logged because I already had three system failures and I was trying to get to the cause.

I had two nodes connected with a patch cable. No other nodes involved. The target node (the one that crashed) was NOT
running Photon. I was using a second node running Photon as my probe node.

The system was fairly idle. Nothing dynamic beyond my testing.

The ham interface (miniboss) reads an initialisation file and loads up HAM with the requirements - mainly with execute/
log actions. I have extended the interface to allow monitoring of arbitary paths (eg /net/remoteNode to quickly
determine if a node or QNET has died) and non-session 1 tasks (apache2) by using self-attached proxies that the main
task manipulates. I can send the package if that would help.

When the ls -FRC ran, the primary HAM vanished from /proc/ and pidin did not show it BUT the guardian did not take over.
Trying to find out why suggested it was still blocked on the supposedly terminated primary task. I made a little awk
script that read a pidin and reported the pid column in hex. After creating a few extra ksh's to ensure I was not
leaving any gaps in the processor slots, I could see the slot previously occupied by the main ham was not being re-used.

This was all under 6.4.0 but I was not in a position to repeat the tast under 6.4.1.

As I said in the first posting, there are two problems - ham crashing on a ls as well as the termination not allowing
the guardian to take over.

Return

The text you entered is not a valid object ID
More Information
Object IDs begin with an object prefix and end with a number. For example, if you enter
artf2345
the application will jump directly to an artifact with the ID artf2345. Some valid object prefixes are:
artf	for an artifact
doc	for a document
page	for a project page
topc	for a discussion topic
wiki	for a wiki page