|
HAM guardian not taking over when primary fails.
|
02/11/2010 11:29 PM
post47189
|
HAM guardian not taking over when primary fails.
First off - some details
# pidin in
CPU:X86 Release:6.4.0 FreeMem:727Mb/1014Mb BootTime:Feb 05 11:33:24 AEDT 2010
Processor1: 1586 Intel ?86 F15M4S1 2796MHz FPU
Processor2: 1586 Intel ?86 F15M4S1 2798MHz FPU
-rw-rw-r-- 1 admin root 761609 Oct 22 2008 procnto-smp-instr
More details including a large ham -vvv log in crash.txt.gz
The short form of the story.
Started ham -vvv 2>/tmp/ham with a task that loads all the monitoring desired (.ini file) and handles monitoring of non-
session 1 tasks plus arbitary paths by creating proxies that self_attach to ham.
ham 2568211
guardian 2572309
kill -9 2568211 to test failover - it all worked OK
ham now 2572309
guardian 2990099
From another node
# ls -FRC /net/hlu-8-4/proc/ham
/net/hlu-8-4/proc/ham:
.info httpd/ miniboss/ redundant/
/net/hlu-8-4/proc/ham/httpd:
.info death/ inform/
/net/hlu-8-4/proc/ham/httpd/death:
.info restart
ls: No such process (/net/hlu-8-4/proc/ham/httpd/inform/.info)
ls: No such file or directory (/net/hlu-8-4/proc/ham/httpd/inform/log)
2572309 is missing - syslog shows
Feb 12 13:49:50 5 21 0 run fault pid 2572309 tid 10 signal 11 code 1 ip 0xb035a428 usr/sbin/ham
but the guardian has not picked up.
A simple list of pidin a (see attachment) shows 2572309 which was in slot 0x15 and that is still not listed in further
tasks (even though I created a bunch of ksh's)
It seems ham guardian (299099) is still waiting for 2572309 to REALLY die!
# pidin -Pham -nhlu-8-4 fam
pid name sid pgrp ppid sibling child
2990099 usr/sbin/ham 1 2990099 2572309
# pidin -Pham -nhlu-8-4
2990099 1 usr/sbin/ham 10r RECEIVE 2
Now what! This is all apart from the problem of letting ham start tasks that daemon(0ize themselve and hence appear to
die on startup.
Two issues here -1. HAM died (I have a core dump) when I tried to do a ls and 2. the kernel has not released the slot !
Of the two, the kernel problem seems more significant so -- what more info can I provide or has it already been fixed
?
|
|
|