BERGMANN Yannick
11/17/2010 3:42 AM
post74903
|
Periodically (once every 2-3 month), on many sites, we have problems with io-net under QNX 6.3.0 SP3 (x86). When this
happens, the system completely freeze, so we can found core dumps but they seem corrupted (coreinfo give results, but in
gdb, the stack trace is very long). We also tried to generate .kev files but the system have no time to write them on
disk.
After some research in our labs, we found that killing io-net with SIGKILL (slay -f -9 io-net) caused the system to
freeze, this is certainly due to a bypass of the cleaning process. If we look at the coreinfo output, we see that one
thread of io-net receive SIGSEGV, so now we would like to know how to prevent this fault.
We developed using the High Availability Manager a method for restarting io-net when it dies, and surprisingly, it
prevent the system to freeze when sending a SIGKILL to io-net. On site, however, it didn't worked: we waited 3 month for
the crash to happen and when it happened, the system was freezed.
At this point, we really don't know what to do. The problem is not reproducible (we never experienced it on our labs),
we don't have a lot of informations, except the io-net corrupted core dump that is placed in /var/dumps/.
We really don't know what to do to avoid the freeze of the system.
Any suggestions?
Here is the coreinfo output, and some infos about io-net's configuration:
>coreinfo io-net.core
io-net.core:
processor=X86 num_cpus=1
cpu 1 cpu=686 name=Intel 686 F6M13S8 speed=1798
flags=0xc0007fff FPU MMU CPUID RDTSC INVLPG WP BSWAP MMX CMOV PSE PGE MTRR SEP SIMD FXSR cyc/sec=1800291400 tod_adj=
1280846434000000000 nsec=7958021569124295 inc=99733
boot=1280846434 epoch=1970 intr=0
rate=838095345 scale=-15 load=119
MACHINE="x86pc" HOSTNAME="localhost"
pid=77841 parent=1 child=0 pgrp=77841 sid=1 flags=0x403210 umask=0 base_addr=0x8048000 init_stack=0x8047f00 ruid=0
euid=0 suid=0 rgid=0 egid=0 sgid=0 ign=0000000006800000 queue=ff00000000008000 pending=0000000000000000
fds=6 threads=8 timers=5 chans=38
thread 1
ip=0xb032e985 sp=0x8047dc8 stkbase=0x7fc7000 stksize=528384
state=SIGWAITINFO flags=80000000 last_cpu=1 timeout=00000000
pri=10 realpri=10 policy=OTHER
thread 2 SIGNALLED-SIGSEGV code=1 MAPERR refaddr=4 fltno=11
ip=0xb031f3d4 sp=0x7fc6e40 stkbase=0x7f73000 stksize=135168
state=STOPPED flags=84000000 last_cpu=1 timeout=00000000
pri=10 realpri=10 policy=OTHER
thread 3
ip=0xb032dd29 sp=0x7fb5f00 stkbase=0x7fa5000 stksize=69632
state=RECEIVE flags=84020000 last_cpu=1 timeout=00000000
pri=10 realpri=10 policy=OTHER
blocked_chid=1
thread 4
ip=0xb032dd29 sp=0x7fa4f00 stkbase=0x7f94000 stksize=69632
state=RECEIVE flags=84020000 last_cpu=1 timeout=00000000
pri=10 realpri=10 policy=OTHER
blocked_chid=1
thread 5
ip=0xb032dd29 sp=0x7f72f00 stkbase=0x7f62000 stksize=69632
state=RECEIVE flags=84020000 last_cpu=1 timeout=00000000
pri=10 realpri=10 policy=OTHER
blocked_chid=1
thread 8
ip=0xb032db45 sp=0x7f1ff70 stkbase=0x7eff000 stksize=135168
state=RECEIVE flags=84000000 last_cpu=1 timeout=00000000
pri=21 realpri=21 policy=OTHER
blocked_chid=24
thread 9
ip=0xb032db45 sp=0x7efef70 stkbase=0x7ede000 stksize=135168
state=STOPPED flags=84000000 last_cpu=1 timeout=00000000
pri=21 realpri=21 policy=OTHER
thread 10
ip=0xb032db45 sp=0x7eddf70 stkbase=0x7ebd000 stksize=135168
state=STOPPED flags=84000000 last_cpu=1 timeout=00000000
pri=21 realpri=21 policy=OTHER
>pidin -p io-net mem irq
pid tid name prio STATE code data stack
77841 1 sbin/io-net 10o SIGWAITINFO 64K 5172K 8192(516K)*
77841 2 sbin/io-net 10o RECEIVE 64K 5172K 4096(132K)
77841 3 sbin/io-net 10o RECEIVE 64K 5172K 24K(68K)
77841 4 sbin/io-net 10o RECEIVE 64K 5172K 4096(68K)
77841 5 sbin/io-net 10o RECEIVE ...
View Full Message
|
|
|