Davide Ancri
|
Kernel dump: S/C/F=5/4/3 (kerext_process@376) [QNX 6.5.0/x86]
|
Davide Ancri
04/29/2015 5:03 AM
post113752
|
Kernel dump: S/C/F=5/4/3 (kerext_process@376) [QNX 6.5.0/x86]
hello there
I'm running QNX 6.5.0/x86 and I'm getting often a kernel crash which dump is always similar, with:
- S/C/F=5/4/3
- instruction[f0058068] (kerext_process@376)
Since my only debug device is the screen, I attach here 3 pictures of the kernel dumps.
I just finished looking at the "reading a kernel dump" doc page, so I found out the S/C/F codes meaning is: SIGTRAP +
TRAP_CRASH + FLTBPT
I took a quick look into an old qnx 6.4.0 kernel sources checkout, and I found the trunk/services/system/ker/
kerext_process.c file. Around the line number 376, I see a lot of consistency checks while destroying a process:
....
362 if(prp->limits && prp->limits->links == ~0U) crash();
363 if(prp->pid) crash();
364 if(prp->cred) crash();
365 if(prp->alarm) crash();
366 if(pril_first(&prp->sig_pending)) crash();
367 if(prp->sig_table) crash();
368 if(prp->nfds) crash();
369 if(prp->chancons.vector) crash();
370 if(prp->fdcons.vector) crash();
371 if(prp->threads.vector) crash();
372 if(prp->timers.vector) crash();
373 if(prp->memory) crash();
374 if(prp->join_queue) crash();
375 // if(prp->session) crash();
376 if(prp->debugger) crash();
377 if(prp->lock) crash();
378 if(prp->num_active_threads) crash();
379 if(prp->vfork_info) crash();
380 // FIX ME - this is not NULL now ... why? if(prp->rsrc_list) crash();
381 if(prp->conf_table) crash();
....
Of course the sources can be changed a lot since 6.4.0, but I guess I'm hitting some kind of kernel consistency
assertion.
Here follows a brief explanation of my system architecture and the actions that often cause the kernel dump.
There are several qnx 6.5.0/x86 hosts (let's say 32, but the dump happens even if they are 6), all running io-pkt-v4,
divided symmetrically into two groups.
Each host mounts two custom interfaces, we wrote the device io-pkt driver for both:
- the "mc0" interface connects hosts laying inthe same group (like 2 ethernet segments, each private to its group of
hosts)
- the "cl0" interface connects each host to the remote group ones: broadcasts packets produced by one host are not
forwarded to other hosts on the same group, so the qnet discovers only "remote" hosts via cl0 interface
qnet is bound to both mc0 and cl0 interfaces, on every host.
The kernel dump happens when, from the first host of the first group (acting as "master" host), a script spawns in
background ("on -f <host> <script> &") a ksh script on each host in the system, which collects lot of informations about
the host itself: many "pidin" command with almost all available options, many io-pkt query utilities (ifconfig, netstat
, nicinfo, etc.), and some custom utilities for general system monitor.
The controlling script then stops into the "wait" command until all the spawned scripts terminate.
The kernel dump happens randomly on the hosts (hosts from both groups).
We never experienced this kernel dump when the system has only one hosts group (no cl0 interface, qnet bound to mc0
only).
Since both software drivers are custom, of course the root cause can be located in our custom code: can anyone give me a
hint about which kind of driver error can lead to a similar kernel dump?
One last info: it seems that the presence of "pidin rc" command into the script executed in parallel on each host
dramatically increases the chance to get a kernel dump.
I'm running a long-term test without "pidin rc" to confirm this fact.
|
|
|