Project Home
Project Home
Documents
Documents
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
BroadcastCommunity.qnx.com will be offline from May 31 6:00pm until June 2 12:00AM for upcoming system upgrades. For more information please go to https://community.qnx.com/sf/discussion/do/listPosts/projects.bazaar/discussion.bazaar.topc28418
wiki1331: Future_Kernel_Introspection_20070420 (Version 3)

Kernel Introspection:Design Meeting 2007-04-20#


Who#

bstecher, cbugess, adanko, dbailey, mkisel, shiv

Summary#


=Summary=
  • we found and fixed a bug in the design for the deadlock smoke alarm
  • new design for CPU hog detection
  • more on reading bulk data from Proc
  • more on the Generic Notifier

What else we need to do:

  • create authoritative list of blocking states for which we will set timers for deadlock detection.
  • Colin will advise what pathnames to use for the general notifier and bulk proc data transfer.
  • we should all combine lists of things for which we should have notification thresholds, starting with the RLIMITS


Deadlock Detection#


Problem#

A particular customer is likely to have several threads that are forever receive blocked. We would be sending their HA controller a sigevent every minute for each of them. Bletch.

Proposal A13#

  • Kernel adds default timers to only a carefully selected set of blocks: only those blocking states which can trigger a detectable deadlock.
    • only some deadlocks are detectable. For example, while two threads can deadlock on a pair of semaphores, any other thread could conceivably release either semaphore.
  • Implementation will take a set of states, in a parameter, for which we will create default timers.
  • but we must advise the customer to use a set of states that we can authoritatively say will detect all deadlocks that their HA controller can deal with.
  • These timers only fire once. We assume that if the event, of a thread being blocked for 1 minute on a thread, is a true symptom of a deadlock, then the customer's HA controller will find and fix it. If it is not a symptom of a true deadlock, then that blocked thread cannot possibly trigger a later deadlock -- not at least until it unblocks and blocks again. On the second block, we will set another timer.
  • The sigevent should identify the blocking thread id, so the customer's HA controller can poll for data focused on that thread
  • We are likely to deliver sigevents in bursts to the customer's HA controller. We should advise the customer to collect sigevents received in some reasonable time (say 30secs or a minute) and then process them all at once.
  • we still need an optimized call to read out all the sync information when the customer is triggered to poll.
    • perhaps we should have an API that takes a list of tids as a parameter, and returns the syncs for each of them. (i.e. bulk transfer) 'Note this is not quite compatible with the current bulk read proposal.'

This design must meet these tests to be useful

  1. in practice, we will not be sending the customer's HA controller a sigevent every minute or more. (Or this is no better than polling)
  2. we will send a sigevent whevenver any deadlock (that the customer's deadlock-detection algorithm is cabable of dealing with) occurs.