Kernel Introspection:Design Meeting 2007-04-17#


Who#


bstecher, cbugess, adanko, dbailey, mkisel

Summary#


We distilled customer's requirements for their HA controllers and separated the need for detection from the need for reporting into seven major requirements.

Our general conclusions were:

Smoke Alarm + Flashlight#

We will provide fast, but approximate detection mechanisms, to alert a customer's HA controller that there might be a problem (problem == a condition customer looks for). The notification is delivered with a sigevent (Smoke Alarm). Customer's controller responds by calling a slower APIs to methodically extract the thread/process details it needs to apply its precise criteria. The Smoke-Alarm tests in the kernel may deliver reasonable number of false positives but may never return a false negative.

Summary of Requirements#


These requirements are distilled from customers own design documents for their high-availablity reporting and control systems. We've attempted, where ever possible, to separate the need for detection from reporting, in the hopes that they might have different performance requirements. We've only include those requirements that might require Neutrino support. Our objective is make sure Neutrino provides a suitably efficient interface to the customer's controller so it can meet these requirements. So for example, when the customer's controller needs to keep a history of a value, we have the choice of keeping the history in Neutrino, or giving the customer an efficient interface to poll the value and letting it keep the history.

C1. Detect Low System Memory#

C2. Detect Processes Hogging FDs#

C3. Detect Process Memory Hogs#

C4. Record Process Memory History#

C5. Detect Process CPU Hogs#

C6. Record Process CPU history#

C7. Detect Deadlocks Threads#


Brainstorming list of APIs#


A1. the existing devctls, called for each process and thread #

A2. bulk devctls #

A3. bulk devctls based on a user supplied callback which defines selection criterion#

A4. bulk devctls based on a selection criteria specified by a user supplied data structure #

A5. a general notification interface#

A6. Read-only shared-memory window into kernel's thread/process data #

A7. use callouts on trace events #

A8. do nothing #

A9. api to read RLIMIT usage levels #

A10. Generalize RLIMIT #

A11. converge (A5, A9, A10) for Smoke-Alarm, and use A1 and A4 for Flashlight#

A12. faster way to read memory usage #


Mapping requirements to Brainstomed APIs#


C1. Low System Memory#

C2. FD Hogs#

Alternatives:
  1. A8
  2. A10
  3. A5 plus A9
  4. A11
This is case of the general resource limit problem.

C3. Memory Hogs#

Alternatives:
  1. A10
  2. advise customer to enable A10 only after low system memory detected

C4. Memory History#

C5. CPU Hogs#

Alternatives:
  1. advise customer to use APS to detect the existinance of some cpu hog, followed by A4
  2. A11

C6. CPU History#

C7. Deadlocks#

Alternatives:
  1. Default timers:
  1. Quick deadlock check in kernel