Kernel Introspection:Design Meeting 2007-04-20#


Who#

bstecher, cbugess, adanko, dbailey, mkisel, shiv

Summary#


What we did#

What else we need to do: #


Deadlock Detection#


Problem#

A particular customer is likely to have several threads that are forever receive blocked. We would be sending their HA controller a sigevent every minute for each of them. Bletch.

Proposal A13#

This design must meet these tests to be useful

  1. in practice, we will not be sending the customer's HA controller a sigevent every minute or more. (Or this is no better than polling)
  2. we will send a sigevent whevenver any deadlock (that the customer's deadlock-detection algorithm is capable of dealing with) occurs.

CPU Hog Detection#


Problem#

APS doesn't have a notification mechanism for either the system running out of CPU time, or a single partition using more time than some limit. (The only notification mechanism in APS is for running out of critical time which is specific to designated threads. So critical time notification doesn't help us solve this problem.)

Proposal A14#

This assumes that the customer's HA controller's first purpose in detecting CPU hogs is to limit damage to the rest of the system and that their second purpose is to kill and recreate them. (Kill/recreate in the hopes that the error state will be reset.) In which case we conjecture that customer's HA controllers can afford to wait 5 minutes to kill a hog if that hog is being calmed by APS.

This assumes that 8 (or possibly 16) partitions is enough.


Reading bulk data from Proc#


We believe there are two ineffeciences in reading all pid/tid data with individual calls:
  1. traversing mapinfo structures (in the current implementation) just to get total memory used for one pid.
  2. traversing the filesystem for each pid/tid

Proposal A15#

First, change allocators to keep a running total of memory used for each process so that no mapinfos need be traversed during calls to Proc for retreiving process info. We would return total memory used by a process as one of the RLIMIT values when returning data about rlimits or returning complete data about the process.

Second, create an optomized call for reading data about many pids/tids at once:

  1. user needs only one open() to read data for all pids/tids.
  2. allow many reads per open and return data about many pids and tids per read()
  3. allow user to filter data returned, focused on task. (example deadlock detection means read all syncs for all threads; cpu hog detection means reading process cpu times for all processes.)

Specifically:

<data for pid1><data for tid1 of pid1><data for tid2 of pid1>....<data for pid2><data for tid1 of pid1> ...

We want to generalize this to provide focused filtering and to allow customers to, well..., customize:

The user specifies which set of structs to return with a special pathname: /proc/pids+threads+maps/ which means returns the thread data and map data for all pids in the system. Reading /proc/pids/ would return the full pid data for each pid in the system. This allows the user to read all possible data, or to filter data appropriate for a specific task.

Observation: this is equivalent to returning xml. For efficiency, we prefer to return packed binary. Later, we could easily add a layer (function or resmgr) to translate into xml.

Looks pretty slick to us.


More on A5: the Generic Notifier#


We would like to base soft RLIMIT warnings on a generic notifier. We'd also use it for all memory and aps notifications, time to do deadlock detection notifications, and in general any notification for any threshold transition of interest to any client of any resource in the OS. This means we need to come up with a pathname space to name all these possibliies:

Design Notes#

Implementation notes#