Project Home
Project Home
Documents
Documents
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
wiki3961: 642_mm_dsl

QMS3212#

MM DSL - Dynamic Skip List for virtual memory maps#

DSL is a mechanism to provide an overlay structure which facilitates fast searching of an ordered singly-linked list.

The problem being addressed was observed when a customer had a very large number of virtual memory segments, and was performing mprotect() operations on them. The mm_map code ended up spending several milliseconds scanning the list for each operation, and the performance degradation was considerable.

The DSL implementation had initially been done in an abstract way, on an experimental basis. For this first use in the kernel, it was targetted at indexing the mm_map structures associated with process virtual memory maps, and it was convenient to make certain assumptions about the data structures being indexed. Specifically, it assumes the first member of the structure is the list linkage and it knows that it can use the virtual memory start address for indexing. These assumptions are not deeply embedded however, so it will still be a relatively simple matter to abstract the mm_dsl implementation to handle other types of lists.

The base data structure is a chain of mm_map structures that are linked together in a singly-linked list:

    mm_1 ---> mm_2 ---> mm_3 ---> mm_4 ---> mm_5 ---> mm_6 ---> mm_7 ---> NULL

The dsl_entry structure contains a few pointers and an integer representing the number of subtending child nodes associated with this entry.

struct dsl_entry {
	struct dsl_entry	*next;
	struct dsl_entry	*child;
	struct dsl_link		*data;
	unsigned            kids;
};

The topmost structure of the dsl_index is initially quite simple. The data pointer is populated to point to the start of the list.

              
    dsl_head - kids=7
    | | |  
    | | |  next
    | | +-------> NULL 
    | |  
    | | child
    | +------> NULL 
    |  
    |data
    |  
    v 
    mm_1 ---> mm_2 ---> mm_3 ---> mm_4 ---> mm_5 ---> mm_6 ---> mm_7 ---> NULL

Once the number of kids exceeds the split threshold (N), a new level is introduced below the top level. The existing kids are split between the two owners.

    dsl_head - kids=2
    | | |  
    | | |  next
    | | +-------> NULL 
    | |  
    | | child
    | +----+
    |      |
    | data |
    |      |
    |      |
    |      v
    |  dsl_entry - kids=N/2    +-->  dsl_entry - kids=N/2
    |  | | |                   |     | | |
    |  | | |  next             |     | | |  next
    |  | | +-------------------+     | | +--------------> NULL
    |  | |                           | |
    |  | | child                     | | child
    |  | +------> NULL               | +------> NULL
    |  |                             |
    |  | data                        | data
    |  |                             |
    v  v                             v 
    mm_1 ---> mm_2 ---> ... --->   mm_N/2 --->  ... ---> mm_N --> NULL

The next level grows in a similar way. As dsl_entries grow to size N, they are split in two.

    dsl_head - kids=3
    | | |  
    | | |  next
    | | +-------> NULL 
    | |  
    | | child
    | +----+
    |      |
    | data |
    |      |
    |      |
    |      v
    |  dsl_entry - kids=?      +-->  dsl_entry - kids=N/2   +-->  dsl_entry - kids=N/2
    |  | | |                   |     | | |                  |     | | | 
    |  | | |  next             |     | | |  next            |     | | |  next               
    |  | | +----------- ... ---+     | | +------------------+     | | +------------------> NULL  
    |  | |                           | |                          | |    
    |  | | child                     | | child                    | | child  
    |  | +------> NULL               | +------> NULL              | +------> NULL    
    |  |                             |                            |     
    |  | data                        | data                       | data     
    |  |                             |                            |       
    v  v                             v                            v          
    mm_1 ---> mm_2 ---> ... --->  mm_P ---> mm_P+1  ...      ---> mm_Q --> mm_Q+1 ... ---> NULL

Once an upper-level node (a node with children) exceeds the split threshold, it is split in two and a new layer inserted.

    dsl_head - kids=2 (WAS N before the split)
    | | |  
    | | |  next
    | | +-------> NULL 
    | |  
    | | child
    | +----+
    |      |
    | data |
    |      |
    |      |
    |      v
    | dsl_entry - kids=N/2    +------------------------------>   dsl_entry - kids=N/2  
    | | | |                   |                                  | | |                 
    | | | |  next             |                                  | | |  next           
    | | | +-------------------+                                  | | +--------------> NULL
    | | |                                                        | |                  
    | | | child                                                  | | child           
    | | +-----+                                                  | +----+
    | |       |                                                  |      |             
    | | data  |                                                  | data |             
    | |       |                                                  |      |             
    | |       |                                                  |      | 
    | |       |                                                  |      |             
    | |       |                                                  |      |
    | |       v                                                  |      v
    | | dsl_entry               +-->  dsl_entry                  |  dsl_entry             
    | | | | |                   |     | | |                      |  | | | 
    | | | | |  next             |     | | |  next                |  | | |  next               
    | | | | +------------ ...  -+     | | +------------> NULL    |  | | +--------- ... --> NULL  
    | | | |                           | |                        |  | |    
    | | | | child                     | | child                  |  | | child  
    | | | +------> NULL               | +------> NULL            |  | +------> NULL    
    | | |                             |                          |  |     
    | | | data                        | data                     |  | data     
    | | |                             |                          |  |       
    v v v                             v                          v  v        
    mm_1 ---> mm_2 --->   ... --->   mm_P ---> mm_P+1...    ---> mm_Q --> mm_Q+1   ... ---> NULL

Note that the branches of the dsl tree below the inserted layers are disjoint. In this example, you can not get the dsl_entry that points at mm_Q from the dsl_entry that points at mm_P, even though mm_Q and mm_P are 'nearby'.

This downward growth process continues until the tree reaches a maximum depth, at which point the tree will have grown very large indeed. The maximum depth is currently set to 10, so the tree could grow as large as 50^10 items.

Shrinking the tree happens when the number of children under a dsl node shrinks to zero. (This is configurable in the code, but is currently set to zero) The dsl entry is deleted, and the effect is noted up the tree if applicable. Oother nodes and whole layers may be required to be deleted recursively.

Note that once a split is done, the lower level trees grow separately, and are never rebalanced. Depending on how insertions and deletions are done to the underlying linked list, the tree could grow in an unbalanced fashion. With deletions, it would also be possible a dsl overlay tree that is as large as the list it is representing, as long as one or more list elements remained under each node of the tree.

APIs#

The external APIs to the dsl code consist of dsl_state_init(), dsl_adding(), dsl_removing(), and dsl_find_prev(). dsl_find_prev() returns a pointer to the element before the indicated item in the list, as this is the element generally required for operating on a singly-linked list.

To use the information contained in the dsl tree, a call is done to dsl_state_init() with a search key parameter. This populates a table with pointers to the dsl_entry structures (one at each level of the tree) that were traversed to find the requested element.

Searching the Tree#

Note that at every level, each dsl_entry maintains a pointer to the underlying list. A search is done by traversing the list at every level and comparing the data key (address) to the search key. Because the dsl_entry nodes represent multiple underlying entries in the list, a dsl search quickly skips large segments of the list. Once a data key greater than the search key is found, the algorithm populates the search table with the appropriate dsl_entry pointer and decends into that branch to refine the search until the 'leaf' level (with no children) is found.

When this is complete, the search table contains a list of each node that was traversed along the way to find the highest data element lower than the search key.

At the lowest level, there is a dsl_entry for as many as 50 list items, so a linear list search at this point is necessary to find the correct list item.

As every search devolves into a linear search with as many as 50 items, there is no benefit in using the dsl overlay for a small list, and it instead incurs a small amount of additional overhead. Once the list exceeds the split point, the benefits of the overlay structure will start to appear, as even in a single-layer dsl tree, the search can skip over 25-50 items with each pointer operation.

Benchmarking#

Benchmarking was done with an application that mirrored the customer usage of the mprotect() API. A single large chunk of memory was allocated with one set of permissions, and then mprotect() calls were done on individual pages to flip the permission bits. In this test, the size of the process mm_map list will grow as the app runs and splits virtual memory into smaller and smaller pieces. Without the dsl overlays, the performance suffered linearly with the growth of the mm_map lists. With the dsl overlays, the performance in this test was virtually constant.

No benchmarking was done to measure performance impact in the normal (<50 mm_map segments) case, but it was not anticipated to be significant. (But looking for surprises is one of the reasons that benchmarking is important now, isn't it?)

Overall, the test application ran 10x faster for 50k memory segments, with the list searches no longer being the dominant factor. (Again, stated without the benefit of profiling)

Other#

There are no implications for the Safety Manual resulting from this change.