Project Home
Project Home
Documents
Documents
Wiki
Wiki
Discussion Forums
Discussions
Project Information
Project Info
wiki1114: Advanced_Memory_Management_Variable_Page_Sizes (Version 8)

Variable Page Sizes#


Each virtual address space consists of a set of mappings of virtual addresses to physical addresses. These mappings are managed in a page table, with one page table existing for each virtual address space. The CPU must have the appropriate page table entry in its TLB (translation look-aside-buffer) table when a process accesses a virtual address in order to translate that virtual address to a physical address. The operating system ensures that the TLB table contains the necessary page table entries at all times.

Each processor has a limited number of slots available in its TLB table, so as different processes run or as one process refers to different portions of its address space, the operating system must remove older TLB entries and replace them with the page table entries required at the moment. Each time this is necessary (that is, each time a process refers to a bit of memory that is not mapped by the current TLB table) the operating system incurs a bit of overhead (a "TLB-miss exception") to update the TLB table to include the appropriate page table entry.

The Advanced Memory Management feature introduces the notion of variable page sizes to the Neutrino memory manager. Before this feature, all page table entries covered 4K blocks of memory. This feature allows the mapping to be performed with different page sizes as appropriate.

For example, prior to this feature a single 64K block of memory would be mapped using 16 4K blocks. This feature allows the system to map that same block of memory as a single 64K map table entry.

The primary benefit of this feature is improved performance due to decreased frequency of TLB misses. With larger memory pages, a single large data structure can be referenced with only a single TLB-miss exception and the TLB table can simultaneously cover a larger portion of all virtual address spaces.


Potential Performance Benefit of Variable Page Size for User Applications#


  • One customer experienced around 30% performance improvement for their application
  • Linux references indicate a 10-30% performance improvement when this feature was introduced
  • For another customer we implemented text segment big page support and saw a 0.5-1s improvement in boot-up time (without this feature, during boot-up this customer observed 3 million TLB misses in 14 seconds).
  • Navarro et al: SPEC2000 improvements with variable-page FreeBSD showed TLB miss reduction usually above 95%, SPEC CPU2000 integer benchmark 11.2% improvement (0 to 38%), and SPEC CPU2000 floating point benchmark 11.0% improvement (-1.5% to 83%)
  • Other benchmarks: FFT (2003 matrix) - 55% improvement; 1000x1000 matrix transpose - 655% improvement; 30%+ in 8 out of 35 benchmarks

Barriers to Variable Page Size#


  • Traditional OS have difficulty reconciling super-pages with paging and Copy-On-Write
    • Size of pages vs. COW copy/page-out granularity
    • Fragmentation of physical memory
    • Integration of file system cache
    • Demotion/promotion during maps/unmaps and mprotects
    • Contiguity becomes a contended resource
  • On QNX, main issue becomes fragmentation/contiguity … provided we are willing to “let go” of COW and paging

Available Page Sizes#


  • x86:
    • Primary Page Sizes of 4k, 4M (2M if ~PAE is enabled)
  • PowerPC book E:
    • Primary Page Sizes of 4k – 1TB in 4x steps (not all page sizes are supported on all implementations)
  • MIPS:
    • Primary Page Sizes of 4k - 256MB in 4x steps
  • Hitachi SH4:
    • Primary Page Sizes of 1k, 4k, 64k, 1M
  • ARM v5:
    • Primary Page Sizes of 1k, 4k, 64k, 1M


Ideas from Seb's presentation #


Use variable page size#

Using larger pages sizes where possible means fewer mapping table entries and fewer TLB misses.

  • Data (Heap)
    • Malloc heap growth -> mmap(MAP_ANON, amblksize)
    • Will need to adjust amblksize to best match available page sizes, e.g. don’t request 32k if 4k, 64k pagesizes are available
    • Policy change in malloc may cause increased memory usage
    • Recommend we do not change the default round-up of mmap() requests
  • Code
  • Shared memory
  • I/O mappings

Create global mappings where possible#

For shared libraries, image filesystem, mark certain mappings as global (libc, IFS, etc.), reserve space in all process address spaces, and mark global in TLB.

This reduces the number of TLB misses and allows certain libs to be linked at fixed address (e.g. libc)

Potential issues:

  • Security
  • TLB flushing of mappings
  • Handling of breakpoints
  • SMP debugging
  • Adjust virtual addresses based on TLB associativity

Reduce aliasing esp. with 2-way TLB Example: thread stack addresses

IFS#

  • without changing of IFS creation tools, try to use big pages to cover IFS when possible.
  • Using big page (1M/4M) to cover text segments of shared libs in IFS:
    • Group the text segments of shared lib together and use a 1M/4M page to cover it. Need to modify IFS creation tools to separate data segment of shared lib from the code segment.


Design Meetings#


Physical memory fragmentation#

  • research algorithms to reduce fragmentation
    • According to one research paper, generally speaking the best fit algorithms or first fit address order/first fit FIFO are good algorithms in terms of reducing fragmentation.
  • drivers/user applications should handle the failure to allocate physical contiguous memory properly (the same as running out of physical memory). Reservation of physical contiguous memory when system boot up is recommended if the driver/application has to use physical contiguous memory. A physical contiguous memory allocation library (to reserve and manage the pool for contiguous physical memory) can be developed to ease the development cost for drivers.
  • allocation alignment can be adjusted according to available page size?
  • pre-allocation can reduce fragmentation, as it bundles some related small allocations together to do one big allocation. For example, if the user can measure the average/maximum size of the kernel memory pool and proc memory pool in its typical usage environment, it can pass in those parameters as procnto arguments to let procnto pre-allocate kernel/proc memory pool to a certain size during the system boot up. This should be able to reduce fragmentation.
    • Same applies to user heap. If the user can measure the average size of its process heap, it can instruct the malloc library to pre-allocate that size at one mmap() call, which can reduce both the mmap() call and fragmentation. Becker is experimenting something like that.
  • Increase the mmap() call size for kernel/proc allocation from 4k to 8k should be able to reduce fragmentation.

Policy for applying variable page sizes#

General Rule#

This general rule is the default rule for all the allocation types listed below unless it is overridden by the rules specified in each type.

  1. Allocate physical memory first in a way that tries to let the physical address start from a specified boundary. The boundary is generated by rounding down the size of the allocation to the nearest power of 2.
  2. If 1. is not possible, the physical allocator will try to allocate two size/2 blocks using the same method (to let the physical address start at the nearest power of 2 of the size/2).
  3. If 2. is still not possible, the physical allocator will divide the allocation size into smaller and smaller sizes by dividing the size by 2 and try to allocate the new size, until it gets the memory or the size is below 4k.
  4. The virtual address normally starts from a big boundary (such as 4M). If it is not MAP_FIXED, we can make the virtual address' starting boundary match the physical address'.

heap#

  • default 32k mmap() allocation. some processors do not support 32k page; they support 64k instead. Would it be beneficial to change the default granularity to 64k for those processors? Or, just document it so that customers who want to trade off memory usage for speed can take advantage of it?
  • use pre-allocation based on user heap analysis to increase the page size and reduce mmap() calls for heap

stack#

  • pre-allocation to increase the chance of using big page

mmap() call for anonymous memory#

MAP_FIXED: try to allocate a physical address to match the virtual address for big page

global/static data#

  • XIP
    • For big data segment, padding?
  • non-XIP
    • Normal

text segment#

  • XIP
    • Anything we can do? Change IFS tools to do padding or moving files to make the physical address big page friendly?
    • mkifs is able to do padding and put text segment in 64k boundary
  • non-XIP
    • Normal

Shared Library#

  • in IFS, group the text segments together to use a big page to map all of them in. (Need to change IFS tools).
  • Split the text segment and data segment of libc to use the biggest page possible to cover the libc text segment? Need modify of IFS tools.
  • if we don't change IFS tools, we can try to match virtual address with the physical address

Shared Memory#

  • When increase the size, if the old size is not zero, the physical address need to be special so that the big page is possible: the physical address need to match its offset inside the object.
  • Do we want to defragment small pieces into big ones when increasing the size?

I/O mapping#

if not MAP_FIXED, make virtual address matching physical address for big page


Technical issues#


demotion of a big page#

  • partial munmap()
  • mprotect() to change attributes for partial of a big page


More Ideas#


  • IFS tools to reorganize IFS to maximize possibility for big pages
  • IFS with holes to make all segments aligned

If combined with compression image, then there is no need to memset those 'holes', and those 'holes' can be released back into the system in startup program


Detailed Design#


  • The available page size for the CPU is specified in a bitmap set. Not all supported page sizes will be available.
  • This page size bitmap is generated in startup program, and pass in through the system page. The reason to generated it in the startup program is because of some PPC processors has special configurations that have to be figured out in the startup program. Also, we want to download as much functionality as possible to startup program as its memory can be reused when the control is transfered to kernel.
  • A mid-term migration algorithm is to generate the bitmap in the kernel if the old startup does not have this functionality. It will make the new kernel work with released BSP.
  • Mapping in big pages will be implemented inside the function cpu_pte_manipulate().
  • Changing big page mappings due to mprotect() call or partial unmapping will be implemented inside cpu_pte_manipulate() as well, as all the big page information needed is inside the page table already.
  • As 64k alignment feature has already been implemented in 'mkifs', we will make use of it for code segment mapping from ifs.
  • Using global mapping for libc text segment will need change in 'mkifs', and will be implemented in the second step.