Memory management within Linux: the Open Source solution

The Linux kernel sees memory at its actual physical addresses, either by executing in "real mode" on some processors, or by using a page table that map virtual memory addresses to the corresponding physical addresses. One way or another, the kernel sees main memory "as it is."

In contrast, normal processes see memory via page tables that give them their own virtual address spaces, and the virtual address of a logical page has little to do with the physical address it resides in---it's wherever the kernel decided toput it.

The kernel manages the allocation of physical memory using several memory allocators, the main one being kmalloc. Kmalloc keeps track of which regions of main memory are in use, and divides memory up into blocks which are powers of two multiples of the virtual memory page size. A physical page of memory may be used to cache a virtual memory page or file page, or as raw memory for DMA I/0, or to hold page tables used by the VM system, or to hold data structures belonging to the kernel itself, etc.

The main kernel memory allocator is kmalloc, which uses the binary buddy system, and only manages blocks of memory whose are a power-of-two multiple of the page size. Kmalloc manages physical memory by physical address---the blocks it returns are contiguous in physical memory, and kmalloc itself knows nothing about virtual memory.

This is imporant for supporting certain kinds of devices, which require large blocks of contiguous physically-addressed memory, such as DMA devices that bypass the CPU (and its address-translation hardware).

The binary buddy system is a simple and relatively fast allocator, but it is a poor allocator in many respects. The fact that it can only manage block sizes in powers of two means that using it straightforwardly requires rounding the requested block sizes up to a power of two, which can incur a large cost in internal fragmentation, i.e., wasted space within blocks. Linux therefore provides two other allocators for allocating memory within the kernel. The "slab" allocator gets largish hunks of memory from kmalloc and carves them into smaller pieces as needed, and vmalloc provides areas of contiguous virtual memory that aren't necessarily contiguous in physical memory.

All page frames (physical RAM pages) have basic metadata stored in an array named mem_map, which is parallel to the physical address ordering of page frames in physical memory.

A page frame may be used for other purposes than caching pages, e.g., it may be part of an area holding the kernel's own data structures. A page struct contains several fields, including a flags field that holds flags describing how the page frame is being used and what is stored in it. It also holds several link fields, for linking the page frame into a doubly-linked list and/or a hash table. These allow page frames to be linked together into pools of page frames that are used in particular ways.

The page struct has several flags used for synchronization purposes and for the replacement mechanism. It also has a "count" field indicating how many mappings of this page frame currently exist.

Linux uses several kinds of data structures and a hardware cache (the Page Table Entry cache, a.k.a. TLB for "Translation Lookaside Buffer") to present the appropriate logical pages as the right virtual pages, from a process's point of view.

The virtual memory mappings of a process are represented by two data structures, one high-level (vm_area_struct's), and the other low-level (page tables). The high-level data structure describes regions of the address space, but does not specify everything about the state of individual pages. The low-level data structure (page table) provides detailed information on individual pages, such as whether the page is in main memory, and if so, where.

The high-level structure is a list of vm_area structs summarizes a process's use of its address space, with an entry for each contiguous region of the address space (a range of pages) that is treated differently from the preceding or following pages. The list is ordered by address (virtual page number).

For example, when a process maps a file into a previously unoccupied part of its address space, the mmap system call creates a vm_area struct recording the virtual page range it is mapped to and how the kernel should handle operations like page faults, copy-on-write page copies, and pageouts in that region.

These operations are specified by a pointer from the vm_area_struct to a vm_operations_struct, which is simply a struct whose fields contain function pointers, pointing to the functions for those operations.

When the kernel needs to perform a given operation on a given page of a process's address space, it can search the list of vm_area_struct's for the one describing the region containing that page. It can then extract the pointer to the vm_operations struct, and from that extract the function pointer to performthat operation.

The main component of the VM replacement mechanism is a clock algorithm. Clock algorithms are commonly used because they provide a passible approximation of LRU replacement and are cheap to implement. A clock algorithm cycles slowly through the pages that are in RAM, checking to see whether they have been touched (and perhaps dirtied)lately.

For this, the hardware-supported reference and dirty bits of the page table entries are used. The reference bit is automatically set by the PTE cache hardware whenever the page is touched---a flag bit is set in the page table entry, if the PTE is evicted from the PTE cache, it will be written back to its home position in the page table. The clock algorithm can therefore examine the reference bits in page-table entries to "examine" the corresponding page. The basic idea of the clock algorithm is that a slow incremental sweep repeatedly cycles through the all of the cached (in-RAM) pages, noticing whether each page has been touched (and perhaps dirtied) since the last time it was examined. If a page's reference bit is set, the clock algorithm doesn't consider it for eviction at this cycle, and continues its sweep, looking for a better candidate for eviction. Before continuing its sweep, however, it resets the reference bit in the page table entry.

Resetting the reference bit ensures that the next time the page is reached in the cyclic sweep, it will indicate whether the page was touched since _this_ time. Visiting all of the pages cyclically ensures that a page is only considered for eviction if it hasn't been touched for at least a whole cycle.

The clock algorithm proceeds in increments, usually sweeping a small fraction of the in-memory pages at a time, and keeps a record of its current position between increments of sweeping. This allows it to resume its sweeping from that page at the next increment. Technically, this simple clock scheme is known as "second chance" algorithm, because it gives a page a second chance to stay in memory---one more sweep cycle.

More refined versions of the clock algorithm may keep multiple bits, recording whether a page has been touched in the last two cycles, or even three or four. Only one hardware-supported bit is needed for this, however. Rather than just testing the hardware supported bit, the clock hand records the current value of the bit before resetting it, for use next time around. Intuitively, it would seem that the more bits are used, the more precise an approximation of LRU we'd get, but that's usually not the case. Once two bits are used, clock algorithms don't generally get much better, due to fundamental weaknesses of clock algorithms.

Linux uses a simple second-chance (one-bit clock) algorithm, sort of, but with several elaborations and complications. The main clock algorithm is implemented by the kernel swap demon, a kernel thread that runs the procedure kswapd(). kswapd is an infinite loop, which incrmentally scans all the normal VM pages subject to paging, then starting over. Kswapd generally does its clock sweeping in increments, and sleeps in between increments so that normal processes may run.

Linux performs a clock sweep over the *virtual* pages, by cycling through each process's pages in address order. For this it uses the vm_area mappings and page tables of the processes, so that it can scan the pages of each process sequentially. Rather than sweeping through all of the pages of an an entire process before switching to another, the main clock tries to evict a batch of pages from a process, and then move on to another process. It visits all of the (pageable) processes and then repeats. The effect of this is that there is a large number of distinct clock sweeps, one per pageable proceses, and the overall clock sweep advances each of these smaller sweeps periodically.

Home