6. Memory Management

Part of the 22C:116 Lecture Notes for Fall 2002
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Key Issues

The key questions for any resource management system are:

WHO requested the resource?
WHAT resource was requested?
WHERE should the resource be allocated?
HOW should the resource be allocated?
WHY should the resource be allocated?
WHEN should the resource be allocated?

WHO requested the resource?

This is not trivial! On most systems, resources are allocated to processes, but this is not universal!

If the process is the object to which all allocation is attributed, then the process becomes a very heavy object, and changing processes becomes expensive. As a result, on systems where this is done, the system's notion of process is frequently augmented with lightweight processes, to which no resources other than the CPU are allocated. These are called threads.

On the old Berkeley Timesharing System, in the late 1960's, resources were allocated to interactive users. Each interactive user had a single virtual address space containing up to 64 pages of memory, and the user could have any number of processes running in this address space, where at any time, each process could directly address 8 pages of this 64-page address space. Berkeley Timesharing System processes were very lightweight objects, with a total process description containined in only 8 words of memory! Using today's terminology, we would say that a Berkeley Timesharing System user was a process, while the processes in this system correspond to threads in modern terminology. In fact, the Berkeley system was the first to incorporate this modern idea of multithreaded processes.

Aside: The word process originally always meant a locus of control, or some entity in a system that could execute a sequential program. Another good definition (though not usually used) is that each process runs on a virtual CPU created by the scheduler.

This definition has been largely forgotten in the UNIX community, where it is common to use the word process to refer to the unit of activity to which allocated resources are bound, even if that unit contains many lightweight threads, each of which is technically a process in the older sense of the word. For now, we must live with the resulting confusion.

WHAT resource was requested?

Memory can be allocated by the byte, the word, the page, or the segment. In any particular context, the system typically only provides one unit of allocation, and it is up to the user or the applications library to manufacture the desired units of memory out of those allocated by the system.

Bytes and Words are typically the smallest addressable units of memory. Allocating by the byte or word is quite rare, except in systems where the word is sufficiently large to accomodate two pointers. In that case, randomly allocated words are sufficient to build arbitrary binary-linked data structures, which in turn, can be used to mimic any other data structure. This is the foundation of the LISP worldview.

For those unfamiliar with LISP, the world (containing both programs and data structures) is represented entirely with S-expressions. The simplest S-expressions are known as atoms; atoms include numbers (represented in 1 word) and textual atoms (represented using a special word that includes the property list for the atom; one property in this list is the printable name of the atom). Complex S-expressions may be formed by composing S-expressions into lists. The compose operator (CONS) takes 2 S-expressions as parameters and returns a new S-expression that joins these. The new S-expression is represented by a word holding 2 pointers, one to the left component (the CAR) and one to the right component (the CDR).

Pages are fixed sized blocks of memory, with the size typically determined by some characteristic of the addressing hardware on the machine. Because the page size is determined by the hardware and not by the application, it is generally necessary to construct large logical objects out of multiple pages, or to divide single pages into multiple logical objects. Typically, the memory management unit hardware allows multiple pages to occupy logically contiguous addresses independent of their physical locations in memory, so that large objects can easily span many pages.

Segments are variable sized blocks of memory, with the size typically determined by some characteristic of the application. Segments are typically supported by hardware, while otherwise identical blocks of memory that are purely software artifacts are usually referred to simply as Blocks. Each block or segment typically contains one logical object, but sometimes, the economics of allocation are such that the application sub-allocates a block or segment.

WHERE should the memory be allocated?

A naive user typically assumes that blocks of memory requested by an application are allocated directly in main memory, so that the application directly uses main memory addresses. This was true in first-generation systems, including both the mainframe systems of the 1950's and the first generation of microcomputer operating systems in the 1970's and 1980's, but by 1970, most mainframe vendors were moving away from this view, and by 1995, the operating systems on microcomputers were well on their way to abandoning this simple view.

Most interesting operating systems only use physical memory addresses for internal data structures of the operating system itseelf. All user data structures are allocated in memory segments allocated to the users by the system. This allows the system, by means of the memory management unit, to protect itself from errant user programs, and it allows user programs to be isolated from each other, with the only permitted communication being through authorized channels.

The question of where memory is allocated has another interpretation: Within any given address space from which memory may be be allocated, where should it be taken. The answer is trivial if all but the requested amount has already been allocated. Another trivial case is when each block of allocatable memory is identical and interchangable with all others, as is the case when the unit of allocation is the page or any other fixed size block, for example, the LISP S-expression node.

A final aspect of this question arises in machines with non-uniform memory access times (so-called NUMA architectures). On such machines, it is important to try to allocate memory "near" the CPU that is the primary user of that memory, where distance is measured in terms of expected access time.

HOW should the memory be allocated?

Allocation for all time may be different from temporary allocation, and some systems allow preemption of memory, either with the cooperation of the user to which the memory was allocated, or without notice.

Almost all preemptable resources are used for implementing virtual resources. Preemptive schedulers create one virtual CPU per process. Demand paged storage managers preempt page frames with every page fault in order to create virtual address spaces. Window managers preempt regions on the screen to support different overlapped virtual screens (windows).

Non preemptable resources exist at the bottom level in each such virtual resource hierarchy. The lowest level backing store in a virtual memory environment cannot easily be preempted. The (possibly virtual) memory used to implement a window manager is similar, as is the memory used to hold the states of the ready and waiting processes under a process manager.

Note that there are many different types of virtual resources in the above example, but all are implemented in terms of the same underlying non preemptable resource. This is a common pattern.

WHY should the resource be allocated?

Why not! It is up to the user to decide why they want a resource. Operating systems have no basis for asking this question, although it should be asked by program designers.

WHEN should the resource be allocated?

Can the user wait? Can the user give advance notice? Must the user give notice? If we attempt to allocate physical resources on demand, and if we block users who ask for resources that are currently unavailable, we run the risk that one process may be blocked waiting for a resource currently held by another process, while the other process is blocked waiting for a resource held by the first. This is one example of the more general problem of deadlock.

With preemptable memory resources, such as page frames, demand allocation is common, but performance can frequently be improved by either forcing some users to wait before their demands are met, or by asking users to provide advanced notice of their expected demand. Sometimes the system can anticipate a user's allocation request and allocate resources before they are needed.

Memory management algorithms

Consider the simple case of allocating blocks of physical memory to system or user code on a system with no protection mechanisms. With small changes, the same mechanism may be used to allocate blocks from the virtual memory segments of a user program. The interface to such a memory manager is frequently specified by the programming language.

The definition of Pascal, for example, includes the special procedure new(p) that sets the pointer p to point to a new block of memory sized to fit an object of the type referenced by p. Most implementations of Pascal include the nonstandard procedure dispose(p) that deallocates the block referenced by p and sets p to nil.

The C standard library, similarly, includes the function malloc(s) that returns a pointer to a new block of s bytes of memory, and the function free(p) that deallocates the block referenced by p, if that block was previously allocated by malloc. C is a weakly typed language, so the programmer is responsible for making sure that the size, in bytes, passed to malloc matches the size of the object required.

It is easy to see that the C and Pascal memory management interfaces are basically the same. Both allocate memory from a memory space called the heap. It is noteworthy that C++ was originally implemented by a preprocessor that converted the C++ program to an equivalent C program. When the C++ preprocessor encounters code requiring the instantiation of a new object, it generates C code calling malloc to create the structure representing the object.

There are a huge number of algorithms for implementing the basic allocate and deallocate operations. Two representative examples will be given here, the binary buddy system and a boundary tag method.

Note that memory management was once a central problem on operating systems, when user processes shared real memory with each other. In modern uniprocessors, it may seem far less critical because user code runs in virtual memory, and allocation and preemption of page frames is relatively trivial.

The use of virtual memory may seem to push the burden of memory management into the user processes, where it shows up either as an application programming problem or as part of the language run-time system, for example, in the implementation of new and dispose or malloc and free in Pascal and C, or as the garbage collector underlying an implementation of LISP, Simula, Smalltalk or Java.

Of course, systems must manage the backing store, but this is typically allocated in units of interchangable disk tracks or sectors. There are performance penalties for allocating disk space poorly, but the problem is one of performance, not absolute necessity. Simple allocation heuristics such as allocating that free sector closest to the other allocated sectors of the same file (or virtual address space) can give excellent results.

In fact, much of the message of the above paragraphs is incorrect. When an application chooses the wrong allocation algorithm, it can paralyze the system. On some UNIX systems from the late 1980's, for example, the system needed to be periodically rebooted because the X-windows window manager, an applications program, consumed all of the available virtual memory. The problem was severe external fragmentation within the memory allocated to the window manager by the system.

Problems faced in memory allocation:

Fragmentation is the term used to describe how the free space in a pool from which memory is allocated is slowly broken up into many small pieces. There may be hundreds of words of free space, and yet a request to allocate a ten word block of memory may fail because there are no ten consecutive free words!

Overhead is the term used to describe additional memory that must be allocated, above and beyond that requested by the users, in order to provide for management of the pool of memory. For example, it may require a few words of memory to describe the location of each allocated memory block.

The consequence of fragmentation and overhead is that, although the total user demand for memory may be less than the number of words available, the system may be unable to meet that demand because the largest available block of free memory is smaller than the smallest user demand that has yet to be satisfied.

Fragments of free space, and fragmentation in general, comes from two sources:

Internal fragments are fragments of free space that are allocated to users even though the user does not request them. For example, on many modern processors, each allocated block of memory must be aligned on modern processors, byte addressing is quitee expensive, so each allocated block of memory must be aligned on a word boundary. If a user wants an odd number of bytes, this will require rounding up the user request to the next word boundary, allocating space to that user that the user does not want.

External fragments are fragments of free space between allocated blocks. External fragmentation can be eliminated by a compacting garbage collector; one that can move blocks of memory around in the heap in order to consolidate the fragments, but these are uncommon outside the specialized area of run-time support for Java.

A trivial memory management algorithm

Allocate fixed size blocks with a size greater than the maximum possible user request.

This leads to awful internal fragmentation unless all user requests are for the same sized block. Allocation and deallocation are very fast, requiring a single linked list, free-list, to keep track of free blocks.

If there are a variety of block sizes being requested, this algorithm can be improved by using a separate free list for each popular block size. The problem with this is that it introduces a new fragmentation problem, there may be plenty of free space, but only in sizes that don't match the user's request. If blocks of one size can be combined to make larger blocks or broken up to make smaller blocks, this problem is solved. The result is a class of algorithms known as buddy systems.

The Binary Buddy System

In the simplest of the buddy systems, all block sizes are powers of two. Any large block may be split in half to make two smaller blocks, and any two blocks that were originally split from the same parent may be combined to reconstruct that parent.

The central data structure used in all buddy systems is an array of free-lists:

      Array of
      free-list
      heads               ______       ______
       ________     ---->|_____o----->|______|
 128K |________|   |     |  64k |     |  64k |
 64K  |_______o----      |      |     |      |
 32K  |_______o------    |______|     |______|
 16K  |________|     |    ______       ______
 8K   |_______o----   -->|_____o----->|_____o---
 4K   |________|   |     |  32k |     |  32k |  |
 2K   |________|   |     |______|     |______|  |
 1K   |________|   |      ______       ______   |
 512  |________|   |     |______|<-----o_____|<-
 256  |________|   |     |  32k |     |  32k |
      (indexed     |     |______|     |______|
       by block    |      ______       ______
       size)        ---->|_____o----->|______|
                         |___8k_|     |___8k_|

Note that the above picture shows the state of a buddy system storage allocator with an array of free lists that allows for blocks of size up to 128k bytes. The system currently has two blocks of size 64k bytes, 3 blocks of size 32k bytes, and 2 blocks of 8k bytes. When a block is in use, the block is entirely available to the user. When the block is free, the first few bytes of the block are used to hold a pointer to the next free block of the same size.

The minimum block size is determined by the size of a pointer. If a user asks for less than this, there will be internal fragmentation because we cannot form blocks into free lists if we cannot store one pointer in each block.

The number of table entries required is determined by the maximum and minimum block sizes. In the above example, the maximum was 128k (2¹⁷), and the minimum on a 32 bit machine would typically be 4 (2²); in this case, the table itself requires 17-2 or 15 entries. This is the storage overhead of the binary buddy system.

Binary Buddy System Algorithms

to allocate a block of size n:
  pick i such that 2^(i-1) < n <= 2ⁱ
  if freelist[i] not empty
    return the first element of freelist[i] to the user
    and set freelist[i] to point to the next element
  else
    allocate a block of size 2⁽ⁱ⁺¹⁾ (this is a recursive)
    and split it in half.  Return half to the user
    and put the other half on freelist[i].
  end

The allocation code outlined above is recursive! Furthermore, because we are constrained to allocate powers of two, if user requests are for uniformly distributed random block sizes, we expect that the typical request will only use 3/4 of each block, resulting in an average loss to internal fragmentation of 1/4 of the total memory capacity.

to deallocate a block of size n:
  pick i such that 2^(i-1) < n <= 2ⁱ
  compute b, the buddy of the deallocated block.
  if b is already in freelist[i]
    remove b from freelist[i].
    combine b and its buddy, and deallocate the
    resulting block of size 2⁽ⁱ⁺¹⁾ (this is recursive)
  else
    add the deallocated block to freelist[i].

Again, this code is expressed in recursive form. It should be noted that this version of deallocate uses what is referred to as prompt merger of free blocks. That is, free blocks are merged with their buddies as soon as possible after deallocation. The result of prompt merger is that there will always be the best possible supply of large free blocks, but if most of the traffic is in one particular block size, there will be unnecessary computaiton involved in the repeated merger and splitting of blocks.

Prompt merger is a good solution for real-time systems, where our concern is a bounded worst-case time per operation. If, instead, we are more worried about minimizing the total computation time, we should use what is called lazy merger. That is, we should avoid merging free blocks for as long as possible, letting the free lists for popular block sizes grow as long as possible and exhausting the free lists for unpopular block sizes.

In lazy merger, free blocks are only merged when an allocation request comes in for a block size where the free lists for that size and all larger sizes are empty. At that point, the allocate routine begins performing mergers of smaller free blocks until it builds a block large enough to satisfy the request. This means that there may be a huge number of mergers in response to one allocation request.

Exercise: For a total heap size of N words, what is the worst case number of splits and mergers for allocate and deallocate, and number of splits and mergers for allocate and deallocate; do this first for the prompt version, and then the lazy version.

Finding the buddy of a block i of size s is easy in the binary buddy system:

  buddy(i) = i xor s

Note that the above rule violates most people's ideas of strong type checking, since we compute the pointer to the buddy of a block by exclusive-oring the size of the block (an integer) with the pointer to the block!

The Fibonacci Buddy System

The Fibonacci buddy system is an alternative where block sizes are Fibonacci numbers. This reduces internal fragmentation by providing for a larger variety of free block sizes.

The Fibonacci numbers are 1 1 2 3 5 8 13 21 ...

Each Fibonacci number is the sum of its two predecessors. The problem with this system is that there is no trivial way to compute the address of the buddy of a block given the address of that block.

Exercise: Implement the Fibonacci buddy system.

Boundary Tag Storage Allocation

Boundary tags are data structures on the boundary between blocks in the heap from which storage is allocated. The use of such tags allows blocks of arbitrary size to be used -- the tag describes, for each block, how big it is, its status, and its neighbors:

        HEAP________
           |________| tag
           |  used  |
           |        |
           |        |
           |________|
           |________| tag
           |  used  |
           |________|
           |________| tag
           |  free  |
           |        |
           |________|
           |________| tag
           |  used  |
           |________|
           |________| tag
           |  free  |
           |________|
           |________| tag

The consequence of allocating arbitrary sized blocks on the heap is that the bookkeeping needed to split blocks, merge adjacent free blocks, find the right size block, and find the neighbors of a block all become more complex. The boundary tag data structure permits this, but we pay a price, the overhead of the tag itself.

The decision to use a boundary tag method poses many choices. Among these are the lazy versus prompt merger decision that was present with the buddy systems. Another important choice is that between best fit and first fit allocation: When searching a collection of free blocks, should we take the first block we find that is big enough to satisfy a user's demand, or should we do a more careful search, looking for the best fit?

Intution suggests that best-fit allocation should reduce fragmentation, but in fact, if an exact fit is unavailable and there is a wide variety of commonly requested block sizes, empirical evidence shows that first fit generally works better because it produces larger fragments that are more likely to prove useful at a later time. The problem with best-fit allocation is that is produces large numbers of very small fragments.

It is very important to note that in most heap management environments, an exact fit is very likely to be available! Most applications repeatedly allocate and deallocate objects of only a few sizes, and a good storage manager should take advantage of this. On the other hand, it is important to remember that there are applications that have genuinely random sequences of allocations. One example of such a random sequence of requests comes when opening a succession of windows to view images, where each image was hand cropped (trimmed to size) to eliminate distracting or irrelevant content. Some web pages are full of such randomly sized images, and viewing such pages can lead to significant heap fragmentation.

Much of the early research in the area of heap management was done by the Burroughs Corporation in the early 1960's, but research in this area continues. Some of the best recent papers in the area were published in a the Proceedings of the ACM SIGPLAN International Symposium on Memory Management, published as SIGPLAN Notices, Vol 34, No 3, March 1999. The most relevant paper in this proceedings is "The Memory Fragmentation Problem: Solved?" by Johnstone and Wilson, pages 26-48.

Boundary Tags

        |_______|
        |_______| <--
        |       |    |-- Block boundary tags
        |       |    |   give
        | block |    |    1) Block status
        |       |    |        (used or free)
        |       |    |    2) Block size
        |_______|    |    3) Addresses of neighbors
        |_______| <--
        |       |        (2 & 3 are correlated!)

With the buddy system, the heap manager needed no record of the blocks allocated to users, but the arbitrary sized blocks used in boundary tag methods require that the manager keep a record of each block in the heap, whether it is allocated or free. These records are easily maintained in the heap itself, in the form of a prefix or suffix on each block. From the user's point of view, these prefixes and suffixes are not part of the block -- they are part of the overhead of the storage manager.

When the heap is viewed as a sequence of blocks, the suffix on one block is adjacent to the prefix on the next block, and every boundary between blocks, whether they are used blocks or free blocks, contains a prefix-suffix pair. This pair is the boundary tag from which the entire class of storage allocation methods takes its name.

For each block, the boundary tag must contain the status of that block so that, at the appropriate time, it can be determined whether the block is free, available for allocation or merger with another free block, or in use, and therefore unavailable. Another piece of necessary information is the size of the block, so that the allocation algorithm can determine whether the block is big enough to satisfy some allocation request. Finally, when it is necessary to merge free blocks, it must be possible to find the neighbors of a free block in order to see whether they are free.

The size of a block and the addresses of the neighbors are redundant, since if you know the address of the following block and the size of the boundary tag fields, you can compute the size of a block, and if you know the size of a block, you can compute the address of either end if you are given the address of the other end.

Boundary Tag Alternatives

   First Alternative     Second Alternative

   |    |        | ^        |________|
   |    |        | |        |__size__|
    ->->|________| |        |        |
     |  |_status_| |        |  block |
     |  |__next_o--         |________|
     |  |__prev_o----       |__size__|
     |  |        |   |      |_status_|
     |  |        |   v      |        |

There are two popular ways of storing the information needed in the boundary tags. In the left method shown above, the boundary tags are arranged into a doubly linked list. Each tag contains a pointer to the next tag and to the previous tag, and each tag contains the status of the block it immediately precedes. As shown in the illustrations, all pointers point to the first word of a block in the heap, so positive offsets from the pointer are indices into the block and negative offsets are indices into the boundary tag. This convention is convenient but not essential.

The right method shown above records the size of each block as a prefix and suffix on that block, and includes the status in either the prefix or suffix or, on some systems, both (above, it is shown in the prefix).

Boundary Tag Free Lists


      ^ |        |      In the worst case,
      | |  free  |      free blocks must
      | |________|      be arranged into
       --o_______|      a doubly-linked
    -----o_______|      free list.  This
   |  ^ |////////| tag  allows fast
   |  | |        |      allocation and eager
   |  | |  used  |      merger.
   |  | |________|
   |  | |////////| tag  For some other
   |  | |        |      choices of allocation
   |  | |  free  |      method, a simpler
   |  | |________|      free list structure
   |   --o_______|      will suffice.
    -> --o_______|
      | |////////| tag
      v |        |

The free list is maintained in the free blocks themselves, not the bounary tags, so the minimum free block size depends on the number of pointers per list element, two if the list is doubly linked, typically one word each. The boundary tag itself requires two pointers or two block sizes (typically, the block size is the same size as a pointer), and one status bit (typically a whole word), so in order to allocate a two word block on the heap, it is typically necessary to find a total of five free words.

The above argument implies that boundary tag methods cannot eliminate internal fragmentation! When a user requests a block of size X and the only block available is size X + E (E is the extra part), then the available block must be split into a block of size exactly X and a block of size E - T (where T is the size of a boundary tag). If E - T is smaller than needed to store the pointers needed to link it into the free-list, then the split cannot be done. For the example data structure shown above, E must be at least 5 words, so there will be occasional internal fragmentation involving fragments of 4 or fewer words.

In the extreme case, no free list is needed. When a user wishes to do allocation, a linear search over the list of boundary tags will always manage to find a free block, but if the heap is largely full, this search will be unnecessarily slow as it examines all of the allocated blocks in the heap in the search for a free block of the right size.

If free block merger is lazy, then the merger can be done by a linear scan of the entire heap, rebuilding the free list as free blocks are found and merged. In this case, the free list can be singly linked.

If free block merger is done on an eager basis (at deallocate time) or on an eager basis (at allocate time) during the scan of the free list, where the free list is randomly linked with no attention to the addresses of blocks in the heap, then double linking is needed because when two adjacent free blocks are merged, one has to be removed from the free list and without a double link, it's successor in the list cannot be found.

The so-called fastest fit storage allocation algorithm uses a simple heap manager with a singly linked free list as its foundation and lazy block mergers; free blocks of popular sizes are stored in secondary free lists, one for each popular block size. Allocations for commonly used sizes check the appropriate secondary free list first, before using the underlying heap manager. Deallocations of common block sizes simply place the blocks in the secondary list (uncommon sizes can be returned to the main free list).

Only when an allocation request to the underlying heap manager cannot be satisfied are free blocks merged -- this is done by scanning the entire heap and rebuilding the main free list, leaving all the secondary free lists empty.

The fastest fit algorithm gives what may be the best average-case performance, but its worst case performance is not good. For the worst case, first-fit is better! This is one of many examples of situations where the best algorithm for real-time use is not the same as the best algorithm for minimizing the total run-time!

Exercise: Write code for a boundary tag storage manager, using prompt merger.