8. Page Replacement

Part of the 22C:116 Lecture Notes for Fall 2002
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Page Faults

When a program references a page that the memory management unit is unable to handle, or when the reference requires access rights that are not allowed by the memory management unit, the memory management unit reports a page fault by forcing the CPU to enter a trap service routine, usually a part of the operating system, but there are systems such as Mach where page faults may be handled by application-level code.

The instruction that was executing at the time of the trap is aborted! On entry to any trap service routine or interrupt service routine, a properly designed CPU will cooperate with the prologue (the introductory sequence of instructions) of the service routine to save the values of all of the registers of the interrupted program, including the PC. In the case of trap handlers, the saved values should be the values that were in effect immediately prior to the instruction that caused the trap.

Note: On some machines, the values for some saved registers include changes made by the instruction that caused the trap. For example, the program counter may have been incremented as a side effect of the instruction fetch prior to the memory cycle that caused a page fault, or the stack pointer may have been decremented prior to the memory cycle that caused the fault. If such a machine is to be used to support demand-paged virtual memory, there must be a way to determine what registers have been changed so that the prologue on the page fault trap service routine can undo these changes!

In theory, given the PC and registers prior to the instruction that caused the trap, the trap service routine can simulate the execution of the next instruction to determine all of the memory addresses it might use; for each address, it can check to see which of these might have caused the trap. This is simple enough that on some machines, the memory management units do not save the actual virtual address that was used to cause the trap; in this case, the address must be computed by the prologue to the trap service routine. The software is simplified, however, if the MMU saves the virtual address that caused the trap, thus saving the trap service routine from having to simulate an instruction cycle.

Here, in outline form, is a generic page fault service routine:

page fault service routine:
    -- prologue
    save the state prior to the instruction that caused the trap

    -- find necessary information about cause of fault
    va = virtual address that was referenced
    op = operation that was attempted (read or write)

    -- is this an access violation?
    if map[va.pagenumber].rights = read_only
    and if op = write
        handle as user access violation (bus trap!)
    else

        -- find a free page frame
        if there are free page frames
            pa = free frame address
        else

            -- page replacement policy
            pa = address of page to replace
            va' = virtual address of page in that frame
            write( pa, diskaddress( va' ))
            update virtual address map, etc to reflect change
        endif

        -- bring in the required page
        read( pa, diskaddress( va ))
        update virtual address map, etc to reflect change

        -- epilogue
        restore saved state
        return from page fault service routine
    endif
This outline leaves us with two important choices; what page(s) should be copied into main memory from disk, and, assuming that main memory is full, what page(s) should be copied out from main memory to disk in order to make space.

There is also the question of what disk address should be mapped to each virtual address. Some systems simply set aside a disk device or a partition of a disk device (a virtual disk) to hold the entire range of legal virtual addresses. On these systems, if we assume that the page size equals the size of a disk sector, we can use the page number directly as a sector number on the disk or disk partition. Another approach is to set aside a disk file in the file system for each virtual address space. This makes sense in systems that support multiple virtual address spaces; when this is done, we can use the virtual address of a page as the offset into the file for reading and writing the saved copy of each page.

What page should be copied into main memory?

When a page fault occurs because a program tried to access a specific virtual address that was not currently mapped into main memory, we must bring that page into memory. If this is the only way pages come into main memory, we say that our virtual memory system uses pure demand paging. That is, pages are only brought into memory on demand.

If, on the other hand, our system attempts to guess the identity of pages that will be needed in the near future and brings these into main memory along with the pages that were demanded, we say it uses anticipatory paging. That is, the system tries to anticipate pages that will be needed soon.

A typical anticipation scheme is based on the observation that most instructions are executed sequentially, and many operand references are located in sequential memory locations. Therefore, we can estimate the likelihood that a memory reference to page X will be followed in the near term by a reference to page X+1. If the likelihood is under 1/2, it is obvious we should not bring in page X+1; if the likelihood is sufficiently greater than half, there is a good chance that we will gain by an anticipatory transfer.

If we study program execution patterns to find the expected size of a function, call this S, we can use this as an estimate of the likelihood that executing location X will predict the execution of locations up to X+S in the near future. More generally, we can study the sequences of operand addresses issued by running programs to get an estimate of the likelihood of references to X+S within the next T time units, for all S and all T, and we can use this as a basis for a sound anticipatory paging policy.

Of course, not all programs behave typically. Experience on systems that support anticapitory paging shows that it does not always pay off. Furthermore, it is sometimes possible to predict, as you write a program, that the program will behave poorly under an anticipatory paging system, for example, when traversing linked lists that were constructed at random. One way of dealing with this is to allow applications programs to turn off anticipation when the programmer believes it will behave poorly, and turn it back on again.

What page in main memory should be replaced?

What page in main memory should be replaced? We refer to this as the question of page replacement policy. In the above outline of a page-fault service routine, this issue was glossed over this by asking for the address of page to replace. There are a number page replacement policies that have widely accepted names:

Random page replacement:
Random replacement requires no intelligence and works moderately well. When a page frame is needed, the random algorithm simply selects any page frame at random, dumps the page from that frame onto disk, and then uses it. If the memory reference pattern of the application program is very irregular, for example, as occurs during some list processing applications, it may even be the best solution.

FIFO page replacement
First-in first-out replacement is an obvious policy; here, the page that has been in main memory the longest is the one replaced. For most applications, this turns out to be as bad as random replacement in practice. The fact that a page has been in memory for a long time does not generally imply that it will not be needed soon!

LIFO page replacement
Last-in first-out replacement is included in this list only for the sake of completeness. This policy is among the worst possible policies! The page most recently brought into main memory is frequently among those that will be used very soon, and sending it back to disk is almost always a very bad idea.

LRU page replacement
The least-recently-used replacement policy is usually the best choice, if appropriate hardware is available to support it. This policy selects for replacement the page in main memory that was least-recently used. That is, the page that has not been used for the longest time. If a page has not been recently used, it is reasonably likely it will not be needed anytime soon. Unfortunately, LRU replacement requires that the MMU perform some fairly complex record keeping, for example, by keeping a constant record of when each page was most recently accessed.

Clock page replacement
Clock replacement is almost as good as LRU replacement. This scheme divides the set of pages in main memory into those that have recently been used, and those that are less recently used, and selects for replacement one of the less recently used pages using an algorithm to be discussed later. Where the pure LRU replacement scheme needs several bits per page frame to select the page for replacement, for example, enough bits to store a timestamp, the clock algorithm needs only one bit to distinguish between those pages that have been recently used and those that have not. Therefore, the hardware support for clock replacement is relatively simple, and in fact, special hardware can be entirely dispensed with.

Belady's optimal page replacement policy
Belady, working at IBM in the late 1960's, developed an optimal page replacement policy. This policy is entirely impractical in practice, but because the number of page faults required by this policy can be worked out retrospectively, it is possible to determine how close the practical policies come to optimality. The optimal policy is simple: Replace that page that will not be needed for the longest time in the future!

LRU Replacement

Required Hardware

The best of the practical page replacement policies is to replace the least recently used page, that is, the page which has not been referenced by the CPU for the longest time in the past. This requires special hardware to track every memory access. One way to do this would be to have the memory management unit manage a table of page frames, where each page frame contains a timestamp that is updated with each reference to the page in that frame.

This nearly doubles the complexity of the memory management unit, but if we move the timestamp to the page table entries, recognizing that this field will only have use for those pages that are currently in frames in main memory, we can simplify the hardware. The resulting page table structure might be as follows:

      A Page Table Entry
       _________________________________
      |__________|______|_______________|
      |   Time   |Rights      Frame     |
      |  Stamp   |                      |
      |          |                      |
      |   added  |   original fields    |

Why Does LRU Replacement Work Well?

Demand paged virtual memory works, that is, it offers access times comparable to main memory, at a price comparable to secondary memory because programs exhibit what is called a working set. The working set of a program is the set of pages it needs in order to make progress at some instant in time. If enough page frames are available in main memory to store the entire working set of a program, that program will execute almost as quickly as it would if it was executing out of main memory with no paging.

In practice, most programs exhibit a well defined working set that changes only slowly with time. That is, they exhibit strong locality of reference. The locality property involves two components: In programs that exhibit strong temporal locality, the fact that a memory location has recently been referenced strongly suggests that it will soon be referenced again. In programs that exhibit strong spatial locality, the fact that some memory location has recently been referenced strongly suggests that nearby locations will soon be needed. Squential execution of programs and sequential processing of arrays lead to spatial locality, while iterative control structures lead temporal locality for both code and data.

The LRU policy directly addresses the issue temporal locality, and the organization of memory into pages that contain more than one consecutive memory location addresses spatial locality. Programs with good locality have well defined working sets, while programs with poor locality have poorly defined working sets.

Belady's optimal page replacement algorithm is optimal because it examines the future to make page replacement decisions. The LRU policy, in contrast, uses the predictions it can make by assuming that the program has good locality, and usually, these are good predictions. LRU is usually considered to be optimal among practical algorithms, although there are contexts where some anticipatory paging can lead to marginal improvements.

MULTICS on the GE 600 used true LRU replacement, back in the late 1960's, but few systems since that time have used true LRU. The MULTICS implementation used a special high-speed core memory to implement its cache of all currently available page frames. The Following sequential algorithm was used by the address translation unit, implemented in fast hardware:

I = 0
   Fetch entry(I) into temp
   while I < cache_size
   and temp.page /= virtual_address.page do
       I = I + 1;
       exchange temp with entry(I)
   endloop
   if temp.page = virtual_address.page
       Store temp into entry(0)
   else
       Report a page fault for virtual_address.page
       Report that temp.frame is the least recently used page frame
       Copy the map entry from main memory into cache entry(0).

This keeps the entries in the cache ordered by recency of use. This ordering guarantees that the cache will be fast for references to recently used frames (those that are likely to be frequently used) while it is slow for frames that weren't recently used (those that are unlikely to be frequently used).

The MULTICS scheme is an elegant historical curiosity, of only limited practical value today because it is predicated on core memory technology.

MULTICS was the first design for a large-scale timesharing system, developed with federal funds at Project MAC at MIT starting in the mid 1960's. The original development involved a consortium involving MIT, GE and Bell Labs, but Bell Labs dropped out of the project, and Honeywell bought GE's computer business. In the 1970's, MULTICS became one of two standard operating systems offered on Honeywell mainframes.

MULTICS was originally conceived of as a computer utility. The idea was that such a utility would be operated in much the same way as the telephone and power utilities, but instead of selling power or audio communication services, the computer utility would sell CPU time and disk storage to its customers. Today, this isn't a radical idea -- there are many computer utilities, ranging from national services like AOL to smaller local providers. The concept was a radical one in the mid 1960's, and this idea drove a number of technical developments.

Clock page replacement

The clock algorithm allows a good approximation of LRU replacement, although if there are too few available page frames, it degenerates to FIFO replacement. Clock replacement requires that there be one bit per page frame to record whether that page has been recently referenced. The mark bit is set by the memory management unit hardware whenever the page in some frame is referenced, and it is reset by the page-fault service routine, in the process of deciding what page frame to replace.

To select a page frame to replace, the page fault service routine software sweeps through the page frames, resetting the referenced bit on each page frame and stopping when it finds a page frame with this bit already reset. The page in this frame is the one that is replaced (copied out to disk to make an empty frame).

The clock algorithm is usually pictured with a clock face, where each minute on the clock face corresponds to a page frame. The clock hand remains fixed except during the execution of the page fault service routine. When a page fault occurs, the clock hand begins its sweep. The hand stops as soon as it sees an unmarked page frame. The clock hand removes the marks from each page frame it sweeps over.

Data Structures and Code for the Clock Algorithm

The Clock Algorithm requires a frame table, that is, a table indexed by physical memory page frame number and containing, for each page frame, the virtual address or page number of the page in that frame, along with at least one status bit, the mark bit. The mark bit can be moved to the page table, at the expense of a level of indirection in each test of the mark bit by software in the replacement algorithm. This is typically done in order to simplify the MMU by having it deal only with the page table.

                          Frame Table
                        ________________
                       |_X__|___________|
                       |____|___________|
  Clock Hand -         |____|___________|
               \       |____|___________|
                 ----->|____|___________|
                    |  |_X__|___________|
     Sweep Direction|  |_X__|___________|
                    |  |____|___________|
                    V  |_X__|___________|
                       |mark| what page |
                       |bit | is in this|
                       |    | frame?    |


	find page to replace:
           loop while frame_table[clock_hand].mark is set
              reset frame_table[clock_hand].mark
              clock_hand = (clock_hand + 1) mod table_size
           endloop
           -- now, frame_table[clock_hand].what_page
           -- is the page number to be written back to disk,
           -- and clock_hand is the frame number to use with
           -- the page being brought in from disk.

Clock replacement has been widely used on many computer systems, including the IBM System 370 and its descendants in the IBM mainframe family, later versions of the Multics system, and many others.

Modified Clock Replacement

It turns out that the clock algorithm can be modified to work without any special hardware support. The first large-scale use of this modification was in the DEC VAX family of computers introduced in the late 1970's, but many modern RISC systems also take advantage of this to simplify their memory management units.

As the clock hand sweeps by a page frame, all page table entries referencing that frame are marked as invalid. In fact, this page is still in main memory, but by marking it as invalid, we force the the memory management unit to force a page-fault trap the next time the application references that page. We use this combination, page-in-main-memory but marked-as-invalid in the page table, to be equivalent to mark-bit-reset in the original clock algorithm, while page-in-main-memory and marked-as-valid in the page table is used to mean mark-bit-set.

When the CPU attempts to reference a page marked as invalid in the page table, there is a page fault. If the page is actually in memory, this is called a soft page fault. The page table entry is simply changed from invalid to valid when this occurs, so it is a very simple fault to handle.

Of course, we must add at least one new bit to each page table entry to distinguish between hard and soft page faults. This bit is never tested or set by the hardware, but is only used by the page-fault service routine. Therefore, it is one of the soft bits in the page table entry.

One of the most interesting models for representing the soft bits in the page table uses a complete copy of the access-rights field in the soft bits. The soft rights field gives the real access rights the user has to the page, while the hardware access-rights field gives the rights that the memory management unit will enforce, requesting a page-fault interrupt if they are violated.

       _____________________________
      |______|______|_______________|
      | Soft |Rights      Frame     |
      |Rights|                      |
      |      |                      |
      | soft |   hardware fields    |
      | bits |                      |

The state of the page, from the point of view of the modified clock algorithm, is represented by the relationship between the hardware rights field and the soft rights field:

Soft Rights = Rights = accessable

Marked. This page is not a candidate for clock replacement; it has been recently used. Legal references to the page by the applicaton program will take place at full speed. Illegal references will cause page faults that should be interpreted as acccess rights violations.

Soft Rights = accessible; Rights = inaccessable

Not Marked. A candidate for replacement. When there is a page fault involving this page, it is a soft page fault and the only action is to change the hardware rights field to match the soft rights field, changing the state to marked.

Soft Rights = Rights = inaccessable

Paged out! Faults involving this page are "hard" -- page replacement and disk I/O are needed to resolve this fault.

Soft Rights = inaccessible; Rights = accessable

Strange! This combination should never occur.

Dirty Pages

A dirty page is a page that has been modified since it was copied from disk to main memory. Dirty pages must be copied back to disk when they are replaced so that the frames that hold them may be used to hold other pages. Clean pages, those that are not dirty, need not be copied back to disk if original copy of the page on disk is still available. This can cut the time needed to service some page faults in half.

Tracking clean and dirty pages requires one bit per page frame; typically, we assume that the hardware sets this bit to mark the page as dirty whenever there is a write operation on that page, and we let the software reset this bit whenever it writes the contents of a page frame to disk. The latter operation is called cleaning the page frame.

This can be combined with the clock algorithm as follows: As the clock hand sweeps by dirty page frames they are scheduled for cleaning; the actual write operations to disk take place in parallel with computation and are typically done in whatever order is convenient for the disk interface hardware and software. The clock hand only stops its sweep when it encounters a page frame that is both unreferenced and clean.

This scheme (modified clock replacement) is typical of modern virtual memory operating systems. This scheme requires minimal hardware support and yet delivers performance comparable to LRU replacement.

If the computer does not include hardware to record the fact that a page is dirty, this may be done by software by marking clean pages as read-only. When a write operation is made to a read-only page, the resulting soft page fault marks the page as read-write, which is interpreted to mean that it is a dirty page.

Some implementations of such schemes are enhanced with anticipatory paging -- for example, when there is a page fault for page i, they might bring in pages i, i+1 and i+2. This assumes that a program that references page i is highly likely to use the following pages. Since this assumption is not always a good one, such systems usually provide a way to turn off the anticipation logic in the page replacement algorithm. Anticipatory paging works very well for programs that use numerous large arrays and it provides few benefits for programs dominated by complex linked lists jumbled into the heap.

Exercise: Outline a procedure to be called from the page-fault service routine that will search the set of page frames and select a page to replace. Do this for the simple clock algorithm, the clock algorithm with clean-dirty page management, and each of the above with memory management unit hardware that does not mark used and dirty pages. Before you do this, you may want to solve the following:

Subsidiary Exercise: What are the states of a page-frame under the modified (software-only) clock algorithm with clean-dirty replacement.

Subsidiary Exercise: The clock algorithm cyclicaly sweeps through the table of all page frames. This is not the same as the address map of the virtual address space! Describe the relationship between these two data structures!

A digression into architecture

Memory management units are fairly simple to design only if the entire page table is held in the MMU and if the MMU does not support the mark and dirty bits needed by the clock algorithm. The latter fact explains why, starting with the DEC VAX architecture, many modern systems, mostly associated with RISC architectures, have omitted hardware support for the clock and dirty bits, forcing the use of software solutions to these problems.

If the entire page table cannot fit into the MMU, it is common for the MMU to use a cache memory, so that the most recently used page table entries are in the MMU while the entire page table is in main memory. This is quite effective, except that cache design is itself a fairly messy business. As a result, some MMU designs in use today take this one step further; the MMU cache, called the translation lookaside buffer, is simplified so that, if the desired page table entry is not in the TLB, the MMU issues a page fault! Thus, the page-fault service routine is responsible for both MMU TLB management and page replacement in main memory.

This introduces a new kind of soft page fault, where the desired page is in main memory but the desired page table entry is not in the TLB, so the page fault service routine merely loads the page table entry into the MMU and returns, with no actual paging activity required.

You can find an example MMU specification illustrating this extremity in the fictional Hawk architecture, described in

http://homepage.cs.uiowa.edu/~dwjones/arch/hawk/overview.html#mmu