22C:116, Lecture Notes, Lecture 10, Spring 2002

Douglas W. Jones
University of Iowa Department of Computer Science

When there's not enough memory
What if the aggregate memory requirements of the system plus applications excede the available main memory. If there is auxiliary memory available, for example, on disk, there are at least four ways of solving the problem:
- Overlays
- Swapping
- Paging
- Segmenting
Overlays are sections of an application's code or data that are stored on secondary memory when not in use. It is up to the application program to bring overlays into main memory when they are needed by that application, and it is up to the application to return overlays to disk or other secondary memory, if necessary, when they are unneeded. Code overlays are usually read-only, so they are not written back to disk. Data overlays generally must be written back, and because of this, they are more expensive.
Overlaying is usually considered an archaic method, but it remains in use, particularly on small microcontrollers using technologies such as flash EEPROM instead of disk.
Typically, each overlay contains a procedure or function, or a group of related routines. The main program must load the overlay before calling the routine, and the routine in the overlay may call any routines that were loaded with the main program. In addition, an overlay may contain secondary routines called only from within that overlay, and it may load other overlays.
In the most primitive overlay systems, memory to hold the overlays is allocated statically. In more advanced overlay systems, memory to hold overlays is allocated dynamically. In such a system, a call to a routine contained in an overlay might be coded as follows:
```
	{ /* load overlay X and then call function x within it */
		int X;       /* file descriptor */
		int Xsiz;    /* size of file */
		void * Xpt;  /* pointer to overlay */

		/* first find out about the file */
		X = open("X", ...);
		Xsiz = lseek( X, 0, L_XTND );

		/* second, make space for the contents */
		Xpt = malloc( Xsiz  );

		/* third, load the file in memory */
		Xsiz = lseek( X, 0, L_SET );
		read( X, Xpt, Xsiz);

		/* then, call the function */
		(*Xpt)(params)

		/* finally, dispose of the memory it occupied */
		free( Xpt )
	}
```
The above C code assumes that the overlay can be loaded into memory by a simple call to read(). This will be true if the file contains position independent code, but a more complicated loader is required if any relocation or linkage needs to be performed.
It is ugly to include the above code in the calling application, and in response to this, overlay linkage through load-on-call overlay stubs was developed. Load-on-call stubs hide the details of overlay management from the user of the overlay, packaging these details in a special procedure, the load-on-call stub. A call to the load-on-call stub loads the overlay into main memory and then transfers control to the actual desired function. Thus, the act of calling the procedure causes it to be loaded.
The word stub is widely used in discussions of program development, dynamic linkage and distributed software, so it is important to make it clear that this word is not just technical jargon, but also common English.
A stub is the remainder of a branch or limb that has been cut or broken off of a tree or animal. In the case of software, the tree that is of interest is the abstract tree describing the history of procedure and function calls during the execution of a program. The root of this tree is the main program, and each routine called from the root corresponds to a branch. A routine that calls no other routines is a leaf of the tree, and if some branch of the tree is run on a different machine, or loaded by a different loader, or involves code that has not yet been written, the code at the base of that branch is referred to as a stub.
Consider the procedure X. In a program calling X where the memory is not sufficient for all routines needed by that program, a call to X might be linked to a stub instead of being linked to X. The following stub is designed to mimic the external interface of X, and when it is called, it loads X into memory, calls it, returns the results of X, and then deallocates the memory used so that it may be used for other routines.
```
	Xstub(params)
		Xpt = malloc( sizeof(X) )
		load( X, Xpt )
		(*Xpt)(params)
		free( Xpt )
		return
```
Here, sizeof(f) returns the size of the memory region needed by the object file f, and load(f,p) is a system routine that loads object file f into the memory region pointed to by p.
More intelligent overlay management stubs can pull the overlay into main memory only if that routine is not already there. Consider the following stub for X that shares memory with routines Y and Z:
```
        Xstub(params)
            if Xpt is null then
                Xpt = malloc( sizeof(X) )
                if Xpt is null then -- the allocate failed
                    if Ypt is not null then
                        free(Ypt)
                    else if Zpt is not null then
                        free(Zpt)
                    endif
                    Xpt = malloc( sizeof(X) )
                endif
                load( X, Xpt )
            endif
            (*Xpt)(params)
            return
```
Here, when X is first called, the stub will allocate space and load the code. On subsequent calls, the code will remain loaded so no new allocation is needed. If, however, space is needed for routines Y or Z, X may be deallocated. If X is needed ans space is unavailable, the stub for X will deallocate Y or Z to make space.
The above scheme rests on the assumption that the stubs for X, Y and Z cooperate to share memory, and it rests on the assumption that the code for X can be deallocated safely when Y or Z need the space! This assumption can be met if no procedure called from X calls Y or Z, no procedure called from Y calls X or Z, and no procedure called from X cally X or Y.
By the late 1960's, most computer manufacturers offered elaborate linkage editors that would analyze the procedure call graph of a program to determine what procedures could operate as peers in overlay management schemes such as the above. Today, such overlay managers are almost obsolete because of the widespread use of virtual memory.
Overlays allow a single program to run in a memory region smaller than that program. The next technique we will discus allows multiple processes to run in a meory that cannot hold all of them at once. Note, however, that it is almost always possible to break a single program into multiple communicating processes, where the purpose of the breakup is not to achieve parallelism, but merely to reduce the size of the address space needed by any single component of the original program.
Overlay technology has modern descendants that we use to solve a completely different class of problems. Just-in-time compilation and downloadable applets are implemented using technology descended from overlay managers.
Swapping involves the process scheduler. When swapping is used, memory assigned to ready processes is considered preemptable. If memory is needed, a ready process is selected to be swapped out. Swapping a process out of main memory involves copying the data segment(s) of that process to a swapping device, usually a disk. When a process is selected to be run, if that process is swapped out, the system must first find or make space in main memory, then swap that process in, before running it.
This leads to a process state diagram such as the following:
```
         preempt
           or         swap
       relinquish     out
         ---->--     ---->-
       /         \ /        \
   RUNNING      READY     SWAPPED
       \         / \        /
         --<----     -<----
        schedule      swap
                       in
```
Swapping systems are at their best if main memory can be divided into at least two partitions, each able to hold one process. In this case, the process in one partition can be running while processes in other partitions are being copied to or from disk. The classic IBM mainframe operating systems of the 1960's operated this way, as did early timesharing systems for the DEC PDP-8 and PDP-10 (a minicomputer and a mainframe, respectively). The first versions of UNIX, on the DEC PDP-9 and PDP-11 were also based on this idea.
It is worth noting that, on these systems of the 1960's, swapping was frequently done to a drum memory, so early system descriptions frequently speak of the swapping drum.
If memory on a uniprocessor was divided into N partitions, there could be at most N-1 ready processes, but typically, there were N-2 ready processes because at any given instant, the process in one partition was typically in transit to or from disk.
Virtual memory based on Paging and the related technique of Segmenting involves hardware address translation mechanisms that allow individual pages of the main memory being used by a program to be preemptively claimed by the system on behalf of other programs.
Virtual Memory
The subject of Virtual memory is on the interface between operating systems and computer architecture
```
  CPU <---> MMU <---> Memory
	   /
	  Address
          translation
          hardware
```
Addresses issued by the CPU are considered to be virtual addresses. The memory management unit translates these to physical addresses or it complains (with a trap request) that the translation is impossible or that the proposed access is not allowed.
Hardware Aspects of Address Translation
- Is the architecture paged or segmented?
- A page is a fixed size small block of memory
- A segment is a variable sized large block of memory
We will concentrate on paged memory, but there are many segmented machines. Notably, the 8086 family of microprocessors support a segmented memory model that is most clearly expressed in the 80286. This was not effectively exploited, and later 80x86 machines allowed users to largely ignore segmenting.
Pages are typically used without regard to the logical arrangement of memory -- that is, data structures are not typically arranged in any particular way with respect to page boundaries. Programmers on paged systems do not usually pay any attention to the presence of pages.
In a segmented system, each data structure is typically assigned to a particular segment. Thus, a segments might be dedicated to each procedure or to each activation record on the stack in a fine- grained segmented system. In a coarse-grained use of segmenting, one segment might be reserved for the stack, another for the code, and a third for the global data of a program.
On some systems, large segments are divided into pages.
On paged systems, the virtual address is viewed as a single integer that indexes a word (or byte) in a single linear virtual address space.
On segmented systems, the virtual address space may be a single linear space, or it may be divided into separate spaces. The 8086 is an example of the latter, where the segment being addressed is determined by the context. PUSH and POP reference the stack segment. Most loads and stores reference the data segment, and instruction fetches reference the code segment. (This default behavior may be overridden by special instruction prefixes).

Paged Address Translation

Here is a diagram of the data flow through the hardware used to translate a virtual address to a physical address in a simple paged virtual address translation memory management unit.

    Virtual Address from CPU
     ______________________
    |__________|___________|   Word in Page
         | Page      |___________
         | ___ _________         |
         |  | |         |        |
    array|__| |  Page   |        |
    index   | |   Table |        |
            V |         |        |
           -->| |       |        |
              |_|_______|        |
               |    |Frame       |
             __|    |__          |
            |          |         |
            v     _____v_________v______
         Access  |__________|___________|
         Control
                 Physical Address to Memory

The picture shown here shows the logical function of the memory management unit, but there are many possible physical organizations -- if the page table is large, it is not practical to store it in the MMU!

The Contents of the Page Table

The format of one entry in the page table will typically be something similar to the following:

    _ _ _  _______ ______________________
  | _ _ _ |_______|______________________|
      |       |              |
      |       |            Frame -- where is the
      |       |                     page stored.
      |     Access
      |     Control -- what actions are allowed,
      |                by hardware on this page.
    Soft                         || invalid
    fields -- used by software   || read-only
              to record other    || read-write
  (optional)  attributes of the  || execute-only
              page.              || dirty

Some hardware leaves space for software use in each page-table entry. This allows the system software to maintain additional attributes of pages that are not known to the hardware. On other systems, the sofware may have to allocate space for any such fields in a parallel array. In any case, the hardware does not interpret the soft fields in any way, and on some systems, the software may not need such fields.

Brute Force Page Table Storage
On systems with a small virtual address space, the page table can be stored in hardware registers within the memory management unit.
Consider a machine with: 16 bits per virtual address, 8 bits to indicate which word in page, 8 bits to indicate which page.
This requires 256 registers in the MMU to hold one page table.
Thus, Even with special hardware, context switching may be quite slow compared to a machine where only the registers in the CPU are saved and restored.
The above example is based on the MODCOMP Classic series of computers first sold in around 1973 (www.modcomp.com still exists). This were extensively used by NASA (in the space shuttle ground support system) and the oil industry (in real-time control of oil refineries and drilling operations). The machine had 16 registers of 16 bits each in the CPU, usable as 8 registers of 32 bits each or 4 64-bit registers.
Alternatives for Page Table Storage
The table can be stored in main memory. In this case, the MMU takes 2 memory cycles per memory reference, unless the table is store in main memory with a cache located in the MMU to speed access.
Cache-based paging systems are simplest if the cache size is the same as the number of page frames in main memory.
The Ferranti Atlas computer, introduced in 1960, was the very first machine to use this scheme; the page table was stored in main memory, and there was one cache entry per page frame so that there was no significant run-time penalty for address translation.
```
Cache Entry
   _______________________________
  |____________|__________________|
	| Page         | Table
	| Number       | Entry
	|              | 
   Subject to     Returned by
   Associative    Successful
   Search         Search
```
In its simplest form, the associative search involved in searching the cache involves a simple parallel comparison of the page number in each cache entry with the page number issued by the CPU. This can be done almost instantly with appropriate hardware. Cache hardware to support fully associative searches is expensive, but the alternatives are the subject of an architecture course and are will not be discussed here.
If the cache size is the same as the number of page frames in main memory, a cache miss signals a page fault, and the software can be put in charge of replacing the cache entry as it replaces the page in the page frame. This approach results in the simplest possible cache logic, since the hardware need not manage cache misses and replacement; this is what was done on Atlas on the early 1960's.
The Software Hardware Interface
The page fault trap service routine gains control when the MMU detects access to a page that is not in main memory.
A trap is very similar to an interrupt. The traditional distinction is that interrupts are caused by asynchronous events, while traps are caused by the actions of the CPU. Other traps include such things as arithmetic exception traps (caused by the values of operands to arithmetic instructions), and bus-fault traps (caused by attempts to access physical addresses to which no physical memory is attached).
On some machines, there is an additional convention -- When an interrupt is requested, the current instruction is allowed to complete before the PC is saved and the interrupt service routine is entered. When a trap occurs, the current instruction is aborted and the trap service routine is entered directly.
For it to be possible to serve a page fault, the instruction must be aborted before it has any side effects, or (as on the DEC PDP-11) all side effects must be recorded so that they can be undone by the trap service routine prior to any attempt to restart the instruction.
The exception to the above rule is on machines such as the Intel 80x86 family, where some instructions are designed to be interruptable -- typically, this applies to block copy instructions. These instructions, if interrupted, save their state in CPU registers in such a way that they can be restarted from the point of interruption instead of restarting from the beginning.
Page Fault Traps
On entry to the trap service routine, the saved PC is the address of the instruction that caused the trap
Before return, the trap service routine must adjust the page table so that re-executing the instruction that caused the trap will not cause the same page fault.
On the PDP-11, the saved PC was not the correct one, but sufficient information was saved about the partial exectuion of the instruction that the correct PC could be reconstructed by the trap service routine. The PDP-11 MMU option included hardware to track which registers an instruction had changed (including the PC) and by how much, so that, after any trap, the changes could be undone by the trap service routine.
Exercise Pick an instruction set with which you are familiar. Examine the instructions in this set to see which, if any, have side effects that cannot be undone when a trap occurs. The answers on the IBM System 360, DEC VAX, Motorola 680x0 and Intel 80x86 are all interesting, and the solutions each of these families of machines takes to dealing with such side effects are different and instructive.
A single instruction can cause many page faults. Consider a memory-to-memory move-double-word instruction. The instruction itself will typically be more than one word long because it contains two complete operand addresses, so the instruction may span a page boundary. Both the source and destination operands are double words, so they too may span page boundaries. Thus, an attempt to execute this single instruction might cause as many as 6 page faults.
Clearly, if there are fewer than 6 page frames on the machine, this instruction cannot be executed. In general, the instruction set of a virtual machine must be examined to determine the maximum number of pages a single instruction can execute, and then that number of page frames must be guaranteed.
Each entry to the page fault trap service routine must bring in one of the pages needed to complete an instruction. It need not bring in all of the required pages, since it is frequently faster to simply retry the instruction to cause the next fault than it is to use software to try to determine what other pages the instruction might need.
Exercise Pick an instruction set with which you are familiar. Examine the instructions and operand formats to determine the minimum number of page frames that must be available to allow the most complex instruction in this instruction set to execute to completion.
Page Faults
When a program references a page that the memory management unit is unable to handle, or when the reference requires access rights that are not allowed by the memory management unit, the memory management unit reports a page fault by forcing the CPU to enter a trap service routine, usually a part of the operating system, but there are systems such as Mach where page faults may be handled by application-level code.
The instruction that was executing at the time of the trap is aborted! On entry to any trap service routine or interrupt service routine, a properly designed CPU will cooperate with the prologue (the introductory sequence of instructions) of the service routine to save the values of all of the registers of the interrupted program, including the PC. In the case of trap handlers, the saved values should be the values that were in effect immediately prior to the instruction that caused the trap.
Note: On some machines, the values for some saved registers include changes made by the instruction that caused the trap. For example, the program counter may have been incremented as a side effect of the instruction fetch prior to the memory cycle that caused a page fault, or the stack pointer may have been decremented prior to the memory cycle that caused the fault. If such a machine is to be used to support demand-paged virtual memory, there must be a way to determine what registers have been changed so that the prologue on the page fault trap service routine can undo these changes!
In theory, given the PC and registers prior to the instruction that caused the trap, the trap service routine can simulate the execution of the next instruction to determine all of the memory addresses it might use; for each address, it can check to see which of these might have caused the trap. This is simple enough that on some machines, the memory management units do not save the actual virtual address that was used to cause the trap; in this case, the address must be computed by the prologue to the trap service routine. The software is simplified, however, if the MMU saves the virtual address that caused the trap, thus saving the trap service routine from having to simulate an instruction cycle.
Here, in outline form, is a generic page fault service routine:
```
page fault service routine:
    -- prologue
    save state prior to instruction that caused the trap

    -- find necessary information about cause of fault
    va = virtual address that was referenced
    op = operation that was attempted (read or write)

    -- is this an access violation?
    if map[va.pageno].rights = read_only
    and if op = write
        handle as user access violation (bus trap!)
    else

        -- find a free page frame
        if there are free page frames
            pa = free frame address
        else

            -- page replacement policy
            pa = address of page to replace
            va' = virtual address of page in that frame
            write( pa, diskaddress( va' ))
            update virtual address map, etc to reflect change
        endif

        -- bring in the required page
        read( pa, diskaddress( va ))
        update virtual address map, etc to reflect change

        -- epilogue
        restore saved state
        return from page fault service routine
    endif
```
This service routine faces two key choices; what page(s) should be copied into main memory from disk, and, assuming that main memory is full, what page(s) should be copied out from main memory to disk in order to make space.
What page should be copied into main memory?
- demand paging -- when a page fault occurs, only the page containing the memory location that was referenced to cause the page fault is moved into main memory. This was shown in the above code.
- anticipatory paging -- when a page fault occurs, the software tries to anticipate other pages that might be needed soon, and brings them into memory at the same time. In addition, some systems provide mechanisms that allow user programs to explicitly initiate the transfer of a page into memory even though that page has yet to be referenced.
What page in main memory should be replaced? In the above code, this issue was glossed over with one line saying ``address of page to replace''. The major alternatives are the subject of the next lecture.