3. Resource Management

Part of the 22C:116 Lecture Notes for Fall 2002
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Introduction

One of the conventional views of operating systems is that the operating system is fundamentally a resource manager, primarily concerned with solving resource conflicts between competing users. This view focuses on the following resources:

Main Memory: The memory manager; this may include heap management, virtual address space management and page fault handlers.
CPU Time: CPU time is generally shared between multiple threads or processes.
Input/Output Time: There are frequently limits on the number of simultaneous input or output transfers the system can perform, and this means that it may be necessary to schedule these.
Secondary Memory: The file system; here we focus on random access disks, directory structures and the implementation of files on the disk.

Main Memory

The primary memory of the system may be seen by the application programmer in many ways, depending both on what objects are clients of the memory manager and on how the manager allocates memory.

Memory may be allocated in pages or segments. These terms are usually defined with reference to memory management unit hardware, but when only software is involved, the same principles apply, and the same terms may be used.

Pages are fixed size blocks of memory at block-aligned addresses.
```
 A page descriptor     ______________________________
                      |___________________|__________|
                      |  address of page  |0000000000|
```
The descriptor of a page is a single memory address, with the least significant n bits of the address all zero, where the page size is 2ⁿ addressable units of memory. The addressable unit of memory may be the byte or word, depending on other aspects of the architecture.
Segments are variable size contiguous blocks of memory, sometimes made up of contiguous ranges of pages, and sometimes with no internal page structure.
```
 A segment descriptor  ______________________________
                base  |______________________________| 
              length  |______________________________|
```
The descriptor for a segment gives the address of the first addressable unit of memory in the segment and the length of that segment in some multiple of the addressable unit of memory. While some memory management units allow segment lengths to differ by as little as one addressable unit of memory, other machines constrain the segment base or length to be some larger multiple of the addressable unit of memory.
The Intel 8088, for example, allowed addressing to the byte, and the segment length was described in bytes, but segments could only start at addresses that were a multiple of 16 bytes.
Segments may consist of multiple pages. This is called a paged segmented memory. In this case, the descriptor for a segment is typically a table giving the addresses of the different pages in that segment.
Pages may hold multiple segments. This is typically the result of using a hierarchy of memory managers, where the more primitive memory manager creates a paged address space, and then a higher-level memory manager takes this paged address space and uses it to allocate segments. This situation is very common! The heap manager that does dynamic memory management for a C, C++, Java, Pascal or Ada program must present the memory resources of the system to the application program in terms of the segments used to store the program's objects. If the underlying system uses a paged memory model, the net result will usually be pages that hold segments.

The clients of a memory manager may be user programs or other parts of the operating system. Allocation to user programs need not use the same mechanism as allocation to other parts of the system, although a proliferation of mechanisms clearly adds complexity to the system.

For example, under ix, the kernel uses a segmented memory manager, the kernel's heap manager to allocate space for small kernel data structures, while using paged management to allocate space to user programs. In turn, most Unix application programming environments use a user-level heap manager to allocate segments of memory from within their paged address spaces for user data structures.

The use of paged or segmented memory allocation does not necessarily rest on the use of memory management hardware. For example, old versions of MacOS for the MC68000 processors in the original Apple Macintosh computers used pages of 64K words each, with each open application being allocated, at minimum, one stack page and one code page. The heap manager for most versions of Pascal and C allocates segments from the heap. The heap itself is frequently implemented as a contiguous range of pages in the user's virtual address space, and this range, in turn, is frequently managed as a segment.

CPU Time

Some kind of CPU time sharing mechanism is essential on any system with more logical processes than physical processors. Because the technology for running multiple processes on one CPU was developed in the early 1960's for timesharing systems, it is common to talk about the contending processes as if each belongs to a different end user of the system.

For our purpose, the terms process and task are synonyms, and in the absence of multilevel scheduling and resource protections, there is no functional difference between a thread and a process. The terms timesharing and multitasking both date from the early 1960's and refer to mechanisms that support multiple concurrent processes on fewer processors; the former term emphasizes the idea of one process per interactive user, while the latter comes from the world of batch processing, where each non-interactive job consisted of a sequence of tasks that needed to be performed. The term multiprocessing is not a synonym for multitasking; instead, it usually refers to systems with multiple CPUs, an idea that also dates from the early 1960's.

Since 1966, with the Berkeley Timesharing System, many systems have allowed users to run more than one process in parallel, so not only do users compete with each other for CPU time, but the processes of one user also compete with each other.

Surveys done of Unix users done in the 1970's show that a single user at a remote terminal typically had 4 processes, one of which was interactive, one of which was waiting for the former process to terminate, and two of which may have been waiting or running in the background.

On the window-oriented systems that emerged in the late 1970's, it became common to find one or more processes behind each window on the display screen, as well as a process running the window manager itself, plus any background processes the user may have launched.

This makes it clear that support for multiple logical processes has real value even on a single user system, although in the first two decades of personal computer operating systems, from 1976 to 1996, most systems offered little or no support this.

Complete support for multitasking or timesharing relies on the system component called the CPU scheduler. The scheduler maintains a queue of processes awaiting a chance to use a CPU, the ready list, and whenever a CPU becomes available, the scheduler selects a process from the ready list to be run. Each time a process is scheduled, there is a context switch to that process from the previous running process.

Input/Output Time

If multiple input-output transfers are pending, a possibility on any multitasking operating system, we must frequently deal with contention between independent transfers. Where there are two pending requests for operations on a single device such as a network interface or a disk drive, we must do first one and then the other. When two devices share some intermediate system component, for example, when there are two disk drives on one SCSI bus, we must deal with contention between transfers addressed to different I/O devices.

With single user operating systems, such contention generally has only a small effect, and simple schemes such as first-in-first-out I/O scheduling are perfectly acceptable. With multiuser systems, whether they are the large-scale timesharing systems of the 1970's or the large network file servers of today, careful scheduling of input-output operations can have a very noticable effect on performance.

Secondary memory

From the user's perspective, secondary memory space, typically disk space, is the single most important resource managed by the operating system. In the extreme, many users view the operating system as if it was nothing more than a command language interpreter and a file system, and in fact, first-generation operation systems such as Microsoft's old MS-DOS were indeed little more than this.

From the user perspective, the central function of a file system, as opposed to the I/O subsystem, is the allocation of space on secondary memory for competing files. Usually, the secondary memory is a magnetic disk subsystem, but it could be anything from Flash EEPROM to an optical recording device such as a CD or DVD drive.

Allocation of disk space usually involves algorithms quite different from those used for main memory, both because the user model of disk files is not the same as the user model of main memory, and because secondary storage devices are typically block structured with a complex relationship between block addresses and I/O latency times. Things are even more complex when the disk technology limits the number of times a write-operation may be done, as with write-once recordable CD-ROM technology.

Security and reliability

Questions of securityh and reliability do not fit cleanly in the resource centered model of operating system structure, and as a result, classical operating systems texts frequently covered these as afterthoughts. For that matter, classical operating system designers also tended to ignore these until so much of the system design was completed that many security flaws had been frozen into the core of the system design.

Protection mechanisms are necessary for both security or reliability, but many small computer systems do without, even though the technology needed to build secure systems has been well known since the mid 1960's.

Security demands that system resources be protected malicious attack.

Reliability demands that system resources be protected accidental damage.

Aside: As a general rule, anything which a malicious user might intentionally do to damage a system might also be done by a careless user. Traditionally, security has been concerned with protection against malicious users, while carelessness was viewed as the concern of fault tolerance. The recognition that malice and carelessness frequently produce the same result has led to the recognition of the relationship between fault tolerance and secure systems.

Resource Protection is generally demanded by users when they perceive a security threat.

Typically, users demand that segments of main memory, files on secondary memory, and terminals allocated to them all be protected from unauthorized use or inspection by other users. There are uniform access control and protection mechanisms that can control access to all of these on the same basis, but on most modern systems, each class of resource is protected using ad-hoc mechanisms that are generally different form those used to protect other resources.

Resource sharing becomes an issue when there are multiple users or multiple processes that must share access to some resources while other resources are private and protected.

Files, windows or memory segments are all appropriate objects that might be shared or might be protected. Sharing is only trivial when there is no protection, and protection is only trivial when there is no sharing.

Early timesharing systems viewed resource sharing in terms of allocating free resources to one or another user, exclusively. If this is done, the system is effectively creating one isolated personal computer for each user, so there is essentially no sharing and the result is of only limited utility.

With the advent of systems such as Multics and the Berkeley Timesharing System (around 1966), the focus shifted to the management of information that was actively shared between users. Thus, each user might have some memory regions that are entirely private, and others that are shared with some, but not all, other users. Management of this shared data, whether on disk or in main memory, poses a significant challenge.

This challenge was largely ignored for much of a decade between 1975 and 1985 as personal computers displaced timesharing systems from growing numbers of applications, but it has re-emerged with a vengence with the advent of computer networks.

Reliability, or absence of failures, is a natural goal of the system designer. Protection mechanisms play a strong part in assuring the reliability of systems by preventing failures such as programming errors in one part of a system from having unlimited effects on other parts of the system.

On early systems, it was generally assumed that any failure in one part of a system would lead to the failure of the entire system, but as technology has advanced, we have learned to build systems where failures lead to incremental degradation of performance.

Early examples of this include memory managers which detect memory errors and remove the bad memory addresses from the pool of memory available for allocation to various system applications. This is now routinely done both with disk systems and with main memory.

More modern examples include multiprocessors where, when a processor fails, the remaining processors continue to operate.

With centralized systems, the assumption that system failure was an all-or-nothing thing was natural. In fact, this was never true, but partial failures of a centralized system were rare enough that they are the subject of folklore -- for example, the timesharing system where the floating point unit failed. The operating system didn't use floating point, so none of the system managers knew anything was wrong until users started complaining about strange results.

With large-scale distributed systems, partial system failures are the norm! Many networks today are large enough that the normal state of the system is that one or more machines are down at any instant.

exercise: Consider the internet, with around one million machines. Assuming each machine has a mean-time between failures of one month, what is the rate of failures, per unit time?

Process Management

Very similar pictures of the states of a process, from the point of view of the process manager, are found in many elementary operating systems textbooks, generally something like the following:

             ---  Waiting  <---
    signal /                    \ wait
          /                      \
          V   <---preempt---      |
        Ready                  Running
          ^    ---schedule--->    |
          \                      /
    create \                    / terminate
             -----  Dead <-----

Some descriptions add one more operator, kill, that forces state transitions from the other states to dead. The process states are summarized here:

Running: There is exactly one process running on each operating CPU in the system. Running processes may wait for an event (they enter the waiting state) or they may terminate (becoming dead). In addition, a running process may be preempted or it may voluntarily relinquish (becoming ready).
Ready: All processes that need a CPU but do not currently have one are classified as ready processes. A ready process may be scheduled (becoming becoming running).
Waiting: All processes that cannot be run until some event occurs are classified as waiting processes. When some other process signals that the event has happened, the waiting process becomes ready.
Dead: Processes that could exist in the future or that have existed in the past are classified as dead. On some systems, dead processes are actual objects, while on others, this class is merely conceptual. Process creation is typically described as taking one of these real or conceptual dead processes and making it ready.

If no useful process in the system needs the CPU, it is usual to leave a running process on the CPU! For this reason, operating systems usually include some kind of background process to soak up the otherwise unused CPU cycles. Sometimes, this is referred to as the background process (as opposed to other background processes) or the idle process.

On early Unix systems, one well known use for the idle process was for computing more digits of pi. This is silly, and there are useful housekeeping jobs the idle process can do. For example, it can run CPU and memory diagnostics, or it can manage garbage collection.

All power consumed by the CPU while running the idle process (if it has no system maintenance work to do) is wasted, so if the computer has a low-power idle mode, the idle process will frequently put the system in this mode while it waits for useful work. This is not a new idea! Minicomputers of the 1970's frequently included such operating modes (with names such as halted waiting for an interrupt) because the reduced power consumption reduced air conditioning costs and allowed extended running times on battery power and reduced fuel consumption for emergency generators.

The ready processes are typically stored in a queue structure, the ready-list. This can be a simple linked-list FIFO queue, with the two basic operations: enqueue a process and dequeue a process. A FIFO queue gives a system where all processes have equal priority. If we assign priorities to processes, so the queue is no-longer FIFO, the implementation is more complex, but we can arrange things so that high priority is given to processes that need better service, for example, to meet tight real-time constraints.

Most systems with real-time constraints operate by assigning fixed priorities to each process, but there are alternatives. With Unix, an interesting variable priority scheme was introduced, in which the priority of a process rises when it voluntarily relinquishes the CPU, for example, to wait for I/O completion, and the priority of a process falls when it is preempted at the end of a time-slice. This favors interactive users over compute bound processes.

From a formal point of view, there are two well-known analytically tractable approaches to assigning priority to real-time processes. If each process must meet periodic deadlines, the rate-monotonic priority assignment scheme assigns the highest priority to the process with the most frequent deadlines. In Liu, C. and J. Layland, "Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment." Journal of the ACM (January 1973), it is shown that rate monotonic scheduling will meet all deadlines for any mix of processes so long as the processes consume no more than 69 percent of the available CPU time. Liu and Leyland alos proposed using the deadline itself as a priority, so that the process with the earliest deadline would be scheduled first, and they showed that this scheme will meet all deadlines even if there is 100 percent CPU utilization. In the years since, several other approaches have been developed.

A Preemption Mechanism

A running process is preempted when it is forced into the ready state in order to give the CPU to some other process. This might be done, for example, by the real-time-clock interrupt service routine at the end of a time-slice in a timesharing system. This is illustrated in the following pseudocode:

   Clock interrupt service routine:
      begin
         Disable Interrupts (if not already disabled)

         -- Save process state of interrupted process
         current.pc := interrupt-return.pc
         current.registers := interrupt-return.registers

         -- Change interrupted process state to ready
         Enqueue( current, ready queue )

         -- Make new process into the current process
         current := Dequeue( ready queue )

         -- Restore process state of new process
         interrupt-return.pc := current.pc
         interrupt-return.registers := current.registers

         Enable Interrupts (if necessary)
         return
      end

The parenthetic remarks in the above code are because some interrupt hardware automatically disables all interrupts when dispatching control to an interrupt service routine. Typically, if this is the case, returning from interrupts automatically re-enables interrupts. Hardware designs differ in this area, so it is necessarily the case that the software will also differ!

Note that, in addition to involuntary preemption, many systems include provisions for a running process to voluntarily move itself from running to ready. This is sometimes called voluntarily relinquishing the CPU, and the code for this is essentially the same as that given above for preemption, except for differences in the calling sequence between interrupt service routines and procedure calls.

In old systems, clock interrupts were periodic, for example, every 1/60 of a second. Modern hardware usually includes a programmable timer, so it is possible for the scheduler to vary the length of the time-slice (interval between interrupts) depending on the behavior of the processes being scheduled.

The Berkeley Timesharing System did this by the late 1960's, giving long time slices to jobs that appeared compute bound, in an effort to minimize the context switching overhead, while giving short timeslices to jobs that appeared I/O bound.

Interrupts must be disabled during the interrupt service routine so that some other interrupt doesn't cause scheduling activity when the ready queue is halfway through being updated or when the process state is only partially saved or restored.

Semaphores

The basic implementation of a process scheduler supporting only two states, ready and running, was well understood in the early 1960's, and the extension of this to multiprocessors with dynamic process creation was an obvious step. (M. Conway, A Multiprocessor System Design, Proc. 1963 FJCC, AFIPS, Vol. 24, pages 139-146.)

What was not so obvious was how to deal with waiting processes. A common model was to simply put a waiting process into a polling loop such as:

      loop while desired condition not yet met
         relinquish
      endloop

Repeatedly polling the condition on which the process is waiting is expensive, and the process only tests it once each time the process is scheduled, so it is hard to get prompt response when the condition is finally met. Furthermore, extending this to priorityh scheduling is very difficult.

Many bad solutions to the problem of waiting processes were used, with mixed results, before a satisfactory solution was developed, in the form of the semaphore abstraction. (E. Dijkstra, Hierarchical Ordering of Sequential Processes, Acta Informatica Vol. 1, 1971, pages 115-138.)

Unfortunately, the key publication describing semaphores came out after the kernel design for Unix had largely been completed. Many classic Unix applications use polling loops to await conditions, and the actual conditions are frequently signaled by very strange things. Semaphores were added to later versions of Unix, notably System V Unix, but large numbers of Unix applications still use ad-hoc solutions to synchronization problems -- both for compatability with older code and because design by afterthought frequently leads to components that are poorly integrated with the system.

A semaphore is, conceptually, an object used to coordinate processes. One semaphore is needed for each condition on which some process might wait. A process can wait for an event by performing the wait operation on the semaphore associated with that event, and once a process does so, it will block until some other process signals that semaphore that the desired event has occurred. If the second process sends the signal before the first process tries to wait, the semaphore remembers this and does not block the waiting process.

The classic outside view of a semaphore describes it as an integer, with the basic operations of wait and signal implemented as:

     wait( sem ) or P( sem ) or sem.wait or sem.P:
         while sem <= 0 do nothing
         sem = sem - 1
         return

     signal( sem ) or V( sem ) or sem.signal or sem.V:
         sem = sem + 1

The names P and V were Dijkstra's original names for the wait and signal operations. In Dijkstra's native language, Dutch, P stands for a word that is the cognate of the English pause, not a bad synonym for wait, but V has no easy connection to English. Nonetheless, the names P and V remain in common use. Of course, it is easy to use both procedural notation, as Dijkstra did, or object-oriented notation, viewing wait and signal as method names. This is simply a matter of syntax and has no effect on the meaning of these operations.

The pseudocode shown above expresses the semantics of the wait and signal operations, but it is not correct, nor is it efficient, since it relies on a polling loop inside the wait operation. The efficient and correct implementation of these operations rests on a more complex representation for semaphores:

      semaphore representation
         queue: a queue of processes blocked on this semaphore
         count: an integer

The count in this realistic implementation will have the same value as the integer representing the semaphore in the naive version, but if the count is zero, the queue will hold one entry for each process blocked on the semaphore. If the count is nonzero, the queue will be empty! Given this, here is an outline of the code typically used to implement semaphores:

     wait( sem ) or P( sem ) or sem.wait or sem.P:
         Disable Interrupts
         if sem.counter > 0
	    sem.counter = sem.counter - 1
         else
            -- Save process state
            current.pc = proc-return.pc
            current.regs = proc-return.regs
   
            -- Change state to waiting
	    enqueue( current, sem.queue )
   
            -- Make new process current process
            current = Dequeue( ready queue )
   
            -- Restore process state
            proc-return.pc = current.pc
            proc-return.regs = current.regs
         endif
         Enable Interrupts
         return

     signal( sem ) or V( sem ) or sem.signal or sem.V:
         Disable Interrupts
         if empty( sem.queue )
            sem.count := sem.count + 1
         else
            temp := dequeue( sem.queue );
            enqueue( temp, ready queue )
         endif
         Enable Interrupts
         return

The above implementations are correct only if all processes have equal priority. If some have a higher priority, and if a low priority process issues a signal on a semaphore for which a high priority process is waiting, then the above code lets the low priority process continue running while the high priority process waits in the ready queue. This is called a priority inversion, and although it is wrong, many systems permit this.

Exercise: Fix signal so it does not cause priority inversions when processes have an integer attribute called priority, and assuming that all queues of processes honor this priority field.