Implementing Scheduling Services

Part of CS:3620, Operating Systems Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Alternative Scheduling Policies

Scheduling decisions may be made with the cooperation of the processes being scheduled, for example, when a process voluntarily relinquishes a processor, but on many systems, the scheduler can also preemptively change a process from running to ready in order to run some other process; such schedulers are called preemptive schedulers. With a preemptive scheduler, the interval allotted to a process between the time it is scheduled and the time it is preempted is called a time slice. The choice of whether or not to preempt running processes, and if so, the length of time slice to use is a policy decision which depends on the demands of the application. For example, if efficiency of processor utilization is a primary concern, the overhead of preemption makes it undesirable; after all, with each preemption, the scheduler must devote some processor time to scheduling decisions. On the other hand, if interactive response time is a concern, preemption is essential. A very short time slice allows fast response but it also requires frequent, and possibly time consuming scheduling decisions.

When a processor becomes available, some ready process must be scheduled. How this ready process is selected is a second policy issue, and again, the particular policy depends on the application. The round robin policy is the simplest; this simply schedules the process which has been ready the longest. The round robin policy is unquestionably fair, but fairness in scheduling is not necessarily desirable. All of the alternatives to round robin scheduling are based on some assignment of priorities to the ready processes. For example, each process might have a fixed priority based on the response time expected for that process, or priorities might be varied depending on resource usage.

The idiomatic phrase round robin has been part of the English language for centuries; it originally appears to have referred to a petition in which all names were written in a circle so that the recipient couldn't tell who had signed first. Today, it usually refers to a letter circulated among a group of correspondants; on receiving the letter, each correspondant may read everything the others have written before adding to it (and typically also deleting their previous contribution). E-mail and cheap photocopies have conspired to make round-robin letters rare, but some still circulate.

Although some priority scheduling policies are based on the assumption that each process has a distinct priority, most must have some rule for breaking ties when processes have equal priority. Equal priorities are quite common; for example, a time-sharing system may put all interactive user processes at one priority (sometimes called the foreground priority), and all non-interactive processes at a lower priority (sometimes called to background priority). When equal priorities are possible, ties should be broken in such a way that processes of equal priority are treated fairly. Random selection is usually considered fair, but it is sometimes desirable to use round robin scheduling of processes within each priority class.

Variable process priorities are common in time sharing systems. For example, each process might be assigned a priority inversely proportional to the fraction of the available processor time that process used in the recent past. This kind of priority rule favors processes which are not computationally intensive while penalizing those which are consuming an inordinate share of the processor time. As a result, interactive processes which spend large amounts of time waiting for input from users will have a high priority and a short response time, independently of whether there are compute bound processes consuming most of the available processor time.

In process control systems, for example, the computer and software controlling a reaction in a chemical plant or the flaps of an airplane, it is common to assign fixed priorities to processes on the basis of how frequently they must operate. Consider a system with two processes, one which must check a thermometer every second in order to adjust a heater, and one which must check the fluid level in a tank every minute in order to adjust a pump. Each process attempts to maintain some value at a constant, but one of these values, the temperature, is more critical than the other, the fluid level. Clearly, the temperature control process should have a higher priority.

Process control systems where priorities are assigned proportionally to the frequency with which processes perform key services are said to be based on a rate monotonic priority assignment. Rate monotonic systems work, but when more than about 60 percent of the processor time is used, on the average, there is a risk of occasionally missing a critical deadline. This problem can be overcome by identifying specific deadlines for each process which specify the times at which key actions must be performed. The priority of each process in such a deadline system is determined by the deadlines themselves so that the process with the most imminent deadline always has the highest priority. Each time a process finishes a key action, it reports this to the scheduler by posting a new deadline indicating the time by which it must complete its next key action.

Although most studies of scheduling in computer systems focuses on the role of a centralized scheduler, it is possible to view scheduling in a broader context that includes decentralized scheduling decisions. For example, most coroutine-based systems may be viewed as multiple process systems in which the scheduling decisions are decentralized and non-preemptive. Each coroutine instance can be viewed as a process, and the resume statements which transfer control between coroutine instances can be viewed as combining a relinquish operation with an explicit statement of which process to schedule next.

An interrupt mechanism can be considered to be a decentralized preemptive scheduler. It is preemptive because the program running at the time an interrupt occurs has no control over the occurrence of that event; the central processor is, in effect, snatched away from the user program and given to the interrupt handler by the interrupt hardware. It is decentralized because decisions to request interrupts are made independently by each input/output device, without consulting other devices. Note that a centralized decision may be made in the rare event that multiple devices simultaneously request interrupts, typically on the basis of hardware priority.

Implementing a Non-Preemptive Round Robin Scheduler

Before discussing the implementation of schedulers in the general context, with multiple processors and preemption, a simple scheduler on a uniprocessor with no interrupt mechanisms will be discussed. This avoids all discussion of asynchronous events, while allowing the conceptual structure of the scheduler to be thoroughly examined. This may be viewed as an exercise in simulating multiple processes using coroutines, but the data structures used here form the basis of most schedulers, so little is lost by such simulation.

The data structures needed by a non-preemptive scheduler rest on a representation for each thread:

struct thread {
        struct thread * next;   /* used to link threads into queues */
        int size;               /* the size of the thread stack */
        jmp_buf state;          /* the state of the thread */

        /* The stack for the thread follows here */

};

This includes a next field used to link threads into a list, for example, a list of ready threads or a list of threads waiting on a particular semaphore. The most important thing it contains, however, is the state of the thread, a copy of all of the registers in the thread. Here, we use an obscure feature of C and C++, the type jump_buf. This is either a copy of all of the registers, or a copy of the program counter and stack pointer after the registers have been pushed on the stack -- different C implementations vary regarding how this is done, but the effect is the same. Finally, of course, each thread must have its own stack to store local variables. This can be allocated as an extension to the thread data structure.

We need several FIFO queues of threads. These are linked through the next field of each thread, and each queue needs a head and tail pointer. Given this, we can declare a set of basic queue routines for threads:

struct thread_queue {
        struct thread * head;
        struct thread * tail;
};

static void thread_enqueue( struct thread * t, struct thread_queue * q )
/* enqueue t on q */
{
        t->next = thread_null;
        if (q->head == thread_null) {
                q->head = t;
                q->tail = t;
        } else {
                q->tail->next = t;
                q->tail = t;
        }
}

static struct thread * thread_dequeue( struct thread_queue * q )
/* dequeue and return a thread from q */
{
        if (q->head == thread_null) {
                return thread_null;
        } else {
                struct thread * t;
                t = q->head;
                q->head = t->next;
                return t;
        }
}

We are now ready to declare the central data structure of our thread scheduler, the ready list. At each instant, there is also a current running thread, and we need to keep a pointer to it:

static struct thread_queue readylist;  /* the list of all ready threads */
static struct thread * current;        /* the current running thread */

In a non-preemptive or cooperative scheduler, the central scheduling service is relinquish, the operation a thread calls when it wishes to voluntarily turn over control to some ready thread. This operation is trivial: Save the state of the current thread, enqueue it on the ready list, dequeue a ready thread, and restore the state from that thread. In C, we can do this as follows:

void thread_relinquish()
/* call this within a thread to allow scheduling of a new thread */
{
        if (_setjmp( current->state ) == 0) {
                thread_enqueue( current, &readylist );
                current = thread_dequeue( &readylist );
                _longjmp( current->state, 1 );
        }
}

This uses the obscure _setjmp() C library routine to save the state in the jump buffer of the current thread. This routine (like fork() returns two different values, zero when it returns from an honest call, and nonzero when it returns from a longjmp via the saved state in the jump buffer. Therefore, on this call to relinquish, the current process with the recently saved state is stored on the ready list, a new process is dequeued from the ready list, and then control passes to the new process by a long jump. The intended use of these obscure library routines is for exception handling:

jmp_buf exception;
int why;
...
if ((why = setjmp(exception)) == 0) { /* try block */
        ...
        /* the following could be in a called routine */
        longjmp( exception, cause ); /* throw exception */
        ...
} else if (why == cause) { /* exception handler for cause */
        ...
} else { /* exception handler for all other causes */
        ...
}

Implementing a Semaphores

Semaphores can be implemented using a queue of waiting processes plus an integer count representing the count in the semaphore.

struct thread_semaphore {
        int count;
        struct thread_queue queue;
};

The operations on semaphores simply increment or decrement the count in the case when the semaphore operation does not not result in a process waiting or in some other process being scheduled. When there is scheduling activity, the count does not change:

void thread_wait( struct thread_semaphore * s )
/* call this within a thread to block on a semaphore */
{
        if (s->count > 0) { /* signal will not block this process */
                s->count--;
        } else {
                if (_setjmp( current->state ) == 0) {
                        thread_enqueue( current, &s->queue );
                        current = thread_dequeue( &readylist );
                        if (current == thread_null) { /* crisis */ }
                        _longjmp( current->state, 1 );
                }
        }
}

void thread_signal( struct thread_semaphore * s )
/* call this within a thread to signal a semaphore */
{
        struct thread * t = thread_dequeue( &s->queue );
        if (t == thread_null) { /* no process waiting for this signal */
                s->count++;
        } else {
                thread_enqueue( t, &readylist );
        }
}

Note that there is an unsolved problem inside the code given above: In the event that there is no ready thread at the time a thread is blocked by a wait operation on a semaphore, the wait operation faces a crisis. It must transfer control to some ready thread, but there is no ready thread.

Typically, at this point, we must terminate the application. If this ever happens in low level operating system code, the result is typically a system freeze, an automatic restart, or the "blue screen of death". In multithreaded application code, the result is usually an error message suggesting that the application has deadlocked.

A deadlock condition occurs when all of the surviving threads in any system are blocked. Since it takes a live thread to send the signal to unblock a thread, deadlocks, when they occur, are fatal.

Priority Scheduling

Assuming that the queues used to implement the scheduler shown in the previous section are first-in first-out queues, the scheduler will implement a round robin scheduling policy. Priority scheduling can easily be done by replacing the first-in first-out queues with priority queues. The simplest implementation of priority queues simply stores the list of processes in each queue in order, sorted by priority. In this case, the enqueue operation would perform a linear search of the queue to insert new entries in the right place, and the dequeue operation would remove the head item each time.

It should be noted that a simple linear list implementation of priority queues is optimal for queues of fewer than about 10 elements, but for queues of more than about 30 elements, other implementations are likely to be much faster. The problem is that insertions in a sorted linear list take a time proportional to the number of items in the list, while data structures such as the heap used in heapsort allow priority queue operations to be done in times proportional to the log of the number of items in the queue.

There are a number of very sophisticated priority queue implementations that are even faster than the heaps used in heapsort, but it should be noted that the average ready-queue length on most systems, even large timesharing systems with hundreds of users, is not particularly great. Most of the time, most processes on most systems are blocked waiting for input or output operations to complete. As a result, the linear list implementation is usually the most appropriate.

There is some disagreement about whether semaphore queues should be sorted by priority when priority scheduling is used. Some definitions of semaphores require that first-in first-out queues be used, but what is really required is a queue which guarantees that every waiting process eventually reaches the head of the queue. If priority queues are used for semaphores and processes have fixed priorities, this property cannot be guaranteed unless it can be shown that the queue will occasionally be empty. If processes have priorities that rise as they wait, or if priorities are assigned based on the deadlines of the waiting processes, priority queues pose no problems, and indeed, they should be used on semaphores as well as for the ready-queue.

Some older priority schedulers are constructed on a radically different plan from the one outlined in the previous section. These are based on a master process list which never changes, except when processes are created or deleted. The master process list is sorted in order of priority, and each process has a flag indicating its state. Whenever it is time to schedule a new process, the master process list is searched in order of descending priority for a ready process. The primary advantage of this approach is that it makes it quite easy to find all processes in the system. Unfortunately, there are a number of problems with this approach. If the master process list becomes long, the operation of scheduling a new process can become quite expensive. If random or round robin scheduling of tasks with equal priority is desired, this requires a complex system to either reorder the master process list or remember which process was most recently run at each priority level.

Oversimplification

The C code given above for a thread scheduler is taken from real working code, but it omits huge and crucial pieces. The problem of launching a thread is not trivial. It requires knowledge of the internal structure of the jump buffer data structure in order to create a jump buffer that properly positions a thread to execute in a new stack. Unfortunately, the structure of a jump buffer changes significantly from one processor to the next and from one implementation of the standard library to the next. As a result, the initialization code for threads is very difficult to write in a portable form.

This situation is different in assembly language. Writing these services in assembly language involves straightforward code to save and restore registers in the thread data structure. A full thread scheduler, including thread initialization and termination, the relinquish operation, and semaphores, takes only a few hundred lines of well documented assembly code.

The only exception to this is on very small processors such as the Microchip PIC family of processors, where it is not always possible to save and restore the CPU state. On these machines, there is simply no good way to write multithreaded code.

Preemptive Scheduling

The data structures used by preemptive schedulers do not differ greatly from those illustrated previously for non-preemptive schedulers. The key difference is in the code itself. Preemptive schedulers rely on a real-time-clock to request an interrupt whenever it is time to preempt one process in order to schedule another. The simplest real-time-clock hardware requests an interrupt each time a set interval has elapsed. For example, many older machines had a power line clock which requested an interrupt 60 times a second, using the 60 cycle per second alternating current from the power line as a time base.

In Europe, power lines run at 50 cycles per second, so power line clocks on that continent run somewhat slower. Whether the power line is 50 or 60 cycles per second, it is worth noting that utility companies make a major effort to keep the frequency of the electric power they deliver constant; in the United States, the interconnected electric utility companies maintain an accuracy over the long run of better than 1 part in a million, so systems using power-line-clock hardware are generally quite accurate.

More complex systems have what are called programmable real-time clocks, that allow the interrupt frequency to be set by software, or programmable interval timers, that allow a register to be loaded with a time interval, and produce an interrupt when that amount of time elapses. Yet others have time of day clocks which report the current time of day and include an alarm register so that an interrupt can be requested at a specific time.

The typical IBM PC compatable has three programmable interval timers compatable with the old Intel 8253 chip. This chip contains 3 16-bit counters that are decremented every 0.838 microseconds (1.19318 MHz). These counters can be used as programmable interval timers or programmable real-time clocks, as well as several other functions. Counter zero is wired to request an interrupt when it expires, while counters 1 and 2 were traditionally used for other purposes (dynamin RAM refresh and crude tone generation).

The classic Microsoft DOS operating system made only minimal use of the real-time clock, setting it to request an interrupt every 65536 clock cycles, which comes to 18.206 interrupts per second. This interrupt source was used to count time, in units of 1/18.206 seconds, counting from midnight, and software conversion routines convert this to time of day in hours, minutes and seconds.

Linux and other modern Unix variants assume that a programmable interval timer is available and then program it to interrupt exactly 100 times a second. This clock is used to maintain a time-of-day value and it is used for process scheduling.

The typical PC compatable also contains a second real-time clock, running off of a battery. This clock counts time in units of years, months, days, hours, minutes and seconds, and typically, one of the things the operating system does as it starts is to read this clock and use it to set the time and date clock that is maintained, from then on, by the programmable interval timer.

On rare occasions, if you keep a computer long enough, the battery backing the time-of-day clock will run down enough that the time-of-day clock loses its memory. When this happens, the system may set the time of day to a random value when it begins. I once had a system come up claiming that the date was some time in 1925.

However the real-time-clock hardware comes to produce an interrupt, on systems where this interrupt is used for scheduling, the real-time-clock interrupt service routine must perform essentially the same operations as are performed by the relinquish service of a non-preemptive scheduler. Thus, the clock interrupt service routine must save the program counter and other registers of the interrupted process, put that process on the ready queue, and then take a ready process from the ready queue and return to it as if it was the interrupted process. There is no way for any particular process to tell when it will be interrupted and preempted in order to allow another process to run. As a result, it is possible that any number of actions taken by other processes will occur between any two consecutive instructions executed by any particular process.

It should be noted that the scheduler interrupt service routine (the clock interrupt service routine) is unlike any other interrupt service routines on most systems. All other interrupt service routines can usually be viewed as procedures which are implicitly called by the interrupt hardware. When any other interrupt service routine has completed its job, it returns to the point in the code where the processor was executing when the interrupt occurred. The scheduler interrupt service routine is called the same way any other interrupt service routine is called, and it returns using the same return sequence, but it does not return to the point of the call because it has performed a context switch, allowing it to return to the point in some other process where that process was most recently interrupted.