22C:116, Lecture 25, Spring 2002

Douglas W. Jones
University of Iowa Department of Computer Science

Symmetrical Multitasking
If an operating system is ported to a shared-memory multiprocessor following the suggestions of the previous section, the result has some interesting properties:
- Each time a process is scheduled, it might run on a different CPU! Nothing in the base system binds resources to CPUs. There is only one ready list.
- Therefore, each time a process blocks, it may migrate from one CPU to another.
- So long as the number of ready application processes is comparable to the number of CPUs, load balancing is trivial. All available CPUs will share the load evenly, and over the long haul, idle CPU time will be distributed evenly over all CPUs.
Unfortunately, modern high-performance CPUs impose a penalty on migration of processes from one CPU to another. The reason for this lies in the pervasive use of cache memory in modern CPU designs.
When each CPU has its own cache, the contents of this cache is, in a sense, a hidden part of the process state. So long as the memory addresses likely to be used by the process are in the cache, it will run at full speed, but when the cache holds none of the process's memory addresses, the process will run very slowly until the cache is loaded.
Modern cache designs allow memory words belonging to multiple processes to reside simultaneously in the cache, so on a uniprocessor, there is a reasonable chance that when a high priority processes is scheduled, it will find that many of the memory references it makes result in cache hits. On an n-CPU multiprocessor, on the other hand, the high priority process has only an 1/n chance of running on the CPU it was most recently using, and as a result, it will most likely have only a 1/n chance of finding relevant memory locations in the cache of the CPU it is scheduled on.
In effect, when the CPU makes heavy use of cache, we have an architecture with some of the characteristics of a NUMA machine.
Non Uniform Memory Access
Non-Uniform Memory Access in a shared memory multiprocessor implies that all memory is shared, but each CPU has fast access only part of memory, its own local memory, and slower access to other parts of memory, the local memory areas of other CPUs.
On such a system, we must deal with a new problem: the page placement problem. When memory access is uniform, all page frames are equal, and the virtual memory system need not worry about which page is put where. On a NUMA system, we need to worry.
Ideally, pages holding the stack of each process should be stored in the local memory of the CPU running that process. As a result, we cannot afford to move processes around freely, so we need a separate scheduling queue per CPU.
With uniform memory access, shared code was a very common thing. On a NUMA machine, however, programs will run faster if the code being executed is stored locally, and this means that code sharing is not a good idea. So, ideally, the code and stack for each process will be in local memory along with the ready list for that CPU.
This creates a new problem, the load balancing problem. When a new process is created, what CPU should it be placed on? If we place it on the CPU of the process that created it, that CPU will carry an added load, but we will not have to pay the cost of moving code and data to the local memory of a different CPU.
When a CPU gets overloaded, we have to ask, which processes should we move to a different CPU in order to balance the load, and what pages should we duplicate or move in order to move the process.