22C:116, Lecture 33, Fall 1998

Douglas W. Jones
University of Iowa Department of Computer Science

Load Balancing
When a system contains multiple processors, a new question arises: How does the system maintain roughly equal loads on all the processors? If the loads are unevenly distributed, the performance of the most heavily loaded processor may end up limiting the performance of the entire system.
On systems with shared main memory, such as the Encore Multimax or the SGI Onyx, this problem is easy to solve. If there is a single ready list and each processor seeks work from that ready list whenever it becomes idle, the load will balance without any computational effort to optimize which processor runs what.
This is only easy because shared memory makes the cost of moving a program from one processor to another negligable -- it is just the cost of context switching. On a machine without shared memory, programs may be difficult to move, and this requires that moves be deliberately planned. Thus, the problem of load balancing in a distributed system has two components:
First when is load balancing applied? Is it applied when processes are created, or are processes moved after they are created.
Second, how is the decision made about which machine to use. Global information about system load is hard to get, so most load balancing schemes rely on heuristics to locate idle computing capacity and to move work away from overloaded machines.
If load balancing algorithms only apply at the time of process creation, long lived processes that have bursty patterns of resource utilization pose problems. These problems can be solved by breaking up such processes into sequences of smaller shorter lived processes, one per phase of the longer computation. As the long `logical process' changes from one phase to another, a new process is created, allowing the process creation mechanism to select a machine appropriate for that phase.
Third, how is a process selected for migration. If the cost of migration exceeds the performance benefit, clearly, it should not be undertaken! Usually, the cost of migration can be estimated in advance, but the benefit depends on the future behavior of the process being moved, and this requires some kind of prediction.
Load Balancing Algorithms
Given that total information about the loading of a distributed system is unlikely to be available because gathering such information is expensive, most load balancing algorithms rely on limited or local information about the state of the system. As a result, these algorithms are generally viewed as using load balancing heuristics.
Typically, the software running on each machine can compute a numeric measure of the local load. This may be as simple as a three valued variable that indicates that the local state is overloaded, acceptably loaded or underloaded, or it may be as complex as a vector indicating the local CPU load, the local memory load, and the local I/O load.
The problem the load balancing algorithm must face is to distribute this information over the net so that, when the time comes to move a process or when the time comes to create a new process, the information necessary to keep the load reasonably balanced is available to the right machines.
Typically, each machine on the net only communicates with a limited subset of the others; this is necessary to allow the algorithm to scale well as the net grows. For the purpose of this discussion, we will refer to this subset, for any particular machine, as the set of neighbors of that machine.
There are two basic schemes for distributing the loading information, demand driven and supply driven. Here, we speak of the product being supplied or demanded as excess computational power -- the terms supply and demand are reversed if the product under discussion is surplus processes! In a demand driven scheme, when a machine is ready to create a new process or export a process, it checks with its neighbors to determine their state, then compares the replies and creates the new process on or exports a process to the neighbor that has the greatest available surplus of processing power.
In a supply driven scheme, machines that judge themselves underloaded advertise for work, sending messages to their neighbors and requesting work. Each machine maintains a list of advertisements it has recently received, and when it needs to create a new process or export a process, it uses this list to select a destination machine.
Assuming that process migration is possible, how does a system decide when to export a process? When a machine creates a new process, how does it decide whether to create it locally or create it on a remote machine. Typically, if the machine is underloaded, it will ignore the migration issue. If a machine is acceptably loaded, it will try to create new processes elsewhere, and if it is overloaded, it will try to export processes.
If there is a loading metric, the machine can compare its load with the loads of each of its neighbors in order to determine the identity of the least loaded machine. If there is a loading metric with multiple dimensions, for example, memory, processor and I/O, machines must compare the load characteristic (measured or anticipated) for each process with the metric for each system in order to determine which process is the best candidate to move in order to lower the overall loading on the whole system.
This clearly indicates that the more complex the local metric, the more difficult the migration problem becomes!
In work published by Harchol-Balter and Downey (ACM Trans on Computer Systems, Aug 1997, V. 15, N. 3, page 253) a very simple metric is shown to be remarkably effective for machines with equal CPU speeds and memory resources -- the length of the ready list. A machine with a short ready list is lightly loaded, while a machine with a long ready list is heavily loaded.
Selecting a Process to Move
In work published by Harchol-Balter and Downey (ACM Trans on Computer Systems, Aug 1997, V. 15, N. 3, page 253) an important empirical observation is made about process lifetimes. Based on an examination of large numbers of process histories on UNIX systems, it is observed that, if a process has survived to consume one second of CPU time, it is likely to survive to consume another second of CPU time. Roughly half the time, it won't, and half the time, it will. This holds for any time interval -- a process that has survived one minute has a 50% probability of surviving another minute, and therefore, this result is independent of CPU speed!
This result should have been predicted years ago, when virtual memory was the subject of intensive study! The reason is, this is a natural consequence of the locality principle! The locality principle states that the history of execution of a particular piece of code is a good predictor of the future need for that piece of code. A piece which has been heavily used is more likely to be heavily used, and a piece which has been lightly used is unlikely to be needed. In the studies of virtual memory back in the 1960's, the pieces of code in question were procedures and functions, but in our current discussion, the pieces are entire processes. Other than this change in scale, the basic issues are very similar!
The consequence of this observation is simple. Given an estimate of the time taken to move a process, and given that the previous CPU time consumed buy a process is a reasonable predictor of its future CPU demand, then a process is a good candidate for migration if the expected speedup resulting from moving the process to a machine with a shorter ready list is greater than the time taken to move the process.
Assuming round-robin scheduling on uniprocessors that are otherwise equal, the execution speed of a process will be proportional to 1/(n+1) on a machine with n processes in the ready list. Given two machines, one with n and one with m processes in their ready lists, where n > m, and given a process on the more heavily loaded machine with age t (measured in CPU seconds) and therefore an expected CPU demand of t seconds, the expected real (wall clock) time until completion will be:
on the heavily loaded machine (1/(n+1))t
if migrated to the lightly loaded machine (1/(m+2))t
Therefore, if the process is moved, the time gained will be:
```
(1/(n+1))t - (1/(m+2))t
```
And, if this gain is greater than the time taken for the move itself, the process will finish sooner! Based on this result, Harchol-Balter and Downey concluded, convincingly, that migration of running processes is a very desirable tool!
Process Migration Methods
Whether or not a system can be made to support process migration depends heavily on how processes interact with their environment! If processes address messages to their correspondants by location, for example, by establishing a connection to socket 20 on ant.cs.uiowa.edu, migration is not practical. On the other hand, if processes address correspondants through names that may be edited by the kernel, anything may be moved.
One of the first systems to demonstrate this effectively was the DEMOS MP system, a reimplementation of the DEMOS operating system on a network of microprocessors, done by Powell and Miller at Berkeley, and published in 1983. Powell was involved with the development of the DEMOS operating system on the Cray 1 computer at Los Alamos. When he moved to Berkeley, he brought DEMOS with him, re-implementing it as DEMOS/MP on a network of microprocessors. The Cray 1 computer had a crude memory protection arcitecture (base and bound registers) that precluded shared memory models of interprocess communication, and the microprocessors used at Berkeley were no more sophisticated. As a result, it was fairly easy to move the majority of the DEMOS code (written in Pascal) from the Cray environment to the microprocessor environment.
The DEMOS process structure consisted of the following pieces:
```
         ____   _______  ||  _______   ________
        | PC | |       | || | LINK  | | STATUS |
        | SP | | CODE  | || | TABLE | |  ETC   |
        | AC | |_______| || |       | |________|
        |    | |       | || |       |
        |____| | HEAP  | || |_______|
               |_______| ||  __________________
               |///////| || |                  |
               |       | || |  MESSAGE QUEUE   |
               | STACK | || |__________________|
               |_______| ||
                         ||
            USER         ||      PROTECTED
```
A DEMOS process has a data segment containing the code of the executing process, the heap for that process, and the stack for that process. This could just as easily be three segments, one for each purpose. What is important is that these segments are not shared. The process registers can be viewed as part of the process's memory segment (and there can even be reserved locations in this segment for saving the registers).
In addition, each DEMOS process has a number of associated data structures that are maintained by the system. These include the usual process status information, such as priority and state (ready, running, etc), and they include two structures related to interprocess communication.
The first interprocess communication data structure is the message queue. This holds all messages which have been delivered to this process but not yet read by the running program.
The second interprocess communication data structure is the link table. This holds the addresses and port numbers that the process may legitimately use as destinations when it transmits a message. Entries in the link table are essentially capabilities for message passing. Link table entries contain a process address formatted as follows:
```
         __________________
        |   |   |          |
        | M | M | LOCAL ID |
        |___|___|__________|
          |   |      |
          |   |  serial number on creating machine.
          |   |
          |  Net address of the creating machine.
          |
         Last known net address of the process.
```
The address has two basic components, the machine ID of the machine where the process was last known to be, and a globally unique process ID. The latter need not have any internal structure, but in fact, it is composed of the machine ID of the machine that created the process and the local ID or serial number of the process on that machine.
Every process keeps the process ID it was assigned by the machine that created it, no matter where it is currently running. When a process sends a message over a link, the message is sent to the machine where the destination process was last known to be. The kernel on that machine receives the message and looks in its current table of processes to see if that process is running locally. If so, the message is delivered to the destination process's mailbox.
The DEMOS system might decide to move a process when the kernel running on a machine that is sufficiently idle discovers that a neighbor is overloaded. In that case, the under-utilized machine requests that the overloaded machine send it a process. The actual move is done as follows:
1. Stop the process that is to be moved.
2. Tell the destination machine the necessary information to create a framework for migration: the process ID and segment sizes, at minimum.
3. The destination machine creates an empty process image for the process being moved.
4. Move the memory image and registers, as well as the link table of the process.
5. Put a forwarding address on the old machine so that any messages sent to it will be forwarded.
6. Forward all of the messages in the mailbox.
7. Delete the process on the source machine and start the new copy running.
The moved process can actually resume execution on the destination machine as soon as step 4 is completed. If it has useful computation to do, all memory and registers are there. If it needs to send any messages, it has the links needed. If it tries to receive a a message that was already in its mailbox, it will block until the message is forwarded by the work in step 6.
A process cannot tell that it was moved from one machine to another, although the outcome may differ because the order in which it receives messages may change as a result of the move. This does not change the correctness of the code, though, because our definition of correct message reception only states that a message must be received after it is transmitted, and if many messages are in transit, it is correct to deliver them in any order.
If the system provided a service that allowed a process to test if a message is pending without accepting delivery of that message, then the process could sometimes tell that it had been moved (it might find a pending message on one test, then finding that that message was missing on a second test, perhaps because the process had been moved and the message had not yet been forwarded).
Use of Forwarding Addresses in DEMOS
When a forwarding address is encountered on the machine where a process was last known to be, the address is used to forward the message to the machine where the process was moved.
In addition, when a forwarding address is used to forward a message, a message is sent to the original sender's kernel to inform it of the move, so that the sender's kernel can update all local link tables to reflect the move.
With this approach, the first message sent to a process that has moved will be sent to the wrong machine and then forwarded, but subsequent messages from the same process will be sent directly to the new machine.
It is hard to imagine how to improve on this, except for one problem: Forwarding addresses tend to accumulate over the life of the system. These must somehow be eliminated! This elimination is something that the system must do in the background, and it is analogous in some ways to garbage collection.
The kernel of a machine holding a forwarding address must somehow learn when the last link referencing that forwarding address has been updated to reflect the new location of the process that has migrated. When this is done, the kernel holding that link can delete it.
Alternatively, forwarding addresses may simply be deleted after they are held for some interval; in this case, if a message is received for a process which is no longer at the expected destination and there is no forwarding address, a broadcast must be made to locate the missing process. Alternately, after a forwarding address has been held for some interval, a broadcast may be made to all kernels informing them that the address is about to be deleted and that they must update any remaining links that reference it.
Finally, there is the option of simply discarding forwarding addresses after a while. This is possible if either the maximum lifetime of a link left unused is clearly advertised, with a warning saying `use it or lose it' in the system manual, or if every kernel periodically sends a probe message along each link from processes stored on that machine to get updated locations.
Problems with migration
Migration under DEMOS/MP was easy because it forbids archival storage of links and it requires strict segregation of links and data. What if links and data are mixed? What if links are stored archivally?
In this presentation, the DEMOS/MP framework will be used -- that is, messages will be discussed as if they are sent over links which designate not only a process but a port of that process. This will be done only to simplify the discussion; the same problems arise if addresses are sent to processes or some other kind of entity.
Process migration is hard if the address fields of links cannot be found, for example when the links are encrypted and stored mixed with data, and it is hard if the address fields of links cannot be changed, for example, when links are stored off-line in archival storage, or if they are written on read-only media (such as WORM disks -- Write Once Read Mostly, a form of optical disk technology).
The result is that the address field of a link must merely name the destination process instead of giving any hint about where the destination process might be located. From this, it can be concluded that some form of name-service is required to support the use of such address encodings.
```
        LINK
         ____________
        |  ...  | ID |
        |_______|____|
                     \_________
                 /   |  NAME   |    Useful for
         Universal   | SERVICE |    message delivery
         ID of       |_________|     /
         destination;          \_________
         useless for           | ADDRESS |
         message delivery      |_________|
```
Name services translate the global identifier of a process or other resource to an address useful for message delivery. When a process sends a message in this context, it must contact the local agent of the name server to translate the destination process name to a useful address.
The local agent typically maintains a cache of recently used name address pairs; if a name is not in the cache, the local agent must contact some remote part of the name server.
This outline of how a name server is used is fairly typical, so long as the context is a message-passing system. In a virtual circuit system, the name server might be called once to set up the circuit, and in this context, the name server may not need a local agent or cache.
When a process is moved, the cache used by local agents of the name server can be updated using the algorithm used to update the link list in DEMOS/MP!
The updating of the information in the global name service or in the local agent caches around the network takes place in the background. The forwarding address assures that messages will be delivered correctly while this updating is being done.