32. Load Balancing

Part of the 22C:116 Lecture Notes for Fall 2002
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Introduction

When a system contains multiple processors, a new question arises: How does the system maintain roughly equal loads on all the processors? If the loads are unevenly distributed, the performance of the most heavily loaded processor may end up limiting the performance of the entire system.

On systems with shared main memory, such as the old Encore Multimax or the SGI Onyx or the more common dual processor Pentium or Power PC systems, this problem is easy to solve. If there is a single ready list and each processor seeks work from that ready list whenever it becomes idle, the load will naturally balance without any computational effort to optimize which processor runs what.

This is only easy because symmetrical shared memory makes the cost of moving a process from one processor to another negligable -- it is just the cost of context switching. On a machine with non-uniform shared memory or with no shared memory, programs may be difficult to move, and this requires that moves be deliberately planned. Thus, the problem of load balancing in a distributed system has two components:

First when is load balancing applied? Is it applied when processes are created, or are processes moved after they are created.

Second, how is the decision made about which machine to use. Global information about system load is hard to get, so most load balancing schemes rely on heuristics to locate idle computing capacity and to move work away from overloaded machines.

If load balancing algorithms only apply at the time of process creation, long lived processes that have bursty patterns of resource utilization pose problems. These problems can be solved by breaking up such processes into sequences of smaller shorter lived processes, one per phase of the longer computation. As the long `logical process' changes from one phase to another, a new process is created, allowing the process creation mechanism to select a machine appropriate for that phase.

Third, how is a process selected for migration. If the cost of migration exceeds the performance benefit, clearly, it should not be undertaken! Usually, the cost of migration can be estimated in advance, but the benefit depends on the future behavior of the process being moved, and this requires some kind of prediction.

Load Balancing Algorithms

Given that total information about the loading of a distributed system is unlikely to be available because gathering such information is expensive, most load balancing algorithms rely on limited or local information about the state of the system. As a result, these algorithms are generally viewed as using load balancing heuristics.

Typically, the software running on each machine can compute a numeric measure of the local load. This may be as simple as a three valued variable that indicates that the local state is overloaded, acceptably loaded or underloaded, or it may be as complex as a vector indicating the local CPU load, the local memory load, and the local I/O load.

The problem the load balancing algorithm must face is to distribute this information over the net so that, when the time comes to move a process or when the time comes to create a new process, the information necessary to keep the load reasonably balanced is available to the right machines.

Typically, each machine on the net only communicates with a limited subset of the others; this is necessary to allow the algorithm to scale well as the net grows. For the purpose of this discussion, we will refer to this subset, for any particular machine, as the set of neighbors of that machine.

There are two basic schemes for distributing the loading information, demand driven and supply driven. Here, we speak of the product being supplied or demanded as surplus capacity (however we measure capacity) -- the terms supply and demand are reversed if the product under discussion is surplus processes! In a demand driven scheme, when a machine is ready to create a new process or export a process, it checks with its neighbors to determine their state, then compares the replies and creates the new process on or exports a process to the neighbor that has the greatest capacity surplus.

In a supply driven scheme, machines that judge themselves underloaded advertise their surplus to their neighbors, looking for work. Each machine maintains a list of advertisements it has recently received, and when it needs to create a new process or export a process, it uses this list to select a destination machine.

The Decision to Migrate

Assuming that process migration is possible, how does a system decide when to export a process? When a machine creates a new process, how does it decide whether to create it locally or create it on a remote machine. Typically, if the machine is underloaded, it will ignore the migration issue. If a machine is acceptably loaded, it will try to create new processes elsewhere, and if it is overloaded, it will try to export processes.

Given a more sophisticated loading metric, the machine can compare its load with the loads of each of its neighbors in order to determine the identity of the least loaded machine. If there is a loading metric with multiple dimensions, for example, memory, processor and I/O, machines must compare the load characteristic (measured or anticipated) for each process with the metric for each system in order to determine which process is the best candidate to move in order to lower the overall loading on the whole system.

This clearly indicates that the more complex the local metric, the more difficult the migration problem becomes!

In work published by Harchol-Balter and Downey (ACM Trans on Computer Systems, Aug 1997, V. 15, N. 3, page 253) a very simple metric is shown to be remarkably effective for machines with equal CPU speeds and memory resources -- the length of the ready list. A machine with a short ready list is lightly loaded, while a machine with a long ready list is heavily loaded.

Selecting a Process to Move

In work published by Harchol-Balter and Downey (ACM Trans on Computer Systems, Aug 1997, V. 15, N. 3, page 253) an important empirical observation is made about process lifetimes. Based on an examination of large numbers of process histories on UNIX systems, they observed that, if a process has survived to consume t seconds of CPU time, its expected remaining lifetime is t seconds. Roughly half the time, it will consume less than t additionional seconds, and half the time, it will consume more. This holds for any time interval -- a process that has survived one minute has a 50% probability of surviving another minute, and therefore, so this result is completely independent of CPU speed!

This result should have been predicted years ago, when virtual memory was the subject of intensive study! The reason is, this is a natural consequence of the locality principle! The locality principle states that the history of execution of a particular piece of code is a good predictor of the future need for that piece of code. A piece which has been heavily used is more likely to be heavily used, and a piece which has been lightly used is unlikely to be needed. In the studies of virtual memory back in the 1960's, the pieces of code in question were procedures and functions, but in our current discussion, the pieces are entire processes. Other than this change in scale, the basic issues are very similar!

The consequence of this observation is simple. Given an estimate of the time taken to move a process, and given that the previous CPU time consumed buy a process is a reasonable predictor of its future CPU demand, then a process is a good candidate for migration if the expected speedup resulting from moving the process to a machine with a shorter ready list is greater than the time taken to move the process.

Assuming round-robin scheduling on uniprocessors that are otherwise equal, the execution speed of a process will be proportional to 1/(n+1) on a machine with n processes in the ready list. Given two machines, one with n and one with m processes in their ready lists, where n > m, and given a process on the more heavily loaded machine with age t (measured in CPU seconds) and therefore an expected CPU demand of t seconds, the expected real (wall clock) time until completion will be:

on the heavily loaded machine (1/(n+1))t

if migrated to the lightly loaded machine (1/(m+2))t

Therefore, if the process is moved, the time gained will be:

(1/(n+1))t - (1/(m+2))t

And, if this gain is greater than the time taken for the move itself, the process will finish sooner! Based on this result, Harchol-Balter and Downey concluded, convincingly, that migration of running processes is a very desirable tool!

Process Migration Methods

Whether or not a system can be made to support process migration depends heavily on how processes interact with their environment! If processes address messages to their correspondants by location, for example, by establishing a connection to socket 20 on ant.cs.uiowa.edu, migration is not practical. On the other hand, if processes address their correspondants through names that may be edited by the kernel, anything may be moved.

One of the first systems to demonstrate this effectively was the Demos/MP system, a reimplementation of the Demos operating system on a network of microprocessors, done by Powell and Miller at Berkeley, and published in 1983. Powell was involved with the development of the Demos operating system on the Cray 1 computer at Los Alamos. When he moved to Berkeley, he brought Demos with him, re-implementing it as Demos/MP on a network of microprocessors. The Cray 1 computer had a crude memory protection arcitecture (base and bound registers) that precluded shared memory models of interprocess communication, and the Z80 microprocessors used at Berkeley were no more sophisticated. As a result, it was fairly easy to move the majority of the Demos code (written in Pascal) from the Cray environment to the microprocessor environment.

The Demos process structure consisted of the following pieces:

         ____   _______  ||  _______   ________
        | PC | |       | || | LINK  | | STATUS |
        | SP | | CODE  | || | TABLE | |  ETC   |
        | AC | |_______| || |       | |________|
        | .. | |       | || |       |
        |____| | HEAP  | || |_______|
               |_______| ||  __________________
               |///////| || |                  |
               |       | || |  MESSAGE QUEUE   |
               | STACK | || |__________________|
               |_______| ||
                         ||
            USER         ||      PROTECTED

A Demos process has a data segment containing the code of the executing process, the heap for that process, and the stack for that process. This could just as easily be three segments, one for each purpose. What is important is that these segments are not shared. The process registers can be viewed as part of the process's memory segment (and there can even be reserved locations in this segment for saving the registers).

In addition, each Demos process has a number of associated data structures that are maintained by the system. These include the usual process status information, such as priority and state (ready, running, etc), and they include two structures related to interprocess communication.

The first interprocess communication data structure is the message queue. This holds all messages which have been delivered to this process but not yet read by the running program. Demos used synchronous message passing, so each entry in this queue held pointers to source and destination buffers, not the actual contents of the message.

The second interprocess communication data structure is the link table. This holds the addresses and port numbers that the process may legitimately use as destinations when it transmits a message. Entries in the link table are essentially capabilities for message passing. In Demos MP, each link table entry has the following format:

        /-process address-\
            /--unique ID--\
         __________________ _______________________
        |   |   |          |              |        |
        | M | M | LOCAL ID | Queue Number | Rights |
        |___|___|__________|______________|________|
          |   |      |
          |   |  serial number on creating machine.
          |   |
          |  Net address of the creating machine.
          |
         Last known net address of the process.

The process address field of the link has two basic components, the machine ID of the machine where the process was last known to be, and a globally unique process ID. The latter need not have any internal structure, but in fact, it is composed of the machine ID of the machine that created the process and the local ID or serial number of the process on that machine.

Every process keeps the process ID it was assigned by the machine that created it, no matter where it is currently running. When a process sends a message over a link, the message is sent to the machine where the destination process was last known to be -- this is a heuristic that is likely to deliver the message to the right machine, but if processes migrate, it occasionally fails.

The kernel on a machine receives a message, it looks in its process table to see if that process is currently running on that machine. If so, a notice of the message is delivered to the destination process's mailbox. Because this system uses synchronous message passing, the actual data transfer is deferred until the recipient notices that the message is waiting.

If the kernel on a machine receives a message addressed to a process that is not listed in the local process table, there is a problem! For now, we will defer discussion of that problem until we have finished discussing the mechanism that allows a process to migrate from one machine to another under the Demos system.

Once the decision has been made to move a Demos process from one machine to another in a Demos system, a decision made jointly by the local kernels on the sending and receiving machines, the move proceeds as follows:

Stop the process that is to be moved.
Give the destination machine the information necessary to create a framework for migration: At minimum, this includes the process ID, the size of the data segment, and the size of the link and queue tables.
The destination machine creates an empty process image for the process being moved. This has all of the attributes of a process, but no code, no links and no content in the incoming queues.
Copy the memory image of the process, including all register values, to the new process, and copy the contents of the link table.
Replace the process table entry on the old machine with a forwarding notice saying "this process has moved to the new machine". This forwarding notice allows any messages that arrive at the old machine to be passed onward to the new machine.
Forward all of the messages from the incoming message queues of the old process to the incoming message queues of the new process.
Delete the now-useless image of the process on the source machine and start the new copy running.

The moved process can actually resume execution on the destination machine as soon as step 4 is completed. If it has useful computation to do, all memory and registers are there. If it needs to send any messages, it has the links needed. If it tries to receive a a message that was already in its mailbox, it will block until the message is forwarded by step 6.

A process running under this migration scheme cannot tell that it was moved from one machine to another, although the outcome may differ because the order in which it receives messages may change as a result of the move. This does not change the correctness of the code, though, because our definition of correct message reception only states that a message must be received after it is transmitted; if many messages are in transit, they may be delivered in any order.

Use of Forwarding Addresses in Demos

When a forwarding address is encountered on the machine where a process was last known to be, the address is used to forward the message to the machine where the process was moved. Of course, there is a small likelihood that the process may have migrated from there to some new home, so forwarding may continue for several cycles.

Each time a forwarding address is used, two things happen: First, the message is forwarded, and second, a change-of-address notice is sent back to the machine that sent the message. When a machine receives a change of address notice for a message that was sent out using a particular link, it changes the last-known-address field of that link to refer to the new location. As a result, the first use of a link after the destination process migrates will cause that link to be updated to refer to the new destination.

This works very well, except for one problem: Forwarding addresses tend to accumulate over the life of the system. These must somehow be eliminated! This elimination is something that the system must do in the background while processes continue to be created and to migrate.

Cleaning up obsolete forwarding addresses can be viewed as being analogous to grabage collection. In this case, the kernel of a machine holding a forwarding address must somehow learn when the last link referencing that forwarding address has been updated; at that point, the forwarding address can be deleted. A crude way to do this would be to periodically look at each forwarding address and broadcast a message saying "does anyone still need this forwarding address?" If there is no reply, the forwarding address can be deleted.

Alternatively, forwarding addresses may simply be deleted after they are held for some interval; in this case, if a message is received for a process which is no longer at the expected destination and there is no forwarding address, a broadcast must be made to locate the missing process. Alternately, after a forwarding address has been held for some interval, a broadcast may be made to all kernels informing them that the address is about to be deleted and that they must update any remaining links that reference it.

Alternatives to Forwarding Addresses

The Demos/MP forwarding address scheme has one major competitor. In this competing model, when a process is migrated away from a machine, no record is saved of its prior location. When a message arrives at that machine addressed to a process that is no-longer present, a reply is sent saying, in effect, "unknown process".

When the kernel sending a message receives such a rebuff, it broadcasts a message saying "has anyone heard of this process". If the process still exists, the kernel on the machine where the process currently resides is obligated to reply giving its current location. The sending kernel updates the link when it gets a reply, and then it retries the message send, using the new destination.

The broadcast activity suggested by this model does not scale well, but it can be eliminated by introducing special servers known as low-level name servers. These servers map universal process ID to current location. When a kernel looses track of the current location of a process, instead of broadcasting an inquiry, it contacts the nearest name server asking where the process is. The name server maintains a cache of processes it has recently known the locations of, and only if the process is not in that cache does it try a broadcast. Server hierarchies can be designed, and variations require processes to register with a server in order to completely eliminate the need for broadcasts.

Problems with migration

Migration under Demos/MP was easy because that system forbids applications from directly manipulating links or storing them in any data structure outside the process's link table. Had Demos/MP allowed this, the kernel would not have been able to find and update links, and as a result, the forwarding address scheme would not have led eventually to the updating of all links.

In this presentation, the Demos/MP framework will be used -- that is, messages will be discussed as if they are sent over links which designate not only a process but a port or incoming message queue of that process. This will be done only to simplify the discussion; the same problems arise if addresses are sent to processes or some other kind of entity.

Process migration is hard if the address fields of links cannot be found, for example when the links are encrypted and stored mixed with data, and it is hard if the address fields of links cannot be changed, for example, when links are stored off-line in archival storage, or if they are written on read-only media (such as WORM disks -- Write Once Read Mostly, a form of optical disk technology).

The result is that the address field of a link must merely name the destination process instead of giving any hint about where the destination process might be located. From this, it can be concluded that some form of name-service is required to support the use of such address encodings.

        LINK
         ____________
        |  ...  | ID |
        |_______|____|
                     \_________
                 /   |  NAME   |    Useful for
         Universal   | SERVICE |    message delivery
         ID of       |_________|     /
         destination;          \_________
         useless for           | ADDRESS |
         message delivery      |_________|

Name services translate the global identifier of a process or other resource to an address useful for message delivery. When a process sends a message in this context, it must contact the local agent of the name server to translate the destination process name to a useful address.

The local agent of the name server in such a system typically holds cache of recently used name address pairs; if a name is not in the cache, the local agent must contact some remote part of or perhaps broadcast a request to all kernels to locate the process.

When a process is moved, the cache used by local agents of the name server can be updated using the algorithm used to update the link list in Demos/MP! The updating of the information in the global name service or in the local agent caches around the network takes place in the background. The forwarding address assures that messages will be delivered correctly while this updating is being done.