22C:116, Lecture 32, Spring 2002

Douglas W. Jones
University of Iowa Department of Computer Science

Fault Tolerant Servers
On a multicomputer, failure of a component -- typically a computer module, should be considered a normal event. If the only processes running on the computer module that failed were user processes, it might be acceptable to pass the responsibility for fault tolerance to the users. On the other hand, if the processes that are lost due to the failure are system processes, particularly servers on which the multicomputer as a whole relies, we must find some way to keep the system running despite this loss.
We need several tools to do this. We must detect the fault, we must assure that critical data is not lost because of the fault, we must replace the servers that were lost, and we must inform the users of the new servers.
All systems that support any form of process migration solve one of these problems -- when a process moves, all such systems have a mechanism for forwarding messages to the new location of a process that has moved. Any mechanism that does this can be modified to also allow the replacement for a failed server to receive messages that had been addressed to the failed predecessor of that server.
Failure Detection
The most common scheme for failure detection involves assigning a monitor process to each server. The monitor process periodically polls the server to ask, "are you still working?" If the server fails to reply, the monitor process launches a replacement server.
```
	monitor process for file server:
	   repeat
              wait X seconds
              send inquiry to file server
	      await reply or timeout
              if timeout
		 launch replacement file server
	      endif
           endloop
```
Of course, the monitor process must itself be monitored -- it is itself an important part of the system, because if a failure kills the monitor, it has killed the fault tolerance of the system.
Furthermore, we must assure that the monitor does not run on the same machine as the server it monitors! Ideally, what we need is a way to launch the monitor process with a set of instructions to the load balancer saying "no matter what you do, keep this process far away from the process it monitors so that no failure is likely to kill both".
This introduces a new measure -- the fault tolerance distance between two processes. Two processes are distant if there are few single-point faults that will stop both processes, while they are close if there are many single-point faults that will stop both. If they are running on the same computer module, they are obviously close; if they are on different computer modules that share a single power supply, they are farther apart, while if they have different power supplies, they are even farther. If they can be isolated from the remainder of the system by cutting one network link, they are close, while if they have independent paths to the rest of the system, they are farther.
Obviously, any load balancing mechanism used in a fault tolerant system should understand the concept of fault tolerance distance and should understand advice from processes about the distance criteria they must meet.
One common way to construct monitors is to implement each server in the system as a set of mutually monitoring processes, where each process in the set monitors at least one of the others and attempts to launch a replacement when needed. At any time, one process in the set is designated as the working server.
Assuring that Critical Data is not Lost
Critical data must be stored redundantly! Thus, where a classical file system maintains only one copy of a file, a fault tolerant file system maintains backup copies. This has already been discussed in the context of RAID systems, but redundant RAID systems are typically viewed by the attached computer as a single disk, where in a fault tolerant multicomputer, we want to arrange our duplication across multiple computer modules.
One option is to require users to store data on multiple file servers. This puts the burden for fault tolerance on the user. Alternatively, we can build a file server that stores each file and directory redundantly on disk servers. Each disk server is not itself fault tolerant. The disk server, and the disk drive itself are tied to a single computer module, and if the disk drive or computer module fails, the disk server will fail as well, and we must accept this.
On the other hand, we can build a fault tolerant file server on top of this, so long as we have multiple failure-prone disk servers that are are distant from each other in the system. When the file server writes to a sector of a file or directory, it does a stable-update of that sector using multiple disk servers. When the file server reads a sector, it does a stable read. This rests on the stable-storage update and read algorithm discussed in the notes for Lecture 16.