27. Mutual Exclusion in Distributed Systems

Part of the 22C:116 Lecture Notes for Fall 2002
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Introduction

When all processes sharing a resource are on the same machine, mutual exclusion is easily assured by the use of a semaphore, spin-lock or other similar shared abstraction. When the processes involved are on different machines, however, mutual exclusion becomes more difficult.

Consider the following example: A number of machines in a network are competing for access to a printer. The printer is so constructed that every line of text sent to the printer must be sent in a separate message, and thus, if a process wants to print an entire file, it must obtain exclusive use of the printer, somehow, send all the lines of the file it wishes to print, and then release the printer for use by others on the network.

A trivial solution to this problem is to install a print spooler process somewhere on the net. That print spooler would gather lines of files provided by various applications processes, maintain a queue of completed files that are ready to print, and print one such file at a time. This works, but it introduces some problems: First, the spooler must have buffer capacity to hold the aggregate of all the files that have not yet been printed. Second, the spooler may become a bottleneck, limiting the performance of the entire system, and third, if the processor supporting the spooler is unreliable, the spooler may limit the reliability of the system.

Mutual Exclusion Servers

In the printer example being used here, the problem of storage space in the spooler typically becomes acute with graphics printers. In such a context, it is desirable to block an applications process until the printer is ready to accept data from that applications process, and then let that process directly deliver data to the printer.

For example, an applications process may be coded as follows:

      Send request for permission to the spooler
      Await reply giving permission to print

      Loop
         send data directly to printer
      End Loop

      Send notice to spooler that printing is done

If all users of the spooler use this protocol, the spooler is no-longer serving as a spooler, it is merely serving as a mutual exclusion mechanism! In fact, it implements exactly the same semantics as a binary semaphore, but it implements it using a client server model.

The process implementing a semaphore using message passing might have the following basic structure:

      Loop
         Await message from client
         Case message type of
           P:
              If count > 0
		 Send immediate reply
		 count = count - 1
              Else
                 Enqueue identity of client
              End if
           V:
              If queue is empty,
		 count = count + 1
              Else
                 Dequeue one blocked client
                 Send a delayed reply
              End if
         End case
      End Loop

This requires a count and a queue of return addresses for each semaphore. Note that, by presenting this, we have proven that blocking message passing can be used to implement semaphores. Since we already know that semaphores plus shared memory are sufficient to implement blocking message passing, we have proven the equivalence, from a computation theory viewpoint, of these two models of interprocess communication!

The disadvantage of implementing semaphores using a server process is that that server becomes a potential source of reliability problems. If we can build a mutual exclusion algorithm that avoids use of a dedicated server, for example, by having the processes that are competing for entry to a critical section negociate directly with each other, we can potentially eliminate the reliability problem.

Token based Mutual Exclusion

One alternative to the mutual-exclusion server given above is to arrange the competing processes in a ring and let them exchange a token. If a process receives the token and does not need exclusive use of the resource, it must pass the token on to the next process in the ring. If a process needs exclusive use of the resource, it waits for the token and then holds it until it is done with the resource, at which point it puts the token back in circulation.

This is the exact software analog of a token ring network. In a token ring network, only one process at a time may transmit, and the circulating token is used to assure this, exactly as described. The token ring network protocol was developed for the hardware level or the link-level of the protocol hierarchy. Here, we are proposing building a virtual ring at or above the transport layer and using essentially the same token passing protocol.

This solution is not problem free. What if the token is lost? What if a process in the ring ceases transmission? Nonetheless, it is at the root of a number of interesting and useful distributed mutual exclusion algorithms. The advantage of such distributed algorithms is that they do not rest on a central authority, and thus, they are ideal candidates for use in fault tolerant applications.

An important detail in the token-based mutual exclusion algorithm is that, on receiving a token, a process must immediately forward the token if it is not waiting for entry into the critical section. This may be done in a number of ways:

Each process could periodically check to see if the token has arrived. This requires some kind of non-blocking read service to allow the process to poll the incoming network connection on the token ring. The Unix FNDELAY flag (set by open() or by fctl()) allows non-blocking read, and the Unix select() kernel call allows testing an I/O descriptor to see if a read from that descriptor would block; either of these is sufficient to support this polling implementation of the token passing protocol. The fact that Unix offers two such mechanisms is good evidence that these are afterthoughts added to Unix after the original implementation was complete.
The receipt of an incoming token could cause an interrupt. Under Unix, for example, the SIGIO signal (see the man page for sigvec() for a list of signals) can be attached to a socket or communications line (see the FASYNC flag set by fcntl). To await the token, the process could disable the SIGIO signal and do a blocking read on the incoming token socket. To exit the critical section, the process could first enable SIGIO and then send the token. The SIGIO handler would read the incoming token and forward it before returning.
A thread or process could be dedicated to the job of token management. We'll refer to such a thread or process as the mutual exclusion agent of the application process. Typically, the application would communicate with its agent using shared memory, semaphores, and other uniprocessor tools, while the agent speaks to other agents over the net. When the user wants entry to the critical section, it sets a variable to "let me in" and then does a wait on the entry semaphore it shares with the agent. When the user is done, the user sets the shared variable to "done" and then signals the go-on semaphore it shares with the agent. The agent always checks the shared variable when it receives the token, and only forwards it when the variable is equal to "done".

Review, Lamport's Bakery Algoritm

One decentralized algorithm in common use, for example in bakeries, is to issue numbers to each customer. Then, when the customers want to access the scarce resource (the clerk behind the counter), they compare the numbers on their slips and the user with the lowest numbered slip wins.

The problem with this is that there must be some way to distribute numbers, but this has been solved. In bakeries, we use a very small server to distribute numbers, in the form of a roll of tickets where conflicts between two customers are solved by the fact that human hands naturally exclude each other from the critical volume of space that must be occupied to take a ticket. We cannot use this approach to solving the problem on a computer.

Before going on to more interesting implementations for distributing numbers, note that clients of such a protocol may make extensive use of their numbers! For example, if the bakery contains multiple clerks, the clients could use their number to select a clerk (number modulo number of clerks). Similarly, in a FIFO queue implemented with a bounded buffer, the number modulo the queue size could indicate the slot in the buffer to be used, allowing multiple processes to simultaneously place values in the queue.

There is a large literature on synchronization algorithms based on the "take a number" scheme. Much of this originates from new work by Gottleib at New York University, where a machine was built, the ultracomputer, that has a hardware solution to the problem of distributing unique numbers to multiple processes with minimal contention.

Lamport's Bakery Algorithm provides a decentralized implementation of the "take a number" idea. As originally formulated, this requires that each competing process share access to an array, but later distributed algorithms have eliminated this shared data structure. Here is the original formulation:

For each process, i, there are two values, C[i] and N[i], giving the status of process I and the number it has picked. In more detail:

        _ _ _ _ _ _ _ _ _ _ _
     C |_|_|_|_|_|_|_|_|_|_|_|
     N |_|_|_|_|_|_|_|_|_|_|_|
    
     N[i] = 0 --> Process i is not in the bakery.
     N[i] > 0 --> Process i has picked a number and is in the bakery.

     C[i] = 0 --> Process i is not trying to pick a number.
     C[i] = 1 --> Process i is trying to pick a number.

     when
         N[i] = min( for all j, N[j] where N[j] > 0 )
     Process i is allowed into the critical section.

Here is the basic algorithm used to pick a number:

          C[i] := 1;
          N[i] := max( for all j, N[j] ) + 1;
          C[i] := 0;

In effect, the customer walks into the bakery, checks the numbers of all the waiting customers, and then picks a number one larger than the number of any waiting customer.

If two customers each walk in at the same time, they are each likely to pick the same number. Lamport's solution allows this but then makes sure that customers notice that this has happened and break the tie in a sensible way.

To help the customers detect ties, each customer who is currently in the process of picking a number holds his hand up (by setting C[i] to 1. He pulls down his hand when he is done selecting a number -- note that selecting a number may take time, since it involves inspecting the numbers of everyone else in the waiting room.

A process does the following to wait for the baker:

     Step 1:
       while (for some j, C(j) = 1) do nothing;

First, wait until any process which might have tied with you has finished selecting their numbers. Since we require customers to raise their hands while they pick numbers, each customer waits until all hands are down after picking a number in order to guarantee that all ties will be cleanly recognized in the next step.

     Step 2:
       repeat
          W := (the set of j such that N[j] > 0)
          -- W is the set of indeces of waiting processes
          M := (the set of j in W
                   such that N[j] <= N[k]
                   for all k in W)
          -- M is the set of process indices with minimum numbers
          j := min(M)
          -- j is in M and the tie is broken.
       until i = j;

Second, wait until your ticket number is the minimum of all tickets in the room. There may be others with this minimum number, but in inspecting all the tickets in the room, you found them! If you find a tie, see if your customer ID number is less than the ID numbers of those with whom you've tied, and only then enter the critical section and meet with the baker.

This is inefficient, because you might wait a bit too long while some other process picks a number after the number you picked, but for now, we'll accept this cost.

If you are not the person holding the smallest number, you start checking again. If you hold the smallest number, it is also possible that someone else holds the smallest number. Therefore, what you've got to do is agree with everyone else on how to break ties.

The solution shown above is simple. Instead of computing the value of the smallest number, compute the minimum process ID among the processes that hold the smallest value. In fact, we need not seek the minimum process ID, all we need to do is use any deterministic algorithm that all participants can agree on for breaking the tie. As long as all participants apply the same deterministic algorithms to the same information, they will arrive at the same conclusion.

To return its ticket, and exit the critical section, processes execute the following trivial bit of code:

       N[i] := 0;

When you return your ticket, if any other processes are waiting, then on their next scan of the set of processes, one of them will find that it is holding the winning ticket.

Moving to a distributed context

In the context of distributed systems, Lamport's bakery algorithm has the useful property that process i only modifies its own N[i] and C[i], while it must read the entries for all others. In effect, therefore, we can implement this in a context where each process has read-only access to the data of all other processes, and read-write access only to its own data.

A distributed implementation of this algorithm can be produced directly by storing N[i] and C[i] locally with process i, and using message passing when any process wants to examine the values of N and C for any process other than itself. In this case, each process must be prepared to act as a server for messages from the others requesting the values of its variables; we have the same options for implementing this service as we had for the token passing approach to mutual exculsion. The service could be offered by an agent process, by an interrupt service routine, or by periodic polling of the appropriate incoming message queues.

Note that we can easily make this into a fault tolerant model by using a fault-tolerant client-server protocol for the requests. If there is no reply to a request for the values of process i after some interval and a few retries, we can simply assume that process i has failed.

This demonstrates that fault tolerant mutual exclusion can be done without any central authority! This direct port of Lamport's bakery algorithm is not particularly efficient, though. Each process must read the variables of all other processes a minimum of 3 times -- once to select a ticket number, once to see if anyone else is in the process of selecting a number, and once to see if it holds the minimum ticket.

For each process contending for entry to the critical section, there are about 6N messages exchanged, which is clearly not very good. Much better algorithms have been devised, but even this algorithm can be improved by taking advantage of knowledge of the network structure. On an ethernet or on a tree-structured network, a broadcast can be done in parallel, sending one message to N recipients in only a few time units. On a tree-structured network, the reply messages can be merged on the way to the root (the originator of the request) so that sorting and searching for the maximum N or the minimum nonzero N can be distributed efficiently.

Ricart and Agrawala's mutual exclusion algorithm

Another alternative is for anyone wishing to enter a critical section to broadcast their request; as each process agrees that it is OK to enter the section, they reply to the broadcaster saying that it is OK to continue; the broadcaster only continues when all replies are in.

If a process is in a critical section when it receives a request for entry, it defers its reply until it has exited the critical section, and only then does it reply. If a process is not in the critical section, it replies immediately.

This sounds like a remarkably naive algorithm, but with point-to-point communications between N processes, it takes only 2(N-1) messages for a process to enter the critical section, N-1 messages to broadcast the request and N-1 replies.

There are some subtle issues that make the result far from naive. For example, what happens if two processes each ask at the same time? What should be done with requests received while a process is waiting to enter the critical section?

Ricart and Agrawala's mutual exclusion algorithm solves these problems. In this solution, each process has 3 significant states, and its behavior in response to messages from others depends on its state:

Outside the critical section.: The process replies immediately to every entry request.
After requesting entry, awaiting permission to enter.: The process replies immediately to higher priority requests and defers all other replies until exit from the critical section.
Inside critical section.: The process defers all replies until exit from the critical section.

As with Lamport's bakery algorithm, this algorithm has no central authority. Nonetheless, the interactions between a process requesting entry to a critical section and each other process have a character similar to client-server interactions. That is, the interactions take the form of a request followed (possibly some time later) by a reply.

As such, this algorithm can be made fault tolerant by applying the same kinds of tricks as are applied in other client server applications. On receiving a request, a processor can be required to immediately send out either a reply or a negative acknowledgement. The latter says "I got your request and I can't reply yet!"

With such a requirement, the requesting process can wait for either a reply or a negative acknowledgement from every other process. If it gets neither, it can retry the request to that process. If it retries some limited number of times and still gets no answer, it can assume that the distant process has failed and give up on it.

If a process receives two consecutive requests from the same process because acknowledgements have been lost, it must resend the acknowledgement. If a process waits a long time and doesn't get an acknowledgement, it can send out a message saying "are you still there", to which the distant process would reply "I got your request but I can't reply yet." If it gets no reply, it can retry some number of times and then give up on the server as being gone.

If a process dies in its critical section, the above code solves the problem and lets one of the surviving processes in. If a process dies outside its critical section, this code also works.

Breaking ties in Ricart and Agrawala's algorithm

There are many ways to break ties between processes that make simultaneous requests; all of these are based on including the priority of each requesting process in the request message. It is worth noting that the same alternatives apply to Lamport's bakery algorithm!

A unique process ID can be used as the priority, as was done in Lamport's bakery algorithm. This is a static priority assignment and is almost always needed to break ties in any of the more complex cases. Typically, a process will append its statically assigned process ID to any more interesting information it uses for tiebreaking, thus guaranteeing that if two processes happen to generate the same interesting information, the tie will still be broken.

The number of times the process has previously entered the same critical section can be used; if processes that have entered the critical section more frequently are given lower priority, then the system will be fair, giving the highest priority to the least frequent user of the resource.

The time since last access to the critical section offers a similar opportunity to enforce fairness if the process that used the critical section least recently is given the highest priority.

If dynamic priority assignments are used, what matters is that the priority used on any entry to the critical section is frozen prior to broadcasting the request for entry, and that it remains the same until after the process is done with that round of mutual exclusion. It is also important that each process has a unique priority, but this can be assured by appending the process ID as the least significant bits of the dynamically chosen priority .

Control Structures for Distributed Synchronization

In all of these distributed synchronization algorithms, each process needs to respond immediately to some interprocess messages, while it must block awaiting others. For the bakery algoritm and other more complex mutual exclusion algorithms, we are essentially forced into use of an agent or local mutual exclusion server to handle the network aspects of the algorithm, although we could replace the agent process with an interrupt service routine or signal handler, or if we are willing to write clumsy code, we could have the application poll the incoming request channel. However we do this, the application and the logical code of the agent communicate primarily through shared variables and synchronization mechanisms, as illustrated here:

       outgoing requests                ___________
    <----------------------------------|           |
       incoming replies                |           |
    ---------------------------------->|Application|
                  _________     ___    |  Process  |
       incoming  |         |   |   |   |           |
       requests  |  Agent  |   | V |   |           |
    ------------>|         |===| a |===|           |
    <------------|         |===| r |===|           |
       outgoing  |         |   | s |   |           |
       immediate |         |   |   |   |           |
       replies   |_________|   |___|   |           |
                                       |           |
       outgoing deferred replies       |           |
    <----------------------------------|___________|

The Agent process (or interrupt or signal handler on some designs), always handles any immediate reaction to an incoming request and any details of the fault tolerance protocol. It either replies immediately or it logs the request in the variables it shares with the user process.

The user process manipulates the variables, and it directly makes requests to the agents of remote processes and awaits the replies. The user process also sends deferred replies when the time comes to leave the critical section.

Data Structures for Ricart and Agrawala's algorithm

The following data structures are needed:

State: what is the state of the user process, normal, awaiting replies, or in the critical section. State must change from normal to awaiting replies after Priority is set before sending out the request to enter.
Priority: what is the priority of the user process.
Replies Needed: how many replies have yet to arrive after sending out the requests (this is not needed by the agent process, just the user).
Pending Replies: the set of replies to requests received by the agent but not immediately responded to. These replies will be sent by the application process on exit from the critical section.

It is instructive to look at the code for this mutual exclusion algorithm when it is packaged in the form of an agent process and user callable routines for entry and exit from a critical section -- try writing pseudocode for these.

Exercise: What mutual exclusion do the user and agent need with respect to the variables they share?

Exercise: How would you synchronize the agent with the application process if the agent code was executed as an interrupt service routine or signal handler. (Your answer must account for the nature of the synchronization required between the agent and application.)