17. Deadlock

Part of the 22C:116 Lecture Notes for Fall 2002
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Introduction

Deadlock occurs when two or more processes are blocked, where each process is blocked awaiting some action by another blocked process.

A process might be blocked awaiting a message from another process, or it might be blocked awaiting a resource currently held by another process. If all the processes in a deadlocked group of processes are blocked awaiting messages, the deadlock is described as a communication deadlock; if all processes in a deadlocked group are awaiting resources, it is a resource deadlock. Mixed cases where some processes in the group are awaiting resources and others are awaiting messages are quite possible.

In either case, whether resources or messages (or both) are involved, the key element in any deadlock is that there is a cyclic wait. In some older texts, the possibility of message passing is completely ignored, and deadlocks are defined entirely in terms of resource contention.

Resource Deadlocks

Deadlocks resulting from resource allocation have been widely studied since the late 1960's. Three distinct issues have been raised in this context:

Deadlock Detection
Deadlock Resolution
Deadlock Avoidance

Once a deadlock is detected, it must be somehow resolved. Deadlock resolution is frequently destructive -- only fault tolerant applications can survive it. Therefore, if it is possible, deadlock avoidance is better.

Deadlock Detection

Since a deadlock involves is a cycle of processes waiting for something, the standard algorithms for deadlock detection convert the problem into graph theoretic terms and then use various cycle finding algorithms on the resulting graph.

The following notation is traditionally used in graph theoretic formulations of the deadlock detection problem:

   Allocated        Process that
   resource         has the resource
      _                _
     |_|------------->(_)

   Waiting          Resource the
   process          process waits for
      _                _
     (_)------------->|_|

   Waiting          Process that
   process          should send message
      _                _
     (_)------------->(_)

The graph has vertices for each process (round) and resource (square). The graph has edges representing the waits for relation. A process can wait for a resource, a resource can wait to be released by a process, and (in message passing systems) a process can wait for a message from some other process. The latter is a modern extension to this old notation.

Here is an example "wait for graph" describing 4 processes, A, B, C and D, and 4 resources, Y, Y, Z and W. A is waiting for X, but B has both X and Z. C has W and is waiting for Y, and D has Y and is waiting for W. Does a deadlock exist?

            Processes     Resources
                _            _
              A(_)--------->|_|X
                         __/
                        /
                _      /     _
              B(_)<===    ->|_|Y
                       \/   /
                       /\  /  
                _ ____/  \/  _
              C(_)<-     /\_|_|Z
                     \  /
                      \/
                _     /\____ _
              D(_)<--    -->|_|W
                 \_____/

There is a deadlock involving processes C and D and resources W and Y.

Detecting a cycle in a graph involves traversing the cycle! Thus, the cost of deadlock detection is related to the number of vertices in the cycle, which could be as large as the number of vertices in the graph. As a result, deadlock detection algorithms scale poorly growing in expense as the system grows in size.

As long as a the only wait operations involve processes waiting on resources, the deadlock graph is formally a bipartite graph, that is, a graph where there are two sets of vertices (round and square, here), and where all edges either connect a vertex in one set with a vertex in the other. If processes may wait for messages, the graph is not bipartite, but this has no effect on the relevant algorithms.

A Review of Elementary Graph Algorithms

Basic graph algorithms are a standard part of undergraduate courses on data structures!

A graph consists of a set of vertices (processes and resources, in our example), and a set of edges (the arrows connecting processes and resources in our example). In general, edges need not have direction associated with them, but in our example application, we are only interested in directed graphs, where edges have direction.

Many of the basic questions we will need to resolve about graphs rest on the same basic meta-algorithm for graph traversal:

    Initially, all vertices and edges are unmarked
    R is a some vertex, to be used as the root
    Mark R
    S = { R }
    Repeat
        Remove a vertex V from S
        For each V' directly reachable from V,
            If V' is not marked
                mark V'
                put V' in S
                mark the edge <V,V'>
            endif
        endfor
    Until S is empty

At any point during the execution of this meta-algorithm, the set S contains the vertices "on the frontier" of the graph being explored. Marked vertices that are no-longer in S have been fully explored, while unmarked vertices not yet in S remain to be visited.

The nature of the markings put on vertices and edges leads to variations in the results obtained from this meta algorithm. The order in which vertices are removed from S is the other primary variable. If S is managed on a LIFO basis, this algorithm performs a depth-first traversal of the graph, following the longest possible path from the root before returning to visit other parts of the graph. If S is managed on a FIFO basis, this algorithm performs a breadth-first traversal of the graph, visiting all vertices reachable from vertices already in S before visiting others beyond them. It is also possible to manage S as a priority queue, using, for example, the value of the marking on each vertex as a priority.

The set of marked edges produced by this algorithm is always a tree that spans every vertex in that part of the graph reachable from the root vertex R. That is, it is a spanning tree of the reachable part of the graph. Not all parts of all graphs are reachable from any vertex, so this is not always a spanning tree of the whole graph. Note that, depending on the traversal scheme used (that is, depending on the management of the set S) many different spanning trees may be produced by the same meta-algorithm (unless, of course, the graph is itself tree structured, in which case there is never more than one possible spanning tree, and that is the graph itself).

NOTE: Other variants on the above metaalgorithm will show up repeatedly in the remainder of this course! The subject of spanning trees will also show up again and again!

Given a directed graph rooted at vertex R, it is not hard to modify the depth-first recursive tree traversal algorithm to traverse the graph and search for cycles. The modified algorithm only marks those edges and vertices on the path from the root to the leaf currently being examined -- thus, as it follows a path through the tree, it marks edges and vertices, and as it backtracks in its traversal, it unmarks edges and vertices. If this modified algorithm ever encounters a marked vertex as it walks deeper into the graph (as opposed to encountering such a vertex on backtracking), it has found a cycle reachable from the root R.

Deadlock Graphs

It is worth noting that the details of the system being studied determines the nature of the graph algorithm needed for deadlock detection! In a resource model, where processes wait for resources, if no process may wait for more than one resource at a time, the number of outgoing edges from any vertex will never excede one, so the deadlock graph has a trivial structure -- it consists of a number of chains or loops. This special case is called the single resource model.

   Running
   process
      _
     (_)                   Running
                           process
   Waiting                that owns 
   process    Resource     resource
      _           _           _
     (_)-------->|_|-------->(_)

                           Waiting                 Running
                           process                 process
   Waiting                that owns    Another     everyone
   process    Resource     resource    resource   waits for
      _           _           _           _           _
     (_)-------->|_|-------->(_)-------->|_|-------->(_)

In this simple case, deadlock detection doesn't involve a general graph algorithm. The constraint that no process waits for more than one resource at a time translates to the constraint that no process vertex in the graph has more than one outgoing edge. Similarly, the constraint that no resource is owned by more than one process translates to the constraint that no resource vertex has more than one outgoing edge. As a result, we can detect deadlocks by following the single chain of outgoing edges from a blocked process, marking vertices as we go; if the chain has an open end, the process is not deadlocked; if we encounter a marked vertex, there is a deadlock. In any case, we always finish by unmarking all the vertices we marked.

Note that the unconstrained resource model allows a process to wait for a number of resources, for example, using the operation block P until all resources in set S can be claimed. If the set S contains resources X and Y, we can replace this general wait operation by block P until X can be claimed followed by block P until Y can be claimed. This allows many more general deadlock problems to be translated into single-resource deadlock problems where deadlocks can be detected using this trivial chain-following deadlock detection algorithm.

Unfortunately, in some systems of processes, the straightforward translation from the more general the primitive to the single-resource primitive will sometimes convert a system that is deadlock free into a system that is prone to deadlock. The reason for this is that the block P until all resources in set S can be claimed primitive claims none of the resources in S until it can claim all of them, while the sequential claim one at a time approach will have some states in which a process holds a claim on some subset of the resources in S. If two processes are competing for the same set of resources and they claim them in different orders, there will be a deadlock under the single resource model that would not have been present under the more general model.

The most common application of the single resource model in real systems is in database systems, where resources correspond to database records that must be claimed before some update can be performed. Some database systems use exactly this trivial chain following algorithm to detect deadlocks.

Deadlock Resolution

Once a deadlock is detected, an important question must be answered: what should be done about it? The classic solution for resource models is to break the cycle of the deadlock by aborting one of the processes and freeing its resources. This is only applicable to systems using a resource model, and of course, it is only applicable in systems where we can afford to abort processes!

What process should be aborted? Ideally, the process to be aborted should be the one where that act will cause the least damage, that is, the least critical process. Few systems provide information sufficient to determine how critical a process is, however. The common alternative of aborting the process with the lowest priority approximates this, since high priority processes are frequently more critical than low priority processes, but this isn't necessarily so. Priority depends on real-time constraints! The audio entertainment process must meet tight real time constraints but it isn't very critical compared to monitoring the level of fuel in the fuel tank, something that is only done occasionally but is very important.

An alternative is to delete the process that closed the cycle, the last process to block in the group of processes involved in the deadlock. Unfortunately, just because a process was the one that completed the cycle does not mean that it is the least critical or the most appropriate victim.

If a process can be coded to anticipate the possibility of allocation failures, for example, by having it deallocate resources it has already claimed and begin again, we can have the resource claiming operation return a failure code if that operation would have closed the cycle in the resource graph, leading to deadlock. This approach is taken in several database managers.

The classic solution of aborting a process is clearly not oriented towards communication deadlocks! Aborting a process that is waiting for a message does nothing to eliminate the deadlock, since the only way to resolve the deadlock may be for that same process to send a message. Aborting a process is also very undesirable in critical applications, where all processes are expected to run forever or until the power fails.

Deadlock Avoidance

If deadlock recovery is difficult, we are far better off if we can construct our applications in such a way that we can prove that they are deadlock free. This may be done statically, or it may be done, in some more complex settings, by dynamic algorithms.

There is a trivial solution to the problem of deadlock avoidance in a pure resource allocation context: All you need to do is prohibit processes from incrementally allocating resources.

This trivial solution has been widely used, particularly back in the days of batch operating systems taking input from a stream of punched card. These systems typically required each process to declare all resources it might want as a field of the job card -- the head card on the card deck. This is clumsy, and it requires that the process hold all resources it might need for the duration of its run even if it only needs some of them briefly.

Claiming Resources in a Strict Order Prevents Deadlock

A less trivial solution to the deadlock avoidance problem is to require that all resources be acquired in a strictly predetermined order. For example, any process that requires both the modem and the laser printer must acquire the modem first and the printer second. As long as this rule is enforced, it will be possible to guarantee that the system is deadlock free with regard to these two resources because it will be impossible for one process to claim the modem and then request the printer while the other claims the printer and then requests the modem.

It is always possible to impose a total ordering over all resources. Any ordering will suffice, so long as all processes adhere to it. For example, resources may be ordered in alphabetical order by resource class, and then in numerical order, within each resource class, by resource address. Many applications can be rewritten to comply with such an ordering constraint, but many others cannot.