38. Distributed Shared Memory

Part of the 22C:116 Lecture Notes for Fall 2002
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Introduction

One of the interesting ideas that was fully developed in experimental uses of Mach involves the support of shared segments on a distributed system. The key to this development is that the Mach kernel does not handle page faults, but rather, the kernel simply hands faults to the appropriate exception handler, outside the kernel. This allowed a number of new approaches to page fault handling to be explored, including the construction of fault handlers that allow segments to be shared across a network.

The basic idea is most easily understood if we assume that there are only two states for each page of the virtual address space: The page is either associated with a page frame, allowing read-write access, or the page is not associated with a page frame, so any attempt to access the page will produce an exception.

Given this minimal virtual addressing model, shared segments can be implemented on a network as follows:

Note that, from the point of view of a user of the shared segment, the performance of this scheme will be similar to using demand paged virtual memory for that segment, with pages copied to and from disk. The difference is that the pages of the segment are not stored on backing storage, they are stored in other copies of the segment on other machines; as a result, when a page is not in the local physical segment, it may be changed by some other process running on some other machine.

Improved Performance

The simple scheme outlined above allows only one copy of each page of a shared segment. In real shared memory applications, most of the accesses to shared variables are to read the variable, and only occasionally does a process change the value of a variable. This simple implementation of shared segments supports this common access pattern very poorly!

The system would perform better if multiple copies of a page could be created when multiple processes are reading that page. This can be done! The solution involves use of read-only protection for duplicated pages, as follows:

This model allows any number of copies of a page to be shared, so long as no process attempts to write that page. As soon as a process writes a page, all other copies of that page will be invalidated and all future attempts to read that page will result in new copies of the updated page being sent over the net to the machines where read operations were being done.

Write Conflicts

There is one problem with the protocol outlined above: What if two machines attempt simultaneous writes to a shared page. As written above, the result will be that the fault handlers on each machine send "invalidate" messages to the others, and all copies of the page will be invalidated!

We can fix this by adding a conflict resolution rule: If fault handler A receives an "invalidate page x" message from B after it has sent "invalidate page x" messages and while it is awaiting replies saying "invalidated", it compares the unique ID of A with that of B. If A has the winning ID, A does not invalidate page x, and replies saying "not invalidated". If A has the winning ID, A invalidates page x and replies normally. This guarantees that, in case of a conflict between two machines, exactly one machine will retain a read-write copy of the page.

Connections to other work

The idea outlined above was originally invented as a hardware algorithm for cache coherency in shared memory machines that use snooping cache technology. Such machines are in widespread use today! In these machines, a local cache is implemented, in hardware, for each CPU, and the only memory accesses that use the shared main memory are those that result from cache misses. The problem of maintaining a coherent view of shared memory through all of these caches is called the cache coherency problem, and the solution outlined above is one example of a cache coherency protocol.

Granularity and Performance

If processes use this scheme to work with very small shared objects, for example, LISP-style objects with just two pointers each, with no organization, so that objects being used by different mach, it is easy to imagine extremely poor performance as a result! The performance is at its worst if multiple processes running on different machines attempt concurrent changes to objects in the same page. This implementation of sharing gets the semantics right, but clearly, the performance will not equal physical shared memory under these usage patterns.

On the other hand, this model will work well for shared memory applications where most data objects are on the order of the same size as a page, and where objects remain in use by one process for a long time before being modified by other processes. In this case, this model should lead to performance that is not substantially different from the performance of message passing versions of the same underlying distributed algorithm.

Curiously, this model works moderately well for shared spin locks! The process that is repeatedly polling the lock will hold a read-only copy of the memory page holding that lock. When some other process releases the lock, it will attempt to modify its own copy, causing a page fault that invalidates the other. The next iteration of the spin lock's polling loop will cause a page fault, and this fault will request an updated copy of the lock. The net cost per signal operation using this scheme, from the point of view of the network, is therefore 1 small invalidate message, 1 small request for the page contents, and 1 larger transmission of the page contents. This is not too different from the cost of synchronous message passing.