38. Distributed Shared Memory

Part of the 22C:116 Lecture Notes for Fall 2002
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Introduction

One of the interesting ideas that was fully developed in experimental uses of Mach involves the support of shared segments on a distributed system. The key to this development is that the Mach kernel does not handle page faults, but rather, the kernel simply hands faults to the appropriate exception handler, outside the kernel. This allowed a number of new approaches to page fault handling to be explored, including the construction of fault handlers that allow segments to be shared across a network.

The basic idea is most easily understood if we assume that there are only two states for each page of the virtual address space: The page is either associated with a page frame, allowing read-write access, or the page is not associated with a page frame, so any attempt to access the page will produce an exception.

Given this minimal virtual addressing model, shared segments can be implemented on a network as follows:

First, a segment of identically the same size is created on each machine that will share the logical segment. These are, of course, distinct segments since there is no physical sharing of memory between machines across the network; we will refer to these segments as the physical segments that represent the shared segment on each machine..
Note that Mach segments do not necessarily have pages associated with them. Instead, a segment is simply a Mach object to which physical memory pages may be assigned by the kernel, in response to user requests.
Second, exactly one page frame is allocated for each page in the logical shared segment. It is convenient to assume that all of these frames are initially allocated on the same machine and in the same physical segment, but so long as only one frame is allocated per virtual address, these frames may be allocated to any of the physical segments, on any of the machines that will share the virtual segment.
Third, a fault handler must be associated with each physical segment representing the shared segment, and these fault handlers must be connected by a group communications channel.
When a page fault message for page x in the physical segment representing the shared segment on one machine is delivered to one of the fault handlers, it broadcasts a message on the group communications channel saying "I need page x".
On receipt of a message saying "I need page x", the fault handler on each machine looks to see if it has page x in its copy of the segment. If so, the handler removes page x from its copy and sends the contents of this page to the fault-handler that made the request.
On receipt of a message containing the data from page x, the fault handler will insert that data into page x of its local segment and allow the thread that caused the fault to continue.

Note that, from the point of view of a user of the shared segment, the performance of this scheme will be similar to using demand paged virtual memory for that segment, with pages copied to and from disk. The difference is that the pages of the segment are not stored on backing storage, they are stored in other copies of the segment on other machines; as a result, when a page is not in the local physical segment, it may be changed by some other process running on some other machine.

Improved Performance

The simple scheme outlined above allows only one copy of each page of a shared segment. In real shared memory applications, most of the accesses to shared variables are to read the variable, and only occasionally does a process change the value of a variable. This simple implementation of shared segments supports this common access pattern very poorly!

The system would perform better if multiple copies of a page could be created when multiple processes are reading that page. This can be done! The solution involves use of read-only protection for duplicated pages, as follows:

All pages are initially marked read-only.
On receipt of a "I need page x" message, if the fault handler has page x in its physical segment, it marks that copy as read-only and sends a copy of the page to the sender of the request.
On receipt of a copy of page x, the fault handler inserts that data into page x of its physical segment and marks it as read-only.
If a fault handler receives a notice of an access-rights fault, where a user thread attempted to write data on a read-only page, the handler sends "invalidate page x" to all of its peers.
If a fault handler receives "invalidate page x", and if page x is currently allocated in its physical segment, it removes page x from its segment and replies "invalidated" to the sender of the request.
After sending "invalidate page x" and after receiving "invalidated" replies from all peers, the fault handler marks page x as read-write and allows the thread that attempted a write operation to continue.

This model allows any number of copies of a page to be shared, so long as no process attempts to write that page. As soon as a process writes a page, all other copies of that page will be invalidated and all future attempts to read that page will result in new copies of the updated page being sent over the net to the machines where read operations were being done.

Write Conflicts

There is one problem with the protocol outlined above: What if two machines attempt simultaneous writes to a shared page. As written above, the result will be that the fault handlers on each machine send "invalidate" messages to the others, and all copies of the page will be invalidated!

We can fix this by adding a conflict resolution rule: If fault handler A receives an "invalidate page x" message from B after it has sent "invalidate page x" messages and while it is awaiting replies saying "invalidated", it compares the unique ID of A with that of B. If A has the winning ID, A does not invalidate page x, and replies saying "not invalidated". If A has the winning ID, A invalidates page x and replies normally. This guarantees that, in case of a conflict between two machines, exactly one machine will retain a read-write copy of the page.

Connections to other work

The idea outlined above was originally invented as a hardware algorithm for cache coherency in shared memory machines that use snooping cache technology. Such machines are in widespread use today! In these machines, a local cache is implemented, in hardware, for each CPU, and the only memory accesses that use the shared main memory are those that result from cache misses. The problem of maintaining a coherent view of shared memory through all of these caches is called the cache coherency problem, and the solution outlined above is one example of a cache coherency protocol.

Granularity and Performance

If processes use this scheme to work with very small shared objects, for example, LISP-style objects with just two pointers each, with no organization, so that objects being used by different mach, it is easy to imagine extremely poor performance as a result! The performance is at its worst if multiple processes running on different machines attempt concurrent changes to objects in the same page. This implementation of sharing gets the semantics right, but clearly, the performance will not equal physical shared memory under these usage patterns.

On the other hand, this model will work well for shared memory applications where most data objects are on the order of the same size as a page, and where objects remain in use by one process for a long time before being modified by other processes. In this case, this model should lead to performance that is not substantially different from the performance of message passing versions of the same underlying distributed algorithm.

Curiously, this model works moderately well for shared spin locks! The process that is repeatedly polling the lock will hold a read-only copy of the memory page holding that lock. When some other process releases the lock, it will attempt to modify its own copy, causing a page fault that invalidates the other. The next iteration of the spin lock's polling loop will cause a page fault, and this fault will request an updated copy of the lock. The net cost per signal operation using this scheme, from the point of view of the network, is therefore 1 small invalidate message, 1 small request for the page contents, and 1 larger transmission of the page contents. This is not too different from the cost of synchronous message passing.