33. File Servers

Part of the 22C:116 Lecture Notes for Fall 2002
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Introduction

Consider a distributed system, with many machines scattered around a net. Some machines have disks, some have none. What are the alternative approaches for providing file service on this system?

There are two separate questions: First, how is the responsibility for disk I/O, file mapping and directory management divided between different clients and servers, and second, how is file service presented to users.

The answer to the first question may vary between a monolithic file server, where all details of file service, from directory management to disk I/O, are handled by the same server, to a fully distributed file server. The answer to the second question may range from servers that present a Unix-like model of disk I/O to those that support a stateless model of I/O.

Monolithic Servers

A monolithic file server can be easily implemented, for example, by taking a conventional file system, for example, that of Unix or Windows, and simply offering its services to remote users. This requires a single server process that awaits requests for file service from remote users and then performs the indicated operations on the local file system.

The Unix NFS system, developed by Sun microsystems, is a slightly modified variant on this theme. Under NFS, each client machine has its own file system, with an intermediate layer inserted between the local file system and the client applications code. This intermediate layer interprets an added component of each open file descriptor that indicates whether the indicated file is local or remote. For local files, the I/O requests are passed on to the local file system. For remote files, the NFS protocol is used to communicate with a remote server.

Furthermore, the NFS extensions to Unix allow a remote server to be mounted in the same way as any other file system may be mounted, by adding the tree of files supported by the remote server as a subtree of the local file hierarchy. In the extreme, diskless clients can be supported by mounting the remote server as the root of the local file system.

The disadvantages of this model of file service are many: First, files are pinned down to specific locations. This may be hidden from the users by naming conventions, but the constraints are still there. The NFS mount instruction says "mount file system from machine xyz on my file hierarchy as /xyz". In this context, opening "/xyz/x" is not much different from explicitly saying "open x on file server xyz", except that the identity of the server is deduced from the structure of the file tree.

This model prevents such things as automatic migration of files from one server to another, for example, in order to balance the load on the system. It also prevents automated redundant file storage, for example, for the sake of fault tolerance or to improve the availability of popular files.

Some NFS Details

An NFS client machine will typically have a device driver that looks like a normal local driver to its local clients, but instead of handling I/O by performing local disk operations, this driver handles I/O by generating network traffic to the NFS server.

The NFSmount service is used to attach an NFS server to some part of the file system tree of a local machine. Once the NFS server is mounted as a device on the local file system tree, the directory of that server becomes accessible as a subtree of the local directory. In the extreme case of diskless workstations, the NFSmount operation applies to the root of the whole file system.

Composite Servers

At the other extreme from a monolithic file server is the composite model where the job of providing file service is broken down into a number of subsidiary servers:

Disk I/O servers must be located on machines with disks; these provide the most rudimentary service, accepting network requests to read or write sectors on disk. These do not impose any file organization on the sectors! They may provide a degree of device independance, for example, by masking some issues of sector, track and cylinder structure, but if they do this, they must typically provide information to their clients to allow the clients to take into account the physical device structure, where that may impact performance.

Low-level file servers sit on top of disk I/O servers. A file server takes the raw storage of a disk I/O server and organizes it into files. The primary service it provides to its clients is mapping from virtual disk addresses, relative to particular files, to physical disk addresses, relative to particular disks.

File servers may use multiple disk servers, for example, to provide for fault tolerance by storing duplicate copies of files, or to provide for the storage of huge files by spreading them over multiple devices. In this model, a low-level file server does not handle directories (that is why we use the term low-level); instead, files are typically named using some low level construct equivalent to I-numbers on the Unix system.

Directory servers arrange files, as delivered by various file servers, into directories. There is no inherent requirement that all files in a given directory reside on the same file server, and in fact, some directory servers take responsibility for load balancing by automatically moving files from overcrowded file servers to servers that have free space. Furthermore, a directory server may maintain multiple copies of a given file on multiple file servers, for example, in order to reduce contention for file and disk service for popular files.

Just as file servers must use some space on disk servers for storing the mapping information that describes where the sectors of each file are stored, the directory server must typically use some files on file servers to store directories.

The dialogue with a composite file system to read a file might proceed as follows:

  1. The user U asks directory server D to open the file named N
  2. D returns the tuple [F,X] to U; [F,X] is an open-file handle
  3. U asks file server F for the contents of sector S of file X
  4. F asks disk server D for sector K, which it knows is sector S of file X.
  5. D returns sector K to F; F may cache this sector
  6. F returns sector K to U.

Stateless versus "stateful" servers.

Most programmers are used to dealing with file systems that maintain state information on behalf of users. Thus, for example, between the time a user opens a file and closes it, the file system is expected to maintain a record of the fact that that user has that file open. Furthermore, we expect this record to contain the current file position, so that the file behaves as a sequential file unless seek operations are done to change the sequence in which the data is read.

While programmers find it convenient for the file system to maintain state information, it poses some problem for the file system. If a user process terminates, the file system must learn of this termination in order to reclaim the storage occupied by state information, and there must be provisions for allocation of this information in the first place.

In a centralized system, these problems are easily solved, but in a distributed system, they become more severe. How can we guarantee that a file server learns of the termination or failure of a client for which it maintains state information? Where do we store the state information for large numbers of low-traffic clients that we would expect in very large distributed systems?

These problems can be avoided by using stateless servers. In a stateless server, the client is given the job of storing all state information. When a client opens a file, the server returns, to the client, the complete state needed to access the file. The client then presents this state information with every read and write request.

The necessary state information can be quite small. A directory server, for example, might respond to an open request by returning the network address of a file server and the I-node number needed by that file server when a user opens a file. Each read and write request would be directed to that file server, with the I-node number and file position provided as the necessary state information.

This raises new issues of network security, but other than that, it is quite straightforward to use. With a minimal layer of added local software, this scheme can even be made to look just like a conventional sequential file model, to the user. This layer allows the user to track the current position in the file, and relieves the server of any knowledge of such information.

Kernel Involvement

In NFS, all file system operations involve calls to the Unix kernel. If the file being accessed is local, the call is handled by the local I/O drivers. If the file being accessed is remote, the Unix kernel uses an I/O driver that initiates network communication with the server. Adding NFS support to a system, therefore, involves modifying the client-side kernel.

This need not be the case. With the Demos, Cambridge CAP or Xerox XDFS file servers, for example, file access was handled user-level communication; the kernel was unaware that the user was doing file I/O, and the user-level communication primitives for using a local file were identical (except for the addressing) with the primitives used for access to remote files. The only kernel involvement on these systems involved interprocess communication, either on a single machine or over the net. On such systems, therefore, adding network file service does not involve adding new kernel code for file service, but only involves adding network interprocess communications.

Demos, for example

In the Demos file server, for example, to open a file, you send a message to the file server's public port saying "open f". The server creates a new private port and gives you a link to that port. This port is used only for operations on f, so you send "read x" to this port and get back the data held in sector x of the file, or you send "write x data" to the open file port to write sector x. Finally, you send "close" to the open file port to close the file.

This is not stateless. The server must create one open file port for each open file, and destroy that port when you close the file!

Internal Server Structure

There are many ways to structure the code of a server.

Servers with One Incoming Queue

The simplest form of message addressing we will consider involves addressing messages directly to a process. This implies that all messages to a process end up in the same incoming message queue; if the process is a server, for example, a file server, this implies a structure like the following:

        repeat
           await request
           process request
           send reply
        forever

This simple model is sufficient for servers that provide very simple services, specifically, this may be the right model to use if the server never blocks during the processing phase. On the other hand, if the server must send messages to a subsidiary server, for example, when a file server communicates with a disk I/O server, problems arise:

        repeat
         * await request from client
           process step 1
           send request to sub server
         * await reply from sub server
           process step 2
           send reply to client
        forever

As examples, consider a directory server that stores its directories as a file, using a low-level file server to do the work, or consider a low-level file server that uses a disk server to store user files. These examples all suggest the above structure.

This structure fails if there is only one incoming FIFO queue of messages per process! In this case, the two starred lines (await) have no way of distinguishing between incoming requests from clients and replies from subsidiary servers.

Event Processing in Single Queue Servers

One solution to the problem outlined above is to rewrite the server so that it classifies incoming messages as either requests or replies. Each message must therefore have a prefix identifying the message type. The software structure of such a server can get quite complex:

        repeat
           await message
           case message-type of
             request:
               allocate transaction record (TR)
               process step 1 relative to TR
               send request + TR to sub server
             reply:
               get TR from reply
               process step 2 relative to TR
               send reply to client recorded in TR
               deallocate TR
           end case
        forever

This scheme, although complex, offers one important advantage! It allows the processing of many transactions to overlap in time. Consider the situation where two transactions arrive in quick succession: The server will process step 1 of both before it awaits answers from the subsidiary server on behalf of either. If there are multiple independent subsidiary servers, or if the subsidiary server can optimize the order in which it processes requests, it is possible for the resulting system to have faster throughput than a naive system that finished one transaction before starting another.

Unfortunately, this "event processing loop" involves complex code, and most programmers would prefer to be able to write something simpler. This can be done, as follows:

        repeat
          ----
         | if deferred-queue empty
         |    await message
         |    (assert it is a request from a client)
         | else
         |    dequeue request from deferred-queue
         | endif
          ----
           process step 1
           send request to sub server
          ----
         | repeat
         |    await message
         |    if it is a request from a client
         |       enqueue it in the deferred-queue
         |    endif
         | until it is a reply from the sub server
          ----
           process step 2
           send reply to client
        forever

This bit of code is also ugly, but the uglyness is confined to the marked sections of code, each of which substitutes for a starred line in the naive code. The resulting code does not offer the possibility of overlapping the processing of multiple transactions, so the resulting throughput may not be as good as in the previous example. On the other hand, it may be significantly easier to debug, and it should serve well for simple low performance servers.

In fact, the "uglyness" in the above code is a special case of a more general solution, in which the prefix on each message identifies the logical queue to which that message is directed. Thus, we might write a general wait routine as:

	typed_wait( msg-type )
          if empty( queue[ msg-type ] )
	    repeat
              await an incoming message
              t = type of received message
              if not( t = msg-type )
                 enqueue message on queue[ t ]
              endif
            until t = msg-type
          else
            dequeue message from queue[ msg-type ]
          endif

This is, in effect, how systems like Demos implement multiple mailboxes for each process. Furthermore, it is not difficult to generalize the above to allow a selective wait for any of a set of message types. There are alternatives to this model; for example, on some systems, there are primitives that allow newly received messages to be inspected and put back on the queue if they don't fit some criteria, or they allow a wait for messages from a particular sender.

Multi Thread, Multi Queue Servers

Given only one queue per thread instead of one queue per process, we can solve the problems outlined above in another way, using multiple threads per server (or equivalently, multiple processes sharing an address space, with one queue per process). This leads to the following server design:

        repeat
           await request from client
           create a new thread or process to do the following
              process step 1
              send request to sub server
              await reply from sub server
              process step 2
              send reply to client
              exit
           end new process
        forever

This piece of code has some interesting properties. First, note that the process that receives requests from clients is not the same process as the one that eventually sends replies to clients! The process that receives requests creates one short lived process per request, and all further work resulting from that request is done by that new process.

The second useful property of this code is that it allows for parallelism in exactly the same way as the event processing code. The processing of different requests can be carried on in parallel, but most of the benefit of this is likely to come from overlapping interactions with subservers. In general, multi-thread or multi-process programming can be used to eliminate the awkwardness of the event-processing model!

Note that knowledge of the short-lived request processing threads or processes is confined to the main process of the server which creates them; the sub-server learns their identity as part of the reply address that is provided with the request messages, but the to the sub-server, they are just clients. As a result, the short-lived processes never need to worry about receiving any messages other than the expected replies.

This code has a problem! If requests arrive faster than they can be processed, large numbers of short-lived threads or processes may be created. This is not a good idea! We can avoid creating too many processes in several ways; for example, for example, we can use a general semaphore to count processes:

        semaphore pop limits the server population

        repeat
           P(pop)
           await request from client
           fork a new process to do the following
              process step 1
              send request to sub server
              await reply from sub server
              process step 2
              send reply to client
              V(pop)
              exit
           end new process
        forever

On a Unix based server, the main server process can use process termination signals to await the termination of a short-lived processes in the event that too many short-lived processes have been created.

Single Thread, Multi Queue Servers

On systems where process creation is expensive, a multithreaded server is impractical. An alternative is to introduce multiple queues. This was done on Demos, and the following simple server structure results:

        repeat
           await message in main mailbox
           (assert it is a service request from a client)
           process step 1
           create new mailbox for sub server replies
           send request+newbox to sub server
           await reply in new mailbox
           destroy new mailbox
           process step 2
           send reply to client
        forever
This code is amost identical to the naive (and incorrect) code give at the beginning of this section; the big change is the use of a new (shortlived) mailbox for each transaction with a subsidiary server.

The above code does not exploit the opportunities for parallelism that were exploited by the event processing code or by the multithreaded code, but this can be done if there is a "multiwait" operation, as on Demos. In this case, the control structure is that of the event processing code:

        boxset = { main mailbox }
        repeat
           await message in boxset
           box id is the mailbox the message came from
           case box id of
             main mailbox:
               create new mailbox B
               boxset := boxset + { B }
               client[B] := client
               process step 1 relative to B
               send request+B to sub server
             other:
               B := box id
               process step 2 relative to B
               send reply to client[B]
               destroy B
           end case
        forever

The advantage of this scheme over the pure event processing scheme is that the simple code can be used for simple servers, and only if these begin to limit system performance need the code be rewritten as above.