Lecture 18, File Systems

From files as virtual devices to directory management

Part of the notes for CS:3620, Operating Systems
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Virtual Devices

Although there are applications where a single disk drive is appropriately considered to be a single file, most files are considerably smaller. The designers of the Demos file system, for example, observed that 50% of all user files are smaller then 3000 bytes. At the same time, however, they noted that the majority of the input/output operations were done on the largest files. Thus, the file system designer is faced with two almost contradictory demands: Fast access to large files and efficient storage of small files must both be possible. It should also be noted that, while most disk files are only accessed sequentially, some very important files, those implementing database systems, are frequently accessed in quite random patterns. Thus, a good file system must provide for both efficient random access and fast sequential access.

The Demos observation makes sense even today, for text files. Individual pieces of e-mail, individual source file in large programming projects, and similar items are all typically small. Demos was designed before multimedia support became common, though, so today's file systems must contend with image, sound and video files. These are typically bigger. Image files vary from a few kilobytes for thumbnail images to several megabytes for typical photographs. Audio files can occupy over a megabyte per minute of playing time. Video files are even bigger. Of course, typical modern disks are far larger than the disks that existed at the time Demos was developed, so the ratio of typical file size to device size may have actually decreased over the decades.

Although some systems meet these two sets of demands with different mechanisms, that is, one file system for large files and another for small files, most systems attempt to provide a single universal file system to meet all demands. Occasionally, the basic file system offered by a machine will be so biased toward large files that users are forced to develop library managers for storing many small files in one large file managed by the file system. The most well known case where this occurred was under IBM's OS family of operating systems for the IBM System/360-370-390 family of machines, where the basic unit of disk allocation was the track, frequently tens of thousands of bytes on modern disks.

Ideally, a user program should be able to view each disk file as an independent logical input/output device. Thus, it is sometimes useful to describe each disk file as a virtual device.

In computer science, the term virtual is used to describe a resource which is implemented by software (usually system software) using some underlying similar resource. This use is borrowed from the field of optics, where a virtual image refers to the perceived location of an image seen by looking in a mirror or lens. Just as a lens cannot project a virtual image without a real object, software cannot create a virtual device without using a real device, but just as a lens can distort and displace the virtual image, placing it where no real object exists and giving it dimensions unrelated to those of the real object, the lens of software can create virtual devices with characteristics quite different from the real devices on which they are based.

It would be possible to design the virtual device interface of a disk file to resemble that of a magnetic tape, but this would preclude random access, and would thus be quite undesirable. Although some systems such as the IBM OS family of operating systems provide multiple file interfaces, called access methods, allowing each file to support a different set of operations, most modern systems have adopted a universal interface which allows each file to be used as either a sequential or random access device.

This chapter begins with a discussion of the different data structures which may be used to gather the different sectors of a disk file into one logical structure. Following this is a discussion of the software used for one such data structure, including the mapping of this data structure into the standard device independent file variable implementation discussed in Chapter 9. Following this, the problems of disk allocation and deallocation are introduced, and finally, directory structures are discussed.

Data Organizations for Disk Files

All but the simplest of disk files consist of more than one track or sector stored on disk. Efficient access to such a file requires that there be some data structure organizing these pieces. Before looking at interesting data structures, it is useful to look at a trivial one; in the case of files, the trivial way to allocate the multiple blocks of a file is sequentially, so that the physical sequence of blocks in a file is identical to the logical sequence seen by the user.

Whenever we talk about the organization of a disk file, we can speak of the physical arrangement of the blocks of that file on disk, or alternately, we can speak of the translation or mapping of the addresses of those blocks. It is easiest to speak of these mappings if blocks on disk can be addressed by sequential integers instead of complex two or three dimensional address tuples such as the <cylinder, surface, sector>. We can translate physical disk addresses to such a form as shown in Figure 1.

function linearize( cyl, srf, sec: integer ): integer;
begin
     linearize := sec
                + srf * sectors_per_track
                + cyl * tracks_per_cylinder * sectors_per_track;
end;

procedure delinearize( var cyl, srf, sec: integer; linear: integer);
begin
     cyl := (linear div sectors_per_track) div tracks_per_cylinder;
     srf := (linear div sectors_per_track) mod tracks_per_cylinder;
     sec :=  linear mod sectors_per_track;
end;

Figure 1. Linear disk address conversions.

The conversions shown in Figure 1 are similar in form to the conversions between a 3-digit decimal number and a binary number, where the sector number sec is the least significant digit, the surface number srf is the next to the least significant, and the cylinder number cyl is the most significant digit. For the binary conversion, the conversion constants by which we multiply, divide and take the remainder are all equal to 10, but aside from this, the structure of the arithmetic is identical.

By making the sector number the least significant part of the linearized disk address and the cylinder number the most significant, we arrange things so that sequential writes to consecutive linearized disk addresses will be done with minimal latency, first filling consecutive tracks within a cylinder before moving to the next cylinder. This assumes, of course, that the interleaving factor of the sector assignments on the disk is optimized.

Given this linearized disk address assignment, the storage used by a contiguously allocated disk file can easily be described by giving the linear disk addresses of the start of the file first and the number of sectors in the file.

Aside from the simplicity of the data structures involved, an important advantage of contiguous allocation is that it minimizes the head positioning latency involved when an entire file is read or written. The primary disadvantage of contiguous allocation is that it can make it impossible to extend the size of an existing file when both ends of that file are adjacent to other files. As a result, users of linear files are usually required to specify the sizes of files when they are created. Alternately, we can allow only one file at a time to be open for write access. This file is always moved to the largest free block on the disk, and then its actual size is recorded when it is closed.

If we want our file system to allow any file to be expanded at any time, we need a more complex way to allocate space for each file. An old way of solving this problem is to represent each disk file by a linked list of the sectors of the file; This allows efficient sequential access to the contents of a file, but it is quite inefficient for random access. As a result, this approrach is rare in modern systems, and most systems which supported the linked list organization also supported other organizations.

Some systems allow different files to be stored in different ways on the same disk. These systems are more complex than those supporting a single organization, and they require that users declare the required organization of a file when that file is created. The latter can be inconvenient, since the eventual use of a file is not always anticipated at the time the file is created. As a result, most systems developed since 1970 have included a single universal low level file organization, frequently based on some kind of tree structure.

As an example, files under Unix are described by data structures called i-nodes (index-nodes). In the classic Unix implementation, each i-node is 64 bytes long, so that they may be packed 8 to a 512 byte disk sector to construct the i-table, a table of all the i-nodes in the file system. Directories under Unix relate textual file names to i-node numbers, where an i-node number is simply the index of a particular i-node in the i-table. The i-node for a disk file contains pointers, or rather, 32-bit linearized disk addresses, for as many as 13 disk sectors, the first 10 of these being the first 10 sectors of the file. If the file is longer than 10 sectors (5120 bytes), the 11th disk address refers to a sector holding the disk addresses of the next 128 sectors of the file; files where this is sufficient are said to use one-level indexing. If the file is larger than 138 sectors (10+128 sectors or 70,656 bytes), the 12th disk address refers to a sector holding the disk addresses of the sectors holding the disk addresses of the next 16,384 (1282) sectors; this is called two-level indexing. If the file is larger than 16,522 sectors (10+128+16,384 sectors or 8,459,264 bytes), the 13th disk address in the i-node is used for three-level indexing.

Note: Unix stores the i-table on disk, so opening a file typically takes at least one disk access to read the i-node before the first access to any of the sectors of the file.

The 13 32-bit sector numbers in a classical Unix i-node consume only 52 bytes, leaving 12 bytes of each i-node for other purposes. These are used for file size (4 bytes), access control (4 bytes), and time of creation and last modification (4 bytes).

The Unix file system organization is as complex as it is because the designers wanted to make access to the very first sectors of a file fast, no matter how large the file. In part, this is because the Unix exec() functions, the functions to load and run a program, all require fast access to the first few bytes of a file in order to determine how to load the program, and requiring extra disk accesses would noticably slow the loader.

The benefits of the Unix i-node scheme over a simple tree structure are marginal, and giving detailed code for such a system would distract from the essential idea of the Unix file system, the idea of using a tree of index sectors to locate the data sectors of a file, where each data sector is, in effect, a leaf of the tree.

Consider a disk system with very small sectors, where each disk sector is large enough to hold just 4 linearized disk addresses. On such a system, we would have to build the tree shown in Figure 2 to describe a file with 7 sectors.

                                  _______________
first level                      |   |   | //| //|
index sector                     |_o_|_o_|//_|//_|
                            _______|   |___________
                           |                       |
                    _______V_______         _______V_______
second level       |   |   |   |   |       |   |   |   | //|
index sectors      |_o_|_o_|_o_|_o_|       |_o_|_o_|_o_|//_|
                   __|   |   |   |__       __|   |   |
                  |     |     |     |     |     |     |
                 _V_   _V_   _V_   _V_   _V_   _V_   _V_
data sectors    |   | |   | |   | |   | |   | |   | |   |
                |___| |___| |___| |___| |___| |___| |___|

Figure 2. A simple tree organizing the sectors of a disk file.

Virtual Disks

If the linked list representation of files is excluded, the file models discussed above can be described in terms of the translation of a virtual disk address to a physical disk address, where each file is viewed as a virtual disk. As a result, the basic operations on a virtual disk are the same as those for a physical disk, posting input/output requests, and awaiting the results. The diskread routine which posts requests for a disk file must first translate the virtual address of the desired sector within that file to a physical address, and then post the request using something like the diskread routine shown in Figures 11.7.

As with user-level files, real and virtual disks must be viewed as polymorphic abstract data types, so we view real disks as one implementation of the disk class and open disk files as another implementation of the class. We can define the interface to this class exactly as we did in Figure 9.3; this is illustrated in Figure 3.

struct diskvar {
     /* pointers to device specific routines */
     void (* read)( struct diskvar * d,
                    char * buf, int len, int addr, int * done );
     void (* write)( struct diskvar * d,
                     char * buf, int len, int addr, int * done );
     /* in the above, buf and len describe a buffer in main memory,
        while addr is the linearized disk address on disk or the
        sector number in a file */

     /* key file characteristics */
     int sectorsize, totalsectors;

     /* device characteristics */
     int cylinders, surfaces, sectors;
};

/* functions implementing access methods */
#define DISKREAD(d,buf,len,addr,done)  (*(d->read))(d,buf,len,addr,done)
	/* read from disk d, into buffer buf, of length len;
           from sector addr, and set flag done when finished */
#define DISKWRITE(d,buf,len,addr,done) (*(d->write))(d,buf,len,addr,done)
	/* write to disk d, from buffer buf, of length len;
           to sector addr, and set flag done when finished */

Figure 3. C code for the disk object interface.

This interface includes not only the basic read and write operatons, but also the dimensions of the disk device. Users of the read and write routines need to know the sector size and the number of sectors in order to effectively use the file. For conventional disks, the number of sectors will be the product of the number of cylinders, tracks and sectors, and knowledge of these details may be useful to some applications where the software using the disk tries to minimize latency by keeping related data together on the same track or cylinder. For some types of disk files, knowledge of the device geometry can be useful, for example, if the file system allocates space in units of tracks, or if the file system attempts to preferentially allocate sectors of a file consecutively. Because it might be useful and it is unlikely to be harmful, we include this data in the low level description of all open disk files or hardware disk devices.

For a physical disk device, the read and write routines would be essentially the same as those shown in Figure 11.7, except that we include information about what queue the disk request should be scheduled in as part of the physical disk description record, and we deal in linearized disk addresses. This is shown in Figure 4.

 

struct physicaldiskvar {
     /* pointers to device specific routines (as in Figure 3) */
     void (* read)( struct diskvar * d,
                    char * buf, int len, int addr, int * done );
     void (* write)( struct diskvar * d,
                     char * buf, int len, int addr, int * done );

     /* key file characteristics */
     int sectorsize, totalsectors;

     /* device characteristics */
     int cylinders, surfaces, sectors;

     /* base input-output register number for this disk device (new) */
     int device;

     /* interface to interrupt service routine (new) */
     struct disk_request_queue * dq;
};

/* function to open disk device n */
struct diskvar * opendisk( int n )
{
     /* base register numbers for the different disk devices */
     static int diskdevices[] = { ... };

     /* physical characteristics of the different disk devices */
     static int sectorsizes[] = { ... };
     static int diskcylinders[] = { ... };
     static int disksurfaces[] = { ... };
     static int disksectors[] = { ... };

     /* queues for each device */
     static disk_request_queue * diskqueues[] = ... ;

     /* the new open file object */
     struct physicaldiskvar * dv;
     dv = malloc( sizeof(struct physacaldiskvar) );

     /* initializer */
     dv->write = physicaldiskwrite;
     dv->read = physicaldiskread;
     dv->sectorsize = sectorsizes[n];
     dv->totalsectors = diskcylinders[n]
                      * disksurfaces[n]
                      * disksectors[n];
     dv->cylinders = diskcylinders[n];
     dv->surfaces = disksurfaces[n];
     dv->sectors = disksectors[n];
     dv->dq = diskqueues[n];

     /* give the caller the initialized instance */
     return (struct diskvar *) dv;
}

void physicaldiskread( struct diskvar * d,
                       char * buf, int len, int addr, int * done );
{
     struct physicaldiskvar * dv = (struct physicaldiskvar *) d;
     struct diskrequest * req;

     /* get a free disk request record */
     while (free == NULL) /* nothing */;
     req = free;
     free = request->next;

     /* setup this request record */
     req->buffer = buf;
     req->size = len;
     req->cylinder = (addr / dv->sectors) / dv->surfaces;
     req->surface =  (addr / dv->sectors) % dv->surfaces;
     req->sector =    addr % dv->sectors;
     req->direction = in;
     req->done = done;

     /* enqueue the request */
     enqueue( dv->dq, req );

     /* reset done flag, so interrupt routine can set it */
     *done = 0;

     /* enable disk interrupt -- a critical section */
     disableints();
     temp = inp( dv->device + DISKCON );
     setbit( temp, IE );
     outp( dv->device + DISKCON, temp );
     enableints();
}

Figure 4. Part of the driver for a physical disk.

Given this interface to a physical disk, we can now use it as a foundation on which to implement virtual disks or disk files. For example, a contiguously allocated disk file could be accessed through the comparatively simple interface shown in Figure 5:

struct lineardiskvar {
     /* pointers to device specific routines (as in Figure 3) */
     void (* read)( struct diskvar * d,
                    char * buf, int len, int addr, int * done );
     void (* write)( struct diskvar * d,
                     char * buf, int len, int addr, int * done );

     /* key file characteristics */
     int sectorsize, totalsectors;

     /* device characteristics */
     int cylinders, surfaces, sectors;

     /* the disk used to store this file (new) */
     struct diskvar * disk;

     /* where does this file begin on that disk (new) */
     int base;
};

void linearfileread( struct diskvar * d,
                     char * buf, int len, int virtaddr, int * done );
{
     struct lineardiskvar * dv = (struct lineardiskvar *) d;
     int realaddr;

     /* check and translate the disk address */
     if (virtaddr >= dv->totalsectors) raise( disk_address_exception );
     realaddr = virtaddr + dv->base;

     /* do the real disk I/O */
     DISKREAD(dv->disk, buf, len, realaddr, done)
}

Figure 5. Part of the driver for a physical disk.

The code shown in Figure 5 does not include an open routine; this will be discussed later, with the discussion of directory structures. The key thing to note in this code is how little it does! Aside from two lines of code to check and translate the disk address, all of the work involved in reading and writing a contiguously stored disk file is done by the physical disk access routines!

While use of contiguous disk files may seem to be a poor choice for general file systems, the operating systems on many modern machines routinely use simple contiguous mapping such as is implemented by the code in Figure 5 to partition their main disks into several smaller virtual disks. This is done for several reasons, ranging from the desire to use different file system structures on each partition of a large disk, for example, a Windows file system on one partition and a Linux file system on another, to the desire to simplify backups by keeping all user files in one partition while keeping system files (which change much less frequently) in another.

On some systems, user programs may directly call on the services of a disk I/O driver such as that illustrated in Figure 4, but most modern systems interpose an additional layer between the user and the disk, in addition to the file system that maps each disk file to its location on disk. This layer forces either the physical disk or the disk file to conform to the standard file variable model of the system, providing character, line and block sequential access to each disk or disk file, along with operations such as seek and rewind to allow for random access to different parts of the file.

Virtual disks can be made to serve many purposes. Our primary emphasis will be to think of disk files as virtual disks, with exactly the same properties as the physical disk on which they are implemented, but there are many other possibilities:

Giant disks
A giant virtual disk can be constructed from an array of smaller physical disks. Give physical disks with n sectors, when a virtual disk request is made to sector s, the actual input/output is performed on disk s mod n. This is the simplest application of a RAID (a Random-access Array of Inexpensive Disks).

Fault-tolerant disks
A fault-tolerant virtual disk can be constructed from an array of physical disks. Consider, for example, recording each sector written two or more times, once on each disk in the array. On read, the sector can be read from any disk in the array. One way to achieve fault tolerance is to accept the first copy read that has a correct checksum. A second alternative is to read all of the copies and return a value for the sector that is the result of a majority vote between all of the copies inspected. Both of these approaches are used in some RAID systems. Officially, the acronym RAID stands for a redundant array of inexpensive disks.

Fault-tolerant Giant disks
Both of the above techniques can be combined. For example, over a decade ago, the RAID systems used at a Fortune 500 company might have had a capacity of 400 gigabytes, when seen as a virtual disk, with an implementation based on a total of 1200 gigabytes of actual storage capacity, made up of an array of 200 or so drives, each with a capacity of around 6 gigabytes.

Remote disks
A virtual disk implementation can translate disk read and write requests to network transactions to a remote server, where that server uses any physical or virtual disk to implement disk read and write semantics. Mounting a network disk drive is generally done this way on systems based on Unix.

A Tree-Structured File Implementation

There are a number of ways of organizing a tree-structured file representation that lead to a result comparable to that shown in Figure 2. The one described here is based on the following idea: For each file larger than one sector, the list of sectors holding the data in that file is stored in an auxiliary smaller file called an index file. This process repeats (recursively) until the final index file is only one sector long. As a result, each level in the tree describing the locations of the sectors in a file is itself stored in a file, as illustrated in Figure 6.

 disk address of top level index _________
                                __________|__________
 index file                    |   _______V_______   |
 describing                    |  |   |   | //| //|  |
 locations of                  |  |_o_|_o_|//_|//_|  |
 index sectors                 |____|___|____________|
                             _______|   |___________
                  __________|_______________________|__________
 index file      |   _______V_______         _______V_______   |
 describing      |  |   |   |   |   |       |   |   |   | //|  |
 locations of    |  |_o_|_o_|_o_|_o_|       |_o_|_o_|_o_|//_|  |
 data sectors    |____|___|___|___|___________|___|___|________|
                    __|   |   |   |__       __|   |   |
                   |     |     |     |     |     |     |
                  _V_   _V_   _V_   _V_   _V_   _V_   _V_
 data sectors    |   | |   | |   | |   | |   | |   | |   |
 of user file    |___| |___| |___| |___| |___| |___| |___|

Figure 6. Use of index files to describe data files.

It would be quite inefficient to follow all of the pointers implied by the structure in Figure 6 in order to find the disk address needed for each disk transfer, but most patterns of disk operations exhibit a property called locality of reference: If some sector of a file has been referenced recently, it is highly likely that future references to that file will be near that sector. This is clearly true of sequential access to a file, but even the random-access patterns of database systems exhibit this to some extent.

One way to take advantage of the locality of reference to disk sectors is to keep a cache of recently used disk sectors in main memory, as suggested in Chapter 11. Another approach is to cache the most recently used index sector with each open file as part of the open file description. With either of these approaches, visits to the root of the indexing tree will be rare once the first disk access has been completed.

The description records used for files with the structure shown in Figure 6 must include an indication of the number of sectors in the file, a disk address (if there is only one sector in the file), and a pointer to the description of the index file, if there are multiple sectors in the file. Declarations for these data structures are shown in Figure 7.

struct indexdiskvar {
     /* pointers to device specific routines (as in Figure 3) */
     void (* read)( struct diskvar * d,
                    char * buf, int len, int addr, int * done );
     void (* write)( struct diskvar * d,
                     char * buf, int len, int addr, int * done );

     /* key file characteristics */
     int sectorsize, totalsectors;

     /* device characteristics */
     int cylinders, surfaces, sectors;

     /* the disk used to store this file (new) */
     struct diskvar * disk;

     /* the file storing the index for this file (new) */
     struct diskvar * index;

     /* the most recent index buffer used to address this file (new) */
     int * indexbuf;
     int indexsect;
};

Figure 7. Declarations supporting the use of index files.

 

If we didn't intend to exploit the locality property, the only field we would have needed to add to the file description record is the address of the disk variable used to store the index. Because we are interested in exploiting locality, however, we also add a pointer to the index buffer and a record of which sector of the index file is currently stored in the index buffer.

The code for disk operations using this data structure is not particularly complex, as shown in Figure 8.

void indexdiskread( struct diskvar * d,
                    char * buf, int len, int virtaddr, int * done );
{
     struct lineardiskvar * dv = (struct lineardiskvar *) d;
     int realaddr;
     int isect;
     int iword;

     /* check the disk address */
     if (virtaddr >= dv->totalsectors) raise( disk_address_exception );

     /* find out where the address is in the index file */
     isect = virtaddr / (dv->sectorsize / sizeof(int));
     iword = virtaddr % (dv->sectorsize / sizeof(int));

     /* make sure we have the right sector in the buffer */
     if ((dv->indexbuf == NULL) !! (dv->indexsect != isect)) {
          int done;
          if (dv->indexbuf == NULL) {
               dv->indexbuf = malloc( dv->sectorsize );
          }
          DISKREAD(dv->index, dv->indexbuf,
                   dv->index->sectorsize, isect, &done);
          waitdiskdone( &done );
          dv->indexsect = isect;
     }

     /* translate the address */
     realaddr = dv->indexbuf[iword];

     /* do the user's requested operation */
     DISKREAD(dv->disk, buf, len, realaddr, done)
}

Figure 8. Implementing tree-structured file indexing.

The largest part of the new code in Figure 8 is concerned with reading the correct sector of the index file into the index buffer. The actual address translation is almost an anticlimax. It is important to undestand that the two calls to DISKREAD in the above code are quite different. The final call that does the user's requested operation will typically operate on a physical disk device, or perhaps on a large partition of a physical disk device.

The first call to DISKREAD in Figure 8 is more interesting. This reads from the index file, and if multi-level indexing is used, it will be likely to be implemented as a recursive call to indexdiskread. This raises an important question: What terminates the recursion? The answer lies in the nature of the recursion itself. This call will only be recursive if the file object being used to represent the index to the file is itself another indexed file. If the index to the file is represented using a different disk file implementation, for example, a trivial one-sector file, it will not be recursive!

Allocation of Disk Space to Files

The code shown in the previous sections for performing disk operations assumed that the entire file was allocated prior to any use of the disk. Although some systems require this, most allow the size of a file to be increased by the operation of attempting to write to a sector beyond the last sector previously allocated to the file. This is called demand allocation, since no space need be allocated for a file on disk until a user tries to actually write data in it.

Demand allocation for non-contiguous files can be quite easy, but the easy approaches do not always lead to acceptable performance. For example, the original Unix file system maintained, for each disk, a linked list of free sectors, where each sector in this list contained the disk address of the next sector in the list and the addresses of up to 50 additional free sectors. When a sector was needed to extend a file, the first available sector listed in the head sector of the free space list would be used. As a result, the sectors of all but the first few files created in the life of a file system end up being randomly distributed on the disk, and an attempt to read or write such a file may involve large numbers of seeks, making sequential reads very expensive.

Early Microsoft file systems suffered from similar problems, and the solution adopted was a program, usually provided by a third party software vendor, called a disk defragmenter. Under early Unix systems, the same effect could be achieved by doing a backup of all the files in the file system, typically writing the files to tape, then re-initializing the file system and reading the tape. The free space list of the freshly initialized file system would deliver the sectors to the files in sequential order, so the restored files would be allocated contiguously, or at least, nearly so.

One classic reason to partition large disks into multiple small file systems is to help solve this problem! In fact, in Unix systems of the early 1980's, this was an important argument in favor of partitioning large disks into multiple partitions, with a separate file system on each partition. Under classical Unix (and for that matter, under most operating systems) files cannot span multiple partitions, and therefore, breaking the disk into multiple partitions, one per logical class of files, forced the sectors of files to be clustered for more efficient access.

Modern file systems largely eliminate the need for tools like disk defragmeners by careful selection of the sectors used when files are extended. When a sector is needed to extend a file, sequential access to the file will be fastest is the new sector is allocated in the same cylinder as the final sector of the file, or if that can't be done, in a nearby cylinder. The surface does not matter, but when the sector is in the same cylinder, the sector number should be just after the final sector of the file. If this approach to allocation is followed every time a file is expanded, sequential access to the file will be possible with nearly minimal latency.

Another optimization file systems can make occurs when files are first created. Because it is common for one application to need to access several files in the same directory, it makes sense to try to allocate them in nearby areas of the disk, and near the directory that describes them. Given a general scheme such as Unix i-nodes or the tree-structured scheme described here, none of this is mandatory, but if it is done carefully, the impact on performance can be quite significant.

Typically, we might implement careful demand allocation by having the allocate disk space service take, as a parameter, the suggested disk address of the block to be allocated. If this is free, the service should use the suggested sector. If it is not free, the service should try to find something nearby, where distance is measured in terms of latency. This requires that the data structure used to describe the free sectors on the disk allow fast searching by disk address; a common scheme uses a three dimensional array of booleans, one bit per sector, where the array is addressed by cylinder, surface and sector numbers. This array is fairly compact and may be easily searched.

Some systems go a step beyond demand allocation by allocating a new disk sector each time a sector of data is written to disk. As a result, the old sectors of a file are not necessarily overwritten until sometime after the changes to the file. This approach is frequently used with a minimum latency allocation strategy, so that the sector used for each write is chosen based on the current head position at the time that write operation is scheduled. If the old sectors of a file are retained until the file is closed, this approach can prevent system failures from leaving a file in a partially updated form. Thus, updates to a file can appear to be atomic transactions which are either completed normally or not done at all.

 

Device Independent File Input/Output

The description of files provided up to this point has made no effort to conform to the standard device interface described in Lecture 8. This interface can, however, be implemented in terms of the low level disk interface software which is described here, using an additional software layer. As with magnetic tape, the primary responsibility of this new software layer is blocking and deblocking, so the software discussed in Lecture 9 provides a useful model.

The need for blocking and deblocking requires that the file description record contain at least one buffer, as well as additional information such as the current position within the buffer, and the current pending input/output request for that buffer, if any. All of these must be stored in the device dependent part of the file variable describing an open disk file; following the polymorphic interface model discussed in Lecture 9. Following this line of reaoning we arrive at the disk file interface given in Figure 9.

/* the definition of the type filevariable */
struct diskfilevariable {
     /* pointers to device specific routines, from Figure 9.3 */
     void (* rewind)( struct filevariable * f );
     char (* read)( struct filevariable * f );
     void (* write)( struct filevariable * f, char c );
          .
          .   /* pointers for other operations from Figure 8.4 */
          .

     /* the disk or virtual disk holding this file (new) */
     struct diskvar * disk;

     /* the current position in this file (new) */
     int pos;

     /* a buffer holding the current sector (new) */
     char * buf;
};

void diskfilerewind( struct filevariable * f )
{
     struct diskfilevariable * df = (struct diskfilevariable *) f;
     df->pos = 0;
     if (df->buf != NULL) free( df->buf );
     df->buf = NULL;
}

char diskfileread( struct filevariable * f )
{
     struct diskfilevariable * df = (struct diskfilevariable *) f;
     int sector = df->pos / df->disk->sectorsize;
     int byte   = df->pos % df->disk->sectorsize;
     char ch;

     /* make sure we have a buffer holding the current byte of the file */
     if (df->buf == NULL) {
          int done;
          df->buf = malloc( df->disk->sectorsize );
          DISKREAD( df->disk, df->buf,
                    df->disk->sectorsize, sector, &done );
          waitdiskdone( &done );
     }

     /* get the byte */
     ch = df->buf[byte];
     df->pos = df->pos + 1;

     /* see if done with buffer */
     if (byte + 1 >= df->disk->sectorsize) free( df->buf );

     return ch;
}

Figure 9. The core of the device independent disk file interface.

The diskfileread routine in Figure 9 makes no effort to overlap reading or writing with computation; as was the case in the discussion of magnetic tape, multiple buffers could have been included in each file variable, so that one could be read from the disk while the previous one was being disassembled into successive characters. A more serious problem with the code given is that the read and rewind routines make no provisions for the possibility that a write might have modified the buffer. We really need to add a flag to the representation of an open disk file, used to remember if the buffer contains the partial results of any write operations; if the flag indicates that this is so, we need to write out the buffer instead of merely deallocating it.

The code in Figure 9 has a second serious flaw: No provision is made for reporting end-of-file conditions! Attempting to read beyond the last sector of the file will raise an exception, but our usual stream model of files requires that end-of-file be reported when the last character is read, even if this is in the middle of a sector. We could simply embed a special end-of-file character in the data stream, or we could change the basic representation of disk files so that the file size is stored in bytes and not in sectors.

The standard file interface defined in Figure 9.3 and Figure 9.4 includes block and line oriented services that can clearly be implemented in terms of the character sequential interface given in Figure 9, and this is the usual way to provide line oriented access. For block oriented access, on the other hand, we are greatly interested in efficiency for the special case when the block size happens to match the sector size and the first byte happens to be aligned on a physical sector boundary. Because of this, our block oriented implementation is typically quite complex, as is shown in Figure 10.

void diskfilereadblock( struct filevariable * f, char buf[], int len )
{
     struct diskfilevariable * df = (struct diakfilevariable *) f;
     int sector = df->pos / df->disk->sectorsize;
     int byte   = df->pos % df->disk->sectorsize;
     int i = 0;

     /* deal with odd bytes at start of buffer */
     while (byte != 0) {
          buf[i] = READ( f );
          i = i + 1;
          if (i >= len) return;
          sector = df->pos / df->disk->sectorsize;
          byte   = df->pos % df->disk->sectorsize;
     }

     /* deal with full sectors */
     while ((len - i) > df->disk->sectorsize) {
          int done;
          DISKREAD( df->disk, &buf[i],
                    df->disk->sectorsize, sector, &done );
          waitdiskdone( &done );
          i = i + df->disk->sectorsize;
          sector = sector + 1;
     }

     /* deal with remaining odd bytes */
     for (;;) {
          if (i >= len) return;
          buf[i] = READ( f );
          i = i + 1;
     }  
}

Figure 10. Block oriented access to disk files.

The code in Figure 10 begins and ends reading one character at a time from the disk file, but in the middle, if finds that the remaining space in the buffer is as big as a sector, it reads whole sectors. This means that the overhead of the software is minimized and the disk's DMA hardware will directly read the required data into the user's buffer.

The code in Figure 10 could be made to run faster in two ways. First, the overhead of calling the character sequential read routine for each character read could be eliminated by carefully writing the logic from the character sequential read routine directly into the block read routine, and second, we could allow more overlap of computation and input/output. The invitation to do the latter is contained in the middle loop, where repeated calls to DISKREAD are made for each full sector to be read. As given here, each of these calls is followed immediately by a call to waitdiskdone, but since each of the buffers being read is independent of the others, it would be possible to first post all of the read requests and then wait for all to be completed, giving the disk scheduler a chance to optimize the requests, and allowing the disk address translation for one request to be done while another is being processed. This would add noticable complexity to the code but it would likely lead to significant performance improvements.

As was the case with magnetic tape, a considerable improvement in performance can be made by allocating multiple buffers for each file, so that, for example, one buffer may be output to disk while the next is being filled, even when the buffers are being filled using the character sequential interface. On older systems where the user has direct access to the low level disk interface routines, this may be left to the applications program, but most modern systems which provide a device independent interface to disk files also attempt to provide at least a small buffer queue for each opened file variable that is used for sequential access. The simplest approach to this, called double buffering, involves only two buffers associated with each file, one of which is used, at any moment, for blocking or deblocking, while the other is used for disk input (during reads) or output (during writes).

As was mentioned in the context of tapes, if programs are to be allowed to mix reading and writing on one file, the problem of buffer management is considerably complicated. This problem is worse for disks than it is for tapes because it is quite reasonable to switch from writing to reading part way through a disk file, while hardware limitations make this unreasonable with most tape technologies. Highly adaptive buffering schemes, such as the adaptive scheme used by Demos that was described in Chapter 10 in the context of tapes add considerably to the complexity that occurs when the direction of data transfer changes.

Users of the device independent interface to a disk file can sometimes make productive use of the ability to position that file to any character position within the file. Most systems provide some device dependent services for this, for example Unix provides an operation called lseek which allows random access to any byte in a file. Some systems do not allow arbitrary integers to be used as pointers to bytes, but instead, have a service which will return a magic cookie describing the position of a file as it is being read; this cookie can later be used to position the file to the same position, for example, to re-read or re-write that part of the file.

The term magic cookie actually appears to have originated in this context early in the development of the Unix system. Its applicaiton to the World Wide Web is far more recent. In general, a magic cookie is any data object returned by part of a system, where that object is not intended to be interpreted or understood by its recipient, but rather, the object is intended to be given back to whatever part of the system originally created it, and only then does it have any meaning.

A pair of services for random access in a user-level file abstraction is given in Figure 11, these are based on the services from the C programming language stream abstraction.

mark(f,i)
when f is a disk file, returns i, the current position in the disk file; on some file systems, i may be simply an integer character count from the start of the file; on others, it may be a sector number and character index in the sector or some even more inscrutable magic cookie.

seek(f,i)
when f is a disk file, sets the position to i, a value which should have previously been returned by mark.

Figure 11. Services for random access at the device independent level.

If the mark and seek services shown in Figure 11 work with integer character counts from the start of the file, they allow flexible random access with the same functions as were allowed by the low level file system interface. On the other hand, even when integer offsets are not used, as must generally be the case when files are represented as linked lists of sectors, these services allow the construction of indices into large files, so that a program can find various parts of the file without reading it from the start.

Although the device independent interface to disk files allows user programs to be written which are truly portable between devices which have different sector sizes or different file organizations, such portable programs will frequently perform poorly. The reason for this is that the disk interface software must perform a considerable amount of work if a user program does input/output in terms of a buffer size other than the natural buffer size of the disk. As a result, systems ought to include a system service allowing user programs to determine the natural block size of any open file so that buffers of that size may be allocated to allow fast input/output.

Directory Structures

The view of files presented up to this point assumes that file description data structures exist in main memory, but does not include any reference to how they got there. The set of services described in Figure 8.3 included one operation to open a file; this should return a pointer to a data structure describing that file. Some of these data structures should be allocated in main memory when the system is loaded, but most will need to be created dynamically, as various disk files are opened.

A directory is a data structure mapping file names to their descriptions. In a disk-based file system, the directory for all the files on a disk is maintained on that disk and managed by a piece of software called the directory manager. The open operation, for a disk file, is implemented as a service of the directory manager. On older systems, a fixed region of each disk (for instance, track zero of cylinder zero) is reserved to hold directory information. The FAT, or file assignment table at the base of all Microsoft DOS disk devices, is such a primitive structure, although on modern systems, this serves primarily as a tool for managing partitions and not files. The directory structures of modern systems are usually far more complex, involving the use of a special category of files to hold directories.

In the abstract sense, all directory structures may be viewed as examples of the symbol table abstract data type. As such, they associate textual names with values, where the value is a complete file description. In this context, the open service is just a variant of the lookup operation discussed in Chapter 3. File systems vary considerably in how they handle file creation. Those which require that attributes such as file size and file organization be specified at the time of file creation sometimes have a create service which can be considered to be analogous to the define operation on the symbol table abstract data type. On systems where files grow in response to write operations and where there is only one universal file organization, the act of opening a previously nonexistent file for write access can be used to create that file.

Most file systems include a "close" operation which signals that a previously opened file is no longer needed. This allows the system to deallocate the data structures used to describe that file, but this deallocation is not always immediate. For example, a write operation may be in progress at the time the close operation is issued. Clearly, the buffers being used to write data to the file cannot be deallocated until the input/output operations on those buffers have completed!

Most file systems provide an additional service to delete existing files from the directory structure. (Many readers will be surprised by the word most in the previous sentence, but deletion may not be allowed on file systems based on WORM devices such as a recordable CD.) When a file is deleted while it is open for reading or writing, different file systems behave in markedly different ways. Under some systems, an attempt to delete an open file will fail with an error indication. Under others, the file will be deleted immediately, so that all future reads and writes to the already open file will fail. On yet others, the file is marked as ready to be deleted as soon as it is closed.

On primitive file systems, file names have no structure, and there is only one directory. This approach to file management causes problems when there are a large number of files on the system or when there are multiple users who are able to independently select file names. In both cases, name conflicts are highly likely; a name conflict occurs whenever a user tries to create a file and finds that there is already a file with the same name. The usual approach to resolving name conflicts is to either install a central naming authority, or to introduce structure into the name-space of the file system in order to resolve conflicts.

Most of the first generation of file systems have been quite primitive, whether on mainframes, minicomputers or personal computers; the early systems for the Apple II and the IBM PC, both based on ideas from the late 1970's, were designed around floppy disks; on these, the designers naturally assumed that physical cabinets and file folders would be used for the top level organization of files, with no need for complex directory structures on each disk.

The usual approach to structuring the name-space of a file system is to introduce a hierarchy; for example, each file name may consist of two parts, the name of the user who created the file, followed by the name of the file selected by that user. Or, to solve a different problem, the basic name of the file may be followed by an extension indicating whether this file holds source code, object code, executable code, or some other derivitive of the original source.

Generalizing on the idea of a hierarchical structure for file names leads to the idea of a directory tree. This idea was first publicized 1965 by the developers of the MULTICS system, and in the years that followed, this approach has slowly become the standard approach to structuring file systems, being supported almost from the smallest of microcomputer based file systems to those of the largest supercomputers.

A standard feature of most tree-structured file systems is the concept of relative file names using a current working directory. This is illustrated in Figure 12.

       ------ root
      |
  directory               ------ current
  |       |              |
  | jones ---------- directory
  | smith -----     |         |
  | brown --   |    | program --------- data
  |_______| |  |    | library ---       file       ---- data
            |  |    |_________|  |                |     file
            |  |                  --- directory   |
            |   ---- directory        |       |   |
            |            .            | index ----
            |            .            |   a ----------- data
            |            .            |   b ------      file
            |                         |_______|   |
             ------- directory                    |
                         .                        |
                         .                         ---- data
                         .                              file

Figure 12. A hierarchic file system.

In Figure 12, the global directory contains 3 entries, jones, smith, and brown, all of which refer to directories. Using the notation of the Unix file system, these directories would be named /jones, /smith, and /brown; the leading slash indicates that these are entries in the directory at the root of the directory tree. The directory /jones is the current working directory, with entries program and library. The file system allows files to be referenced in either of two ways: Absolute path names always state the entire path from the root directory to the desired file, for example, /jones/library/index. Relative path names state the path from the current working directory to the desired file, for example, library/index. On Unix, relative and absolute path names are distinguished by the presence or absence of a leading slash. On some systems, there is no clear distinction, and all directories between the current directory and the root (or all directories on some explicitly designated path) must be searched.

It is worth noting that the first widespread implementation of hierarchic file names was on the MULTICS system; under this system, the > character was used as a path separator character in file names, so the example above would have been written >jones>library>index. Microsoft's DOS operating system for the IBM PC and compatables did not originally support hierarchical file names, and when this feature was added, the syntax and usage of Unix was copied very deliberately. Unfortunately, the forward slash character was already being used by the DOS command interpreter, so Microsoft opted to use the backlash, changing the example to \jones\library\index.

It should be noted that, although Figure 12 shows a directory tree, the concepts of hierarchic name structure, current working directory, and absolute versus relative path names may be used even when the physical directory is a single 'flat' data structure. In this case, the entire absolute path name of a file is stored as that file's name in the directory, and the current working directory name is simply a text string which serves as an implicit prefix on relative path names. The primary disadvantage of this scheme is that the single directory structure can become very large and, as a result, complex search algorithms may be needed to efficiently search it. In contrast, if each logical directory in a hierarchic system is implemented as a physically separate directory, the typical directory may be short enough that a simple linear search will be the best search algorithm.

On systems which use the concept of a current working directory, it is common to set that directory to initially point to the user's personal directory when the command language interpreter is first started. Commands are usually provided to change the current working directory. The Unix operating system, for example, uses the cd command for this, and this usage has been copied by Microsoft's DOS. If the command cd library is issued in the context of Figure 12, the current working directory will be changed to /jones/library. Note that the argument of the cd command used here was a relative path name, but that an absolute name could have just as easily been used.

Although simple tree-structured directories have proven to be adequate for many applications, a number of systems have generalized on the simple tree concept by allowing arbitrary graph structured directories. The Unix operating system has gone part of the way towards this by allowing the same data file to be referenced from many different directories. This allows two users to share a file while using entirely different names for that file, and it allows a user to move a file from one directory to another in a very secure way, adding a link to the file from the new directory before deleting the link from the old directory. As a result, a file cannot be lost due to an ill-timed system failure during the move or rename operations.

The Unix system makes a second use of arbitrary graph structures by allowing a link in each directory to refer to the parent of that directory in the file system. This link is always named "..", and is most commonly used as a relative path name in the command cd .., which changes the current working directory to a parent of what had been the current working directory. This link in the directory structure serves an important secondary purpose; it adds redundancy to the directory tree so that, if a directory is lost due to a disk failure, the entries in the lost directory referring to other directories can be reconstructed by following the ".." links; this is the fundamental tool used by the fsck utility to rebuild Unix file systems after the disk has become corrupted.

There are two basic approaches to implementing directories in the context of multiple disk devices. One is to support a separate directory structure on each device, the other is to integrate all devices into a common directory structure. The former approach is taken by Microsoft's DOS system, where file names must generally be prefixed by device names, as in C:boot.com, referring to the file named boot.com in the directory of the disk device named C. Under Unix, a completely different approach has been taken. The Unix approach integrates all devices into a common directory tree, so there is only one root for the file system, no matter how many disk devices are present. In Unix, the entire directory tree of any disk device may be mounted as a subtree of any currently mounted directory. When the system is started, the initial file system has only the directory structure of the boot device; once the system is running, one of the first things it does is to mount the other disks in the machine on the root file system. This is done by the command language script used for starting the machine, so the details of what devices get mounted will vary from system to system.

An Example Directory System

As an example of a directory system, consider the problem of storing descriptions of files implemented using the indexed implementation suggested in Figures 18.6 to 18.8. Each directory entry must, of course, contain a textual file name, but it must also include enough information to build the data structures shown in Figures 18.6 and 18.7. The key information needed is the disk address of the sector at the root of the index tree and the size of the file, in sectors. The latter allows the depth of the tree and the sizes of each of the index files to be inferred. For this example, we assume that the directory entry describing a file must always be on the same device as the file itself, so there is no need to include this information in the directory entry. Given this information, a procedure to create a file description can be written, as shown in Figure 12.

struct diskvar * indexdiskopen( struct diskvar * dev, int root, int size )
{
     if (size > 1) {
          /* big (indexed) file */
          struct indexdiskvar * dv;
          dv = malloc( sizeof( struct indexdiskvar ) );

          dv->read = indexdiskread;
          dv->write = indexdiskwrite;
          dv->sectorsize = dev->sectorsize;
          dv->totalsectors = size;
          dv->cylinders = size; /* here, we lie about the geometry! */
          dv->surfaces = 1;     /*   "   */
          dv->sectors = 1;      /*   "   */
          dv->disk = dev;
          dv->index = indexdiskopen( dev, root,
                               size / (dv->sectorsize / sizeof(int)) );
          dv->indexbuf = NULL;
          dv->indexsect = 0;
          return (struct diskvar *) dv;
     } else /* size == 1 */ { 
          /* small (trivially contiguous) file */
          struct lineardiskvar * dv;
          dv = malloc( sizeof( struct lineardiskvar ) );

          dv->read = lineardiskread;
          dv->write = lineardiskwrite;
          dv->sectorsize = dev->sectorsize;
          dv->totalsectors = size;
          dv->cylinders = 1;
          dv->surfaces = 1;
          dv->sectors = 1;
          dv->disk = dev;
          dv->base = root;
          return (struct diskvar *) dv;
     }
}

Figure 13. Low level code to open a disk file.

Note that the code in Figure 13 is recursive, with the number of recursive calls controlled by the number of layers of index files used to reach the data sectors of the user's file. This recursion exactly matches the recursion implicit in Figure 8; here, we explicitly construct the hierarchy of objects that will control the implicit recursion of Figure 8.

Directories themselves can be stored as files, so the directory itself can use identically the same data organization as any user file. If the directory is small enough to fit in one sector (typically 512 to 4096 bytes on modern disks), no fancy file structure will be needed, while if the directory is larger, we may use exactly the same kind of indexing that is used for large data files. Surprisingly, this was not realized until relatively recently in the history of file systems; many early file systems used special data structures on disk with organizations completely different from those of typical data files. The Unix system is the first widely studied system to maintain directories as a kind of file; in Unix and its descendants, user programs may even open directories and read them as text, and in fact, this is how the ls service under the command shell operates.

If each directory is represented as a text file, with one directory entry per line, the code to search a directory for some file name would simply compare the desired file name with the name on each line until a match is found, and then extract the information from that line needed to create the in-memory data structures, for example, by calling the indexdiskopen routine shown in Figure 13, passing the root sector number and file size stored in the directory.

In a non-hierarchical file system, this would finish the job; in a hierarchical system, this process must be repeated for each component in the path name given; as a result, opening a file using a long path name may involve opening a large number of other files representing the various directories along that path.

On a hierarchical file system, we would also typically add a flag to each directory entry indicating whether that entry contains a data file or a directory. Under Unix, this flag and the file size information are part of the i-node and not the directory entry. Unix directories relate textual file names to i-numbers, where the i-number is the index into an array of i-nodes stored on disk. All file attributes except its name are part of the i-node, while the name is stored in the directory. This allows one file to be indexed by multiple names from several different directories.

Recovery After a Failures

Unlike the contiguous allocation scheme, all list and tree structured file organizations must devote some disk space to descriptions of files. In the contiguous allocation scheme, the loss of a disk sector will destroy at most one sector of data, but the loss of a sector with these linked representations can destroy access to part or all of a file. Whether the files are allocated contiguously or using a linked structure, the entire file system is threatened when a disk sector in the directory structure is lost.

There are many ways that the contents of a disk sector can be lost. The most obvious is the mechanical destruction of the recording surface, for example, by an appropriately sized particle of dust lodging briefly between the head and the disk scratching away the data. Electronic damage can occur if a power failure occurs while a sector is being written; it takes time to write a sector, and if the electronics simply stops in mid write, half the sector will contain newly written data, while half the sector will contain old data. As a result, the checksum, which covers the entire sector, will be invalid, and the data must be considered lost.

The Xerox Pilot operating system overcomes the system vulnerabilities discussed above while maintaining the benefits of tree-structured access to individual disk sectors. The motivation for this was the high frequency with which power failures during a disk operation are expected in personal computer systems. On large professionally operated computer systems, uninterruptable power supplies are common and it is possible to rely on the operators to follow fairly complex shutdown procedures. With personal computers, uninterruptable power supplies are rare, and even if the users are taught to follow some clean shutdown procedure, accidental shutdowns are fairly common.

The Xerox Pilot system was an exploratory operating system for personal computers, built in the mid 1970's on hardware that was far ahead of what the marketplace could provide at the time. The deliberate intent of the project was to explore the kinds of systems users were expected to want on the hardware that might be available a decade in the future. At about the same time, Xerox was experimenting with large scale use of the window-mouse-icon-pointer user interface (originally developed at Stanford Research Institute), and at about the same time, Xerox was developing the ethernet local area network standard.

Like Unix, the Pilot directory structure relates textual file names to file numbers; in Unix, these numbers are called i-numbers because they refer to i-nodes. In Pilot, only permanent files were entered in directories; temporary files can be created as anonymous files, known only by their file numbers. The Pilot system took advantage of the fact that on many disks, the prefix on each disk sector contains not only physical information such the start-of-sector bit pattern and cylinder, surface and sector fields, but also fields that the operating system may use for its own purposes. In Pilot, the system used these fields to store, attached to each sector of the disk, the file number of the file to which that sector is allocated, and the sector number within that file that the sector represents.

Thus, even if there is no data structure describing where the sectors of a file are stored, if each sector is tagged as in Pilot, a linear search of the disk can be used to find any sector of any file. Of course, this is not an efficient way to access the sectors, but it guarantees that no matter what happens to the disk, only those sectors actually damaged will be lost. In Pilot, a tree structure was built to allow fast access; instead of a separate tree for each file, as in Unix, Pilot used one big search tree, using the concatenated file number and the sector number in that file as a search key. This tree was organized as a B-tree, using one disk sector to store each internal node of this tree. B-trees are a standard data structure for organizing balanced search-trees on disk.

The data nodes in a B-tree are all leaf nodes, and are all at the same level in the tree. All internal nodes contain between n/2 and n pointers to (or disk addresses of) nodes at the next lower level in the tree, and between n/2-1 and n-1 key values. The root is a special case, containing from 1 to n pointers. A B-tree with 2 pointers and one key value per node is searched in the same way as a binary search tree. Nodes with more than two pointers are searched using a binary search of the keys in the node to find which pointer to follow to the next lower level in the tree. Adding a new leaf node to a B-tree is easy if the parent of that new leaf node has less than n pointers; a new key value and a new pointer are simply sorted into the immediate parent of that leaf node. If there are already n pointers in a node and a new pointer is to be added to that node, that node must be split into two nodes, replacing one pointer in its parent with two pointers, an operation which may force the splitting of its parent. Splitting the root adds an extra level to the tree.

Pilot used the same philosophy for maintaining directory structures as it used for the low level mapping of files numbers to the physical storage for that file. For each file, sectors zero and up were used to hold the data in the file, while sector -1 was used to hold the file name and directory information for the file.

Under the Pilot system, if there was ever a suspicion that the directory or file structure had been damaged, a linear scan over the entire disk could be used to rebuild the B-tree used to access sectors of files by number, and once this was rebuilt, a second scan, examining sector -1 of each file, could be used to completely rebuild the directory structure. This guaranteed that, no matter what damage occurred, all of the data on the disk, with the exception of the sectors actually damaged, would remain accessible.

Most modern file systems offer at least some redundancy, so that the file system on a corrupted disks may be at least partially rebuilt. Under Unix, the program to do this is called fsck, under early versions of the Microsoft and Apple systems, no similar tool was provided by the system, but third-party suppliers such as Norton provide similar utilities. Today's commercial systems all provide at least some built-in recovery tools, but it appears that none of these tools offer the strict guarantee offered by Pilot, that the only sectors lost to a disk error would be the sectors actually corrupted by that error.

Another interesting feature of the data structures used by Pilot was that it made it very easy for a user to allocate sectors of a disk file which are not logically contiguous; for example, a user could have a file containing only sectors 1, 15, 23, and 2145. This allows the user to use sector numbers as logical information about a sector; thus, the data describing some employee might be stored in the disk sector of the employee file indexed by that employee's social security number. This is called an indexed sequential file organization. Indexed sequential files can be implemented in terms of tree structured organizations such as are used on Unix, but only with the aid of an additional layer of software to manage the index. Indexed sequential files are also directly supported by IBM's OS family of operating systems for the IBM System/360-370-390 family of computers.

As pointed out in the previous chapter, RAID hardware provides redundancy below the level of the file system. This makes a huge difference in the reliability of file storage for large computer systems. A more recent approach to redundancy, developed by Rosenblum and Ousterhout in 1992, is to use some part of the disk to store a log of disk transactions; this journal as it has come to be known, has been combined with the B-tree structure used in XDFS as the foundation of today's journalling file systems. In a journalling file system, once a change has been recorded in the log, no corruption of the directory or index structures on the disk will cause the data to be lost.

Terminology

The reader should be familiar with the following general terms after reading this section:

virtual devices                 access methods
contiguous file organizations   linked-list file organizations
tree-structured files           i-nodes
indexed-sequential files        virtual address
physical address                virtual address translation
locality of reference           demand allocation
allocation strategy             directory structures
hierarchical directories        path names
current working directories     absolute and relative path names
journalling file system
In addition, the reader should be familiar with the following system services of the example operating system:
mark
seek

Exercises

  1. The maximum size of a classic Unix file without indirect addressing is 10 sectors or 5120 bytes with the classic sector size of 512 bytes. With a single level of indirection, it is 138 sectors or 70656 bytes; what is the maximum size for files which use two and three levels of indirection?

  2. Assuming that a sector holds 256 bytes and that each sector can hold 64 disk addresses, how many bytes could be stored in a file such as is shown in Figure 2, assuming

    a) no index sectors?

    b) one index sector?

    c) two levels of index sectors?

    d) three levels of index sectors?

  3. The classic Unix file system allows only 32 bits (4 bytes) to specify the address of any sector on disk, and each sector holds 512 bytes. What is the maximum storage capacity of a single disk device?

  4. A multimedia file system is being designed, and one goal is to allow files as large as an uncompressed movie. Movies are typically under 2 hours long, and are typically made up of 24 frames per second. Each frame is typically composed of about 1 million pixels, where each pixel at high resolution is about 24 bits (8 bits each, red, green and blue).

    a) How many bytes does it take to store the file holding an uncompressed movie?

    b) How many bits do we need to allow as the argument to a Unix-style seek command allowing seeking to any byte of a movie file.

    b) If the Unix i-node format described here is used, with 512 bytes per sector and 32-bit linearized sector numbers, can a Unix file store a movie?

    c) A high-fidelity audio channel for a movie can be stored as 20 thousand 8-bit samples per second. Modern movies frequently use on the order of 6 channels for sound: front left, center and right, rear left and right, and a sub-base channel for the sounds you feel in your chest. How much extra does the sound-track add to the size of the movie file?

  5. Directories under both the Xerox Pilot system and Unix map textual file names to numerical names (in Unix, the numerical name is the i-number specifying the i-node that describes the file). Both systems use a tree structure for efficient access to a particular sector of a particular open file. How do these systems differ?

  6. Write code for the indexdiskwrite routine that goes with Figure 8. Assume writing past the end of file is an error.

  7. Writing beyond the end of file of a disk file ought to cause the file to be enlarged.

    a) Write an indexdiskexpand routine that enlarges an indexed disk file as defined in Figure 7 by one sector. Assume you can call newsector(disk,addr) to get the address of a free sector for the named disk, where addr is the address of the sector the new sector should follow with minimum latency.

    b) Write code for the indexdiskwrite routine that goes with Figure 8 that will add a new sector to the file if the user tries to write past the end of file.

  8. Write code for the diskfilewrite routine that goes with Figure 9. Ignore the problem of changing between reading and writing in mid-file.

  9. Write code for the diskfilewrite routine that goes with Figure 9. Pay attention to the problem of clean transitions between reading and writing during the processing of a file. You will have to modify the declarations fo struct diskfilevariable to solve this problem.

  10. Write code for the "mark" and "seek" services defined in Figure 11, using the data structures shown in Figure 9, and taking into account the needs of the code shown in Figure 10.

  11. Using the indexdiskopen code from Figure 13, write a diskfileopen function which creates the data structures needed for device independent access to a file see Figure 9. This new function should take the same parameters as indexdiskopen; presumably, this new routine would be called by open, the directory search routine called by the user.

  12. Write a "close" routine which, when passed a description of an open disk file at the level of the description shown in Figure 9, will deallocate all of the components of that description which need deallocation (use the C library function free(p) for this; this takes p, a pointer to a block of memory previously allocated by malloc and returns that block of memory to the system.

 

References

For an old but well written summary of many of the issues faced by the designer of a file system, see section 9.5 of

System's Programming by Donovan, (McGraw Hill, 1972)

For a more detailed discussion of these issues, see chapters 8 and 9 of

Software Systems Principles by Freeman (SRA, 1975)

Another useful discussion of these issues, including the issue of access methods, allocation methods and directory structures is in Chapter 3 of

Operating System Concepts by Peterson and Silberschatz (Addison Wesley, 1983)

Good descriptions of the classic Unix file system are contained in the series of papers in

The Bell System Technical Journal, volume 57, number 6 (July-August, 1978) pages 1905-1969; see especially Section IV of the paper by D. Ritchie and K. Thompson on "The Unix Time-Sharing System", and Section IV of the paper by K. Thompson on "Unix Implementation".

The improved Unix file system in the Unix 4.2BSD system is described in a paper by

"A Fast File System for Unix", by McKusick, Joy, Leffler and Fabry, in ACM Transactions on Computer Systems, volume 2, number 3 (August, 1984) pages 181-197.

The Pilot operating system was described in

"Pilot: An Operating System for a Personal Computer," by D. D. Redell, et al, in Communications of the ACM, volume 23, number 2 (Feb. 1980) pages 81-92.

The idea of a log or journal for a file system is introduced in

"The Design and Implementation of a Log-Structured File system," by M. Rosenblum and J. K. Ousterhout, in ACM Transactions on Computer Systems, volume 10, number 1 (Feb. 1992) pages 26-52.