Assignment 6, due Mar 14

Solutions

by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Background: Consider the classic Unix i-node as described in the notes for lecture 18. Each i-node holds 13 sector numbers, the first 10 of which refer to the first 10 sectors of the file, while the final 3 are used for successively deeper trees of index sectors (with data sectors as leaf nodes on the tree). In the classic system, sectors are 512 bytes and sector numbers are 32 bits.
a) What is the largest disk this file system can be used on? (0.3 points)

32 bit sector numbers allow 4,294,967,296 sectors of 512 bytes each, or 2,199,023,255,552 bytes. For practical purposes, 2 terabytes.

b) What is the largest file this file system can support? (0.3 points)

10 sectors directly addressible, plus
128 sectors addressed by the 11th disk address, plus
16,384 sectors addressed by the 12th disk address, plus
2,097,152 sectors addressed by the 13th disk address.
This gives a sum of 2,113,674 sectors or 1,082,201,088 bytes. For practical purposes, a gigabyte.

c) Suppose the system was changed to use the first 8 disk addresses in each i-node to refer to the first 8 sectors of the file, with the final 5 addresses used to refer to successively deeper trees of index sectors. Ignoring questions of disk size, what is the maximum file size this would permit? (0.3 points)

For practical purposes, this multiplies the maximum file size by 16,384, so the maximum will be on the order of 16 terabytes.

d) Evaluate the prudence of using the alternative suggested in part c. (0.3 points)

A maximum file size larger than the maximum device size will never be fully utilized, however, if we just used the first 9 disk addresses as direct addresses, the maximum file size would be 128 gigabytes, significantly less than the device size. Therefore, the proposal in part c is the simplest way to modify the classic Unix file system to allow full utilization of the largest possible disk drive this system supports.

e) If the sector size was changed from 512 bytes to 1024 bytes, how many of the 13 disk addresses in each i-node should point to data sectors, and how many should point to successively deeper trees of index sectors? In answering this, take into account both the maximum file size and the maximum disk size. (0.3 points)

This increases the maximum disk size by a factor of two, to 4 terabytes, and it also increases the capacity of a disk sector from 128 to 256 sector numbers. This means that the classic Unix i-node organization (10 + 3) would allow about 256³×1024 bytes per file, or about 16 gigabytes. Switching one pointer from direct to indirect would change this to 256⁴×1024 bytes per file, about 4 terabytes. Thus, since this puts the maximum file size very close to the maximum disk size, we should have direct pointers to the first 9 disk sectors, and 4 successively deeper trees after that.
Background: Google-scale file systems require some huge files, much bigger than a single volume, as well as huge numbers of small files. Developers of Google-scale file systems have generally been unimpressed by the foundations provided by conventional file models such as those of Unix or Windows. The Wikipedia writeup on the Google file system is worth reading.
A Question: Discuss the relative benefits of implementing file "chunks" as files in the underlying file sytem, versus abandoning the underlying file system and building a new file system on top of raw disk drives. (0.5 points)

Using an existing file system as a foundation for a Google-scale file system would involve building a layer of middleware on top of it. This means that you add overhead but you may minimize the amount of new code you have to write. The overhead will involve such things as access to index files that hold the mappings between chunks of the Google-scale files and files on the underlying system, and extra directory accesses to open and access each of those chunks.
Background: The Xerox Distributed File System (XDFS) was like the Unix file system in one critical aspect: Directories mapped file names to file numbers, and then the low-level read and write operations operated in terms of reading or writing the selected sector of the selected file in terms of its file number. In XDFS, the need for an elaborate tree structure to locate the sectors of each file was eliminated. Instead, the 32-bit sector number was simply concatentated with the 32-bit file number to make a 64-bit sector number, and then the desired sector was located by looking up this 64-bit value in a global B-tree.
You may have to look up B-trees -- the Wikipedia entry is well written. In the case of XDFS, each B-tree record was one sector, each search key was 64 bits, and linearized sector numbers were 32 bits. If we assume 512 bytes per sector, this means that each sector would hold, on average, about 64 entries.
a) Compare the efficiency (in terms of total disk space needed for i-nodes and index sectors) of this file system with the efficiency of the classical Unix file system. Specifically, given that there are m files of n sectors each, which file system will have a higher storage overhead? (0.5 points)

Here, there is, for a first-order approximation, one index sector per 64 data sectors. With small unix files, it is one i-node (1/16 of a sector) per 4 data sectors (assuming small files are more common than large) which translates to one sector of i-nodes per 64 data sectors. The same.
For large Unix files, however, we use 128-ary trees, so it is one index sector per 128 data sectors, as a first-order approximation. In sum, the Unix representation is more space-efficient for large files and comparable for small files.

b) What is the expected run-time cost of the first access to sector i of file j, assuming that there are m files of n sectors each. (The expected run-time cost of the second access is not a problem if the system uses appropriate caching.) (0.5 points)

With n files of m sectors, there are m×n total sectors in the B-tree. With 64 entries per sector, on average, that means that the expected depth of the B-tree is log₆₄mn. Access to a particular sector requires searching down into the B-tree, which means log₆₄mn sectors must be accessed.
With small Unix files, once the file is open, the access time for any file sector is just an access to that sector. With large files, after the file is open, the access time will be log₁₂₈m disk accesses. Therefore, Unix should be faster -- although caching commonly used disk sectors in RAM will speed up both schemes, the amount of cache capacity needed to achieve a given performance will be higher on the Xerox system.