5 -- Pulling Apart a Shell

22C:112 Notes, Spring 2008

Part of the 22C:112, Operating Systems Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

A Shell

Consider the example shell in http://homepage.cs.uiowa.edu/~dwjones/opsys/shell.txt This shell is quite minimal. The main loop is

        while (one_command() != -1);

The subroutine one_command reads one command from standard input, parses it, executes it, and then returns.

The subroutine get_args is called by one_command to read one line of input and parse it into arguments. The argument data structure used by the program is dictated by the conventions for launching applications in Unix. The variable argc is set to the number of arguments found on the line, and the array argv is set to hold pointers to each successive argument.

Having called get_args, one_command tests to see if the line has any executable content as follows:

        if ((argc > 0) && (argv[0][0] != '#')) {

That is, if argc indicates that there the line is nonblank, and if the first character of the first argument is not a pound sign. Here, we digress into the nature of a C array. argv[0] points to the first argument on the line -- the command name, if there is a command. Since a pointer to an array is, in C or C++, the same thing as an array, argv[0][0] is the first chracter of the first argument on the line.

Having found that the line is not blank and not a comment, one_command continues by asking, is it a built-in command, and if it isn't, it calls launch_app to actually launch the application.

        if (!builtin( argc, argv )) launch_app( argc, argv );

builtin must both recognize built-in commands and execute them. It returns false if the command is not recognized, and returns true if the command is recognized and has been executed. In the case of the exit command, builtin terminates the program and doesn't bother returning.

Data Structures

The Unix system call execve is central to the shell. To launch a non-builtin command, the shell calls:

        execve( name, argv, environ );

Here, name is the name of the file to be executed, argv is the array of argument strings, and environ is a second array of strings that makes up what Unix calls the environment of the program. These two arrays are both parameters, but the intent of Unix is that the argv contain parameters, while environ contains an array of name/value pairs that behave like global variables. This is merely a convention, in the sense that an application that wanted to abuse the argument list or the environment is free to do so.

Our example shell merely passes onward to the applications it launches an environment identical to the environment it was passed. On the other side, it must construct its own argument list for each application it launches.

Consider this command line:

        echo        this "is a test"

After parsing, the data structure that results is:

        argc = 3
        argv[0] --> echo
        argv[1] --> this
        argv[2] --> is a test
        argv[3] = null

To launch the echo command, the shell must call execve("/bin/echo"...). Notice that the first argument the echo command gets is just echo while the file name is /bin/echo. It is up to the shell to figure this out!

The Search Path

The standard Unix shells use an environment variable, $PATH, to decide what file names to check for commands. Here is a typical structure for $PATH:

        echo $PATH
        /usr/local/bin:/bin:/usr/bin:/space/jones/bin:.

This means, first look for the command in /usr/local/bin and then look in /bin before looking in /usr/bin Finally, chedk my own binary directory, /space/jones/bin before trying the current directory, indicated by a period. Components of the path are separated by colons, and each component should be a directory name.

Our example shell does not use the search path from the environment. Rather, it uses a hard-coded search path:

                strncpy( name, "", LINELEN);
                strncat( name, argv[0], LINELEN);
                execve( name, argv, environ );

                strncpy( name, "/bin/", LINELEN);
                strncat( name, argv[0], LINELEN);
                execve( name, argv, environ );

                strncpy( name, "/usr/bin/", LINELEN);
                strncat( name, argv[0], LINELEN);
                execve( name, argv, environ );

This relies on the fact that execve only returns if it is unsuccessful in launching the application. If it is successful, it never returns (more about this later). The above code, with three calls to execve, tries three different versions of the file name, that is, three different directories on its hard-coded search path.

Each search-path entry is created by using string manipulation routines from the C standard library. The strncpy(a,b,c) routine copies string b into string a up to c characters. The strncat(a,b,c) routine concatenates the string b onto the end of the string a up to c characters. Because strings in C are arrays of characters, where the last character in the string is a null, and because of the equivalence of strings, arrays of characters, and pointers to characters in C, the variables a and b can be array names or pointers to characters. Getting the string length c wrong can result in the copy or concatenate operation overwriting memory that was not allocated to the destination string. This is very dangerous, and C does nothing to prevent this error.

The net result is that, if the user types in the command ls, the code tries first ls, to see if the command is itself a well-formed file name, and then it tries /bin/ls before trying /usr/bin/ls.

Modifying this code to use the $PATH variable would require using the getenv C standard library routine to get the value of the $PATH variable, then iteratively extracting each colon-separated chunks of the path and concatenating it with the file name from the shell command.

Note: On most Unix systems, you can look up commands, system calls and standard library routines using the man shell command. Type man man for help on this.

The most frustrating things about the on-line manual are that, sometimes, it throws a huge "man page" at you, and you generally need to know the name of the command you are looking up. Another problem is that some commands, such as write and exit are both the names of shell commands and the names of system calls or library routines. In those cases, man will give you the shell command. To force it to tell you about a system call, type man 2 write or man 2 execve. Section 2 of the manual is the system call section. To get information on a standard library routine, type man 3 strncpy or man 3 strncat. Section 3 of the manual is the standard library section.

Launching an application

The code to launch an application under operating systems descended from Unix is strange. A more rational design might have the exec() system call run some other application and then return when that application terminates. Instead, the Unix exec() system calls (execv() and execve()) are semantically equivalent to goto statements, permanently abandoning the caller's program as they start the new application. Nothing in the documentation for these system services hints at a way for the caller to retain control while waiting for a subsidiary application to return.

The reason Unix does things this way is a consequence of a second design decision, the way Unix handles parallel process creation. Where a more rational design might connect the launching of a parallel process with starting a new applicaton -- so an application could be started in parallel with the current application or executed sequentially after the current one -- Unix opted to allow a process to fork, that is, to create a copy of itself running in parallel with the original.

The Unix fork() system service causes the calling process to be duplicated. Conceptually, the new process is a copy of the caller, with every bit of the caller's memory duplicated. In fact, no read-only data is duplicated -- it is read only, so it can be shared by both the original process and its copy. Furthermore, modern memory management mechanisms allow copying to be limited to those parts of the process's read-write memory that are actually changed, so if a read-write variable is not changed, no copy will be made. We will discuss this later in our discussion of virtual memory technology.

The Unix fork() system call creates only one difference between the original process, the caller or parent process, and the new process, the child process. That is for the caller, fork() returns the process ID of the child, which is always nonzero. For the child, it returns zero.

The parent can call the wait() system call to wait for the child to exit. The wait() system call waits for any child process of the parent to terminate, and when that child terminates, it returns the process ID of the child. It can, optionally, also capture the exit status of the child.

The exit() system call terminates a process. Applications normally terminate by calling exit(), and if the child does not launch an applicaton, the child itself should exit. This leads to the following framework for launching an applicaton from within another application running under Unix:

        int pid; /* the process ID of the child */
        int status; /* the exit status of the child */
        if ((pid = fork()) == 0) {
                /* child process */
                execve( file, argv, environ );

                /* control reaches this point only if the execve fails */
                exit(-1);
        } else {
                /* parent process */
                while (wait( &status ) != pid);
        }
        /* the child has terminated with status indicating how */

The argument to exit() is an integer. If a pointer to an integer is passed to the wait() system call (as in wait(&status) above), 8 bits of the status argument passed to exit() are packed into the referenced integer, along with other information. There are a collection of macros that can be used to extract this information. For example, WEXITSTATUS(status) will extract the 8-bit status itself, while WIFEXITED(status) will return true if the child terminated by calling exit() (it could have terminated by a run-time error or by being killed).

Input Parsing

The example shell uses getargs(argv,line) to get one line of input and break it up into a sequence of arguments. The buffer to hold the text, line, is passed as the second parameter, while argv, the array of pointers to describe that line, is passed as the first parameter. It returns the argument count, argc as the function value.

The getargs(argv,line) function calls getnonblank() to skip over blanks and return the first nonblank character it finds. It calls this after reaching the end of each argument. Arguments are terminated by blanks, but a null is stored in place of the argument terminator in the argument vector argv.

The code of getargs() is dominated by special cases to handle things like overlength lines and end of file. Stripping these out, we are left with:

        len = 0;
        argc = 0;
        for (;;) {
                /* find a nonblank character */
                ch = getnonblank();

                /* ch is a nonblank character, it starts an argument */
                argv[argc] = &line[len];
                argc++;
                line[len] = ch;
                len++;

                /* get remainder of argument (find next blank) */
                for (;;) { 
                        ch  = getchar();
                        if (ch <= ' ') break;

                        line[len] = ch;
                        len++;
                }

                /* put null terminator on argument */
                line[len] = '\0';
                len++;
        }
        argv[argc] = NULL;

Most of this is trivial code, but one piece is critical:

        argv[argc] = &line[len];

This assigns a new pointer to an entry in argv[]. The pointer is the address of line[len], where the ampersand operator means "take the address of."

Files

In the above discussion, we ignored the question of open files. By default, when a process forks, both parent and child processes share all open files. By default, when a program uses some flavor of exec() to launch another program, the files that were open in the caller remain open in the new program. As we will see later, this has interesting security consequences.

It is possible to mark files to be automatically closed when the program does an exec(), but this is an unnecessary feature of Unix and its descendants, since a program that has open files ought to know about them, and knowing about them, the program could explicitly close them itself before calling exec().

What's Missing

Our example shell is missing a huge number of features. Built-in commands such as cd, if and while. The ability to manipulate the environment, for example, with built-in commands such as set to assign values to environment variables. Parameter and environment variable substitution, for example, by recognizing dollar signs as lead-ins to environment variable names. The use of quotation marks to enclose arguments that include blanks. Input-output redirection. All of these would make interesting assignments.