5 -- Pulling Apart a Shell

CS:3620 Notes, Spring 2018

Part of the CS:3620, Operating Systems Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

A Shell Language

The standard Unix shells all split lines of input into blank-delimited chunks, so if the user types

/bin/ls -l myfile

The 3 parts are

  1. /bin/ls
  2. -l
  3. myfile

Part 1 is interpreted as a file name. The standard shells do far more than merely attempt to execute the program in this file. They also try appending the file name from the input to each of several prefixes provided in the search path (you can look at your current search path by typing the shell command echo $PATH). This is why, when you type ls as a shell command, it successfully executs /bin/ls, the utility for listing a directory. Initially, our example shell will completely ignore the concept of a search path and merely try to run the program using the given file name, so a user wishing to list a directory would need to type the full file name /bin/ls.

The shell passes parts 2 and 3 as arguments to the program. In this case, the program stored in /bin/ls interprets the -l argument as a command-line option, requesting a long-form listing, and it interprets the myfile argument as the name of the file (or directory) to be listed.

A Shell

Consider the example shell in http://homepage.cs.uiowa.edu/~dwjones/opsys/notes/mush.txt This shell is quite minimal. The main loop is

for (;;) {
        getcommand();
        splitargv();
        launch();
}

The subroutine getcommand() reads one command from standard input into the global string command (recall that in C, strings are just arrays of characters with a NUL character to mark the end).

The splitargv() routine finds the start of each blank-delimited component of command and sets one pointer in the global array argv to point to that component. It also plants a NUL at the end of each component; as a result, at the end of this process, argv is an array of pointers to strings, where each string is one component of the original command string.

The argv data structure used by the program is dictated by the conventions for launching applications in Unix. When a Unix program wants to transfer control to a different program, it passes an array of parameter strings. The first parameter to a program is always the name of the program. The number of parameters is indicated by a NULL pointer marking the end of the parameter array.

The Launch routine is responsible for actually launching the application. It uses the execve() kernel call to transfer control to the new application. The first argumet to execve() is the file name of the application to be run. The code given here is very stupid, it does not check anything about the command, it merely calls execve(argv[0], ... ). In all of the more interesting shells, there are a number of built-in commands, so before calling execve(), the shell would have to check if the command name is built into the shell. The most obvious built-in command is exit, the command that tells the shell to terminate, but if the shell supports control structures, commands such as if and while are also typically built in.

The Unix system call execve is central to our shell. The parameters to this are:

execve( name, argv, environ );

name
The name of the file to be executed.
argv
The array of argument strings
environ
A second array of strings that makes up what Unix calls the environment of the program.

The two arrays of strings are both parameters to the executed program, but the intent of the Unix is that argv contains parameters, while environ contains an array of name/value pairs (separated by equals signs) that are used as global variables. Our minimal shell passes a NULL environment. The uses made of argv and environ are merely conventions. If the program that launches an application and the applicaton agree, they could abuse these two arrays of NUL terminated strings to pass anything that can be passed using that data format.

The Search Path

Standard Unix shells use an environment variable, $PATH, to decide what file names to check for commands. Here is a typical structure for $PATH:

echo $PATH
/usr/local/bin:/bin:/usr/bin:/space/dwjones/bin:.

Components of the path are separated by colons, and each component should be a directory name. Therefore, this path means, first look for the command in /usr/local/bin, and if it is not found there, look in /bin, and then try /usr/bin. Those three binary directories are the standard directories where the official versions of system commands are stored. The /usr/local/bin/ directory is for locally installed variants of standard commands, so we look there first in order to allow local variants to take precedence over standard versions.

The final two entries in the search path given above are personal. My search path ends by checking my personal binary directory /space/jones/bin where I keep binaries for utilities I've written or installed for my own use. The final entry on my search path is . (dot), standing for the current directory. This way, if I type the name of an executable file in my current directory, it will run. If this element in the path was not present, I'd have to type ./mush to run the minimally usable shell instead of jut mush.

To use components of the path, the shell needs to get the path from the current environment. The main program in our example shell had no parameters, or rather, it ignored any parameters passed by whover launched it because none were declared. All Unix systems actually pass 3 parameters to every application you launch, so we should have declared the main program as:

int main(int argc, char **argv, char **envp)

The first parameter, argc is the count of arguments in the argv array that was passed to the main program by the call to execve made by whoever executed the main program. The second parameter is a pointer to argv, the array of arguments, where each array entry is a pointer to a character string. We could have declared this in any of several ways, all of which are equivalent because every pointer in C is potentially a reference to the first element of an array:

char **argv,    /* a pointer to a pointer to a character */
char *argv[],   /* an array of pointers to characters */
char argv[][],  /* a two-dimensional array of characters */

The first parameter is actually redundant since we could just count the entries in the array until we find a NULL pointer, but once a standard is established, we are stuck with it, whether or not it is well thought out. The third parameter is a pointer to the envrionment (which is, itself, an array of pointers to the strings defining that environment). Curiously, there is a global variable environ that is also a pointer to the environment, so there is really no need for this third parameter to the main program. This is further evidence of the unplanned evolution of C. To get access to the global variable, add this global declaration to your code:

extern char **environ;

One of the strings in the environment begins with PATH=; this means that it is a definition of the environment shell variable $PATH, so the actual search path begins after the equals sign. The C standard library has a utility function getenv() that can be used to find look up the values of variable in the environment. So, we could find the search path using:

char * path = getenv( "PATH" );

The C standard library contains a wide variety of routines for working with character strings. The shell command man 3 string will list them all with no additional information. The command man 3 xxx, where xxx is the name of one of these string functions will give the official manual page for that function. The 3 in man 3 xxx is there because we are only asking for information from section 3 of the manual (the standard library). Section 1 describes shell commands, while Section 2 describes the operating system kernel interface (you can look up execve() there.

One of the standard string functions, for example, can be used to pick off successive tokens from a string. Because the strings in the path are delimited by colons, we could get the first entry from the path as follows:

char * entry = strtok( path, ":" );

If the above code had been written in Java or Python, the string returned by strtok() would be a copy of the first token found in the string path. The C string package does not work this way! Instead, strtok() actually edits path by planting an end-of-string mark in place of the terminator on the first token and then returns a pointer to the start of that token. Our code for the mush shell includes a splitargv() command that operates in much the same way, splitting out all the substrings of the command line.

Adding a rudimentary search path to mush

The C string library provides a number of tools, among them:

Before using any of these, your program must include the header file <string.h> so that the compiler knows about these routines.

Here is an example showing how these can be used to concatenate /usr/bin with whatever is in argv[0] in order to execute a program from that directory:

char name[LEN+1];
name[LEN] = '\0';
strncpy( name, "/usr/bin/", LEN);
strncat( name, argv[0], LEN - strlen(name));
execve( name, argv, environ );

The above code first allocates a local variable, name, an array of characters. The size of this array is declared in terms of the defined symbol LEN. This must be big enough to hold one more character than the length of the longest file name expected. This allows for a guarantee that there will be room for the final terminating NUL character. The program then copies a the prefix (a component of the search path) into this buffer using strncpy(), and then uses strncat() to append the command name the user typed.

Finally, the above code fragment attempts to launch the application. If successful, it will never return because execve only returns if it is unsuccessful in launching the application. This means that we can hard-code a search path into our program as follows:

char name[LEN+1];
name[LEN] = '\0';
strncpy( name, "/usr/local/bin/", LEN);
strncat( name, argv[0], LEN - strlen(name));
execve( name, argv, environ );
strncpy( name, "/bin/", LEN);
strncat( name, argv[0], LEN - strlen(name));
execve( name, argv, environ );
strncpy( name, "/usr/bin/", LEN);
strncat( name, argv[0], LEN - strlen(name));
execve( name, argv, environ );

We could do even better by writing a loop trying path after path after path picked off of the $PATH environment variable.

Launching an application

The code to launch an application under operating systems descended from Unix is strange. A more rational design might have the exec() system call run some other application and then return when that application terminates. Instead, the Unix exec() system calls (execv() and execve()) are semantically equivalent to goto statements, permanently abandoning the caller's program as they start the new application. Nothing in the documentation for these system services hints at a way for the caller to retain control while waiting for a subsidiary application to return.

The reason Unix does things this way is a consequence of a second design decision, the way Unix handles parallel process creation. Parallel programming is an advanced topic that we will discuss in much more detail later, but here, we will use it in a rudimentary way in order to create a way for the launched program to return to its launcher. Where a more rational design might connect the launching of a parallel process with starting a new applicaton -- so an application could be started in parallel with the current application or executed sequentially after the current one -- Unix opted to allow a process to fork, that is, to create a copy of itself running in parallel with the original.

The Unix fork() kernel service causes the calling process to be duplicated. Conceptually, the new process is a copy of the caller, with every bit of the caller's memory duplicated. In fact, no read-only data is duplicated. Read only data including the code of the program can be shared by both the original process and its copy. Furthermore, since the late 1970's, we've known how to use memory management hardware to limit copying to those parts of the program's read-write data that are actually changed, so if a read-write variable is not changed, no copy will be made. We will discuss this later in our discussion of virtual memory technology.

The Unix fork() kernel call creates only one difference between the original process, the caller or parent process, and the new process, the child process. For the caller, fork() returns the process ID of the child, a nonzero integer. For the child, it returns zero. Since zero is false in C, this means that writing if(fork()) can be used to divert the parent and child into different blocks of code.

The wait() kernel call blocks the calling process until any of several events; one of these events is the termination of a child process, so the parent process can call wait() to wait for the child to finish. When that child terminates, it wait() returns the process ID of the child. Optionally, it can also capture the exit status of the child.

The exit() kernel call terminates a process. Applications normally terminate by calling exit(), and if the child does not launch an applicaton, the child itself should exit. This leads to the following framework for launching an applicaton from within another application running under Unix:

int pid; /* the process ID of the child */
int status; /* the exit status of the child */
if (pid = fork()) {
        /* parent process waits for child to terminate */
        while (wait( &status ) != pid);
} else {
        /* child process launches an application */
        execve( file, argv, environ );

        /* control reaches this point only if the execve fails */
        exit(-1);
}
/* control only reaches this point when the child has terminated */

The argument to exit() is an integer. If a pointer to an integer is passed to the wait() system call (as in wait(&status) above), 8 bits of the status argument passed to exit() are packed into the referenced integer, along with other information. There are a collection of macros that can be used to extract this information. For example, WEXITSTATUS(status) will extract the 8-bit status itself, while WIFEXITED(status) will return true if the child terminated by calling exit() (it could have terminated by a run-time error or by being killed).

Input Parsing

The example shell uses splitargv() separate one line of input into an array of arguments. The buffer to hold the text, command, is passed as a global variable, along with argv, the array where pointers to successive arguments will be put. We could have shortened getargs considerably by using the strtok() routine mentioned above. Here, we avoided doing so simply to show all of the computation in one place instead of requiring the reader to understand a complex subroutine library.

The splitargv() function loops through the arguments. For each argument, it first skips blanks, then remembers the address of the argument with &command[j], where the ampersand means "get me the address of the variable, not its value." Having found the start of an argument, it then scans for a blank or end of string, and then replaces the blank at the end of the command with a null. The result is the replacement of a single string with an array of strings without moving any of the text.

Files

In the above discussion, we ignored the question of open files. By default, when a process forks, both parent and child processes share all open files. By default, when a program uses some flavor of exec() to launch another program, the files that were open in the caller remain open in the new program. As we will see later, this has interesting security consequences.

It is possible to mark files to be automatically closed when the program does an exec(), but this is an unnecessary feature of Unix and its descendants, since a program that has open files ought to know about them, and knowing about them, the program could explicitly close them itself before calling exec().

What's Missing

Our example shell is missing a huge number of features. The code does nothing to deal with over-length lines, so formally speaking, it is in error. If you type an over-length input line, the program will fail very ungracefully. Built-in commands such as cd, if and while are completely absent. No tools are provided to support manipulation of the environment, for example, with built-in commands such as set to assign values to environment variables. There is no support for parameter and environment variable substitution, for example, by recognizing dollar signs as lead-ins to environment variable names. The use of quotation marks to enclose arguments that include blanks is missing, as is Input-output redirection. All of these would make interesting machine problems, and if they were all done, the resulting shell would be genuinely useful.