Lecture 13, Recursive Descent Translation

Part of the notes for CS:4980:1
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Continuing the Example

Consider the grammar of our running example:

<expression> ::= <term> { ( '+' | '-' ) <term> }
<term> ::= <factor> { ( '*' | '/' ) <factor> }
<factor> ::= [ '-' ] ( <number> | '(' <expression> ')' )

Our first goal is to convert a parser for this grammar into an expression evaluator. With the appropriate main program, this evaluator might respond as follows to these expressions:

 ( 93 )
value = 93
 - ( 78 - - 92 ) / 65
value = -2

Recursive Descent Evaluation

To convert our recursive descent parser to an evaluator, we first identify the class of values associated with each nonterminal in the grammar. In general, these values can be quite complex, but in the case of our example grammar for integer expressions, the obvious interpretation is that the value of each nonterminal is an integer, the value of that expression or subexpression.

Having determined the type of each of our nonterminals, we then take each parsing routine and convert it to a function, returning a value of that type. Internally, the code of each function uses the values associated with each of the subsidiary nonterminals and with each terminal symbol it consumes in order to to compute its own value. This leads us to the following expression evaluator:

int expression() {
/**/    int accumulator;
        accumulator = term();
        while (lex_ispuncset( lex_this, makeset2( PT_PLUS, PT_MINUS ) )) {
/**/            punct_type operator = lex_this.value;
/**/            int operand;
                lex_advance(); /* consume operator */
                operand = term();
/**/            if (operator == PT_PLUS) {
/**/                    accumulator = accumulator + operand;
/**/            } else {
/**/                    accumulator = accumulator - operand;
/**/            }
        }
/**/    return accumulator;
}

int term() {
/**/    int accumulator;
        accumulator = factor();
        while (lex_ispuncset( lex_this, makeset2( PT_TIMES, PT_DIV ) )) {
/**/            punct_type operator = lex_this.value;
/**/            int operand;
                lex_advance(); /* consume operator */
                operand = factor();
/**/            if (operator == P_TIMES) {
/**/                    accumulator = accumulator * operand;
/**/            } else {
/**/                    accumulator = accumulator / operand;
/**/            }
        }
/**/    return accumulator;
}

int factor() {
/**/    int accumulator;
/**/    punct_type operator;
        if (lex_ispunc( lex_this, P_MINUS)) {
                lex_advance() /* consume minus sign */
/**/            operator = PT_MINUS;
        } else {
/**/            operator = PT_PLUS;
        }
        if (lex_this.type == NUMBER) {
/**/            accumulator = lex_this.value;
                lex_advance(); /* consume number */
        } else {
                lex_forcepunc( PT_LPAREN );
                accumulator = expression();
                lex_forcepunc( PT_RPAREN );
        }
/**/    if (operator == PT_MINUS) {
/**/            accumulator = -accumulator;
/**/    }
/**/    return accumulator;
}

In the above, most of the new code added to evaluate expressions is marked with an empty comment in the left margin. The exception involves declaration of return types on each of the functions and collection of the returned values from the called functions. Of course, while we want direct evaluation of expressions that have constant terms, a compiler also needs to deal with expressions that have varialbe terms. In that case, the compiler must generates machine code.

A Toy Machine

Consider generating code for a simple stack machine, where the basic operations are as follows:

PUSHI c: push the immediate constant c on the stack
NEG: negate the top item on the stack
ADD: the top item is popped from the stack and added to the item below it, leaving the sum on the stack top.
SUB: the top item is popped from the stack and subtracted from the item below it, leaving the difference on the stack top.
MUL: the top item is popped from the stack and multiplied by the item below it, leaving the product on the stack top.
DIV: the top item is popped from the stack and divided into the item below it, leaving the quotient on the stack top.

These machine instructions are a subset of the instructions you will find on a real stack machine, and they are also present on some virtual machines used in lightweight implementations of programming languages. The j-code underlying many Java implementations is an example.

We can also interpret the above machine instructions as macros for a conventional machine. In this case, the first draft of each macro will typically involve a sequence of real machine instructions. This would be appropriate for a preliminary version of a compiler. It is worth noting that some macro assemblers allow very sophisticated macros, so that a compiler that begins outputs a sequence of such macro calls can actually generate surprisingly good code.

Compiling Expressions for the Toy Machine

Given a machine definition, we can (with surprisingly little difficulty) convert the parser from an expression evaluator to a compiler that generates machine code to evaluate expressions:

void expression() {
        term();
        while (lex_ispunc( lex_this, makeset2( PT_PLUS, PT_MINUS ) )) {
/**/            punct_type operator = lex_this.value;
                lex_advance(); /* consume operator */
                term();
/**/            if (operator == PT_PLUS) {
/**/                    printf( " ADD\n" );
/**/            } else {
/**/                    printf( " SUB\n" );
/**/            }
        }
}

void term() {
        factor();
        while (ispunc( lex_this, makeset2( PT_TIMES, PT_DIV ) )) {
/**/            punct_type operator = lex_this.value;
                lex_advance(); /* consume operator */
                factor();
/**/            if (operator == PT_TIMES) {
/**/                    printf( " MUL\n" );
/**/            } else {
/**/                    printf( " DIV\n" );
/**/            }
        }
}

void factor() {
/**/    punct_type operator = lex_this.value;
        if (lex_ispunc( lex_this, PT_MINUS )) {
                lex_advance() /* consume minus sign */
/**/            operator = PT_MINUS;
        } else {
/**/            operator = PT_PLUS;
        }
        if (lex_this.type == NUMBER) {
/**/            printf( " PUSHI %u\n", lex_this.value );
                lex_advance(); /* consume number */
        } else {
                lex_forcepunc( PT_LPAREN );
                expression();
                lex_forcepunc( PT_RPAREN );
        }
/**/    if (operator == PT_MINUS) {
/**/            printf( " NEG\n" );
/**/    }
}

In this code, the parsing functions no-longer return values; instead, they append assembly language text to the output stream. As a result, processing the following expression would give the following results:

Input:

 - ( 78 - - 92 ) / 65

output:

 PUSHI 78
 PUSHI 92
 NEG
 SUB
 NEG
 PUSHI 65
 DIV

When this program is run, it would go through the following sequence of values on the stack top:

stack             after this instruction
|
|  78 |           PUSHI 78
|  78 |  92 |     PUSHI 92
|  78 | -92 |     NEG
| 170 |           SUB
|-170 |           NEG
|-170 |  65 |     PUSHI 65
|  -2 |           DIV

We could add considerable complexity to the compiler to make it generate machine code for a typical multi-register machine, or we could define the macros used here to do sophisticated code generation. If the assembler's macro language is sufficiently powerful, the latter alternative is actually able to generate reasonably good code. The following facilities are required:

Conditional macro expansion. That is, a macro must be able to include conditional assembly directives that are interpreted when the macro is expanded. (Notably, the C and C++ macro preprocessors do not meet this requirement because a #define directive may not include #if directives.)
Automatic symbol generation: It must be possible to generate new symbols on the fly on the basis of computations done during macro expansion. This allows the assembler to generate, for example, register names such as R12 generated by concatenating the prefix R with the value of a symbol NUMBER that has previously been set to 12.

In combination, these features allow macros to be constructed that, for example, maintain the top of the stack in general registers while spilling the stack into RAM if the available registers are filled. Ian Stocks used this method very successfully in his Pascal compiler for the PDP-11, written in 1973 -- this was the first Pascal compiler written in North America.

Parse Trees

It is frequently the case that a compiler cannot generate code immediately as it parses the text of a program. Optimal code generation for loops, for example, frequently requires knowing the size of the loop body before the decision can be made to use short or long forms of branch instructions. (This is because many machines &emdash; notably, the Intel x86 &emdash; have 16-bit branch instructions that have limited branch displacements, in addition to 32-bit instructions with longer displacements, plus general tools to branch to an arbitrary 32-bit address, usually requiring multiple instructions.)

Deferred code generation for expressions requires that the compiler maintain an internal data structure that can be traversed, at a later time, to generate the required code. A common way to do this is to store a parse-tree. In this case, the value associated with each nonterminal symbol in the grammar is an object. The cleanest way to do this is to use an object oriented language.

To start with, the classes of all nonterminals in the grammar inherit from a common abstract superclass:

class nonterminal {
    public:
        virtual void parse();   // cause nonterminal to parse itself
        virtual void emit();    // cause nonterminal to emit its code
};

This merely says that all nonterminals can be asked to parse an instance of themselves. In a real compiler, there could be more methods here, but this will do for a start. Each concrete subclass of this superclass must provide specific methods for each of these virtual methods. Depending on the semantics of the language, each concrete subclass may provide additional methods, and in some cases, the subclasses will inherit from intermediate classes. Consider this:

class expression : nonterminal { // an expression is a nonterminal
    private:
        nonterminal * left;  // always a term
        nonterminal * right; // could be null or an expression
        lex_value operator;  // how to combine left and right
    public:
        expression();
};

expression::expression() {  // initializer
        left = NULL;
        right = NULL;
        // by convention, null righthand side means undefined operator
}

expression::parse() {  // parse one expression
        left = new term();
        left->parse(); // parse lefthand operand
	if (ispunc( lex_this, makeset2( P_PLUS, P_MINUS ) )) {
                operator = lex_this.value;
                lex_advance(); // consume an operator
        	right = new expression();
                right->parse(); // parse righthand operator
        }
}

expression::emit() {  // emit code to evaluate one expression
	left->emit();
	if (right != NULL)
		right->emit();
		if (operator == P_PLUS) {
                        printf( " ADD\n" );
                } else {
                        printf( " SUB\n" );
		}
	}
}

Note that this code now does a right-recursive parse where the original code used iteration to parse the string of operands at one level in the expression. This is because the code was forced to build a strictly binary parse tree instead of the variable-width tree suggested by the EBNF grammar. Nonetheless, the code remains relatively clear and concise, gathering all of the details of dealing with one nonterminal in that classes code.

There are some interesting design choices here: Should the parse() method be combined with the class constructor? Should the emit() method be the class destructor? These alternative are somewhat attractive and must be considered separately.

With regard to merging the destructor with code generation, consider the following arguments: For some of the nonterminals in a programming language, for example, those involving expressions and statements, it may well be that, once code is emitted for them, there is no reason to retain the parse tree that was used to generate the code. If our language only included that class of constructs, combining the emit() method with the class destructor for the corresponding nonterminal would make sense.

In other cases, however, the story is more complex. Declarations of array types, for example, may need to retain information about their component and index types, and the easiest way to do this is to simply retain the parse tree for the declaration of the array type until the end of the scope for that type. The same holds for record and class declarations, and it holds in part for subroutines, where it may indeed be practical to discard the parse tree for the subroutine body once the code is generated, but retaining the parse tree for the subroutine header may be the easiest way to retain the information needed to construct a call to that routine.

With regard to merging the constructor with the parser, the arguments are largely a matter of style. Some people prefer to keep serious computation out of constructors. In that case, parsing belongs in a separate method.

An alternative view is that the parser could be a "factory method", that is, to use Java terminology, a static method of the class that creates a class instance and parses that instance.