Compiler Construction

<kestrel program> ::= <block> <end of file>

<block> ::= { <block element> [ ";" ] }

<block element> ::= <declaration>
                 |  <statement>

In short, a block is a sequence of statements mixed with declarations. A declaration associates a name with an entity such as a variable or a procedure or function; the grammar given here is abridged (it ignores private and restricted declarations, it ignores exceptions and functions):

<declaration> ::= <identifier> ":" <declarator>
<declarator> ::= <constant declarator>
            |  <type declarator>
            |  <variable declarator>
            |  <procedure declarator>

The term declarator here is used as a generic term for the syntactic forms that declare attributes that can be bound to identifiers by declarations. Types, variables and constants, after all, are radically different from each other, and we need a name for the syntactic category that includes all of them.

Looking further, we find that blocks occur in several contexts. The first context is the body of a procedure (or function, but we're abridging the grammar here):

<procedure declarator> ::= "procedure"
                           [ <formal parameter list> ]
                           <body>

<body> ::= <block> "end"

The second context involves the construction of a record type (the Kestrel term for what C and C++ call a struct, but also used for objects, because blocks may contain procedure and function declarations, which in this context become the methods of the object). Again, we have abridged the grammar:

<type declaration> ::= "type" <type>

<type> ::= <reference>
        |  <record>

<record> ::= "record" <block> "end"

<statement> ::= <if>
             |  <assignment>

<if> ::= "if" <expression> [ "then" ]
             <block>
       [ "else"
             <block> ]
         "end"

Parsing Programs

<kestrel program> ::= <block> <end of file>

We can write the code for a top-down recursive descent parse_program routine something like this (assuming we add lex_gotbutwant() to the lexical analysis support package):

void parse_program() {
        parse_block();
	if (lex_this.type != ENDFILE)) {
		lex_gotbutwant( lex_this, ENDFILE );
	}
}

This parser can be changed into a translator by adding code to emit the standard prologue and standard epilogue to the object file.

void compile_program() {
        fputs( PROLOGUE, objectfile );
        compile_block();
	if (lex_iskey( lex_this, KEY_END )) lex_advance();
	if (lex_this.type != ENDFILE)) {
		er_gotbutwant( lex_this, ENDFILE );
	}
        fputs( EPILOGUE, objectfile );
}

What are the standard prologue and epilogue for your machine? You can find it by compiling an empty C program and comparing the output with the result of compiling, for example, a hello-world program. Consider compiling these two:

/* empty.c -- the empty C program */
int main() {
}

/* hello.c -- a hello-world C program */
#include <stdio.h>
int main() {
	fputs( "Hello world!\n", stdout );
}

If you compile these on our target machine with cc -S empty.c and cc -S hello.c, the output files empty.s and hello.s, will hold the assembly code corresponding to the two input files, in (one of) the assembly language(s) of the machine you are using. By comparing the two files, you should be able to identify the code that is always prefixed to a main program and the code that always follows a main program, as well as the code specific to the body of the hello world program.

What's that about multiple assemblers? Unfortunately, there is no such thing as the unique and only assembly language for any machine. It is always possible to devise an alternative assembly language for any machine, and this has been done for most machines. The GNU family of compilers generally use the GNU assemblers (versions of gas) to process their outputs, while the hardware vendor for each machine generally has a somewhat incompatible assembly language. Usually these have similar names for the machine instructions, but even that is not always true.

For the x86 family, for example, you have a choice between masm, the Microsoft assembler, and gas, the GNU assembler. There are many others. There is less variety for the ARM, but gas on the ARM is not fully compatible with the Microsoft armasm, and there are many more assemblers, ranging from freeware to high priced commercial products.

Parsing Blocks

<block> ::= { <block element> [ ";" ] }

<block element> ::= <declaration>
                 |  <statement>

We can write the framework for a top-down recursive descent parse_block routine:

void parse_block() {
	while (??) {
		if (??) {
			parse_declaration();
		} else {
			parse_statement();
		}
		if (lex_ispunc( lex_this, PT_SEMI )) lex_advance();
	}
}

This leaves two important questions unanswered. What is the terminating condition on the while loop? This determines when the block ends. What is the condition on the if statement? This determines whether the block element is a declaration or a statement. To answer these questions, we need to look at the start sets of block elements, declarations and statements.

Unfortunately, the start set of <statement> includes the start set of <assignment> and assignment statements begin with identifiers. Similarly all declarations begin with the identifier being declared. Therefore, the parser cannot distinguish statements from declarations by inspecting just their first lexeme. It must look ahead at the next lexeme. All declarations in Kestrel begin with <identifier> ":", and no statement begins this way. Therefore, by looking at lex_next, we can distinguish between statements and declarations. This gives us the following, assuming that the constant STATEMENT_KEYWORDS is the set of all keywords that could begin statements:

void parse_block() {
	while ( (lex_this.type == IDENT)
        ||      (lex_iskeyset( lex_this, STATEMENT_KEYWORDS ))
        ) { /* there are more block elements */
		if ( (lex_this.type == IDENT)
		&&   (lex_ispunc( lex_next, PT_COLON ))) {
			parse_declaration();
		} else {
			parse_statement();
		}
		if (lex_ispunc( lex_this, PT_SEMCOL )) lex_advance();
	}
}

Here, we've assumed that lex_iskeyset() allows testing a lexeme for membership in a set of keywords. This is analogous to lex_ispuncset(), a routine in our lexical analysis support package.

The compile_block() routine does not generate any code, ever. That job belongs to parse_declaration and parse_statement. The compile_block() routine does need to do something that is not hinted at above: That is, it needs to collect all of the declarations in the block from the declarations and make them available to all future declarations in the same block, as well as to all statements in the block. To understand what is involved here, we must look at the concept of scope rules and how this is evident in Kestrel.

Hierarchic Scope Rules

So what are the semantic attributes of a block? The answer lies in what we discussed in the previous lecture. Kestrel blocks build environments. During the processing of a block, each declaration in that block adds itself to the environment, and each statement in the block uses the environment created up to that point. At the end of a block, the compiler for a block must returns the environment that block has created so that the public components of that block may be accessed from the outside world. In Kestrel, this is particularly relevant in the case of the blocks that are used in record types declarations, which also serve as classes in the object-oriented sense.

Because Kestrel's nested scope rules make it legal to reference identifiers declared outside a block from within that block, the code to compile a block must take the enclosing environment as a parameter. This leads to the following skeleton for compiling a block:

environment compile_block( environment e ) {
	while ( (lex_this.type == IDENT)
        ||      (lex_iskeyset( lex_this, STATEMENT_KEYWORDS ))
        ) { /* there are more block elements */
		if ( (lex_this.type == IDENT)
		&&   (lex_ispunc( lex_next, PT_COLON ))) {
			/* =BUG= is there an alternative to this? */
			e = compile_declaration( e );
		} else {
			compile_statement( e );
		}
		if (lex_ispunc( lex_this, PT_SEMCOL )) lex_advance();
	}
	/* =BUG= should we edit so return only includes public fields? */
	return e;
}

The above code is written under the assumption that we've formulated the semantics of compile_declaration() to take an environment as input, use that environment as needed in the process of making the declaration. Can we use this alternative formulation?

			e.add( compile_declaration( e ) );

In this new version, we have explicitly used e.add(), allowing the declaration to return just one binding of an identifier to a meaning which is added to the environment by compile_declaration(). It turns out, however, that this is not sufficient for Kestrel (or, for that matter, C). The reason is that these languages allow declarations like this:

color: type ( red, blue, green );

typedef enum { red, blue, green } color;

In either Kestrel or C, this one adds a single new type binding to the environment, color, along with three new constants of that type, red, blue and green. Our alternative formulation does not permit this, so, as it turns out, the original formulation is the one we must use. It was not a bug after all.

We will devote some time in coming lectures to a discussion of the details of the attributes bound to each name in the environment, and the rules for how environments are searched to create nesting relationships between blocks.

Before we abandon the discussion of the logic for compiling a block, note that there are actually two kinds of blocks defined in Kestrel:

Lecture 16, Blocks

Blocks

Parsing Programs

Parsing Blocks

Hierarchic Scope Rules