Chapter 5, Extending the Example Assembler

Lecture notes for 22C:50 Introduction to System Software, by Douglas W. Jones, University of Iowa Department of Computer Science

A Critique of the Example Assembler

The material presented in Chapters 2 through 4 is sufficient to write a functional but minimal assembler, but this assembler would lack many features which make assembly languages convenient to use. Among the more commonly used missing features are expressions in the operand field of instructions, macros, and conditional assembly. Of course, the most glaring omission is the ability to process the machine language of some real machine.

Supporting the Language of a Real Machine

Before going into the details of how the example assembler can be extended to support the instruction set of some particular machine, it is worth noting that, if macros are available, machine specific extensions to the assembler are not really needed. A new macro can simply be defined for each machine instruction. In fact, this approach is commonly used when an assembler for one machine is needed and a powerful macro assembler for some other is available. New assemblers continue to be written for new machines primarily because macro expansion can be computationally expensive.

Adding new op-codes to the example assembler is not hard. In the example statement processor in Figure 2.17, there was a two way branch to handle the B and W alternatives. If the R6502 codes for loading and storing the accumulator were to be supported, (for the moment, ignoring all but normal direct addressing), this would expand to a four way branch, as in Figure 5.1.

if lex.this = "B" then begin
     lex_scan {skip B};
     M[location] := operand;
     location := location + 1;
end else if lex.this = "W" then begin
     lex_scan {skip W};
     o := operand;
     M[location] := first_byte_of(o);
     M[location + 1] := second_byte_of(o);
     location := location + 2;
end else if lex.this = "LDA" then begin
     lex_scan {skip LDA};
     o := operand;
     M[location] := to_hex("AD") {opcode};
     M[location + 1] := first_byte_of(o);
     M[location + 2] := second_byte_of(o);
     location := location + 3;
end else if lex.this = "STA" then begin
     lex_scan {skip STA};
     o := operand;
     M[location] := to_hex("8D") {opcode};
     M[location + 1] := first_byte_of(o);
     M[location + 2] := second_byte_of(o);
     location := location + 3;
end;

Figure 5.1. The heart of an assembler, extended.

This code is wasteful, since both the LDA and STA instructions have the same format. If these were the only two instructions with this format, this would not be a serious problem, but most machines have large numbers of instructions with identical formats, differing only in the particular values stored in the op-code field. If each such instruction were granted its own block of code in a case statement such as is shown in Figure 5.1, the result would be quite wasteful.

The solution to the above problem involves introducing a new symbol table which is used to associate symbolic opcode names such as LDA and STA with an indication of the opcode format and, for those formats requiring it, the numerical opcode name associated with that format. Typically, one format code is reserved for each directive or pseudo-instruction. Thus, for the code in Figure 5.1, formats 1 and 2 might be reserved for B and W. A quick examination of table 2.1 shows that an assembler for the R6502 needs at least 4 other formats. Format 3 might be used for op-codes corresponding to 1 byte instructions, format 4 might for those with a 16 bit address field following the instruction byte, format 5 for those with an 8 bit constant following the instruction byte, and format 6 for those with an 8 bit relative address.

A typical op-code table might be accessed by a routine called "opcodelookup" which, given a lexeme, looks it up and returns the associated format and machine instruction code. This leads to the logic shown in Figure 5.2, using Pascal:

opcodelookup( lex, format, code );
lex_scan {skip the op-code};
case format of
    1: begin { process the B directive }
            M[location] := operand;
            location := location + 1;
       end;
    2: begin { process the W directive }
            o := operand;
            M[location] := first_byte_of_operand;
            M[location+1] := second_byte_of_operand;
            location := location + 2;
       end;
    3: begin { process 1 byte instructions }
            M[location] := code;
            location := location + 1;
       end;
    4: begin { process 16 bit address instructions }
            o := operand;
            M[location] := code;
            M[location+1] := first_byte_of_operand;
            M[location+2] := second_byte_of_operand;
            location := location + 3;
       end;
end {case};

Figure 5.2. Use of an op-code table.

Actually, the R6502 processor (like many others) would require more complex code than is hinted at in Figure 5.2 because a given symbolic op-code may assemble to different machine instruction formats depending on the addressing mode used for the operand. Therefore, our operand evaluation routine would have to return not only the value but also an indicator of the format, and for each symbolic opcode name, we'd need a short array (indexed by operand format) giving the actual machine instruction format. The assembly language for the Pentium is even more complicated, but more modern assembly languages rarely add this kind of complexity, since most assembly language use these days is simply output from a compiler and isn't intended to be read by people.

It should be noted that the symbol table for op-codes would not generally be implemented using the same access functions as that for other symbols. This is because it never changes during the assembly process, and most entries are the same length. The latter fact suggests that a string pool would be wasted here, while the former allows a number of interesting alternatives. A sorted table with a binary search is quite reasonable here, but an even better alternative is to use a "perfect hash" function, that is, one which produces distinct integers for each allowed op-code, so there will never be collisions. Although this will not be covered here, it should be noted that a considerable amount of work has been done on perfect hash functions.

The obvious way to initialize the op-code table is to make the entire table a constant with the initial values hard coded in the source of the assembler. A useful alternatives to this involves initializing the table by reading it in from a file. This can lead to a "universal assembler" which can generate assembly code for a number of different machines depending on the op-code file that is used. Such an assembler would not actually be universal, since it would only work for machines with similar instruction formats. A second alternative is to introduce a new assembly time operation for defining an instruction and its format. This would add an instruction to the op-code table, allowing assembler users to dynamically modify the assembly language.

Changing and Using the Assembly Location Counter

A special symbol is introduced in most assembly languages to allow the current location to be used in the operand field. In this example, the symbol "." will be used to refer to the current location (a convention from assemlers written at DEC, now part of Compaq); the symbols "*" and "$", are commonly used in a number of other assemblers for the same purpose. Thus, for example, the instruction "W ." in our example assembly language causes the assembler to store the address of the current location in the current location. This has the same effect as "A: W A", except that no label is needed.

Merely being able to references the location counter is of little use, but when code can include expressions such as ".+1" or "A-.", this can be very useful! If the assembly language also allows the location counter to be modified, we can do even more,

On many machines, programs must start at some location other than zero. For example, on the R6502, the first 256 locations in memory are "page zero" and serve as temporary storage, while the second 256 are reserved for the stack; thus, R6502 programs will typically start at location 512 or 020016. On the historically important DEC PDP-8, similar constraints reserve the first 128 locations, so programs usually start at location 128, or 02008. Some operating systems set similar limitations where the hardware would otherwise not seem to reserve any locations. Thus, the old CPM system for the Intel 8080 and Zilog Z80 microprocessors reserves the first 256 locations for itself, forcing user programs to start at location 256.

The starting location of an assembly language program is sometimes called the assembly origin, and most assemblers provide a facility for setting the origin. Many older assemblers provide an explicit assembly directive for this purpose, with a mnemonic name such as ORG. Other assembly languages use the symbol definition facility of the language (the equals sign in our example language, or the EQU or SET directives in many others) to set the location counter.

In the example language, the construct ". = value" will be used to set the location counter. The example in Figure 5.3 illustrates this approach to setting the origin.

1                    |. = #10
2  0010: 10          | A: B .
3  0011: 10          |    B A
4                    |. = #08
5  0008: 10          |    B A
6  0009: 09          |    B .

Figure 5.3. Setting the assembly origin

As illustrated in this example, an assembly language program may make repeated assignments to the location counter. Such a program might appropriately be called a program with multiple origins. For an example of a context where this is appropriate, consider a machine with very hard constraints on the alignment of data structures in pages of the address space. With such a machine, it may make sense to explicitly set the origin before each section of code that must be placed together in one page.

Multiple origins in a program can lead to nonsense. Consider what would happen if additional instructions, were added to the end of the example in Figure 5.3. These would assemble into consecutive locations following location 0916. All would be well up through the assembly of the instruction in location 0F16, after which an instruction would be assembled in location 1016. The problem is that information has already been assembled into this location, by line 2. It is difficult for an assembler to detect this error. Some assemblers make an effort to track what locations have been used, but most simply discard any data previously assembled for a given location when they assemble new data into it. If chaining is used to resolve forward references, the results can be very complex!

Expressions

Another important facility which most assemblers provide is an expression evaluator which can be used for values in the operand field. This can be quite powerful, as is illustrated by the two examples in Figure 5.4.

     ; with expressions   ; without expressions
     . = 125              . = 125
     A: W .+2             A: W 127  ; pointer to the next word
        B 25                 B 25
     C: W C-A+2           C: W 5    ; size of A through C inclusive

Figure 5.4. Use of expressions in the operand field.

When expressions are used, the value produced is not obvious, but at least the factors which go into computing the value are. In Figure 5.4, any additional code inserted between the last two lines would be automatically taken into account where expressions are used, but the programmer would have to count the bytes inserted where constants are used, and then carefully adjust the values of the constants by the appropriate amounts.

The example in Figure 5.4 uses the simplest form of expression commonly implemented in assemblers: Operands and operators are evaluated in strict left to right order, with no operator precedence. Such expressions are described by the extended BNF grammar in Figure 5.5.

<expression> ::= <value> { ( + | - | * | / ) <value> }
          -- a value followed by zero or more operator-value pairs

Figure 5.5. A very simple extended BNF grammar for expressions.

Alternately, this may be described using RTN notation as shown in Figure 5.6.
expression      -------
    --------->-| value |---------->-------------
           /    -------   \
          |               |\
          |               |\ -(+)---- 
          |               |\ ----(-)- \ 
          |                \ -(*)---- \|
          |                  ----(/)- \|
          |                           \|
           \__________________________/

Figure 5.6. A very simple RTN grammar for expressions.

A program to parse and evaluate simple expressions cannot simply perform each operation as the associated operator symbol is encountered, since the operators precede their right hand operand in the left to right order of parsing. Thus, the parser must save the operator until both its left and right operands have been evaluated.

Figure 5.7 shows how this might be done for the simple expression grammars given in Figures 5.5 and 5.6, assuming that the function "value" returns the value of the next operand and advances the lexical analysis process so that "lex_this" is the first lexeme after that operand.

int expression()
{
	int acc;   /* the accumulated value */
	char op;   /* the operator */
        int right; /* the right operand of op */

	acc = value(); /* get next operand */
	op = lex_this;   /* get following operator */
	while (char_in_string( op, "+-*/" )) {
		lex_scan;
		right = value(); /* get next operand */
          	switch (op) {
		case '+': acc = acc + right; break;
		case '-': acc = acc - right; break;
		case '*': acc = acc * right; break;
		case '/': acc = acc / right; break;
        	}
        	op = lex_this; /* get next operator */
        }
        return acc;
} /* expression */

Figure 5.7. An evaluator for simple expressions.

Note that, in real assemblers, other operators are frequently included, such as "&" for the bitwise and operation, "|" for the bitwise or, and "~" for bitwise not, and so on. Values are also typically extended to include not only the location counter, numeric constants, and symbols, but also internal representations of characters: Thus, "A" might give the internal representation of the character A in the same way that the constant 10 gives the internal representation of the number ten.

Simple expressions with left to right evaluation are sufficient for all purposes, but they are inconvenient, since auxiliary variables must frequently be used to obtain the effect of sub-expressions or operator precidence. Any decent assembly or high-level language will allow parentheses to be used to modify the order of evaluation. A simple RTN grammar that allows parenthesized expressions is given in Figure 5.8.

expression          ------------
    ----------(()--| expression |--())-------------------------------
        /  \        ------------       /    \
       |    \         -------         /     |\
       |      -------| value |-------       |\ -(+)-----   
       |              -------               |\ -----(-)- \
       |                                     \ -(*)----- \|
       |                                       -----(/)- \|
       |                                                 \|
        \________________________________________________/

Figure 5.8. An RTN grammar for expressions allowing parentheses.

Here, for the first time in this text, the recursive transition network notation has actually been used recursively. When this is translated into a program to parse expressions (with no attempt being made to evaluate the result, the recursive code given in Figure 5.9 is the result:
procedure expression;
var more: boolean;
begin
     repeat
          if lex = '(' then begin
               lex_scan;
               expression;
               if lex = ')' then lex_scan else error;
          end else begin
               value;
          end;
          more := lex in ['+','-','*','/'];
          if more then lex_scan;
     until not more;
end {expression};

Figure 5.9. A parser for parenthesized expressions.

The term recursive descent top-down parsing refers to the recursion found in the code used to parse expressions, statements and other syntactic entities that may contain, as components, instances of the same entity. Thus, an if statement contains, as components, statements in the then and else clauses, and an expression may contain prenthesized expressions. The parsers for such languages will include recursive routines for parsing statements and expressions. Converting the code in Figure 5.9 from a simple parser to an expression evaluator is left as an exercise for the reader.

To help understand how such a parser operates, consider the example in Figure 5.10.

Input text:  A + B - ( C + D ) * ( ( E - F ) * G )

Level 0:     A + B - (       ) * (               )
                      \     /     \             /
Level 1:               C + D       (       ) * G
                                    \     /
Level 2:                             E - F

Figure 5.10. A parsing example showing levels of recursion.

When the code for "expression" given in Figure 5.9 is applied to this example, it iteratively processes the operands A and B normally. When the parser encounters the "(", it calls itself recursively to evaluate the sub-expression involving the operands C and D. The recursive call terminates when the ")" is encountered. Similarly, in parsing the operands E, F and G, two recursive calls are performed because there are two levels of parenthesization. The top level invocation of "expression" parses everything shown on Level 0 of Figure 5.10; the first level of recursive call processes the text shown on Level 1, and the second level of recursion processes the text shown on Level 2.

Although the grammar for expressions presented in Figure 5.8 is not terribly inconvenient, it is not up to the standard of the grammars used for most high level languages because it has no operator precidence. That is, it evaluates operators in a straight left-to-right order instead of giving multiplication and division predicence over addition and subtraction. The extended BNF grammar in Figure 5.11 introduces such a precedence relationship.

<expression> ::= <term> { (+|-) <term> }

<term> ::= <factor> { (*|/) <factor> }

<factor> ::= <value> | ( <expression> )

<value> ::= <identifier> | <number>

Figure 5.11. A grammar with precedence.

It is not difficult to use this grammar as the basis of a top-down recursive descent parser using three mutually recursive procedures, one to parse expressions, one to parse terms, and one to parse factors.

There is an alternative to top-down recursive descent parsing. The alternative involves use of an explicit stack instead of the implicit stack used in recursive descent. In this case, the contents of the accumulator ("acc" in Figure 5.7) are pushed on the stack each time a "(" is encountered, and popped off when a ")" is encountered. The stack pointer in this case serves as an indication of the parenthesis nesting level.

An Alternative Approach to Expressions, Bottom-Up Parsing

Explicit use of a stack leads naturally to the other classic approach to parsing, bottom-up parsing. In a bottom-up parser, the production rules of the grammar are used as rewrite rules replacing strings of terminal and nonterminal symbols with nonterminal symbols, hoping to eventually reduce the entire input string to some target symbol. This is most easily illustrated in terms of BNF with all extensions removed; for this reason, the grammar in Figure 5.11 has been rewritten as shown in Figure 5.12, with all the production rules numbered.

1  <expression> ::= <term>
2                 | <expression> + <term>
3                 | <expression> - <term>

4  <term> ::= <factor>
5           | <term> * <factor>
6           | <term> / <factor>

7  <factor> ::= <value>
8             | ( <expression> )

9  <value> ::= <identifier>
10           | <number>

Figure 5.12. Figure 5.11 rewritten in pure BNF.

Consider the problem of parsing the string given in Figure 5.13 in terms of the grammar in Figure 5.12.

Input text:    A + B * ( C + D )

apply rule 9:  <value> + <value> * ( <value> + <value> )

apply rule 7:  <factor> + <factor> * ( <factor> + <factor> )

apply rule 4:  <term> + <term> * ( <term> + <term> )

apply rule 1:  <expression> + <term> * ( <expression> + <term> )

apply rule 2:  <expression> + <term> * ( <expression> )

apply rule 8:  <expression> + <term> * <factor>

apply rule 5:  <expression> + <term>

apply rule 2:  <expression>

Figure 5.13. A bottom-up parsing example.

This example illustrates a number of aspects of bottom-up parsing. First, note that the production rules of the BNF grammar have been applied in reverse, substituting the left hand side of the rule for the right hand side. Thus, in effect, the rules are being used as reduction rules, not production rules. Also note that not all rules which are applicable to a given partial result can be safely applied. For example, the result of applying rule 1 in Figure 5.13 contains two instances of "<expression> + <term>", yet only one of these can be safely reduced by rule 2 to "<expression>", since reducing the other would lead to the irreducible string "<expression> * <factor>" after the application of rule 8.

The problem of which substrings to reduce when is can be solved by determining, for each reduction rule, additional contextual information which should be examined before that rule can be applied. Thus, rule 2 should only reduce "<expression> + <term>" to "<expression>" when the next symbol is not a "*" or a "/". The result is that each rule is associated with a set of legal successors. The number of lexemes which must be examined to determine whether a reduction is legal is closely related to the amount of look-ahead needed in a top-down parser. For the grammar in Figure 5.12, these strings involve at most one lexeme; the grammar is classified as being parsable by an LR1 parser, and the parser is called an LR1 parser because it parses from Left to right, making the Rightmost reduction first, with 1 symbol of look-ahead.

Given a grammar annotated as suggested above, a parser can be constructed as outlined in Figure 5.14.

while lex <> eof do begin
     push( lex_this )   { shift phase };
     lex_scan;
     repeat        { reduction phase };
          reduce;
     until failure;
end;

Figure 5.14.  A shift-reduce parser.
This is called a shift-reduce parser because it repeatedly shifts one lexeme from the input string onto the parsing stack, then tries to apply as many reduction rules as possible to the stack top, using the value of lex, when needed, to determine which reductions are legal. Repeating the example from Figure 5.13, the operation of this parser can be illustrated as shown in Figure 5.15.
Cycle Rule Stack                                       Input text
                                                       A + B * ( C + D )

1          A                                           + B * ( C + D )
      9    <value>
      7    <factor>
      4    <term>
      1    <expression>

2          <expression> +                              B * ( C + D )

3          <expression> + B                            * ( C + D )
      9    <expression> + <value>
      7    <expression> + <factor>
      4    <expression> + <term>

4          <expression> + <term> *                     ( C + D )

5          <expression> + <term> * (                   C + D )

6          <expression> + <term> * ( C                 + D )
      9    <expression> + <term> * ( <value>
      7    <expression> + <term> * ( <factor>
      4    <expression> + <term> * ( <term>
      1    <expression> + <term> * ( <expression>

7          <expression> + <term> * ( <expression> +    D )

8          <expression> + <term> * ( <expression> + D  )
           <expression> + <term> * ( <expression> + <value>
           <expression> + <term> * ( <expression> + <factor>
           <expression> + <term> * ( <expression> + <term>
           <expression> + <term> * ( <expression>

9          <expression> + <term> * ( <expression> )
      8    <expression> + <term> * <factor>
      5    <expression> + <term>
      2    <expression>

Figure 5.15. A shift-reduce parsing example.

Each line in this example shows either the result of one shift step or one reduction step. If a reduction was applied, the reduction rule number is shown to the left of the result of applying that reduction.

In a bottom-up parser which also evaluates what it parses, the stack used for parsing is usually widened to contain a value field. Whenever a reduction rule is applied, the value fields of the items on the stack are combined appropriately to produce the value field of the new nonterminal which represents the result of the reduction. Typically, the reduction rules themselves are encoded in a table, and there is a procedure associated with each reduction which is used to perform the appropriate value updates.

Most bottom-up parsers in common use today are built semi-automatically. The parse tables (including the look-ahead sets) are produced by a program called a parser generator which takes, as input, a BNF grammar. Some of the most complete parser generators allow each BNF rule in the input grammar to be annotated with the code which is to be executed when that rule is used as a reduction rule. One of the most well known parser generators in this class is the YACC program which runs under the UNIX operating system. It should be noted that it is equally practical to automate the generation of top-down recursive descent parsers, but this is less common because they are reasonably easy to hand code.

Expressions in Assembly Languages

It is interesting to note that, when expressions are allowed in the operand field of assembly language statements and definitions, there are two interpretations for the request to write an assembly language program to add A and B and place the result in C. These are illustrated in Figure 5.16.

; first interpretation
LDA A       ;Load A into the accumulator
ADD B       ;Add B to it
STA C       ;Store the result in C

; second interpretation
C = A + B 

Figure 5.16. Two interpretations of an assignment.

These two interpretations accomplish entirely different results. Assembling the first interpretation produces, as output, an executable program which, when executed, will compute the sum of the memory locations A and B and put this in location C. In this case, the values of the symbols A, B, and C are addresses of memory locations containing values to be manipulated at run-time.

The second interpretation is a program which directs the assembler itself to take the value of the symbols A and B from the symbol table, add them, and store the result in the symbol table as the value of C. In this case, the values of the symbols A, B, and C are directly manipulated at assembly-time and the output of the process is not an executable program.

This demonstrates that, from the point of view of assembly-time expression evaluation and definition processing, an assembler can be viewed as an interpreter, while from the point of view of run-time program execution, it must be viewed as a language translator. The definition facility in the example assembly language is an assembly-time assignment statement, while the STA instruction in the first interpretation in Figure 5.16 is used to implement run-time assignment. Both assignment operators bind values to names, but the nature of the binding differs. At run-time, the STA instruction binds a value to a location, while the definition facility binds a value (which may be the address of a location) to an identifier.

Expressions in High Level Languages

When a compiler for a high level language parses an expression, its actions are quite similar to those taken by an assembler, with one major exception. Assemblers parse expressions with the intent of determining their values; thus, they evaluate each term and combine them as indicated by the operator symbols. Compilers parse expressions with the intent of generating code to evaluate them at a later time; thus, they generate code to evaluate each term and code to combine them as indicated by the operator symbols.

The similarity between the processing steps performed by expression compilers and expression interpreters are at their clearest when the interpreter uses an explicit stack to evaluate expressions and the compiler generates code for a stack machine. Figure 5.17 illustrates this similarity.

The input expression:  A + (B * C)

Computations performed by an interpreter during parsing:

     1) push( lookup(A) );
     2) push( lookup(B) );
     3) push( lookup(C) );
     4) push( pop() * pop() );
     5) push( pop() + pop() );

Computations performed by a compiler during parsing:

     1) output(' PUSH ');  output( lookup(A) );
     2) output(' PUSH ');  output( lookup(B) );
     3) output(' PUSH ');  output( lookup(C) );
     4) output(' MUL');
     5) output(' ADD');

Figure 5.17. Interpretation versus compilation.

Here, the interpreter actually pushes values onto a stack and directly performs arithmetic on the stack top, while the compiler generates assembly code to do the same operations. The symbol table in the interpreter holds the values of symbols, while the symbol table in a compiler holds the addresses of variables.

Although it is harder to make the comparison between an expression interpreter and the code generated by an expression compiler for a machine without a stack, the relationship between the two remains the same. The difference is that, instead of pushing values on a stack, the compiler must explicitly save intermediate values in some register or memory location. Furthermore, the problem of generating high quality code may require that the compiler re-order the instructions from the 'natural' order of computations performed by the expression interpreter.

Terminology

The reader should be familiar with the following general terms after reading this section.

op-code table                   run-time computation
universal assembler             assembly-time computation
assembly origin                 run-time binding
recursive descent parsing       assembly-time binding
bottom-up parsing               shift-reduce parsers
Additionally, the reader should be able to write programs to parse and evaluate complex expressions.

Exercises

  1. What differences would the programmer see if a single symbol table were used for both op-code storage and label storage? Note that the PAL-8 assembler for the DEC PDP-8 used this approach.

  2. Pascal does not allow the declaration of structured constants; this makes the problem of initializing the op-code table somewhat more difficult than it is in C, Ada or assembly language. Assuming that the programmer does not want to read the op-code table in from a file, how can it be initialized in Pascal in a nice way, that is, without resorting to one assignment statement for each field of each entry of the table? Ideally, it should take only one line of code for each table entry, plus some supporting code.

  3. Hand assemble the following code using chaining to resolve forward references. Explain any anomalous results. Assume that each link in the chain is a 2 byte address (least significant byte first), and that a link which points to itself is used to mark the end of a chain.
    a)  .  = #200
           W B
        A: W B
        .  = A
           W B
        B  = 6
    
    b)  .  = #200
        A: W B
        C: W .
        .  = A
           W C
        B  = 6
    

  4. Explain why a two pass assembler can correctly assemble example I below but cannot assemble example II below:
    I)  .  = #200
           W A
        C:
        .  = A
           W B
        .  = C
        B: W A
        A:
    
    II) .  = #200
           W B
        C:
        .  = A
        B: W B
        .  = C
           W A
        A:
    

  5. Write out the extended BNF equivalent of the grammar in Figure 5.8.

  6. Write out the BNF equivalent of the grammar in Figure 5.8.

  7. Modify the parser given in Figure 5.9 so that it evaluates expressions which conform to the grammar given in Figure 5.8.

  8. Write a parser for the grammar given in Figure 5.11. Optional: Modify your parser so that it evaluates expressions.

  9. Why does the bottom-up parser for the grammar presented in Figure 5.12 appear to need more look-ahead than a top-down recursive descent parser for the same or equivalent grammar?

  10. How much look-ahead would a shift-reduce parser for the grammar shown in Figure 5.8 require?

  11. What should be on the stack top when the while loop in Figure 5.14 terminates, assuming that the input was legal? Assume that the reduction rules being used are those from figure 5.12.

  12. Modify the parser shown in Figure 5.9 so it generates output in the assembly language for the Honeywell DDP 516 computer. This computer had two registers, A and X, supported the following instructions (plus many more):
    	LDA	op	; A = M[op]     load accumulator
    	STA	op	; M[op] = A     store accumulator
    	ADD	op	; A = A + M[op] add to accumulator
    	SUB	op	; A = A - M[op] subtract from accumulator
    	LDA I	X	; A = M[X]      \
    	STA I	X	; M[X] = A       \ indirect through
    	ADD I   X       ; A = A + M[X]   / index register
    	SUB I   X       ; A = A - M[X]  /
    	LDC	op	; A = op        load constant
            ADC	op	; A = A + op    add constant
    	SBC	op	; A = A - op    subtract constant
    	INX		; X = X + 1     increment X
    	DCX		; X = X - 1     decrement X
    
    Ignore the multiply and divide operations (the machine instructions for these were messy). Assume X points to the top element on the stack, that the push operation increments X, and that the result should end up in the A register.

References

The references cited at the end of Chapter 4 provide a reasonable discussion of the organization and use of the op-code table. Figure 3.7 of

Systems Programming by Donovan (McGraw Hill, 1972)
describes one possible approach using separate tables for the op-codes and pseudo-ops, unlike the organization proposed here, where the format field distinguishes the two. The discussion of the MIX assembler on pages 294-297 and 313-317 of in
Computer Organization and Assembly Language Programming by Peterson (Academic Press, 1978)
uses a table similar to the one proposed here, but all machine instructions have the same format.

The references cited at the end of Chapter 2 are a good primary source of information on parsing techniques. The reader interested in automatic construction of parsers should refer to the papers:

Language Development Tools on the Unix System, by S. C. Johnson, in IEEE Computer, August, 1980, pages 16-21.

UNIX Time-Sharing System: Language Development Tools, by S. C. Johnson and M. E. Lesk, in Bell System Technical Journal 57, 6 (July-Aug 1978) pages 2155-2175.

Chapter 5 of

Algorithms + Data Structures = Programs by Wirth (Prentice Hall, 1976)
provides good coverage of parsing and code generation issues, with an emphasis on top-down recursive descent parsing. A complete compiler for a small toy language which generates code for a small toy instruction set is included.