Lecture 27, Code Generation Details

Part of the notes for CS:4980:1
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

A Stack Architecture

We can describe the semantics of a stack machine's instruction set in terms of a stack stored in memory (the array M), and a stack-pointer register sp. Because the ARM, our target architecture, is defined as having a full descending stack, we'll use the same convention in the following description. We'll use the following notation:

The stack pointer sp always holds the address of the top element of the stack (a "full" stack).
The stack grows down (a "descending" stack).
We use the notation --sp to mean decrement the stack pointer and then use its value, while sp++ means use the value of the stack pointer and then increment it. This is C notation.
The array M[] refers to memory, and we'll assume that the capacity of each memory location is one word.

PUSHL x

PUSH Local; x words of uninitialized memory are pushed onto the stack to reserve space for a local variable. Used for storage allocation.

sp = sp - x;

POPL x

POP Local; x words of memory are popped from the stack. Used for local storage deallocation.

sp = sp + x;

PUSHI c

PUSH Immediate; the one-word constant c is pushed onto the stack.

M[--sp] = c;

PUSHGA a

PUSH Global Address; the one-word address a is pushed onto the stack, where A is the symbolic name of a global variable. In a real machine, the loader would convert PUSHGA to PUSHI after substituting the actual memory address of the global variable for its name.

LOAD

LOAD onto the stack top the contents of the one-word memory location addressed by the stack top. The stack pointer is unchanged.

M[sp] = M[M[sp]];

LOADHS

LOADHU

LOADBS

LOADBU

Later, when thinking about operands in memory that occupy fractional words, we may want to introduce halfword and byte load operands. Each of these may require a signed and an unsigned version. For now, however, we assume that we are only using one-word variables. The difference between signed and unsigned load operations lies in whether they sign-extend their operand to place it into a full-word stack top, or alternatively, pad the operand with zero in its high bits.

POPS

POP the data on the stack top and Store it in the memory address referenced by the address popped from the stack top. A total of two words are popped from the stack; the data was on top of the address.

temp = M[sp++];
M[M[sp++]] = temp;

POPSH

POPSB

Later, when thinking about operands in memory that occupy fractional words, we may want to introduce pop operations that store bytes or halfwords, discarding the high bits of the operand on the stack. The store operations for signed and unsigned operands do not differ.

DUP

DUPlicate the item on the stack top. Don't include this in your intermediate high-level architecture unless you find you need it, but in some cases, it may be useful.

M[--sp] = M[sp];

ADD

ADD the top two items on the stack, popping one operand and repacing the other operand with the sum.

temp = M[sp++];
M[sp] = M[sp] + temp;

SUB

MUL

DIV

MOD

Other arithmetic operators are similar to ADD.

GT

Greater Than replaces the top two items on the stack with the result of the comparison, comparing the item under the stack top with the item on the stack top and placing the boolean result on the stack. The net effect is like all other binary operators: One item is popped from the stack while the result replaces the other operand.

temp = M[sp++];
M[sp] = M[sp] > temp; -- for example, could also be < or &eq;, etc

LT

GE

LE

EQ

Other comparison operators are similar to GT.

BTRUE dst

BFALSE dst

Branch if TRUE (and Branch if FALSE) pop the top item from the stack and if it is true (or if it is false) they branch to the dst. The top item on the stack is always popped, whether or not the branch is taken.

if M[sp--] -- or if not
  then pc = dst;

BR dst

Branch unconditionally to dst, with no change to the stack.

LABEL dst

Mark a point that will be the destination for branches. In machine code, branch destinations are just memory addresses, but in a compiler, we typically generate symbolic labels by incrementing a counter. In this case, argument would be an integer, say 25, and if the code generator generates symbolic assembly language, then LABEL 25 translates to something like the label L25 in the actual assembly language. Labels of this form are used only as branch destinations.

The above instruction set is sufficient to allow reference to global variables at static addresses, but for languages such as ALGOL 60 and its descendants, including C, C++ and Java, we need a way to load the addresses of local variables. The easiest way to do this is with a special register, the frame pointer fp. Within any procedure or function, this always points to the base of the stack frame (activation record) for the current context.

PUSHLA d

PUSH Local Address; the address of a local variable at displacement d in the activation record is pushed onto the stack.

M[--sp] = d + fp;

There are numerous alternatives to the above. You could have an instruction to push the frame pointer itself, then push the displacement, then add, if you wanted to eliminate the combination of arithmetic with a constnt.

An Observation

Because all data objects in Kestrel have a constant size that is statically determined, the stack pointer at any point in a block is at a constant displacement from the base of the activation record for that block. Consider this code.

sort: procedure( ref i: int32, ref j: int32 )
   if i < j then
      k: var int32;
      k = i;
      i = j;
      j = k;
   end;
end;

If you imagine a totally sub-optimal implementation of this code, you might consider pushing each term in an expression on the stack as you evaluate the expression. This leads to the following picture of what is on the stack at each point in the above code:

sort: procedure( ref i: int32, ref j: int32 )
       -- @i @j
   if i
       -- @i @j i
      < j
       -- @i @j i j
       -- @i @j b
   then
       -- @i @j
      k: var int32;
       -- @i @j k
      k
       -- @i @j k @k
      = i
       -- @i @j k @k i
      ;
       -- @i @j k
      i
       -- @i @j k @i
      = j
       -- @i @j k @i j
      ;
       -- @i @j k
      j
       -- @i @j k @j
      = k
       -- @i @j k @j k
      ;
       -- @i @j k
   end;
       -- @i @j
end;
--- Legend i,  j - the values of the actual parameters
---       @i, @j - the addresses of the actual parameters
---        k     - the local variable
---       @k     - the address of the local variable
---        b     - a boolean resulting from comparing two integers

Note that the storage allocated in the activation record for the above routine's parameters holds references to the actual parameters, since they were declared as reference parameters. The basic activation record for the sort procedure therefore holds 2 reference parameters along with the local variable k. The rest of the variables in the activation record are anonymous temporaries that are used during the evaluation of expressions and the execution of assignment statements.

For example, to evaluate if i < j then we push the value of i on the stack, then push the value of j on the stack, before comparing them and replacing the two items with just one item reflecting the result of the comparison. Finally, we pop the result from the stack as we do a conditional branch.

The observation with which we began is simple: We do not need to use a dynamically computed stack pointer for this stack. We can use static computation, where each push is essentially an allocation operation in the current activation record, and each pop is a deallocation. Instead of using a stack pointer, we can use static offsets from the base of the activaiton record, just as we do for local variables.

Hand Translation to Stack Code

Now, consider the example at the head of this section. Ignoring the details of the procedure call and return, the code generated for the example code would be as follows:

sort: procedure( ref i: int32, ref j: int32 )
   if i < j then
         PUSHLA I -- push the address of @i
         LOAD     -- get the value of @i
         LOAD     -- get the value of i
         PUSHLA J -- push the address of @j
         LOAD     -- get the value of @j
         LOAD     -- get the value of j
         LT
         BFALSE 1
      k: var int32;
         PUSHL  1 -- allocate space for k
      k = i;
         PUSHLA K -- get the address of k
         PUSHLA I -- get the address of @i
         LOAD     -- get the value of @i
         LOAD     -- get the value of i
         POPS
      i = j;
         PUSHLA I -- get the address of @i
         LOAD     -- get the value of @i
         PUSHLA J -- get the address of @j
         LOAD     -- get the value of @j
         LOAD     -- get the value of j
         POPS
      j = k;
         PUSHLA J -- get the address of @j
         LOAD     -- get the value of @j
         PUSHLA K -- get the address of k
         LOAD     -- get the value of k
         POPS
   end;
         POPL   1 -- deallocate k
         LABEL 1
end;

If you imagine a totally sub-optimal implementation of a stack machine, you imagine each stack machine instruction involves one or more memory cycles to access the top elements of the stack. Looking at the above, this suggests that a simple assignment such as i=1 could take 3 or more instruction cycles. Even the Burroughs 5000 computer avoided this by dedicating two registers inisde the CPU to holding the top two elements of the stack. State bits in the CPU were used to record whether the stack was entirely in memory, or whether the top word of the stack was in a register, or whether the top two words were in registers. The net result was efficient execution with few memory cycles actually required. In later machines such as the AAMP, 4 registers were used to hold up to the top 4 elements of the stack.

Relating to the ARM

Section 5.1.1 of the following document describes the roles of the 16 registers in the ARM CPU as they are to be used on entry and exit from external procedures and functions.

-- Procedure Call Standard for the ARM Architecture

During function calls, registers 0 to 3 are reserved for parameter passing and argument return. Registers 4 through 11 (but sometimes not register 9) are avaiable for general use. Local calls can use other conventions, but in the early stages of developing a compiler, it is probably better to use the same calling conventions for local and external calls.

How do you translate stack machine code for effective execution on such a computer? The obvious approach is to use the ARM SP register, register 13, as the stack pointer and keep the entire contents of the stack in memory. This would make very inefficient use of registers.

When the Burroughs Corporation developed the B5000 computer in the early 1960s, they had a better idea. Use some registers inside the CPU as a stack-top-cache. The B5000 used only 2 registers inside the CPU to hold zero, one or two of the top elements on the stack, but later Burroughs had 4 registers to serve this purpose, and machines like the Collins AAMP typically set aside 8 registers inside the CPU to cache the stack top. The code generator inside a compiler for a machine like the ARM can easily do the same thing.