22C:18, Lecture 16, Fall 1996

Douglas W. Jones
University of Iowa Department of Computer Science

Recursion

Consider the following C code to print an integer as a decimal number:

void dspdec( unsigned int n, int w )
/* n is the number to print */
/* w is the field width to use */
{
	unsigned int fw = w - 1;
	unsigned int rm = n % 10;
	if ((n / 10) != 0) {
		/* there are more significant digits,
		   so print them first */
		dspdec( n / 10, fw );
	} else {
		/* there are no more significant digits,
		   so print any leading blanks first */
		for (;;) {
			fw = fw - 1;
			if (fw < 0) exit; /* from for loop */
			dspch(' ');
		}
	}
	/* leading blanks and preceding digits are printed,
	   so print this digit */
	dspch('0' + rm);
}

Calling dspdec(n,w) has the same effect as the Pascal call write(n:w) or the C call fprintf("%*d", w, n); all of these print n, right justified in a field of width w, using more characters than w if n is too large. The names dspch and dspdec are used here because these are the names of the routines in the Hawk monitor.

The above routine is recursive, and it is instructive to trace its behavior through a call to print a small number, say 123 (decimal) in a field of width 5:

dspdec(123, 5) called;
fw = 4, rm = 3.
dspdec(12, 4) called;
fw = 3, rm = 2.
dspdec(1, 3) called;
fw = 2, rm = 1.
using fw to count spaces, dspch(' ') is called twice.
dspch('0' + rm) is called, output is '1'
return
dspch('0' + rm) is called, output is '2'
return
dspch('0' + rm) is called, output is '3'
return

The problem is, how can this be translated to assembly language? The central issue is where should the local variables be put. Because the routine is recursive, the answer is that storage must be allocated dynamically, at run-time, for these variables.

A Recursive Calling Sequence

To allow for recursion, local variables are typically allocated on a stack (there are alternatives to use of stacks). The routines in the Hawk monitor use R2 for a stack pointer and R1 for linkage, and user code should use the same conventions (but nothing about the Hawk architecture prevents users from using other conventions).

By the entirely arbitrary convention used here, at the time control enters a procedure, R2 points to the lowest unused address on the stack, so m[R2] is available for use by the procedure, as is m[R2+1], but we assume that lower addresses such as m[R2-1] are unavailable. Thus, we will use the convention that pushing on the stack involves incrementing R2, while popping from the stack involves decrementing R2. A surprising number of machines use the opposite convention, and the thing to remember is that these conventions really are arbitrary and that what matters is that they are used consistantly.

Consider a procedure that requires AR bytes of memory for storage of local variables, including all temporary locations used by the procedure as well as the local variables you would give names in a high level language program. We will use the following calling sequence withinn this procedure to call other procedures:

	ADDI	R2,AR	; push my block of local variables
	JSR	R1,PROC
	ADDI	R2,-AR  ; pop my block of local variables

If PROC is not in the same source file as the call, so that the linker is involved, a slightly more complex form must be used:

	ADDI	R2,AR	; push my block of local variables
	LOAD	R1,PPROC
	JSRS	R1,R1
	ADDI	R2,-AR  ; pop my block of local variables

This is needed because the SMAL linker cannot compute PC relative addresses of external procedures. Instead, it must fill in the pointer to the procedure in an absolute word, defined elsewhere in the same source file as follows:

	EXT	PROC
PPROC:	W	PROC

In both of these calling sequences, the value AR is used to indicate the size of the area on the stack used by the current procedure. This area is called the activation record of the procedure, and the identifier AR (or identifiers ending with AR in the case of multiple coexisting procedures) is used for the size of the procedure's activation record.

On entry to a procedure, our standard calling sequence puts the return address in R1. If the procedure is to call other procedures, the return address must be set aside somewhere. It could be set aside in some other register, but if a recursive call it involved, it must be set aside on a stack in memory. By convention, Hawk programmers should set aside the first local variable of every procedure for this purpose. Having done this, we can use the following receiving and return sequences:

PROC:	STORES	R1,R2	; receiving sequence
			; saves the return address

	LOADS	R1,R2	; return sequence recovers
	JUMPS	R1	; and uses the return address

It is useful to embed these in a larger framework of comments and definitions for each procedure. A standard format is suggested here:

;------------------------
	INT	PROC

;		AR format, indexed using R2:
;RA	= 0	; the return address
LOCAL	= 4	; an example local variable
AR	= 8	; size of the activation record

PROC:		; a comment about what it does
		; comments telling what registers it uses
		; for parameters, results, etc

	STORES	R1,R2		; save return address

	STORE	R0,R2,LOCAL	; an example reference
				; to the local variable

	LOADS	R1,R2		; recover return address
	JUMPS	R1		; return

By convention, we reserve the first word in each activation record for for the return address, and we always access this with a STORES in the receiving sequence and an LOADS in the return sequence. As such, we don't need to name the return address field of the calling sequence, but we include a line in the documentation just the same, with the assignment of a value to the name RA commented out since we never intend to use it.

The example local variable, LOCAL above, is one word and is cleared by the store instruction in the body of our example procedure. Nothing useful is done with this variable here, but it is included to illustrate how such variables can be declared and used. The example activation record takes up 8 bytes, 4 for the return address and 4 for the local variable, so we define the symbol AR as 8. Since this example procedure does not call other procedures, AR need not be defined.

Note that this is only one possible activation record structure! Nothing about the Hawk architecture demands this structure, and there are many alternatives. Some machines have procedure calling instructions that dictate a particular calling sequence, sometimes quite complex, but the Hawk machine does not.

Coding A Recursive Procedure

We can now begin translation of the DSPDEC procedure to assembly language. Our first version will use the above calling sequence exactly, so we can code the entry to the routine as follows:

;------------------------
        INT     DSPDEC
;               AR format, indexed using R2:
;         0     ; return address
DSPDFW	= 4	; field width
DSPDRM	= 8	; remainder after division by 10
DSPDAR	= 12	; total size

DSPDEC:			; output R3 in decimal
			; uses R4 as field width
			; wipes out ????
	STORES	R1,R2	; save ret addr in ar

Here, we have fleshed out the receiving sequence with details specific to the problem of coding the dspdec routine. The local variables fw and rm have been allocated space in the activation record, as fields DSPDFW and DSPDRM, at offsets 4 and 8 in the activation record. So long as R2 points to the start of the activation record, indexed addressing (with the LOAD, LOADCC, STORE and LEA instructions) may be used to address fields of the activation record. The C code given above both allocated and initialized fw and ar with the following two lines:

	unsigned int fw = w - 1;
	unsigned int rm = n % 10;

The corresponding assembly code, for initialization of these fields of the AR, is as follows:

	ADDSI	R4,-1	; decrement the field width
	STORE	R4,R2,DSPDFW
	LIS	R4,10
	ADDI	R2,DSPDAR; push my local variables
	JSR	R1,DIVIDE; divide number by 10 (wipe out R5-6)
	ADDI	R2,-DSPDAR;pop my local variables
	STORE	R4,R2,DSPDRM

Note the use of our standard calling sequence to call the DIVIDE routine. This routine divides the number in R3 by the number in R4, returning the quotient in R3 and the remainder in R4, and obliterating R5 and R6 in the process. As a result, we have computed both the remainder we needed for the initialization of the local variable rm, and the quotient needed by the next few lines of C code:

	if ((n / 10) != 0) {
		/* there are more significant digits,
		   so print them first */
		dspdec( n / 10, fw );

These convert to assembly language as follows:

	TESTR	R3	; check the quotient
	BZS	DSPDBL	;   if zero, go output blanks
	LOAD	R4,R2,DSPDFW
	ADDI	R2,DSPDAR
	JSR	R1,DSPDEC ; recursive call
	ADDI	R2,-DSPDAR
	BR	DSPDQT	; go off to the endif

The above code takes advantage of the fact that the DIVIDE routine left the quotient right where it was needed for passing to the recursive call to DSPDEC, so only one instruction was needed to put the parameter into place. The next segment of C code is:

	} else {
		/* there are no more significant digits,
		   so print any leading blanks first */
		for (;;) {
			fw = fw - 1;
			if (fw < 0) exit; /* from for loop */
			dspch(' ');
		}

This is translated to assembly code as follows:

DSPDBL:			; setup to output DSPDFW blanks
	LOAD	R6,R2,DSPDFW
	LIS	R3,' '
DSPDLP:	ADDSI	R6,-1	; decrement counter
	BLT	DSPDQT	;   quit if nothing left
	ADDI	R2,DSPDAR
	JSR	R1,DSPCH; output blank (wipe out R4-5)
	ADDI	R2,-DSPDAR
	BR	DSPDLP

The final segment of the procedure, in C, was:

	}
	/* leading blanks and preceding digits are printed,
	   so print this digit */
	dspch('0' + rm);
}

This translates to the following assembly code:

DSPDQT:			; setup to output the digit
	LOAD	R6,R2,DSPDRM
	ADDI	R3,'0'	; convert remainder to ASCII
	ADDI	R2,DSPDAR
	JSR	R1,DSPCH; output it (wipe out R4-5)
	ADDI	R2,-DSPDAR
	LOADS	R1,R2	; recover ret addr from ar
	JUMPS	R1	; return

Optimization

If producing working code in a hurry is your goal, the above code is appropriate, but when performance is an issue, some additional work is called for.

The calling sequence we have used requires 3 instructions, plus a 1 instruction entry sequence and a 2 instruction exit sequence. This seems excessive! On machines designed according to the CISC (complex instruction set computer) philosophy, the conventional solution to this excess is to create special calling instructions that combine these functions, for example, imagine having a call instruction such as:

	CISCCALL size,dst

This would increment R2 by size, save the return address in M[R2] and transfer control to dst. As a result, the entry sequence is reduced to no instructions, but we need a new return instruction to undo things:

	CISCRETURN

This would transfer control to the return address found in M[R2], but first, it would look in M[M[R2]-4] to find the size field of the CALL instruction in order to decrement R2 by the right amount. While programming with these "improved" procedure calling and return instructions would be pleasant, they are not pleasant instructions to implement, and their implementation would gravely complicate the problem of making the CPU run quickly.

As a result, processors designed according to the RISC (reduced instruction set complexity) philosophy usually omit such complicated but easy to use instructions and rely on the programmer, or more usually, on the compiler, to leave out those parts of the calling sequence that are not needed in some circumstance.

For example, in our receiving and return sequences:

PROC:
        STORES  R1,R2   ; save ret addr in ar
	; some code
        LOADS   R1,R2   ; recover ret addr from ar
        JUMPS   R1      ; return

The code to save and restore the return address is unnecessary if the procedure calls no other procedures. A smart programmer, or a smart compiler, after tentatively including the STORES and LOADS instructions that do this saving and restoring, can look at the body of the procedure and notice that it contains no calls, then remove these instructions.

Having removed them, the size of the activation record for that procedure or function can be reduced by one. In the case of many procedures and functions, particularly those that call no other procedures, the size of the activation record can be reduced to zero because all the local variables needed by the procedure fit in registers.

The above optimization produces a class of procedures and functions that have no activation records and therefore ignore R2 when they are called. When calling such a procedure or function, the calling sequence:

	ADDI	R2,AR
	JSR	R1,PROC
	ADDI	R2,-AR

Can be simplified to just:

	JSR	R1,PROC

Another optimization can be done in the context of consecutive calls. Consider this example:

	ADDI	R2,AR
	JSR	R1,PROCA
	ADDI	R2,-AR
	ADDI	R2,AR
	JSR	R1,PROCB
	ADDI	R2,-AR

Here, two procedures were called in quick sequence, and clearly, the intermediate adjustment to the activation record pointer, R2, are redundant and the code can be collapsed to:

	ADDI	R2,AR
	JSR	R1,PROCA
	JSR	R1,PROCB
	ADDI	R2,-AR

This optimization is clearly possible even when there are other instructions between the two procedure calls, but it would seem more difficult when the intervening instrucitons reference fields of the activation record, as in this example:

        ADDI    R2,AR
        JSR     R1,PROCA     ; returns something in R3 
        ADDI    R2,-AR
	STORES	R3,R2,LOCALA ; save result of PROCA
	LOADS	R3,R2,LOCALB ; get parameter to PROCB
        ADDI    R2,AR
        JSR     R1,PROCB     ; expects something in R3 
        ADDI    R2,-AR

It turns out that it is still possible to eliminate the adjustment to R2 between the calls to PROCA and PROCB by moving the arithmetic from run-time to assembly time, adjusting LOCALA and LOCALB instead of adjusting R2. This is illustrated below:

        ADDI    R2,AR
        JSR     R1,PROCA        
	STORES	R3,R2,LOCALA-AR
	LOADS	R3,R2,LOCALB-AR
        JSR     R1,PROCB
        ADDI    R2,-AR

Here, we have substituted assembly-time arithmetic on the offsets used for the intermediate load and store instructions for the run-time arithmetic of the ADDI instructions.

Finally, there is an optimization that is particularly useful in the context of recursive procedures, and that is to collapse activation records so that, at the time of the recursive call, only those fields of the activation record that actually contain useful information are saved. In the case of the example decimal print routine, note, for example, that the field width stored in the DSPDFW field of the activation record is not needed after any recursive call.

This optimization is done by organizing the fields of the activiation record so that fields that aren't needed during a recursive call are stored last. In the case of our decimal print routine, we would put the DSPDRM field first, since it is needed during the recursive call, and the DSPDFW field last. The result is a routine that needs 1/3 less stack space when it is called.

Here is the result of applying a number of the above optimizations to the decimal print routine:

;------------------------
        INT     DSPDEC
;               AR format, indexed using R2:
;         0     ; return address
DSPDRM	= 4	; remainder after division by 10
DSPDFW	= 8	; field width
DSPDAR	= 12	; total size

DSPDEC:			; output R3 in decimal
			; uses R4 as field width
			; wipes out 4-6
	STORES	R1,R2	; save ret addr in ar
	ADDSI	R4,-1	; decrement the field width
	STORE	R4,R2,DSPDFW
	LIS	R4,10
	JSR	R1,DIVIDE; divide number by 10 (wipe out R5-6)
	STORE	R4,R2,DSPDRM
	TESTR	R3	; check the quotient
	BZS	DSPDBL	;   if zero, go output blanks
	LOAD	R4,R2,DSPDFW
	ADDI	R2,DSPDFW
	JSR	R1,DSPDEC ; recursive call
	ADDI	R2,-DSPDFW
DSPDQT:			; setup to output the digit
	LOAD	R3,R2,DSPDRM
	ADDI	R3,'0'	; convert remainder to ASCII
	JSR	R1,DSPCH; output it (wipe out R4-5)
	LOADS	R1,R2	; recover ret addr from ar
	JUMPS	R1	; return

DSPDBL:			; setup to output DSPDFW blanks
	LOAD	R6,R2,DSPDFW
	LIS	R3,' '
DSPDLP:	ADDSI	R6,-1	; decrement counter
	BLT	DSPDQT	;   quit if nothing left
	JSR	R1,DSPCH; output blank (wipe out R4-5)
	BR	DSPDLP

Internal and External Routine Efficiency

In the worst case on the Hawk machine, if you know nothing of the routine you are calling other than the bare fact that it uses R1 for linkage and R2 as an activation record pointer, you must assume that it destroys the contents of all other registers. In this case, you will be forced to allocate space in your own activation record for all values you need saved during the call, and you will have to augment your calling sequence with code to save the contents of all the registers you care about before the call.

When a compiler generates code to call a procedure that is defined outside the current source file, perhaps in a file that has not even yet been compiled, it must make pesimistic assumptions about that procedure. As a result, calls to such procedures will, by necessity, be expensive, involving maximal use of the activation record, poor use of registers, and the full glory of our calling sequence.

In comparison, when compilers generate code to procedures defined within the same compilation unit, they can frequently apply many of the optimizations we have discussed, making effective use of registers and considerably speeding the procedure call.

As a result, modern optimizing compilers frequently generate far better code for calls to local procedures than to separately compiled code. The exception is in languages like Ada, where there are strict rules about order of compilation. In such a language, when a procedure is compiled, the compiler generates not only an object file but a file of instructions about how to compile calls to that procedure. When a call to an external procedure is encounterd by the compiler, it reads that file of instructions in order to determine how the call is to be made. Unfortunately, older programmingl languages like C, C++ and Fortran are rarely implemented to take advantage of such opportunities for effective optimization of external procedure calls.