Consider the following C code to print an integer as a decimal number:
Calling dspdec(n,w) has the same effect as the Pascal call write(n:w) or the C call fprintf("%*d", w, n); all of these print n, right justified in a field of width w, using more characters than w if n is too large. The names dspch and dspdec are used here because these are the names of the routines in the Hawk monitor.void dspdec( unsigned int n, int w ) /* n is the number to print */ /* w is the field width to use */ { unsigned int fw = w - 1; unsigned int rm = n % 10; if ((n / 10) != 0) { /* there are more significant digits, so print them first */ dspdec( n / 10, fw ); } else { /* there are no more significant digits, so print any leading blanks first */ for (;;) { fw = fw - 1; if (fw < 0) exit; /* from for loop */ dspch(' '); } } /* leading blanks and preceding digits are printed, so print this digit */ dspch('0' + rm); }
The above routine is recursive, and it is instructive to trace its behavior through a call to print a small number, say 123 (decimal) in a field of width 5:
The problem is, how can this be translated to assembly language? The central issue is where should the local variables be put. Because the routine is recursive, the answer is that storage must be allocated dynamically, at run-time, for these variables.
- dspdec(123, 5) called;
- fw = 4, rm = 3.
- dspdec(12, 4) called;
- fw = 3, rm = 2.
- dspdec(1, 3) called;
- fw = 2, rm = 1.
- using fw to count spaces, dspch(' ') is called twice.
- dspch('0' + rm) is called, output is '1'
- return
- dspch('0' + rm) is called, output is '2'
- return
- dspch('0' + rm) is called, output is '3'
- return
To allow for recursion, local variables are typically allocated on a stack (there are alternatives to use of stacks). The routines in the Hawk monitor use R2 for a stack pointer and R1 for linkage, and user code should use the same conventions (but nothing about the Hawk architecture prevents users from using other conventions).
By the entirely arbitrary convention used here, at the time control enters a procedure, R2 points to the lowest unused address on the stack, so m[R2] is available for use by the procedure, as is m[R2+1], but we assume that lower addresses such as m[R2-1] are unavailable. Thus, we will use the convention that pushing on the stack involves incrementing R2, while popping from the stack involves decrementing R2. A surprising number of machines use the opposite convention, and the thing to remember is that these conventions really are arbitrary and that what matters is that they are used consistantly.
Consider a procedure that requires AR bytes of memory for storage of local variables, including all temporary locations used by the procedure as well as the local variables you would give names in a high level language program. We will use the following calling sequence withinn this procedure to call other procedures:
If PROC is not in the same source file as the call, so that the linker is involved, a slightly more complex form must be used:ADDI R2,AR ; push my block of local variables JSR R1,PROC ADDI R2,-AR ; pop my block of local variables
This is needed because the SMAL linker cannot compute PC relative addresses of external procedures. Instead, it must fill in the pointer to the procedure in an absolute word, defined elsewhere in the same source file as follows:ADDI R2,AR ; push my block of local variables LOAD R1,PPROC JSRS R1,R1 ADDI R2,-AR ; pop my block of local variables
In both of these calling sequences, the value AR is used to indicate the size of the area on the stack used by the current procedure. This area is called the activation record of the procedure, and the identifier AR (or identifiers ending with AR in the case of multiple coexisting procedures) is used for the size of the procedure's activation record.EXT PROC PPROC: W PROC
On entry to a procedure, our standard calling sequence puts the return address in R1. If the procedure is to call other procedures, the return address must be set aside somewhere. It could be set aside in some other register, but if a recursive call it involved, it must be set aside on a stack in memory. By convention, Hawk programmers should set aside the first local variable of every procedure for this purpose. Having done this, we can use the following receiving and return sequences:
It is useful to embed these in a larger framework of comments and definitions for each procedure. A standard format is suggested here:PROC: STORES R1,R2 ; receiving sequence ; saves the return address LOADS R1,R2 ; return sequence recovers JUMPS R1 ; and uses the return address
By convention, we reserve the first word in each activation record for for the return address, and we always access this with a STORES in the receiving sequence and an LOADS in the return sequence. As such, we don't need to name the return address field of the calling sequence, but we include a line in the documentation just the same, with the assignment of a value to the name RA commented out since we never intend to use it.;------------------------ INT PROC ; AR format, indexed using R2: ;RA = 0 ; the return address LOCAL = 4 ; an example local variable AR = 8 ; size of the activation record PROC: ; a comment about what it does ; comments telling what registers it uses ; for parameters, results, etc STORES R1,R2 ; save return address STORE R0,R2,LOCAL ; an example reference ; to the local variable LOADS R1,R2 ; recover return address JUMPS R1 ; return
The example local variable, LOCAL above, is one word and is cleared by the store instruction in the body of our example procedure. Nothing useful is done with this variable here, but it is included to illustrate how such variables can be declared and used. The example activation record takes up 8 bytes, 4 for the return address and 4 for the local variable, so we define the symbol AR as 8. Since this example procedure does not call other procedures, AR need not be defined.
Note that this is only one possible activation record structure! Nothing about the Hawk architecture demands this structure, and there are many alternatives. Some machines have procedure calling instructions that dictate a particular calling sequence, sometimes quite complex, but the Hawk machine does not.
We can now begin translation of the DSPDEC procedure to assembly language. Our first version will use the above calling sequence exactly, so we can code the entry to the routine as follows:
Here, we have fleshed out the receiving sequence with details specific to the problem of coding the dspdec routine. The local variables fw and rm have been allocated space in the activation record, as fields DSPDFW and DSPDRM, at offsets 4 and 8 in the activation record. So long as R2 points to the start of the activation record, indexed addressing (with the LOAD, LOADCC, STORE and LEA instructions) may be used to address fields of the activation record. The C code given above both allocated and initialized fw and ar with the following two lines:;------------------------ INT DSPDEC ; AR format, indexed using R2: ; 0 ; return address DSPDFW = 4 ; field width DSPDRM = 8 ; remainder after division by 10 DSPDAR = 12 ; total size DSPDEC: ; output R3 in decimal ; uses R4 as field width ; wipes out ???? STORES R1,R2 ; save ret addr in ar
The corresponding assembly code, for initialization of these fields of the AR, is as follows:unsigned int fw = w - 1; unsigned int rm = n % 10;
Note the use of our standard calling sequence to call the DIVIDE routine. This routine divides the number in R3 by the number in R4, returning the quotient in R3 and the remainder in R4, and obliterating R5 and R6 in the process. As a result, we have computed both the remainder we needed for the initialization of the local variable rm, and the quotient needed by the next few lines of C code:ADDSI R4,-1 ; decrement the field width STORE R4,R2,DSPDFW LIS R4,10 ADDI R2,DSPDAR; push my local variables JSR R1,DIVIDE; divide number by 10 (wipe out R5-6) ADDI R2,-DSPDAR;pop my local variables STORE R4,R2,DSPDRM
These convert to assembly language as follows:if ((n / 10) != 0) { /* there are more significant digits, so print them first */ dspdec( n / 10, fw );
The above code takes advantage of the fact that the DIVIDE routine left the quotient right where it was needed for passing to the recursive call to DSPDEC, so only one instruction was needed to put the parameter into place. The next segment of C code is:TESTR R3 ; check the quotient BZS DSPDBL ; if zero, go output blanks LOAD R4,R2,DSPDFW ADDI R2,DSPDAR JSR R1,DSPDEC ; recursive call ADDI R2,-DSPDAR BR DSPDQT ; go off to the endif
This is translated to assembly code as follows:} else { /* there are no more significant digits, so print any leading blanks first */ for (;;) { fw = fw - 1; if (fw < 0) exit; /* from for loop */ dspch(' '); }
The final segment of the procedure, in C, was:DSPDBL: ; setup to output DSPDFW blanks LOAD R6,R2,DSPDFW LIS R3,' ' DSPDLP: ADDSI R6,-1 ; decrement counter BLT DSPDQT ; quit if nothing left ADDI R2,DSPDAR JSR R1,DSPCH; output blank (wipe out R4-5) ADDI R2,-DSPDAR BR DSPDLP
This translates to the following assembly code:} /* leading blanks and preceding digits are printed, so print this digit */ dspch('0' + rm); }
DSPDQT: ; setup to output the digit LOAD R6,R2,DSPDRM ADDI R3,'0' ; convert remainder to ASCII ADDI R2,DSPDAR JSR R1,DSPCH; output it (wipe out R4-5) ADDI R2,-DSPDAR LOADS R1,R2 ; recover ret addr from ar JUMPS R1 ; return
If producing working code in a hurry is your goal, the above code is appropriate, but when performance is an issue, some additional work is called for.
The calling sequence we have used requires 3 instructions, plus a 1 instruction entry sequence and a 2 instruction exit sequence. This seems excessive! On machines designed according to the CISC (complex instruction set computer) philosophy, the conventional solution to this excess is to create special calling instructions that combine these functions, for example, imagine having a call instruction such as:
This would increment R2 by size, save the return address in M[R2] and transfer control to dst. As a result, the entry sequence is reduced to no instructions, but we need a new return instruction to undo things:CISCCALL size,dst
This would transfer control to the return address found in M[R2], but first, it would look in M[M[R2]-4] to find the size field of the CALL instruction in order to decrement R2 by the right amount. While programming with these "improved" procedure calling and return instructions would be pleasant, they are not pleasant instructions to implement, and their implementation would gravely complicate the problem of making the CPU run quickly.CISCRETURN
As a result, processors designed according to the RISC (reduced instruction set complexity) philosophy usually omit such complicated but easy to use instructions and rely on the programmer, or more usually, on the compiler, to leave out those parts of the calling sequence that are not needed in some circumstance.
For example, in our receiving and return sequences:
The code to save and restore the return address is unnecessary if the procedure calls no other procedures. A smart programmer, or a smart compiler, after tentatively including the STORES and LOADS instructions that do this saving and restoring, can look at the body of the procedure and notice that it contains no calls, then remove these instructions.PROC: STORES R1,R2 ; save ret addr in ar ; some code LOADS R1,R2 ; recover ret addr from ar JUMPS R1 ; return
Having removed them, the size of the activation record for that procedure or function can be reduced by one. In the case of many procedures and functions, particularly those that call no other procedures, the size of the activation record can be reduced to zero because all the local variables needed by the procedure fit in registers.
The above optimization produces a class of procedures and functions that have no activation records and therefore ignore R2 when they are called. When calling such a procedure or function, the calling sequence:
Can be simplified to just:ADDI R2,AR JSR R1,PROC ADDI R2,-AR
Another optimization can be done in the context of consecutive calls. Consider this example:JSR R1,PROC
Here, two procedures were called in quick sequence, and clearly, the intermediate adjustment to the activation record pointer, R2, are redundant and the code can be collapsed to:ADDI R2,AR JSR R1,PROCA ADDI R2,-AR ADDI R2,AR JSR R1,PROCB ADDI R2,-AR
This optimization is clearly possible even when there are other instructions between the two procedure calls, but it would seem more difficult when the intervening instrucitons reference fields of the activation record, as in this example:ADDI R2,AR JSR R1,PROCA JSR R1,PROCB ADDI R2,-AR
It turns out that it is still possible to eliminate the adjustment to R2 between the calls to PROCA and PROCB by moving the arithmetic from run-time to assembly time, adjusting LOCALA and LOCALB instead of adjusting R2. This is illustrated below:ADDI R2,AR JSR R1,PROCA ; returns something in R3 ADDI R2,-AR STORES R3,R2,LOCALA ; save result of PROCA LOADS R3,R2,LOCALB ; get parameter to PROCB ADDI R2,AR JSR R1,PROCB ; expects something in R3 ADDI R2,-AR
Here, we have substituted assembly-time arithmetic on the offsets used for the intermediate load and store instructions for the run-time arithmetic of the ADDI instructions.ADDI R2,AR JSR R1,PROCA STORES R3,R2,LOCALA-AR LOADS R3,R2,LOCALB-AR JSR R1,PROCB ADDI R2,-AR
Finally, there is an optimization that is particularly useful in the context of recursive procedures, and that is to collapse activation records so that, at the time of the recursive call, only those fields of the activation record that actually contain useful information are saved. In the case of the example decimal print routine, note, for example, that the field width stored in the DSPDFW field of the activation record is not needed after any recursive call.
This optimization is done by organizing the fields of the activiation record so that fields that aren't needed during a recursive call are stored last. In the case of our decimal print routine, we would put the DSPDRM field first, since it is needed during the recursive call, and the DSPDFW field last. The result is a routine that needs 1/3 less stack space when it is called.
Here is the result of applying a number of the above optimizations to the decimal print routine:
;------------------------ INT DSPDEC ; AR format, indexed using R2: ; 0 ; return address DSPDRM = 4 ; remainder after division by 10 DSPDFW = 8 ; field width DSPDAR = 12 ; total size DSPDEC: ; output R3 in decimal ; uses R4 as field width ; wipes out 4-6 STORES R1,R2 ; save ret addr in ar ADDSI R4,-1 ; decrement the field width STORE R4,R2,DSPDFW LIS R4,10 JSR R1,DIVIDE; divide number by 10 (wipe out R5-6) STORE R4,R2,DSPDRM TESTR R3 ; check the quotient BZS DSPDBL ; if zero, go output blanks LOAD R4,R2,DSPDFW ADDI R2,DSPDFW JSR R1,DSPDEC ; recursive call ADDI R2,-DSPDFW DSPDQT: ; setup to output the digit LOAD R3,R2,DSPDRM ADDI R3,'0' ; convert remainder to ASCII JSR R1,DSPCH; output it (wipe out R4-5) LOADS R1,R2 ; recover ret addr from ar JUMPS R1 ; return DSPDBL: ; setup to output DSPDFW blanks LOAD R6,R2,DSPDFW LIS R3,' ' DSPDLP: ADDSI R6,-1 ; decrement counter BLT DSPDQT ; quit if nothing left JSR R1,DSPCH; output blank (wipe out R4-5) BR DSPDLP
In the worst case on the Hawk machine, if you know nothing of the routine you are calling other than the bare fact that it uses R1 for linkage and R2 as an activation record pointer, you must assume that it destroys the contents of all other registers. In this case, you will be forced to allocate space in your own activation record for all values you need saved during the call, and you will have to augment your calling sequence with code to save the contents of all the registers you care about before the call.
When a compiler generates code to call a procedure that is defined outside the current source file, perhaps in a file that has not even yet been compiled, it must make pesimistic assumptions about that procedure. As a result, calls to such procedures will, by necessity, be expensive, involving maximal use of the activation record, poor use of registers, and the full glory of our calling sequence.
In comparison, when compilers generate code to procedures defined within the same compilation unit, they can frequently apply many of the optimizations we have discussed, making effective use of registers and considerably speeding the procedure call.
As a result, modern optimizing compilers frequently generate far better code for calls to local procedures than to separately compiled code. The exception is in languages like Ada, where there are strict rules about order of compilation. In such a language, when a procedure is compiled, the compiler generates not only an object file but a file of instructions about how to compile calls to that procedure. When a call to an external procedure is encounterd by the compiler, it reads that file of instructions in order to determine how the call is to be made. Unfortunately, older programmingl languages like C, C++ and Fortran are rarely implemented to take advantage of such opportunities for effective optimization of external procedure calls.