Our goal by using 2 registers to hold the top 2 items on the stack is to cut down on the number of memory cycles when there is a sequence of operations that alternate push and pop. For example, to evaluate A + B + C, we'd do something like PUSHA, LOAD, PUSHB, LOAD, ADD, PUSHC, LOAD, ADD, and in theory, given the registers TOPA and TOPB, this can be done with no memory references other than those used to actually fetch the operands from the variables A, B and C.
If we naively translate push(X) operation into "M[SP++]=TOPB; TOPB=TOPA; TOPA=X", and we translate pop() into "return(TOPA); TOPA=TOPB; TOPB=M[--SP]", we gain nothing! To make this scheme lead to an improvement, we need to keep a record of the state of TOPA and TOPB, and allow a pop to leave TOPA or both TOPA and TOPB vacant so that a push can be done without taking any memory cycles.
As an aside, the Rockwell AAMP-2 had 4 registers in the CPU to hold the top items on the stack; thus giving the AAMP-2 even greater performance.
Part A: One way to do this is to add a special 2-bit register to the CPU called TOPSTATE. If TOPSTATE is zero, TOPA and TOPB are both vacant. If TOPSTATE is one, TOPA is vacant and the top of stack is in TOPB. If TOPSTATE is two, TOPA is the top of stack, and TOPB holds the word below it. Rewrite the code describing the Minimal CISC CPU given about 1/3 of the way into the Minimal CISC notes so it uses this idea.
Part B: Evaluate the impact of this improvement on the execution of the benchmark code shown about 3/5 of the way into the paper by counting instruction cycles required to execute one pass through the loop body with and without this improvement.
A Problem: Given what you know about efficient instruction encoding, propose how you would use the 4-bit instruction syllables of our modified machine. Note that for operations with a likelyhood of less than 1/16, you should use multiple syllables to encode the operation, and you may use fractions of syllables as well as entire following syllables for operand fields.