Lecture 28, Stack Top Caches

Part of the notes for CS:4980:1
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Implementing Stacks in Hardware

The Burroughs corporation, when they built the B5000 machine back in the 1960s, recognized that a pure stack architecture had an obvious performance problem: Consider the run-time cost of executing i=i+1 on a machine where the stack is entirely memory resident, described here using a full descending stack and assuming that the variable i is a global variable and that the expression &i gives the address of i:

PUSHGA  i    -- sp = sp - 1; M[sp] = &i;
PUSHGA  i    -- sp = sp - 1; M[sp] = &i;
LOAD         -- M[sp] = M[M[sp]]
PUSHI   1    -- sp = sp - 1; M[sp] = 1;
ADD          -- M[sp+1] = M[sp+1] + M[sp]; sp = sp + 1
POPS         -- M[M[sp+1]] = M[sp]; sp = sp + 1

In sum, this requires 12 memory cycles (excluding instruction fetches). An obvious way to improve this is to use one register inside the CPU to hold the top item on the stack top. We'll call it the a register here, and we'll adopt the convention that the stack pointer sp always points to the topmost item on the part of the stack that is in memory. Exactly the same code given above can now be implemented with the following operations:

PUSHGA  i    -- sp = sp - 1; M[sp] = a; a = &i;
PUSHGA  i    -- sp = sp - 1; M[sp] = a; a = &i;
LOAD         -- a = M[a]
PUSHI   1    -- sp = sp - 1; M[sp] = a; a = 1;
ADD          -- a = M[sp] + a; sp = sp + 1;
POPS         -- M[M[sp]] = a; a = M[sp+1]; sp = sp + 2;

This offers a modest improvement: 12 cycles are reduced to 8 cycles. But, if you think about building hardware, you realize that for any add operation, both operands must be held in registers while the ALU computes the result, and for any store operation, both the address and value being stored must be held in registers. The reads from the stack in the above would mostly be reading into this second register.

The most obvious way to do this is to keep the top two elements on the stack in registers at all times. Let's call the register used for the stack a and the one used for the element below the stack top b. The naive way to use this hardware to execute the above program would be:

PUSHGA  i    -- sp = sp - 1; M[sp] = b; b = a; a = &i;
PUSHGA  i    -- sp = sp - 1; M[sp] = b; b = a; a = &i;
LOAD         -- a = M[a]
PUSHI   1    -- sp = sp - 1; M[sp] = b; b = a; a = 1;
ADD          -- a = a + b; b = M[sp]; sp = sp + 1;
POPS         -- M[b] = a; a = M[sp]; b = M[sp+1]; sp = sp + 2;

Unfortunately, this made no difference in the run time, since there are still 8 memory cycles. Note that we do not count the register to register data transfers as costing any time because, when this is done inside the CPU, these transfers can all be done in parallel.

A Brilliant Idea

I suspect that Burroughs went through something like the above before someone on the design team came up with a brilliant idea: Instead of keeping the two registers full of data at all times, add two bits of state information, a bit to indicate whether a currently holds data, and a bit to indicate whether b currently holds data. As a result, the system has 3 potential states:

Now, consider how the above program would run if the stack was initially in state 0:

             -- state = 0;
PUSHGA  i    -- b = &i; state = 1;
PUSHGA  i    -- a = &i; state = 2;
LOAD         -- a = M[a]; state = 2;
PUSHI   1    -- sp = sp - 1; M[sp] = b; b = a; a = 1; state = 2;
ADD          -- b = a + b; state = 1;
POPS         -- a = b; b = M[sp]; sp = sp + 1; M[b] = a; state = 0;

This reduces the run time to 4 memory cycles, a significant improvement. The Burroughs implementation simply included the current value of the stack state in its instruction decoding, so that each instruction had a different implementation in each potential stack state:

PUSHGA i in state 0
b = &i; state = 1;
PUSHGA i in state 1
a = &i; state = 2;
PUSHGA i in state 2
M[sp] = b; sp = sp - 1; b = a; a = &i; state = 2;

PUSHI i in state 0
b = i; state = 1;
PUSHI i in state 1
a = i; state = 2;
PUSHI i in state 2
M[sp] = b; sp = sp - 1; b = a; a = i; state = 2;

LOAD in state 0
b = M[sp]; sp = sp + 1; b = M[b]; state = 1;
LOAD in state 1
b = M[b]; state = 1;
LOAD in state 2
a = M[a]; state = 2;

ADD in state 0
a = M[sp]; b = M[sp+1]; sp = sp + 2; b = a + b; state = 1;
ADD in state 1
a = b; b = M[sp]; sp = sp + 1; b = a + b; state = 1;
ADD in state 2
b = a + b; state = 1;

POP in state 0
a = M[sp]; b = M[sp+1]; sp = sp + 2; M[b] = a; state = 0;
POP in state 1
a = b; b = M[sp]; sp = sp + 1; M[b] = a; state = 0;
POP in state 2
M[b] = a; state = 0;

Note that the word cache in English (as opposed to computer science) was originally a verb borrowed from the French meaning to hide. From this, the word was converted to as noun meaning a collection of hidden things. So, you might read a news report about the police discovering a cache of stolen money, or about army troops blowing up a terrorist arms cache. In computer science, the term cache has come to have a specific technical meaning. A cache is a place where copies of data are stored that in order to speed up a computation. The cache is not generally part of the external specification of the computation, but by keeping certain data in the cache, the computation can be done faster. Cache memory is frequently included inside CPUs in order to reduce the frequency of access to main memory, and operating systems may include a cache of frequently used disk sectors in main memory in order to reduce the frequency of accesses to slower disk or flash memory.

The a and b registers used inside the Burroughs CPU served as a small cache for the top elements on the stack. Later Burroughs stack machines added additional registers, typically a total of 4, enlarging this cache and further reducing the likelihood that stack operations would result in actual memory cycles.

Stack Top Caches in Compilers

This idea can be directly applied to register allocation in a compiler, with one major change: In a compiler, we want to avoid using instructions for register-to-register transfers. As a result, we do not want to dedicate a particular register to holding the top item on the stack. The stack state is therefore more complicated. If we have a block of n registers that is to be used as a stack-top cache, each of these registers has one of n+1 possible uses: It may be unused, it may hold the stack top, an item below the stack top, all the way up to the item n–1 items below the stack top.

A first venture at this would attach, to each register, a counter indicating whether it was currently not part of the stack (0), or whether it represented the top element (1), the element below the top (2), and so on. This suggests that we might need to search the registers to find the stack top, and it suggests the need to update all of the registers with nonzero tags each time there is a push or pop. This is not appealing.

Fortunately, there is an alternative. If we arrange the set of size cache registers holding the stack top as a circular buffer in the range min to max, where size=(max-min)+1, we can use just one index to track which register holds the stack top, call it top, and a count to track how many stack elements are currently in registers, call it count. On the ARM, for example, min might be 3 and max might be 12. Here is the logic for generating code for the pushi c and add instructions in this context:

void pushi( int c ){
        int r = top + 1;
        if (r > max) r = min;
        if (count == size) { /* we need to clear a register */
                -- emit code to push R[r] on the RAM stack --
                count = count - 1;
        }
        -- emit code to load the constant c into R[r] --
        top = r;
        count = count + 1;
}
void add(){
        if (count == 0) { /* we to get the stack top into a register */
                -- emit code to pop the RAM stack into R[top] --
                count = count + 1;
        }
        int r = top - 1;
        if (r < min) r = max;
        if (count == 1) { /* we to get the stack top into a register */
                -- emit code to pop the RAM stack into R[r] --
                count = count + 1;
        }
        -- emit code to add R[r] plus R[top] and put the result in R[r]
        top = r;
        count = count - 1;
}

This scheme works well within a basic block, but whenever control converges on a label from two or more directions, it does not guarantee that the stack will be in the same state on each path into that label. As a result, at the end of each branch point in the code, the stack-top cache must be put into a standard state such as entirely in RAM. This was not a problem with the hardware stack-top cache, since the cache state there was a dynamic attribute. In general, stack-top caches are not a good stating point for developing more optimal code generators.

In a compiler that uses global optimization, the interblock register allocation problem can be quite complex. In a one-pass compiler without global optimiztion, there are three basic schemes we can use for interblock register allocation: