A Minimal CISC

Part of the Computer Architecture On-Line Collection
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Abstract

The minimal CISC architecture presented here is an extremely simple zero-address architecture suitable for microprogrammed implementation. It is sufficiently simple to be introduced in one lecture, with time left over for discussion of implementation or enhancement options.

This writeup is a revision of a paper by the same name published in ACM Computer Architecture News, 16, 3 (June 1988), pages 56-63.

Introduction

A clear distinction has come to be recognized between two schools of instruction set design, frequently characterized as RISC, standing for reduced instruction set computer architecture and CISC, standing for complex instruction set computer architecture. In fact, the distinction between these schools emerged long before the names were coined. For example, this distinction is quite apparent in the comparison of the Data General Nova (RISC) and the DEC PDP-11 (CISC) architectures developed in the late 1960s.

The Nova has an instruction set in which most instructions can execute in a single fixed-length cycle involving an instruction fetch, and one of either a fetch, a store, or an operation on registers. In contrast, the PDP-11 has numerous addressing modes; depending on the mode, an instruction may execute in from 1 to 7 memory cycles. Both machines were designed in the late 1960s, and except for the RISC-CISC distinction, they are quite comparable; they competed for similar applications, offering similar performance at similar prices.

In teaching introductory computer architecture courses, it is frequently difficult to find simple architectures which illustrate the distinction between these two schools of design; this is particularly difficult on the CISC side, where, as the name suggests, complexity is the rule. The architecture presented here was developed to meet this need.

The minimal CISC instruction set is presented and formally specified in the next section. An example program for the minimal CISC is presented in the section after that, and the example is then used to compare the potential performance of the minimal CISC with the DEC PDP-11. The final section discusses the effects of modifications to the architecture and its implementations.

Instruction Set

The minimal CISC instruction set presented here is stack-based and composed entirely of zero-address instructions. There are only 8 instructions, so each can be coded as a 3-bit syllable. Assuming a 16 bit word, 5 instruction syllables can be packed in each word (the least significant syllable is the first). The instructions are:

(000) NOP:: No operation.
(001) DUP:: Duplicate the stack top. This is the only way to allocate stack space.
(010) ONE:: Shift the stack top left one bit, shifting one into the least significant bit.
(011) ZERO:: Shift the stack top left one bit, shifting zero into the least significant bit.
(100) LOAD:: Use the value on the stack top as a memory address; replace it with the contents of the referenced location.
(101) POP:: Store the value from the top of the stack in the memory location referenced by the second word on the stack; pop both.
(110) SUB:: Subtract the top value on the stack from the value below it, pop both and push the result.
(111) JPOS:: If the word below the stack top is positive, jump to the word pointed to by the stack top. In any case, pop both.

This instruction set is not particularly convenient, but that is not the point of this exercise. The important thing is that it is very simple yet it is sufficient to write any program, given enough memory, and given memory mapped input-output devices.

Using this instruction set, any constant can be pushed on the stack by a DUP followed by 16 ONE or ZERO instructions. Zero may be pushed on the stack by the sequence DUP DUP SUB; negation may be done by subtracting from zero; addition may be done by subtracting a negated value, and pushing zero prior to pushing an unconditional branch address allows an unconditional branch. The code in the example given later illustrates these tricks.

The NOP instruciton is required because the JPOS instruciton can only address the first instruciton in a word. If the logical destination does not fall on a word boundary, it must be moved to a word boundary by padding with NOP instructions.

The following high level description of this architecture is presented using a Pascal-like pseudocode. This description will be used to support the evaluation of this architecture, and it will be used to systematically derive the register-transfer logic and the corresponding microcoded control unit.

      var
          ir, pc, acc, sp, t: word { registers };
          m: array [word] of word { memory };
      
      while true do begin
      99: { instruction fetch }
          ir := m[pc];
          pc := pc + 1;
          repeat
              { syllable decode }
              case ir mod 8 of
                  0: { nop };
                  1: begin { dup }
                        m[sp] := acc;
                        sp := sp + 1;
                     end;
                  2: { one }  acc := acc << 1 | 1;
                  3: { zero } acc := acc << 1 | 0;
                  4: { load } acc := m[acc];
                  5: begin { pop }
                         sp := sp - 1;
                         t := m[sp];
                         m[t] := acc;
                         sp := sp - 1;
                         acc := m[sp];
                     end;
                  6: begin { sub }
                         sp := sp - 1;
                         acc := m[sp] - acc;
                     end;
                  7: begin { jpos }
                         sp := sp - 1;
                         t := m[sp];
                         sp := sp - 1;
                         if t >= 0 then begin
                             pc := acc;
                             acc := m[sp];
                             goto 99;
                         end;
                         acc := m[sp];
                     end;
              end { case };
              ir := ir div 8;
          until ir = 0;
      end { while };

Some effort was made to limite the variety of operations used in the above pseudocode. Among the operations avoided were subscript expressions such as m[sp+1] or m[m[sp]], and double decrements such as sp:=sp-2. These would have allowed a more compact textual representation, at the expense of making the derivation of the hardware less obvious.

Programming

As an illustration of programming the minimal CISC, consider the problem of summing the contents of an array in memory. A Pascal-like outline of a solution to this problem follows;

      variables ctr, ptr, sum;

      sum := 0;
      ptr := address( array );
      ctr := size( array ) - 1;
      repeat
          sum := sum + ptr^;
          ptr := ptr + 1;
          ctr := ctr - 1;
      until ctr < 0;
      ...

The SMAL assembly language will be used to present the machine code that solves this problem. In this language, the W assembly directive causes a word to be assembled into memory, and most of the other details are borrowed from the MACRO-11 assembler for the PDP-11 (VAX assembly language is very similar). To simplify the coding, a macro will be defined that assembles instruction syllables into words:

      MACRO CODE i  ; accumulate the instruction i
        cword = cword + (i << ccount)
        ccount = ccount + 3
        IF ccount > 12
          W cword   ; emit the accumulated word
          cword = 0
          ccount = 0
        ENDIF
      ENDMAC

      MACRO FLUSH
        IF ccount > 0
          W cword   ; emit the accumulated word
          cword = 0
          ccount = 0
        ENDIF
      ENDMAC

      cword = 0     ; reset accumulator
      ccount = 0    ; reset shift count

In the above, note that a<<b shifts a left by b bits. The CODE macro accumulates one instruction syllable, while FLUSH should be used before each label to align the next instruction on a word boundary. To make the code readable, symbolic constants will be defined for each machine isntruction:

      NOP  = 0          LOAD = 4
      DUP  = 1          POP  = 5
      ONE  = 2          SUB  = 6
      ZERO = 3          JPOS = 7

Although these tools are sufficient for programming, macros for pushing constants on the stack are needed if examples are to be presented compactly. The parameters to these macros are passed by value, as indicated by the leading equals sign on the formal parameter declarations.

      MACRO PUSH =c ; push c on the stack
        CODE DUP
        PUTBITS c, 16
      ENDMAC

      MACRO PUTBITS =c, =b
        IF b > 1    ; first put the more significant bits
          PUTBITS c>>1, b-1
        ENDIF
        IF c&1 = 1  ; second, put the least significant bit
          CODE ONE
        ELSE
          CODE ZERO
        ENDIF
      ENDMAC

      MACRO PUSH0   ; push constant zero
        CODE DUP
        CODE DUP
        CODE SUB
      ENDMAC

PUSH c pushes the constant c onto the stack; this takes 17 machine instructions, independent of the value pushed. PUTBITS c,b shifts the b least significant bits of c onto the stack top. SMAL code for the example fragment on the minimal CISC can now be written:

      PUSH sum   ; sum
      PUSH0      ; sum, 0
      CODE POP   ; -- sum := 0
      PUSH ptr   ; ptr
      PUSH array ; ptr, array
      CODE POP   ; -- ptr := address(array)
      PUSH ctr   ; ctr
      PUSH size-1; ctr, size-1
      FLUSH
loop:            ; ctr, <value to put in ctr>
      CODE POP   ; -- ctr := size(array)-1
                 ; -- ctr := ctr-1
      PUSH sum   ; sum
      CODE DUP   ; sum, sum
      CODE LOAD  ; sum, [sum]
      PUSH0      ; sum, [sum], 0
      PUSH ptr   ; sum, [sum], 0, ptr
      CODE LOAD  ; sum, [sum], 0, [ptr]
      CODE LOAD  ; sum, [sum], 0, [[ptr]]
      CODE SUB   ; sum, [sum], -[[ptr]]
      CODE SUB   ; sum, [sum]+[[ptr]]
      CODE POP   ; -- sum := sum + ptr^
      PUSH ptr   ; ptr
      CODE DUP   ; ptr, ptr
      CODE LOAD  ; ptr, [ptr]
      PUSH0      ; ptr, [ptr], 0
      CODE DUP   ; ptr, [ptr], 0, 0
      CODE ONE   ; ptr, [ptr], 0, 1
      CODE SUB   ; ptr, [ptr], -1
      CODE SUB   ; ptr, [ptr]+1
      CODE POP   ; -- ptr := ptr + 1
      PUSH ctr   ; ctr
      CODE DUP   ; ctr, ctr
      CODE LOAD  ; ctr, [ctr]
      PUSH0      ; ctr, [ctr], 0
      CODE ONE   ; ctr, [ctr], 1
      CODE SUB   ; ctr, [ctr]-1
      CODE DUP   ; ctr, [ctr]-1, [ctr]-1
      PUSH loop  ; ctr, [ctr]-1, [ctr]-1, loop
      CODE JPOS  ; ctr, [ctr]-1
      CODE POP   ; -- ctr := ctr - 1;

The macros in this program expand to a total of 206 instructions. Thus, the program fills 41 words, plus 3 bits of a 42nd word. One pass through the loop body involves, 115 instructions and can be executed with 23 memory cycles for instruction fetch, 8 memory cycles for operand loading and storing, and 32 memory cycles for stack manipulation. Thus, the run-time for this program will be dominated by the need to perform 63 memory cycles per iteration of the loop, where most cycles access the top elements of the stack.

In contrast, the equivalent PDP-11 program using general registers and using the SOB (subtract one and branch) instruction for loop control can be written in 5 instructions occupying 7 words. In this case, the loop body is 2 instructions long and requires 3 memory references per iteration.

The above comparison clearly demonstrates that the minimal CISC architecture is not competitive, but it does not demonstrate that it is suboptimal! The class of optimal architectures can be thought of as a surface in a multidimensional computer design space. Taking typical axes of the space to be processor complexity (measured, for example, as the number of nand gates required to implement the processor, or the number of square millimeters of silicon needed under a fixed set of design rules), the program size for some benchmark, and the memory traffic required to execute that benchmark, it is clear that the PDP-11 is better than the minimal CISC on two axes, but it also requires a more complex processor. The minimal CISC can only be proven to be suboptimal if a processor can be found that is better when measured along at least one axis of the design space while being no worse along any other axes. Of course, serious evaluation of an architecture must rest on a benchmark that is more comprehensive than that used here!

Implementation

A systematic approach to implementing the minimal CISC architecture described by the pseudocode given above 2 involves, first, identifying the register-transfers used in the code, then building a register transfer machine that can perform these transfers, and finally designing a control unit to evoke the transfers in the right order.

Each assignment statement in the pseudocode describes a register transfer. One register can hold the value of each simple variable, and the list of assignments to each variable determines the functional units that must process the input to the corresponding register and the registers from which those functional units take their inputs. This approach was followed in deriving the register-transfer logic outlined in Figure 1. Here, connections to the control unit are shown to the left, and connections to the memory are to the right. A tri-state data bus is assumed, where the control signal MRE (memory read) causes the contents of the addressed word to be gated onto the data bus, and a positive transition of MWR (memory write) causes the contents of the bus to be stored in the addressed word.

Figure 1

It is worth noting that all of the functional boxes in Figure 1 correspond closely to standard MSI chips. The sp register and its functional unit can be implemented by a 74LS169A up-down counter, the pc register can be implemented by a 74LS161 counter, and the ir register can be implemented by a 74LS298 register. Finally, a general purpose ALU such as the 74LS381 can perform the operations in the functional unit feeding acc.

The reader is invited to rewriote the pseudocode for the minimal CISC with control signals to evoke particular register transfers substituted for the assignment statements. Consider using the notation (ACW=1, IRF=0, IRC, SPF=1, SPC), for example, to mean `Hold ACW and SPF to one, hold IRF to zero, and apply clock pulses to IRC and SPC; this would evoke the register transfers ir:=acc and sp:=sp-1 concurrently.

In carrying out this exercise, note that the operation of shifting the instruction register can be moved from the end of the inner while loop to the end of each alternative in the case statement. This allows the shift operation to be done in parallel with the final register transfer of each alternative.

Before a microprogram can be written, one hardware detail must be determined: How does the microcode interpreter perform conditional branches? The solution used here is outlined in Figure 2. The next microinstruction address is formed by oring the condition selected by the condition select field of each microinstruction with the least significant bits of the next address field. Note the use of a two phase clock; all registers in the data part change on the negative edge of the clock pulse, while the microprogram state advances on the positive clock edge. If a single-phase clock were used, where all registers changed simultaneously, the overall speed of the system could probably be higher, but the microprogram would be larger because conditional branches in the microcode could not depend on the results of the current register transfer.

Figure 2

Microcode for the minimal CISC is given in Table 1. The comments indicate what phase of which machine instruction each microinstruction performs. Each instruction execution cycle begins with a fetch or a decode microinstruction, depending on whether the previous instruction was the last in a word. NOP, DUP, ONE, ZERO and LOAD can be executed with one additional microinstruction (assuming that memory latencies equal register latencies, in the latter case). The other instructions require more cycles. The line commented JPOS 3-, is used when JPOS detects a negative operand. The same line is used as the final microinstruciton of POP, since in both cases, the accumulator must be loaded from the stack top.

uaddr || next uadr | clocks |bus |  function   |
------++-----------+--------+----+-------------+---------
      ||           | IAPTSM | AM | I A  P S M  |             
      || COND NEXT | RCCCPW | CR | R C  C P A  |             
      ||           | CCC CR | WE | F F  F F D  |
------++-----------+--------+----+-------------+---------
 0000 ||  11  1000 | 101000 | 01 | 0 xx 0 x 00 | fetch
 0001 ||  11  1000 | 100000 | 00 | 1 xx x x xx | decode
 0010 ||  00  0011 | 000100 | 01 | x xx x x 10 | POP 2
 0011 ||  00  0111 | 000011 | 10 | x xx x 1 11 | POP 3
 0100 ||  10  0000 | 010000 | 01 | x 11 x x 10 | SUB 2
 0101 ||  01  0110 | 000110 | 01 | x xx x 1 10 | JPOS 2
 0110 ||  00  0000 | 011000 | 01 | x 10 1 x 10 | JPOS 3+
 0111 ||  10  0000 | 010000 | 01 | x 10 x x 10 | JPOS 3-
 1000 ||  10  0000 | 000000 | 00 | x xx x x xx | NOP
 1001 ||  10  0000 | 000011 | 10 | x xx x 0 10 | DUP
 1010 ||  10  0000 | 010000 | 00 | x 01 x x xx | ONE
 1011 ||  10  0000 | 010000 | 10 | x 00 x x xx | ZERO
 1100 ||  10  0000 | 010000 | 01 | x 10 x x 01 | LOAD
 1101 ||  00  0010 | 000010 | 00 | x xx x 1 xx | POP 1
 1110 ||  00  0100 | 000010 | 00 | x xx x 1 xx | SUB 1
 1111 ||  00  0101 | 000010 | 00 | x xx x 1 xx | JPOS 1

As noted, the microcode presented in Table 1 assumes that the memory access time can be handled in a single microcycle. If the clock rate is fixed, this means that the microcycle time must match the memory cycle time, even if many register transfers can be finished faster. This assumption has not been justified by most viable computer technology. Typically, the microcycle time for non-memory reference operations can be quite fast, while memory reference operations are slower. With such a technology, it is common to overlap microexecution with memory latency by designing the microengine so it can execute multiple microinstructions while a memory cycle is being completed.