A Minimal CISC
Part of
the Computer Architecture On-Line Collection
|
The minimal CISC architecture presented here is an extremely simple zero-address architecture suitable for microprogrammed implementation. It is sufficiently simple to be introduced in one lecture, with time left over for discussion of implementation or enhancement options.
This writeup is a revision of a paper by the same name published in ACM Computer Architecture News, 16, 3 (June 1988), pages 56-63.
A clear distinction has come to be recognized between two schools of instruction set design, frequently characterized as RISC, standing for reduced instruction set computer architecture and CISC, standing for complex instruction set computer architecture. In fact, the distinction between these schools emerged long before the names were coined. For example, this distinction is quite apparent in the comparison of the Data General Nova (RISC) and the DEC PDP-11 (CISC) architectures developed in the late 1960s.
The Nova has an instruction set in which most instructions can execute in a single fixed-length cycle involving an instruction fetch, and one of either a fetch, a store, or an operation on registers. In contrast, the PDP-11 has numerous addressing modes; depending on the mode, an instruction may execute in from 1 to 7 memory cycles. Both machines were designed in the late 1960s, and except for the RISC-CISC distinction, they are quite comparable; they competed for similar applications, offering similar performance at similar prices.
In teaching introductory computer architecture courses, it is frequently difficult to find simple architectures which illustrate the distinction between these two schools of design; this is particularly difficult on the CISC side, where, as the name suggests, complexity is the rule. The architecture presented here was developed to meet this need.
The minimal CISC instruction set is presented and formally specified in the next section. An example program for the minimal CISC is presented in the section after that, and the example is then used to compare the potential performance of the minimal CISC with the DEC PDP-11. The final section discusses the effects of modifications to the architecture and its implementations.
The minimal CISC instruction set presented here is stack-based and composed entirely of zero-address instructions. There are only 8 instructions, so each can be coded as a 3-bit syllable. Assuming a 16 bit word, 5 instruction syllables can be packed in each word (the least significant syllable is the first). The instructions are:
Using this instruction set, any constant can be pushed on the stack by a DUP followed by 16 ONE or ZERO instructions. Zero may be pushed on the stack by the sequence DUP DUP SUB; negation may be done by subtracting from zero; addition may be done by subtracting a negated value, and pushing zero prior to pushing an unconditional branch address allows an unconditional branch. The code in the example given later illustrates these tricks.
The NOP instruciton is required because the JPOS instruciton can only address the first instruciton in a word. If the logical destination does not fall on a word boundary, it must be moved to a word boundary by padding with NOP instructions.
The following high level description of this architecture is presented using a Pascal-like pseudocode. This description will be used to support the evaluation of this architecture, and it will be used to systematically derive the register-transfer logic and the corresponding microcoded control unit.
var ir, pc, acc, sp, t: word { registers }; m: array [word] of word { memory }; while true do begin 99: { instruction fetch } ir := m[pc]; pc := pc + 1; repeat { syllable decode } case ir mod 8 of 0: { nop }; 1: begin { dup } m[sp] := acc; sp := sp + 1; end; 2: { one } acc := acc << 1 | 1; 3: { zero } acc := acc << 1 | 0; 4: { load } acc := m[acc]; 5: begin { pop } sp := sp - 1; t := m[sp]; m[t] := acc; sp := sp - 1; acc := m[sp]; end; 6: begin { sub } sp := sp - 1; acc := m[sp] - acc; end; 7: begin { jpos } sp := sp - 1; t := m[sp]; sp := sp - 1; if t >= 0 then begin pc := acc; acc := m[sp]; goto 99; end; acc := m[sp]; end; end { case }; ir := ir div 8; until ir = 0; end { while };Some effort was made to limite the variety of operations used in the above pseudocode. Among the operations avoided were subscript expressions such as m[sp+1] or m[m[sp]], and double decrements such as sp:=sp-2. These would have allowed a more compact textual representation, at the expense of making the derivation of the hardware less obvious.
As an illustration of programming the minimal CISC, consider the problem of summing the contents of an array in memory. A Pascal-like outline of a solution to this problem follows;
variables ctr, ptr, sum; sum := 0; ptr := address( array ); ctr := size( array ) - 1; repeat sum := sum + ptr^; ptr := ptr + 1; ctr := ctr - 1; until ctr < 0; ...The SMAL assembly language will be used to present the machine code that solves this problem. In this language, the W assembly directive causes a word to be assembled into memory, and most of the other details are borrowed from the MACRO-11 assembler for the PDP-11 (VAX assembly language is very similar). To simplify the coding, a macro will be defined that assembles instruction syllables into words:
MACRO CODE i ; accumulate the instruction i cword = cword + (i << ccount) ccount = ccount + 3 IF ccount > 12 W cword ; emit the accumulated word cword = 0 ccount = 0 ENDIF ENDMAC MACRO FLUSH IF ccount > 0 W cword ; emit the accumulated word cword = 0 ccount = 0 ENDIF ENDMAC cword = 0 ; reset accumulator ccount = 0 ; reset shift countIn the above, note that a<<b shifts a left by b bits. The CODE macro accumulates one instruction syllable, while FLUSH should be used before each label to align the next instruction on a word boundary. To make the code readable, symbolic constants will be defined for each machine isntruction:
NOP = 0 LOAD = 4 DUP = 1 POP = 5 ONE = 2 SUB = 6 ZERO = 3 JPOS = 7Although these tools are sufficient for programming, macros for pushing constants on the stack are needed if examples are to be presented compactly. The parameters to these macros are passed by value, as indicated by the leading equals sign on the formal parameter declarations.
MACRO PUSH =c ; push c on the stack CODE DUP PUTBITS c, 16 ENDMAC MACRO PUTBITS =c, =b IF b > 1 ; first put the more significant bits PUTBITS c>>1, b-1 ENDIF IF c&1 = 1 ; second, put the least significant bit CODE ONE ELSE CODE ZERO ENDIF ENDMAC MACRO PUSH0 ; push constant zero CODE DUP CODE DUP CODE SUB ENDMACPUSH c pushes the constant c onto the stack; this takes 17 machine instructions, independent of the value pushed. PUTBITS c,b shifts the b least significant bits of c onto the stack top. SMAL code for the example fragment on the minimal CISC can now be written:
PUSH sum ; sum PUSH0 ; sum, 0 CODE POP ; -- sum := 0 PUSH ptr ; ptr PUSH array ; ptr, array CODE POP ; -- ptr := address(array) PUSH ctr ; ctr PUSH size-1; ctr, size-1 FLUSH loop: ; ctr, <value to put in ctr> CODE POP ; -- ctr := size(array)-1 ; -- ctr := ctr-1 PUSH sum ; sum CODE DUP ; sum, sum CODE LOAD ; sum, [sum] PUSH0 ; sum, [sum], 0 PUSH ptr ; sum, [sum], 0, ptr CODE LOAD ; sum, [sum], 0, [ptr] CODE LOAD ; sum, [sum], 0, [[ptr]] CODE SUB ; sum, [sum], -[[ptr]] CODE SUB ; sum, [sum]+[[ptr]] CODE POP ; -- sum := sum + ptr^ PUSH ptr ; ptr CODE DUP ; ptr, ptr CODE LOAD ; ptr, [ptr] PUSH0 ; ptr, [ptr], 0 CODE DUP ; ptr, [ptr], 0, 0 CODE ONE ; ptr, [ptr], 0, 1 CODE SUB ; ptr, [ptr], -1 CODE SUB ; ptr, [ptr]+1 CODE POP ; -- ptr := ptr + 1 PUSH ctr ; ctr CODE DUP ; ctr, ctr CODE LOAD ; ctr, [ctr] PUSH0 ; ctr, [ctr], 0 CODE ONE ; ctr, [ctr], 1 CODE SUB ; ctr, [ctr]-1 CODE DUP ; ctr, [ctr]-1, [ctr]-1 PUSH loop ; ctr, [ctr]-1, [ctr]-1, loop CODE JPOS ; ctr, [ctr]-1 CODE POP ; -- ctr := ctr - 1;The macros in this program expand to a total of 206 instructions. Thus, the program fills 41 words, plus 3 bits of a 42nd word. One pass through the loop body involves, 115 instructions and can be executed with 23 memory cycles for instruction fetch, 8 memory cycles for operand loading and storing, and 32 memory cycles for stack manipulation. Thus, the run-time for this program will be dominated by the need to perform 63 memory cycles per iteration of the loop, where most cycles access the top elements of the stack.
In contrast, the equivalent PDP-11 program using general registers and using the SOB (subtract one and branch) instruction for loop control can be written in 5 instructions occupying 7 words. In this case, the loop body is 2 instructions long and requires 3 memory references per iteration.
The above comparison clearly demonstrates that the minimal CISC architecture is not competitive, but it does not demonstrate that it is suboptimal! The class of optimal architectures can be thought of as a surface in a multidimensional computer design space. Taking typical axes of the space to be processor complexity (measured, for example, as the number of nand gates required to implement the processor, or the number of square millimeters of silicon needed under a fixed set of design rules), the program size for some benchmark, and the memory traffic required to execute that benchmark, it is clear that the PDP-11 is better than the minimal CISC on two axes, but it also requires a more complex processor. The minimal CISC can only be proven to be suboptimal if a processor can be found that is better when measured along at least one axis of the design space while being no worse along any other axes. Of course, serious evaluation of an architecture must rest on a benchmark that is more comprehensive than that used here!
A systematic approach to implementing the minimal CISC architecture described by the pseudocode given above 2 involves, first, identifying the register-transfers used in the code, then building a register transfer machine that can perform these transfers, and finally designing a control unit to evoke the transfers in the right order.
Each assignment statement in the pseudocode describes a register transfer. One register can hold the value of each simple variable, and the list of assignments to each variable determines the functional units that must process the input to the corresponding register and the registers from which those functional units take their inputs. This approach was followed in deriving the register-transfer logic outlined in Figure 1. Here, connections to the control unit are shown to the left, and connections to the memory are to the right. A tri-state data bus is assumed, where the control signal MRE (memory read) causes the contents of the addressed word to be gated onto the data bus, and a positive transition of MWR (memory write) causes the contents of the bus to be stored in the addressed word.
Figure 1It is worth noting that all of the functional boxes in Figure 1 correspond closely to standard MSI chips. The sp register and its functional unit can be implemented by a 74LS169A up-down counter, the pc register can be implemented by a 74LS161 counter, and the ir register can be implemented by a 74LS298 register. Finally, a general purpose ALU such as the 74LS381 can perform the operations in the functional unit feeding acc.
The reader is invited to rewriote the pseudocode for the minimal CISC with control signals to evoke particular register transfers substituted for the assignment statements. Consider using the notation (ACW=1, IRF=0, IRC, SPF=1, SPC), for example, to mean `Hold ACW and SPF to one, hold IRF to zero, and apply clock pulses to IRC and SPC; this would evoke the register transfers ir:=acc and sp:=sp-1 concurrently.
In carrying out this exercise, note that the operation of shifting the instruction register can be moved from the end of the inner while loop to the end of each alternative in the case statement. This allows the shift operation to be done in parallel with the final register transfer of each alternative.
Before a microprogram can be written, one hardware detail must be determined: How does the microcode interpreter perform conditional branches? The solution used here is outlined in Figure 2. The next microinstruction address is formed by oring the condition selected by the condition select field of each microinstruction with the least significant bits of the next address field. Note the use of a two phase clock; all registers in the data part change on the negative edge of the clock pulse, while the microprogram state advances on the positive clock edge. If a single-phase clock were used, where all registers changed simultaneously, the overall speed of the system could probably be higher, but the microprogram would be larger because conditional branches in the microcode could not depend on the results of the current register transfer.
Figure 2Microcode for the minimal CISC is given in Table 1. The comments indicate what phase of which machine instruction each microinstruction performs. Each instruction execution cycle begins with a fetch or a decode microinstruction, depending on whether the previous instruction was the last in a word. NOP, DUP, ONE, ZERO and LOAD can be executed with one additional microinstruction (assuming that memory latencies equal register latencies, in the latter case). The other instructions require more cycles. The line commented JPOS 3-, is used when JPOS detects a negative operand. The same line is used as the final microinstruciton of POP, since in both cases, the accumulator must be loaded from the stack top.
uaddr || next uadr | clocks |bus | function | ------++-----------+--------+----+-------------+--------- || | IAPTSM | AM | I A P S M | || COND NEXT | RCCCPW | CR | R C C P A | || | CCC CR | WE | F F F F D | ------++-----------+--------+----+-------------+--------- 0000 || 11 1000 | 101000 | 01 | 0 xx 0 x 00 | fetch 0001 || 11 1000 | 100000 | 00 | 1 xx x x xx | decode 0010 || 00 0011 | 000100 | 01 | x xx x x 10 | POP 2 0011 || 00 0111 | 000011 | 10 | x xx x 1 11 | POP 3 0100 || 10 0000 | 010000 | 01 | x 11 x x 10 | SUB 2 0101 || 01 0110 | 000110 | 01 | x xx x 1 10 | JPOS 2 0110 || 00 0000 | 011000 | 01 | x 10 1 x 10 | JPOS 3+ 0111 || 10 0000 | 010000 | 01 | x 10 x x 10 | JPOS 3- 1000 || 10 0000 | 000000 | 00 | x xx x x xx | NOP 1001 || 10 0000 | 000011 | 10 | x xx x 0 10 | DUP 1010 || 10 0000 | 010000 | 00 | x 01 x x xx | ONE 1011 || 10 0000 | 010000 | 10 | x 00 x x xx | ZERO 1100 || 10 0000 | 010000 | 01 | x 10 x x 01 | LOAD 1101 || 00 0010 | 000010 | 00 | x xx x 1 xx | POP 1 1110 || 00 0100 | 000010 | 00 | x xx x 1 xx | SUB 1 1111 || 00 0101 | 000010 | 00 | x xx x 1 xx | JPOS 1As noted, the microcode presented in Table 1 assumes that the memory access time can be handled in a single microcycle. If the clock rate is fixed, this means that the microcycle time must match the memory cycle time, even if many register transfers can be finished faster. This assumption has not been justified by most viable computer technology. Typically, the microcycle time for non-memory reference operations can be quite fast, while memory reference operations are slower. With such a technology, it is common to overlap microexecution with memory latency by designing the microengine so it can execute multiple microinstructions while a memory cycle is being completed.