22C:122/55:032 Notes, Lecture 25, Spring 2001

Douglas W. Jones
University of Iowa Department of Computer Science

Implementing a Pipelined Processor
In this lecture, we will go into the details of the implementation of a pipelined processor, doing a bottom up design that will actually yield most of the instruciton set as a side-effect of the design process!
We begin by assuming a canonical 5-stage pipe; this is the 5-stage pipeline that follows naturally from an analysis of an instruction set for a 1 and 1/2 address machine that contains both memory to register and register to memory arithmetic operations; the complexity of this pipeline is more than sufficient to handle register to register operations.
```
				  ______       _____
	   Instruction Fetch     |______|-----|     |
				  |____|   _  |     |
	   Address Computation   |______|-| | |     |
				  |____|  | | |     |
	   Operand Fetch         |______|=| |-|  M  |
				  |____|  |R| |     |
	   Operate               |______| | | |     |
				  |____|  | | |     |
	   Operand Store         |______|=|_|-|_____|
	
```
Aside: Instruction sets may be classified in terms of how many memory addresses there are per instruction. A 3-address machine allows two source operands from anywhere in memory to be combined and the result stored anywhere in memory, using only one instruction. The DEC/Compaq VAX architecture is the best known of the post 1970 3-address machines. A 2-address machine allows an operand anywhere in memory to be combined with any variable in memory. The DEC PDP-11 and M680x0 families allowed this. A 1-address machine allows an operand anywhere in memory to be addressed; typically, such machines have an accumulator with which operands may be combined. Zero-address machines use a stack for all operand manipulation and addressing, with no address fields required in any instructions.
Most stack machines contain several 1-address instructions for loading, storing and branching. A single accumulator is very limiting, so most one address machines contain multiple accumulators. Machine instructions on such a one-address machine therefore allow a rudimentary second address with a very limited address space, mainly, the identity of the accumulator to use. Such a machine is called a one-and-a-half address machine.
We'll assume that a word is 32 bits, that there are 16 general purpose registers R, word addressable memory, and that all instruction words are identical with the following format typical of 1 and 1/2 address machines:
```
         _______________ _______ _______ _______________________________
	|_______________|_______|_______|_______________________________|
        |opcode (8 bits)|   r   |   x   |       disp (16 bits)          |
	
```
We'll assume that we want to support instructions such as:
- load R[r] from M[R[x] + disp]
- store R[r] to M[R[x] + disp]
- load R[r] with R[x] + disp (this is load immediate)
- add to R[r] from M[R[x] + disp]
- add R[r] to M[R[x] + disp]
- add to R[r] with R[x] + disp (this is add immediate)
Furthermore, we'll have the rule that when x is zero, instead of designating R[0], it designates the constant 0, turning off indexing and giving us genuine immediate operands.
Finally, we'll assume that when x is 1111, this is a reference to R[15], the program counter, giving us PC relative addressing.
For the moment, we will completely ignore branches!
Deriving the Contents of the Interstage Registers
We can actually derive the contents of all of the interstage registers from the above description! We start with the instruction fetch to address computation interface. Clearly, this must contain the instruction that was fetched! In addition, because we are interested in relative addressing, we should pass onward the value of the PC that was used to fetch that instruction.
```
	   Instruction Fetch
	|____________________________________|
	  |       pc      || op|r|x|  disp | 
	 _|_______________||___|_|_|_______|_
	|                                    |
	   Address Computation
	
```
The address computation stage is going to take information from register x and combine it with various sources to compute the effective address used by this instruction; therefore, the output of the address computation phase must include this effective address. The x and disp fields of the instruction register are not needed by later pipeline stages, and the pc value has been combined into the effective address, if it is needed, so these may be discarded. This gives us:
```
	   Address Computation
	|__________________________|
	  | op|r||       ea      |
	 _|___|_||_______________|_
	|                          |
	   Operand Fetch
	
```
The operand fetch stage is going to use either or both of r and ea to get operands from memory and registers, so obviously, it must include these operands in its output. The operand store stage will need both r and ea itself, so we pass these on as well, giving:
```
	   Operand Fetch
	|____________________________________________________________|
	  | op|r||       ea      ||      opa      ||      opb      |
	 _|___|_||_______________||_______________||_______________|_
	|                                                            |
	   Operate
	
```
The operate stage is going to contain the ALU that combines the operands as instructed by the opcode. Thus, its output must contain the result instead of the operands:
```
	   Operate
	|___________________________________________|
	  | op|r||       ea      ||     result    |
	 _|___|_||_______________||_______________|_
	|                                           |
	   Operand Store
	
```
This completes all the interstage registers!
Assumptions about the Memory and Register Interfaces
In outlining the structure of the pipeline stages, we must make some assumptions about the registers and memory. Here, we will assume that these are true multiport random access memories. For the CPU registers, this is a realistic assumption. For the main memory, this is unrealistic, but we won't let this deter us.
In addition, we will assume that there is some cost to a real multiport register or memory, so we will provide an input with each memory or register port, a signal saying "port needed". If this is not asserted during a pipeline cycle, that port is inactive.

Details of the Pipeline, Stage by Stage

We can now begin to build the pipeline stages, starting with the instruction fetch stage. In the abstract, this perorms the following logic:

	repeat
	   output_ir = M[pc]
	   output_pc = pc
	   pc = pc + 1
	forever

In the following, we'll assume that all registers are clocked simultaneously unless clock logic is shown! With this assumption, the fetch stage can be constructed as follows:

	   Instruction Fetch Stage
	                               True -----> need memory
	                                     
	 ---------o------------------------------> address to memory
	|       __|__               -------------< data from memory
	|      | +1  |             |
	|      |__ __|             |
	|         |                |
	|      ...|...             |
	|         |                |
	|  _______|_______  _______|_______
	| |       pc      || op|r|x|  disp | 
	| |_______________||___|_|_|_______|
        |_________|

The logic given above cannot perform branch instructions, but we will solve this later by adding a multiplexor at the location marked with a dotted line.

The address computation stage is only marginally more complex. This performs the following computation:

	repeat
	   output_op = input_ir.op
	   output_r  = input_ir.r
	   if input_ir.x = 0000
	      need_regs = 0
	      output_ea = input_ir.disp + 0
	   else if input_ir.x = 1111
	      need_regs = 0
	      output_ea = input_ir.disp + input_pc
	   else
	      need_regs = 1
	      output_ea = input_ir.disp + R[input_x]
	   endif
	forever

This reduces to the following logic:

	   Address Computation Stage
	   _______________  _______________
	  |       pc      || op|r|x|  disp | 
	  |_______________||___|_|_|_______|
                  |          |  | |    |
             -----|----------   | |    |
            |   --|-------------  o----|----------> register number
            |  |   -    ----------|----|----------< data from registers
            |  |    |  |  0       |    |
            |  |  __|__|__|__   __|__  |  ___ 
            |  |  \ 0  0  1 /--|=0000|-|-|nor|----> need register value
            |  |   \1__0__0/---|=1111|-|-|___|
            |  |       |               |
            |  |       |    ----------- 
            |  |       |___|            
            |  |      |  +  |           
            |  |      |_____|           
	   _|__|  _______|_______
	  | op|r||       ea      |
	  |___|_||_______________|

The operand fetch stage is quite simple:

	repeat
	   output_op = input_op
	   output_r  = input_r
	   output_ea = input_ea
	   if read_register( input_op )
	      output_opa = M[ input_ea ]
	      needs_register = true
	   else
	      output_opa = undefined
	      needs_register = false
	   endif
	   if read_memory( input_op )
	      output_opb = M[ input_ea ]
	      needs_memory = true
	   else
              output_opb =    input_ea
	      needs_memory = false
	   endif
	forever

This translates to the following logic:

	   Operand Fetch Stage
	   _____  _______________
	  | op|r||       ea      |
	  |___|_||_______________|
	    |  |         |
	    |  o---------|------------------------> register number       
	    |  |         |   _________     -------< data from registers
	    |  |         |  |  read   |   |                        
	    o--|---------|--|registers|---|-------> need register value
	    |  |         |  |_________|   |                        
	    |  |         |                |                        
	    |  |         o--------------------o---> address to memory
	    |  |         |                |   |  -< data from memory
	    |  |         |                |  _|_|_
	    |  |         |   ______     --|--\0_1/                 
	    |  |         |  | read |   |  |    |                   
	    o--|---------|--|memory|---o--|----|--> needs memory
	    |  |         |  |______|      |    |
	    |  |         |                |     -----------        
	   _|__|  _______|_______  _______|_______  _______|_______
	  | op|r||       ea      ||      opa      ||      opb      |
	  |___|_||_______________||_______________||_______________|

The operate stage is equally straightforward:

	repeat
	   output_op = input_op
	   output_r  = input_r
	   output_ea = input_ea
	   output_result = ALU( aluop( input_op ), input_opa, input_opb )
	forever

	   Operate Stage
	   _____  _______________  _______________  _______________
	  | op|r||       ea      ||      opa      ||      opb      |
	  |___|_||_______________||_______________||_______________|
            |  |         |                |                |
            |  |         |               -   -------------- 
            |  |         |    _____    _|___|_
            o--|---------|---|aluop|--|  alu  |
            |  |         |   |_____|  |_______|
	   _|__|  _______|_______  _______|_______
	  | op|r||       ea      ||     result    |
	  |___|_||_______________||_______________|

Finally, we have the result store stage. This is complicated by the fact that it may store either to registers or to memory.

	repeat
	   if write_register( input_op )
	      R[ input_r ] = input_result
	   endif
	   if write_memory( input_op )
	      M[ input_ea ] = input_result
	   endif
	forever

We can implement this as follows:

	   Operand Store Stage
	   _____  _______________  _______________
	  | op|r||       ea      ||     result    |
	  |___|_||_______________||_______________|
            |  |         |                |
	    |   ---------|----------------|-------> register number       
	    |            |   ________     o-------> data to registers
	    |            |  |  write |    |                        
	    o------------|--|register|----|-------> write register
	    |            |  |________|    |                        
	    |            |                |                        
	    |             ----------------|-------> address to memory
	    |                ______        -------> data to memory
	    |               | write|
	     ---------------|memory|--------------> write memory
                            |______|

The above design omitted all details of flow-of-control through a program! This is a serious oversight which we will correct later!

Deriving an Instruction Set

The design outlined above directly suggests an instruction set for this machine! In looking at the information derived from the opcode field of the instruction to control the various pipeline stages, we have the following:

rr, read register -- operand fetch stage
rm, read memory -- operand fetch stage
aluop -- operate stage
wr, write_register -- operand store stage
wm, write_memory -- operand store stage

We can easily code these in an 8-bit instruction word as follows:

         _______________ _______ _______ _______________________________
	|_______________|_______|_______|_______________________________|
        |opcode (8 bits)|   r   |   x   |       disp (16 bits)          |
         _______ _ _ _ _
        |_______|_|_|_|_|
        |       | | | | |
          aluop rr  wr
                  rm  wm

We could consider the 4 read and write control bits to comprise something of an address mode field for the instruction. This leads to the following interpretation:

         _______ _ _ _ _
        |_______|_|_|_|_|
        |       | | | | |
          aluop rr  wr
                  rm  wm
                 0 0 0 0  -- *
                 0 0 0 1  -- *
                 0 0 1 0  -- immediate to register
                 0 0 1 1  -- *
                 0 1 0 0  -- *
                 0 1 0 1  -- *
                 0 1 1 0  -- memory to register
                 0 1 1 1  -- *
                 1 0 0 0  -- *
                 1 0 0 1  -- register to memory
                 1 0 1 0  -- register + immediate to register
                 1 0 1 1  -- *
                 1 1 0 0  -- *
                 1 1 0 1  -- register + memory to memory
                 1 1 1 0  -- register + memory to register
                 1 1 1 1  -- register + memory to both register and memory

The problem with this instruction encoding is that it leaves us with a number of useless addressing modes, for example, modes in which results are discarded, modes in which an immediate operand is also used as an address, and so on. These modes are starred above. In addition, depending on what operations the ALU provides, additional modes provided above may be redundant.

Nonetheless, the above tentative instruction set is actually a good start in the design of a practical computer! To complete this kind of design, what we will do is take some of the opcodes that serve no purpose with the register transfer design given and decode them for other purposes such as conditional branch instructions and other potentially interesting operations.

Adding Branches to the instruction set could be as straightforward as using the convention that r=1111 refers to the PC. As a result, load immediate into pc becomes a branch instruction. Adding function calls is harder.

Finding an opcode for conditionals is also fairly easy. The group of addresing modes neither store results in memory or registers are an obvious candidate to borrow for a new meaning. Finding an appropriate meaning is harder.

As an example of a small change we could make that would be both useful and simple, consider modifying the operand fetch stage as follows:

	   Operand Fetch Stage
	   _____  _______________
	  | op|r||       ea      |
	  |___|_||_______________|
	    |  |         |
	    |  o---------|---------------o--------> register number       
	    |  |         |   _________   |  ------< data from registers
	    |  |         |  |  read   |  | |                        
	    o--|---------|--|registers|--|-|---o--> need register value
	    |  |         |  |_________| _|_|_  |                   
	    |  |         |              \0_1/--                    
	    |  |         |                |                        
	    |  |         o--------------------o---> address to memory
	    |  |         |                |   |  -< data from memory
	    |  |         |                |  _|_|_
	    |  |         |   ______     --|--\0_1/                 
	    |  |         |  | read |   |  |    |                   
	    o--|---------|--|memory|---o--|----|--> needs memory
	    |  |         |  |______|      |    |
	    |  |         |                |     -----------        
	   _|__|  _______|_______  _______|_______  _______|_______
	  | op|r||       ea      ||      opa      ||      opb      |
	  |___|_||_______________||_______________||_______________|

Here, all we have done is add one multiplexor, so that, if a register is not read, the register number field is available as the opa input to the ALU instead of an undefined value from registers. This creates a new class of immediate addressing modes with short 4-bit operands:

         _______ _ _ _ _
        |_______|_|_|_|_|
        |       | | | | |
          aluop rr  wr
                  rm  wm
                 0 0 0 0  -- *
                 0 0 0 1  -- NEW short immediate to memory
                 0 0 1 0  -- immediate to register
                 0 0 1 1  -- *
                 0 1 0 0  -- *
                 0 1 0 1  -- NEW short immediate + memory to memory
                 0 1 1 0  -- memory to register
                 0 1 1 1  -- *
                 1 0 0 0  -- *
                 1 0 0 1  -- register to memory
                 1 0 1 0  -- register + immediate to register
                 1 0 1 1  -- *
                 1 1 0 0  -- *
                 1 1 0 1  -- register + memory to memory
                 1 1 1 0  -- register + memory to register
                 1 1 1 1  -- register + memory to both register and memory

Some of these addressing modes will be useful with only a subset of the ALU operations:

result = f(opb) -- for immediate to register
result = f(opa) -- for short immediate or register to memory

The modes shown with a + sign above will be useful with the full range of two-operand arithmetic and logical operations.

Even so, it is unlikely that all operations will be used with the same frequency, and it is unlikely that all operations will be equally likely with all addressing modes. As a result, we will most likely want to introduce a more complex encoding of the opcodes, eliminating the orthogonal encoding of addressing modes and introducing a ROM in the address computation stage to take a compact opcode encoding and expand it into an orthogonal endocing.