In this lecture, we will go into the details of the implementation of a pipelined processor, doing a bottom up design that will actually yield most of the instruciton set as a side-effect of the design process!
We begin by assuming a canonical 5-stage pipe; this is the 5-stage pipeline that follows naturally from an analysis of an instruction set for a 1 and 1/2 address machine that contains both memory to register and register to memory arithmetic operations; the complexity of this pipeline is more than sufficient to handle register to register operations.
______ _____ Instruction Fetch |______|-----| | |____| _ | | Address Computation |______|-| | | | |____| | | | | Operand Fetch |______|=| |-| M | |____| |R| | | Operate |______| | | | | |____| | | | | Operand Store |______|=|_|-|_____|
Aside: Instruction sets may be classified in terms of how many memory addresses there are per instruction. A 3-address machine allows two source operands from anywhere in memory to be combined and the result stored anywhere in memory, using only one instruction. The DEC/Compaq VAX architecture is the best known of the post 1970 3-address machines. A 2-address machine allows an operand anywhere in memory to be combined with any variable in memory. The DEC PDP-11 and M680x0 families allowed this. A 1-address machine allows an operand anywhere in memory to be addressed; typically, such machines have an accumulator with which operands may be combined. Zero-address machines use a stack for all operand manipulation and addressing, with no address fields required in any instructions.We'll assume that a word is 32 bits, that there are 16 general purpose registers R, word addressable memory, and that all instruction words are identical with the following format typical of 1 and 1/2 address machines:Most stack machines contain several 1-address instructions for loading, storing and branching. A single accumulator is very limiting, so most one address machines contain multiple accumulators. Machine instructions on such a one-address machine therefore allow a rudimentary second address with a very limited address space, mainly, the identity of the accumulator to use. Such a machine is called a one-and-a-half address machine.
_______________ _______ _______ _______________________________ |_______________|_______|_______|_______________________________| |opcode (8 bits)| r | x | disp (16 bits) |We'll assume that we want to support instructions such as:
Finally, we'll assume that when x is 1111, this is a reference to R[15], the program counter, giving us PC relative addressing.
For the moment, we will completely ignore branches!
We can actually derive the contents of all of the interstage registers from the above description! We start with the instruction fetch to address computation interface. Clearly, this must contain the instruction that was fetched! In addition, because we are interested in relative addressing, we should pass onward the value of the PC that was used to fetch that instruction.
Instruction Fetch |____________________________________| | pc || op|r|x| disp | _|_______________||___|_|_|_______|_ | | Address ComputationThe address computation stage is going to take information from register x and combine it with various sources to compute the effective address used by this instruction; therefore, the output of the address computation phase must include this effective address. The x and disp fields of the instruction register are not needed by later pipeline stages, and the pc value has been combined into the effective address, if it is needed, so these may be discarded. This gives us:
Address Computation |__________________________| | op|r|| ea | _|___|_||_______________|_ | | Operand FetchThe operand fetch stage is going to use either or both of r and ea to get operands from memory and registers, so obviously, it must include these operands in its output. The operand store stage will need both r and ea itself, so we pass these on as well, giving:
Operand Fetch |____________________________________________________________| | op|r|| ea || opa || opb | _|___|_||_______________||_______________||_______________|_ | | OperateThe operate stage is going to contain the ALU that combines the operands as instructed by the opcode. Thus, its output must contain the result instead of the operands:
Operate |___________________________________________| | op|r|| ea || result | _|___|_||_______________||_______________|_ | | Operand StoreThis completes all the interstage registers!
In outlining the structure of the pipeline stages, we must make some assumptions about the registers and memory. Here, we will assume that these are true multiport random access memories. For the CPU registers, this is a realistic assumption. For the main memory, this is unrealistic, but we won't let this deter us.
In addition, we will assume that there is some cost to a real multiport register or memory, so we will provide an input with each memory or register port, a signal saying "port needed". If this is not asserted during a pipeline cycle, that port is inactive.
We can now begin to build the pipeline stages, starting with the instruction fetch stage. In the abstract, this perorms the following logic:
repeat output_ir = M[pc] output_pc = pc pc = pc + 1 foreverIn the following, we'll assume that all registers are clocked simultaneously unless clock logic is shown! With this assumption, the fetch stage can be constructed as follows:
Instruction Fetch Stage True -----> need memory ---------o------------------------------> address to memory | __|__ -------------< data from memory | | +1 | | | |__ __| | | | | | ...|... | | | | | _______|_______ _______|_______ | | pc || op|r|x| disp | | |_______________||___|_|_|_______| |_________|The logic given above cannot perform branch instructions, but we will solve this later by adding a multiplexor at the location marked with a dotted line.
The address computation stage is only marginally more complex. This performs the following computation:
repeat output_op = input_ir.op output_r = input_ir.r if input_ir.x = 0000 need_regs = 0 output_ea = input_ir.disp + 0 else if input_ir.x = 1111 need_regs = 0 output_ea = input_ir.disp + input_pc else need_regs = 1 output_ea = input_ir.disp + R[input_x] endif foreverThis reduces to the following logic:
Address Computation Stage _______________ _______________ | pc || op|r|x| disp | |_______________||___|_|_|_______| | | | | | -----|---------- | | | | --|------------- o----|----------> register number | | - ----------|----|----------< data from registers | | | | 0 | | | | __|__|__|__ __|__ | ___ | | \ 0 0 1 /--|=0000|-|-|nor|----> need register value | | \1__0__0/---|=1111|-|-|___| | | | | | | | ----------- | | |___| | | | + | | | |_____| _|__| _______|_______ | op|r|| ea | |___|_||_______________|The operand fetch stage is quite simple:
repeat output_op = input_op output_r = input_r output_ea = input_ea if read_register( input_op ) output_opa = M[ input_ea ] needs_register = true else output_opa = undefined needs_register = false endif if read_memory( input_op ) output_opb = M[ input_ea ] needs_memory = true else output_opb = input_ea needs_memory = false endif foreverThis translates to the following logic:
Operand Fetch Stage _____ _______________ | op|r|| ea | |___|_||_______________| | | | | o---------|------------------------> register number | | | _________ -------< data from registers | | | | read | | o--|---------|--|registers|---|-------> need register value | | | |_________| | | | | | | | o--------------------o---> address to memory | | | | | -< data from memory | | | | _|_|_ | | | ______ --|--\0_1/ | | | | read | | | | o--|---------|--|memory|---o--|----|--> needs memory | | | |______| | | | | | | ----------- _|__| _______|_______ _______|_______ _______|_______ | op|r|| ea || opa || opb | |___|_||_______________||_______________||_______________|The operate stage is equally straightforward:
repeat output_op = input_op output_r = input_r output_ea = input_ea output_result = ALU( aluop( input_op ), input_opa, input_opb ) forever
Operate Stage _____ _______________ _______________ _______________ | op|r|| ea || opa || opb | |___|_||_______________||_______________||_______________| | | | | | | | | - -------------- | | | _____ _|___|_ o--|---------|---|aluop|--| alu | | | | |_____| |_______| _|__| _______|_______ _______|_______ | op|r|| ea || result | |___|_||_______________||_______________|Finally, we have the result store stage. This is complicated by the fact that it may store either to registers or to memory.
repeat if write_register( input_op ) R[ input_r ] = input_result endif if write_memory( input_op ) M[ input_ea ] = input_result endif foreverWe can implement this as follows:
Operand Store Stage _____ _______________ _______________ | op|r|| ea || result | |___|_||_______________||_______________| | | | | | ---------|----------------|-------> register number | | ________ o-------> data to registers | | | write | | o------------|--|register|----|-------> write register | | |________| | | | | | ----------------|-------> address to memory | ______ -------> data to memory | | write| ---------------|memory|--------------> write memory |______|The above design omitted all details of flow-of-control through a program! This is a serious oversight which we will correct later!
The design outlined above directly suggests an instruction set for this machine! In looking at the information derived from the opcode field of the instruction to control the various pipeline stages, we have the following:
_______________ _______ _______ _______________________________ |_______________|_______|_______|_______________________________| |opcode (8 bits)| r | x | disp (16 bits) | _______ _ _ _ _ |_______|_|_|_|_| | | | | | | aluop rr wr rm wmWe could consider the 4 read and write control bits to comprise something of an address mode field for the instruction. This leads to the following interpretation:
_______ _ _ _ _ |_______|_|_|_|_| | | | | | | aluop rr wr rm wm 0 0 0 0 -- * 0 0 0 1 -- * 0 0 1 0 -- immediate to register 0 0 1 1 -- * 0 1 0 0 -- * 0 1 0 1 -- * 0 1 1 0 -- memory to register 0 1 1 1 -- * 1 0 0 0 -- * 1 0 0 1 -- register to memory 1 0 1 0 -- register + immediate to register 1 0 1 1 -- * 1 1 0 0 -- * 1 1 0 1 -- register + memory to memory 1 1 1 0 -- register + memory to register 1 1 1 1 -- register + memory to both register and memoryThe problem with this instruction encoding is that it leaves us with a number of useless addressing modes, for example, modes in which results are discarded, modes in which an immediate operand is also used as an address, and so on. These modes are starred above. In addition, depending on what operations the ALU provides, additional modes provided above may be redundant.
Nonetheless, the above tentative instruction set is actually a good start in the design of a practical computer! To complete this kind of design, what we will do is take some of the opcodes that serve no purpose with the register transfer design given and decode them for other purposes such as conditional branch instructions and other potentially interesting operations.
Adding Branches to the instruction set could be as straightforward as using the convention that r=1111 refers to the PC. As a result, load immediate into pc becomes a branch instruction. Adding function calls is harder.
Finding an opcode for conditionals is also fairly easy. The group of addresing modes neither store results in memory or registers are an obvious candidate to borrow for a new meaning. Finding an appropriate meaning is harder.
As an example of a small change we could make that would be both useful and simple, consider modifying the operand fetch stage as follows:
Operand Fetch Stage _____ _______________ | op|r|| ea | |___|_||_______________| | | | | o---------|---------------o--------> register number | | | _________ | ------< data from registers | | | | read | | | o--|---------|--|registers|--|-|---o--> need register value | | | |_________| _|_|_ | | | | \0_1/-- | | | | | | o--------------------o---> address to memory | | | | | -< data from memory | | | | _|_|_ | | | ______ --|--\0_1/ | | | | read | | | | o--|---------|--|memory|---o--|----|--> needs memory | | | |______| | | | | | | ----------- _|__| _______|_______ _______|_______ _______|_______ | op|r|| ea || opa || opb | |___|_||_______________||_______________||_______________|Here, all we have done is add one multiplexor, so that, if a register is not read, the register number field is available as the opa input to the ALU instead of an undefined value from registers. This creates a new class of immediate addressing modes with short 4-bit operands:
_______ _ _ _ _ |_______|_|_|_|_| | | | | | | aluop rr wr rm wm 0 0 0 0 -- * 0 0 0 1 -- NEW short immediate to memory 0 0 1 0 -- immediate to register 0 0 1 1 -- * 0 1 0 0 -- * 0 1 0 1 -- NEW short immediate + memory to memory 0 1 1 0 -- memory to register 0 1 1 1 -- * 1 0 0 0 -- * 1 0 0 1 -- register to memory 1 0 1 0 -- register + immediate to register 1 0 1 1 -- * 1 1 0 0 -- * 1 1 0 1 -- register + memory to memory 1 1 1 0 -- register + memory to register 1 1 1 1 -- register + memory to both register and memorySome of these addressing modes will be useful with only a subset of the ALU operations:
Even so, it is unlikely that all operations will be used with the same frequency, and it is unlikely that all operations will be equally likely with all addressing modes. As a result, we will most likely want to introduce a more complex encoding of the opcodes, eliminating the orthogonal encoding of addressing modes and introducing a ROM in the address computation stage to take a compact opcode encoding and expand it into an orthogonal endocing.