26. Pipelined CPUs
Part of
the 22C:122/55:132 Lecture Notes for Spring 2004
|
In the late 1960's, IBM introduced the 360 model 91; this was the first commercial offering of a pipelined computer, if one ignores the fact that the CDC 6600 computer could be viewed, in retrospect, as pipelined, in the sense that instruction fetch was done in anticipation of later execution by one or another functional unit.
The IBM System 360 family of computers had a common architecture, that is, a common instruction set, as seen by the users, with a wide variety of implementations. Vertical microcode (with large numbers of short microinstructions per instruction) and horizontal microcode (with a small number of long microinstgructions per instruction) were both used for some members of this family.
It is worth noting that the System 360 family is alive and well today. The number was coined by marketing to mean a 3rd generation machine for the 1960's, so, of course, when the 1960's ended, marketing considerations demanded a change of numbering, so the 370 family was born. In the 1980's, a new and somewhat scrambled numbering system obscured what was going on, but in the 1990's, the 390 family carried on the tradition. Today's representatives of this family remain object-code compatable with the original for user programs, despite immense changes in the implementation technology, the I/O architecture, and the operating system environment.
The IBM System 360 family is characterized by a 32-bit word, and 16 general purpose registers. The basic instruction format of the IBM System 360 was and is:
Full-Word Instructions: 8 4 4 4 12 |_______________|_______|_______|_______|_______________________| |_______________|_______|_______|_______|_______________________| | OP R X B DISP | Typical examples: L (load) GPR[R] = M[ GPR[X] + GPR[B] + DISP ] LA (load address) GPR[R] = GPR[X] + GPR[B] + DISP ST (store) M[ GPR[X] + GPR[B] + DISP ] = GPR[R] A (add) GPR[R] = M[ GPR[X] + GPR[B] + DISP ] + GPR[R] BAL(branch and link) PC = GPR[X] + GPR[B] + DISP; GPR[R] = PC Half-Word Instructions: 8 4 4 |_______________|_______|_______| |_______________|_______|_______| | OP R X | Typical examples: BALR (branch and link) PC = GPR[X]; GPR[R] = PC
There are hunddreds of texts on programming this family of machines (The assembly language was BAL, the Basic Assembly Language, so many texts are catalogued under this and not under IBM 360 in librar catilogs). The purpose of the above list of instructions is not to provide anything like an exhaustive list! Rather, it is to illustrate the kinds of operations the the underlying hardware must execute for each instruction execution cycle.
During any cycle, the machine may fetch a 16-bit halfword instruction or a 32-bit full-word instruction. In executing this instruction, it may perform any of the following operations:
Not all instructions will involve all of these stages, but note that This list includes several time-consuming operations. Gathering operands from registers will be fairly fast, since it involves accessing a very small and very fast RAM used to implement the register file. If we ignore floating point operations and multiplication and division, the addition required to form an effective address is very likely to be just as as time consuming as the ALU operations required for instructions such as add and subtract.
The memory reference for operands is certain to take as long as an instruction fetch, and it is quite possible to imagine a system where the time to perform arithmetic operations is comparable to the time taken to access memory. This leads to the suggestion that this architecture could be pipelined with the following pipeline stages:
(It is important to note that this set of pipeline stages was derived from the instruction set of one computer, under one set of assumptions about the relative time taken for various operations. Other instruction sets and other assumptions about relative times lead naturally to other sets of operations!)
Given the above breakdown of an instruction set into pipeline stages, the first question that must be resolved is: What goes in the interstage registers?
_______ _______ |_______| |_______| IR NEXT PC
The first stage fetches an instruction, so the interstage register that it feeds must contain this instruction word, here, stored in the IR interstage register. For the IBM 360 architecture and many others, there are instructions that require knowledge of the address of the instruction following the instruction that was just fetched. In the case of the IBM 360, the BAL and BALR instructions need this so they can save a return address. Other computers need the same information for PC relative branches. Here, we store this in the NEXT PC interstage register.
_______ _______ _______ |___////| |_______| |_______| IR R EA
The gather operands stage collects the operands of the instruction from general registers and from the next PC interstage register, and feeds the addressing adder which produces an effective address. The details of what operand is loaded in the R and EA interstage registers are determined by the opcode.
Certain fields of the instruction register are no-longer needed at this point. The displacement and base register B have been used, and for all of the example instructions given here, there is no further need for the index register X, although there are many instructions that do need this.
_______ _______ _______ |___////| |_______| |_______| IR R OP
The memory reference may save a register to memory, or it may load a register from memory. In the former case, most of the work of the instruciton has been completed and the ALU stage is only needed for cleanup. In the latter case, the ALU will be used to combine the two register operand R with the memory operand OP to compute some result. The opcode field is still needed, as is the register number, so the result can be saved.
If we want to give more detail on a pipelined processor in one or two lectures, we must simplify the design! To do this, we will eliminate the double indexing of the IBM 360 architecture, and then we will begin a bottom-up design that will actually yield most of the instruciton set as a side-effect of the design process!
We begin by assuming a canonical 5-stage pipe; this is the 5-stage pipeline that follows naturally from an analysis of an instruction set for a 1 and 1/2 address machine that contains both memory to register and register to memory arithmetic operations; the complexity of this pipeline is more than sufficient to handle register to register operations.
______ _____ Instruction Fetch |______|----| | _ |____| | | Address Computation | |--|______| | | | | |____| | | Operand Fetch | |--|______|----| M | |R| |____| | | Operate | | |______| | | | | |____| | | Operand Store |_|--|______|----|_____|
Aside: Instruction sets may be classified in terms of how many memory addresses there are per instruction. A 3-address machine allows two source operands from anywhere in memory to be combined and the result stored anywhere in memory, using only one instruction. The DEC/Compaq VAX architecture is the best known of the post 1970 3-address machines. A 2-address machine allows an operand anywhere in memory to be combined with any variable in memory. The DEC PDP-11 and M680x0 families allowed this. A 1-address machine allows an operand anywhere in memory to be addressed; typically, such machines have an accumulator with which operands may be combined. Zero-address machines use a stack for all operand manipulation and addressing, with no address fields required in any instructions.Most stack machines contain several 1-address instructions for loading, storing and branching. A single accumulator is very limiting, so most one address machines contain multiple accumulators. Machine instructions on such a one-address machine therefore allow a rudimentary second address with a very limited address space, mainly, the identity of the accumulator to use. Such a machine is called a one-and-a-half address machine.
We'll assume that a word is 32 bits, that there are 16 general purpose registers R, word addressable memory, and that all instruction words are identical with the following format typical of 1 and 1/2 address machines:
8 4 4 16 _______________ _______ _______ _______________________________ |_______________|_______|_______|_______________________________| | opcode | r | x | disp |
We'll assume that we want to support instructions such as:
Finally, we'll assume that when x is 1111, this is a reference to R[15], the program counter, giving us PC relative addressing.
For the moment, we will completely ignore branches!
We can actually derive the contents of all of the interstage registers from the above description! We start with the instruction fetch to address computation interface. Clearly, this must contain the instruction that was fetched! In addition, because we are interested in relative addressing, we should pass onward the value of the PC that was used to fetch that instruction.
Instruction Fetch |____________________________________| | pc || op|r|x| disp | _|_______________||___|_|_|_______|_ | | Address Computation
The address computation stage is going to take information from register x and combine it with various sources to compute the effective address used by this instruction; therefore, the output of the address computation phase must include this effective address. The x and disp fields of the instruction register are not needed by later pipeline stages, and the pc value has been combined into the effective address, if it is needed, so these may be discarded. This gives us:
Address Computation |__________________________| | op|r|| ea | _|___|_||_______________|_ | | Operand Fetch
The operand fetch stage is going to use either or both of r and ea to get operands from memory and registers, so obviously, it must include these operands in its output. The operand store stage will need both r and ea itself, so we pass these on as well, giving:
Operand Fetch |____________________________________________________________| | op|r|| ea || opa || opb | _|___|_||_______________||_______________||_______________|_ | | Operate
The operate stage is going to contain the ALU that combines the operands as instructed by the opcode. Thus, its output must contain the result instead of the operands:
Operate |___________________________________________| | op|r|| ea || result | _|___|_||_______________||_______________|_ | | Operand Store
This completes all the interstage registers!
In outlining the structure of the pipeline stages, we must make some assumptions about the registers and memory. Here, we will assume that these are true multiport random access memories. For the CPU registers, this is a realistic assumption. For the main memory, this is unrealistic, but we won't let this deter us.
In addition, we will assume that there is some cost to a real multiport register or memory, so we will provide an input with each memory or register port, a signal saying "port needed". If this is not asserted during a pipeline cycle, that port is inactive.
We can now begin to build the pipeline stages, starting with the instruction fetch stage. In the abstract, this perorms the following logic:
repeat output_ir = M[pc] output_pc = pc pc = pc + 1 foreverIn the following, we'll assume that all registers are clocked simultaneously unless clock logic is shown! With this assumption, the fetch stage can be constructed as follows:
Instruction Fetch Stage True -----> need memory ---------o------------------------------> address to memory | __|__ -------------< data from memory | | +1 | | | |__ __| | | | | | ...|... | | | | | _______|_______ _______|_______ | | pc || op|r|x| disp | | |_______________||___|_|_|_______| |_________|The logic given above cannot perform branch instructions, but we will solve this later by adding a multiplexor at the location marked with a dotted line.
The address computation stage is only marginally more complex. This performs the following computation:
repeat output_op = input_ir.op output_r = input_ir.r if input_ir.x = 0000 need_regs = 0 output_ea = input_ir.disp + 0 else if input_ir.x = 1111 need_regs = 0 output_ea = input_ir.disp + input_pc else need_regs = 1 output_ea = input_ir.disp + R[input_x] endif forever
This reduces to the following logic:
Address Computation Stage _______________ _______________ | pc || op|r|x| disp | |_______________||___|_|_|_______| | | | | | -----|---------- | | | | --|------------- o----|----------> register number | | - ----------|----|----------< data from registers | | | | 0 | | | | __|__|__|__ __|__ | ___ | | \ 0 0 1 /--|=0000|-|-|nor|----> need register value | | \1__0__0/---|=1111|-|-|___| | | | | | | | ----------- | | |___| | | | + | | | |_____| _|__| _______|_______ | op|r|| ea | |___|_||_______________|
The operand fetch stage is quite simple:
repeat output_op = input_op output_r = input_r output_ea = input_ea if read_register( input_op ) output_opa = R[ input_r ] needs_register = true else output_opa = undefined needs_register = false endif if read_memory( input_op ) output_opb = M[ input_ea ] needs_memory = true else output_opb = input_ea needs_memory = false endif forever
This translates to the following logic:
Operand Fetch Stage _____ _______________ | op|r|| ea | |___|_||_______________| | | | | o---------|------------------------> register number | | | _________ -------< data from registers | | | | read | | o--|---------|--|registers|---|-------> need register value | | | |_________| | | | | | | | o--------------------o---> address to memory | | | | | -< data from memory | | | | _|_|_ | | | ______ --|--\0_1/ | | | | read | | | | o--|---------|--|memory|---o--|----|--> needs memory | | | |______| | | | | | | ----------- _|__| _______|_______ _______|_______ _______|_______ | op|r|| ea || opa || opb | |___|_||_______________||_______________||_______________|
The operate stage is equally straightforward:
repeat output_op = input_op output_r = input_r output_ea = input_ea output_result = ALU( aluop( input_op ), input_opa, input_opb ) forever
Giving us this hardware:
Operate Stage _____ _______________ _______________ _______________ | op|r|| ea || opa || opb | |___|_||_______________||_______________||_______________| | | | | | | | | - -------------- | | | _____ _|___|_ o--|---------|---|aluop|--| alu | | | | |_____| |_______| _|__| _______|_______ _______|_______ | op|r|| ea || result | |___|_||_______________||_______________|
Finally, we have the result store stage. This is complicated by the fact that it may store either to registers or to memory.
repeat if write_register( input_op ) R[ input_r ] = input_result endif if write_memory( input_op ) M[ input_ea ] = input_result endif forever
We can implement this as follows:
Operand Store Stage _____ _______________ _______________ | op|r|| ea || result | |___|_||_______________||_______________| | | | | | ---------|----------------|-------> register number | | ________ o-------> data to registers | | | write | | o------------|--|register|----|-------> write register | | |________| | | | | | ----------------|-------> address to memory | ______ -------> data to memory | | write| ---------------|memory|--------------> write memory |______|
The above design omitted all details of flow-of-control through a program! We have no way to perform a branch, and this is a serious oversight which we must eventually correct!
The design outlined above directly suggests an instruction set for this machine! In looking at the information derived from the opcode field of the instruction to control the various pipeline stages, we have the following:
_______________ _______ _______ _______________________________ |_______________|_______|_______|_______________________________| |opcode (8 bits)| r | x | disp (16 bits) | _______ _ _ _ _ |_______|_|_|_|_| | | | | | | aluop rr wr rm wmWe could consider the 4 read and write control bits to comprise something of an address mode field for the instruction. This leads to the following interpretation:
_______ _ _ _ _ |_______|_|_|_|_| | | | | | | aluop rr wr rm wm 0 0 0 0 -- * 0 0 0 1 -- * 0 0 1 0 -- immediate to register 0 0 1 1 -- * 0 1 0 0 -- * 0 1 0 1 -- * 0 1 1 0 -- memory to register 0 1 1 1 -- * 1 0 0 0 -- * 1 0 0 1 -- register to memory 1 0 1 0 -- register + immediate to register 1 0 1 1 -- * 1 1 0 0 -- * 1 1 0 1 -- register + memory to memory 1 1 1 0 -- register + memory to register 1 1 1 1 -- register + memory to both register and memory
The problem with this instruction encoding is that it leaves us with a number of useless addressing modes, for example, modes in which results are discarded, modes in which an immediate operand is also used as an address, and so on. These modes are starred above. In addition, depending on what operations the ALU provides, additional modes provided above may be redundant.
Nonetheless, the above tentative instruction set is actually a good start in the design of a practical computer! To complete this kind of design, what we will do is take some of the opcodes that serve no purpose with the register transfer design given and decode them for other purposes such as conditional branch instructions and other potentially interesting operations.