Mean = 19.9 X ______________________________X_X_____X___X_____X_________X_____ 0 . 2 . 4 . 6 . 8 . 10. 12. 14. 16. 18. 20. 22. 24. 26. 28. 30 - - B B B B + + - - A A A A + +
Here is a 1-instruction, 2-word solutionWord 1: OP = 10 load PC = M[PC + 1] Word 2: <value to load into PC>Here is a 3-instruction solution that assumes ZERO designates a register reserved to hold the value zero.Word 1: OP = 01 immediate TEMP = ZERO + <high 16 bits> Word 2: OP = 00 operate TEMP = TEMP << 16 Word 3: OP = 01 immediate PC = TEMP + <low 16 bits>The former is probably faster, but there may be CPUs for which the second would be faster.Sadly, many did second-rate work here, even though this was an essential component of the second study question.
Here, we'll used the first solution given above, prefixing it with an instruction to save the return address:Word 1: OP = 01 immediate RA = PC + 3 Word 2: OP = 10 load PC = M[PC + 1] Word 3: <value to load into PC>Most students forgot to add 3 to the PC, thus leaving the called routine with an address that wasn't technically the return address. Given that this was based on the second study question, this is unfortunate.
Stage 1: instruction fetch.
Stage 2: operand fetch (R2, R3, const)
Stage 3: ALU for operate, compute memory addr for others.
Stage 4: memory access for load/store
Stage 5: operand save for operate/immediate/loadThis was, of course, a repeat of the first study question. Several students had similar answers, but many tried to make a 4-stage pipe, collapsing some pair of the above stages in a way that may not have been feasible.
The example in class had separate memory access paths for load and store; as a result, having a separate cache on each memory access port raises major cache coherence problems. These are eliminated by moving the cache used for operands downstream from the memory arbitration logic.
Part B: Why might you prefer the alternative on the left for the example architecture presented on this exam?
We have only one pipeline stage that accesses memory for operands; therefore, there is no cache coherency problem if we place a cache upstream from the arbitration logic.Note that some students were worried about cache coherency between the intsruction fetch cache and the operand cache(s). Most architectures largely ignore this issue, since most machines these days assume a read-only instruction stream most of the time.
The third study question should have prepared everyone for this question.
Pipelined architectures typically have mechanisms for invalidating the contents of a pipeline stages. Branch instructions, pipeline stalls caused by operand or resource conflicts all require this. We can use this mechanism to implement skips by having the operate instruction that wishes to skip invalidate its successor.In contrast, adding conditional branches requires adding condition codes, a new resource that adds dependencies between instructions.
Part B: Evaluated solely from the perspective of efficient instruction coding, which is likely to make more efficient use of program stream bandwidth. Your answer should be in the form of a persuasive qualitative argument.
The biggest problem with the conditional branch proposal is that it reduces the constant field to fewer than 16 bits. As a result, loading an arbitrary 32 bit constant using a sequence of 16-bit loads is no longer an option.The op and sh fields, taken together, are unlikely to be fully exploited. Add and shift, for example, is very important for multiply-by-constant operations, but exclusive-or and shift will likely be very rare. Careful reallocation of these bits is, therefore, likely to be practical, if difficult.
The final study question, "What's missing from this architecture", should have prepared people for this!
There is the program counter used in the instruction fetch stage, but not all fetched instructions will be executed -- they may be skipped or cancelled by a branch.If there is branch prediction logic, branches may be predicted incorrectly, so there is next-instruction-address computation in the operand save stage that is responsible for computing the flow of control that is to be finalized. Everything pushed into the pipe by the fetch stage must be viewed as speculative!
Most did pretty well here.
The alignment network cannot solve the problem of fetching a word that starts at an odd address.
Part B: Give a truth table for the combinational function f. Note that, for some combinations of inputs, some outputs may not be fully defined.
mode byte | trouble topmux bottommux ----------+------------------------- 0 0 | 0 1 1 0 1 | 0 1 0 1 0 | 0 0 0 1 1 | 1 ? ?
MA1: Read the first word containing operand, align it so that the second byte is in the first position of the operand output.MA2: Read the second word containing operand, align it so that the first byte is in the second position of the operand output; The first byte of the operand output comes from MA1.
Part B: For a single byte write to memory, how many pipeline stages are needed, what do they do, and how must the alignment net for each stage be set.
MA1: Read the word from memory containig the desired byte. The output of this stage will contain the word, with the desired byte changed to its new value.MA2: Write the word from MA1 back to memory.
Part C: For a non-aligned word write to memory, how many pipeline stages are needed, what do they do, and how must the alignment net for each stage be set.
MA1 and MA2 do a 1-byte write to memory of the first byte of the word into the first word of memory (see above for details).MA3 and MA4 do a 1-byte write to memory of the second byte of the word into the second word of memory.
Sadly, even though most correctly deduced the need for two memory cycles to read the two bytes of a non-aligned word, and of them, most correctly understood the need for two cycles to write a byte, few correctly multiplied 2x2 to get 4.
Part D: Given that the naive pipeline, with no consideration for non-aligned operands, had a single memory access stage that could read or write aligned or non-aligned bytes or words, how many stages do we really need?
If we confine ourselves to the approach outlined here, 4, based on the worst case, part C.
Stealing the 2-bit field from the const field of load and store operations has a negligable cost (most index constants are small), but it requires that, either, our memory access stage be able to stall for 3 extra memory cycles or that we add 3 extra memory access stages, as outlined in the previous problem. Either of these will significantly raise the cost of the piplined processor and each will lead to significant loss of pipeline performance. Programmers will appreciate the ease of use of the resulting instruction set, but may be mislead into thinking that byte and halfword writes and non-aligned word and halfword operations are inexpensive when in fact, they have a significant cost.Supporting byte and halfword operations through register-to-register operations such as "byte extract" and "byte stuff" leads to a very low-cost pipelined CPU, but it makes extra work for assembly language programmers and compiler writers. It exposes the real costs of byte extraction and stuffing directly to the programmer and encourages programmers to seek ways of manipulating things like character strings in a word parallel manner.
Afterword: It is essentially this exercise that led the designers of the DEC Alpha chip to omit support of operands smaller than 32 bits from their load and store instructions.