Final Exam Solutions and Commentary

22C:122, Fall 1999

Grade Distribution

 Mean = 19.9                                X
  ______________________________X_X_____X___X_____X_________X_____
   0 . 2 . 4 . 6 . 8 . 10. 12. 14. 16. 18. 20. 22. 24. 26. 28. 30
                              - - B B B B + + - - A A A A + +

Solutions

A Question: Given an arbitrary 32 bit destination address, how would you write code to jump to or goto that address using this instruction set?
Here is a 1-instruction, 2-word solution
```
	Word 1: OP = 10 load PC = M[PC + 1]
	Word 2: <value to load into PC>
	
```
Here is a 3-instruction solution that assumes ZERO designates a register reserved to hold the value zero.
```
	Word 1: OP = 01 immediate TEMP = ZERO + <high 16 bits>
	Word 2: OP = 00 operate   TEMP = TEMP << 16
	Word 3: OP = 01 immediate PC   = TEMP + <low 16 bits>
	
```
The former is probably faster, but there may be CPUs for which the second would be faster.
Sadly, many did second-rate work here, even though this was an essential component of the second study question.
A Question: Given a 2 instruction solution to the problem of jumping to a location, suggest an instruction sequence that will be sufficient to accomplish a call to a subroutine using a register to store the return address.
Here, we'll used the first solution given above, prefixing it with an instruction to save the return address:
```
	Word 1: OP = 01 immediate RA = PC + 3
	Word 2: OP = 10 load PC = M[PC + 1]
	Word 3: <value to load into PC>
	
```
Most students forgot to add 3 to the PC, thus leaving the called routine with an address that wasn't technically the return address. Given that this was based on the second study question, this is unfortunate.
A Question: How many pipeline stages would be reasonable for a straightforward pipelined version of this machine?
Stage 1: instruction fetch.
Stage 2: operand fetch (R2, R3, const)
Stage 3: ALU for operate, compute memory addr for others.
Stage 4: memory access for load/store
Stage 5: operand save for operate/immediate/load
This was, of course, a repeat of the first study question. Several students had similar answers, but many tried to make a 4-stage pipe, collapsing some pair of the above stages in a way that may not have been feasible.
Part A: Explain why the left alternative above could not be easily incorporated into the architecture used for the running example in class, while the alternative presented on the right was considered appropriate.
The example in class had separate memory access paths for load and store; as a result, having a separate cache on each memory access port raises major cache coherence problems. These are eliminated by moving the cache used for operands downstream from the memory arbitration logic.

Part B: Why might you prefer the alternative on the left for the example architecture presented on this exam?
We have only one pipeline stage that accesses memory for operands; therefore, there is no cache coherency problem if we place a cache upstream from the arbitration logic.
Note that some students were worried about cache coherency between the intsruction fetch cache and the operand cache(s). Most architectures largely ignore this issue, since most machines these days assume a read-only instruction stream most of the time.
The third study question should have prepared everyone for this question.
Part A: Evaluated solely from the perspective of ease of implementation in a pipelined processor, which of these would be most desirable, and why?
Pipelined architectures typically have mechanisms for invalidating the contents of a pipeline stages. Branch instructions, pipeline stalls caused by operand or resource conflicts all require this. We can use this mechanism to implement skips by having the operate instruction that wishes to skip invalidate its successor.
In contrast, adding conditional branches requires adding condition codes, a new resource that adds dependencies between instructions.

Part B: Evaluated solely from the perspective of efficient instruction coding, which is likely to make more efficient use of program stream bandwidth. Your answer should be in the form of a persuasive qualitative argument.
The biggest problem with the conditional branch proposal is that it reduces the constant field to fewer than 16 bits. As a result, loading an arbitrary 32 bit constant using a sequence of 16-bit loads is no longer an option.
The op and sh fields, taken together, are unlikely to be fully exploited. Add and shift, for example, is very important for multiply-by-constant operations, but exclusive-or and shift will likely be very rare. Careful reallocation of these bits is, therefore, likely to be practical, if difficult.
The final study question, "What's missing from this architecture", should have prepared people for this!
A Question: Explain why the concept of "the program counter register" gets somewhat fuzzy in the context of pipelined processors.
There is the program counter used in the instruction fetch stage, but not all fetched instructions will be executed -- they may be skipped or cancelled by a branch.
If there is branch prediction logic, branches may be predicted incorrectly, so there is next-instruction-address computation in the operand save stage that is responsible for computing the flow of control that is to be finalized. Everything pushed into the pipe by the fetch stage must be viewed as speculative!
Most did pretty well here.
Part A: What memory access condition must this alignment network report as a trouble condition?
The alignment network cannot solve the problem of fetching a word that starts at an odd address.

Part B: Give a truth table for the combinational function f. Note that, for some combinations of inputs, some outputs may not be fully defined.
```
	mode byte | trouble topmux bottommux
        ----------+-------------------------
          0    0  |   0       1      1
          0    1  |   0       1      0
          1    0  |   0       0      0
          1    1  |   1       ?      ?
```
Part A: For a non-aligned word read from memory, how many pipeline stages are needed, what do they do, and how must the alignment net for each stage be set.
MA1: Read the first word containing operand, align it so that the second byte is in the first position of the operand output.
MA2: Read the second word containing operand, align it so that the first byte is in the second position of the operand output; The first byte of the operand output comes from MA1.

Part B: For a single byte write to memory, how many pipeline stages are needed, what do they do, and how must the alignment net for each stage be set.
MA1: Read the word from memory containig the desired byte. The output of this stage will contain the word, with the desired byte changed to its new value.
MA2: Write the word from MA1 back to memory.

Part C: For a non-aligned word write to memory, how many pipeline stages are needed, what do they do, and how must the alignment net for each stage be set.
MA1 and MA2 do a 1-byte write to memory of the first byte of the word into the first word of memory (see above for details).
MA3 and MA4 do a 1-byte write to memory of the second byte of the word into the second word of memory.
Sadly, even though most correctly deduced the need for two memory cycles to read the two bytes of a non-aligned word, and of them, most correctly understood the need for two cycles to write a byte, few correctly multiplied 2x2 to get 4.

Part D: Given that the naive pipeline, with no consideration for non-aligned operands, had a single memory access stage that could read or write aligned or non-aligned bytes or words, how many stages do we really need?
If we confine ourselves to the approach outlined here, 4, based on the worst case, part C.
The example architecture has no provisions for byte or halfword manipulation. One way to add this would be to steal 2-bit field from the const field of memory reference instructions and use this to specify the operand size (word, halfword or byte). An alternative is to make sure that some combinations of operation and shift fields in the operate instruction can be used to efficiently extract and edit arbitrary fields of operand words briefly outline the costs and benefets of these two alternatives.
Stealing the 2-bit field from the const field of load and store operations has a negligable cost (most index constants are small), but it requires that, either, our memory access stage be able to stall for 3 extra memory cycles or that we add 3 extra memory access stages, as outlined in the previous problem. Either of these will significantly raise the cost of the piplined processor and each will lead to significant loss of pipeline performance. Programmers will appreciate the ease of use of the resulting instruction set, but may be mislead into thinking that byte and halfword writes and non-aligned word and halfword operations are inexpensive when in fact, they have a significant cost.
Supporting byte and halfword operations through register-to-register operations such as "byte extract" and "byte stuff" leads to a very low-cost pipelined CPU, but it makes extra work for assembly language programmers and compiler writers. It exposes the real costs of byte extraction and stuffing directly to the programmer and encourages programmers to seek ways of manipulating things like character strings in a word parallel manner.
Afterword: It is essentially this exercise that led the designers of the DEC Alpha chip to omit support of operands smaller than 32 bits from their load and store instructions.