Mean = 21.9 X ________X_______________X_X_____X_X_____X_____X_____________X___ 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29
Part B: If a separate cache were put on each memory port from the memory stage of the pipeline, this would introduce cache coherency problems. Solving these would add significantly to the complexity of the chip.
Half the class had problems here, going off on tangents, being distracted by locality issues, and other things. Part C: The designs suggested here are not appropriate for a bus-based multiprocessor using snooping cache technology because the memory management unit is placed between the on-chip caches and the external memory bus. As a result, the caches all cache virtual addresses and not physical addresses, and this makes snooping very difficult at best, and impossible without the aid of complex support from the memory management unit.
One third had problems here, but most had the right idea and then added confusion or distraction.
Part D: Context switching is complicated on this machine because the caches hold virtual addresses. If the contents of the memory management unit's mapping registers are changed, the information in the cache becomes obsolete and must be ignored! A straightforward solution to this problem is to add a mechanism to invalidate the contents of the caches whenever a change in the memory management state of the machine occurs.
Only one did well here; many suggested reversing the order of the memory management unit and caches, but this requires either a complex multiport memory management unit or it makes the memory management unit into a severe bottleneck. In either case, the cost is likely to be much higher than a cache invalidation mechanism.
Part E: The most obvious place to put the alignment networks in the system presented is between the pipeline stage and the bus arbitrators. Thus, all bus arbitration and all caches deal with a single fixed word size. There is no need for an alighment network on the instruction port because all instruction fetches can fetch an aligned 64 bit doubleword. In the case that branch is made to the odd half of such a doublework, the other half can be simply invalidated.
One third did well here, one third proposed putting the alignment network after the caches and the remainder had difficult to classify answers.
Part B Assuming that all processes are equally likely to be scheduled and that there is no priority, added register sets can be assigned on an LRU or round robin basis. If the number of processes only slightly exceeds the number of register sets, and if scheduling is random, there can be a significant performance benefit. If scheduling is round-robin or if the number of processes is much larger than the number of register sets, the benefit of multiple register sets will be minimal.
Just over half did well here, of the remainder, many suggested policies but failed to discuss the performance in any significant way.
Two thirds did well here, suggesting alternative such as those given above. The remainder had hard to classify errors.
Part B The optimum size of the register set will decrease as the lifetime of the average process or thread decreases. In the extreme, if each thread only performs one two-operand arithmetic instruction before termination, one accumulator is sufficient.
Half did well here, and of the rest, most received no credit.
1 A[0] = R[X] 0000xxxx 2 Op[0] = M[A[0]] 011000ss -- ss, the size is not given 3 A[1] = R[Y] 0001yyyy 4 A[1] = A[1] + C 001101ss -- ss = 16 bit, encoding not given cccccccc cccccccc 5 Op[1] = M[A[1]] 011101ss -- ss, the size is not given 6 Op[0] = ADD 1000oooo -- oooo, the ADD encoding is not given 7 R[z] = Op[0] 1010zzzzAbout half had significant programming errors.
Part B: Here is one way to code CALL X:
8 A[0] = R[14] 00001110 9 A[0] = A[0] - 4 001001ss -- ss = 8 bit, encoding not given 11111100 -- = -4 10 R[14] = A[0] 11001110 11 Op[0] = R[15] 01001111 12 Op[1] = 10 001111ss -- ss = 8 bit, encoding not given 00001010 13 Op[0] = ADD 1000oooo -- oooo, the ADD encoding is not given 14 M[A[0]] = Op[0] 011010ss -- ss = 32 bit, encoding not given 15 Op[1] = X oo1111ss -- ss = 32 bit, encoding not given xxxxxxxx xxxxxxxx xxxxxxxx xxxxxxxx 16 R[15] = Op[1] 10111111One in five did well here; about half forgot that R[15] is the current PC and, until the very last instruction of the macro, it doesn't point to the return address. Therefore, it must be adjusted (the ADD instruction above) to skip over the rest of the call. Another common error was to forget to pre-decrement the stack pointer.
, Part C: Given that the two instruction sequences above are to be executed in sequence, they may be interleaved as (3 1 4 2 3 5 8 6 9 7 11 10 12 15 13 14 16). In this interleaving, no instruction depends on the result of the immediately preceding instruction, so if a pipeline can be built with only 1 visible operand delay slot, this code can be pipelined with no interlock problems! The instruction set itself admits to a 3 or 5 stage pipeling -- ALU and memory operations can be done in the same stage, since on instruction uses both, and with only 2 input registers to select between, an operand gather stage may be unneeded.
Half the class failed to make any use of the example to support their contentions about pipelining this instruction set. One in three did well.
Part D: The addition of the additional address register removes a severe bottleneck, greatly improving the ability of this architecture to allow interleaving of different instruction sequences in order to remove conflicts from the instruction stream!
Two in three did well here, a few dodged the question or made other errors.
One in 5 did well here. One in 3 completely missed this, while others claimed that there was no way to fetch instructions because the address on the CPU bus was only 8 bits wide. The alternative path directly to memory (through a MUX) from the PC eliminates this problem!
Part B: Given that there are 16 ALU operations (evoked by write) and 16 possible condition code checks (evoked by read), an 8-bit address leaves 4 bits available to specify an accumulator on an ALU reference instruction. But! There must also be some way to specify the MAR, MDR and PC, so only 15 accumulators would work (the address that could have referred to the 16th would be sufficient to allow addressing of the other registers.
Half the class gave the minumum number of accumulators that would allow comfortable programming, but the question clearly asked for the maximum (how many could be added, not how many should be added). One in five did well, but typically forgot to allow addressing of the other registers. Almost half gave incomprehensible or very confused explanations of their answers.
Part C: This architecture could be easily pipelined, with an instruction fetch, operand fetch and operand store stage in the pipe, just as on the minimal ultimate RISC. Memory arbitration (or a 2-port memory) would be required only when an instruction referenced the memory data register. In other cases, instruction fetch could bypass the CPU bus. In any case, the CPU bus and all registers directly attached to it should allow parallel read and write.
One in 5 did well here, an equal number gave answers with no connection to the problem, and an equal number did fairly well, but gave answers indicating a misunderstanding of the memory addressing restrictions of the architecture.