First, the number of microinstructions per instruction varies considerably from one instruction to the next. The shortest instructions take 2 microcycles (fetch, execute) while the longest take 5 (POP requires the cycles labeled fetch, POP1, POP2, POP3 and JPOS3- in the paper).A closely related issue is the number of memory cycles per instruction. Each instruction begins with an optional fetch, if a new instruction must be fetched, and depending on the instruction, from 0 to 2 additional memory references. Thus, we have from 0 to 3 memory references per instruction.
In both cases, this means that most pipeline stages would be idle during the execution of most instructions, and therefore, that the hardware resources of a pipelined implementation would largely go to waste.
A completely different set problem is posed by the packing of multiple instruction syllables into a word. Pipelined machines can unpack such syllables, but it adds complexity to the system.
Yet another problem is posed by the fact that successive operations on a stack processor usually depend on each other, since most operations modify the stack pointer and or the stack top. Therefore, instruction dependencies are very severe and will reduce the effectiveness of the pipelined processor.
Part B: There is at least one feature of this machine that would allow at least a shallow pipelined implementation, and this could lead to a modest performance improvement. Identify the most likely candidate for this feature and explain.
We could install an instruction prefectch mechanism that would await microcycles when there was no memory access taking place and fetch the next instruction into the instruction buffer. This requires that the instruction buffer be at least one syllable wider than the word size.
registers pc, -- local to instruction fetch src, dst, -- from instruction fetch to operand fetch dst', temp; -- from operand fetch to operand store repeat do src = M[pc]; \ in parallel with \ dst = M[pc+1]; > instruction fetch in parallel with / pc = pc + 2; / in parallel with temp = M[src]; \ in parallel with > operand fetch dst' = dst; / in parallel with M[dst'] = temp; > operand store enddo forever
Part B: How many parallel busses providing access to memory did you assume in part A. Which of these are read-only? Which are read-write? Which of them are used for access to RAM but never for access to I/O or functional units?
4 busses were assumed:Of these, the first two will only be used to fetch instructions, so these busses need not have connections to any non-memory resources. The final two, however, must connecto to functional units, memory and I/O device registers.
- src = M[pc] -- read-only instruction fetch, part 1.
- dst = M[pc] -- read-only instruction fetch, part 2.
- tmp = M[src] -- read-only operand fetch.
- M[dst'] = temp -- write-only operand store.
Part C: How many operand delay slots does your pipelined IEU introduce.
In the sequence of instructions A,B,C, B will fetch its source operand in the same memory cycle as A stores its destination operand, so (unless we add forwarding hardware) B cannot access the result stored by A, but C can access this. Therefore, we must deal with one operand delay slot.
Part D: How many branch delay slots does your pipelined IEU introduce.
In the sequence A,B,C,D, C is fetched during the same clock cycle that stores the result of A. Therefore, if A is a branch, both B and C will be fetched before the new program counter is used. We have 2 branch delay slots.