Homework 9 Solved

22C:122/55:132, Spring 2001

Douglas W. Jones
  1. Part A: Describe at least two features of the Minimal CISC that make it extremely difficult to pipeline, explaining, for each, exactly what it is about this feature that stands in the way of pipelined execution.

    First, the number of microinstructions per instruction varies considerably from one instruction to the next. The shortest instructions take 2 microcycles (fetch, execute) while the longest take 5 (POP requires the cycles labeled fetch, POP1, POP2, POP3 and JPOS3- in the paper).

    A closely related issue is the number of memory cycles per instruction. Each instruction begins with an optional fetch, if a new instruction must be fetched, and depending on the instruction, from 0 to 2 additional memory references. Thus, we have from 0 to 3 memory references per instruction.

    In both cases, this means that most pipeline stages would be idle during the execution of most instructions, and therefore, that the hardware resources of a pipelined implementation would largely go to waste.

    A completely different set problem is posed by the packing of multiple instruction syllables into a word. Pipelined machines can unpack such syllables, but it adds complexity to the system.

    Yet another problem is posed by the fact that successive operations on a stack processor usually depend on each other, since most operations modify the stack pointer and or the stack top. Therefore, instruction dependencies are very severe and will reduce the effectiveness of the pipelined processor.

    Part B: There is at least one feature of this machine that would allow at least a shallow pipelined implementation, and this could lead to a modest performance improvement. Identify the most likely candidate for this feature and explain.

    We could install an instruction prefectch mechanism that would await microcycles when there was no memory access taking place and fetch the next instruction into the instruction buffer. This requires that the instruction buffer be at least one syllable wider than the word size.

  2. Part A: Give pseudocode for a pipelined version of the Ultimate RISC CPU.
    	registers pc,         -- local to instruction fetch
    		  src, dst,   -- from instruction fetch to operand fetch
    		  dst', temp; -- from operand fetch to operand store
    
    	repeat
                do
                    src = M[pc];    \
                in parallel with     \
                    dst = M[pc+1];    > instruction fetch
                in parallel with     /
                    pc = pc + 2;    /
                in parallel with
                    temp = M[src];  \
                in parallel with     >  operand fetch
                    dst' = dst;     /
                in parallel with
                    M[dst'] = temp; >   operand store
                enddo
            forever
    

    Part B: How many parallel busses providing access to memory did you assume in part A. Which of these are read-only? Which are read-write? Which of them are used for access to RAM but never for access to I/O or functional units?

    4 busses were assumed:
    1. src = M[pc] -- read-only instruction fetch, part 1.
    2. dst = M[pc] -- read-only instruction fetch, part 2.
    3. tmp = M[src] -- read-only operand fetch.
    4. M[dst'] = temp -- write-only operand store.
    Of these, the first two will only be used to fetch instructions, so these busses need not have connections to any non-memory resources. The final two, however, must connecto to functional units, memory and I/O device registers.

    Part C: How many operand delay slots does your pipelined IEU introduce.

    In the sequence of instructions A,B,C, B will fetch its source operand in the same memory cycle as A stores its destination operand, so (unless we add forwarding hardware) B cannot access the result stored by A, but C can access this. Therefore, we must deal with one operand delay slot.

    Part D: How many branch delay slots does your pipelined IEU introduce.

    In the sequence A,B,C,D, C is fetched during the same clock cycle that stores the result of A. Therefore, if A is a branch, both B and C will be fetched before the new program counter is used. We have 2 branch delay slots.