Homework 6 -- Midterm solutions

22C:122, Spring 1996

Question 1 Give an efficient macro to load a 32 bit constant.

The following shows some cute optimizations for short constants. The final case suffices!

	MACRO LI rd, const
	  IF (const <= 127) & (const >= -128)
	    LIS rd, const
          ELSEIF (const <= #7FFFFF) & (const >= -#800000)
            LIL rd, const
          ELSE
	    LIL rd, const >> 8
	    ORIS rd, const & #FF
          ENDIF
	ENDMAC

Question 2 Give a macro to branch to an arbitrary absolute address.

Here, we assume that rt is a designated temporary register. Again, a cute optimization is included, but the final case suffices!

	MACRO BR dst
	  IF (dst <= #7FFF) & (dst >= #8000)
	    JSR R0, R0, R0, dst
	  ELSE
	    LI rt,const
	    JSRR R0,rt
	  ENDIF
	ENDMAC

Question 3 Present macros for load and store operations on 16 bit word and 8 bit byte operands.

The most important use of byte and halfword addressing is in arrays of bytes and halfwords, and to handle these, we need to follow byte pointers in registers. This is easy for load:

	MACRO LOADRB dr, rx
	  LOADR dr, rx
	  EXTB  dr, dr, rx
	ENDMAC

	MACRO LOADRW dr, rx
	  LOADR dr, rx
	  EXTW  dr, dr, rx
	ENDMAC

This is harder for store, requiring 2 twmporary registers, rt and rtt, and requiring an awkward computation to clear the destination field of the destination word. A NOT instruction would help, and 3-operand stuff instructions would be even better:

	MACRO STORERB sr, rx
	  LOADR  rt, rx
	  LI     rtt, #FF
	  STUFFB rtt, rx	; make mask
	  SUB    rtt, R0, rtt   ; 2's comp (we want 1's comp!)
          SUBI   rtt, 1		; finish 1's comp of mask
	  AND    rt, rtt        ; turn off target byte
          STUFFB rtt, sr	; align byte to be stored
	  OR     rt, rtt        ; push it into cleared place
	  STORER rt, rx
	ENDMAC

	MACRO STORERW sr, rx
	  LOADR  rt, rx
	  LI     rtt, #FFFF
	  STUFFW rtt, rx	; make mask
	  SUB    rtt, R0, rtt   ; 2's comp (we want 1's comp!)
          SUBI   rtt, 1		; finish 1's comp of mask
	  AND    rt, rtt        ; turn off target byte
          STUFFW rtt, sr	; align byte to be stored
	  OR     rt, rtt        ; push it into cleared place
	  STORER rt, rx
	ENDMAC

Question 4 Find all the pipeline interlocks required in a natural pipelined implementation of this architecture.

Asume, for example, the following pipeline, with Rn being the interstage register after stage n:

Fetch:: R1.ir <- M[PC], PC <- PC + 2
Decode:: R2.ir <- R1.ir; R2.s1 <- R[R1.ir]; R2.s2 <- R[R1.ir]
Execute:: R3.ir <- R2.ir; R3.res <- ALU(R1,R2)
Memory:: R4.ir <- R3.ir; R4.res <- R3.res; optional memory reference cycle
Update:: optional R[R4.ir] <- R4.res

Most obvious problem: MAKE x; USE x -- The Decode stage of the user of register x needs the result of the immediately prior instruction, which will only be available after the Update stage. If the prior instruction is not a load instruction, result forwarding logic can get the needed data from the Execute stage. If the prior instruction is a load, the Fetch and Decode clocks must freeze until the Memory stage processes the Fetch and forwards the result. If forwarding logic is not used, Fetch and Decode clocks must freeze for at least 2 and possibly 3 clock cycles to let the Update stage store the result. NOOP instructions must be fed into the pipe below the decode stage during these freezes!
Note: If self-modifying code is not used, there are no problems with a store immediately followed by a load.
Less obvious problem: Some instructions set condition codes, perhaps in the execute phase. Others use them. If use of the condition codes is also handled in the execute phase, there are no interlocks involving them!
Another traditional problem: Branches may modify the PC no sooner than the Execute stage -- that's where we have access to the ALU to do the PC relative computation. The contents of the Fetch and Decode stages at this point are invalid and must be canceled, for example, by converting to NOOP instructions.
Another problem: Long format instructions in the Execute phase steal the contents of R1.ir to get their extended constant. This must be replaced by a NOOP as it flows down the pipe!

Question 5 What problems would this architecture pose for a superscalar pipelined implementation?

First, it is obvious that it should be implemented using a superscalar pipeline, since instructions will likely be fetched 32 bits at a time but many instructions are only 16 bits.

Both pipes are identical in structure so there is no pipe selection problem. Interlocks become more complex, though, and the logic to get the extended constants for 32 bit format instructions get more complex!

Branches to an odd address must fetch the full 32 bit word but replace the even half with a NOOP.

Condition code dependancies raise problems that were solved for the simple pipeline by stating that all checking and setting of condition codes is done in the ALU stage. These problems are somewhat reduced by the small fraction of the instruciton set that uses condition codes!

Question 6 This instruction set is missing instructions for housekeeping and interrupt response. There is some room in the instruction set for expansion, but it may not be obvious. List the opcodes that can be used for such applications.

Any instruction with rd=0000 that does not set condition codes or have other side effects is a no-op in this architecture. A single one of these should be reserved as the official NOOP, but the remainder are available for other functions. These are, at the very least: STORE LEA LIL LIS ORIS.

In addition, when rs2 is 0000 on ADD SUB OR XOR the instruction becomes equivalent to MOVER, and when rs2 is 0000 on AND, it becomes equivalent to many other ways of zeroing a register. Therefore, all of these combinations can be stolen for extensions to the instruction set!

Note that rs1=0000 is very useful for subtract (it becomes negate) but that it may also be a useless code for the other arithmetic operations, including the EXT instructions. EXT with rs2=0 is not useless! It is the way to truncate a word to half or byte size!

A shift count of zero is another example of a no-op or MOVER equivalent; This occurs when the rs field is zero on any of the shift instructions! These can all be used for other purposes.

ADDI and SUBI with rs=0000 are also either NOP or MOVER equivalents, depending on whether or not they set the condition codes. In either case, they are available for other uses.

The set of conditional branches offered by Bcc probably uses the rd field to select the condition, allowing 16 conditions. Given NZVC, the traditional condition codes, 8 are needed to test individual codes, and 6 more for common relational tests. One combination would typically mean branch always, and a final one would commonly mean branch never. This latter offers an 8 bit constant field for some other purpose!

Question 7 Given your solutions to all of the above, plus everything you have learned in this course, present a summary evaluation of this architecture.

This summary is too short: Not bad, but seriously flawed!

In more detail, this is a commonplace RISC style instruction set, with a load-store approach to arithmetic and word addressing. It bends the common rules of RISC design by having a variable instruction length, but we know how to handle this. It also may offend some because it still rests on condition codes, but in a simple pipelined implementation, these pose no problem, and in a superscalar implementation, it uses them sparingly enough that many interlocks can be avoided. There is room for expansion, and it is likely that expansion and small modifications to this instruction set could lead to a viable design.

Here are some details of what's wrong with it, as given:

Condition code setting is sloppy. Why does STORER set them? The operations that load or compute operand values in a register seem to be the only ones that need to set condition codes. Why do MOVER, shifts and immediate adds not set them? Furthermore, there is no obvious way to find out whether an instruction sets the condition codes, so decoding this is going to need a dedicated output of the instruciton decoding PLA.
Why is there no one's complement instruciton? It is needed! It seems too common to fake with a two's complement followed by a subtract.
Three operand STUFF instructions would greatly speed byte processing: STUFF a,b,c should stuff the low end of b into a as designated by the low bits of c. This would reduce the awful byte and half-word store macros given above to 3 instructions.
Support for multiplication by small constants is weak. Shifts suffice for multiplication by small powers of 2, but others are hard:
- by 3: ADD a,b,b; ADD a,a,b
- by 5: MOVER a,b; ASLI a,2; ADD a,b
- by 6: ADD a,b,b; ADD a,a,b; ASLI a,1
- by 7: MOVER a,b; ASLI a,3; SUB a,b
- by 9: MOVER a,b; ASLI a,3; ADD a,b
- by 10: MOVER a,b; ASLI a,2; ADD a,b; ASLI a,1
These aren't nearly as elegant at the HP PA-RISC sequences! A 3-operand shift instruction would make up most of the difference! SH a,b,c would move Rb to Ra shifted a constant c places.
Support for multiplication of variables is weak. Fast algorithms for this rest on doing a case statement indexed by the least significant bits of the multiplier. An efficient instruction sequence is needed to extract the 4 least significant bits, multiply by 4 bytes per word and do an indirect indexed jump. For the instruction set given, this is:
LIS rm,#F; AND rt,rm,rs; LSL rt,2; LOAD rt,rt,x,table; JSRR r0,rt.
This may be shortened a bit by loading register rm once and using it 8 times during the software multiply routine, but a bit field extract instruction that both masks and shifts would help, as would an indirect indexed jump. Of course, bitfield extract should be complemented by a bitfield stuff.
LSL and ASL are redundant. The only necessary difference between the two kinds of left shift can be reported in the condition codes -- C should report overflow on unsigned shifts, while O can report overflow for signed shifts.
MOVER seems unneeded! ADD a,b,R0 or ADD a,R0,b does the same.
ADDI and SUBI are probably overkill -- sign extending the immediate constant would allow increments in the range -8 to +7, and if 0000 is interpreted as +8, this would handle most of the burden.
It is probable that XOR and OR are given too much prominence in the instruction set. Two-operand forms of these might suffice, freeing the opcode space needed for the full-function STUFF operations.
Load boolean (0 or 1) from condition codes is another operation that is needed to support C or Pascal efficiently.