21. Coprocessors

Part of the 22C:122/55:132 Lecture Notes for Spring 2004
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Motivation

Suppose we have some slow computation, for example, multiplication or division done by the shift and add or shift and subtract algorithm.

One way to incorporate this into a computer is to incorporate it directly into the CPU, so that the execute portion of the fetch-execute cycle is expanded into many clock cycles when the current instruction requires such a time consuming operation. This is how things have typically been done on classic machines, from the days of the Berks Goldstein and Von Neumann paper to low-performance microprocessors of the present day.

This low performance alternative relies on a single sequentially operated control unit, typically microcoded, with the control structure of the microcode looking something like the following:

	repeat
	   -- fetch
	   IR = memory[PC]; PC = PC + 1; -- register transfers in parallel

	   -- decode
	   case IR:opcode of
	      load:
		 AC = memory[IR.address];
	      add:
		 AC = AC + memory[IR.address];
	      branch:
		 PC = IR.address;
	      times:
		 repeat n times -- multiplication loop for an n-bit word
		    -- notation:  a|b means concatenate a and b, n-bits each
		    if odd MQ
		       AC|MQ = AC|MQ >> 1;
		       AC = AC + memory[IR.address];
		       -- the above 2 lines can be done in one clock cycle
		    else
		       AC|MQ = AC|MQ >> 1;
		    endif
		 endloop
	      ...
	   end case
	endloop;

There are two problems with this model: First, it requires a complex control unit, and second, it seems unnecessarily slow. It is reasonable to ask, is there some way to continue fetching and executing instructions while the slow iterative multiply operation is being done? This leads to the idea of creating a special secondary processor, a coprocessor, to perform such time consuming operations.

The easiest way to add a coprocessor is as a peripheral device, and exactly this approach was commonplace with minicomputers and microcomputers of the 1970's, and that is the approach we will discuss here. This is not the actual way these ideas were discovered! If w were to follow a historical track, we would first discuss the CDC 6600 from 1965, with its use of functional units within the CPU, and then we would look at the DEC PDP-11/45 floating point coprocessor before, finally, we looked at simple coprocessors such as we are discussing here.

Coprocessor Parallelism

A coprocessor is a system component that runs in parallel with the CPU and has its own control unit, so that it may perform computations while the CPU is operating. Coprocessors may be as complex as entire secondary CPUs (for example, in systems with DSP or digital signal processor coprocessors), or they may be as simple as, for example, direct memory access (DMA) processors used for input or output.

Here, we are interested, specifically, in arithmetic coprocessors such as floating-point units.

The simplest such units are attached to the system bus as if they were peripheral devices. Consider, for example, a multiplier coprocessor, designed to support the standard but slow shift-and-add approach to multiplication. We would include the following registers

ier -- the multiplier
acc -- the accumulator (not the CPU's accumulator!)
icand -- the multiplicand

Our coprocessor's control unit must be able to clock each of these registers, and it must be able to respond to a read request for the contents of these registers from the CPU. These functions require the following control signals from the coprocessor's control unit:

cier -- clock the multiplier
cacc -- clock the accumulator
cica -- clock the multiplicand
rier -- read from multiplier to I/O bus
racc -- read from accumulator to I/O bus
rica -- read from multiplicand to I/O bus
run -- a multiplexor control, see below

We can cheat a bit on the controls for our data part by noting that, for each cycle of the multiplication algorithm, the least significant bit of the multiplier controls whether we just shift or shift and add. Therefore, we can wire this connection directly in the data part and avoid involving the control part. This cheat is marked in the diagram of the data part that follows with an asterisk:

	Data    ======o=====o======o======o======o======o==
	Address ======|=====|======|======|======|======|==
	Read    ------|-----|------|------|------|------|--
	Write   ------|-----|------|------|------|------|--
                      |     |      |      |      |      |     
            rier ...  |  --/_\     |   --/_\    n/   --/_\    
            racc     n/     |     n/  ____|______|______|_____
            rica      |     |      | |    |      |  ____|___  |
                      |     /n    _|_|_   /n    _|_|_   |   | |
            run  ...  |     |   --\0_1/   |   --\0_1/   /n  | |
                   ___|___  |    ___|___  |    ___|___  |   | |
	    cier -|>_icand| |  -|>_acc__| |  -|>_ier__| |   | |
            cacc      |     |       |     |       |     |   | |
            cica       -----o       o-----        o-----    | |
                            |      _|_____________|_        | |
                            |     |______>>1________|-      | |
                            |       |             |   |     | |
                            |    ---o             |   |     | |
                           _|___|_  |             |   |     | |
                          |___+___| |             |   |     | |
                             _|_____|_            |   |    n/ /n
       * (mux control)       \1_____0/--<---------|---*     | |
         is the LSB out          |                |_________| |
         of the shifter          |____________________________|

The run control signal from the control unit is set to zero when the control unit is loading a register from the bus, and it is set to one while the multiply operation is in progress.

The control unit for this multiplier can be quite simple, operating according to the following algorithm:

	repeat
	   wait for start pulse
	   run = 1
	   repeat n times
	      generate step pulse
	   endloop
	   run = 0
	forever

We need a bus interface for the control unit in order to allow the CPU to use the bus to set and inspect the registers and to allow the CPU to request a multiply operation by issuing a start pulse:

	Data    ==========================
	Address ==o=======================
	Read    --|--------------o--------               to register
	Write   --|--------------|---o----               transfer logic
                __|__            |   |                   of coprocessor
               |     |          _|_  |
               |= x  |-o-------| 4 |-|----- rica    \
        address|= x+1|-|-o-----| x |-|----- racc     enable the addressed
        decoder|= x+2|-|-|-o---|AND|-|----- rier     tri-state bus drivers
               |= x+3|-|-|-|-o-|___|-|----- xx      /
               |_____| | | | |      _|_
	                -|-|-|-----| 4 |--- cica   ___          \
           to and from    -|-|-----| x |----------|2 x|-- cacc   clock the
           coprocessor      -|-----|AND|----------|OR_|-- cier   register
          control unit        -----|___|--          |           /
                                          |         |
              start-----------<-----------          |
              step ------------------------>--------
              run  ------------------------>------------ run

The above bus interface enables read from the registers when there is a read from address x (icand), x+1 (acc) or x+2 (ier) and it enables a write to one of these registers when there is a write to address x, x+1 or x+2. A write to address x+3 (start) issues a start pulse to the coprocessor's control unit, while a read from address x+3 issues a pulse on the control line marked xx, with no current use; we will find a use for it shortly!

The coprocessor control unit issues step pulses, and these are ored into the clock lines to both the acc and icand registers.

Note that if we wanted to generalize this coprocessor into one able to perform both multiply and divide operations, we might use writes to x+3 to record the desired operation. For example, writing 0 to x+3 might configure the data part to perform multiplication, while writing 1 to x+3 could configure the data part to perform division. The control unit would issue the same n step pulses in either case, but the effect of these would depend on the operation that was selected by the write operation that initiated coprocessor operation.

Applicaiton programming using this coprocessor

Assuming we've got an Ultimate RISC CPU, we can write code to use this multiply unit as follows:

		; code for A = B x C
		MOVE B ier	; move to x+0
		MOVE C icand    ; move to x+2
		MOVE junk start ; move to x+3 to start the multiply

		; must wait the appropriate time for the result

		MOVE icand C    ; move from x+3

A good compiler or a good assembly language programmer will insert code to perform other computations in the space where the above code indicates a wait.

The problem with this code is that it includes no provisions for waiting the appropriate time. If the manual for the coprocessor relates its speed to the speed of the CPU, we might be able to know exactly how many instructions to skip before trying to get the result, but if we upgrade our CPU or the coprocessor, the answer will change. This leads us to look for other solutions.

Polling the coprocessor status

One solution is to have the coprocessor include an explicit status bit that programs running on the CPU can test in order to see if the coprocessor is busy. This solution is common on direct-memory-access coprocessors, and it is common on interfaces to DSP chips. In both contexts, we typically also allow the condition status=done to be interpreted by the CPU as an interrupt request.

We can do this with our example coprocessor by giving a meaning to the unused control line xx in the bus interface given previously. This line is asserted when the CPU tries to read from address x+3. We will use this line to enable a read of the state of the run output from the coprocessor control unit, so that the user can test to see if the coprocessor is running or idle. As a result, we change our code for multiplication to something like:

		; code for A = B x C
		MOVE B ier      ; move to x+0
		MOVE C icand    ; move to x+2
		MOVE junk start ; move to x+3 to start the multiply

		; insert other operations here if there is something
		; else useful to be done before checking for done

		; polling loop to test multiply done
	LP:	MOVE run to sfal; skip next instruction if run = 0
		MOVE CLP pc     ; move constant equal to LP to pc

		MOVE icand C    ; move from x+3

This is a bit ponderous, but it allows the same code to run on machines with a wide range of relative CPU and coprocessor speeds.

Asynchronous Busses

The multiplication coprocessor described above is likely to be able to complete its job in only a few instruction times, particularly if we are really using an Ultimate RISC as our main processor. This is because the control unit for the coprocessor can clock the registers at speeds limited by the short data circulation path within the coprocessor, while the CPU instruction execution cycle is limited by the bus speed and main memory speed, and our Ultimate RISC CPU took 4 bus cycles per instruction cycle.

If a coprocessor is fast enough that it is likely to complete its job in only a few instruction times, the cost of using a few instructions to poll the coprocessor status while waiting for it to finish the operation is relatively high! Therefore, we need another solution!

One common idea is to use an asynchronous bus, that is, a bus where the time it takes for a data transfer on the bus is under the control of the devices making the transfer. The minimum requirement for such a bus is an extra bus line, call it wait, that is asserted by the device to say "hold on CPU, give me time to finish this job!" in response to a read or write request.

We use this idea commonly in modern computer systems, so that the CPU can do very fast cycles to addresses in fast RAM and significantly slower cycles to, for example, peripheral interface registers. Similarly, when the CPU starts a memory cycle, it has no idea whether the addressed location will be held in cache, and therefore available quite quickly, or available only in main memory, and therefore available after a much longer wait.

We'll use this idea as follows:

Whenever the CPU tries to read the contents of the multiplier register ier or of the accumulator, we'll have our coprocessor assert WAIT until the coprocessor is idle. This is illustrated below for just the read line for the accumulator:

	Data    ============o======
	Address ============|======
	Read    ------------|------
	Write   ------------|------
	Wait    -------o----|------
                       |    |
            racc -----/_\--/_\
                       |    |
            run  ------     |
                            |
                      data from acc

The CPU or (in the case of the ultimate RISC) the Instruction Execution Unit must be modified to inspect the Wait line. There are two approaches to this. In one, if Wait is high during a bus cycle, the CPU must repeatedly retry that bus cycle until a cycle can be completed during which wait stays low.

In the other approach, raising wait during a bus cycle stops the CPU clock until wait is lowered. These two approaches are distinctly different! CPU designs that allow an external signal to stop or stretch clock pulses are notably more complex than those what simply keep retrying operations until they can be completed. Genuinely asynchronous busses require the ability to stretch a single bus cycle.

We can program a machine where the coprocessor can block a fetch until the operation is done as follows:

		; code for A = B x C
                MOVE B ier      ; move to x+0
                MOVE C icand    ; move to x+2
		MOVE junk start ; move to x+3 to start the multiply

		; insert other operations here if there is something
		; else useful to be done before checking for done

                MOVE icand C    ; move from x+3 (will block as needed)

The result is that naive code will run correctly, with the CPU being blocked as needed when results are not immediately available. Good compilers and careful assembly language programmers will interleave code so that other operations are performed at times when the coprocessor is expected to be busy, thus minimizing the number of times that the CPU is blocked by coprocessor activity.

Of course, we cheated a bit. Our coprocessor only forces the CPU to wait when it references the coprocessor's accumulator. If we use the design given above and the program looks first at the ier register, it will see intermediate results without being blocked. We would have to add some more gates to the design to prevent this from happening.