22C:122, Lecture Notes, Lecture 18, Fall 1999

Douglas W. Jones
University of Iowa Department of Computer Science

Motivation for Coprocessor or Functional Units Parallelism
Suppose we have some slow computation, for example, multiplication or division done by the shift and add or shift and subtract algorithm.
One way to incorporate this into a computer is to incorporate it into the CPU, so that the execute portion of the fetch-execute cycle is expanded into many clock cycles when the current instruction requires such a time consuming operation. This is how things were typically done on classic machines, from the days of the Berks Goldstein and Von Neumann paper to low-performance microprocessors of the present day.
This low performance alternative relies on a single sequentially operated control unit, typically microcoded, with the control structure of the microcode looking something like the following:
```
		repeat
		   -- fetch
		   ADD = PC;
		   IR = memory[ADD]; PC = PC + 1;
		
		   -- decode
		   ADD = IR:address; case IR:opcode of
		      load:
			 AC = memory[ADD];
		      add:
			 AC = AC + memory[ADD];
		      branch:
			 PC = ADD;
		      times:
			 repeat n times
			    if odd MQ
                               AC|MQ = (AC + memory[ADD])|MQ >> 1;
                            else
                               AC|MQ = AC|MQ >> 1;
			    endif
                         endloop
                      ...
                   end case
                endloop;
	
```
The problem with this model is that it requires a complex control unit and it interfered with the desire to run the computation at maximum speed.
Coprocessor Parallelism
A coprocessor is a system component that runs in parallel with the CPU and has its own control unit, so that it may perform computations while the CPU is operating. Coprocessors may be as complex as entire secondary CPUs (for example, in systems with DSP coprocessors, digital signal processors), or they may be as simple as, for example, direct memory access processors.
Here, we are interested, specifically, in arithmetic coprocessors such as floating-point units.
The simplest such units are attached to the system bus as if they were peripheral devices. Consider, for example, a multiplier coprocessor. Here is the data part we might need for a simple design based on the standard and slow shift and add operation:
```
	Data    ======o=====o======o======o======o======o==
	Address ======|=====|======|======|======|======|==
	Read    ------|-----|------|------|------|------|--
	Write   ------|-----|------|------|------|------|--
                      |     |      |      |      |      |     
            rier ...  |  --/_\     |   --/_\     |   --/_\    
            racc      |     |      |  ----|------|------|-----
            rica      |     |      | |    |      |  ----|---  |
                      |     |     _|_|_   |     _|_|_   |   | |
            run  ...  |     |   --\0_1/   |   --\0_1/   |   | |
                   ___|___  |    ___|___  |    ___|___  |   | |
	    cier -|>_ier__| |  -|>_acc__| |  -|>icand_| |   | |
            cacc      |     |       |     |       |     |   | |
            cica       -----o    ---o     |       o-----    | |
                           _|___|_  |     |       |         | |
                          |___+___| o-----        |         | |
                             _|_____|_            |         | |
                             \1_____0/------------|---      | |
                                _|________________|_  |     | |
                               |________>>1_________|-      | |
                                 |                |         | |
                                 |                 ---------  |
                                  ----------------------------
	
```
The control signals from the control system of this coprocessor are:
rier, racc, rica
enables transfer of ier, acc and icand registers to the data bus
run
if zero, sets the multiplexors on intputs to acc and icand to take input from the data bus. If one, takes input from the output of the shifter.
cier, cacc, cica
clocks the ier, acc and icand registers
The control unit for this multiplier is quite simple, operating according to the following:
```
		repeat
		   wait for start pulse
		   run = 1
		   repeat n times
		      generate step pulse
		   endloop
		   run = 0
		forever
	
```
Now, we need a bus interface for this system to allow the CPU to use the bus to set and inspect the registers and to allow the CPU to request a multiply operation by issuing a start pulse:
```
	Data    ==========================
	Address ==o=======================
	Read    --|--------------o--------
	Write   --|--------------|---o----
                __|__            |   |
               |     |          _|_  |
               |= x  |-o-------| 4 |-|----- rier  
               |= x+1|-|-o-----| x |-|----- racc 
               |= x+2|-|-|-o---|AND|-|----- rica
               |= x+3|-|-|-|-o-|___|-|----- xx   
               |_____| | | | |      _|_
	                -|-|-|-----| 4 |--- cier  ___
              from        -|-|-----| x |---------|2 x|-- cacc
           coprocessor      -|-----|AND|---------|OR_|-- cica
          control unit        -----|___|-- start   |
                                                   |
              step --------------------------------
              run  ------------------------------------- run
	
```
The above control unit enables read from the registers when there is a read from address x (ier), x+1 (acc) or x+2 (icand) and it enables a write to a register when there is a write to address x, x+1 or x+2. A write to address x+3 (start) issues a start pulse to the coprocessor's control unit, while a read from address x+3 issues a pulse on the control line marked xx, with no current use (we will find a use for it!).
The coprocessor control unit issues step pulses, and these are turned into clock pulses to both the acc and icand registers.
Note that if we wanted to generalize this coprocessor into one able to perform both multiply and divide operations, we might use writes to x+3 to record the desired operation. For example, writing 0 to x+3 might configure the data part to perform multiplication, while writing 1 to x+3 could configure the data part to perform division. The control unit would issue the same n step pulses in either case, but the effect of these would depend on the operation that was selected by the write operation that initiated coprocessor operation.
Programming this coprocessor
Assuming we've got an Ultimate RISC CPU, we can write code to use this multiply unit as follows:
; code for A = B x C MOVE B ier ; move to x+0 MOVE C icand ; move to x+2 MOVE junk start ; move to x+3 ; must wait the appropriate time for the result MOVE icand C ; move from x+3
A good compiler or a good assembly language programmer will insert code to perform other computations in the space where the above code indicates a wait. This is particularly valuable if there are other coprocessors in the system.
The problem with this code is that it includes no provisions for waiting the appropriate time. If the manual for the coprocessor relates its speed to the speed of the CPU, we might be able to know exactly how many instructions to skip before trying to get the result, but if we upgrade our CPU, the answer will change. This leads us to look for other solutions.
Polling the coprocessor status
One solution is to have the coprocessor include an explicit status bit that programs running on the CPU can test in order to see if the coprocessor is busy. This solution is common on direct-memory-access coprocessors, and it is common on interfaces to DSP chips. In both contexts, we typically also allow the condition status=done to be interpreted by the CPU as an interrupt request.
We can do this with our example coprocessor by giving a meaning to the unused control line xx in the bus interface given previously. This line is asserted when the CPU tries to read from address x+3. We will use this line to enable a read of the state of the run output from the coprocessor control unit, so that the user can test to see if the coprocessor is running or idle. As a result, we change our code for multiplication to something like:
; code for A = B x C MOVE B ier ; move to x+0 MOVE C icand ; move to x+2 MOVE junk start ; move to x+3 ; insert other operations here if there is something ; else useful to be done before checking for done LP: MOVE run to sfal; skip next instruction if run = 0 MOVE CLP pc ; move constant equal to LP to pc MOVE icand C ; move from x+3
This is a bit ponderous, but it allows the same code to run on machines with a wide range of relative CPU and coprocessor speeds.
Asynchronous Busses
The multiplication coprocessor described above is likely to be able to complete its job in only a few instruction times. This is because the control unit for the coprocessor can clock the registers at speeds limited by the short data circulation path within the coprocessor, while the CPU instruction execution cycle is limited by the bus speed.
If a coprocessor is fast enough that it is likely to complete its job in only a few instruction times, the cost of using a few instructions to poll the coprocessor status while waiting for it to finish the operation is relatively high! Therefore, we need another solution!
One common idea is to use an asynchronous bus, that is, a bus where the time it takes for a data transfer on the bus is under the control of the devices making the transfer. The minimum requirement for such a bus is an extra bus line, call it wait, that is raised by the device to say "hold on CPU, give me time to finish this job!" in response to a read or write request.
We use this idea commonly in modern computer systems, so that the CPU can do very fast cycles to addresses in fast RAM and significantly slower cycles to, for example, peripheral interface registers.
Here, we'll use the idea as follows:
Whenever the CPU tries to read the contents of the multiplier register ier or of the accumulator, we'll have our coprocessor assert WAIT until the coprocessor is idle. This is illustrated below for just the read line for the accumulator:
Data ============o====== Address ============|====== Read ------------|------ Write ------------|------ Wait -------o----|------ | | racc -----/_\--/_\ | | run ------ | | data from acc
The CPU or (in the case of the ultimate RISC) the Instruction Execution Unit must be modified to inspect the Wait line. There are two approaches to this. In one, if Wait is high during a bus cycle, the CPU must repeatedly retry that bus cycle until a cycle can be completed during which wait stays low.
In the other approach, raising wait during a bus cycle stops the CPU clock until wait is lowered. These two approaches are distinctly different! CPU designs that allow an external signal to stop or stretch clock pulses are notably more complex than those what simply keep retrying operations until they can be completed. Genuinely asynchronous busses require the ability to stretch a single bus cycle.
We can program a machine where the coprocessor can block a fetch until the operation is done as follows:
; code for A = B x C MOVE B ier ; move to x+0 MOVE C icand ; move to x+2 MOVE junk start ; move to x+3 ; insert other operations here if there is something ; else useful to be done before checking for done MOVE icand C ; move from x+3 (will block as needed)
The result is that naive code will run correctly, with the CPU being blocked as needed when results are not immediately available. Good compilers and careful assembly language programmers will interleave code so that other operations are performed at times when the coprocessor is expected to be busy, thus minimizing the number of times that the CPU is blocked by coprocessor activity.