22C:122/55:132 Notes, Lecture 34, Spring 2001

Douglas W. Jones
University of Iowa Department of Computer Science

Bus Arbitration
It is common to build computer systems with multiple active devices attached to the same memory bus. The most obvious example may be a machine with multiple central processors, but it is equally reasonable to speak in terms of secondary processors such as direct-memory-access peripheral devices. Consider, for example:
```
                    bus
     =========================================
       __|__   __|__   __|__   __|__   __|__
      |     | |     | |     | |     | |     |
      | CPU | | CPU | | DMA | | DMA | | RAM |
      |_____| |_____| |_____| |_____| |_____|
                       __|__   __|__
                      |     | |video|
                      | disk| | out |
                      |_____| |_____|
                        
      
```
Here, we have two CPU's and two DMA devices. Each of these may independently access RAM. The CPU's initiate DMA transfers to or from disk, but once initiated, the disk controller transfers the data between the disk and RAM on its own, without involving the CPU in each transfer. The DMA video controller operates continuously, reading a new pixel from RAM every microsecond and converting it to voltages on the red, green and blue signals sent to a color CRT.
The problem this poses is how these different devices can share the same bus connecting them to memory. Up to this point, we have spoken of memory busses with the assumption that a single device, the bus master, provided the memory address and memory control signals. Now, we assume bus where any of several devices may control the bus, and they must share it in an orderly way!
We solve this problem with bus arbitration logic. This logic may be centralized or distributed, and it may operate synchronously or asynchronously. Distributed asynchronous bus arbitration is very hard, while centralized synchronous arbitration is fairly easy. Consider the following interface between the bus arbitrator and any particular client:
```
            Clock -------->

Arbitrator  <------ Request  Client
 
            Grant -------->
```
With a synchronous bus, the arbitrator and all clients share a common clock, so it is usual to distribute the clock signal on the bus and to treat the clock itself as a component of the arbitrator. During each clock cycle, exactly one client may use the bus.
To use the bus, a client asserts its bus request line. If the arbitrator replies by asserting the bus grant line, the client may use the bus for a read or a write cycle. If the arbitrator does not grant the bus, the client must stall. If we have 2 clients, we can make a truth table describing the bus arbitrator:
```
       Requests        Grants
      A       B   |   A       B
 -----------------+-----------------
      0       0   |   0       0
      0       1   |   0       1
      1       0   |   1       0
      1       1   |   1       0
```
The first 3 lines of this truth table are simple. If no client requests the bus, no client receives a bus grant. If one only one client requests the bus, that client receives the bus grant. The third line is more interesting! If both clients request the bus, only one may receive it. As shown, whenever this happens, the bus is granted to client A; thus, we have established a strict priority, giving the highest priority to client A. Here is a similar truth table for 3 clients:
```
       Requests        Grants
      A   B   C   |   A   B   C
 -----------------+-----------------
      0   0   0   |   0   0   0
      0   0   1   |   0   0   1
      0   1   0   |   0   1   0
      0   1   1   |   0   1   0
      1   0   0   |   1   0   0
      1   0   1   |   1   0   0
      1   1   0   |   1   0   0
      1   1   1   |   1   0   0
```
A fixed priority bus arbitration system is very appropriate when the clients are naturally different in their patterns of bus usage. For example, consider a system with a CPU able to make one memory reference every 50 nanoseconds, a network interface that makes one meory reference every microsecond, and a disk interface that makes a memory reference every 5 microseconds. It turns out that the best priority allocation for this system gives the highest priority to the client that makes the least frequent use of the memory, so the disk gets the high priority, the network the next lower priority, and the CPU gets the lowest priority.
Fixed priority is undesirable when the clients are equals! For example, if two CPUs share a bus, giving one a higher priority will always lead to one CPU running fast at the expense of the other. The solution to this is to roatate the priority system. This is easiest in the context of a distributed arbitrator:

Distributed Synchronous Arbitration

A distributed bus arbitrator can be constructed on the following model:

                       grant
                         |
                    _____v_____ 
                   |           |
   request A ----->| arbitrate |--------> grant A
                   |___________|
                         |
                    _____v_____ 
                   |           |
   request B ----->| arbitrate |--------> grant B
                   |___________|
                         |
                         v

Here, the highest priority is at the top of the list. If the arbitrator at the top sees a request, it always grants it. If the arbitrator below sees a request, it only grants it if there are no requests above. The vertical connection between the arbitration units is called the priority chain. Each arbitrate unit has the following logic:

                 priority|
                   chain |
                     in  |   ___ 
                         o--|   |
                         |  |AND|------- grant
       request --------o----|___|
                       | |
                       O_| 
                      |   |
                      |AND|
                      |___|
                        |  
                        |  priority
                        |   chain
                        |    out

To create a round-robin priority system, where granting the request of some bus client causes its priority to drop, all we need to do is create a circular priority chain that is broken immediately below that client each time the bus is granted.

                _________      clock
               |         |       |
               |    _____v_____  |
               |   |           | |
   request A --|-->| arbitrate |-|---o--> grant A
               |   |___________| |  _|_
               |         |       o-|>__|
               |         |  _____|___|
               |         |_|     |
               |        |   |    |
               |        | OR|    |
               |        |___|    |
               |         |       |
               |    _____v_____  |
               |   |           | |
   request B --|-->| arbitrate |-|---o--> grant B
               |   |___________| |  _|_
               |         |       o-|>__|
               |         |  _____|___|
               |         |_|     |
               |        |   |    |
               |        | OR|    |
               |        |___|    |
               |_________|

Here, on each clock cycle, we remember which grant line was high in a flipflop tied to that grant line. The outputs of these flipflops controls where the priority chain is broken, and therefore, which arbitrate unit is at the head of the chain.

The above arbitration circuit has a serious fault. If a bus cycle occurs during which no client makes a memory request, none of the flipflops will be set, and therefore, no arbitrator will be at the head of the chain. This is clearly a problem, but we can solve it by enabling the clock to the flipflops only when there is at least one request, so that the flipflops always remember the most recent request. We must also make sure that exactly one of the flipflops is set on powerup.

Crossbar Switches and Interleaving
The oldest way to build a multiport memory saw one of its first major uses in the Burroughs D825 multiprocessor system. Conway's 1963 paper, on the D825 operating system in the 1963 Spring Joint Computer Conference is a good source on this. This machine was designed with up to 4 CPUs, 16 memory modules of 4K each, and an I/O subsystem, all able to access memory. The illustration in this paper was roughly as follows:
```
                  ___  ___  ___  ___
                 |   ||   ||   ||   |
                 |CPU||CPU||CPU||CPU|
                 |___||___||___||___|
      ___     |    |    |    |    |
     |   |    |    |    |    |    |
     | M |----x----x----x----x----x-
     |___|    |    |    |    |    |
      ___     |    |    |    |    |
     |   |    |    |    |    |    |
     | M |----x----x----x----x----x-
     |___|    |    |    |    |    |
       ...    |    |    |    |    |
      ___     |    |    |    |    |
     |   |    |    |    |    |    |
     | M |----x----x----x----x----x-
     |___|    |    |    |    |    |
              |
              |
         I/O subystem
```
The matrix structure of this interconnect system is called a crossbar switch, and the component at each x has, at its heart, a piece of bus grant logic. The bus coming out of each CPU is conventional, excep that it has a stall line that causes the CPU to stall if the memory request cannot be granted. The bus going to memory is also conventional. During any clock cycle, each CPU may access a different memory module, with stalls requested only when two CPUs attempt to access the same module.
The term crossbar switch comes from the field of telephony. Early dial telephones used 10ary and later 100ary trees of stepper switches in the telephone exchange to connect the calling phone to the called phone. These were expensive, and each switch could only connect one telephone call at a time. Crossbar telephone switches consist of an array of horizontal bars crossing an array of vertical bars, with contact fingers on each. Each bar has an electromagnet that can rotate the bar. If a horizontal and vertical bar are simultaneously rotated, the contact fingers latch. Once a connection is made, it remains connected until a release electromagnet on one of the bars is energized, and while one connection is in use, any of the other connections may be made or broken.
Typically, if we have 2ⁿ memory modules, we use the least significant n bits of the memory address to select the module. This is called n-way interleaved memory, and it effectively randomizes the locations of variables among the different memory modules. The alternative is to treat each module as a separate bank of contiguous memory addresses, a model that forces the programmer to be constantly aware of the assignment of banks to CPUs.
Interleaved memory has several consequences, some of which are important even when there is only one bus client. For example, on classical core memory, a read cycle caused the memory location that was read to be erased as a side effect of the read. In low performance systems, the restoration of the contents of this memory location was frequently left to microcode in the CPU.
In high performance core memory systems, though, it was useful to build more complex memory modules that take care of this detail. As a result, on a high performance system, the memory module from which a read was made was unavailable for one clock cycle after each read while it restored the location from which the read took place. This is of little value if there is only one module, but when there are multiple modules and they are interleaved, a sequence of reads from successive memory locations can be done without waiting for the restore cycles.
While core is an archaic technology today, read cycles still tend to take longer than write cycles because the write requires only that the address and data travel together from CPU to memory, while read cycles require that the address travel from CPU to memory and then that the data travel in the opposite direction, for a total of twice the transit time. On the highest performance CPU-memory interconnect systems, these two data transfers take place in separate bus cycles, and it is useful to interleave the memory so that a bus transation with a new memory module can be started while a read is being processed by another module.

A Crossbar Switch Cell

At each bus intersection in a crossbar switch, we need several components, but the most basic is a way to switch address and data information from one bus to another:

                 CPU
             addr data write
               |   |   |
               /   /   |
               |   |   |
   addr  --/---|---|---|--o---------------------------
MEM            |   |   |  |
   data  --/---|---|---|--|--------o------------------
               |   |   |  |        |
   write ------|---|---|--|-o------|------------------
               |   |   | _|_|_     |
               |   |   |   |       |
               |   |   |  / \      |
               |   |   | /___\-----|--------------o--- enable
               |   |   |   |    ---o---     ___   |
               o---|---|-| |  _|_      |   |   |--o
               |   |   | |-  \   /-----|---|AND|  |
               |   |   o-|    \ /      |   |___|O-|--
               |   |   |       |       |    ___   |  |
               |   |   |       |      / \  |   |--   |
               |   |   |       |     /___\-|AND|     |
               |   |   |       |       |   |___|-----o
               |   |   |        ---o---              |
               |   |   |           |                 |
               |   o---|-----------                  |
               |   |   |                             |
               |   |   o----------------------------- 
               |   |   |
               |   |   |

Here, when the enable line for an intersection between the busses is high, bus coupler logic transmits the address and the read-write control line from the CPU bus to the memory bus, and bidirectional bus coupler logic transmits the data in the appropriate direction between the busses, depending on whether this is a read or write operation.

A particular connection will be enabled when two conditions hold:

A) The least significant bits on the address from the CPU to memory select this memory module and
B) The bus grant logic grants this memory to this CPU.

So, in sum, we must have the following picture:

                   CPU
                  basic
                   bus req grant
                    |   |   |   ________ 
                    |   |   |  | Decode |
                    o---|---|--|addr lsb|
                    |   |   |  |________|
                    |   |   |   | |  | |
                    |   |   |      \
                    |   |   |       | each intersection 
                    |   |   |       | is connected to a
                   _|_  |   |       | different output.
                  |   | |   |       |
   basic bus -----|   |-|---|-------|------------------
MEM               |___| |   |       |
   req       ----/--|---|---|-------|------o-----------
                /   |   |   |       |      |
   grant     --/----|---|---|-------|------|-------o---
               |    |   |   |      _|_     |       |
               |    |   |   |     |   |   / \     _|_ 
               |    |   o---|-----|AND|  /___\-o-\   /
               |    |   |   |     |___|    1   |  \ /
               |    |   |   |       |req       |   |
               |    |   |   |   ____|____      |   |
   priority    |    |   |   |  |bus grant|     |   |
   chain     --|----|---|---|--| logic   |-----|---|---
               |    |   |   |  |_________|     |   |
               |    |   |   |       |          |   |
               |    |   |   |       |grant     |   |
                ----|---|---|-------o----------    |
                    |   |   |                      |
                    |   |   |                      |
                    |   |   o----------------------
                    |   |   |
                    |   |   |

The cost of an N by N crossbar switch for a bus with an A-bit address and a D-bit data path will by O(NxNx(A+D)). This is not a pleasant cost to pay when N and M are large. Furthermore, if you imagine building a single crossbar intersection as an integrated circuit, it will require 2(A+D)+k input output pins, where k is a small integer that accounts for miscellaneous connections such as the priority chain, read/write lines and so on. On a modern machine with a 32 bit address and a 64 bit data path, this will requires chips with about 200 pins, ignoring power and ground! This pin count is comparable to that of a good microprocessor circa the year 2000!