With a simple crossbar switch, each memory reference is routed through one switch on the way to memory. With alternative arrangement, each memory reference is routed through 2 switches.
Problem B: How do these two approaches scale? Address both switching delay and number of components in your answer, as a function of N, the number of CPUs (equal to the number of Memory modules).
If we use 2x2 switches as our fundamental unit, we require (N/2)2 of these to build a simple crossbar switch to connect N processors, and the switching delay will be a constant independent of N (until the wire length comes into play). If we use the multilevel scheme, we will need (N/2)log2(N) such switching units, and the switching delay will be log2(N) times the delay of a simple crossbar switch.
Problem C: Both systems require that there be some specialization in each crossbar swtich to customize it for its setting in the interconnection system. Clearly identify the aspects of this specialization that differ in the two interconnection schemes!
Each crossbar intersection (independent of the dimensions of the switching module) must make a connection when the address from the CPU references its outgoing bus. In a 2nx2n crossbar, this depends on the least significant n bits of the address. In a hierarchical switching network of 2x2 crossbars, each intersection checks only one bit of the address.Therefore, in general, we must be able to program each crossbar switch intersection with a mask that indicates which address bits it is to check, and with a pattern it checks for equality to, under this mask. All the switches in a 2x2 crossbar switch module will use the same mask, and all the switches in each column of the 2x2 module will compare with the same pattern, so we must be able to customize 3 "words" in each module.
We need a valid bit attached to each entry in the cache. If we use static ram, where each bit is stored in a flipflop, logically formed of two nand (or two nor) gates. To invalidate all cache entries, we add an extra input to this flipflop connected to the invalidate line. Here is a reasonable design for one line of this RAM chip:
_ _ DD R Word || | | Select --||o---------|------------------|-- (from ||| | | address ||| ___ | ___ | decoder) ||o--| | o---| | | ||| |AND|0-|---|AND|O- | |o|--|___| | --|___| | | ||| ___ || ___ | | || --| | || | | | |\ | || |AND|0-||--|AND|O-|-o-| >-o o|---|___| || -|___| | | |/ | || ||| | | | || || -------- | | || | ----------- | || | | QThe above one-bit cell of a static RAM has only one feature added, the Reset input to the upper right nand gate.
? |_| |_| CPU --c--| |-| |- CPU --a--|___|-|___|- c |_| |_| CPU --h--| |-| |- CPU --e--|___|-|___|- ? | | | | ???cache??? | | | | M M M MIn either case, we add 4 caches; in one case, one per CPU, and in the other case, one per memory. Contrast these! What problems does each pose for the cache designer, and what are the potential benefits of each approach.
If the cache is put between CPU and the crossbar switch, it reduces contention in the switch, but it makes it extremely difficult to solve the cache coherency problem. If the cache is put between the crossbar switch and memory, it does nothing to help with contention, but it poses no cache coherency problem.