39. MMUs and Alignment at the Memory Interface
Part of
the 22C:122/55:132 Lecture Notes for Spring 2004
|
Up to this point, we have assumed that the only memory operations are to load a word from memory and to store a word in memory. In fact, most memory systems operate this way, but many of our CPU designs assume a more interesting interface, where any particular instruction may read or write bytes or words (and sometimes also halfwords and doublewords). This leads us to imagine a memory interface such as the following:
CPU side Memory side _____________ address >----| |----> address data out >----| |----> data out data in <----| interface |----< data in direction >----| |----> direction size >----|_____________|
At the heart of this interface is a system of multiplexors and demultiplexors called an allignment network
Consider the simple case of a 16-bit memory interface on a machine that allows references to both bytes and 16-bit words. For a read from memory, we might use the following alignment network:
CPU side Memory side byte n |------------------/----------- word address --/---| n-1 address |-------- | | /| | |0|-- 0 ---|-| | | | |1|-- MSB 8 | | \| | 8 MSB 16 |--/- | | o---/--| data <----/---| | | | |---/--< data |--/- | | | -/--| 16 LSB 8 | /| | | | 8 LSB | |1|-|--- | --| | | | |0|-|----- \| | | size ---------------------
When the size input to this circuit is 0, the circuit reads one byte from memory, using the least significant bit of the address to select the byte. When the size input is 1 and the least significant bit of the address is zero, it reads a 16-bit word.
There is a problem when this logic is used with a size of 1 (word) and an odd address. In this case, this alignment network reads the most significant byte from memory into both bytes of the value given to the CPU. This is clearly nonsense! The easiest solution is to simply disallow this case -- we simply require that all one-word operands be aligned on word boundaries in memory.
Exercise: Add logic to detetect an attempt to access a non-aligned operand and report this to the CPU.
Exercise: Work out the read alignment network for a 32-bit machine supporting aligned operands of size 8, 16 and 32 bits, encoded using size codes of 0, 1 and 2 from the CPU. Assume that 16-bit operands must have even addresses.
Exercise: Note that the alignment network given here always reads zero into the most significant bits of 8-bit operands. This is appropriate when reading data of the type declared as unsigned char in C or 0..255 in Pascal, but it is for the type signed char or -128..127. These require that the 8-bit operand be sign-extended to 16 bits. Modivy the alignment network given above to include an optional sign-extend control that enables this alternative.
The most common way to build an alignment network that supports writes of partial words is based on a bus design that divides the write signal into multiple write signals, one for each byte of the word. We can do this with 8-bit bytes and a 16-bit word as follows:
Memory Subsystem word ----/------------o-------- address n-1 | | |--/---|--------|--- data out --/---| 8 | | | 16 |--/---|--- | | 8 | _|_ | _|_ -| | -| | |RAM| |RAM| write low --------------|>__| -|>__| | | | |--/-------|----|--- data in --/---| 8 | | 16 |--/------- | | write high --------------------
This is the approach taken on a large fraction of the 16-bit microprocessors, including the Motorola 68000 and the Intel 8086, and modern 32-bit processors simply widen this model to 4 bytes, with 4 lines to control which byte is written during any particular write cycle, and an 8-line version would be required for a 64-bit word.
Given that we are using this type of ram, we can build the alignment network for the write data path as follows:
CPU side Memory side byte n |------------------/---------- word address --/---| ___ n-1 address |-----o------| | ___ | |OR |--| | | ---|___| |AND|-- write high | | -|___| | | | write --------------|--|--------o | | | ___ | | ___ -| | --|--O| | |AND|-- write low | |OR |--|___| o---|___| | | |\ ---|-|1| | | | |-- MSB 8 | -|-|0| | 8 MSB 16 |--/- | | |/ --/--| data >----/---| | | | |---/--> data |--/---o-|--|------/--| 16 LSB 8 | | 8 LSB size ------------------o--
For an aligned word write, this alignment network puts the entire word on the data bus, while for byte write, it puts the duplicate copies of the byte on both bytes of the bus and then relies on the LSB of the address to determin whether the high or low byte gets written.
With this network, as with the read network given previously, an attempt to access a non-aligned word, that is, a word that starts at an odd address in memory, will be handled incorrectly; therefore, an attempt to do so should be detected and reported to the CPU.
Exercise:Work out the write alignment network for a 32-bit machine supporting aligned operands of size 8, 16 and 32 bits, encoded using size codes of 0, 1 and 2 from the CPU. Assume that 16-bit operands must have even addresses.
There is an alternative approach to writing partial words to memory that rests on a conventional memory where the entire word must be written at once. This is to read the entire word, update the byte or bytes that are to be changed, and then write it back to memory. Such read-modify-write cycles could be executed at no runtime cost in the era of core memory because the read operation from core was destructive, requiring that every read be followed by a write. Adding the data paths to allow the value written to be a modified version of the original value introduced no significant delay.
With modern DRAM semiconductor memory, the read operation is also destructive because it discharges the capacitors that store each bit of the word, but the refresh logic on the chip automatically handles the problem of refreshing the stored charge, and as a result, external logic does not see two memory cycles per read. Because of this, read-modify-write cycles require a pipeline stall on a typical modern machine, so (with an important exception), this approach is rarely used.
If we attempt to read or write a non-aligned word, we must recognize that we are dealing with a word that is stored physically in two consecutive words of memory. Thus, on a 16-bit machine, we encounter one case"
________ ________ aligned |__HIGH__|__LOW___| ________ ________ non |\\\\\\\\|__HIGH__| \ aligned |__LOW___|\\\\\\\\| /
On a 32-bit machine, we have many more non-aligned cases:
________ ________ ________ ________ aligned |_HIGHER_|__HIGH__|__LOW___|_LOWER__| 32-bit word |\\\\\\\\|\\\\\\\\|__HIGH__|__LOW___| \ |\\\\\\\\|__HIGH__|__LOW___|\\\\\\\\| > 16-bit halfwords |__HIGH__|__LOW___|\\\\\\\\|\\\\\\\\| / ________ ________ ________ ________ non |\\\\\\\\|\\\\\\\\|\\\\\\\\|_HIGHER_| \ 32-bit aligned |__HIGH__|__LOW___|_LOWER__|\\\\\\\\| / ________ ________ ________ ________ |\\\\\\\\|\\\\\\\\|_HIGHER_|__HIGH__| \ 32-bit |__LOW___|_LOWER__|\\\\\\\\|\\\\\\\\| / ________ ________ ________ ________ |\\\\\\\\|_HIGHER_|__HIGH__|__LOW___| \ 32-bit |_LOWER__|\\\\\\\\|\\\\\\\\|\\\\\\\\| / ________ ________ ________ ________ |\\\\\\\\|\\\\\\\\|\\\\\\\\|__HIGH__| \ 16-bit halfword |__LOW___|\\\\\\\\|\\\\\\\\|\\\\\\\\| /
Given that the operands we address in these cases span multiple words in memory, there is no choice but to use multiple memory cycles to access them!
Typically, in a pipelined CPU, this means that operand load or store operations that involve non-aligned operands must be executed using multiple memory cycles from the same pipeline stage, and this, in turn, means that this pipeline stage must stall for an extra cycle.
We could do this using a simple microprogrammed control unit for the pipeline stage that performs the operand fetch or store, but if this is the only cause we have to break out from pipelined execution, we can use simpler schemes.
For example, when we detect a non-aligned operand reference, we can stall the stages above while we peform the first half of the non-aligned reference, clocking a new operation into the interstage register on the next cycle that will (a) release the stall and (b) perform the second half of the non-aligned operation.
If the hardware supports non-aligned memory references, as is the case on the 80x86 family and most other CISC machines, they are bad for the efficiency of pipelined processing, and therefore, they should be avoided.
Compilers, for example, can automatically pad data structures so that each operand begins in a new word, and the heap manager for dynamically allocated data types can guarantee that all objects on the heap begin on a word boundary. Better compilers, particularly for languages that do not specify a one-to-one mapping from textual form of declaration to a specific storage layout in memory, may re-order the fields of records (and activation records) so that memory is used efficiently while staying within alignment constraints. Consider:
struct widget { byte A; halfword B; byte C; word D; } Inefficiently allocated: ________ ________ ________ ________ |////////|////////|////////|___A____| |////////|////////|________B________| |////////|////////|////////|___C____| |_________________D_________________| Efficiently allocated: ________ ________ _________________ |___A____|___C____|________B________| |_________________D_________________|
Given that good compilers avoid non-aligned operands, it is fair to ask, why bother supporting them in hardware.
The designers of the DEC Alpha RISC architecture took an extreme position. This machine has 64-bit registers, and it supports direct load and store of 64-bit and 32-bit operands, with no support of non-aligned 64-bit operands, and no load or store operations for operands smaller than 32 bits.
Memory addresses on the Alpha are still byte addresses, so each byte in memory can be given an address, but the least significant 2 bits of the address are not used outside the CPU. Given a byte pointer in a register on the Alpha, loading a byte from memory takes a 2-instruction sequence:
The code to store a byte in an arbitrary memory address is even more complex:
These instruction sequences seem quite unwieldy, but the designers of the Alpha emphasized the importance of carefully hand coding very high performance string operations that manipulated strings using 64-bit parallel data transfers whenever possible. As a result, the performance of the Alpha has proven to be extremely good; for many years, the Alpha was the fastest microprocessor available (current models still outperform the Itanium, according to benchmarks done by Silicon Graphics).
Since around 1970, most operating systems of any significance have assumed the availability of a memory management unit, sitting between the CPU and main memory and able to translate virtual addresses output from the CPU to physical addresses used in memory.
_____ data _____ | | |--------------------------| | | | CPU |--| _____ |--| RAM | |_____| |---------| |----------| |_____| virtual | MMU | physical address |_____| address
The naive programmer generally does not want to know about physical addresses and would prefer to ignore the presence of the memory management unit in the system. The naive system programmer knows it is there and typically prefers a very simple model of its function based on a set of data structures that are all in RAM:
page word in page ___________ ___________ |___________|___________| Virtual address | | | -------------------- | | | page table | | | | rights frame | | ________________ | | |____|___________| | | |____|___________| | -------->|____|___________| | |____|___________| | |____|___________| | |____|___________| | |____|___________| | | | | Rights <----------- | | _____|_____ _____|_____ Physical address |___________|___________| frame word in frame
If the rights field of the table entry indicates that that table entry is invalid, or if the rights field says that a table entry is read-only but the program attempted a write operation, the memory management unit will abort the memory access and force the CPU to trap.
If we actually built memory management units like this, the performance penalty would be immense! For each memory reference by an applications program, the MMU would have to add the page number field of the virtual address to the base address of the page table to construct the address of a page-table entry, read the page-table entry from RAM, and then use it to construct a physical address. Thus, each main memory reference from the CPU would take two memory cycles, one for address translation and one for the actual data transfer.
_____ data | | |--------------------------| | | CPU |--| _____ |--x-- |_____| |---------| |----------| | virtual | MMU | physical | address |_____| address | | | -------o--------x-- __|__ | _____ | | | | | | TLB | --| RAM | |_____| |_____|
We can improve on this naive design by adding a cache to the MMU, as shown above. This cache is usually called the translation lookaside buffer, but in the conceptually easiest designs, it is just a fast cache.
In fact, it is foolish to use the physical address of a page-table entry as a key during translation lookaside buffer lookup, because this requires adding the page-table base address to the page number prior to checking the TLB. It is better to use the page number itself as the search key in the TLB, so that the TLB associative memory directly associates page numbers with frame numbers, and the physical address of a page-table entry is only needed if there is a TLB miss.
In modern RISC systems such as the SGI MIPS chip, it is common to simplify the memory management unit even more by eliminating the hardware for handling TLB misses. In these systems, a TLB miss is treated by the hardware in the same way that it treats an access violation or an invalid page-table entry. The memory management unit simply requests a trap when any of these events occur. On such a machine, the TLB entries are treated as special registers that may be written by system software in the trap handler in response to a TLB-miss trap.
Further discussion of memory management unit design requires a large body prerequisite material from operating systems!
The full data path from a pipelined CPU to the bus interface connector of a modern microprocessor chip such as the Pentium or Power PC that supports symmetrical multiprocessor configurations must look something like the following:
Pipelined | | CPU -----o-----x-- | __|__ | | ________ | | I | | On the same |________|-- |cache| | chip as the | |______| |_____| | CPU |________| _____ | | |______| | | | |________|-----|Align|--x-- _____ | |______| |_____| | | | | |________| ---| MMU |----o-----x-- | |_____| __|__ | | O |--o | | |cache| | _ | Bus on |_____| --|_|--o motherboard | (L1) | |
The memory management unit sits between the processor and the operand cache because the operand cache is a snooping cache, and must therefore deal with physical addresses. The instruction cache, in contrast, uses virtual addresses, and because of this, we must invalidate the entire contents of the I-cache whenever we change the address mapping data used by the memory management unit.
A second reason that the memory manager is typically put between the processor and the cache is so that the entries in the page table (or the TLB) can contain one bit to indicate whether each page of the address space is to be cached or not. Pages that are mapped to conventional RAM are typically marked as cached, while pages that are mapped to device registers on the input-output bus should not be cached. Special areas of the memory address space such as video RAM may also require the cache to be inhibited, particularly if it is a write-back cache and not a write-through cache.
Cache memory plus bus arbitration logic is good for constructing small-scale multiprocessors, but the parallelism that is possible is extremely limited! We can quantify this limitation!
If the equation holds, the bus will be fully saturated. If the left side is smaller, adding more CPUs may be possible. If the right side is smaller, adding CPUs will not improve the performance, but increasing the hit ratio, for example, by using a larger cache, will improve the performance.
Consider a system with a 70 nanosecond main memory cycle time, and a deeply pipelined CPU that runs at 100 MHz (a 10 nanosecond cycle time). If our L1 cache has a hit rate of 90%, the CPU will make a main memory reference every 100 nanoseconds, so this system will perform well with one CPU. With 2 CPU's, however, they demand a memory cycle every 50 nanoseconds, so they cannot operate at full speed with 70 nanosecond main memory. We might achieve reasonable performance, however, by using an L2 cache with a hit rate of 50% to speed the apparant main memory cycle time to something like 35 nanoseconds.
But! There is no way to extend this scheme arbitrarily! There comes a point where the cache is large enough to hold all of the instructions and variables that the processors use, so that the only traffic on the main bus is traffic involved in interprocess communications. Once we enlarge the system to the point that this traffic saturates the bus, we reach a dead end!
Up to the point of failure, bus based systems can be expanded to accomodate N CPUs at a cost of O(N) per CPU. Crossbar switches can be expanded at a cost of O(N2) with no limit, but the quadratic price becomes prohibitive at about the same point that the bus-based systems fail. Both have commonly reached N=16, and above this, bus-based systems fail to perform well and crossbar switches grow too expensive.
This problem was extremely clear in the early 1970's, as the 16 by 16 crossbar-based C.mmp processor was being completed. The next system built at Carnegie Mellon University, Cm*, was quite different. This system was a NUMA machine, with a non-uniform memory access time. The top-level structure of the machien was arranged as follows:
_____________ Cluster Bus | cluster | | | ------o----------------o--| controller | | | ___ | ___ | |_____________|-o | |CPU| | up to |CPU| | | | |___| | 10 |___| | | | _|_ | computer _|_ | | | |MMU|- modules |MMU|- inter | | |___| |___| cluster | | _|_ _|_ busses | | |RAM| ... |RAM| | | |___| |___| | | | | _____________ | | | cluster |-o | --------------------------| controller | | | |_____________|-|-o _____________ | | | cluster |-|-o --------------------------| controller | | | |_____________| | |
In this system, each computer module had a CPU and memory, connected by a memory management unit. Each segment of the address space could be mapped to local memory, in which case, access was fast, or to remote memory, in which case access was a minimum of 3 times slower. Nonlocal memory references could refer to other memory within the same cluster, or it could refer to memory in some computer module of some other cluster. In the latter case, the overhead of intercluster communications slowed the memory access to about 1/9 of the local memory speed.
Each cluster controller could access multiple intercluster busses; as built, the system had 5 clusters of 10 computer modules, with 3 intercluster busses. In theory, the system could be expanded without limit, but the operating system problems that arise on such a machine pose severe difficulties, and this design never went farther.
Bolt Beranek and Newman Corporation developed an alternative model for a multiprocessor based on commodity microprocessor chips, known as the BBN Butterfly. BBN (now BBN Technologies, a division of Verizon Corporation) had pioneered the development of priority interrupt structures, timesharing systems, modems and computer networks in the 1960's. As with Cm*, this was a NUMA based on computer modules and an interconnect system, but instead of a hierarchy of clusters with an almost random interconnect structure between the clusters, it had a very systematic approach to interconnection, the butterfly switching network. A Butterfly with 128 processors was demonstrated in 1984, and unlike Cm*, the butterfly was sold commercially.
The basic butterfly network topology had been developed simultaneously by several different groups, and starting from several different sets of initial assumptions about the nature of the switches that would be used to interconnect systems. One starting point was the cube connected network:
___ |___|--o--------o----o- ___ | | | |___|--|-o------|-o--o- ___ | | | | |___|--|-|-o----o-|--o- ___ | | | | | |___|--|-|-|-o----o--o- ___ | | | | |___|--o-|-|-|--o----o- ___ | | | | | |___|----o-|-|--|-o--o- ___ | | | | |___|------o-|--o-|--o- ___ | | | |___|--------o----o--o-
The diagram shows a 3-cube, but this model can be extended to any number of dimensions. If each box is a computer system and each vertical line is a network interconnection, we have a cube connected multicomputer, an idea that dates back to the 1960's. The CalTech Cosmic Cube, from 1983, consisting of 64 microcomputers interconnected by a 6-dimensional hypercube; this spawned the IPSC family of hypercube systems (the Intel Personal Supercomputer family).
If each interconnection in the above figure represents a crossbar switch on the path from CPU to memory, we get the switching network that has been described as a banyan or butterfly network.
Banyan trees are tropical trees similar mangroves; like the mangrove, they have multiple roots and trunks supporting a spreading crown of branches that can extend over a very large area. The network is called a butterfly because one of the ways of drawing it has a vague resemblance to a butterfly.
Such an interconnection scheme allows 2n processors to address any of 2n memory modules with a switching delay of n. The path from one CPU to one memory module in an 8-processor butterfly network is shown here:
-- -- ___ | | | | | | | | | | |CPU|----x--x-- | | | --x--x-- |___| | | --x--x-- | | --x--x-- | | | --x--x-- | | | | --x--x-- | | | ___ | | ---- | | | | | | | - | | | ------ -------|RAM| --------- | |___| -----
If all memory accesses had been delayed through such a switching network, the BBN Butterfly would have had an unacceptable performance, but the machine was built with two paths to memory, a direct path from each processor to its own part of the global memory, and a longer path, through the butterfly switch, allowing each processor access to all of memory. Each computer module of the BBN Butterfly therefore had the following structure:
| ___ --x----< from butterfly switch | | | |CPU|----x----> to butterfly switch |___| | _|_ | | |RAM| |___|
In writing programs for a NUMA computer such as the Butterfly or the Cm* machine, the code, private variables and stack of each process should be located in the local memory of computer module that runs that process, so that the only traffic through the switching network is traffic to shared variables.
Static code placement in a large scale multiprocessor is difficult! Many algorithms dynamically produce new processes, so there has been considerable development in two directions, one centered on operating systems that can dynamically move data to the local memory of the processor where it is needed, and the other centered on replacement or supplementation of local memory with cache.
If each bus has an A bit address and D bits of data, plus a request and an acknowledge control line, a read/write line, and a strobe, we require A+D+4 lines per connection to a crossbar switch module, as a bare minimum! An NxN crossbar would therefore require 2N(A+D+4) input/output connections.
Consider the common case where A=D=32 and N=2. In this case, A VLSI crossbar switch chip would require a minimum of 4x68 = 272 pins, ignoring the number of pins required for power and ground! Chips, and even circuit boards with this number of pins are extremely awkward! Integrated circuit realizations of a single crossbar intersection require N=1 plus at least two control lines for the arbitration logic. For the example, this would be 2x68+2 = 134 pins, a number that is far easier to deal with.
A group at the Courant Institute at New York University observed that the most serious performance problems with the Butterfly architecture were caused by what they referred to as hot spots in the memory address space. These were shared variables that were heavily used by large numbers of processors. For example, consider a multiprocessor where each of the CPU's occasionally checks the value of some shared variable to see if it is time to terminate the computation. From the point of view of any one CPU, this variable is only accessed occasionally, but from the point of view of the memory module holding the variable, we see an entirely different perspective:
___ | | | | | | x--x--x | M | x--x--x | | | |___| | | | | | | x------x------x | | | | | | | | x--x--x x--x--x | | | |The memory sees a binary tree of switching paths leading away toward the myriad processors, each of which makes occasional requests. As those requests reach toward the memory, the occasional becomes a torrent, and collisions in the switching nodes become very common.
This led to the notion of the "combining switch" element and to the NYU Ultracomputer Project. A combining switch behaves like a a small crossbar switch when it sees requests for different memory locations, but it combines requests when they are for the same location. If the only memory operations allowed are read and write, we have 4 cases:
The NYU team did not stop with this! They noted that many memory cycles are read-write-modify cycles, for example, I=I+1, and they realized that if the memory supports active semantics, that is, if the memory address bus includes not a one-bit opcode indicating that the operation is read or write, but rather, a small opcode requesting an ALU operation to be performed by the memory unit, it would be possible to combine operations in a far more interesting way!
To support this, the NYU ultracomputer supported the following model of memory:
Bus from memory client ------ __|__ | | | | Address -->--------| RAM | | |_____| | | | Data to memory -->--------- | | _|_|_ | | | | Operation -->-------| ALU | | |_____| | | | Data to client --<----------o-------
With this, a CPU can add to memory, subtract from memory, read or write in one cycle, depending on the ALU operation selected, and appropriate combining switches that include a local ALU in each switch can combine these operations.
A large part of the work of the Ultracomputer group at NYU was devoted to developing parallel algorithms that took maximal advantage of the active memory semantics, but, with support from IBM, they developed VLSI combinign switches and developed a close working relationship with the group at IBM that developed a series of multiprocessors based on Butterfly-like topologies.
I am uncertain of the details, but it appears that the group at IBM that inherited the Sequent Symmetry product line also inherited large parts of the work that had been done in conjunction with NYU. I am uncertain of the extent to which the ideas from Product Ultra are currently incorporated into any current experimental or production parallel machines from IBM.
Encore's snooping cache model can also be exploited to build large scale parallel machines by building a multilevel cache scheme. Encore announced a large scale multiprocessor based on this idea when the company was founded, but I am unaware of the outcome of that venture. It appears that the final Cray multiprocessor, based on DEC Alpha chips, may have been intended to be such a machine. The idea is as follows:
| | o-----o-----x- | __|__ | |cache|--o |_____| | even address | | memory cluster o-----o-----x- _____ | __|__ | | CPU |----o-----x- |cache|--o _____ |_____| __|__ | |_____| | | | |cache|--o o--| RAM | |_____|--|--o--------o |_____| _____ | | | | | | CPU |----o-----x- | o--| RAM | |_____| __|__ | | | |_____| |cache|--o | |_____|--|-- | o-----o-----x- Bus based cluster| __|__ | Odd addresses |cache|--o |_____| |
As shown, the system has 2 cache levels, and at each level, we have 2 CPUs and 2-way interleaved connections to memory. Thus, the example shown will allow 4 memory modules in two groups of two. One group of 2 modules (shown in full above) handles only even memory addresses, while the other (abbreviated) handles only odd addresses. If each bus supports two cached clients, the system will allow 4 CPUs.
If we assume a 50% hit ratio in each cache, the number of bus cycles per second with two clients should equal the the number of cycles in each of the two clients, so this system will balance and in theory it will scale to any depth! The difficulty is that the top-level caches must be able to simultaneously snoop on several busses, but this is quite feasible, although it increases the cost modestly.
If we assume that we can put clients on each cluster bus, and we assume a 90% hit rate in the client caches with 4-way interleaved memory connections, then building this system with depth 2 would allow us to have 64 clients.
If we build this system with a fan-in and fan-out of two at each level, so there are two cached clients connected to two memory connections on each bus, we end up with a topology identical to the butterfly or banyan interconnect structure!