22C:122, Lecture Notes, Lecture 11, Fall 1999

Douglas W. Jones
University of Iowa Department of Computer Science

  1. Instruction Coding

    One problem the designer of any insturction set must answer is, How many bits shoul be assigned to each field in each variation of the instruction word. This is a question of information encoding, and the theory relevant to this is information theory.

  2. A Brief Review of Information Theory

    Information theory deals with the information content of messages. Both information content and message size are measured in the same unit, and this leads to problems, so we will distinguish between bits of information on the one hand and either bits of representation or binary digits on the other hand.

    Given

    We define the information content, or entropy of the message as:

    So, for example, the if the probability of the message "I am sick" is 1/8, the entropy H("I am sick") is log2( 1/(1/8) ) or log2(8) which is 3 bits.

    If you just received M1="my body temperature is above normal", you can guess that I have a fever and therefore that I am sick. Here, you are relying not on the unconditional probability M2="I am sick", P(M2), but rather on the conditional probability P(M2|M1), the probability of M2 in the context of M1. In fact, I could have an above-normal temperature because I just came out of a sauna, so P(M2|M1) is not 1, but it is certainly higher than P(M2).

    Note that the entropy, H, of a message need not be an integer. It is perfectly reasonable to say that H("I have a temperature") is 2.75 and that the conditional entropy of the second message, "I am sick" is 0.25.

    If we have messages composed from an alphabet, for example, messages composed of letters of the Roman alphabet, or messages composed of instructions from an instruction set, we can measure the entropy of the message in many ways. The simplest, H0 or the zeroth order entropy, rests on the unconditional probabilities of the letters.

    That is, the entropy of a string is the sum of the entropies of the letters of the string. Higher order entropies take context into account, so H1, the first order entropy, substitutes P(Si:Si-1) for P(Si). That is, we use the probability of each letter in the context of its predecessor instead of the raw probability of each letter.

    If we use context, we generally get a smaller measure of the entropy. For example, the first order entropy is:

    and the second order entropy is

    In general, data compression algorithms seek to compress the representations of S to H(S) bits. No data compression algorithm can compress to better than H(S) for an arbitrary S. If it does so for some particular string S, it must do significantly worse than H(S) for a sufficient number of other strings that the expected value of the compressed length of the output is greater than or equal to H(S).

  3. Variable-Length Codes

    Unambiguous variable length codes are usually organized as prefix codes. Consider the following example for an alphabet of 7 letters:

    	A = 00
    	B = 010
    	C = 011
    	D = 10
    	E = 110
    	F = 1110
    	G = 1111
    	
    The string DECADE in this alphabet would be encoded as 101100110010110.

    Prefix codes are called this because no codeword in such a code is a prefix of any other codeword.

    Any prefix code can be described by a trie, or decoding tree. The trie for the above code is:

                                 . 
                               0/ \1
                               /   \
                              .     . 
                            0/ \1 0/ \1
                            A   . D   . 
                              0/ \1 0/ \1
                              B   C E   . 
                                      0/ \1
                                      F   G
    	
    If our representation is in base n, the trie for decoding a prefix code will be an n-ary tree. To decode the code, start at the root, following the labeled branches that match successive digits of the code, until you reach the leaf. Record the character in the leaf and then start back at the root.

    Huffman discovered a coding algorithm to optimize the binary prefix code representation of a finite alphabet where each letter has a probability that is known in advance. Huffman's algorithm assigns the probability to the letter as a weight, and then produces a weight-balanced trie. The weight of the entire trie is 1 (the sum of all the probabilities), and the weights of the two subtrees are as close to 1/2 as is possible. Huffman's algorithm produces a Huffman code, and it is fairly easy to prove that a Huffman code will compress text in the alphabet being used to within 1 bit per letter of H0, and that no prefix code can do better than this.

    The prefix codes that interest us in the world of instruction coding are prefix codes with radix higher than 2. So, if our machine has instructions that are a multiple of 8 bits in length, we might view the instruction set as a prefix code in radix 256. this would be an appropriate way to view the instruction set of machines such as the DEC VAX or the Intel 80x86 family.

  4. Block Codes

    If we have a code that is based on fixed size words, such as is the case on most modern RISC machines, we cannot use Huffman's algorithm. Nonetheless, we note that when H(S) = Length(S), S cannot be compressed, and when this is true, P(0) = P(1) = .5 in the binary representation of S. Therefore, an instruction code in which the representations fo real programs does not look random is not optimally compressed.

    If an instruction set results in programs that are not optimally compressed, we expect it to take more memory cycles to execute, to require more cache misses, more page faults, and more code space than a comparable computer with a more nearly optimal program representation. This is independent of all other details of the CPU implementation!