22C:18, Lecture 2, Fall 1996

Douglas W. Jones
University of Iowa Department of Computer Science

Representation

When we write computer programs, we always deal with representations. For example, we don't deal with text the way a calligrapher or author deals with text, we deal with a specific character set that may be used to represent text. Instead of dealing with integers or real numbers, we deal with finite precision representations of integers and we deal with floating point numbers.

Well designed programming languages allow the programmer to treat these representations as if they were identical to the abstractions they represent, but this identity is never complete. Thus, for example, our machine representations of integers behave exactly like abstract integers until a result greater than the machine precision is produced; at that point, the behavior of integer variables becomes very strange!

Binary Representations

All modern computers use binary representations for their data, and use fixed size containers for data. There is no reason that computers could not be built that use other basic representations; "trinary" (base 3) logic devices aren't hard to build, and in fact, decimal computers were built in large numbers in the 1960's. There is no compelling benefit to using non-binary representations, though, and even the decimal machines of the 1960's were built using binary logic and used binary-coded-decimal (BCD) internally.

The bit is the fundamental unit of information. A bit has two values, conventionally called zero or one. We could as easily name the values on and off, high and low, or true and false; in fact, all of those names are used in different contexts, and the truth is that all such names are arbitrary!

Over the past 50 years, bits have been grouped inside computers in many ways, and these groupings have been given many names:

Historical Groupings of Bits

Word -- Most computers built since 1946 have had a well defined notion of word size. Here are the word sizes of some important computers:

40 bits
Used by machines based on the famous Berks, Goldstein and Von Neumann design developed in 1946. These include the IAS machine, ILLIAC I, ORDVAC and many other first generation custom-built machines.

36 bits
Used by IBM's 701 computer in 1951, the IBM's followups, the 704, 7040, 709 and 7090, sold into the mid 1960's. Also used by the DEC PDP-6, PDP-10 and DECsystem-20, sold between 1965 through the 1970's, and by the General Electric 600, which became the basis for the Honeywell 6000 family in the 1960's and 1970's.

18 bits
Used by the TX0, the very first transistorized computer, and by DEC's PDP-1, PDP-4, PDP-7, PDP-9 and PDP-15, sold through the 1960's and into the 1970's.

48 bits
Used by Burroughs mainframes from 1960 to the present (Burroughs is now part of Unisys). Also used by early CDC mainframes.

12 bits
Used first by the CDC 160, in 1961, and used by the DEC PDP-5 and PDP-8 family, sold between 1963 and 1990.

60 bits
Used by the CDC 6600 and the followup Cyber series of supercomputers, introduced in 1965 and sold through the mid 1980's.

16 bits
Was this first used by Computer Control Corporation (later Honeywell) in 1964?; in any case, by 1970, this was the standard word length of the entire minicomputer industry, led by DEC, with Data General and Hewlett Packard agressively participating. It was quite natural that when microcomputers emerged in the mid 1970's, they were patterned after successful minicomputer architectures.

32 bits
First used by the IBM System 360 in 1965, and still used today by IBM's mainframes. By the early 1970's, some minicomputer makers had 32 bit mainframes, notably Interdata and Modcomp, but DEC's VAX, an architecture based on the PDP-11, became probably the most widely imitated 32 bit architecture. By the 1980's, most of the great microprocessors of the 1970's had been upgraded to support 32 bit words.

64 bits
First used as the primary word size by the Cray I; the DEC Alpha, 1992, was the first microprocessor to use this word size.
Byte -- Many computers built since 1960 have had a well defined notion of byte size; on some machines, Bytes were called characters. The number of bits per byte was generally determined by the machine designer's idea of how many bits it would take to represent one character.

6 bits
First used on the IBM 701 in the early 1950's, and used on essentially all machines designed prior to 1965. The predominance of word sizes that are a multiple of 6 on machines designed in this era is completely explained by the dominance of 6 bit bytes! In early usage, the term byte was never used! Most early machines that supported 6-bit things either called them characters or digits.

8 bits, called an Octet in ISO standards.
First used on the IBM 360 in 1965, and used on essentially all machines designed since that time.

9 bits
Occasionally used on machines with 36 bit words in those cases where 6 bits are not enough, as was the case with the when users of such machines wished to deal with the full 7-bit ASCII character set.

variable sized
The DEC PDP-6, PDP-10 and DECsystem 20 had no fixed byte size. Assembly language programmers could freely set the byte addressing hardware to use any size from one bit per byte up to a full 36 bit word per byte.
Nybble -- Some computers support operations on 4 bit quantities. Historically, these have been referred to as digits, because 4 bits allow the encoding of one binary coded decimal digit, but since 1970, it has been common to call such quantities nybbles, a term that began as a joke.
Chompe -- The joke began with the observation that a byte was 8 bits, and that a small byte should be called a nybble. A large byte, therefore, should be called a chompe. Humorous usage called a chompe 16 bits. Extrapolating to 32 bits ...
Halfword -- The IAS machine, ILLIAC, and the other early 40 bit machines had 20 bit instruction words. The IBM 701, with a 36 bit word, had 18 bit instructions. The IBM System 360, with a 32 bit word, had instructions that were multiples of 16 bits. On all of these machines, the term halfword was used for these units.

Coding

How many values can be coded in one bit? Two! We'll call them 0 and 1 for now. How many values can be coded by 2 bits? By 3 bits? By 4 bits? Here is a systematic listing of all possible combinations of 4 bits:

 0 0 0 0     1 0 0 0
 0 0 0 1     1 0 0 1
 0 0 1 0     1 0 1 0
 0 0 1 1     1 0 1 1            4
 0 1 0 0     1 1 0 0   4 bits, 2  = 16 possibilities.
 0 1 0 1     1 1 0 1
 0 1 1 0     1 1 1 0
 0 1 1 1     1 1 1 1

In general, to get n combinations, you need log2(n) bits (log2 is the log to the base 2).

How we chose to associate values with these bits is our business! We could decide that 0000 means Tuesday and 0001 means five, if we wanted, and, in fact, character sets frequently look that arbitrary! Consider the character set used by ILLIAC I; the table below was transcribed from a typewritten original in The ILLIAC Miniature Manual by John Halton (U. of Illinois Digital Computer Lab file no 260, Oct 22 1958, page 3):

ILLIAC Character Codes

              | Characters |n for 92
   Tape Holes | F/S    L/S | Orders
  -----------------------------------
        o        0      P       2F
        o  O     1      Q      66F
        o O      2      W     130F
        o OO     3      E     194F
        oO       4      R     258F
        oO O     5      T     322F
        oOO      6      Y     386F
        oOOO     7      U     450F
       Oo        8      I     514F
       Oo  O     9      O     578F
       Oo O      +      K     642F
       Oo OO     -      S     706F
       OoO       N      N     770F
       OoO O     J      J     834F
       OoOO      F      F     898F
       OoOOO     L      L     962F
      O o      Delay  Delay     3F
      O o  O   $(Tab)   D      67F
      O o O    CR/LF  CR/LF   131F
      O o OO     (      B     195F
      O oO   L/S=Letter-Shift 259F
      O oO O     ,      V     323F
      O oOO      )      A     387F
      O oOOO     /      X     451F
      OOo      Delay  Delay   515F
      OOo  O     =      G     579F
      OOo O      .      M     643F
      OOo OO F/S=Figure-Shift 707F
      OOoO       '      H     771F
      OOoO O     :      C     835F
      OOoOO      x      Z     899F
      OOoOOO   Space  Space   963F

This character set is coded in 5 bits, with L/S and F/S as shift codes for switching between two different and wildly disorganized sets of 32 characters! How the designers of this character set decided on the order of the letters is a mystery! Note that the capital O in the "tape holes" column stands for one in the binary code, and that blank stands for zero. The lower case o in the "tape holes" column shows where the sprocket hole in the tape goes; this is not part of the binary code, and is more like a decimal point, from the human reader's perspective.

The modern ISO/7 or ASCII character set, originally proposed in the early 1960's, is a significant improvement:

ASCII Character Codes

                          left 3 bits

                000 001 010 011 100 101 110 111
               ---------------------------------
         0000 | NUL DLE SP   0   @   P   `   p
         0001 | SOH DC1  !   1   A   Q   a   q
         0010 | STX DC2  "   2   B   R   b   r
         0011 | ETX DC3  #   3   C   S   c   s
         0100 | EOT DC4  $   r   D   T   d   t
         0101 | ENQ NAK  %   5   E   U   e   u
         0110 | ACK SYN  &   6   F   V   f   v
 Right   0111 | BEL ETB  '   7   G   W   g   w
4  bits  1000 | BS  CAN  (   8   H   X   h   x
         1001 | HT  EM   )   9   I   Y   i   y
         1010 | LF  SUB  *   :   J   Z   j   z
         1011 | VT  ESC  +   ;   K   [   k   {
         1100 | FF  FS   ,   <   L   \   l   |
         1101 | CR  GS   -   =   M   ]   m   }
         1110 | SO  RS   .   >   N   ^   n   ~
         1111 | SI  US   /   ?   O   _   o  DEL

This character set, now officially known as the ISO/7 character set, is the basis of all of the widely used character sets in common use today. One other character set from the "bad old days" is still with us, EBCDIC.

Binary Numbers

As with character, it is possible to associate arbitrary bit patterns with numeric values, but if arithmetic is to be perfomed on those values, things are far easier if a uniform character set is used. The basic binary number system dates back to ancient China, where it was used to order of the 64 combinations of I-Ching sticks (each stick could be in either of two states; the pattern that resulted from a throw of the sticks was used for fortune telling):

6-bit Binary Numbers

   0 = 000000  16 = 010000  32 = 100000  48 = 110000
   1 = 000001  17 = 010001  33 = 100001  49 = 110001
   2 = 000010  18 = 010010  34 = 100010  50 = 110010
   3 = 000011  19 = 010011  35 = 100011  51 = 110011
   4 = 000100  20 = 010100  36 = 100100  52 = 110100
   5 = 000101  21 = 010101  37 = 100101  53 = 110101
   6 = 000110  22 = 010110  38 = 100110  54 = 110110
   7 = 000111  23 = 010111  39 = 100111  55 = 110111
   8 = 001000  24 = 011000  40 = 101000  56 = 111000
   9 = 001001  25 = 011001  41 = 101001  57 = 111001
  10 = 001010  26 = 011010  42 = 101010  58 = 111010
  11 = 001011  27 = 011011  43 = 101011  59 = 111011
  12 = 001100  28 = 011100  44 = 101100  50 = 111100
  13 = 001101  29 = 011101  45 = 101101  61 = 111101
  14 = 001110  30 = 011110  46 = 101110  62 = 111110
  15 = 001111  31 = 011111  47 = 101111  63 = 111111

In looking formally at how the bit pattern for a number is related to its abstract value, we must distinguish clearly between representation and abstraction. Here, we will use a programming language notation, where unquoted numerals stand for abstract integer values, subject to arithmetic operations, while quoted strings stand for concrete representations of those values.

Thus, the string "011100" is just a string of symbols, with no inherent meaning. It could be equivalent to "eleven thousand one hundred" or it could be the binary representation of the number 28.

Consider, first, the formula that relates representations of decimal numbers to their values:

                i=digits(S)-1            i
     value(S) =      SUM     v(S[i]) × 10       (decimal)
                     i=0

     Where
           S is a string of digits
           value(S) is the numeric value assigned to that string
           digits(S) is the number of digits in S
           the rightmost digit in S is S[0]
           the leftmost digit in S is S[size(S)-1]
           v(d) is the numeric value associated with digit d

Users of the decimal system rarely use mathematical notation, as above; instead, we speak of the one's place, the ten's place, the hundred's place and so on.

The above formula applies equally to binary numbers, with the simple restriction that each binary digit, or bit may take on only two values, 0 and 1, and that the base of the number system is 2 instead of 10:

                i=digits(S)-1           i
     value(S) =      SUM     v(S[i]) × 2        (binary)
                     i=0

Similarly, in binary numbers, we speak of the one's place, the two's place, the four's place and the eight's place. The list of powers of 2 rapidly becomes familiar to users of binary numbers:

             i
    i       2

    0       1
    1       2
    2       4
    3       8
    4      16
    5      32
    6      64
    7     128
    8     256
    9     512
   10    1024  about 1000 or 1K (kilo)
   11    2048
   12    4096
   13    8192
   14   16384
   15   32768
   16   65536

   20 1048576  about 1,000,000 or 1M (mega)
   30          about 1,000,000,000 or 1G (giga)

How do we relate the characters used to represent digits to their abstract numeric values? The most general solution to this problem is to use some kind of lookup table, for example:

     to compute v(d)
         find the value of i such that key[i]=d
           where key = "0123456789"
         return that value of i

In practice, this is unnecessary because we arrange our character sets so that the numeric positions of the numerals in the character set are consecutive. This was true even with the quirky character set used on ILLIAC! Thus, the following works:

     to compute v(d)
         return rep('0')-rep(d)

Here, rep(c) returns the integer used to represent the character c.

Other Number Bases

Any integer may be used as the radix of a number system. The formula used above to evaluate numbers in bases 2 and 10 also works for r, an arbitrary radix:

                i=digits(S)-1           i
     value(S) =      SUM     v(S[i]) × r        (radix r)
                     i=0

Note that, for any radix r, the string 10 in that radix has, as its value, the radix. Thus, in binary, 10 means two, in decimal, the string 10 means ten, and in octal, the string 10 means eight. To avoid confusion, it is conventional to indicate the number base by a numerical subscript, as

      10110   =  26  =  22   =  211
           2       8      10       3

As long as the radix r is no greater than 10, the conventional digits, 0 through 9, can be used. For r greater than 10, additional symbols must be chosen for use as digits. Historically, a variety of symbols have been used! For example, in base 12 also called duodecimal, the digits T and E have been used (T for ten and E for eleven). Later, on ILLIAC I, the following digits were used for base 16, called sexadecimal in the documentation:

	0123456789KSNJFL

Today, the standard convention for bases greater than ten is to use A for ten, B for eleven, C for twelve, and so on, up to Z. This allows for bases as high as 36, although the highest commonly used number base is base 16, usually called hexadecimal, using the digits.

	0123456789ABCDEF

In the context of computer systems based on binary numbers, it is common to use bases 8 and 16. The reason for this is that it is generally easier for people to treat very long numbers in terms of groups of digits. For example, in dealing with large decimal numbers, we group the digits in threes. In effect, this grouping creates a number in base 1000 with three character names for each digit in that number and commas separating the digits. In the SI system of units, we even have standard adjectival forms for the place values in this system:

      1 kilo = 1,000
      1 mega = 1,000,000
      1 giga = 1,000,000,000
      1 tera = 1,000,000,000,000

Many computer science students think one k is 1024, but in fact, this usage is only approximate. In the world of disks, it is common to use strange mixed approximations, where mega means 1000×1024. The convention of using a subscript two on the SI prefixes to indicate that the nearest powers of two is intended has recently been suggested and seems to be a good idea! So,

      1 kilo  = 1024      1 mega  = 1,048,576
            2                   2

Having found such utility in grouping decimal digits into groups of three, it is natural to group binary digits similarly:

      100   = 1,100,100  = 144
         10            2      8

But, as is shown above, if each group of 3 bits is interpreted as a value between zero and seven, the result is an octal (base 8) number. Hexadecimal (Base 16) results from a similar grouping of binary digits in groups of 4, with digit value ranging from zero to fifteen.

      100   = 110,0100  = 64
         10           2     16

Finite Precision

On computers, we generally are forced to work in terms of a fixed word size, so our numbers are composed of a fixed number of bits. Leading zeros are used to pad a number out to the desired width. Thus, on a machine with a 16 bit word,

      100   = 0000,0000,0110,0100  = 0064
         10                      2       16

On this machine, the maximum representable number is

    65535   = 1111,1111,1111,1111  = FFFF
         10                      2       16

When deciding between octal and hexadecimal for compact representation of numbers, it is common to select one that divides words and bytes evenly, so for machines based on an 8 bit byte, hexadecimal seems a good choice, while for machines with a 6 bit byte, octal makes sense. There are exceptions to this! The Motorola 68000 is essentially a 16 bit machine with 8 bit bytes, but the instruction format is made of numerous 3 bit fields, so displaying 68000 instructions in octal makes good sense!