When we write computer programs, we always deal with representations. For example, we don't deal with text the way a calligrapher or author deals with text, we deal with a specific character set that may be used to represent text. Instead of dealing with integers or real numbers, we deal with finite precision representations of integers and we deal with floating point numbers.
Well designed programming languages allow the programmer to treat these representations as if they were identical to the abstractions they represent, but this identity is never complete. Thus, for example, our machine representations of integers behave exactly like abstract integers until a result greater than the machine precision is produced; at that point, the behavior of integer variables becomes very strange!
All modern computers use binary representations for their data, and use fixed size containers for data. There is no reason that computers could not be built that use other basic representations; "trinary" (base 3) logic devices aren't hard to build, and in fact, decimal computers were built in large numbers in the 1960's. There is no compelling benefit to using non-binary representations, though, and even the decimal machines of the 1960's were built using binary logic and used binary-coded-decimal (BCD) internally.
The bit is the fundamental unit of information. A bit has two values, conventionally called zero or one. We could as easily name the values on and off, high and low, or true and false; in fact, all of those names are used in different contexts, and the truth is that all such names are arbitrary!
Over the past 50 years, bits have been grouped inside computers in many ways, and these groupings have been given many names:
How many values can be coded in one bit? Two! We'll call them 0 and 1 for now. How many values can be coded by 2 bits? By 3 bits? By 4 bits? Here is a systematic listing of all possible combinations of 4 bits:
0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 1 1 4 0 1 0 0 1 1 0 0 4 bits, 2 = 16 possibilities. 0 1 0 1 1 1 0 1 0 1 1 0 1 1 1 0 0 1 1 1 1 1 1 1In general, to get n combinations, you need log2(n) bits (log2 is the log to the base 2).
How we chose to associate values with these bits is our business! We could decide that 0000 means Tuesday and 0001 means five, if we wanted, and, in fact, character sets frequently look that arbitrary! Consider the character set used by ILLIAC I; the table below was transcribed from a typewritten original in The ILLIAC Miniature Manual by John Halton (U. of Illinois Digital Computer Lab file no 260, Oct 22 1958, page 3):
| Characters |n for 92
Tape Holes | F/S L/S | Orders
-----------------------------------
o 0 P 2F
o O 1 Q 66F
o O 2 W 130F
o OO 3 E 194F
oO 4 R 258F
oO O 5 T 322F
oOO 6 Y 386F
oOOO 7 U 450F
Oo 8 I 514F
Oo O 9 O 578F
Oo O + K 642F
Oo OO - S 706F
OoO N N 770F
OoO O J J 834F
OoOO F F 898F
OoOOO L L 962F
O o Delay Delay 3F
O o O $(Tab) D 67F
O o O CR/LF CR/LF 131F
O o OO ( B 195F
O oO L/S=Letter-Shift 259F
O oO O , V 323F
O oOO ) A 387F
O oOOO / X 451F
OOo Delay Delay 515F
OOo O = G 579F
OOo O . M 643F
OOo OO F/S=Figure-Shift 707F
OOoO ' H 771F
OOoO O : C 835F
OOoOO x Z 899F
OOoOOO Space Space 963F
This character set is coded in
5 bits, with L/S and F/S as shift codes for switching between two different
and wildly disorganized sets of 32 characters! How the designers of this
character set decided on the order of the letters is a mystery!
Note that the capital O in the "tape holes" column stands for one in the
binary code, and that blank stands for zero. The lower case o in the "tape
holes" column shows where the sprocket hole in the tape goes; this is not
part of the binary code, and is more like a decimal point, from the human
reader's perspective.
The modern ISO/7 or ASCII character set, originally proposed in the early 1960's, is a significant improvement:
left 3 bits
000 001 010 011 100 101 110 111
---------------------------------
0000 | NUL DLE SP 0 @ P ` p
0001 | SOH DC1 ! 1 A Q a q
0010 | STX DC2 " 2 B R b r
0011 | ETX DC3 # 3 C S c s
0100 | EOT DC4 $ r D T d t
0101 | ENQ NAK % 5 E U e u
0110 | ACK SYN & 6 F V f v
Right 0111 | BEL ETB ' 7 G W g w
4 bits 1000 | BS CAN ( 8 H X h x
1001 | HT EM ) 9 I Y i y
1010 | LF SUB * : J Z j z
1011 | VT ESC + ; K [ k {
1100 | FF FS , < L \ l |
1101 | CR GS - = M ] m }
1110 | SO RS . > N ^ n ~
1111 | SI US / ? O _ o DEL
This character set, now officially known as the ISO/7 character set, is the
basis of all of the widely used character sets in common use today. One
other character set from the "bad old days" is still with us,
EBCDIC.
As with character, it is possible to associate arbitrary bit patterns with numeric values, but if arithmetic is to be perfomed on those values, things are far easier if a uniform character set is used. The basic binary number system dates back to ancient China, where it was used to order of the 64 combinations of I-Ching sticks (each stick could be in either of two states; the pattern that resulted from a throw of the sticks was used for fortune telling):
0 = 000000 16 = 010000 32 = 100000 48 = 110000 1 = 000001 17 = 010001 33 = 100001 49 = 110001 2 = 000010 18 = 010010 34 = 100010 50 = 110010 3 = 000011 19 = 010011 35 = 100011 51 = 110011 4 = 000100 20 = 010100 36 = 100100 52 = 110100 5 = 000101 21 = 010101 37 = 100101 53 = 110101 6 = 000110 22 = 010110 38 = 100110 54 = 110110 7 = 000111 23 = 010111 39 = 100111 55 = 110111 8 = 001000 24 = 011000 40 = 101000 56 = 111000 9 = 001001 25 = 011001 41 = 101001 57 = 111001 10 = 001010 26 = 011010 42 = 101010 58 = 111010 11 = 001011 27 = 011011 43 = 101011 59 = 111011 12 = 001100 28 = 011100 44 = 101100 50 = 111100 13 = 001101 29 = 011101 45 = 101101 61 = 111101 14 = 001110 30 = 011110 46 = 101110 62 = 111110 15 = 001111 31 = 011111 47 = 101111 63 = 111111In looking formally at how the bit pattern for a number is related to its abstract value, we must distinguish clearly between representation and abstraction. Here, we will use a programming language notation, where unquoted numerals stand for abstract integer values, subject to arithmetic operations, while quoted strings stand for concrete representations of those values.
Thus, the string "011100" is just a string of symbols, with no inherent meaning. It could be equivalent to "eleven thousand one hundred" or it could be the binary representation of the number 28.
Consider, first, the formula that relates representations of decimal numbers to their values:
i=digits(S)-1 i
value(S) = SUM v(S[i]) × 10 (decimal)
i=0
Where
S is a string of digits
value(S) is the numeric value assigned to that string
digits(S) is the number of digits in S
the rightmost digit in S is S[0]
the leftmost digit in S is S[size(S)-1]
v(d) is the numeric value associated with digit d
Users of the decimal system rarely use mathematical notation, as above; instead, we speak of the one's place, the ten's place, the hundred's place and so on.
The above formula applies equally to binary numbers, with the simple restriction that each binary digit, or bit may take on only two values, 0 and 1, and that the base of the number system is 2 instead of 10:
i=digits(S)-1 i
value(S) = SUM v(S[i]) × 2 (binary)
i=0
Similarly, in binary numbers, we speak of the one's place, the two's place,
the four's place and the eight's place. The list of powers of 2 rapidly
becomes familiar to users of binary numbers:
i
i 2
0 1
1 2
2 4
3 8
4 16
5 32
6 64
7 128
8 256
9 512
10 1024 about 1000 or 1K (kilo)
11 2048
12 4096
13 8192
14 16384
15 32768
16 65536
20 1048576 about 1,000,000 or 1M (mega)
30 about 1,000,000,000 or 1G (giga)
How do we relate the characters used to represent digits to their abstract
numeric values? The most general solution to this problem is to use some
kind of lookup table, for example:
to compute v(d)
find the value of i such that key[i]=d
where key = "0123456789"
return that value of i
In practice, this is unnecessary because we arrange our character sets so that
the numeric positions of the numerals in the character set are consecutive.
This was true even with the quirky character set used on ILLIAC!
Thus, the following works:
to compute v(d)
return rep('0')-rep(d)
Here, rep(c) returns the integer used to represent the character c.
Any integer may be used as the radix of a number system. The formula used above to evaluate numbers in bases 2 and 10 also works for r, an arbitrary radix:
i=digits(S)-1 i
value(S) = SUM v(S[i]) × r (radix r)
i=0
Note that, for any radix r, the string 10 in that radix has, as its value, the radix. Thus, in binary, 10 means two, in decimal, the string 10 means ten, and in octal, the string 10 means eight. To avoid confusion, it is conventional to indicate the number base by a numerical subscript, as
10110 = 26 = 22 = 211
2 8 10 3
As long as the radix r is no greater than 10, the conventional digits,
0 through 9, can be used. For r greater than 10, additional symbols
must be chosen for use as digits. Historically, a variety of symbols
have been used! For example, in base 12 also called duodecimal, the
digits T and E have been used (T for ten and E for eleven). Later,
on ILLIAC I, the following digits were used for base 16, called sexadecimal
in the documentation:
0123456789KSNJFLToday, the standard convention for bases greater than ten is to use A for ten, B for eleven, C for twelve, and so on, up to Z. This allows for bases as high as 36, although the highest commonly used number base is base 16, usually called hexadecimal, using the digits.
0123456789ABCDEFIn the context of computer systems based on binary numbers, it is common to use bases 8 and 16. The reason for this is that it is generally easier for people to treat very long numbers in terms of groups of digits. For example, in dealing with large decimal numbers, we group the digits in threes. In effect, this grouping creates a number in base 1000 with three character names for each digit in that number and commas separating the digits. In the SI system of units, we even have standard adjectival forms for the place values in this system:
1 kilo = 1,000
1 mega = 1,000,000
1 giga = 1,000,000,000
1 tera = 1,000,000,000,000
Many computer science students think one k is 1024, but in fact, this
usage is only approximate. In the world of disks, it is common to
use strange mixed approximations, where mega means 1000×1024.
The convention of using a subscript two on the SI prefixes to indicate
that the nearest powers of two is intended has recently been suggested
and seems to be a good idea! So,
1 kilo = 1024 1 mega = 1,048,576
2 2
Having found such utility in grouping decimal digits into groups of three,
it is natural to group binary digits similarly:
100 = 1,100,100 = 144
10 2 8
But, as is shown above, if each group of 3 bits is interpreted as a value
between zero and seven, the result is an octal (base 8) number. Hexadecimal
(Base 16) results from a similar grouping of binary digits in groups of 4,
with digit value ranging from zero to fifteen.
100 = 110,0100 = 64
10 2 16
On computers, we generally are forced to work in terms of a fixed word size, so our numbers are composed of a fixed number of bits. Leading zeros are used to pad a number out to the desired width. Thus, on a machine with a 16 bit word,
100 = 0000,0000,0110,0100 = 0064
10 2 16
On this machine, the maximum representable number is
65535 = 1111,1111,1111,1111 = FFFF
10 2 16
When deciding between octal and hexadecimal for compact representation
of numbers, it is common to select one that divides words and bytes evenly,
so for machines based on an 8 bit byte, hexadecimal seems a good choice, while
for machines with a 6 bit byte, octal makes sense. There are exceptions to
this! The Motorola 68000 is essentially a 16 bit machine with 8 bit bytes,
but the instruction format is made of numerous 3 bit fields, so displaying
68000 instructions in octal makes good sense!