10. Floating Point

Part of 22C:60, Computer Organization Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

The Hawk Floating Point Coprocessor

The Hawk architecture includes two opcodes reserved for use by a floating-point coprocessor, one to give data to the coprocessor and one to get data from it. The operations given here are only a subset of the operations a typical floating point coprocessor might permit. Real hardware might support 64-bit operands and advanced math functions as well.

The Hawk coprocessor set instruction
15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
0 0 0 1 dst 0 0 1 0 x

We can use the shorthand notation coprocessorset(R[dst],x) for this instruction, where it is up to the coprocessor what to do with the constant x from the instruction. A floating point coprocessor with two 64-bit accumulators A[0] and A[1] might use x as follows.

x       operation
0 0 0 a set A[a] to a 32-bit integer value
0 0 1 a set A[a] to a 32-bit floating value
1 0 0 a add 32-bit floating value to A[a]
1 0 1 a subtract 32-bit floating value from A[a]
1 1 0 a multiply A[a] by 32-bit floating value
1 1 1 a divide A[a] by 32-bit floating value

The coprocessor set instructions all transfer values to the coprocessor. If the coprocessor is currently busy, the instruction may have to wait before transferring data to the coprocessor. If the operation initiates arithmetic, the coprocessor will become busy as a result.

The Hawk coprocessor get instruction
15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
0 0 0 1 dst 0 0 1 1 x

We can use the shorthand notation R[dst]=coprocessorget(x) for this instruction, where it is up to the coprocessor what to do with the constant x from the instruction. A floating point coprocessor with two accumulators might use X as follows:

x       operation
0 0 0 a get integer part of floating point value A[a]
0 0 1 a get 32-bit floating value of A[a]

The coprocessor get instructions wait until the coprocessor is idle before returning a result. They all set the condition codes to indicate whether the floating point value is negative or zero with the N and Z condition codes. In addition, if the floating point value represents an infinite value or if it is too big to represent as an integer, the V condition code is set.

Exercises

a) The operations FPUT R,OP,A puts the constant R into floating point register A using operation OP, where the operations are PUTI, PUTF, ADDF, SUBF, MULF, and DIVF. Define a macro to assemble the FPUT operation, defining any auxiliary symbols required.

b) Suppose we wanted to support 64-bit floating point operands. Given what you know about the Hawk architecture, suggest how you would expect to use coprocessor put and get operations to load such operands.

IEEE Floating Point Format

It is not sufficient to say that we have a floating point coprocessor that supports 32-bit floating point values. We must also define the data format used by this processor. Most modern computers support the same floating point format, a format defined by the Institute for Electrical and Electronic Engineers, the IEEE. This format follows a general outline that is very similar to most floating point formats supported by floating point hardware since the early 1960's, but it does have some eccentric and occasionally difficult to explain features.

The IEEE standard includes both 32 and 64-bit floating point numbers, but for this discussion, we will ignore the latter and focus the simple 32-bit format.

The IEEE standard single-precision floating-point number representation
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
S exponent mantissa

In the IEEE floating point formats, like most others hardware floating point formats, the most significant bit holds the sign of the mantissa, and the mantissa is stored in signed magnitude form. The magnitude of the mantissa of a 32-bit floating-point number is given to 24 bits of precision, while the exponent is stored in the 8 remaining bits. IEEE double-precision numbers differ from the above in two ways. First, they have a 64-bit representation, and second, they have a 16-bit exponent instead of an 8-bit exponent.

A Normalized Mantissa

The IEEE format has some rules that may be a bit puzzling. The first hint at these rules comes from adding up the numbers of bits in each field. In the above paragraph, it was stated that the number had one sign bit, 8 bits of exponent, and 24 bits of mantissa. Add the numbers of bits in each field and you get 33, one more bit than the word size. Something is fishy here!

The trick lies in the fact that, except in some very special cases, the high bit of the mantissa of a number in IEEE format is always one. If a bit always has the same value, there is never any reason to store it! Instead, we only store the bits that change from one value to the next. Thus, we only store 23 bits of the 24 bit mantissa.

To understand what is going on, it is helpful to refer to scientific notation, as used in the sciences. In decimal, all of the following representations are all exactly equivalent:

60221419.9 ×  1016
60221.4199×1019
602.214199×1021
60.2214199×1022
6.02214199×1023
0.602214199×1024

Of these, only 6.02... × 1023 is considered to be properly in scientific notation. In scientific notation, the mantissa is always represented as a fixed-point decimal number between 1.000 and 9.999... The only exception is zero. When we find a number that has a mantissa that does not satisfy this rule, we normalize it by moving the point (and adjusting the exponent accordingly) until it satisfies this rule.

IEEE format has a similar normalization rule: The fixed point binary mantissa is always between 1.000 and 1.111... Thus, the most significant bit is always one. This is the hidden bit that we do not bother storing in the IEEE representation.

The hidden bit in the IEEE mantissa
31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
S nonzero exponent 00000000 00000000 0000000
interpreted as 1. 00000000 00000000 0000000

Floating point representations without a hidden bit tend to waste one bit of memory. This is not generally a problem with high precision representations such as those using 48 or 64 bits, but for 32 bit numbers, every bit of the mantissa is precious and the extra bit of precision provided by the hidden bit can be significant.

The Biased Exponent

The second odd feature of the IEEE format is that the exponent is given as a biased signed integer with the eccentric bias of 127, and the normal range of exponents excludes both the smallest and largest values, 000000002 and 111111112. An exponent of zero is reserved for a mantissa of zero or for unnormalized (extraordinarily small) values. The exponent of zero implies that the hidden bit is zero. An exponent of all ones is reserved for infinity (with a mantissa of zero) and for values that the IEEE calls NaNs, where NaN stands for not a number. The mantissa field of a NaN may be put to a variety of uses, for example, being used to remember how the non-number came to be, but this is rarely done.

Because of the odd bias of 127 for exponents, the exponent one is represented as 100000002, zero is 011111112, and negative one is 011111102. There is a competing but equivalent presentation of the IEEE format that presents the bias as 128 and places the point in the mantissa differently relative to the hidden bit. The different presentations of the IEEE system make no difference in the number representations, but they can be confusing when comparing different presentations of the number system from different textbooks. The following table shows IEEE floating-point numbers, given in binary, along with their interpretations.

Example IEEE single-precision floating-point numbers
Infinity and NaN
0 11111111 00000000000000000000000=Infinity
1 11111111 00000000000000000000000=-Infinity
0 11111111 00000000000010000000000=NaN
1 11111111 00000010001111101000110=NaN
Normalized numbers
0 10000000 00000000000000000000000= +1.0 × 21 * 1.002 = 2
0 01111110 00000000000000000000000= +1.0 × 2-1 * 1.002 = 0.5
0 01111111 10000000000000000000000= +1.0 × 2-1 * 1.102 = 1.5
0 01111111 11000000000000000000000= +1.0 × 2-1 * 1.112 = 1.75
1 01111111 11000000000000000000000= -1.0 × 2-1 * 1.112 = -1.75
Unnormalized numbers
0 00000001 00000000000000000000000= +1.0 × 2-126 * 1.002 = 2-126
0 00000000 10000000000000000000000= +1.0 × 2-126 * 0.102 = 2-127
0 00000000 01000000000000000000000= +1.0 × 2-126 * 0.012 = 2-128
0 00000000 00000000000000000000000= +1.0 × 2-126 * 0.002 = 0
1 00000000 00000000000000000000000= -1.0 × 2-126 * 0.002 = 0

Exercises

c) Give the 32-bit IEEE representation for 10010.

d) Give the 32-bit IEEE representation for 100010.

e) Give the 32-bit IEEE representation for 0.110.

f) Give the 32-bit IEEE representation for 0.0110.

Software Floating Point

Our Hawk emulator does not have a floating point coprocessor. Therefore, we must implement floating point operations entirely with software. This forces us to examine the algorithms that underly floating point arithmetic, and it provides us with an extended example of a software module that implements a complex class.

The interface specification should list all of the operations on objects of that class, and for each operation, it should specify the parameters expected, constraints on those parameters, and the nature of the result. It is good practice to add documentation to the interface specification, so that it serves as a manual for users of the class as well as a formal interface.

For our floating-point class, the set of operations is fairly obvious. We want operators that add, subtract, multiply and divide floating-point numbers, and we also want operators that return the integer part of a number, and that convert integers to floating-point form. We probably want other operations, but we will forgo those for now.

In most object-oriented programming languages, a strong effort is made to avoid copying objects from place to place. Instead, objects sit in memory and object handles are used to refer to them. The handle for an object is actually just a pointer to that object, that is, a word holding the address of the object. Therefore, our floating point operators will take, as parameters, the addresses of their operands, not the values.

Finally, the interface specificaiton for a class must indicate how to allocate storage for an element of that class. The only thing a user of the object needs to know is the size of the object, not the internal details of its representation. The following interface specification for our Hawk floating point package assumes that each floating point number is stored in two words of memory, an exponent and a mantissa of one word each.

Interface specification for a Hawk floating-point package
        TITLE float.h, interface specification for float.a

FLOATSIZE = 8   ; size of a floating point number, in bytes.
                ;   the user gets no other information about the
                ;   number representation

                ; for all calling sequences here:
                ;   R1 = return address
                ;   R2 = pointer to activation record
                ;   R3-7 = parameters and temporaries
                ;   R8-15 = guaranteed to be saved and restored

                ; functions that return floating values use:
                ;   R3 = pointer to place to put return value
                ;       the caller must pass this pointer!
        ALIGN   4
        EXT     FLOAT   ; convert integer to floating
PFLOAT: W       FLOAT   ; on entry, R3 = pointer to floating result
                        ;           R4 = integer to convert

        EXT     FLTINT  ; convert floating to integer
PFLTINT:W       FLTINT  ; on entry, R3 = pointer to floating value
                        ; on exit,  R3 = integer return value

        EXT     FLTCPY  ; copy a floating point number
PFLTCPY:W       FLTCPY  ; on entry, R3 = pointer to floating result
                        ;           R4 = pointer to floating operand

        EXT     FLTTST  ; test sign and zeroness of floating number
PFLTTST:W       FLTTST  ; on entry, R3 = pointer to floating value
                        ; on exit,  R3 = integer -1, 0 or 1

        EXT     FLTADD  ; add floating-point numbers
PFLTADD:W       FLTADD  ; on entry, R3 = pointer to floating result
                        ;           R4 = pointer to addend
                        ;           R5 = pointer to augend

 
Interface specification, continued
        EXT     FLTSUB  ; subtract floating-point numbers
PFLTSUB:W       FLTSUB  ; on entry, R3 = pointer to floating result
                        ;           R4 = pointer to subtrahend
                        ;           R5 = pointer to minuend

        EXT     FLTNEG  ; negate a floating-point number
PFLTNEG:W       FLTNEG  ; on entry, R3 = pointer to floating result
                        ;           R4 = pointer to operand

        EXT     FLTMUL  ; multiply floating-point numbers
PFLTMUL:W       FLTMUL  ; on entry, R3 = pointer to floating result
                        ;           R4 = pointer to multiplicand
                        ;           R5 = pointer to multiplier

        EXT     FLTDIV  ; divide floating-point numbers
PFLTDIV:W       FLTDIV  ; on entry, R3 = pointer to floating result
                        ;           R4 = pointer to multiplicand
                        ;           R5 = pointer to multiplier

Exercises

g) Write a main program that uses a separate common block of size FLOATSIZE to hold each floating point variable it needs in the computation of the floating point representation of 0.1, computed by converting the integer constants 1 and 10 to floating point and then dividing 1.0 by 10.0. This should call FLOAT several times, and then make just one call to FLTDIV. Your main program will, of course, use the file float.h.

h) Write a separately compilable subroutine called SQUARE that takes 2 pointers to floating point numbers as parameters and returns the square of the second number in the first. Don't forget to write an appropriate interface specification, and comment everyting appropriately, including an indication of the file names that should be used.

A floating point representation

The simplest way to represent a floating point number for a software implementation of floating point operations is as a pair of words, one holding the exponent and the other holding the mantissa. We need more detail. Which word is which? What is the range of exponent values? How do we represent the sign of the exponent? How is the mantissa normalized? How do we represent non-normalized values such as zero?

On a computer that supports two's complement integers, it makes sense to represent the exponent and mantissa as two's complement values. We can represent zero using a mantissa of zero; technically, when the mantissa is zero, the exponent does not matter, but we will always set the exponent to the smallest (most negative) possible value.

The more difficult question is, where is the point in our two's complement mantissa? We could put the point anywhere and make it work, but the two obvious choices are to use an integer mantissa or to put the point immediately to the right of the sign bit. Here, we opt for the latter case, and we will normalize the mantissa so that the bit immediately to the right of the point is always a one. This is equivalent to changing the normalization rules for decimal scientific notation so that 0.602×1024 is considered to be normalized. The following examples illustrate this number format.

A floating-point number representation
exponent  00000000000000000000000000000000 +0.5 × 20 = 0.5
mantissa  01000000000000000000000000000000
exponent  00000000000000000000000000000001 +0.5 × 21 = 1.0
mantissa  01000000000000000000000000000000
exponent  00000000000000000000000000000001 -0.5 × 21 = -1.0
mantissa  11000000000000000000000000000000
exponent  00000000000000000000000000000001 +0.75 × 21 = 1.5
mantissa  01100000000000000000000000000000
exponent  00000000000000000000000000000001 -0.75 × 21 = -1.5
mantissa  10100000000000000000000000000000
exponent  11111111111111111111111111111111 +0.5 × 2-1 = 0.25
mantissa  01000000000000000000000000000000
exponent  11111111111111111111111111111111 +0.5 × 2-1 = 0.25
mantissa  01000000000000000000000000000000
exponent  11111111111111111111111111111101 +0.5 × 2-3 = 0.0625
mantissa  01000000000000000000000000000000
exponent  11111111111111111111111111111101 ~8/10 × 2-3 = 0.1...
mantissa  01100110011001100110011001100110
exponent  11111111111111111111111111111101 ~-8/10 × 2-3 = -0.1...
mantissa  10011001100110011001100110011010

Exercises

i) In this number system, what is the largest possible positive value (in binary!).

j) In this number system, what is the smallest possible positive nonzero normalized value?

k) In this number system, how is 10.010 represented.

l) In this number system, how is 100.010 represented.

Normalizing a floating point number

Many operations on floating point numbers produce results that are unnormalized, and these must be normalized before performing additional operations on them. If this is not done, there will be a loss of precision in the results; This is one reason that classical scientific notation is always presented in normalized form. To normalize a floating point number, we must distinguish some special cases: First, is the number zero? Zero is a special case. Second, is the number negative? Because we opted to represent our mantissa in two's complement form, negative numbers require special treatment. This is why many hardware floating-point systems use signed magnitude.

The normalize subroutine is not part of the public interface to our floating point package, but rather, it is a private component, used as the final step of most floating point operations. Therefore, we need not follow the calling conventions of the interface specification. Here, we will pass operands in registers instead of passing pointers to memory. We will use registers 3 and 4 to hold the exponent and mantissa of the number to be normalized on both entrance and exit.

Normalizing a floating-point number
        SUBTITLE normalize

NORMALIZE:      ; normalize floating point number
                ; link through R1
                ; R3 = exponent on entry and exit
                ; R4 = mantissa on entry and exit
                ; no other registers used
        TESTR   R4
        BZR     NRMNZ           ; if (mantissa == 0) {

        LIL     R3,#800000
        SL      R3,8            ;   exponent = 0x80000000;
        JUMPS   R1              ;   return;

NRMNZ:  BNS     NRMNEG          ; } /* else */ if (mantissa > 0) {
NRMPLP:                         ;   while
        BITTST  R4,30
        BCS     NRMPRT          ;     ((mantissa & 0x40000000) == 0) {
        SL      R4,1            ;     mantissa = mantissa << 1;
        ADDSI   R3,-1           ;     exponent = exponent - 1;
        BR      NRMPLP          ;   }
NRMPRT:
        JUMPS   R1              ;   return;

NRMNEG:                         ; } /* else if */ { /* mantissa < 0 */
        ADDSI   R4,-1           ;   mantissa = mantissa - 1;
                                ;   /* mantissa now in one's complement form */
NRMNLP:                         ;   while
        BITTST  R4,30
        BCR     NRMNRT          ;     ((mantissa & 0x40000000) != 0) {
        SL      R4,1            ;     mantissa = mantissa << 1;
        ADDSI   R3,-1           ;     exponent = exponent - 1;
        BR      NRMPLP          ;   }
NRMNRT:
        ADDSI   R4,1            ;   mantissa = mantissa + 1;
                                ;   /* mantissa now in two's complement form */
        JUMPS   R1              ;   return;
                                ; }

There are two tricks in this code worth mention. First, this code uses the BITTST instruction to test bit 30 of the mantissa. This instruction moves the indicated bit to the C condition code; in fact, the assembler converts this instruction to either a left or a right shift to move the indicated bit into the carry bit while discarding the shifted result using R0. In C, C++ or Java, in contrast, inspection of one bit of a word is most easily expressed by anding that word with a constant with just that bit set. The second trick involves normalizing negative numbers. In the example values presented above, note that the representation of -0.5 has bit 30 set to 1, while -0.75 has it set to zero. By subtracting one from the least significant bit of each negative value, we can convert to one's complement, allowing us to take advantage of the fact that bit 30 of the one's complement representation of normalized mantissas is always zero.

Exercises

m) The above code does not detect underflow! If it decrements the exponent below the smallest legal value, it produces the highest legal value. Rewrite the code to make it produce a value of zero whenever decrementing the exponent would underflow.

Floating to Integer and Integer to Floating Conversion

Conversion from integer to floating point is surprisingly simple! All we need to do is set the exponent to 31, set the mantissa to the desired integer, and then normalize. This is because the fixed point fractions we use to represent the mantissa can be viewed as integer counts in units of 2-31.

Integer to Floating Conversion on the Hawk
; format of a floating point number stored in memory
EXPONENT  = 0
MANTISSA  = 4
FLOATSIZE = 8

        SUBTITLE integer to floating conversion

FLOAT:                  ; on entry, R3 = pointer to floating result
                        ;           R4 = integer to convert
        MOVE    R5,R1           ; R5 = return address

        MOVE    R6,R3           ; R6 = pointer to floating result
        LIS     R3,31           ; exponent = 31; /* R3-4 is now floating */
        JSR     R1,NORMALIZE    ; normalize( R3-4 );
        STORES  R3,R6           ; result->exponent = exponent;
        STORE   R4,R6   MANTISSA; result->mantissa = mantissa;

        JSRS    R5              ; return; /* uses saved return address! */

Conversion of floating-point numbers to integer is a bit more complex, but only because we have no pre-written denormalize routine that will set the exponent field to 31. Instead, we need to write this ourselves! Where the normalize routine shifted the mantissa left and decremented the exponent until the number was normalized, the floating to integer conversion routine will have to shift the mantissa right and increment the exponent until the exponent has the value 31.

What if the exponent was greater than 31? In this case, the integer part of the number is too large to represent in 32 bits, so we should raise an exception, or, lacking a model of how to write exception handlers, we could set the overflow condition code. This is left as an exercise for the reader.

Floating to Integer Conversion on the Hawk
        SUBTITLE floating to integer conversion

FLTINT:                 ; on entry, R3 = pointer to floating value
                        ; on exit   R3 = integer result
        LOADS   R4,R3           ; R4 = argument->exponent
        LOAD    R3,R3,MANTISSA  ; R3 = argument->mantissa
FINTLP:                         ; while
        CMPI    R4,31
        BGE     FINTLX          ;   (exponent < 31) {
        SR      R3,1            ;   mantissa = mantissa >> 1
        ADDSI   R4,1            ;   exponent = exponent + 1;
        BR      FINTLP          ; }
FINTLX:
        ; unchecked error condition: exponent > 31 implies overflow
        JUMPS   R1              ; return denormalized mantissa

Exercises

n) The above code for floating to integer conversion truncates the result in an odd way for negative numbers. If the floating point input value is -1.5, what integer does it return? Why?

o) The above code for floating to integer conversion truncates the result in an odd way for negative numbers. Fix the code so that it truncates the way a naive programmer would expect.

p) The above code for floating to integer conversion truncates, but sometimes, it is desirable to have a version that rounds a number to the nearest integer. Binary numbers can be rounded by adding one in the most significant digit that will be discarded, that is, in the 0.5's place. Write code for FLTROUND that does this.

q) The above code for floating to integer conversion could do thousands of right shifts for numbers with very negative exponents! This is an utter waste. Modify the code so that it automatically recognizes these extreme cases and returns a value of zero whenever more than 32 shifts would be required.

Floating Point Addition

We are now ready to explore the implementation of some of the floating point operations. These follow quite naturally from the standard rules for working with numbers in scientific notation. Consider the problem of adding 9.92×103 to 9.25×101. We begin by denormalizing the numbers so that they have the same exponents; this allows us to add the mantissas, after which we renormalize the result and round it to the appropriate number of decimal places:

Adding in scientific notation
given 9.92 × 103 + 9.25 × 101
denormalized 9.92 × 103 + 0.0925 × 103
rearranged (9.92 + 0.0925) × 103
added 10.0125 × 103
normalized 1.00125 × 104
rounded 1.00 × 104

The final rounding step is one many students forget, particularly in this era of scientific calculators. For numbers given in scientific notation, we have the convention that the number of digits given is an indication of the precision of the measurements from which the numbers were taken. As a result, if two numbers are given in scientific notation and then added or subtracted, the result should not be expressed to greater precision than the least precise of the operands! When throwing away the less significant digits of the result, we always round in order to minimise the loss of information and introduction of systematic error that would result from truncation.

An important question arises here: Which number do we denormalize prior to adding? The the answer is, we never want to lose the most significant digits of the sum, so we always increase the smaller of the two exponents while shifting the corresponding mantissa to the right. In addition, we are seriously concerned with preventing a carry out of the high digit of the result; this caused no problem with pencil and paper, but if we do this in software, we must be prepared to recover from overflow in the sum! This problem is solved in the following floating point add subroutine for the Hawk:

Adding two floating point numbers on the Hawk
        SUBTITLE floating add

; activation record format
RETAD   =       0       ; return address
R8SAVE  =       4       ; place to save R8

FLTADD:         ; on entry, R3 = pointer to floating sum
                ;           R4 = pointer to addend
                ;           R5 = pointer to augend
        STORES  R1,R2           ; save return address
        STORE   R8,R2,R8SAVE    ; save R8
        MOVE    R7,R3           ; R7 = saved pointer to sum
        LOADS   R3,R4           ; R3 = addend.exponent
        LOAD    R4,R4,MANTISSA  ; R4 = addend.mantissa
        LOAD    R6,R5,MANTISSA  ; R6 = augend.mantissa
        LOADS   R5,R5           ; R5 = augend.exponent
        CMP     R3,R5
        BLE     FADDEI          ; if (addend.exponent > augend.exponent) {
        MOVE    R8,R3
        MOVE    R3,R5
        MOVE    R5,R8           ;   exchange exponents
        MOVE    R8,R4
        MOVE    R4,R6
        MOVE    R6,R8           ;   exchange mantissas
FADDEI:                         ; }
                                ; assert (addend.exponent <= augend.exponent)
FADDDL:                         ; while
        CMP     R3,R5
        BGE     FADDDX          ;   (addend.exponent < augend.exponent) {
        ADDSI   R3,1            ;   increment addend.exponent
        SR      R4,1            ;   shift addend.mantissa
        BR      FADDDL
FADDDX:                         ; }
                                ; assert (addend.exponent = augend.exponent)
        ADD     R4,R6           ; add mantissas
        BOR     FADDNO          ; if (overflow) { /* we need one more bit */
        ADDSI   R3,1            ;   increment result.exponent
        SR      R4,1            ;   shift result.mantissa
        SUB     R0,R0,R0        ;   set carry bit in order to ...
        ADJUST  R4,CMSB         ;   flip sign bit of result (overflow repaired!)
FADDNO:                         ; }
        JSR     R1,NORMALIZE    ; normalize( result )
        STORES  R3,R7           ; save result.exponent
        STORE   R4,R7,MANTISSA  ; save result.mantissa
        LOAD    R8,R2,R8SAVE    ; restore R8
        LOADS   R1,R2           ; restore return address
        JUMPS   R1              ; return!

Most of this code follows simply from the logic of adding that we demonstrated with the addition of two numbers using scientific notation. There are some points, however, that are worthy of note.

First, about 1/3 of the way down, this code exchanges the two numbers; There are many ways to exchange two registers. This code uses the simplest to explain: First, set the value in one register aside, move the other register, and then move the set-aside value into place. This takes three move instructions and a spare register. There are other ways to do this that are just as fast without using an extra register. The most famous of these uses the exclusive or operator: a=a⊕b;b=a⊕b;a=a⊕b.

This code uses registers 1 to 7 and it both calls NORMALIZE and needs an extra register for the exchange discussed above. Therefore, it needs to use its activation record. The activation record has two fields, one for saving register 1 to allow the call to NORMALIZE, and one for saving register 8, freeing it for local use. While FLTADD uses its activation record, NORMALIZE does not. Therefore, this code does not adjust the stack pointer before or after calling NORMALIZE.

Finally, note that addition can lead to overflow. After addition, if the sign is wrong, it has the correct value as the most significant magnitude bit, as if there was an invisible sign bit to the left of it. In this case, we do a signed right shift to make space for the new sign bit (incrementing the exponent to compensate) and then we complement the sign by adding one to it using the ADJUST instruction.

Exercises

r) The floating point add code given here is rather stupid about shifting. It should never right-shift the lesser of the two addends more than 32 times! Fix this!

s) Fix this code so that the denormalize step rounds the lesser of the two addends by adding one to the least significant bit just prior to the final right shift operation.

Floating Point Multiplication

Floating point multiplication is actually simpler than floating point addition. As in scientific notation, we just add the exponents, multiply the mantissas and normalize the result, as illustrated below:

Multiplication in scientific notation
given 1.02 × 103  ×  9.85 × 101
rearranged (1.02 × 9.85) × 10(3 + 1)
multiplied 10.047 × 104
normalized 1.0047 × 105
rounded 1.00 × 105

There is a matter of precision. Multiplying two 31-bit fixed point fractions gives a 62-bit fixed point result, so we need a signed integer multiply with at least that precision. We will assume this:

A signed multiply interface specification
MULTIPLYS:              ; link through R1
                        ; on entry, R3 = multiplier
                        ;           R4 = multiplicand
                        ; on exit,  R3 = product, low bits
                        ;           R4 = product, high bits
                        ; destroys R5, R6, uses nothing else

If the multiplier and multiplicand are normalized to have a minimum absolute value of 0.5, the product will have a minimum absolute value of 0.25. Therefore, normalizing the mantissa will involve shifting at least one bit left, and sometimes two bits left. We need to use 64-bit shifts for this in order to avoid loss of precision, so we cannot use the standard normalize routine given earlier.

Multiplying two floating point numbers on the Hawk
        SUBTITLE floating multiply

; activation record format
RETAD   =       0       ; return address
PRODUCT =       0       ; pointer to floating product

FLTMUL:         ; on entry, R3 = pointer to floating product
                ;           R4 = pointer to multiplier
                ;           R5 = pointer to multiplicand
        STORES  R1,R2           ; save return address
        STORE   R3,R2,PRODUCT   ; save pointer to product
        LOADS   R6,R4           ; R6 = multiplier.exponent
        LOADS   R7,R5           ; R7 = multiplicand.exponent
        ADD     R7,R6,R7        ; R7 = product.exponent
        LOAD    R3,R4,MANTISSA  ; R3 = multiplier.mantissa
        LOAD    R4,R5,MANTISSA  ; R4 = multiplicand.mantissa
        LOAD    R1,PMULTIPLYS
        JSRS    R1,R1           ; R3-4 = product.mantissa
                                ; assert (R3-4 has 2 bits left of the point)
        SL      R3,1
        ADDC    R4,R4           ; shift product.mantissa 1 place
                                ; assert (R3-4 has 1 bit left of the point)
        BNS     FMULN           ; if (product.mantissa > 0) {
        BITTST  R4,30
        BCS     FMULOK          ;   if (product.mantissa not normalized) {
        SL      R3,1
        ADDC    R4,R4           ;     shift product.mantissa 1 place
        ADDSI   R7,-1           ;     decrement product.exponent
        BR      FMULOK          ;   }
FMULN:                          ; } else { negative mantissa
        ADDSI   R3,-1
        BCS     FMULNC
        ADDSI   R4,-1           ;   decrement product.mantissa
FMULNC:                         ;   mantissa is now in one's complement form
        BITTST  R4,30
        BCR     FMULNOK         ;   if (product.mantissa not normalized) {
        SL      R3,1
        ADDC    R4,R4           ;     shift product.mantissa 1 place
        ADDSI   R7,-1           ;     decrement product.exponent
FMULNOK:                        ;   }
        ADDSI   R3,1
        ADDC    R4,R0           ;   increment product.mantissa
FMULOK:                         ; } mantissa now normalized
        LOAD    R5,R2,PRODUCT
        STORES  R7,R5           ; store product.exponent
        STORE   R4,R5           ; store product.mantissa
        LOADS   R1,R2           ; restore return address
        JUMPS   R1              ; return

There are some oversights in this code! What if the product is zero? Our normalization rule states that a product of zero ought to have a particular exponent, the most negative possible value. Furthermore, there is no test at all for overflow or underflow, that is, no test for the possibility that adding the exponents might produce a value outside the legal range of exponents.

Exercises

t) Fix this floating point multiply code so that it detects underflow and overflow in adding exponents and correctly returns zero on underflow and when the exponent is too large, locks the exponent at its maximum value.

u) Fix this multiply code so it correctly normalizes products when the value is zero.

v) Write code for a floating point divide routine.

Other Operations

We need other operations in addition to multiply and divide. Because we have committed ourselves to a strongly abstracted model, The user of our floating point numbers should not peer into their representations. Therefore, we must provide tools for comparing numbers, testing the sign of numbers, testing for zero, and other operations that might otherwise appear trivial.

Another issue we face is import and export of floating point numbers. We need tools to convert numbers to and from textual form and to and from the binary forms used with other computers. The natural choice for the latter is the IEEE standard floating point representation. Our code for this begins by dealing with the range of values. A 32-bit exponent field gives an extraordinarily large range. The body of the code then converts the exponent and mantissa to the appropriate form and then packs these together.

Packing an IEEE Floating point value in C
unsigned int ieeepack( int exponent, int mantissa )
{
        int sign = 0;

        /* first split off the sign */
        if (mantissa < 0) {
                mantissa = -mantissa;
                sign = 0x80000000;
        }
        /* put the mantissa in IEEE normalized form */
        mantissa = mantissa >> 7;

        /* convert */
        if (exponent > 128) { /* convert overflow to infinity */
                mantissa = 0;
                exponent = 0x7F800000;
        } else if (exponent < -125) { /* convert underflow to zero */
                mantissa = 0;
                exponent = 0;
        } else { /* conversion is possible */
                mantissa = mantissa & 0x007FFFFF;
                exponent = (exponent + 126) <<
        }
        return sign | exponent | mantissa;
}

Note in the above code that the advertised bias of the IEEE format is 127, yet we used a bias of 126! This is because we also subtracted one from the original exponent to account for the fact that our numbers were normalized in the range 0.5 to 1.0, while IEEE numbers are normalized in the range 1.0 to 2.0. This is also why we compared with 128 and -125 instead of 127 and -126 when checking for the maximum and minimum legal exponents in the IEEE format. We have omitted one significant detail in the above! All underflows were simply forced to zero when some of them ought to have resulted in denormalized numbers.
Hawk code to unpack an IEEE-format floating-point number
        SUBTITLE unpack an IEEE-format floating point number
FLTIEEE:        ; on entry, R3 points to the return floating value
                ;           R4 is the number in IEEE format.
                ; R5 is used as a temporary
        MOVE    R5,R4           ; R5 = exponent
        SL      R5,1            ; throw away the bit left of the exponent
        SR      R5,12
        SR      R5,12           ; pull the exponent field all the way right
        ADDI    R5,R5,-126      ; unbias the exponent
        STORES  R5,R3           ; save converted exponent
        MOVE    R5,R4           ; R5 = mantissa
        SL      R5,9            ; push mantissa all the way left
        SR      R5,1            ; and then pull it back for missing one bit
        SUB     R0,R0,R0        ; set carry
        ADJUST  R5,CMSB         ; and use it to put missing one into mantissa
        TESTR   R4
        BNR     FIEEEPOS        ; if (number < 0) {
        NET     R5,R5           ;   negate mantissa
FIEEEPOS:                       ; }
        STORE   R5,R3,MANTISSA  ; save converted mantissa
        JUMPS   R1              ; return

Conversion from IEEE format to our eccentric Hawk format is fairly easy because our exponent and mantissa fields are larger than those of the single-precision IEEE format. Thus, we can convert with no loss of precision. This code presented above ignores the possibility that the value might be a NaN or infinity.

This code makes extensive use of shifting to clear fields within the number. Thus, instead of writing n&0xFFFFFF00, we write (n>>8)<<8. This trick is useful on many machines where loading a large constant is significantly slower than a shift instruction. By doing this, we avoid both loading a long constant into a register and using an extra register to hold it. We used a related trick to set the implicit one bit, using a subtract instruction to set the carry bit and then adding this bit into the number using an adjust instruction.

Conversion to Decimal

A well designed floating point package will include a complete set of tools for conversion to and from decimal textual representations, but our purpose here is to use the conversion problem to illustrate the use of our floating point package, so we will write our conversion code as user-level code, making no use of any details of the floating point abstraction that are not described in the header file for the package.

First, consider the problem of printing a floating point number using only the operations we have defined, ignoring the complexity of assembly language and focusing on the algorithm. We can begin by taking the integer part of the number and printing that, followed by a point, but the question is, how do we continue from there, printing the digits after the point?

C code to print a floating point number
void fltprint( float num, int places )
{
        int inum; /* the integer part */

        if (num < 0) {  /* make it positive and print the sign */
                num = -num;
                dspch( '-' );
        }

        /* first put out integer part */
        inum = fltint( num );
        dspnum( inum );
        dspch( '.' );

        /* second put out digits of the fractional part */
        for (; places > 0; places--) {
                num = (num - float(inum)) * 10.0;
                inum = fltint( num );
                dspch( inum + '0' );
        }
}

To print the fractional part of a number, this code subtracts the integer part of the number from the number, leaving just the fraction. Multiplying the fraction by ten brings one digit above the point. This is repeated for each digit. Converting back and forth is not very efficient, but it works. Rewriting this with our floating point support routines gives:

Lower level C code to print a floating point number
void fltprint( float *pnum, int places )
{
        float num;  /* a copy of the number */
        float tmp;  /* a temporary floating point number */
        float ten;  /* a constant floating value */
        int inum;   /* the integer part */
        int i;      /* loop counter */
        float( &ten, 10 );
        if (flttst( &num ) < 0) {  /* make it positive, print the sign */
                fltneg( &num, pnum );
                dspch( '-' );
        } else {
                fltcpy( &num, pnum );
        }
        /* first put out integer part */
        inum = fltint( &num );
        dspnum( inum );
        dspch( '.' );

Lower level C code to print a floating point number, continued
        /* second put out digits of the fractional part */
        while (places > 0) {
                float( &tmp, inum );
                fltsub( &num, &num, &tmp );
                fltmul( &num, &num, &ten );
                inum = fltint( &num );
                dspch( inum + '0' );
                places = places - 1;
        }
}

We face a few problems in the above, and it is best to tackle these incrementally. First, all of our support routines take pointers to floating point numbers as parameters in order to isolate the code from any knowledge of the representation. All arithmetic in the above is done with explicit calls to the floating point package, and even the constant ten is converted to floating by a call to FLOAT. This eliminates the need for an assembly language programmer to know how many registers each floating point value might need.

We declared several local variables as floating point numbers above, but to do this, we need to know the size of a floating point number. Note that float.h, the interface definition for the floating point package, exports the size as FLOATSIZE. We can write our code in terms of that symbol without ever taking advantage of its value if we change the way we allocate local variables in the activation record. Up until now, we have done this manually. We can let the assembler the clerical work if we begin with an activation record size of zero. For each field, we then define it in terms of the current activation record size before adding in the field size. We will do this with a that ought to be included in hawk.macs.

Macro to add an activation record field
        MACRO   LOCAL   X, =SIZE
          X = ARSIZE
          ARSIZE = ARSIZE + SIZE
        ENDMAC

INTSIZE =       4       ; size of an integer

 
Building an activation record for FLTPRINT
        TITLE   fltprint.a -- floating print routine
        USE     "hawk.macs"
        USE     "float.h"
        INT     FLTPRINT

; activation record format
ARSIZE  =       0       ; initial size of activation record
        LOCAL   RETAD, INTSIZE  ; return address
        LOCAL   NUM, FLOATSIZE  ; copy of the floating point number
        LOCAL   TMP, FLOATSIZE  ; a temporary floating point number
        LOCAL   TEN, FLOATSIZE  ; the constant ten
        LOCAL   R8SAVE, INTSIZE ; save area for register 8
        LOCAL   R9SAVE, INTSIZE ; save area for register 9

In the above, had we allowed ourselves to use knowledge about the size of a floating point number, we could have defined NUM=4, TMP=12 and TEN=20, but then, any change in the floating point package would have required us to rewrite this code. The macro LOCAL allows us to write local variable declarations compactly; without this macro, each of our local variables would have required two lines of code. For example, the declaration of the local variable NUM would begin with NUM=ARSIZE, and then it would add to the activation record size with ARSIZE=ARSIZE+FLOATSIZE.

The local variables for saving registers 8 and 9 were allocated so that the integer variables in our code can use these registers over and over again instead of being loaded and stored in order to survive each call to a routine in the floating point package. Of course, if those routines need registers 8 and 9, they will be saved and restored anyway, but we leave that to them.

The following code contains one significant optimization. With all of the subroutine calls, we could have incremented and decremented the stack pointer many times. Instead, we increment it just once at the start of the print routine and decrement it just once at the end; in between, we always subtract ARSIZE from every displacement into the activation record in order to correct for this.

The body of the floating print routine, part 1
FLTPRINT:       ; on entry: R3 = pointer to floating point number to print
                ;           R4 = number of places to print after the point
        STORES  R1,R2
        STORE   R8,R2,R8SAVE
        STORE   R9,R2,R9SAVE    ; saved return address, R8, R9
        MOVE    R8,R3           ; R8 = pointer to number
        MOVE    R9,R4           ; R9 = places

        ADDI    R2,R2,ARSIZE    ; from here on, R2 points to end of AR

        LEA     R3,R2,TEN-ARSIZE
        LIS     R4,10
        LOAD    R1,PFLOAT
        JSRS    R1,R1           ; float( &ten, 10 );

        MOVE    R3,R8
        LOAD    R1,FLTTST
        JSRS    R1,R1
        TESTR   R3
        BNR     FPRNNEG         ; if (flttst( pnum ) < 0) {

        LEA     R3,R2,NUM-ARSIZE
        MOVE    R4,R8
        LOAD    R1,PFLTNEG
        JSRS    R1,R1           ;   fltneg( &num, pnum );

        LIS     R3,'-'
        LOAD    R1,PDSPCH
        JSRS    R1,R1           ;   dspch( '-' );

        BR      FPRABS

FPRNNEG:                        ; } else {

        LEA     R3,R2,NUM-ARSIZE
        MOVE    R4,R8
        LOAD    R1,PFLTCPY
        JSRS    R1,R1           ;   fltcpy( &num, pnum );

 

The body of floating print, part 2
FPRABS:                         ; }

                                ; /* first put out the integer part */
        LEA     R3,R2,NUM-ARSIZE
        LOAD    R1,PFLTINT
        JSRS    R1,R1
        MOVE    R8,R3           ; R8 = inum = fltint( num );

        LOAD    R1,PDSPNUM
        JSRS    R1,R1           ; dspnum( inum );

        LIS     R3,'.'
        LOAD    R1,PDSPCH
        JSRS    R1,R1           ; dspch( '.' );

FPRLP:
        TESTR   R9
        BLE     FPRLX           ; while (places > 0) {

        LEA     R3,R2,TMP-ARSIZE
        MOVE    R4,R8
        LOAD    R1,PFLOAT
        JSRS    R1,R1           ;   float( &tmp, inum );

        LEA     R3,R2,NUM-ARSIZE
        MOVE    R4,R3
        LEA     R5,R2,TMP-ARSIZE
        LOAD    R1,PFLTSUB
        JSRS    R1,R1           ;   fltsub( &num, &num, &tmp );

        LEA     R3,R2,NUM-ARSIZE
        MOVE    R4,R3
        LEA     R5,R2,TEN-ARSIZE
        LOAD    R1,PFLTMUL
        JSRS    R1,R1           ;   fltmul( &num, &num, &ten );

        LEA     R3,R2,NUM-ARSIZE
        LOAD    R1,PFLTINT
        JSRS    R1,R1
        MOVE    R8,R3           ;   R8 = inum = fltint( &num );

        ADDI    R3,R3,'0'
        LOAD    R1,PDSPCH
        JSRS    R1,R1           ;   dspch( inum + '0' );
        ADDSI   R9,-1           ;   places = places - 1;
        BR      FPRLP
FPRLX:                          ; }
        ADDI    R2,R2,-ARSIZE
        LOAD    R8,R2,R8SAVE
        LOAD    R9,R2,R9SAVE
        LOADS   R1,R2           ; restore return address, R8, R9
        JUMPS   R1              ; return

 

Exercises

w) Write a floating print routine that produces its output in scientific notation, for example, using the format 6.02E23 where the E stands for times ten to the. To do this, you will have to first do a decimal normalize, counting the number of times you have to multiply or divide by ten in order to bring the mantissa into the range from 1 to just under 10, and then you will have to print the mantissa (using the floating print routine we just discussed), and finally print the exponent.