11. Floating Point

Part of 22C:60, Computer Organization Notes
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

The Hawk Floating Point Coprocessor

The Hawk architecture includes two opcodes reserved for communication with coprocessors. The term coprocessor refers to a special purpose processor that operates in conjunction with the central processor. Some coprocessors are physically separate from the central processor, for example, on a separate chip, but the logical separation is essential. Coprocessors are frequently used to perform floating point operations, but they have also been used for graphics, cryptography, and other specialized computations.

The Hawk coprocessor instructions, COSET and COGET allow data to be transferred between the Hawk general purpose registers and the specialized registers inside one or more coprocessors. In this chapter, we will only discuss coprocessor number one, the floating point coprocessor.

The Hawk coprocessor set instruction

07 06 05 04 03 02 01 00 15 14 13 12 11 10 09 08

0 0 0 1 dst 0 0 1 0 x

The Hawk coprocessor get instruction

07 06 05 04 03 02 01 00 15 14 13 12 11 10 09 08

0 0 0 1 dst 0 0 1 1 x

In these instructions, the dst field always refers to a CPU register, while the x field refers to one of the registeres in the currently active coprocessor. Coprocessor register zero, the coprocessor status register, COSTAT is used to select the active coprocessor. This register has several fields, the details of which can be found in the Hawk manual. What matters, for our purposes, is that the following instruction sequence enables the floating point coprocessor to handle short (32-bit) floating point operands:

Enabling the floating point coprocessor

LIL R1, FPENAB + FPSEL COSET R1, COSTAT

**Enabling the floating point coprocessor**
	LIL R1, FPENAB + FPSEL COSET R1, COSTAT

Once the floating point coprocessor is enabled, addressing coprocessor registers 1 to 15 refers specifically to registers inside the floating point unit. When operating in short format, there are only two useful registers in the floating-point coprocessor, floating-point accumulators zero and one, FPA0 and FPA1, which corespond to register numbers 2 and 3. Register 1 will be ignored, for now. It is used for access to the least significant halfword of long (64-bit) floating point operands.

Coprocessor registers 4 to 15 are not, in reality, registers. Rather, operations that would appear to set these registers actually initiate operations on the floating point accumulators. The available operations include the obvious ones, floating point add, subtract, multiply and divide, as well as square root and conversion from integer to floating. Setting even registers in the range from 4 to 15 causes operations on FPA0 and setting odd registers in this range operates on FPA1. For example, setting coprocessor register number 5 converts an integer operand from a general purpose register into a floating point value in FPA1. The complete set of short (32-bit) floating point operations on floating point accumulator zero is illustrated below; the same operations are available on coprocessor register 1.

Floating point coprocessor operations

COSET R1, FPA0 ; 2 FPA0 = R1 COSET R1, FPINT+FPA0 ; 4 FPA0 = (float) R1 COSET R1, FPSQRT+FPA0 ; 6 FPA0 = sqrt( R1 ) COSET R1, FPADD+FPA0 ; 8 FPA0 = FPA0 + R1 COSET R1, FPSUB+FPA0 ; 10 FPA0 = FPA0 - R1 COSET R1, FPMUL+FPA0 ; 12 FPA0 = FPA0 * R1 COSET R1, FPDIV+FPA0 ; 14 FPA0 = FPA0 / R1

**Floating point coprocessor operations**
COSET R1, FPA0 ; 2 FPA0 = R1 COSET R1, FPINT+FPA0 ; 4 FPA0 = (float) R1 COSET R1, FPSQRT+FPA0 ; 6 FPA0 = sqrt( R1 ) COSET R1, FPADD+FPA0 ; 8 FPA0 = FPA0 + R1 COSET R1, FPSUB+FPA0 ; 10 FPA0 = FPA0 - R1 COSET R1, FPMUL+FPA0 ; 12 FPA0 = FPA0 * R1 COSET R1, FPDIV+FPA0 ; 14 FPA0 = FPA0 / R1

Unlike integer operations, floating point operations do not directly set the condition codes. When the coprocessor get instruction COGET is used to get the contents of either floating point accumulator, it sets the N and Z condition codes to indicate whether the floating point value is negative or zero. In addition, the C condition code is used to report floating point values that are infinite or non numeric. This is possible because the floating point representation includes representations for infinite values.

Exercises

a) Give appropriate defines for the symbols FPA0, FPA1, FPSQRT, FPADD, FPSUB, FPMUL and FPDIV that are used as operands on the COSET instruction.
b) Given 32-bit floating point values x in R4 and y in R5, give Hawk code to enable the floating point coprocessor, compute sqrt(x² + y²), place the result in R3 and then disable all coprocessors.

IEEE Floating Point Format

It is not sufficient to say that we have a floating point coprocessor that supports 32-bit floating point values. We must also define the data format used by this processor. Most modern computers support the same floating point format, a format defined by the Institute for Electrical and Electronic Engineers, the IEEE. The Hawk is no exception. This format follows a general outline that is very similar to most floating point formats supported by floating point hardware since the early 1960's, but it does have some eccentric and occasionally difficult to explain features.

The format of the binary floating point numbers used in computers is closely related to the format of decimal numbers expressed in scientific notation. Consider the number 6.02×10²³, known as Avagadro's number. This number is composed of a mantissa, 6.02, and an exponent, 23. The number base used in the mantissa is the same as the value to which the exponent is applied. Scientific notation is more complex than this, however, because we have normalization rules. Consider the following expressions of the same number:

Equivalent values
60221419.9 × 10¹⁶ \
60221.4199 × 10¹⁹ not normalized
60.2214199 × 10²² /
6.02214199 × 10²³ normalized
0.602214199 × 10²⁴ not normalized

**Equivalent values**
60221419.9	×	10¹⁶	\
60221.4199	×	10¹⁹	not normalized
60.2214199	×	10²²	/
6.02214199	×	10²³	normalized
0.602214199	×	10²⁴	not normalized

Of these, only 6.02... × 10²³ is considered to be properly in scientific notation. In scientific notation, the mantissa is always represented as a fixed-point decimal number between 1.000 and 9.999... The only exception is zero. When we find a number that has a mantissa that does not satisfy this rule, we normalize it by moving the point (and adjusting the exponent accordingly) until it satisfies this rule.

The IEEE standard includes both 32 and 64-bit floating point numbers. In this discussion, we will ignore the latter and focus the one-word 32-bit format. As with numbers in scientific notation, these have an exponent and a mantissa field, but these are binary representations, so the mantissa field is in base two and the exponent field is a power of two, not a power of ten.

In the IEEE floating point formats, like most others hardware floating point formats, the most significant bit holds the sign of the mantissa, with zero meaning positive. The mantissa is stored in signed magnitude form. The magnitude of the mantissa of a 32-bit floating-point number is given to 24 bits of precision, while the exponent is stored in the 8 remaining bits. Notice that this adds up to 33 bits of sign, exponent and mantissa. This is because of some exceptionally tricky details of the IEEE floating point representation. IEEE double-precision numbers differ from the above in that each number is 64 bits. This allows 11-bits for the exponent instead of an 8 bits, and 53 bits for the mantissa, including one extra bit obtained from the same trickery that got an extra bit for the 32-bit format.

A Normalized Mantissa

The way the IEEE format gets an extra bit for the mantissa stems from a consequence of the normalization rule used for the mantissa. The mantissa in an IEEE format number is represented as a binary fixed point number with one place to the left of the point. With this representation, the normalization rule is very similar to that used for scientific notation. The smallest normalized mantissa value is 1.0, while the largest normalized value is 1.1111...₂. This means that, for normalized mantissas, the most significant bit of the mantissa is always one. In general, if a bit is always the same, there is no point in storing it, we can take the constant value for granted. We call this bit that we do not store the hidden bit. Consider the following IEEE floating point value represented as 12345678₁₆:

The IEEE single-precision floating-point representation

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00

0 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 1 1 0 0 0

sign = 0 (positive)
exponent = 001 0010 0
mantissa = 1.011 0100 0101 0110 0111 1000

The IEEE format supports non-normalized representations only in the case of the smallest possible exponents. The representation of zero falls into this category, a value with the smallest possible exponent and a mantissa of zero.

The Biased Exponent

The second odd feature of the IEEE format is that the exponent is given as a biased signed integer with the eccentric bias of 127. The normal range of exponents runs from 00000001₂, meaning -126, to 11111110₂, meaning +127. The exponent represented as 00000000₂ is reserved for unnormalized (extraordinarily small) values and for zero. In this case, the exponent is still interpreted as having the representation -126. The hidden bit is zero for unnormalized values. The exponent 11111111₂ is reserved for infinity (with a mantissa of zero) and for values that the IEEE calls NaNs, where NaN stands for not a number. The mantissa field of a NaN may be put to a variety of uses by software, but this is rarely done.

Because of the odd bias of 127 for exponents, the exponent one is represented as 10000000₂, zero is 01111111₂, and negative one is 01111110₂. There is a competing but equivalent presentation of the IEEE format that presents the bias as 128 and places the point in the mantissa differently relative to the hidden bit. The different presentations of the IEEE system make no difference in the number representations, but they can be confusing when comparing presentations of the number system from different sources. The following table shows IEEE floating-point numbers, given in binary, along with their interpretations.

Example IEEE single-precision floating-point numbers
Infinity and NaN
0 11111111 00000000000000000000000 = Infinity
1 11111111 00000000000000000000000 = -Infinity

0 11111111 00000000000010000000000 = NaN
1 11111111 00000010001111101000110 = NaN

Normalized numbers Sign Exp. Mant. Value
0 10000000 00000000000000000000000 = +1.0 × 2¹ × 1.00₂ = 2
0 01111110 00000000000000000000000 = +1.0 × 2^-1 × 1.00₂ = 0.5
0 01111111 10000000000000000000000 = +1.0 × 2^-1 × 1.10₂ = 1.5
0 01111111 11000000000000000000000 = +1.0 × 2^-1 × 1.11₂ = 1.75
1 01111111 11000000000000000000000 = -1.0 × 2^-1 × 1.11₂ = -1.75
0 00000001 00000000000000000000000 = +1.0 × 2^-126 × 1.00₂ = 2^-126

Unnormalized numbers Sign Exp. Mant. Value
0 00000000 10000000000000000000000 = +1.0 × 2^-126 × 0.10₂ = 2^-127
0 00000000 01000000000000000000000 = +1.0 × 2^-126 × 0.01₂ = 2^-128
0 00000000 00000000000000000000000 = +1.0 × 2^-126 × 0.00₂ = 0
1 00000000 00000000000000000000000 = -1.0 × 2^-126 × 0.00₂ = 0

**Example IEEE single-precision floating-point numbers**
Infinity and NaN
0 11111111 00000000000000000000000	=	Infinity
1 11111111 00000000000000000000000	=	-Infinity

0 11111111 00000000000010000000000	=	NaN
1 11111111 00000010001111101000110	=	NaN

Normalized numbers	Sign		Exp.	Mant.	Value
0 10000000 00000000000000000000000	=	+1.0	×	2¹	× 1.00₂	= 2
0 01111110 00000000000000000000000	=	+1.0	×	2^-1	× 1.00₂	= 0.5
0 01111111 10000000000000000000000	=	+1.0	×	2^-1	× 1.10₂	= 1.5
0 01111111 11000000000000000000000	=	+1.0	×	2^-1	× 1.11₂	= 1.75
1 01111111 11000000000000000000000	=	-1.0	×	2^-1	× 1.11₂	= -1.75
0 00000001 00000000000000000000000	=	+1.0	×	2^-126	× 1.00₂	= 2^-126

Unnormalized numbers	Sign		Exp.	Mant.	Value
0 00000000 10000000000000000000000	=	+1.0	×	2^-126	× 0.10₂	= 2^-127
0 00000000 01000000000000000000000	=	+1.0	×	2^-126	× 0.01₂	= 2^-128
0 00000000 00000000000000000000000	=	+1.0	×	2^-126	× 0.00₂	= 0
1 00000000 00000000000000000000000	=	-1.0	×	2^-126	× 0.00₂	= 0

Exercises

c) Give the 32-bit IEEE representation for 100₁₀.
d) Give the 32-bit IEEE representation for 1000₁₀.
e) Give the 32-bit IEEE representation for 0.1₁₀.
f) Give the 32-bit IEEE representation for 0.01₁₀.
g) Consider the following infinite loop on the Hawk:
        LIL     R3, FPENAB + FPSEL
        COSET   COSTAT
        LIS     R3, 1
L:
        COSET   R3, FPINT+FPA0
        COGET   R3, FPA0
        BR      L
Describe the sequence of values in R3, with an emphasis on the value after a large number of iterations. An approximate upper bound on the value of R3 is a strong answer to this question.

Software Floating Point

Some versions of the Hawk emulator do not have a floating point coprocessor. On such machines, we must implement floating point operations entirely with software. This forces us to examine the algorithms that underly floating point arithmetic, and it faces us with an extended example of a software module that implements a complex class.

In the following presentation, we will not bother to maintain any compatibility with the Hawk floating point coprocessor or with IEEE floating point format. Instead, we will focus on clear code and straightforward data representation. Transformation between this package and IEEE format will be left for last.

The interface specification for a class should list all of the operations applicable to objects of that class, the methods, and for each method, it should specify the parameters expected, constraints on those parameters, and the nature of the result. The implementation of the class must then give the details of the object representation and the specific algorithms used to implement each method. It is good practice to add documentation to the interface specification, so that it serves as a manual for users of the class as well as a formal interface.

For our floating-point class, the set of operations is fairly obvious. We want operators that add, subtract, multiply and divide floating-point numbers, and we also want operators that return the integer part of a number, and that convert integers to floating-point form. We probably want other operations, but we will forgo those for now.

In most object-oriented programming languages, a strong effort is made to avoid copying objects from place to place. Instead, objects sit in memory and object handles are used to refer to them. The handle for an object is actually just a pointer to that object, that is, a word holding the address of the object. Therefore, our floating point operators will take, as parameters, the addresses of their operands, not the values.

Finally, the interface specificaiton for a class must indicate how to allocate storage for an element of that class. The only thing a user of the object needs to know is the size of the object, not the internal details of its representation. The following interface specification for our Hawk floating point package assumes that each floating point number is stored in two words of memory, enough for an exponent and a mantissa of one word each, although the user need not know how the words are used.

Software Floating Point Interface Specification

TITLE float.h, interface specification for float.a FLOATSIZE = 8 ; size of a floating point number, in bytes. ; the user gets no other information about the ; number representation ; for all calling sequences here: ; R1 = return address ; R2 = pointer to activation record ; R3-7 = parameters and temporaries ; R8-15 = guaranteed to be saved and restored ; functions that return floating values use: ; R3 = pointer to place to put return value ; the caller must pass this pointer! EXT FLOAT ; convert integer to floating ; on entry, R3 = pointer to floating result ; R4 = integer to convert EXT FLTINT ; convert floating to integer ; on entry, R3 = pointer to floating value ; on exit, R3 = integer return value EXT FLTCPY ; copy a floating point number ; on entry, R3 = pointer to floating result ; R4 = pointer to floating operand EXT FLTTST ; test sign and zeroness of floating number ; on entry, R3 = pointer to floating value ; on exit, R3 = integer -1, 0 or 1 EXT FLTADD ; add floating-point numbers ; on entry, R3 = pointer to floating result ; R4 = pointer to addend ; R5 = pointer to augend EXT FLTSUB ; subtract floating-point numbers ; on entry, R3 = pointer to floating result ; R4 = pointer to subtrahend ; R5 = pointer to minuend EXT FLTNEG ; negate a floating-point number ; on entry, R3 = pointer to floating result ; R4 = pointer to operand EXT FLTMUL ; multiply floating-point numbers ; on entry, R3 = pointer to floating result ; R4 = pointer to multiplicand ; R5 = pointer to multiplier EXT FLTDIV ; divide floating-point numbers ; on entry, R3 = pointer to floating result ; R4 = pointer to multiplicand ; R5 = pointer to multiplier

**Software Floating Point Interface Specification**
TITLE float.h, interface specification for float.a FLOATSIZE = 8 ; size of a floating point number, in bytes. ; the user gets no other information about the ; number representation ; for all calling sequences here: ; R1 = return address ; R2 = pointer to activation record ; R3-7 = parameters and temporaries ; R8-15 = guaranteed to be saved and restored ; functions that return floating values use: ; R3 = pointer to place to put return value ; the caller must pass this pointer! EXT FLOAT ; convert integer to floating ; on entry, R3 = pointer to floating result ; R4 = integer to convert EXT FLTINT ; convert floating to integer ; on entry, R3 = pointer to floating value ; on exit, R3 = integer return value EXT FLTCPY ; copy a floating point number ; on entry, R3 = pointer to floating result ; R4 = pointer to floating operand EXT FLTTST ; test sign and zeroness of floating number ; on entry, R3 = pointer to floating value ; on exit, R3 = integer -1, 0 or 1 EXT FLTADD ; add floating-point numbers ; on entry, R3 = pointer to floating result ; R4 = pointer to addend ; R5 = pointer to augend EXT FLTSUB ; subtract floating-point numbers ; on entry, R3 = pointer to floating result ; R4 = pointer to subtrahend ; R5 = pointer to minuend EXT FLTNEG ; negate a floating-point number ; on entry, R3 = pointer to floating result ; R4 = pointer to operand EXT FLTMUL ; multiply floating-point numbers ; on entry, R3 = pointer to floating result ; R4 = pointer to multiplicand ; R5 = pointer to multiplier EXT FLTDIV ; divide floating-point numbers ; on entry, R3 = pointer to floating result ; R4 = pointer to multiplicand ; R5 = pointer to multiplier

Exercises

h) Write a main program that uses a separate common block of size FLOATSIZE to hold each floating point variable it needs in the computation of the floating point representation of 0.1, computed by converting the integer constants 1 and 10 to floating point and then dividing 1.0 by 10.0. This should call FLOAT several times, and then make just one call to FLTDIV. Your main program will, of course, use the file float.h.
i) Write a separately compilable subroutine called SQUARE that takes 2 pointers to floating point numbers as parameters and returns the square of the second number in the first. Don't forget to write an appropriate interface specification, and comment everyting appropriately, including an indication of the file names that should be used.

A floating point representation

The simplest way to represent a floating point number for a software implementation of floating point operations is as a pair of words, one holding the exponent and another holding the mantissa, but this is not enough detail. Which word is which? We need to specify the interpretation of the bits of each of these words. What is the range of exponent values? How do we represent the sign of the exponent? How is the mantissa normalized? How do we represent non-normalized values such as zero?

On a computer that supports two's complement integers, it makes sense to represent the exponent and mantissa as two's complement values. We can represent zero using a mantissa of zero; technically, when the mantissa is zero, the exponent does not matter, but we will always set the exponent to the smallest (most negative) possible value.

The more difficult question is, where is the point in our two's complement mantissa? We could put the point anywhere and make it work, but the two obvious choices are to use an integer mantissa or to put the point immediately to the right of the sign bit. Here, we opt for the latter, and we will normalize the mantissa so that the bit immediately to the right of the point is always a one. This is equivalent to changing the normalization rules for decimal scientific notation so that 0.602×10²⁴ is considered to be normalized. The following examples illustrate this number format.

A floating-point number representation
exponent 00000000000000000000000000000000 +0.5 × 2⁰ = 0.5
mantissa 01000000000000000000000000000000

exponent 00000000000000000000000000000001 +0.5 × 2¹ = 1.0
mantissa 01000000000000000000000000000000

exponent 00000000000000000000000000000001 +0.75 × 2¹ = 1.5
mantissa 01100000000000000000000000000000

exponent 00000000000000000000000000000001 -0.75 × 2¹ = -1.5
mantissa 10100000000000000000000000000000

exponent 11111111111111111111111111111111 +0.5 × 2^-1 = 0.25
mantissa 01000000000000000000000000000000

exponent 11111111111111111111111111111101 +0.5 × 2^-3 = 0.0625
mantissa 01000000000000000000000000000000

exponent 11111111111111111111111111111101 ~8/10 × 2^-3 = 0.1...
mantissa 01100110011001100110011001100110

**A floating-point number representation**
exponent	`00000000000000000000000000000000`	+0.5 × 2⁰ = 0.5
mantissa	`01000000000000000000000000000000`

exponent	`00000000000000000000000000000001`	+0.5 × 2¹ = 1.0
mantissa	`01000000000000000000000000000000`

exponent	`00000000000000000000000000000001`	+0.75 × 2¹ = 1.5
mantissa	`01100000000000000000000000000000`

exponent	`00000000000000000000000000000001`	-0.75 × 2¹ = -1.5
mantissa	`10100000000000000000000000000000`

exponent	`11111111111111111111111111111111`	+0.5 × 2^-1 = 0.25
mantissa	`01000000000000000000000000000000`

exponent	`11111111111111111111111111111101`	+0.5 × 2^-3 = 0.0625
mantissa	`01000000000000000000000000000000`

exponent	`11111111111111111111111111111101`	~8/10 × 2^-3 = 0.1...
mantissa	`01100110011001100110011001100110`

Exercises

j) In this number system, what is the largest possible positive value (in binary).
k) In this number system, what is the smallest possible positive nonzero normalized value?
l) In this number system, how is 10.0₁₀ represented.
m) In this number system, how is 100.0₁₀ represented.

Normalizing a floating point number

Many operations on floating point numbers produce results that are unnormalized, and these must be normalized before performing additional operations on them. If this is not done, there will be a loss of precision in the results. Classical scientific notation is always presented in normalized form for the same reason. To normalize a floating point number, we must distinguish some special cases: First, is the number zero? Zero cannot be normalized! Second, is the number negative? Because we have opted to represent our mantissa in two's complement form, negative numbers are slightly more difficult to normalize; this is why many hardware floating-point systems use signed magnitude for their floating point numbers.

The normalize subroutine is not part of the public interface to our floating point package, but rather, it a private component, used as the final step of just about every floating point operation. Therefore, we can write it with the assumption that operands are passed in registers instead of using pointers to memory locations. We will code this here using registers 3 and 4 to hold the exponent and mantissa of the number to be normalized, and we will use this convention both on entrance and exit.

Normalizing a floating-point number (part 1)

SUBTITLE normalize NORMALIZE: ; normalize floating point number ; link through R1 ; R3 = exponent on entry and exit ; R4 = mantissa on entry and exit ; no other registers used TESTR R4 BZR NRMNZ ; if (mantissa == 0) { LIL R3,#800000 SL R3,8 ; exponent = 0x80000000; JUMPS R1 ; return; NRMNZ: BNS NRMNEG ; } /* else */ if (mantissa > 0) { NRMPLP: ; while BITTST R4,30 BCS NRMPRT ; ((mantissa & 0x40000000) == 0) { SL R4,1 ; mantissa = mantissa << 1; ADDSI R3,-1 ; exponent = exponent - 1; BR NRMPLP ; } NRMPRT: JUMPS R1 ; return;

**Normalizing a floating-point number (part 1)**
SUBTITLE normalize NORMALIZE: ; normalize floating point number ; link through R1 ; R3 = exponent on entry and exit ; R4 = mantissa on entry and exit ; no other registers used TESTR R4 BZR NRMNZ ; if (mantissa == 0) { LIL R3,#800000 SL R3,8 ; exponent = 0x80000000; JUMPS R1 ; return; NRMNZ: BNS NRMNEG ; } /* else */ if (mantissa > 0) { NRMPLP: ; while BITTST R4,30 BCS NRMPRT ; ((mantissa & 0x40000000) == 0) { SL R4,1 ; mantissa = mantissa << 1; ADDSI R3,-1 ; exponent = exponent - 1; BR NRMPLP ; } NRMPRT: JUMPS R1 ; return;

Normalizing a floating-point number (part 2)

NRMNEG: ; } /* else if */ { /* mantissa < 0 */ ADDSI R4,-1 ; mantissa = mantissa - 1; ; /* mantissa now in one's complement form */ NRMNLP: ; while BITTST R4,30 BCR NRMNRT ; ((mantissa & 0x40000000) != 0) { SL R4,1 ; mantissa = mantissa << 1; ADDSI R3,-1 ; exponent = exponent - 1; BR NRMPLP ; } NRMNRT: ADDSI R4,1 ; mantissa = mantissa + 1; ; /* mantissa now in two's complement form */ JUMPS R1 ; return; ; }

**Normalizing a floating-point number (part 2)**
NRMNEG: ; } /* else if / { / mantissa < 0 / ADDSI R4,-1 ; mantissa = mantissa - 1; ; / mantissa now in one's complement form / NRMNLP: ; while BITTST R4,30 BCR NRMNRT ; ((mantissa & 0x40000000) != 0) { SL R4,1 ; mantissa = mantissa << 1; ADDSI R3,-1 ; exponent = exponent - 1; BR NRMPLP ; } NRMNRT: ADDSI R4,1 ; mantissa = mantissa + 1; ; / mantissa now in two's complement form */ JUMPS R1 ; return; ; }

There are two tricks in this code worth mention. First, this code uses the BITTST instruction to test bit 30 of the mantissa. This instruction moves the indicated bit to the C condition code; in fact, the assembler converts this instruction to either a left or a right shift to move the indicated bit into the carry bit while discarding the shifted result using R0. In C, C++ or Java, in contrast, inspection of one bit of a word is most easily expressed by anding that word with a constant with just that bit set. The second trick involves normalizing negative numbers. In the example values presented above, note that the representation of -0.5 has bit 30 set to 1, while -0.75 has it set to zero. By subtracting one from the least significant bit of each negative value, we can convert to one's complement, allowing us to take advantage of the fact that bit 30 of the one's complement representation of normalized mantissas is always zero.

Exercises

n) The above code does not detect underflow. If it decrements the exponent below the smallest legal value, it produces the highest legal value. Rewrite the code to make it produce a value of zero whenever decrementing the exponent would underflow.

Integer to Floating Conversion

Conversion from integer to floating point is remarkably simple. All that needs to be done is to adjust the exponent field to 31 and set the mantissa field to the desired integer, and then normalize the result. This is because the fixed point fractions we are using to represent the mantissa can be viewed as integer counts in units of 2^-31. As a result, our code simply moves the data into place for a call to normalize and then stores the results in the indicated memory location.

Integer to Floating Conversion on the Hawk

; format of a floating point number stored in memory EXPONENT = 0 MANTISSA = 4 FLOATSIZE = 8 SUBTITLE integer to floating conversion FLOAT: ; on entry, R3 = pointer to floating result ; R4 = integer to convert MOVE R5,R1 ; R5 = return address MOVE R6,R3 ; R6 = pointer to floating result LIS R3,31 ; exponent = 31; /* R3-4 is now floating */ JSR R1,NORMALIZE ; normalize( R3-4 ); STORES R3,R6 ; result->exponent = exponent; STORE R4,R6 MANTISSA; result->mantissa = mantissa; JSRS R5 ; return; /* uses saved return address! */

**Integer to Floating Conversion on the Hawk**
; format of a floating point number stored in memory EXPONENT = 0 MANTISSA = 4 FLOATSIZE = 8 SUBTITLE integer to floating conversion FLOAT: ; on entry, R3 = pointer to floating result ; R4 = integer to convert MOVE R5,R1 ; R5 = return address MOVE R6,R3 ; R6 = pointer to floating result LIS R3,31 ; exponent = 31; /* R3-4 is now floating / JSR R1,NORMALIZE ; normalize( R3-4 ); STORES R3,R6 ; result->exponent = exponent; STORE R4,R6 MANTISSA; result->mantissa = mantissa; JSRS R5 ; return; / uses saved return address! */

Floating to Integer Conversion

Conversion of floating-point numbers to integer is a bit more complex, but only because we have no pre-written denormalize routine that will set the exponent field to 31. Instead, we need to write this ourselves. Where the normalize routine shifted the mantissa left and decremented the exponent until the number was normalized, the floating to integer conversion routine will have to shift the mantissa right and increment the exponent until the exponent has the value 31.

This leaves open the question of what happens if the initial value of the exponent was greater than 31. The answer is, in that case, the integer part of the number is too large to represent in 32 bits. In this case, we should raise an exception, or, lacking a model of how to write exception handlers, we could set the overflow condition code. Here, this is left as an exercise for the reader.

Floating to Integer Conversion on the Hawk

SUBTITLE floating to integer conversion FLTINT: ; on entry, R3 = pointer to floating value ; on exit R3 = integer result LOADS R4,R3 ; R4 = argument->exponent LOAD R3,R3,MANTISSA ; R3 = argument->mantissa FINTLP: ; while CMPI R4,31 BGE FINTLX ; (exponent < 31) { SR R3,1 ; mantissa = mantissa >> 1 ADDSI R4,1 ; exponent = exponent + 1; BR FINTLP ; } FINTLX: ; unchecked error condition: exponent > 31 implies overflow JUMPS R1 ; return denormalized mantissa

**Floating to Integer Conversion on the Hawk**
SUBTITLE floating to integer conversion FLTINT: ; on entry, R3 = pointer to floating value ; on exit R3 = integer result LOADS R4,R3 ; R4 = argument->exponent LOAD R3,R3,MANTISSA ; R3 = argument->mantissa FINTLP: ; while CMPI R4,31 BGE FINTLX ; (exponent < 31) { SR R3,1 ; mantissa = mantissa >> 1 ADDSI R4,1 ; exponent = exponent + 1; BR FINTLP ; } FINTLX: ; unchecked error condition: exponent > 31 implies overflow JUMPS R1 ; return denormalized mantissa

Exercises

o) The above code for floating to integer conversion truncates the result in an odd way for negative numbers. If the floating point input value is -1.5, what integer does it return? Why?
p) The above code for floating to integer conversion truncates the result in an odd way for negative numbers. Fix the code so that it truncates the way a naive programmer would expect.
q) The above code for floating to integer conversion truncates, but sometimes, it is desirable to have a version that rounds a number to the nearest integer. Binary numbers can be rounded by adding one in the most significant digit that will be discarded, that is, in the 0.5's place. Write code for FLTROUND that does this.
r) The above code for floating to integer conversion could do thousands of right shifts for numbers with very negative exponents! This is an utter waste. Modify the code so that it automatically recognizes these extreme cases and returns a value of zero whenever more than 32 shifts would be required.

Floating Point Addition

We are now ready to explore the implementation of some of the floating point operations. These follow quite naturally from the standard rules for working with numbers in scientific notation. Consider the problem of adding 9.92×10³ to 9.25×10¹. We begin by denormalizing the numbers so that they have the same exponents; this allows us to add the mantissas, after which we renormalize the result and round it to the appropriate number of decimal places:

Adding in scientific notation
given 9.92 × 10³ + 9.25 × 10¹
denormalized 9.92 × 10³ + 0.0925 × 10³
rearranged (9.92 + 0.0925) × 10³
added 10.0125 × 10³
normalized 1.00125 × 10⁴
rounded 1.00 × 10⁴

**Adding in scientific notation**
given		9.92 × 10³ + 9.25 × 10¹
denormalized		9.92 × 10³ + 0.0925 × 10³
rearranged		(9.92 + 0.0925) × 10³
added		10.0125 × 10³
normalized		1.00125 × 10⁴
rounded		1.00 × 10⁴

The final rounding step is one many students forget, particularly in this era of scientific calculators. For numbers given in scientific notation, we have the convention that the number of digits given is an indication of the precision of the measurements from which the numbers were taken. As a result, if two numbers are given in scientific notation and then added or subtracted, the result should not be expressed to greater precision than the least precise of the operands! When throwing away the less significant digits of the result, we always round in order to minimise the loss of information and introduction of systematic error that would result from truncation.

An important question arises here: Which number do we denormalize prior to adding? The the answer is, we never want to lose the most significant digits of the sum, so we always increase the smaller of the two exponents while shifting the corresponding mantissa to the right. In addition, we are seriously concerned with preventing a carry out of the high digit of the result; this caused no problem with pencil and paper, but if we do this in software, we must be prepared to recover from overflow in the sum! This problem is solved in the following floating point add subroutine for the Hawk:

Adding two floating point numbers on the Hawk

SUBTITLE floating add ; activation record format RETAD = 0 ; return address R8SAVE = 4 ; place to save R8 FLTADD: ; on entry, R3 = pointer to floating sum ; R4 = pointer to addend ; R5 = pointer to augend STORES R1,R2 ; save return address STORE R8,R2,R8SAVE ; save R8 MOVE R7,R3 ; R7 = saved pointer to sum LOADS R3,R4 ; R3 = addend.exponent LOAD R4,R4,MANTISSA ; R4 = addend.mantissa LOAD R6,R5,MANTISSA ; R6 = augend.mantissa LOADS R5,R5 ; R5 = augend.exponent CMP R3,R5 BLE FADDEI ; if (addend.exponent > augend.exponent) { MOVE R8,R3 MOVE R3,R5 MOVE R5,R8 ; exchange exponents MOVE R8,R4 MOVE R4,R6 MOVE R6,R8 ; exchange mantissas FADDEI: ; } ; assert (addend.exponent <= augend.exponent) FADDDL: ; while CMP R3,R5 BGE FADDDX ; (addend.exponent < augend.exponent) { ADDSI R3,1 ; increment addend.exponent SR R4,1 ; shift addend.mantissa BR FADDDL FADDDX: ; } ; assert (addend.exponent = augend.exponent) ADD R4,R6 ; add mantissas BVR FADDNO ; if (overflow) { /* we need one more bit */ ADDSI R3,1 ; increment result.exponent SR R4,1 ; shift result.mantissa SUB R0,R0,R0 ; set carry bit in order to ... ADJUST R4,CMSB ; flip sign bit of result (overflow repaired!) FADDNO: ; } JSR R1,NORMALIZE ; normalize( result ) STORES R3,R7 ; save result.exponent STORE R4,R7,MANTISSA ; save result.mantissa LOAD R8,R2,R8SAVE ; restore R8 LOADS R1,R2 ; restore return address JUMPS R1 ; return!

**Adding two floating point numbers on the Hawk**
SUBTITLE floating add ; activation record format RETAD = 0 ; return address R8SAVE = 4 ; place to save R8 FLTADD: ; on entry, R3 = pointer to floating sum ; R4 = pointer to addend ; R5 = pointer to augend STORES R1,R2 ; save return address STORE R8,R2,R8SAVE ; save R8 MOVE R7,R3 ; R7 = saved pointer to sum LOADS R3,R4 ; R3 = addend.exponent LOAD R4,R4,MANTISSA ; R4 = addend.mantissa LOAD R6,R5,MANTISSA ; R6 = augend.mantissa LOADS R5,R5 ; R5 = augend.exponent CMP R3,R5 BLE FADDEI ; if (addend.exponent > augend.exponent) { MOVE R8,R3 MOVE R3,R5 MOVE R5,R8 ; exchange exponents MOVE R8,R4 MOVE R4,R6 MOVE R6,R8 ; exchange mantissas FADDEI: ; } ; assert (addend.exponent <= augend.exponent) FADDDL: ; while CMP R3,R5 BGE FADDDX ; (addend.exponent < augend.exponent) { ADDSI R3,1 ; increment addend.exponent SR R4,1 ; shift addend.mantissa BR FADDDL FADDDX: ; } ; assert (addend.exponent = augend.exponent) ADD R4,R6 ; add mantissas BVR FADDNO ; if (overflow) { /* we need one more bit */ ADDSI R3,1 ; increment result.exponent SR R4,1 ; shift result.mantissa SUB R0,R0,R0 ; set carry bit in order to ... ADJUST R4,CMSB ; flip sign bit of result (overflow repaired!) FADDNO: ; } JSR R1,NORMALIZE ; normalize( result ) STORES R3,R7 ; save result.exponent STORE R4,R7,MANTISSA ; save result.mantissa LOAD R8,R2,R8SAVE ; restore R8 LOADS R1,R2 ; restore return address JUMPS R1 ; return!

Most of this code follows simply from the logic of adding that we demonstrated with the addition of two numbers using scientific notation. There are some points, however, that are worthy of note.

First, about 1/3 of the way down, this code exchanges the two numbers; this involves exchanging two pairs of registers. There are many ways to do this; the approach used here is the simplest to understand, setting the value in one of the registers aside, moving the other register, and then moving the set-aside value into its final resting place. This takes three move instructions and a spare register. There are other ways to do this that are just as fast but do not require a spare register, but these are harder to understand. The most famous and cryptic of these uses the exclusive or operator: a=a⊕b;b=a⊕b;a=a⊕b.

Because this routine uses registers 1 to 7 and makes calls to another routine, it needs to use its activation record; here, we have constructed an activation record with two fields, one for saving register 1 to allow the call to NORMALIZE, and one for saving register 8, freeing it for local use. While FLTADD uses its activation record, NORMALIZE does not. Therefore, this code does not need to adjust the stack pointer, register 2, before or after the call to normalize.

Finally, there is the issue of dealing with overflow during addition. After addition, when the sign is wrong, interpreted as a sign bit, it does have the correct value as the most significant bit of the magnitude, as if there were an invisible sign bit to the left of it. Therefore, after a signed right shift to make space for the new sign bit (incrementing the exponent to compensate for this) we can complement the sign by adding one to it, for example, using the ADJUST instruction.

Exercises

s) The floating point add code given here is rather stupid about shifting. It should never right-shift the lesser of the two addends more than 32 times! Fix this!
t) Fix this code so that the denormalize step rounds the lesser of the two addends by adding one to the least significant bit just prior to the final right shift operation.

Floating Point Multiplication

Starting with a working integer multiply routine, floating point multiplication is simpler than floating point addition. This simplicity is apparent in the algorithm for multiplying in scientific notation: Add the exponents, multiply the mantissas and normalize the result, as illustrated below:
Multiplication in scientific notation
given 1.02 × 10³ × 9.85 × 10¹
rearranged (1.02 × 9.85) × 10^{(3 + 1)}
multiplied 10.047 × 10⁴
normalized 1.0047 × 10⁵
rounded 1.00 × 10⁵

**Multiplication in scientific notation**
given		1.02 × 10³ × 9.85 × 10¹
rearranged		(1.02 × 9.85) × 10^{(3 + 1)}
multiplied		10.047 × 10⁴
normalized		1.0047 × 10⁵
rounded		1.00 × 10⁵

Unlike addition, we need not denormalize anything before the operation. The one new issue we face is the matter of precision. Multiplying two 32-bit mantissas gives a 64-bit result. We will assume a signed multiply routine that delivers this result, with the following calling sequence:

A signed multiply interface specification

MULTIPLYS: ; link through R1 ; on entry, R3 = multiplier ; R4 = multiplicand ; on exit, R3 = product, low bits ; R4 = product, high bits ; destroys R5, R6 ; uses no other registers

**A signed multiply interface specification**
MULTIPLYS: ; link through R1 ; on entry, R3 = multiplier ; R4 = multiplicand ; on exit, R3 = product, low bits ; R4 = product, high bits ; destroys R5, R6 ; uses no other registers

If the multiplier and multiplicand have 31 places after the point in each, then the 64-bit product has 62 places after the point. Therefore, to normalize the result, we will always shift it one place. If the multiplier and multiplicand are normalized to have minimum absolute values of 0.5, the product will have a minimum absolute value of 0.25. Normalizing such a small product will require an additional shift, but never more than one. We must use 64-bit shifts for thiese normalize steps in order to avoid loss of precision, so we cannot use the normalize code we used with addition, subtraction and conversion from binary to floating point.
Multiplying two floating point numbers on the Hawk

SUBTITLE floating multiply ; activation record format RETAD = 0 ; return address PRODUCT = 0 ; pointer to floating product FLTMUL: ; on entry, R3 = pointer to floating product ; R4 = pointer to multiplier ; R5 = pointer to multiplicand STORES R1,R2 ; save return address STORE R3,R2,PRODUCT ; save pointer to product LOADS R6,R4 ; R6 = multiplier.exponent LOADS R7,R5 ; R7 = multiplicand.exponent ADD R7,R6,R7 ; R7 = product.exponent LOAD R3,R4,MANTISSA ; R3 = multiplier.mantissa LOAD R4,R5,MANTISSA ; R4 = multiplicand.mantissa LIL R1,MULTIPLYS JSRS R1,R1 ; R3-4 = product.mantissa ; assert (R3-4 has 2 bits left of the point) SL R3,1 ADDC R4,R4 ; shift product.mantissa 1 place ; assert (R3-4 has 1 bit left of the point) BNS FMULN ; if (product.mantissa > 0) { BITTST R4,30 BCS FMULOK ; if (product.mantissa not normalized) { SL R3,1 ADDC R4,R4 ; shift product.mantissa 1 place ADDSI R7,-1 ; decrement product.exponent BR FMULOK ; } FMULN: ; } else { negative mantissa ADDSI R3,-1 BCS FMULNC ADDSI R4,-1 ; decrement product.mantissa FMULNC: ; mantissa is now in one's complement form BITTST R4,30 BCR FMULNOK ; if (product.mantissa not normalized) { SL R3,1 ADDC R4,R4 ; shift product.mantissa 1 place ADDSI R7,-1 ; decrement product.exponent FMULNOK: ; } ADDSI R3,1 ADDC R4,R0 ; increment product.mantissa FMULOK: ; } mantissa now normalized LOAD R5,R2,PRODUCT STORES R7,R5 ; store product.exponent STORE R4,R5 ; store product.mantissa LOADS R1,R2 ; restore return address JUMPS R1 ; return

**Multiplying two floating point numbers on the Hawk**
SUBTITLE floating multiply ; activation record format RETAD = 0 ; return address PRODUCT = 0 ; pointer to floating product FLTMUL: ; on entry, R3 = pointer to floating product ; R4 = pointer to multiplier ; R5 = pointer to multiplicand STORES R1,R2 ; save return address STORE R3,R2,PRODUCT ; save pointer to product LOADS R6,R4 ; R6 = multiplier.exponent LOADS R7,R5 ; R7 = multiplicand.exponent ADD R7,R6,R7 ; R7 = product.exponent LOAD R3,R4,MANTISSA ; R3 = multiplier.mantissa LOAD R4,R5,MANTISSA ; R4 = multiplicand.mantissa LIL R1,MULTIPLYS JSRS R1,R1 ; R3-4 = product.mantissa ; assert (R3-4 has 2 bits left of the point) SL R3,1 ADDC R4,R4 ; shift product.mantissa 1 place ; assert (R3-4 has 1 bit left of the point) BNS FMULN ; if (product.mantissa > 0) { BITTST R4,30 BCS FMULOK ; if (product.mantissa not normalized) { SL R3,1 ADDC R4,R4 ; shift product.mantissa 1 place ADDSI R7,-1 ; decrement product.exponent BR FMULOK ; } FMULN: ; } else { negative mantissa ADDSI R3,-1 BCS FMULNC ADDSI R4,-1 ; decrement product.mantissa FMULNC: ; mantissa is now in one's complement form BITTST R4,30 BCR FMULNOK ; if (product.mantissa not normalized) { SL R3,1 ADDC R4,R4 ; shift product.mantissa 1 place ADDSI R7,-1 ; decrement product.exponent FMULNOK: ; } ADDSI R3,1 ADDC R4,R0 ; increment product.mantissa FMULOK: ; } mantissa now normalized LOAD R5,R2,PRODUCT STORES R7,R5 ; store product.exponent STORE R4,R5 ; store product.mantissa LOADS R1,R2 ; restore return address JUMPS R1 ; return

Most of the above code is involved with normalizing the result. This code is oversimplified! What if the product is zero? Our normalization rule is that a product of zero should have the most negative possible value. This code does not test for overflow or underflow, that is, no test for exponent out of bounds.

Exercises

u) Fix this floating point multiply code so that it detects underflow and overflow in adding exponents and correctly returns zero on underflow and when the exponent is too large, locks the exponent at its maximum value.
v) Fix this floating point multiply code so it correctly handles the value zero.
w) Write code for a floating point divide routine.

Other Operations

Multiply and divide routines do not finish the story. Our commitment to strong abstraction means that users of our floating point numbers may not examine their representations. The designers of floating point hardware do not face this constraint. They advertise the exact format they use and users are free to use this information. If we do not disclose such detail, we must provide tools for comparing numbers, for testing the sign of numbers, for testing for zero, and other operations that might otherwise be trivial.

Another issue we face is the import and export of floating point numbers. We need tools to convert numbers to and from textual and IEEE standard format. The routine to convert from our eccentric format to IEEE format begins by dealing with the range of exponent values. Our 32-bit exponent field has an extraordinary range. Second, it converts the exponent and mantissa to the appropriate form, and finally, it packs the pieces must be packed together. The following somewhat oversimplified code does this:

Packing an IEEE Floating point value in C

unsigned int ieeepack( int exponent, int mantissa ) { int sign = 0; /* first split off the sign */ if (mantissa < 0) { mantissa = -mantissa; sign = 0x80000000; } /* put the mantissa in IEEE normalized form */ mantissa = mantissa >> 7; /* convert */ if (exponent > 128) { /* convert overflow to infinity */ mantissa = 0; exponent = 0x7F800000; } else if (exponent < -125) { /* convert underflow to zero */ mantissa = 0; exponent = 0; } else { /* conversion is possible */ mantissa = mantissa & 0x007FFFFF; exponent = (exponent + 126) << } return sign | exponent | mantissa; }

**Packing an IEEE Floating point value in C**
unsigned int ieeepack( int exponent, int mantissa ) { int sign = 0; /* first split off the sign / if (mantissa < 0) { mantissa = -mantissa; sign = 0x80000000; } / put the mantissa in IEEE normalized form / mantissa = mantissa >> 7; / convert / if (exponent > 128) { / convert overflow to infinity / mantissa = 0; exponent = 0x7F800000; } else if (exponent < -125) { / convert underflow to zero / mantissa = 0; exponent = 0; } else { / conversion is possible */ mantissa = mantissa & 0x007FFFFF; exponent = (exponent + 126) << } return sign \| exponent \| mantissa; }

Note in the above code that the advertised bias of the IEEE format is 127, yet we used a bias of 126! This is because we also subtracted one from the original exponent to account for the fact that our numbers were normalized in the range 0.5 to 1.0, while IEEE numbers are normalized in the range 1.0 to 2.0. This is also why we compared with 128 and -125 instead of 127 and -126 when checking for the maximum and minimum legal exponents in the IEEE format. We have omitted one significant detail in the above! All underflows were simply forced to zero when some of them ought to have resulted in denormalized numbers.

Hawk code to unpack an IEEE-format floating-point number

SUBTITLE unpack an IEEE-format floating point number FLTIEEE: ; on entry, R3 points to the return floating value ; R4 is the number in IEEE format. ; R5 is used as a temporary MOVE R5,R4 ; R5 = exponent SL R5,1 ; throw away the bit left of the exponent SR R5,12 SR R5,12 ; pull the exponent field all the way right ADDI R5,R5,-126 ; unbias the exponent STORES R5,R3 ; save converted exponent MOVE R5,R4 ; R5 = mantissa SL R5,9 ; push mantissa all the way left SR R5,1 ; and then pull it back for missing one bit SUB R0,R0,R0 ; set carry ADJUST R5,CMSB ; and use it to put missing one into mantissa TESTR R4 BNR FIEEEPOS ; if (number < 0) { NET R5,R5 ; negate mantissa FIEEEPOS: ; } STORE R5,R3,MANTISSA ; save converted mantissa JUMPS R1 ; return

**Hawk code to unpack an IEEE-format floating-point number**
SUBTITLE unpack an IEEE-format floating point number FLTIEEE: ; on entry, R3 points to the return floating value ; R4 is the number in IEEE format. ; R5 is used as a temporary MOVE R5,R4 ; R5 = exponent SL R5,1 ; throw away the bit left of the exponent SR R5,12 SR R5,12 ; pull the exponent field all the way right ADDI R5,R5,-126 ; unbias the exponent STORES R5,R3 ; save converted exponent MOVE R5,R4 ; R5 = mantissa SL R5,9 ; push mantissa all the way left SR R5,1 ; and then pull it back for missing one bit SUB R0,R0,R0 ; set carry ADJUST R5,CMSB ; and use it to put missing one into mantissa TESTR R4 BNR FIEEEPOS ; if (number < 0) { NET R5,R5 ; negate mantissa FIEEEPOS: ; } STORE R5,R3,MANTISSA ; save converted mantissa JUMPS R1 ; return

Conversion from IEEE format to our eccentric software format is fairly easy because our exponent and mantissa fields are larger than those of the single-precision IEEE format. Thus, we can convert with no loss of precision. This code presented above ignores the possibility that the value might be a NaN or infinity.

This code makes extensive use of shifting to clear fields within the number. Thus, instead of writing n&0xFFFFFF00, we write (n>>8)<<8. This trick is useful on many machines where loading a large constant is significantly slower than a shift instruction. By doing this, we avoid both loading a long constant into a register and using an extra register to hold it. We used a related trick to set the implicit one bit, using a subtract instruction to set the carry bit and then adding this bit into the number using an adjust instruction.

Conversion to Decimal

A well designed floating point package will include a complete set of tools for conversion to and from decimal textual representations, but our purpose here is to use the conversion problem to illustrate the use of our floating point package, so we will write our conversion code as user-level code, making no use of any details of the floating point abstraction that are not described in the header file for the package.

First, consider the problem of printing a floating point number using only the operations we have defined, ignoring the complexity of assembly language and focusing on the algorithm. We can begin by taking the integer part of the number and printing that, followed by a point, but the question is, how do we continue from there, printing the digits after the point?

C code to print a floating point number

void fltprint( float num, int places ) { int inum; /* the integer part */ if (num < 0) { /* make it positive and print the sign */ num = -num; dspch( '-' ); } /* first put out integer part */ inum = fltint( num ); dspnum( inum ); dspch( '.' ); /* second put out digits of the fractional part */ for (; places > 0; places--) { num = (num - float(inum)) * 10.0; inum = fltint( num ); dspch( inum + '0' ); } }

**C code to print a floating point number**
void fltprint( float num, int places ) { int inum; /* the integer part / if (num < 0) { / make it positive and print the sign / num = -num; dspch( '-' ); } / first put out integer part / inum = fltint( num ); dspnum( inum ); dspch( '.' ); / second put out digits of the fractional part / for (; places > 0; places--) { num = (num - float(inum)) 10.0; inum = fltint( num ); dspch( inum + '0' ); } }

To print the fractional part of a number, the above C code takes the integer part of the number and subtract it from the number, leaving just the fractional part. Multiplying the fractional part by ten brings one decimal digit of the fraction above the point. Print that digit, and then repeat this process for each following digit. This is not particularly efficient, since it keeps converting back and forth between floating and integer representations, but it works.

We face a few problems here, and it is best to tackle these incrementally. First, in order to allow code to be written with no knowledge of the structure of floating point numbers, we must pass pointers to numbers, not the numbers themselves, because passing the numbers themselves will require that the assembly language programmer know how manyu registers it takes to hold each number. Second, we have used arithmetic operators above that involve calls to routines in the floating point package. We will tackle these problems as the high-level before trying to deal with them in assembly language.

Lower level C code to print a floating point number

void fltprint( float *pnum, int places ) { float num; /* a copy of the number */ float tmp; /* a temporary floating point number */ float ten; /* a constant floating value */ int inum; /* the integer part */ int i; /* loop counter */ float( &ten, 10 ); if (flttst( &num ) < 0) { /* make it positive, print the sign */ fltneg( &num, pnum ); dspch( '-' ); } else { fltcpy( &num, pnum ); } /* first put out integer part */ inum = fltint( &num ); dspnum( inum ); dspch( '.' ); /* second put out digits of the fractional part */ while (places > 0) { float( &tmp, inum ); fltsub( &num, &num, &tmp ); fltmul( &num, &num, &ten ); inum = fltint( &num ); dspch( inum + '0' ); places = places - 1; } }

**Lower level C code to print a floating point number**
void fltprint( float pnum, int places ) { float num; / a copy of the number / float tmp; / a temporary floating point number / float ten; / a constant floating value / int inum; / the integer part / int i; / loop counter / float( &ten, 10 ); if (flttst( &num ) < 0) { / make it positive, print the sign / fltneg( &num, pnum ); dspch( '-' ); } else { fltcpy( &num, pnum ); } / first put out integer part / inum = fltint( &num ); dspnum( inum ); dspch( '.' ); / second put out digits of the fractional part */ while (places > 0) { float( &tmp, inum ); fltsub( &num, &num, &tmp ); fltmul( &num, &num, &ten ); inum = fltint( &num ); dspch( inum + '0' ); places = places - 1; } }

The above code shows some of the problems we forced on ourselves by insisting on having no knowledge of the representation of floating point numbers when we write our print routine. Where a C or Java programmer would write 10.0, relying on the compiler to translate this into floating point representation, and put it in memory, we have been forced to use the integer constant 10 and then call the float() routine to convert it to its internal representation. This is a common consequence of strict object oriented encapsulation, although loose encapsulation schemes, for example, those that export compile or assembly time macros to process constants into their internal representation can get around this.

The next problem we face is that, at the time we write this code, we are denying ourselves knowledge of the size of the representation of floating point numbers. As a result, we cannot allocate space in our activation records taking advantage of a known size. Our solution to this problem rests on two elements.

First, we will rely on the fact that the interface definition for the floating point package float.h provides us with the size of a floating point number in the constant FLOATSIZE; in fact, we have adopted the general convention that, for each object, record or structure, we always have a symbol defined to hold its size.

Second, we can use the assembler itself to sum up the sizes of the fields of the activation record instead of adding them up by hand, as we have in our previous examples. To do this, we begin with an activation record size of zero, define each field in terms of the previous activation record size, and then add the field size to compute the new activation record size. We could, of course, have defined all of the easy to define fields first using the old method, but to be consistant, we have defined all of the fields this way in the following:

Building an activation record for FLTPRINT

TITLE fltprint.a -- floating print routine USE "float.h" INT FLTPRINT MACRO LOCAL X, =SIZE X = ARSIZE ARSIZE = ARSIZE + SIZE ENDMAC INTSIZE = 4 ; size of an integer ; activation record format ARSIZE = 0 ; initial size of activation record LOCAL RETAD, INTSIZE ; return address LOCAL NUM, FLOATSIZE ; copy of the floating point number LOCAL TMP, FLOATSIZE ; a temporary floating point number LOCAL TEN, FLOATSIZE ; the constant ten LOCAL R8SAVE, INTSIZE ; save area for register 8 LOCAL R9SAVE, INTSIZE ; save area for register 9

**Building an activation record for FLTPRINT**
TITLE fltprint.a -- floating print routine USE "float.h" INT FLTPRINT MACRO LOCAL X, =SIZE X = ARSIZE ARSIZE = ARSIZE + SIZE ENDMAC INTSIZE = 4 ; size of an integer ; activation record format ARSIZE = 0 ; initial size of activation record LOCAL RETAD, INTSIZE ; return address LOCAL NUM, FLOATSIZE ; copy of the floating point number LOCAL TMP, FLOATSIZE ; a temporary floating point number LOCAL TEN, FLOATSIZE ; the constant ten LOCAL R8SAVE, INTSIZE ; save area for register 8 LOCAL R9SAVE, INTSIZE ; save area for register 9

In the above, had we allowed ourselves to use knowledge about the size of a floating point number, we could have defined NUM=4, TMP=12 and TEN=20, but then, any change in the floating point package would have required us to rewrite this code. The macro LOCAL allows us to write local variable declarations compactly; without this macro, each of our local variables would have required two lines of code. For example, the declaration of the local variable NUM would begin with NUM=ARSIZE, and then it would add to the activation record size with ARSIZE=ARSIZE+FLOATSIZE.

The local variables for saving registers 8 and 9 were allocated so that the integer variables in our code can use these registers over and over again instead of being loaded and stored in order to survive each call to a routine in the floating point package. Of course, if those routines need registers 8 and 9, they will be saved and restored anyway, but we leave that to them.

The following code contains one significant optimization. With all of the subroutine calls, we could have incremented and decremented the stack pointer many times. Instead, we increment it just once at the start of the print routine and decrement it just once at the end; in between, we always subtract ARSIZE from every displacement into the activation record in order to correct for this.

The body of the floating print routine, part 1

FLTPRINT: ; on entry: R3 = pointer to floating point number to print ; R4 = number of places to print after the point STORES R1,R2 STORE R8,R2,R8SAVE STORE R9,R2,R9SAVE ; saved return address, R8, R9 MOVE R8,R3 ; R8 = pointer to number MOVE R9,R4 ; R9 = places ADDI R2,R2,ARSIZE ; from here on, R2 points to end of AR LEA R3,R2,TEN-ARSIZE LIS R4,10 LIL R1,FLOAT JSRS R1,R1 ; float( &ten, 10 ); MOVE R3,R8 LIL R1,FLTTST JSRS R1,R1 TESTR R3 BNR FPRNNEG ; if (flttst( pnum ) < 0) { LEA R3,R2,NUM-ARSIZE MOVE R4,R8 LIL R1,FLTNEG JSRS R1,R1 ; fltneg( &num, pnum ); LIS R3,'-' LIL R1,DSPCH JSRS R1,R1 ; dspch( '-' ); BR FPRABS FPRNNEG: ; } else { LEA R3,R2,NUM-ARSIZE MOVE R4,R8 LIL R1,FLTCPY JSRS R1,R1 ; fltcpy( &num, pnum ); FPRABS: ; } ; /* first put out the integer part */ LEA R3,R2,NUM-ARSIZE LIL R1,FLTINT JSRS R1,R1 MOVE R8,R3 ; R8 = inum = fltint( num ); LIL R1,DSPNUM JSRS R1,R1 ; dspnum( inum ); LIS R3,'.' LIL R1,DSPCH JSRS R1,R1 ; dspch( '.' );

**The body of the floating print routine, part 1**
FLTPRINT: ; on entry: R3 = pointer to floating point number to print ; R4 = number of places to print after the point STORES R1,R2 STORE R8,R2,R8SAVE STORE R9,R2,R9SAVE ; saved return address, R8, R9 MOVE R8,R3 ; R8 = pointer to number MOVE R9,R4 ; R9 = places ADDI R2,R2,ARSIZE ; from here on, R2 points to end of AR LEA R3,R2,TEN-ARSIZE LIS R4,10 LIL R1,FLOAT JSRS R1,R1 ; float( &ten, 10 ); MOVE R3,R8 LIL R1,FLTTST JSRS R1,R1 TESTR R3 BNR FPRNNEG ; if (flttst( pnum ) < 0) { LEA R3,R2,NUM-ARSIZE MOVE R4,R8 LIL R1,FLTNEG JSRS R1,R1 ; fltneg( &num, pnum ); LIS R3,'-' LIL R1,DSPCH JSRS R1,R1 ; dspch( '-' ); BR FPRABS FPRNNEG: ; } else { LEA R3,R2,NUM-ARSIZE MOVE R4,R8 LIL R1,FLTCPY JSRS R1,R1 ; fltcpy( &num, pnum ); FPRABS: ; } ; /* first put out the integer part */ LEA R3,R2,NUM-ARSIZE LIL R1,FLTINT JSRS R1,R1 MOVE R8,R3 ; R8 = inum = fltint( num ); LIL R1,DSPNUM JSRS R1,R1 ; dspnum( inum ); LIS R3,'.' LIL R1,DSPCH JSRS R1,R1 ; dspch( '.' );

The body of floating print, part 2

FPRLP: TESTR R9 BLE FPRLX ; while (places > 0) { LEA R3,R2,TMP-ARSIZE MOVE R4,R8 LIL R1,FLOAT JSRS R1,R1 ; float( &tmp, inum ); LEA R3,R2,NUM-ARSIZE MOVE R4,R3 LEA R5,R2,TMP-ARSIZE LIL R1,FLTSUB JSRS R1,R1 ; fltsub( &num, &num, &tmp ); LEA R3,R2,NUM-ARSIZE MOVE R4,R3 LEA R5,R2,TEN-ARSIZE LIL R1,FLTMUL JSRS R1,R1 ; fltmul( &num, &num, &ten ); LEA R3,R2,NUM-ARSIZE LIL R1,FLTINT JSRS R1,R1 MOVE R8,R3 ; R8 = inum = fltint( &num ); ADDI R3,R3,'0' LIL R1,DSPCH JSRS R1,R1 ; dspch( inum + '0' ); ADDSI R9,-1 ; places = places - 1; BR FPRLP FPRLX: ; } ADDI R2,R2,-ARSIZE LOAD R8,R2,R8SAVE LOAD R9,R2,R9SAVE LOADS R1,R2 ; restore return address, R8, R9 JUMPS R1 ; return

**The body of floating print, part 2**
FPRLP: TESTR R9 BLE FPRLX ; while (places > 0) { LEA R3,R2,TMP-ARSIZE MOVE R4,R8 LIL R1,FLOAT JSRS R1,R1 ; float( &tmp, inum ); LEA R3,R2,NUM-ARSIZE MOVE R4,R3 LEA R5,R2,TMP-ARSIZE LIL R1,FLTSUB JSRS R1,R1 ; fltsub( &num, &num, &tmp ); LEA R3,R2,NUM-ARSIZE MOVE R4,R3 LEA R5,R2,TEN-ARSIZE LIL R1,FLTMUL JSRS R1,R1 ; fltmul( &num, &num, &ten ); LEA R3,R2,NUM-ARSIZE LIL R1,FLTINT JSRS R1,R1 MOVE R8,R3 ; R8 = inum = fltint( &num ); ADDI R3,R3,'0' LIL R1,DSPCH JSRS R1,R1 ; dspch( inum + '0' ); ADDSI R9,-1 ; places = places - 1; BR FPRLP FPRLX: ; } ADDI R2,R2,-ARSIZE LOAD R8,R2,R8SAVE LOAD R9,R2,R9SAVE LOADS R1,R2 ; restore return address, R8, R9 JUMPS R1 ; return

Exercises

x) Write a floating print routine that produces its output in scientific notation, for example, using the format 6.02E23 where the E stands for times ten to the. To do this, you will have to first do a decimal normalize, counting the number of times you have to multiply or divide by ten in order to bring the mantissa into the range from 1 to just under 10, and then you will have to print the mantissa (using the floating print routine we just discussed), and finally print the exponent.

`31`	`30`	`29`	`28`	`27`	`26`	`25`	`24`	`23`	`22`	`21`	`20`	`19`	`18`	`17`	`16`	`15`	`14`	`13`	`12`	`11`	`10`	`09`	`08`	`07`	`06`	`05`	`04`	`03`	`02`	`01`	`00`

S	exponent								mantissa