14. Support for Multiply and Divide
Part of
the Hawk Manual

14.1. Multiplication by Constants  Methodology
14.2. Small Constant Multipliers  Optimal Code
14.3. Multiplication by Variables
14.4. Higher Radix Multiplication  Fast Code
14.5. Division by Powers of 2
14.6. Division by Constants  Methodology
14.7. Division by Variables
The Hawk has no integer multiply instruction. Multiplication hardware fits poorly with modern pipelined architectures. On most computers multiplication by small constants, the most common case, can always be done faster using sequences of add and shift instructions. On the Intel Pentium, Quantasm Corporation reported that while a 32bit integer add or subtract took 1 clock cycle, a 32bit multiply took 10 cycles and a divide took 41 cycles. Clearly, multiply and divide instructions should be avoided.
On the Hawk, MOVESL, ADDSL and SL (see Section 6.2) all multiply by constants:
SL r,s — r = r × 2^{s} for 0 < s < 17
ADDSL r,r,s — r = r × (1+2^{s}) for 0 < s < 17
For example:
MOVESL R3,R4,1 ; R3 = R4 x 2 SL R3,1 ; R3 = R3 x 2 ADDSL R3,R3,1 ; R3 = R3 x 3 SL R3,2 ; R3 = R3 x 4 ADDSL R3,R3,2 ; R3 = R3 x 5 SL R3,3 ; R3 = R3 x 8 ADDSL R3,R3,3 ; R3 = R3 x 9
When the desired constant multiplier does not fit this pattern, there are two useful strategies:
SL R3,1 ; \ R3 = R3 x 6 (2 x 3) ADDSL R3,R3,1 ; / SL R3,1 ; \ R3 = R3 x 10 (2 x 5) ADDSL R3,R3,2 ; / ADDSL R3,R3,1 ; \ R3 = R3 x 15 (3 x 5) ADDSL R3,R3,2 ; /
Generally, our goal in splitting a multiplier (a + b) into components a and b will be to search out values of a and b that are easy to multiply by. Summing partial products requires the use of a temporary register; in the following code, we use r[1]. In loading this temporary with a copy of the multiplicand, it is useful to recall that we can use MOVESL to scale the multiplicand and we can use NEG to negate it (see Section 8.2).
NEG R1,R3 ; \ R3 = R3 x 7 (8 + 1) ADDSL R3,R1,3 ; / MOVESL R1,R3,1 ; \ R3 = R3 x 10 (8 + 2) ADDSL R3,R1,3 ; / MOVESL R1,R3,1 ; \ R3 = R3 x 11 (9 + 2) ADDSL R3,R3,3 ;  ADD R3,R3,R1 ; / MOVE R1,R3 ; \ R3 = R3 x 13 ((3 x 4) + 1) ADDSL R3,R3,1 ;  ADDSL R3,R1,2 ; / MOVESL R1,R3,1 ; \ R3 = R3 x 14 ((3 x 4) + 2) ADDSL R3,R3,1 ;  ADDSL R3,R1,2 ; /
Again, a temporary register, r[1] in the following examples, holds a copy of the multiplicand. If the binary multiplier has n one bits, most of the product can be computed using n–1 ADDSL instructions. The shift count on each ADDSL is one more than the number of zeros between consecutive one bits. If the binary multiplier has trailing zeroes, a final SL is needed with a shift count equal to the number of trailing zeros.
MOVE R1,R3 ; \ R3 = R3 x 10 (binary 1010) ADDSL R3,R1,2 ;  SL R3,1 ; / MOVE R1,R3 ; \ R3 = R3 x 27 (binary 11011) ADDSL R3,R1,1 ;  ADDSL R3,R1,2 ;  ADDSL R3,R1,1 ; / MOVE R1,R3 ; \ R3 = R3 x 100 (binary 1100100) ADDSL R3,R1,1 ;  ADDSL R3,R1,3 ;  SL R3,2 ; / MOVE R1,R3 ; \ R3 = R3 x 117 (binary 1110101) ADDSL R3,R1,1 ;  ADDSL R3,R1,1 ;  ADDSL R3,R1,2 ;  ADDSL R3,R1,2 ; /
It is difficult to give a general rule for which approach will give the shortest instruction sequence for multiplication. For each multiplier, a different approach may yield a better result. It is common to find multiple instruction sequences that all perform comparably. Trial and error gives us the following instruction sequences for multipliers from 2 to 10:
SL R3,1 ; R3 = R3 x 2 ADDSL R3,R3,1 ; R3 = R3 x 3 SL R3,2 ; R3 = R3 x 4 ADDSL R3,R3,2 ; R3 = R3 x 5 SL R3,1 ; \ R3 = R3 x 6 (2 x 3) ADDSL R3,R3,1 ; / NEG R1,R3 ; \ R3 = R3 x 7 (8 + 1) ADDSL R3,R1,3 ; / SL R3,3 ; R3 = R3 x 8 ADDSL R3,R3,3 ; R3 = R3 x 9 SL R3,1 ; \ R3 = R3 x 10 (2 x 5) ADDSL R3,R3,2 ; /
The above all use 2 or fewer instructionsr. Multipliers from 11 to 38, sometimes require 3:
MOVE R1,R3 ; \ R3 = R3 x 11 (binary 1011) ADDSL R3,R1,2 ;  ADDSL R3,R1,1 ; / SL R3,2 ; \ R3 = R3 x 12 (4 x 3) ADDSL R3,R3,1 ; / MOVE R1,R3 ; \ R3 = R3 x 13 (binary 1101) ADDSL R3,R1,1 ;  ADDSL R3,R1,2 ; / MOVESL R1,R3,1 ; \ R3 = R3 x 14 ((3 x 4) + 2) ADDSL R3,R3,1 ;  ADDSL R3,R1,2 ; / ADDSL R3,R3,1 ; \ R3 = R3 x 15 (3 x 5) ADDSL R3,R3,2 ; / SL R3,4 ; R3 = R3 x 16 ADDSL R3,R3,4 ; R3 = R3 x 17 ADDSL R3,R3,3 ; \ R3 = R3 x 18 (9 x 2) SL R3,1 ; / MOVE R1,R3 ; \ R3 = R3 x 19 (binary 10011) ADDSL R3,R1,3 ;  ADDSL R3,R1,1 ; / SL R3,2 ; \ R3 = R3 x 20 (4 x 5) ADDSL R3,R3,2 ; / MOVE R1,R3 ; \ R3 = R3 x 21 (binary 10101) ADDSL R3,R1,2 ;  ADDSL R3,R1,2 ; / MOVESL R1,R3,1 ; \ R3 = R3 x 22 (binary 10110 with a trick!) ADDSL R3,R1,3 ;  ADDSL R3,R1,1 ; / NEG R1,R3 ; \ R3 = R3 x 23 ((3 x 8) + 1) ADDSL R3,R3,1 ;  ADDSL R3,R1,3 ; / SL R3,3 ; \ R3 = R3 x 24 (8 x 3) ADDSL R3,R3,1 ; / ADDSL R3,R3,2 ; \ R3 = R3 x 25 (5 x 5) ADDSL R3,R3,2 ; / MOVESL R1,R3,1 ; \ R3 = R3 x 26 ((3 x 8) + 2) ADDSL R3,R3,1 ;  ADDSL R3,R1,3 ; / ADDSL R3,R3,3 ; \ R3 = R3 x 27 (9 x 3) ADDSL R3,R3,1 ; / MOVESL R1,R3,2 ; \ R3 = R3 x 28 ((3 x 8) + 4) ADDSL R3,R3,1 ;  ADDSL R3,R1,3 ; / NEG R1,R3 ; \ R3 = R3 x 29 (32 + 3) ADDSL R1,R1,1 ;  ADDSL R3,R1,5 ; / SL R3,1 ; \ R3 = R3 x 30 (2 x 3 x 5) ADDSL R3,R3,1 ;  ADDSL R3,R3,2 ; / NEG R1,R3 ; \ R3 = R3 x 31 (32 + 1) ADDSL R3,R3,5 ; / SL R3,5 ; R3 = R3 x 32 ADDSL R3,R3,5 ; R3 = R3 x 33 ADDSL R3,R3,4 ; \ R3 = R3 x 34 (17 x 2) SL R3,1 ; / MOVESL R1,R3,1 ; \ R3 = R3 x 35 = (3 + 32 with a trick) ADD R1,R1,R3 ;  ADDSL R3,R1,5 ; / MOVESL R1,R3,2 ; \ R3 = R3 x 36 (32 + 4) ADDSL R3,R1,5 ; / MOVE R1,R3 ; \ R3 = R3 x 37 (32 + 5) ADDSL R1,R1,2 ;  ADDSL R3,R1,5 ; / MOVESL R1,R3,1 ; \ R3 = R3 x 38 ((2 x 3) + 32) ADDSL R1,R1,1 ;  ADDSL R3,R1,5 ; /
Multiplying by 39 takes 4 instructions, but multiplying by 100 takes only 3. Extending this table is tedious and the payoff is limited. Occasionally, however, some applicatons require multiplication by larger constants. For example, some pseudorandom number generators reqsuire fast multiplication by strange values such as 16,807, 39,373, 69,621 and 48,271.
The classic shiftandadd algorithm for multiplying is known by many names; if, in each step, the multiplier is right shifted and the multiplicand is left shifted, so that the partial products are added in order of increasing significance, it is called the Russian Peasant Algorithm or the Egyptian Algorithm, giving credit to one or the other of its many folk origins.
Whatever the shift direction or the order of additions, all basic binary multiplication algorithms work as follows: In each multiply step, one bit of the multiplier is tested and both the multiplier and product are shifted one place; if the bit is one, the multiplicand is added, as a partial product, to the product being accumulated. It takes 32 such steps to compute the 64bit product of two 32bit unsigned numbers.
In many cases, only the low 32 bits of the 64bit product are needed, so the following simplified code is useful. This shifts both the product and multiplier left, adding the most significant partial product first:
TIMESUL:; unsigned multiply, low 32bits of the product ; expects: R1  return address ; R3 = and  multiplicand ; R4 = ier  multiplier ; returns: R3 = prod  low 32 bits of product ; uses: R5 = and  copy of multiplicand ; R6 = i  loop counter MOVE R5,R3 ;  copy multiplicand CLR R3 ; prod = 0 LIS R6,32 ; i = 32 TMULLP: ; do {  begin multiply step SL R3,1 ; prod = prod << 1 SL R4,1 ; ier = ier << 1 BCR TMULCN ; if (high bit of ier was 1) { ADD R3,R3,R5; prod = prod + and TMULCN: ; }  end multiply step ADDSI R6,1 ; i = i  1; BGT TMULLP ; } while (i > 0) JUMPS R1 ; return prod
Signed multiplication in the 2's complement number system poses only small added challenges. With 2's complement binary numbers, the effective weight of the sign bit is negative, so the partial product computed for the sign bit must be subtracted from the product, not added. Therefore, we handle the sign bit with a special multiply step outside the loop. Other than this, the following code uses the same algorithm as the unsigned code given above:
TIMESL: ; signed multiply, low 32bits of the product ; expects: R1  return address ; R3 = and  multiplicand ; R4 = ier  multiplier ; returns: R3 = prod  low 32 bits of product ; uses: R5 = and  copy of multiplicand ; R6 = i  loop counter MOVE R5,R3 ;  copy multiplicand CLR R3 ; prod = 0 ;  begin special multiply step SL R4,1 ; ier = ier << 1 BCR TMSSKP ; if (sign bit of ier was 1) { SUB R3,R0,R5; prod = prod  and TMSSKP: ; }  end special multiply step LIS R6,31 ; i = 31 TMSLLP: ; do {  begin multiply step SL R3,1 ; prod = prod << 1 SL R4,1 ; ier = ier << 1 BCR TMSLCN ; if (high bit of ier was 1) { ADD R3,R3,R5; prod = prod + and TMSLCN: ; }  end of multiply step ADDSI R6,1 ; i = i  1 BGT TMSLLP ; } while (i > 0) JUMPS R1 ; return prod
Each multiply step in either version of the above code has 4 instructions, and it takes 2 more instructions to wrap a loop around the multiply step. This means that 1/3 of the instructions executed during the body of the multiply code are loop control instructions, while 2/3 do the useful computation. On pipelined processors, branch instructions may incur a branch penalty, making them even more expensive.
Since these loops are definite loops with a fixed number of iterations, this code can be accelerated by unrolling the loop. That is, instead of doing 32 iterations with 1 multiply step per iteration, we can do 16 iterations with 2way unrolling, that is 2 multiply steps per iteration If we fully unroll the loop, we end up with 32 multiply steps in a row with no loop at all. If each instruction takes the same amount of time, the fully unrolled code would run in 2/3 the time taken by the original. On a pipelined machine, the elimination of the branch penalty may lead to even greater improvements. The following example uses 2way unrolling to speed up the unsigned multiply given above:
TIMESUL:; unsigned multiply, low 32bits of the product ; expects: R1  return address ; R3 = and  multiplicand ; R4 = ier  multiplier ; returns: R3 = prod  low 32 bits of product ; uses: R5 = and  copy of multiplicand ; R6 = i  loop counter MOVE R5,R3 ;  copy multiplicand CLR R3 ; prod = 0 LIS R6,16 ; i = 16 TMULLP: ; do {  begin first multiply step SL R3,1 ; prod = prod << 1 SL R4,1 ; ier = ier << 1 BCR TMULC1 ; if (high bit of ier was 1) ADD R3,R3,R5; prod = prod + and TMULC1: ; }  end first, start second SL R3,1 ; prod = prod << 1 SL R4,1 ; ier = ier << 1 BCR TMULC2 ; if (high bit of ier was 1) ADD R3,R3,R5; prod = prod + and TMULC2: ;  end of second multiply step ADDSI R6,1 ; i = i  1 BGT TMULLP ; } while (i > 0) JUMPS R1 ; return prod
The following table summarizes the potential performance gains by unrolling the unsigned multiply routine, from no unrolling to completely unrolled. Execution time can be estimated from the number of instructions, but on pipelined machines, there will be branch penalties that may only apply to unpredictable branches.
Unrolling (steps per iteration)  1  2  n < 32  32 

Code size, bytes  20  28  12 + 8n  262 
Worst case instructions  196  164  132 + 64/n  131 
Average instructions  170  138  106 + 64/n  104 
Worst case branch count  63  47  32 + 31/n  32 
Average branch count  47  31  16 + 31/n  16 
Predictable branches  31  15  31/n  0 
Another way to speed up multiplication is to move from binary to a higher radix. For binary numbers, obvious radices are higher powers of two. A radix 4 multiply, working with 2 bits at a time, takes only 16 iterations to multiply 32bit numbers, and a radix 16 multiply, using 4 bits at a time takes only 8 iterations.
With a higher radix, the multiply step for each digit of the multiplier involves more than just a shift and add. We already know optimal instruction sequences for multiplying by each possible digit value (see Section 14.2). The key fast higherradix code is a caseselect construct to select the appropriate optimal sequence to compute the partial product for each digit of the multiplier.
BTRUNC (see Section 9.3) is the key to building fast caseselect constructs. This allows a 2^{n}way branch depending on the n least significant bits of a register.
The code given for binary multiplication in the previous section always added the most significant partial product first. Using BTRUNC, the code works better if the least significant partial product is added first and the multiplier is shifted right after each step. This leads to the following unsigned radix4 multiply code:
TIMESUL:; unsigned radix4 multiply, low 32bits of the product ; expects: R1  return address ; R3 = and  multiplicand ; R4 = ier  multiplier ; returns: R3 = prod  low 32 bits of product ; destroys: R5 = and  copy of multiplicand ; R6 = i  loop counter ; R7 = pp  partial product MOVE R5,R3 ;  copy multiplicand CLR R3 ; prod = 0 LIS R6,16 ; i = 16 TMULLP: ; do {  begin radix 4 multiply step BTRUNC R4,2 ; select (prod & 3) { BR TMUL0 ;  branch table to the cases BR TMUL1 BR TMUL2 BR TMUL3 TMUL0: ; case 0 CLR R7 ; pp = 0 BR TMULBR ; break; TMUL1: ; case 1 MOVE R7,R5 ; pp = and BR TMULBR ; break; TMUL2: ; case 2 MOVESL R7,R5,1 ; pp = and x 2 BR TMULBR ; break; TMUL3: ; case 3 MOVE R7,R5 ADDSL R7,R7,1 ; pp = and x 3 TMULBR: ; }  end select ADD R3,R3,R7; prod = prod + pp SR R4,2 ; ier = ier >> 2 SL R5,2 ; and = and << 2 ;  end radix 4 multiply step ADDSI R6,1 ; i = i  1 BGT TMULLP ; } while (i > 0) JUMPS R1 ; return prod
The above radix4 code occupies 44 bytes, exactly the same as a 4way unrolled binary multiply, and its average and worstcase performance are identical, with exactly 148 instructions executed per multiply. This is significantly better than the version of binary multiplication without unrolling, but on the average, not as good as not as good as 2way unrolled binary multiplication.
There is one small optimization that can be made to the above: Case 3 can be moved to the start of the list of cases. This allows BR TMUL3 to be eliminated from branch table and replaced by the first instruction of the case. On average, this optimization reduces the number of instructions per multiply by 8, with no change in the worst case.
Another optimization we can make is to duplicate the entire loop epilog at the end of each case, eliminating BR TMULBR from each case. This allows further optimization of case 0 where there is no need to "compute" or add the partial product zero.
As is usual with optimized code, these changes make the code less readable and they make it difficult to write comments that resemble wellstructured code in a highlevel language. At best, the comments ends up resembling poorly written code in a midlevel language like C.
While the above radix4 code is not a clear winner, if we move to radix 16, we get code that, while significantly larger, is extraordinarily fast. The following radix16 code incorporates both of the optimizations suggested above. This code occupies 294 bytes, as opposed to 262 bytes for the fully unrolled binary code, but its worstcase execution path is only 91 instructions and on average (assuming random multipliers) it should take about 73 instructions.
TIMESUL:; unsigned radix16 multiply, low 32bits of the product ; expects: R1  return address ; R3 = and  multiplicand ; R4 = ier  multiplier ; returns: R3 = prod  low 32 bits of product ; destroys: R5 = and  copy of multiplicand ; R6 = i  loop counter ; R7 = pp  partial product MOVE R5,R3 ;  copy multiplicand CLR R3 ; prod = 0 LIS R6,8 ; i = 8 TMULLP: ; do {  begin radix16 multiply step BTRUNC R4,4 ; select (prod & 15) { BR TMUL0 ;  branch table to the cases BR TMUL1 BR TMUL2 BR TMUL3 BR TMUL4 BR TMUL5 BR TMUL6 BR TMUL7 BR TMUL8 BR TMUL9 BR TMULA BR TMULB BR TMULC BR TMULD BR TMULE ; case 15 MOVE R7,R5 ADDSL R7,R7,1 ;  and x 5 ADDSL R7,R7,2 ; pp = and x 15 ADD R3,R3,R7; prod = prod + pp SL R5,4 ; and = and >> 4 SR R4,4 ; ier = ier << 4  end multiply step ADDSI R6,1 ; count = count  1 BGT TMULLP ; if (count > 0) continue JUMPS R1 ; return prod TMUL0: ; case 0 SL R5,4 ; and = and >> 4 SR R4,4 ; ier = ier << 4  end multiply step ADDSI R6,1 ; count = count  1 BGT TMULLP ; if (count > 0) continue JUMPS R1 ; return prod TMUL1: ; case 1 ADD R3,R3,R5; prod = prod + and SL R5,4 ; and = and >> 4 SR R4,4 ; ier = ier << 4  end multiply step ADDSI R6,1 ; count = count  1 BGT TMULLP ; if (count > 0) continue JUMPS R1 ; return prod TMUL2: ; case 2 MOVESL R7,R5,1 ; pp = and x 2 ADD R3,R3,R7; prod = prod + pp SL R5,4 ; and = and >> 4 SR R4,4 ; ier = ier << 4  end multiply step ADDSI R6,1 ; count = count  1 BGT TMULLP ; if (count > 0) continue JUMPS R1 ; return prod TMUL3: ; case 3 MOVE R7,R5 ADDSL R7,R7,1 ; pp = and x 3 ADD R3,R3,R7; prod = prod + pp SL R5,4 ; and = and >> 4 SR R4,4 ; ier = ier << 4  end multiply step ADDSI R6,1 ; count = count  1 BGT TMULLP ; if (count > 0) continue JUMPS R1 ; return prod TMUL4: ; case 4 MOVESL R7,R5,2 ; pp = and x 4 ADD R3,R3,R7; prod = prod + pp SL R5,4 ; and = and >> 4 SR R4,4 ; ier = ier << 4  end multiply step ADDSI R6,1 ; count = count  1 BGT TMULLP ; if (count > 0) continue JUMPS R1 ; return prod TMUL5: ; case 5 MOVE R7,R5 ADDSL R7,R7,2 ; pp = and x 5 ADD R3,R3,R7; prod = prod + pp SL R5,4 ; and = and >> 4 SR R4,4 ; ier = ier << 4  end multiply step ADDSI R6,1 ; count = count  1 BGT TMULLP ; if (count > 0) continue JUMPS R1 ; return prod TMUL6: ; case 6 MOVESL R7,R5,1 ADDSL R7,R7,1 ; pp = and x 6 ADD R3,R3,R7; prod = prod + pp SL R5,4 ; and = and >> 4 SR R4,4 ; ier = ier << 4  end multiply step ADDSI R6,1 ; count = count  1 BGT TMULLP ; if (count > 0) continue JUMPS R1 ; return prod TMUL7: ; case 7 MOVE R7,R5 ADDSL R7,R7,1 ADDSL R7,R5,1 ; pp = and x 7 ADD R3,R3,R7; prod = prod + pp SL R5,4 ; and = and >> 4 SR R4,4 ; ier = ier << 4  end multiply step ADDSI R6,1 ; count = count  1 BGT TMULLP ; if (count > 0) continue JUMPS R1 ; return prod TMUL8: ; case 8 MOVESL R7,R5,3 ; pp = and x 8 ADD R3,R3,R7; prod = prod + pp SL R5,4 ; and = and >> 4 SR R4,4 ; ier = ier << 4  end multiply step ADDSI R6,1 ; count = count  1 TMULPP: BGT TMULLP ; if (count > 0) continue JUMPS R1 ; return prod TMUL9: ; case 9 MOVE R7,R5 ADDSL R7,R7,3 ; pp = and x 9 ADD R3,R3,R7; prod = prod + pp SL R5,4 ; and = and >> 4 SR R4,4 ; ier = ier << 4  end multiply step ADDSI R6,1 ; count = count  1 BGT TMULLP ; if (count > 0) continue JUMPS R1 ; return prod TMULA: ; case 10 MOVESL R7,R5,1 ADDSL R7,R7,2 ; pp = and x 10 ADD R3,R3,R7; prod = prod + pp SL R5,4 ; and = and >> 4 SR R4,4 ; ier = ier << 4  end multiply step ADDSI R6,1 ; count = count  1 BGT TMULLP ; if (count > 0) continue JUMPS R1 ; return prod TMULB: ; case 11 MOVE R7,R5 ADDSL R7,R7,2 ADDSL R7,R5,1 ; pp = and x 11 ADD R3,R3,R7; prod = prod + pp SL R5,4 ; and = and >> 4 SR R4,4 ; ier = ier << 4  end multiply step ADDSI R6,1 ; count = count  1 BGT TMULLP ; if (count > 0) continue JUMPS R1 ; return prod TMULC: ; case 12 MOVESL R7,R5,2 ADDSL R7,R7,1 ; pp = and x 12 ADD R3,R3,R7; prod = prod + pp SL R5,4 ; and = and >> 4 SR R4,4 ; ier = ier << 4  end multiply step ADDSI R6,1 ; count = count  1 BGT TMULLP ; if (count > 0) continue JUMPS R1 ; return prod TMULD: ; case 13 MOVE R7,R5 ADDSL R7,R7,1 ADDSL R7,R5,2 ; pp = and x 13 ADD R3,R3,R7; prod = prod + pp SL R5,4 ; and = and >> 4 SR R4,4 ; ier = ier << 4  end multiply step ADDSI R6,1 ; count = count  1 BGT TMULPP ; if (count > 0) continue JUMPS R1 ; return prod TMULE: ; case 14 MOVE R7,R5 ADDSL R7,R7,1 ADDSL R7,R5,1 SL R7,1 ; pp = and x 14 ADD R3,R3,R7; prod = prod + pp SL R5,4 ; and = and >> 4 SR R4,4 ; ier = ier << 4  end multiply step ADDSI R6,1 ; count = count  1 BGT TMULPP ; if (count > 0) continue JUMPS R1 ; return prod ; }  end case ; }  end loop
The above code is be better than twice the speed of the unoptimized unsigned binary multiply code for random integers. Furthermore, multipliers with long runs of zero bits are common. While these are the worst case for the original unsigned binary code, they are the best case for this code.
Duplication of the loop epilog in the above has expanded the code so much that the BGT instructions in the final two cases are unable to reach the head of the loop at TMULLP. To solve this problem, the above code introduces a new label, TMULPP in case 8, that allows control to reach TMULLP in two steps. There are several alternative solutions to the problem of outofrange branches, but in this context, none are faster or more compact.
On pipelined machines, the above code is definitely not optimal because many instructions depend directly on the results of the previous instruction. To eliminate these dependencies, some of the instructions can be reordered. The goal is to place unrelated instructions between any instruction that produces a result that the next instruction depends on. Unfortunately, such reorderings almost always produce code that is harder to read than the original. Consider, for example, the followig rewrite of the final case above:
TMULE: ; case 14 MOVE R7,R5 ADDSL R7,R5,1 SR R4,4 ; ier = ier << 4 ADDSL R7,R5,1 ; pp = and x 7 SL R5,4 ; and = and >> 4 SL R7,1 ; pp = pp x 2 (and x 14) ADDSI R6,1 ; count = count  1 SUM R3,R7 ; prod = prod + pp  cc's unchanged BGT TMULPP ; if (count > 0) continue
Caches add even more difficulty because they make simple prediction of runtime extremely difficult. When there is a cache, the second time a particular instruction is executed it may be much faster than it was the first time because the cache eliminates most of the cost of the instruction fetch. As a result, a large multiply routine with many specialized cases for each multiplier digit may be slower than small version that applies the same general case to every digit.
SRU (see Section 6.3) can be used for fast unsigned division with truncation, that is, rounding any fractional quotient down to the next lower integer:
SRU R3,1 ; unsigned divide by 2 SRU R3,2 ; unsigned divide by 4
SR (see Section 6.3) works for signed division, but there are alternative rules for truncating the quotient. SR truncates to the next lower integer so the remainder is always positive:
SR R3,1 ; signed divide by 2, truncating downward SR R3,2 ; signed divide by 4, truncating downward
Generally, people expect –5 divided by 4 to give a quotient of –1 with a remainder of –1. When SR is used for this division problem, it gives a quotient of –2 with a remainder of 3. In fact, our naive expectation is contradicted by another equally naive expectation. We expect the remainder after division by 4 to be between 0 and 3. These two naive expectations are incompatible for negative dividends. SR keeps the remainder positive while violating naive expectations about the quotient.
Many programming languages require integer division to truncate the quotient toward zero, so that the sign of the remainder matches the sign of the dividend. ADJUST r,SSQ (See Section 11.2) can be used to deliver this result, as follows:
SR R3,2 ; divide by 4, truncating downward ADJUST R3,SSQ ; adjust quotient so truncation is toward zero
For both signed and unsigned shifts, the quotient can be rounded to the nearest integer using ADDC (See Section 10.3):
SR R3,2 ; signed divide by 4 with positive remainder ADDC R3,0 ; round so remainder is between 2 and 1 SRU R3,3 ; unsigned divide by 8 with positive remainder ADDC R3,0 ; round so remainder is between 4 and 3
In general, it is faster to multiply by the reciprocal of a number than to divide by that number! This is particularly important for division by constants that are not powers of two, since we can divide by a power of two by a simple shift. Furthermore, this applies across most computers, including those with hardware divide instructions. The following table shows the reciprocals of the integers from 1 to 10.
n  1/n (base 10)  1/n (base 2) 

1  1.0  1.0 
2  0.5  0.1 
3  0.33  0.0101 
4  0.25  0.01 
5  0.2  0.00110011 
6  0.166  0.00101 
7  0.142857142857  0.001001 
8  0.125  0.001 
9  0.11  0.000111000111 
10  0.1  0.000110011 
In binary or decimal repeating fractions are traditionally marked with an overbar above the string of digits that repeat; in the above table, this is rendered like this. Many numbers that have exact reciprocals in decimal have reciprocals that repeat in binary. For example, 0.1_{10} has no finite binary representation.
In general, if we multiply a 32 bit integer by the reciprocal, truncating the integer part of the product to 32 bits, we get a result that differs from the expected quotient by no more than one. In some cases, this is acceptable, but in most cases, an exact quotient is needed.
One way to obtain an exact quotient is to multiply by a carefully selected approximation of the reciprocal. In general, for an nbit unsigned divisor, we can get the exact result by multiplying by an n+1 bit quantity and then shifting to take the appropriate bits of the product. The n+1 bit approximate reciprocal cannot have more than n+1 onebits, so there are at most n+1 partial products to add in computing the quotient. ADDSRU (see Section 6.3) is is particularly effective for adding partial products and discarding fractional bits of the result. Generally, we can divide a 32bit unsigned number by an arbitrary constant with a sequence of no more than 33 ADDSRU instructions. Here are some examples:
; R3 = R3 / 3 = R3 * 0.010101010101010101010101010101011 MOVE R1,R3 ; R3 = R1 * 1.0 (make copy of original R3) SRU R3,1 ; R3 = R1 * 0.1 ADDSRU R3,R1,2 ; R3 = R1 * 0.011 exact R3/3 for all R1 < 8 ADDSRU R3,R1,2 ; R3 = R1 * 0.01011 all R1 < 32 ADDSRU R3,R1,2 ; R3 = R1 * 0.0101011 all R1 < 128 ADDSRU R3,R1,2 ; R3 = R1 * 0.010101011 all R1 < 512 ADDSRU R3,R1,2 ; R3 = R1 * 0.01010101011 all R1 < 2,048 ADDSRU R3,R1,2 ; R3 = R1 * 0.0101010101011 all R1 < 8,192 ADDSRU R3,R1,2 ; R3 = R1 * 0.010101010101011 all R1 < 32,768 ADDSRU R3,R1,2 ; R3 = R1 * 0.01010101010101011 all R1 < 131,072 ADDSRU R3,R1,2 ; R3 = R1 * 0.0101010101010101011 ADDSRU R3,R1,2 ; R3 = R1 * 0.010101010101010101011 ADDSRU R3,R1,2 ; R3 = R1 * 0.01010101010101010101011 ADDSRU R3,R1,2 ; R3 = R1 * 0.0101010101010101010101011 ADDSRU R3,R1,2 ; R3 = R1 * 0.010101010101010101010101011 ADDSRU R3,R1,2 ; R3 = R1 * 0.01010101010101010101010101011 ADDSRU R3,R1,2 ; R3 = R1 * 0.0101010101010101010101010101011 ADDSRU R3,R1,2 ; R3 = R1 * 0.010101010101010101010101010101011 ; R3 = R3 / 5 = R3 * 0.0011001100110011001100110011001101 MOVE R1,R3 ; R3 = R1 * 1.0 (make copy of original R3) SRU R3,2 ; R3 = R1 * 0.01 ADDSRU R3,R1,1 ; R3 = R1 * 0.101 ADDSRU R3,R1,3 ; R3 = R1 * 0.001101 exact R3/5 for all R1 < 64 ADDSRU R3,R1,1 ; R3 = R1 * 0.1001101 ADDSRU R3,R1,3 ; R3 = R1 * 0.0011001101 all R1 < 1,024 ADDSRU R3,R1,1 ; R3 = R1 * 0.10011001101 ADDSRU R3,R1,3 ; R3 = R1 * 0.00110011001101 all R1 < 16,384 ADDSRU R3,R1,1 ; R3 = R1 * 0.100110011001101 ADDSRU R3,R1,3 ; R3 = R1 * 0.001100110011001101 all R1 < 262,144 ADDSRU R3,R1,1 ; R3 = R1 * 0.1001100110011001101 ADDSRU R3,R1,3 ; R3 = R1 * 0.0011001100110011001101 ADDSRU R3,R1,1 ; R3 = R1 * 0.10011001100110011001101 ADDSRU R3,R1,3 ; R3 = R1 * 0.00110011001100110011001101 ADDSRU R3,R1,1 ; R3 = R1 * 0.100110011001100110011001101 ADDSRU R3,R1,3 ; R3 = R1 * 0.001100110011001100110011001101 ADDSRU R3,R1,1 ; R3 = R1 * 0.1001100110011001100110011001101 ADDSRU R3,R1,3 ; R3 = R1 * 0.0011001100110011001100110011001101
As the comments above indicate, the complete series of 16 ADDSRU instructions is only needed if the maximum value of the dividend is unknown. If the maximum is known, an abbreviated shift sequence will suffice. For example, for a 16bit dividend, 6 or 7 ADDSRU instructions do the job.
Given code to divide by 3, division by 6, 12 and other binary multiples of 3 is straightforward; simply increase the final shift count appropriately. Similarly, we can derive code for divide by 10, 20 and other binary multiples of 5 by simple changes to the final shift count.
The above examples illustrate a general technique for multiplying an unsigned binary integer by a fractional constant binary multiplier. If there are k onebits in the multiplier, there will be one SRU instruction followed by k–1 ADDSRU instructions. Reading from the first SRU downward, the sequence of shift counts gives the intervals between consecutive onebits of the multiplier, starting with the leastsignificant bit. If the two least significant bits are one, the first shift count is one; if there are z zeros between the two least significant ones, the first shift count is z+1. The final shift count follows the same formula, based on the number of zeros between the point and the most significant one.
To divide a 32bit integer by an arbitrary integer constant, take the reciprocal of the divisor as a fixed point binary fraction. This fraction must be computed to 33 places of precision, counting the most significant one as the first place. Do not round the fraction, but instead add one to the least significant bit, the one 33 places to the right of the most significant one.
For repeating binary fractions, faster code is sometimes possible at the expense of a result that is only approximate because 1 is not added to the 33^{rd} bit. The trick is to build up the repeating pattern in blocks bigger than one partial product at a time. The following example illustrates this for division by 5:
; R3 = ~(R3 / 5) = R3 * 0.00110011001100110011001100110011 MOVE R1,R3 ; R3 = R1 * 1.0 (make copy of original R3) SRU R3,1 ; R3 = R1 * 0.1 ADDSRU R3,R1,3 ; R3 = R1 * 0.0011 ADDSRU R3,R1,1 ; R3 = R1 * 0.10011 ADDSRU R3,R1,1 ; R3 = R1 * 0.110011 MOVE R4,R3 ; R3 = R1 * 0.110011 (make copy of working value) SRU R3,8 ; R3 = R1 * 0.00000000110011 ADDSRU R3,R4,8 ; R3 = R1 * 0.0000000011001100110011 * ADDSRU R3,R4,8 ; R3 = R1 * 0.000000001100110011001100110011 * ADDSRU R3,R4,2 ; R3 = R1 * 0.00110011001100110011001100110011 ; Either R3 = (R1 / 5) or (R3 + 1) = (R1 / 5);
The above code only uses 10 instructions instead of the 18 in the bruteforce code, but the result is not exact. With this code, the quotient may be off by one. If the two lines marked with asterisks are deleted, the code works, but may be off by one, for all 16bit dividends, while if just one of these is deleted, it is good for all 24bit dividends.
Because the approximation will never be off by more than one, we can fix it by computing the remainder and then checking to see that it is within bounds. The following code fixes the result of the above divide by 5 code:
; Given R1 = dividend, R3 = approximate quotient MOVE R4,R3 ADDSL R4,R3,2 ; R4 = R3 * 5 = (R1 / 5) * 5 SUB R4,R1,R4; R4 = R1 % 5 (remainder after division) CMPI R4,5 BLTU NOFIX ; if remainder >= 5 ADDSI R3,1 ; fix quotient ADDSI R4,5 ; fix remainder NOFIX: ; Now, R1 = dividend, R3 = quotient, R4 = remainder
The above code produces both the remainder and the quotient after division. In fact, aside from the "perfect" algorithms given initially, where an exact quotient is produced by reciprocal multiplication, all useful division algorithms produce the remainder and quotient as the result of a single computation. This is why, on computers with a general purpose divide instruction, the divide instruction almost always produces both the remainder and the quotient.
These methods extend trivially to signed division by positive constants, with one caveat. Simply replace SRU and ADDSRU with SR and ADDSR. The caveat is that the remainder will always be positive.
To make the sign of the remainder the same as the sign of the dividend, A correction step similar to that suggested above must be added. If the remainder is nonzero and the dividend is negative, the correction is to increment the quotient by one and decrement the remainder by the divisor. Computing the remainder is done exactly the same way in both the signed and unsigned versions.
Division by a variable is somewhat more complex than multiplication by a variable. The following code implements the classic restoring division algorithm on 32bit unsigned operands, giving a 32bit remainder and a 32bit quotient.
DIVIDEU:; unsigned divide, 32bit operands ; expects: R1  return address ; R3 = end  dividend ; R4 = isor  divisor ; returns: R3 = quot  quotient \ or remquot ; R4 = rem  remainder / (a 64bit value) ; uses R5 = isor  copy of divisor ; destroys: R6 = i  loop counter MOVE R5,R4 ;  copy divisor CLR R4 ; rem = 0 LIS R6,32 ; i = 32 DIVULP: ; do {  begin divide step SL R3,1 ;  low half of shift ROL R4 ; remquot = remquot << 1 CMP R4,R5 ;  comparison BLTU DIVUC ; if (rem >= isor) { SUB R4,R4,R5; rem = rem  isor ADDSI R3,1 ; quot = quot + 1 DIVUC: ; }  end divide step ADDSI R6,1 ; i = i  1 BGT DIVULP ; } while (i > 0) JUMPS R1 ; return remquot
As with the algorithms given for multiplication by a variable, this code does not check for overflow. If size is the primary consideration, this code is optimal. If speed is paramount, unrolling the loop one step will eliminate 2 of the 8 instructions per iteration, giving a speedup of at least 1/4. Division by variables is rare enough in practice that this is generally acceptable, but 4way unrolling is entirely feasible.