14. Support for Multiply and Divide

Part of the Hawk Manual
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Contents

14.1. Multiplication by Constants -- Methodology
14.2. Small Constant Multipliers -- Optimal Code
14.3. Multiplication by Variables
14.4. Higher Radix Multiplication -- Fast Code

14.5. Division by Powers of 2
14.6. Division by Constants -- Methodology
14.7. Division by Variables


14.1. Multiplication by Constants -- Methodology

The Hawk has no integer multiply instruction. Multiplication hardware fits poorly with modern pipelined architectures. On most computers multiplication by small constants, the most common case, can always be done faster using sequences of add and shift instructions. On the Intel Pentium, Quantasm Corporation reported that while a 32-bit integer add or subtract took 1 clock cycle, a 32-bit multiply took 10 cycles and a divide took 41 cycles. Clearly, multiply and divide instructions should be avoided.

On the Hawk, MOVESL, ADDSL and SL (see Section 6.2) all multiply by constants:

MOVESL a,b,sa = b × 2s   for 0 < s < 17

SL     r,s  r = r × 2s    for 0 < s < 17

ADDSL  r,r,sr = r × (1+2s) for 0 < s < 17

For example:

        MOVESL R3,R4,1    ; R3 = R4 x 2
        SL     R3,1       ; R3 = R3 x 2
        ADDSL  R3,R3,1    ; R3 = R3 x 3
        SL     R3,2       ; R3 = R3 x 4
        ADDSL  R3,R3,2    ; R3 = R3 x 5
        SL     R3,3       ; R3 = R3 x 8
        ADDSL  R3,R3,3    ; R3 = R3 x 9

When the desired constant multiplier does not fit this pattern, there are two useful strategies:

Factoring: (a × b) × c = a × (b × c)
When the multiplier can be factored, multiply by each factor in turn. Multiplying by each factor in sequence is, of course, equivalent to multiplying by their product. Here are some examples that use this trick for some common small multipliers that aren't in the above list:
        SL     R3,1       ; \ R3 = R3 x 6  (2 x 3)
        ADDSL  R3,R3,1    ; /

        SL     R3,1       ; \ R3 = R3 x 10 (2 x 5)
        ADDSL  R3,R3,2    ; /

        ADDSL  R3,R3,1    ; \ R3 = R3 x 15 (3 x 5)
        ADDSL  R3,R3,2    ; /
Summation: (a + b) × c = (a × c) + (b × c)
If the desired multiplier cannot be factored, the next thing to consider is computing the product as a sum of partial products.

Generally, our goal in splitting a multiplier (a + b) into components a and b will be to search out values of a and b that are easy to multiply by. Summing partial products requires the use of a temporary register; in the following code, we use r[1]. In loading this temporary with a copy of the multiplicand, it is useful to recall that we can use MOVESL to scale the multiplicand and we can use NEG to negate it (see Section 8.2).

        NEG    R1,R3      ; \ R3 = R3 x 7 (8 + -1)
        ADDSL  R3,R1,3    ; /

        MOVESL R1,R3,1    ; \ R3 = R3 x 10 (8 + 2)
        ADDSL  R3,R1,3    ; /

        MOVESL R1,R3,1    ; \ R3 = R3 x 11 (9 + 2)
        ADDSL  R3,R3,3    ; |
        ADD    R3,R3,R1   ; /

        MOVE   R1,R3      ; \ R3 = R3 x 13 ((3 x 4) + 1)
        ADDSL  R3,R3,1    ; |
        ADDSL  R3,R1,2    ; /

        MOVESL R1,R3,1    ; \ R3 = R3 x 14 ((3 x 4) + 2)
        ADDSL  R3,R3,1    ; |
        ADDSL  R3,R1,2    ; /
Binary representation:
When all else fails, any number can be expressed as a sum of powers of two. The binary representation of the number has a one-bit for each power of two that is present in this sum. This approach is really a variation on summation, using only multiples that are powers of two.

Again, a temporary register, r[1] in the following examples, holds a copy of the multiplicand. If the binary multiplier has n one bits, most of the product can be computed using n–1 ADDSL instructions. The shift count on each ADDSL is one more than the number of zeros between consecutive one bits. If the binary multiplier has trailing zeroes, a final SL is needed with a shift count equal to the number of trailing zeros.

        MOVE   R1,R3      ; \ R3 = R3 x 10 (binary 1010)
        ADDSL  R3,R1,2    ; |
        SL     R3,1       ; /

        MOVE   R1,R3      ; \ R3 = R3 x 27 (binary 11011)
        ADDSL  R3,R1,1    ; |
        ADDSL  R3,R1,2    ; |
        ADDSL  R3,R1,1    ; /

        MOVE   R1,R3      ; \ R3 = R3 x 100 (binary 1100100)
        ADDSL  R3,R1,1    ; |
        ADDSL  R3,R1,3    ; |
        SL     R3,2       ; /

        MOVE   R1,R3      ; \ R3 = R3 x 117 (binary 1110101)
        ADDSL  R3,R1,1    ; |
        ADDSL  R3,R1,1    ; |  
        ADDSL  R3,R1,2    ; |
        ADDSL  R3,R1,2    ; /

14.2. Small Constant Multipliers -- Optimal Code

It is difficult to give a general rule for which approach will give the shortest instruction sequence for multiplication. For each multiplier, a different approach may yield a better result. It is common to find multiple instruction sequences that all perform comparably. Trial and error gives us the following instruction sequences for multipliers from 2 to 10:

        SL     R3,1       ;   R3 = R3 x 2

        ADDSL  R3,R3,1    ;   R3 = R3 x 3

        SL     R3,2       ;   R3 = R3 x 4

        ADDSL  R3,R3,2    ;   R3 = R3 x 5

        SL     R3,1       ; \ R3 = R3 x 6 (2 x 3)
        ADDSL  R3,R3,1    ; /

        NEG    R1,R3      ; \ R3 = R3 x 7 (8 + -1)
        ADDSL  R3,R1,3    ; /

        SL     R3,3       ;   R3 = R3 x 8

        ADDSL  R3,R3,3    ;   R3 = R3 x 9

        SL     R3,1       ; \ R3 = R3 x 10 (2 x 5)
        ADDSL  R3,R3,2    ; /

The above all use 2 or fewer instructionsr. Multipliers from 11 to 38, sometimes require 3:

        MOVE   R1,R3      ; \ R3 = R3 x 11 (binary 1011)
        ADDSL  R3,R1,2    ; |
        ADDSL  R3,R1,1    ; /

        SL     R3,2       ; \ R3 = R3 x 12 (4 x 3)
        ADDSL  R3,R3,1    ; /

        MOVE   R1,R3      ; \ R3 = R3 x 13 (binary 1101)
        ADDSL  R3,R1,1    ; |
        ADDSL  R3,R1,2    ; /

        MOVESL R1,R3,1    ; \ R3 = R3 x 14 ((3 x 4) + 2)
        ADDSL  R3,R3,1    ; |
        ADDSL  R3,R1,2    ; /

        ADDSL  R3,R3,1    ; \ R3 = R3 x 15 (3 x 5)
        ADDSL  R3,R3,2    ; /

        SL     R3,4       ;   R3 = R3 x 16

        ADDSL  R3,R3,4    ;   R3 = R3 x 17

        ADDSL  R3,R3,3    ; \ R3 = R3 x 18 (9 x 2)
        SL     R3,1       ; /

        MOVE   R1,R3      ; \ R3 = R3 x 19 (binary 10011)
        ADDSL  R3,R1,3    ; |
        ADDSL  R3,R1,1    ; /

        SL     R3,2       ; \ R3 = R3 x 20 (4 x 5)
        ADDSL  R3,R3,2    ; /

        MOVE   R1,R3      ; \ R3 = R3 x 21 (binary 10101)
        ADDSL  R3,R1,2    ; |
        ADDSL  R3,R1,2    ; /

        MOVESL R1,R3,1    ; \ R3 = R3 x 22 (binary 10110 with a trick!)
        ADDSL  R3,R1,3    ; |
        ADDSL  R3,R1,1    ; /

        NEG    R1,R3      ; \ R3 = R3 x 23 ((3 x 8) + -1)
        ADDSL  R3,R3,1    ; |
        ADDSL  R3,R1,3    ; /

        SL     R3,3       ; \ R3 = R3 x 24 (8 x 3)
        ADDSL  R3,R3,1    ; /

        ADDSL  R3,R3,2    ; \ R3 = R3 x 25 (5 x 5)
        ADDSL  R3,R3,2    ; /

        MOVESL R1,R3,1    ; \ R3 = R3 x 26 ((3 x 8) + 2)
        ADDSL  R3,R3,1    ; |
        ADDSL  R3,R1,3    ; /

        ADDSL  R3,R3,3    ; \ R3 = R3 x 27 (9 x 3)
        ADDSL  R3,R3,1    ; /

        MOVESL R1,R3,2    ; \ R3 = R3 x 28 ((3 x 8) + 4)
        ADDSL  R3,R3,1    ; |
        ADDSL  R3,R1,3    ; /

        NEG    R1,R3      ; \ R3 = R3 x 29 (32 + -3)
        ADDSL  R1,R1,1    ; |
        ADDSL  R3,R1,5    ; /

        SL     R3,1       ; \ R3 = R3 x 30 (2 x 3 x 5)
        ADDSL  R3,R3,1    ; |
        ADDSL  R3,R3,2    ; /

        NEG    R1,R3      ; \ R3 = R3 x 31 (32 + -1)
        ADDSL  R3,R3,5    ; /

        SL     R3,5       ;   R3 = R3 x 32

        ADDSL  R3,R3,5    ;   R3 = R3 x 33


        ADDSL  R3,R3,4    ; \ R3 = R3 x 34 (17 x 2)
        SL     R3,1       ; /

        MOVESL R1,R3,1    ; \ R3 = R3 x 35 = (3 + 32 with a trick)
        ADD    R1,R1,R3   ; |
        ADDSL  R3,R1,5    ; /

        MOVESL R1,R3,2    ; \ R3 = R3 x 36 (32 + 4)
        ADDSL  R3,R1,5    ; /

        MOVE   R1,R3      ; \ R3 = R3 x 37 (32 + 5)
        ADDSL  R1,R1,2    ; |
        ADDSL  R3,R1,5    ; /

        MOVESL R1,R3,1    ; \ R3 = R3 x 38 ((2 x 3) + 32)
        ADDSL  R1,R1,1    ; |
        ADDSL  R3,R1,5    ; /

Multiplying by 39 takes 4 instructions, but multiplying by 100 takes only 3. Extending this table is tedious and the payoff is limited. Occasionally, however, some applicatons require multiplication by larger constants. For example, some pseudo-random number generators reqsuire fast multiplication by strange values such as 16,807, 39,373, 69,621 and 48,271.

14.3. Multiplication by Variables

The classic shift-and-add algorithm for multiplying is known by many names; if, in each step, the multiplier is right shifted and the multiplicand is left shifted, so that the partial products are added in order of increasing significance, it is called the Russian Peasant Algorithm or the Egyptian Algorithm, giving credit to one or the other of its many folk origins.

Whatever the shift direction or the order of additions, all basic binary multiplication algorithms work as follows: In each multiply step, one bit of the multiplier is tested and both the multiplier and product are shifted one place; if the bit is one, the multiplicand is added, as a partial product, to the product being accumulated. It takes 32 such steps to compute the 64-bit product of two 32-bit unsigned numbers.

In many cases, only the low 32 bits of the 64-bit product are needed, so the following simplified code is useful. This shifts both the product and multiplier left, adding the most significant partial product first:

TIMESUL:; unsigned multiply, low 32-bits of the product
        ; expects:  R1        -- return address
        ;           R3 = and  -- multiplicand
        ;           R4 = ier  -- multiplier
        ; returns:  R3 = prod -- low 32 bits of product
        ; uses:     R5 = and  -- copy of multiplicand
        ;           R6 = i    -- loop counter

        MOVE    R5,R3   ; -- copy multiplicand
        CLR     R3      ; prod = 0
        LIS     R6,32   ; i = 32
TMULLP:                 ; do { -- begin multiply step
        SL      R3,1    ;   prod = prod << 1
        SL      R4,1    ;   ier = ier << 1
        BCR     TMULCN  ;   if (high bit of ier was 1) {
        ADD     R3,R3,R5;     prod = prod + and
TMULCN:                 ;   } -- end multiply step
        ADDSI   R6,-1   ;   i = i - 1;
        BGT     TMULLP  ; } while (i > 0)
        JUMPS   R1      ; return prod

Signed multiplication in the 2's complement number system poses only small added challenges. With 2's complement binary numbers, the effective weight of the sign bit is negative, so the partial product computed for the sign bit must be subtracted from the product, not added. Therefore, we handle the sign bit with a special multiply step outside the loop. Other than this, the following code uses the same algorithm as the unsigned code given above:

TIMESL: ; signed multiply, low 32-bits of the product
        ; expects:  R1        -- return address
        ;           R3 = and  -- multiplicand
        ;           R4 = ier  -- multiplier
        ; returns:  R3 = prod -- low 32 bits of product
        ; uses:     R5 = and  -- copy of multiplicand
        ;           R6 = i    -- loop counter

        MOVE    R5,R3   ; -- copy multiplicand
        CLR     R3      ; prod = 0
                        ; -- begin special multiply step
        SL      R4,1    ; ier = ier << 1
        BCR     TMSSKP  ; if (sign bit of ier was 1) {
        SUB     R3,R0,R5;   prod = prod - and
TMSSKP:                 ; } -- end special multiply step
        LIS     R6,31   ; i = 31
TMSLLP:                 ;   do { -- begin multiply step
        SL      R3,1    ;   prod = prod << 1
        SL      R4,1    ;   ier = ier << 1
        BCR     TMSLCN  ;   if (high bit of ier was 1) {
        ADD     R3,R3,R5;     prod = prod + and
TMSLCN:                 ;   } -- end of multiply step
        ADDSI   R6,-1   ;   i = i - 1
        BGT     TMSLLP  ; } while (i > 0)
        JUMPS   R1      ; return prod

Each multiply step in either version of the above code has 4 instructions, and it takes 2 more instructions to wrap a loop around the multiply step. This means that 1/3 of the instructions executed during the body of the multiply code are loop control instructions, while 2/3 do the useful computation. On pipelined processors, branch instructions may incur a branch penalty, making them even more expensive.

Since these loops are definite loops with a fixed number of iterations, this code can be accelerated by unrolling the loop. That is, instead of doing 32 iterations with 1 multiply step per iteration, we can do 16 iterations with 2-way unrolling, that is 2 multiply steps per iteration If we fully unroll the loop, we end up with 32 multiply steps in a row with no loop at all. If each instruction takes the same amount of time, the fully unrolled code would run in 2/3 the time taken by the original. On a pipelined machine, the elimination of the branch penalty may lead to even greater improvements. The following example uses 2-way unrolling to speed up the unsigned multiply given above:

TIMESUL:; unsigned multiply, low 32-bits of the product
        ; expects:  R1        -- return address
        ;           R3 = and  -- multiplicand
        ;           R4 = ier  -- multiplier
        ; returns:  R3 = prod -- low 32 bits of product
        ; uses:     R5 = and  -- copy of multiplicand
        ;           R6 = i    -- loop counter

        MOVE    R5,R3   ; -- copy multiplicand
        CLR     R3      ; prod = 0
        LIS     R6,16   ; i = 16
TMULLP:                 ; do { -- begin first multiply step
        SL      R3,1    ;   prod = prod << 1
        SL      R4,1    ;   ier = ier << 1
        BCR     TMULC1  ;   if (high bit of ier was 1)
        ADD     R3,R3,R5;     prod = prod + and
TMULC1:                 ;   } -- end first, start second
        SL      R3,1    ;   prod = prod << 1
        SL      R4,1    ;   ier = ier << 1
        BCR     TMULC2  ;   if (high bit of ier was 1)
        ADD     R3,R3,R5;     prod = prod + and
TMULC2:                 ;   --- end of second multiply step
        ADDSI   R6,-1   ;   i = i - 1
        BGT     TMULLP  ; } while (i > 0)
        JUMPS   R1      ; return prod

The following table summarizes the potential performance gains by unrolling the unsigned multiply routine, from no unrolling to completely unrolled. Execution time can be estimated from the number of instructions, but on pipelined machines, there will be branch penalties that may only apply to unpredictable branches.

Cost per 32-bit unsigned multiply
Unrolling (steps per iteration) 1 2 n < 32 32
Code size, bytes 20 28 12 + 8n 262
Worst case instructions 196 164 132 + 64/n 131
Average instructions 170 138 106 + 64/n 104
Worst case branch count 63 47 32 + 31/n 32
Average branch count 47 31 16 + 31/n 16
Predictable branches 31 15 31/n 0

14.4. Higher Radix Multiplication -- Fast Code

Another way to speed up multiplication is to move from binary to a higher radix. For binary numbers, obvious radices are higher powers of two. A radix 4 multiply, working with 2 bits at a time, takes only 16 iterations to multiply 32-bit numbers, and a radix 16 multiply, using 4 bits at a time takes only 8 iterations.

With a higher radix, the multiply step for each digit of the multiplier involves more than just a shift and add. We already know optimal instruction sequences for multiplying by each possible digit value (see Section 14.2). The key fast higher-radix code is a case-select construct to select the appropriate optimal sequence to compute the partial product for each digit of the multiplier.

BTRUNC (see Section 9.3) is the key to building fast case-select constructs. This allows a 2n-way branch depending on the n least significant bits of a register.

The code given for binary multiplication in the previous section always added the most significant partial product first. Using BTRUNC, the code works better if the least significant partial product is added first and the multiplier is shifted right after each step. This leads to the following unsigned radix-4 multiply code:

TIMESUL:; unsigned radix-4 multiply, low 32-bits of the product
        ; expects:  R1        -- return address
        ;           R3 = and  -- multiplicand
        ;           R4 = ier  -- multiplier
        ; returns:  R3 = prod -- low 32 bits of product
        ; destroys: R5 = and  -- copy of multiplicand
        ;           R6 = i    -- loop counter
        ;           R7 = pp   -- partial product

        MOVE    R5,R3   ; -- copy multiplicand
        CLR     R3      ; prod = 0
        LIS     R6,16   ; i = 16
TMULLP:                 ; do { -- begin radix 4 multiply step
        BTRUNC  R4,2    ;   select (prod & 3) {
        BR      TMUL0   ;     -- branch table to the cases
        BR      TMUL1
        BR      TMUL2
        BR      TMUL3
TMUL0:                  ;     case 0
        CLR     R7      ;       pp = 0
        BR      TMULBR  ;       break;
TMUL1:                  ;     case 1
        MOVE    R7,R5   ;       pp = and
        BR      TMULBR  ;       break;
TMUL2:                  ;     case 2
        MOVESL  R7,R5,1 ;       pp = and x 2
        BR      TMULBR  ;       break;
TMUL3:                  ;     case 3
        MOVE    R7,R5
        ADDSL   R7,R7,1 ;       pp = and x 3
TMULBR:                 ;   } -- end select
        ADD     R3,R3,R7;   prod = prod + pp
        SR      R4,2    ;   ier = ier >> 2
        SL      R5,2    ;   and = and << 2
                        ;   -- end radix 4 multiply step
        ADDSI   R6,-1   ;   i = i - 1
        BGT     TMULLP  ; } while (i > 0)
        JUMPS   R1      ; return prod

The above radix-4 code occupies 44 bytes, exactly the same as a 4-way unrolled binary multiply, and its average and worst-case performance are identical, with exactly 148 instructions executed per multiply. This is significantly better than the version of binary multiplication without unrolling, but on the average, not as good as not as good as 2-way unrolled binary multiplication.

There is one small optimization that can be made to the above: Case 3 can be moved to the start of the list of cases. This allows BR TMUL3 to be eliminated from branch table and replaced by the first instruction of the case. On average, this optimization reduces the number of instructions per multiply by 8, with no change in the worst case.

Another optimization we can make is to duplicate the entire loop epilog at the end of each case, eliminating BR TMULBR from each case. This allows further optimization of case 0 where there is no need to "compute" or add the partial product zero.

As is usual with optimized code, these changes make the code less readable and they make it difficult to write comments that resemble well-structured code in a high-level language. At best, the comments ends up resembling poorly written code in a mid-level language like C.

While the above radix-4 code is not a clear winner, if we move to radix 16, we get code that, while significantly larger, is extraordinarily fast. The following radix-16 code incorporates both of the optimizations suggested above. This code occupies 294 bytes, as opposed to 262 bytes for the fully unrolled binary code, but its worst-case execution path is only 91 instructions and on average (assuming random multipliers) it should take about 73 instructions.






TIMESUL:; unsigned radix-16 multiply, low 32-bits of the product
        ; expects:  R1        -- return address
        ;           R3 = and  -- multiplicand
        ;           R4 = ier  -- multiplier
        ; returns:  R3 = prod -- low 32 bits of product
        ; destroys: R5 = and  -- copy of multiplicand
        ;           R6 = i    -- loop counter
        ;           R7 = pp   -- partial product

        MOVE    R5,R3   ; -- copy multiplicand
        CLR     R3	; prod = 0
        LIS     R6,8    ; i = 8
TMULLP:                 ; do { -- begin radix-16 multiply step
        BTRUNC  R4,4    ;   select (prod & 15) {
        BR      TMUL0   ;     -- branch table to the cases
        BR      TMUL1
        BR      TMUL2
        BR      TMUL3
        BR      TMUL4
        BR      TMUL5
        BR      TMUL6
        BR      TMUL7
        BR      TMUL8
        BR      TMUL9
        BR      TMULA
        BR      TMULB
        BR      TMULC
        BR      TMULD
        BR      TMULE
                        ;     case 15
        MOVE    R7,R5
        ADDSL   R7,R7,1 ;       -- and x 5
        ADDSL   R7,R7,2 ;       pp = and x 15
        ADD     R3,R3,R7;       prod = prod + pp
        SL      R5,4    ;       and = and >> 4
        SR      R4,4    ;       ier = ier << 4 -- end multiply step
        ADDSI   R6,-1   ;       count = count - 1
        BGT     TMULLP  ;       if (count > 0) continue
        JUMPS   R1      ;       return prod
TMUL0:                  ;     case 0
        SL      R5,4    ;       and = and >> 4
        SR      R4,4    ;       ier = ier << 4 -- end multiply step
        ADDSI   R6,-1   ;       count = count - 1
        BGT     TMULLP  ;       if (count > 0) continue
        JUMPS   R1      ;       return prod
TMUL1:                  ;     case 1
        ADD     R3,R3,R5;       prod = prod + and
        SL      R5,4    ;       and = and >> 4
        SR      R4,4    ;       ier = ier << 4 -- end multiply step
        ADDSI   R6,-1   ;       count = count - 1
        BGT     TMULLP  ;       if (count > 0) continue
        JUMPS   R1      ;       return prod
TMUL2:                  ;     case 2
        MOVESL  R7,R5,1 ;       pp = and x 2
        ADD     R3,R3,R7;       prod = prod + pp
        SL      R5,4    ;       and = and >> 4
        SR      R4,4    ;       ier = ier << 4 -- end multiply step
        ADDSI   R6,-1   ;       count = count - 1
        BGT     TMULLP  ;       if (count > 0) continue
        JUMPS   R1      ;       return prod
TMUL3:                  ;     case 3
        MOVE    R7,R5
        ADDSL   R7,R7,1 ;       pp = and x 3
        ADD     R3,R3,R7;       prod = prod + pp
        SL      R5,4    ;       and = and >> 4
        SR      R4,4    ;       ier = ier << 4 -- end multiply step
        ADDSI   R6,-1   ;       count = count - 1
        BGT     TMULLP  ;       if (count > 0) continue
        JUMPS   R1      ;       return prod



TMUL4:                  ;     case 4
        MOVESL  R7,R5,2 ;       pp = and x 4
        ADD     R3,R3,R7;       prod = prod + pp
        SL      R5,4    ;       and = and >> 4
        SR      R4,4    ;       ier = ier << 4 -- end multiply step
        ADDSI   R6,-1   ;       count = count - 1
        BGT     TMULLP  ;       if (count > 0) continue
        JUMPS   R1      ;       return prod
TMUL5:                  ;     case 5
        MOVE    R7,R5
        ADDSL   R7,R7,2 ;       pp = and x 5
        ADD     R3,R3,R7;       prod = prod + pp
        SL      R5,4    ;       and = and >> 4
        SR      R4,4    ;       ier = ier << 4 -- end multiply step
        ADDSI   R6,-1   ;       count = count - 1
        BGT     TMULLP  ;       if (count > 0) continue
        JUMPS   R1      ;       return prod
TMUL6:                  ;     case 6
        MOVESL  R7,R5,1
        ADDSL   R7,R7,1 ;       pp = and x 6
        ADD     R3,R3,R7;       prod = prod + pp
        SL      R5,4    ;       and = and >> 4
        SR      R4,4    ;       ier = ier << 4 -- end multiply step
        ADDSI   R6,-1   ;       count = count - 1
        BGT     TMULLP  ;       if (count > 0) continue
        JUMPS   R1      ;       return prod
TMUL7:                  ;     case 7
        MOVE    R7,R5
        ADDSL   R7,R7,1
        ADDSL   R7,R5,1 ;       pp = and x 7
        ADD     R3,R3,R7;       prod = prod + pp
        SL      R5,4    ;       and = and >> 4
        SR      R4,4    ;       ier = ier << 4 -- end multiply step
        ADDSI   R6,-1   ;       count = count - 1
        BGT     TMULLP  ;       if (count > 0) continue
        JUMPS   R1      ;       return prod
TMUL8:                  ;     case 8
        MOVESL  R7,R5,3 ;       pp = and x 8
        ADD     R3,R3,R7;       prod = prod + pp
        SL      R5,4    ;       and = and >> 4
        SR      R4,4    ;       ier = ier << 4 -- end multiply step
        ADDSI   R6,-1   ;       count = count - 1
TMULPP: BGT     TMULLP  ;       if (count > 0) continue
        JUMPS   R1      ;       return prod
TMUL9:                  ;     case 9
        MOVE    R7,R5
        ADDSL   R7,R7,3 ;       pp = and x 9
        ADD     R3,R3,R7;       prod = prod + pp
        SL      R5,4    ;       and = and >> 4
        SR      R4,4    ;       ier = ier << 4 -- end multiply step
        ADDSI   R6,-1   ;       count = count - 1
        BGT     TMULLP  ;       if (count > 0) continue
        JUMPS   R1      ;       return prod
TMULA:                  ;     case 10
        MOVESL  R7,R5,1
        ADDSL   R7,R7,2 ;       pp = and x 10
        ADD     R3,R3,R7;       prod = prod + pp
        SL      R5,4    ;       and = and >> 4
        SR      R4,4    ;       ier = ier << 4 -- end multiply step
        ADDSI   R6,-1   ;       count = count - 1
        BGT     TMULLP  ;       if (count > 0) continue
        JUMPS   R1      ;       return prod
TMULB:                  ;     case 11
        MOVE    R7,R5
        ADDSL   R7,R7,2
        ADDSL   R7,R5,1 ;       pp = and x 11
        ADD     R3,R3,R7;       prod = prod + pp
        SL      R5,4    ;       and = and >> 4
        SR      R4,4    ;       ier = ier << 4 -- end multiply step
        ADDSI   R6,-1   ;       count = count - 1
        BGT     TMULLP  ;       if (count > 0) continue
        JUMPS   R1      ;       return prod
TMULC:                  ;     case 12
        MOVESL  R7,R5,2
        ADDSL   R7,R7,1 ;       pp = and x 12
        ADD     R3,R3,R7;       prod = prod + pp
        SL      R5,4    ;       and = and >> 4
        SR      R4,4    ;       ier = ier << 4 -- end multiply step
        ADDSI   R6,-1   ;       count = count - 1
        BGT     TMULLP  ;       if (count > 0) continue
        JUMPS   R1      ;       return prod
TMULD:                  ;     case 13
        MOVE    R7,R5
        ADDSL   R7,R7,1
        ADDSL   R7,R5,2 ;       pp = and x 13
        ADD     R3,R3,R7;       prod = prod + pp
        SL      R5,4    ;       and = and >> 4
        SR      R4,4    ;       ier = ier << 4 -- end multiply step
        ADDSI   R6,-1   ;       count = count - 1
        BGT     TMULPP  ;       if (count > 0) continue
        JUMPS   R1      ;       return prod
TMULE:                  ;     case 14
        MOVE    R7,R5
        ADDSL   R7,R7,1
        ADDSL   R7,R5,1
        SL      R7,1    ;       pp = and x 14
        ADD     R3,R3,R7;       prod = prod + pp
        SL      R5,4    ;       and = and >> 4
        SR      R4,4    ;       ier = ier << 4 -- end multiply step
        ADDSI   R6,-1   ;       count = count - 1
        BGT     TMULPP  ;       if (count > 0) continue
        JUMPS   R1      ;       return prod
                        ;   } -- end case
                        ; } -- end loop

The above code is be better than twice the speed of the unoptimized unsigned binary multiply code for random integers. Furthermore, multipliers with long runs of zero bits are common. While these are the worst case for the original unsigned binary code, they are the best case for this code.

Duplication of the loop epilog in the above has expanded the code so much that the BGT instructions in the final two cases are unable to reach the head of the loop at TMULLP. To solve this problem, the above code introduces a new label, TMULPP in case 8, that allows control to reach TMULLP in two steps. There are several alternative solutions to the problem of out-of-range branches, but in this context, none are faster or more compact.

On pipelined machines, the above code is definitely not optimal because many instructions depend directly on the results of the previous instruction. To eliminate these dependencies, some of the instructions can be reordered. The goal is to place unrelated instructions between any instruction that produces a result that the next instruction depends on. Unfortunately, such reorderings almost always produce code that is harder to read than the original. Consider, for example, the followig rewrite of the final case above:

TMULE:                  ;     case 14
        MOVE    R7,R5
        ADDSL   R7,R5,1
        SR      R4,4    ;       ier = ier << 4
        ADDSL   R7,R5,1 ;       pp = and x 7
        SL      R5,4    ;       and = and >> 4
        SL      R7,1    ;       pp = pp x 2 (and x 14)
        ADDSI   R6,-1   ;       count = count - 1
        SUM     R3,R7   ;       prod = prod + pp -- cc's unchanged
        BGT     TMULPP  ;       if (count > 0) continue

Caches add even more difficulty because they make simple prediction of run-time extremely difficult. When there is a cache, the second time a particular instruction is executed it may be much faster than it was the first time because the cache eliminates most of the cost of the instruction fetch. As a result, a large multiply routine with many specialized cases for each multiplier digit may be slower than small version that applies the same general case to every digit.

14.5. Division by Powers of 2

SRU (see Section 6.3) can be used for fast unsigned division with truncation, that is, rounding any fractional quotient down to the next lower integer:

        SRU    R3,1       ; unsigned divide by 2
        SRU    R3,2       ; unsigned divide by 4

SR (see Section 6.3) works for signed division, but there are alternative rules for truncating the quotient. SR truncates to the next lower integer so the remainder is always positive:

        SR     R3,1       ; signed divide by 2, truncating downward
        SR     R3,2       ; signed divide by 4, truncating downward

Generally, people expect –5 divided by 4 to give a quotient of –1 with a remainder of –1. When SR is used for this division problem, it gives a quotient of –2 with a remainder of 3. In fact, our naive expectation is contradicted by another equally naive expectation. We expect the remainder after division by 4 to be between 0 and 3. These two naive expectations are incompatible for negative dividends. SR keeps the remainder positive while violating naive expectations about the quotient.

Many programming languages require integer division to truncate the quotient toward zero, so that the sign of the remainder matches the sign of the dividend. ADJUST r,SSQ (See Section 11.2) can be used to deliver this result, as follows:

        SR     R3,2       ; divide by 4, truncating downward
        ADJUST R3,SSQ     ;   adjust quotient so truncation is toward zero

For both signed and unsigned shifts, the quotient can be rounded to the nearest integer using ADDC (See Section 10.3):

        SR     R3,2       ; signed divide by 4 with positive remainder
        ADDC   R3,0       ;   round so remainder is between -2 and 1

        SRU    R3,3       ; unsigned divide by 8 with positive remainder
        ADDC   R3,0       ;   round so remainder is between -4 and 3

14.6. Division by Constants -- Methodology

In general, it is faster to multiply by the reciprocal of a number than to divide by that number! This is particularly important for division by constants that are not powers of two, since we can divide by a power of two by a simple shift. Furthermore, this applies across most computers, including those with hardware divide instructions. The following table shows the reciprocals of the integers from 1 to 10.

 n 1/n (base 10) 1/n (base 2)
1 1.0 1.0
2 0.5 0.1
3 0.33 0.0101
4 0.25 0.01
5 0.2 0.00110011
6 0.166 0.00101
7 0.142857142857 0.001001
8 0.125 0.001
9 0.11 0.000111000111
10 0.1 0.000110011

In binary or decimal repeating fractions are traditionally marked with an overbar above the string of digits that repeat; in the above table, this is rendered like this. Many numbers that have exact reciprocals in decimal have reciprocals that repeat in binary. For example, 0.110 has no finite binary representation.

In general, if we multiply a 32 bit integer by the reciprocal, truncating the integer part of the product to 32 bits, we get a result that differs from the expected quotient by no more than one. In some cases, this is acceptable, but in most cases, an exact quotient is needed.

One way to obtain an exact quotient is to multiply by a carefully selected approximation of the reciprocal. In general, for an n-bit unsigned divisor, we can get the exact result by multiplying by an n+1 bit quantity and then shifting to take the appropriate bits of the product. The n+1 bit approximate reciprocal cannot have more than n+1 one-bits, so there are at most n+1 partial products to add in computing the quotient. ADDSRU (see Section 6.3) is is particularly effective for adding partial products and discarding fractional bits of the result. Generally, we can divide a 32-bit unsigned number by an arbitrary constant with a sequence of no more than 33 ADDSRU instructions. Here are some examples:

        ;        R3 = R3 / 3 = R3 * 0.010101010101010101010101010101011
        MOVE    R1,R3   ; R3 = R1 * 1.0      (make copy of original R3)
        SRU     R3,1    ; R3 = R1 * 0.1
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.011    exact R3/3 for all R1 < 8
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.01011                 all R1 < 32
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.0101011               all R1 < 128
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.010101011             all R1 < 512
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.01010101011           all R1 < 2,048
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.0101010101011         all R1 < 8,192
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.010101010101011       all R1 < 32,768
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.01010101010101011     all R1 < 131,072
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.0101010101010101011
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.010101010101010101011
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.01010101010101010101011
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.0101010101010101010101011
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.010101010101010101010101011
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.01010101010101010101010101011
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.0101010101010101010101010101011
        ADDSRU  R3,R1,2 ; R3 = R1 * 0.010101010101010101010101010101011

        ;        R3 = R3 / 5 = R3 * 0.0011001100110011001100110011001101
        MOVE    R1,R3   ; R3 = R1 * 1.0      (make copy of original R3)
        SRU     R3,2    ; R3 = R1 * 0.01
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.101
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.001101 exact R3/5 for all R1 < 64
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.1001101
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.0011001101            all R1 < 1,024
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.10011001101
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.00110011001101        all R1 < 16,384
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.100110011001101
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.001100110011001101    all R1 < 262,144
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.1001100110011001101
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.0011001100110011001101
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.10011001100110011001101
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.00110011001100110011001101
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.100110011001100110011001101
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.001100110011001100110011001101
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.1001100110011001100110011001101
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.0011001100110011001100110011001101

As the comments above indicate, the complete series of 16 ADDSRU instructions is only needed if the maximum value of the dividend is unknown. If the maximum is known, an abbreviated shift sequence will suffice. For example, for a 16-bit dividend, 6 or 7 ADDSRU instructions do the job.

Given code to divide by 3, division by 6, 12 and other binary multiples of 3 is straightforward; simply increase the final shift count appropriately. Similarly, we can derive code for divide by 10, 20 and other binary multiples of 5 by simple changes to the final shift count.

The above examples illustrate a general technique for multiplying an unsigned binary integer by a fractional constant binary multiplier. If there are k one-bits in the multiplier, there will be one SRU instruction followed by k–1 ADDSRU instructions. Reading from the first SRU downward, the sequence of shift counts gives the intervals between consecutive one-bits of the multiplier, starting with the least-significant bit. If the two least significant bits are one, the first shift count is one; if there are z zeros between the two least significant ones, the first shift count is z+1. The final shift count follows the same formula, based on the number of zeros between the point and the most significant one.

To divide a 32-bit integer by an arbitrary integer constant, take the reciprocal of the divisor as a fixed point binary fraction. This fraction must be computed to 33 places of precision, counting the most significant one as the first place. Do not round the fraction, but instead add one to the least significant bit, the one 33 places to the right of the most significant one.

For repeating binary fractions, faster code is sometimes possible at the expense of a result that is only approximate because 1 is not added to the 33rd bit. The trick is to build up the repeating pattern in blocks bigger than one partial product at a time. The following example illustrates this for division by 5:

        ;     R3 = ~(R3 / 5) = R3 * 0.00110011001100110011001100110011
        MOVE    R1,R3   ; R3 = R1 * 1.0      (make copy of original R3)
        SRU     R3,1    ; R3 = R1 * 0.1
        ADDSRU  R3,R1,3 ; R3 = R1 * 0.0011
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.10011
        ADDSRU  R3,R1,1 ; R3 = R1 * 0.110011
        MOVE    R4,R3   ; R3 = R1 * 0.110011 (make copy of working value)
        SRU     R3,8    ; R3 = R1 * 0.00000000110011
        ADDSRU  R3,R4,8 ; R3 = R1 * 0.0000000011001100110011          *
        ADDSRU  R3,R4,8 ; R3 = R1 * 0.000000001100110011001100110011  *
        ADDSRU  R3,R4,2 ; R3 = R1 * 0.00110011001100110011001100110011
        ;     Either R3 = (R1 / 5) or (R3 + 1) = (R1 / 5);

The above code only uses 10 instructions instead of the 18 in the brute-force code, but the result is not exact. With this code, the quotient may be off by one. If the two lines marked with asterisks are deleted, the code works, but may be off by one, for all 16-bit dividends, while if just one of these is deleted, it is good for all 24-bit dividends.

Because the approximation will never be off by more than one, we can fix it by computing the remainder and then checking to see that it is within bounds. The following code fixes the result of the above divide by 5 code:

        ; Given R1 = dividend, R3 = approximate quotient
        MOVE    R4,R3
        ADDSL   R4,R3,2 ; R4 = R3 * 5 = (R1 / 5) * 5
        SUB     R4,R1,R4; R4 = R1 % 5 (remainder after division)
        CMPI    R4,5
        BLTU    NOFIX   ; if remainder >= 5
        ADDSI   R3,1    ;   fix quotient
        ADDSI   R4,-5   ;   fix remainder
NOFIX:
        ; Now, R1 = dividend, R3 = quotient, R4 = remainder

The above code produces both the remainder and the quotient after division. In fact, aside from the "perfect" algorithms given initially, where an exact quotient is produced by reciprocal multiplication, all useful division algorithms produce the remainder and quotient as the result of a single computation. This is why, on computers with a general purpose divide instruction, the divide instruction almost always produces both the remainder and the quotient.

These methods extend trivially to signed division by positive constants, with one caveat. Simply replace SRU and ADDSRU with SR and ADDSR. The caveat is that the remainder will always be positive.

To make the sign of the remainder the same as the sign of the dividend, A correction step similar to that suggested above must be added. If the remainder is nonzero and the dividend is negative, the correction is to increment the quotient by one and decrement the remainder by the divisor. Computing the remainder is done exactly the same way in both the signed and unsigned versions.

14.7. Division by Variables

Division by a variable is somewhat more complex than multiplication by a variable. The following code implements the classic restoring division algorithm on 32-bit unsigned operands, giving a 32-bit remainder and a 32-bit quotient.

DIVIDEU:; unsigned divide, 32-bit operands
        ; expects:  R1        -- return address
        ;           R3 = end  -- dividend
        ;           R4 = isor -- divisor
        ; returns:  R3 = quot -- quotient  \ or remquot 
        ;           R4 = rem  -- remainder / (a 64-bit value)
        ; uses      R5 = isor -- copy of divisor
        ; destroys: R6 = i    -- loop counter

        MOVE    R5,R4   ; -- copy divisor
        CLR     R4      ; rem = 0
        LIS     R6,32   ; i = 32

DIVULP:                 ; do { -- begin divide step
        SL      R3,1    ;   -- low half of shift
        ROL     R4      ;   remquot = remquot << 1
        CMP     R4,R5   ;   -- comparison
        BLTU    DIVUC   ;   if (rem >= isor) {
        SUB     R4,R4,R5;     rem = rem - isor
        ADDSI   R3,1    ;     quot = quot + 1
DIVUC:                  ;   } -- end divide step

        ADDSI   R6,-1   ;   i = i - 1
        BGT     DIVULP  ; } while (i > 0)
        JUMPS   R1      ; return remquot

As with the algorithms given for multiplication by a variable, this code does not check for overflow. If size is the primary consideration, this code is optimal. If speed is paramount, unrolling the loop one step will eliminate 2 of the 8 instructions per iteration, giving a speedup of at least 1/4. Division by variables is rare enough in practice that this is generally acceptable, but 4-way unrolling is entirely feasible.