Binary to Decimal Conversion in Limited Precision

Part of the Arithmetic Tutorial Collection
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Copyright © 1999, Douglas. W. Jones, with major revisions made in 2002 and 2010. This work may be transmitted or stored in electronic form on any computer attached to the Internet or World Wide Web so long as this notice is included in the copy. Individuals may make single copies for their own use. All other rights are reserved.

Index

Introduction
The Basic Idea
Reduction to Code
Dealing with Signed Numbers
Living Within 8 Bits
Doing Without Hardware Multiply and Divide
Alternative Radices
64-bit Conversion on a 32-bit Machine

Rated as a top 50 site by Programmer's Heaven as of Feb 2002.

Introduction

Programmers on binary computers have always faced computational problems when dealing with textual input-output because of our convention that printed numeric data should be represented in base 10. These problems have never been insurmountable, but they are particularly annoying when writing programs for machines with limited capacity in their arithmetic units. Suppose, for example, you have an unsigned binary number that you wish to print out in decimal form. The following C code will do this:

    void putdec( unsigned int n ) {
        unsigned int rem  = n % 10;
        unsigned int quot = n / 10;
        if (quot > 0) putdec( quot );
        putchar( rem + '0' );
    }

The problem with this C code is that it assumes that the computer's arithmetic unit is sufficiently large to handle the entire number n, and it assumes that the arithmetic unit can conveniently do integer division yielding both a quotient and a remainder. These assumptions are not always reasonable, both because many commercially successful computers do not include division hardware and because, even when such hardware is available, it is of little direct use when dividing numbers larger than the ALU can handle.

The focus of this tutorial is on writing fast code to convert unsigned 16 bit binary numbers to decimal on a machine with an 8-bit ALU and no divide instruction. This situation is common on small microcontrollers, and the techniques generalize in useful ways, for example, to the problem of doing 64 bit binary to decimal conversion using a 32 bit ALU. A final section of this tutorial examines how these methods can be extended to larger numbers, for example, converting a 64-bit binary number to decimal on a machine with a 32-bit ALU.

On machines with no divide instruction where speed is no issue, it may be better to simply use repeated subtraction instead of the fast but algebraically complex code given in the body of this tutorial. Consider the following 16-bit conversion code as an example.

    void putdec( int n ) {
        char a, b, c, d;
        if (n < 0) {
            putchar( '-' );
            n = -n;
        }
        for ( a = '0' - 1; n >= 0; n -= 10000 ) ++a;
        for ( b = '9' + 1; n <  0; n += 1000  ) --b;
        for ( c = '0' - 1; n >= 0; n -= 100   ) ++c;
        for ( d = '9' + 1; n <  0; n += 10    ) --d;
        putchar( a );
        putchar( b );
        putchar( c );
        putchar( d );
        putchar( n + '0' );
    }

Variants on the clever algorithm given above have been very popular on 8-bit microprocessors and microcontrollers, despite the fact that conversions can require over 30 iterations of the for loops in this code. As given here, this code does not work for the extreme negative value -32768; fixing this bug without introducing a special case is an interesting exercise.

The Basic Idea

To do the conversion faster, what we'll do is look at our binary number as a base 16 number and then do the conversion in terms of the base 16 digits. Consider a 16 bit number n. This can be broken into 4 fields of 4 bits each, the base 16 digits; call them n₃, n₂, n₁ and n₀.

                    n
    |_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _|
    |_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|
    |   n   |   n   |   n   |   n   |
         3       2       1       0

The value of the number can be expressed in terms of these 4 fields as follows:

n = 4096 n₃ + 256 n₂ + 16 n₁ + 1 n₀

In the same way, if the value of n, expressed in decimal, is d₄d₃d₂d₁d₀, where each of d₄, d₃, d₂, d₁ and d₀ are decimal digits, the value of n can be expressed as:

n = 10000 d₄ + 1000 d₃ + 100 d₂ + 10 d₁ + 1 d₀

Our problem then, is to compute the d_i from the n_i without dealing directly with the larger number n.

To do this, we first note that the factors used in combining the values of n_i can themselves be expressed as sums of multiples of powers of 10. Therefore, we can rewrite the original expression as:

n =
n₃( 4×1000 + 0×100 + 9×10 + 6×1 ) +
n₂( 2×100 + 5×10 + 6×1 ) +
n₁( 1×10 + 6×1 ) +
n₀( 1×1 )

If distribute the n_i over the expressions for each factor, then factor out the multiples of 100, we get the following:

n =
1000 (4n₃) +
100 (0n₃ + 2n₂) +
10 (9n₃ + 5n₂ + 1n₁) +
1 (6n₃ + 6n₂ + 6n₁ + 1n₀)

We can use this to arrive at first approximations a_i for d₃ through d₀:

a₃ = 4 n₃
a₂ = 0 n₃ + 2 n₂
a₁ = 9 n₃ + 5 n₂ + 1 n₁
a₀ = 6 n₃ + 6 n₂ + 6 n₁ + 1 n₀

The values of a_i are not proper decimal digits because they are not properly bounded in the range 0 ≤ a_i ≤ 9. Instead, given that each of n_i is bounded in the range 0 ≤ n_i ≤ 15, the a_i are bounded as follows:

0 ≤ a₃ ≤ 60
0 ≤ a₂ ≤ 30
0 ≤ a₁ ≤ 225
0 ≤ a₀ ≤ 285

This is actually quite promising because a₃ through a₁ are less than 256 and may therefore be computed using an 8 bit ALU, and even a₀ may be computed in 8 bits if we can use the carry-out bit of the ALU to hold the high bit. Furthermore, if our interest is 2's complement arithmetic where the high bit is the sign bit, the first step in the output process is to print a minus sign and then negate, so the constraint on n₃ becomes 0 ≤ n₃ ≤ 8 and we can conclude naively that

0 ≤ a₀ ≤ 243

Note that we said n₃ ≤ 8 and not n₃ < 8; this is because of the one negative number in the 2's complement system that cannot be properly negated, -32768. If we exclude this number from the range of legal values, the upper bound is reduced to 237; in fact, this is the overall upper bound, because in the one case where n₃ is 8, all of the other n_i are zero, so we can assert globally that

0 ≤ a₀ ≤ 237

Even with our naive bound, however, it is easy to see that we can safely use 8 bit arithmetic to accumulate all of the a_i.

Given that we have computed the a_i, we can push them into the correct ranges by a series of 8-bit divisions and one 9-bit division, as follows:

c₁ = a₀ / 10
d₀ = a₀ mod 10

c₂ = (a₁ + c₁) / 10
d₁ = (a₁ + c₁) mod 10

c₃ = (a₂ + c₂) / 10
d₂ = (a₂ + c₂) mod 10

c₄ = (a₃ + c₃) / 10
d₃ = (a₃ + c₃) mod 10

d₄ = c₄

In the above, the c_i terms represent carries propagated from one digit position to the next. The upper bounds on each of the c_i can be computed from the bounds given above for the a_i:

c₁ ≤ 28 [ = 285 / 10 ]
c₂ ≤ 25 [ = (28 + 225) / 10 = 253 / 10 ]
c₃ ≤ 5 [ = (25 + 30) / 10 = 55 / 10 ]
c₄ ≤ 6 [ = (5 + 60) / 10 = 65 / 10 ]

In computing the bound on c₂, we came within 2 of to 255, the maximum that can be represented in 8 bits, so an 8 bit ALU is just barely sufficient for this computation. Again, if we negate any negative numbers prior to output, so that the maximum value of n₃ is 8, the bound on c₁ becomes 23 instead of 28.

Reduction to Code

A first hack at reducing the above idea to code might include named temporaries for all of the terms used above, but we can avoid this if we reorganize things slightly and note that, after computing each digit of the number, we no-longer need one of the n_i. This leads to the following C code:

    void putdec( uint16_t n ) {
        uint8_t d4, d3, d2, d1, q;
        uint16_t d0;

        d0 = n       & 0xF;
        d1 = (n>>4)  & 0xF;
        d2 = (n>>8)  & 0xF;
        d3 = (n>>12) & 0xF;

        d0 = 6*(d3 + d2 + d1) + d0;
        q = d0 / 10;
        d0 = d0 % 10;

        d1 = q + 9*d3 + 5*d2 + d1;
        q = d1 / 10;
        d1 = d1 % 10;

        d2 = q + 2*d2;
        q = d2 / 10;
        d2 = d2 % 10;

        d3 = q + 4*d3;
        q = d3 / 10;
        d3 = d3 % 10;

        d4 = q;

        putchar( d4 + '0' );
        putchar( d3 + '0' );
        putchar( d2 + '0' );
        putchar( d1 + '0' );
        putchar( d0 + '0' );
    }

In the above, we use the standard C types uint8_t and uint16_t to represent unsigned 8 and 16-bit values. By convention, these should be defined in <stdint.h>. Of course, this code prints all 5 digits of n instead of only printing the significant digits, but this can be fixed in many ways. The following fix is perhaps the cleanest, making effective use of partial results as soon as those results can be used to predict that digits will be zero. Even so, adding these features to the code obscures the computation being performed.

    void putdec( uint16_t n ) {
        uint8_t d4, d3, d2, d1, q;
        uint16_t d0;

        d0 = n       & 0xF;
        d1 = (n>>4)  & 0xF;
        d2 = (n>>8)  & 0xF;
        d3 = (n>>12);

        d0 = 6*(d3 + d2 + d1) + d0;
        q = d0 / 10;
        d0 = d0 % 10;

        d1 = q + 9*d3 + 5*d2 + d1;
        if (d1 != 0) {
            q = d1 / 10;
            d1 = d1 % 10;

            d2 = q + 2*d2;
            if ((d2 != 0) || (d3 != 0)) {
                q = d2 / 10;
                d2 = d2 % 10;

                d3 = q + 4*d3;
                if (d3 != 0) {
                    q = d3 / 10;
                    d3 = d3 % 10;
            
                    d4 = q;
            
                    if (d4 != 0) {
                        putchar( d4 + '0' );
                    }
                    putchar( d3 + '0' );
                }
                putchar( d2 + '0' );
            }
            putchar( d1 + '0' );
        }
        putchar( d0 + '0' );
    }

Dealing with Signed Numbers

To print signed values, we can replace the first few lines of the above code with the following:

    void putdec( int16_t n ) {
        uint8_t d4, d3, d2, d1, d0, q;

        if (n < 0) {
            putchar( '-' );
            n = -n;
        }

        d0 = n       & 0xF;
        d1 = (n>>4)  & 0xF;
        d2 = (n>>8)  & 0xF;
        d3 = (n>>12) & 0xF;

        /* the remainder of the code is unchanged */

In the above, we use the standard C type int16_t to represent signed 16-bit values and uint8_t to represent unsigned 8-bit values. By convention, these should be defined in <stdint.h>. As mentioned earlier, forcing n positive prior to breaking it down into nibbles allows us to use an 8-bit type for d0 instead of the 16-bit type used previously (only 9 bits were actually needed).

Note that the computation of d3 has also changed as a result of the change from unsigned to signed arithmetic. This is because the right-shift operator shifts zeros into the high bit when applied to unsigned numbers, but it replicates the sign bit when applied to signed numbers. There is one case in 16 bit 2's complement arithmetic where both n and -n are negative, that is, n = -32768. The change we have made gives us the correct output in this case.

Living Within 8 Bits

In the code given above for unsigned printing, the fact that a₀ could be as large as 285 forced us to declare the variable d0 as a 16-bit integer instead of just 8 bits. We can avoid this by observing that the final addition in the sum used to compute a₀ is the only one that will push this sum over 255, and therefore, we need only check the carry bit once to detect overflow. Having detected overflow, we can account for it in the computation that follows:

    void putdec( uint16_t n ) {
        uint8_t d4, d3, d2, d1, d0, q;

        d0 = n       & 0xF;
        d1 = (n>>4)  & 0xF;
        d2 = (n>>8)  & 0xF;
        d3 = (n>>12) & 0xF;

        if ( (6*(d3 + d2 + d1) + d0) > 255 ) { /* overflow */
            d0 = (6*(d3 + d2 + d1) + d0) & 0xFF;
            q = d0/10 + 25;
            d0 = d0 % 10 + 6;
            if (d0 >= 10) {
                d0 = d0 - 10;
                q = q + 1;
            }
        } else { /* no overflow */
            d0 = 6*(d3 + d2 + d1) + d0;
            q = d0 / 10;
            d0 = d0 % 10;
        }

In the above code, after having noted an overflow in the computation of a₀, we retain the least significant bits of the sum in d0 and then approximately correct the quotient by adding 256/10, or 25; correcting this and computing the correct remainder is a bit more interesting.

Given some integer i ≥ 256, if we want to compute i mod 10 from (i - 256) mod 10, we need to note that the modulus operator is distributive in an interesting sense:

(i + j) mod m = ((i mod m) + (j mod m)) mod m

Using this, we can conclude that:

i mod 10 = (((i - 256) mod 10) + (256 mod 10)) mod 10
= (((i - 256) mod 10) + 6) mod 10

In the case where ((i - 256) mod 10) + 6 is greater than 10, the truncation of the quotient was also incorrect, so we must add 1 to the approximate quotient as well. This justifies the code given above, and this code has also been tested exhaustively for all integers from 0 to 65535.

Doing Without Hardware Multiply and Divide

If flat-out speed is not of the essence, the easiest way to do the required divisions is to use the trivial repeated subtraction algorithm. The largest dividend we must consider is only 285, so it will take no more than 28 iterations for this division.

    /* quotient and remainder returned by div10 */
    unsigned char q, r;

    void div10( uint8_t i ) {
        q = 0;
        r = i;
        while (r > 9) {
                q = q + 1;
                r = r - 10;
        }
    }

This can be optimized in several ways. Most notably, termination conditions that compare with zero are almost always significantly faster than those that compare with other constants, so we might be tempted to write something like this:

    /* quotient and remainder returned by div10 */
    uint8_t q, r;

    void div10( uint8_t i ) {
        q = -1;
        r = i;
        do {
                q = q + 1;
                r = r - 10;
        } while (r >= 0); /* see note below */
        r = r + 10;
    }

The above C code is wrong because r is declared to be unsigned leading to the fact that the comparison r>=0 will always return true. Note that in assembly language, this code can be fixed by terminating the loop when the condition codes indicate a borrow after the r-10 subtraction.

Another section of this tutorial covers the problem of doing multiplicaton and division using only shift and add instructions. We use the results presented there, taking specific advantage of the limited ranges of values occupied by our dividends, to completely eliminate iteration from our code.

There are two ways to get an exact quotient and remainder from i/10 using reciprocal multiplication in 8 bits. Either we multiply by 0.00011001 to get an approximate quotient and then compute the remainder and correct the quotient, or we multiply by 0.00011001101 to get an exact quotient from which we can compute the remainder.

The former method, yielding an approximation, can be reduced to the following C code:

    /* quotient and remainder returned by div10 */
    uint8_t q, r;

    void div10( uint8_t i ) {
        /* approximate the quotient as q = i*0.000110011 (binary) */
        q = ((i>>1) + i) >> 1; /* times 0.11 */
        q = ((q>>4) + q) >> 3; /* times 0.000110011 */

        /* compute the reimainder as r = i - 10*q */
        r = ((q<<2) + q) << 1; /* times 1010. */
        r = i - r;

        /* fixup if approximate remainder out of bounds */
        if (r >= 10) {
            r = r - 10;
            q = q + 1;
        }
    }

In first two right-shift-and-add statements used above, we have assumed that the carry out of the high bit of the add operation is shifted into the high bit of the second of the two right-shift operations used in each statement. This can be done efficiently in many machine instruction sets, but even good optimizing compilers may insist on using a 16 bit accumulator for all intermediate results of arithmetic operations.

The second approach, using a more elaborate multiplier to give an exact quotient, may be coded as follows:

    /* quotient and remainder returned by div10 */
    uint8_t q, r;

    void div10( uint8_t i ) {
        /* approximate the quotient as q = i*0.00011001101 (binary) */
        q = ((i>>2) + i) >> 1; /* times 0.101 */
        q = ((q   ) + i) >> 1; /* times 0.1101 */
        q = ((q>>2) + i) >> 1; /* times 0.1001101 */
        q = ((q   ) + i) >> 4; /* times 0.00011001101 */

        /* compute the reimainder as r = i - 10*q */
        r = ((q<<2) + q) << 1; /* times 1010. */
        r = i - r;
    }

The particular choice of one or the other of these schemes can only be made after compiling them or hand translation to machine code. Depending on the specific features offered by the target architecture, either of the above may be the fastest or most compact, and careful use of small optimizations may tilt the balance one way or the other.

Depending on the space-time tradeoffs of the application, the above code may be expanded in-line as an open macro, or it may be called as a closed function. If in-line expansion is appropriate, it should be noted that the computations of c₃ and c₄ involve dividends no greater than 65. For dividends up to 68, the multiplier 0.0001101 gives exact results, suggesting the following version of the print routine:

    void putdec( int16_t n ) {
        uint8_t d4, d3, d2, d1, d0, q;

        if (n < 0) {
            putchar( '-' );
            n = -n;
        }

        d1 = (n>>4)  & 0xF;
        d2 = (n>>8)  & 0xF;
        d3 = (n>>12) & 0xF;

        d0 = 6*(d3 + d2 + d1) + (n & 0xF);
        q = (d0 * 0xCD) >> 11;
        d0 = d0 - 10*q;

        d1 = q + 9*d3 + 5*d2 + d1;
        q = (d1 * 0xCD) >> 11;
        d1 = d1 - 10*q;

        d2 = q + 2*d2;
        q = (d2 * 0x1A) >> 8;
        d2 = d2 - 10*q;

        d3 = q + 4*d3;
        d4 = (d3 * 0x1A) >> 8;
        d3 = d3 - 10*d4;

        putchar( d4 + '0' );
        putchar( d3 + '0' );
        putchar( d2 + '0' );
        putchar( d1 + '0' );
        putchar( d0 + '0' );
    }

All multiplications in the above code can be reduced to short sequences of shift and add instructions, and it is straightforward to add the code discussed earlier to suppress leading zeros from this code.

All of these details are incorporated into the attached decimal print routine for the 14-bit family of PIC microcontrollers.

Alternative Radices

Note: The following section was added between August 6 and 8, 2012.

All of the above work was based on dividing the 16-bit number to be printed into four 4-bit chunks. That is, we worked in radix 16. In fact, there is no reason to work in a fixed radix. Consider, for example, breaking a 16-bit number into 4 chunks as follows:

                    n
    |_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _|
    |_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|_|
    |  n  |   n   |   n   |    n    |
        3      2       1        0

n = 8192 n₃ + 512 n₂ + 32 n₁ + 1 n₀

This leads to the following initial approximations for the decimal digits of the number:

a₃ = 8 n₃
a₂ = 1 n₃ + 5 n₂
a₁ = 9 n₃ + 1 n₂ + 3 n₁
a₀ = 2 n₃ + 2 n₂ + 2 n₁ + 1 n₀

Shifting to a mixed radix system instead of pure base 16 has the net effect of changing the range of values in each of initial approximations for the decimal digits:

0 ≤ a₃ ≤ 65
0 ≤ a₂ ≤ 95
0 ≤ a₁ ≤ 133
0 ≤ a₀ ≤ 105

Enlarging n₀ to 5 bits had no major effect because the multiplier on that term is 1. Shifting the other terms over one place changed 3 multiplers from 6 to 12, resulting in smaller terms. As a result, 8-bit arithmetic now suffices for all of the computations. Reducing this to C code, we get:

    void putdec( uint16_t n ) {
        uint8_t d4, d3, d2, d1, d0, q;

        d0 = n       & 0x1F;
        d1 = (n>>5)  &  0xF;
        d2 = (n>>9)  &  0xF;
        d3 = (n>>13) &  0x7;

        d0 = 2*(d3 + d2 + d1) + d0;
        q = (d0 * (uint16_t)0x67) >> 10;
        d0 = d0 - 10*q;

        d1 = 9*d3 + d2 + 3*d1 + q;
        q = (d1 * (uint16_t)0x67) >> 10;
        d1 = d1 - 10*q;

        d2 = d3 + 5*d2 + q;
        q = (d2 * (uint16_t)0x67) >> 10;
        d2 = d2 - 10*q;

        d3 = 8*d3 + q;
        q = (d3 * (uint16_t)0x1A) >> 8;
        d3 = d3 - 10*q;

        putchar(q  + '0');
        putchar(d3 + '0');
        putchar(d2 + '0');
        putchar(d1 + '0');
        putchar(d0 + '0');
    }

Enlarging n₀ to 5 bits had a second consequence. Because the upper bounds on the larger terms have been reduced, shortened approximations for one tenth can be used. On small microcontrollers without hardware support for multiplication, this can result in savings of a few instructions per multiply. The multiplier 67₁₆ works because 0.0001100111 approximates one tenth to sufficient precision to divide any number less than 179 by ten. The final digit, d3, can still be computed using the even shorter approximation we used previously.

64-bit Conversion on a 32-bit Machine

Note: The following section was added on June 8, 2010. As of June 10, the constants given in this section and the code at the end have been checked and appear to be correct. As of August 13, this code has undergone extensive testing.

The techniques explored above are applicable wherever a large integer must be converted to decimal on a machine with a smaller ALU. For example, consider the problem of converting 64-bit integers to decimal on a machine with a 32-bit ALU. This practical problem comes up, for example, in the BIOS code for Intel-architecture machines, where the code must not assume a 64-bit CPU but must still be able to report 64-bit values when it reports on such things as the sizes of the attached disk drives. When 64-bit arithmetic is supported on a 32-bit machine, it is slow enough that significant speedups can be achieved by avoiding it.

In general, 64-bit values may have up to 20 decimal digits:

2⁶⁴ = 18,446,744,073,709,551,616

Normally, we group decimal digits in groups of 3 using commas, as illustrated above. This is equivalent to writing the number in base 1000, where each base 1000 digit is represented with 3 consecutive decimal digits. Because there are 20 digits, grouping them as 5 blocks of 4 makes sense, suggesting that we should work in base 10,000. In addition, 10,000 is the largest power of 10 that can be represented in 16 bits (half the precision of our ALU). Therefore, we will write:

2⁶⁴ = 1844,6744,0737,0955,1616

We begin by breaking our 64-bit number into 16-bit chunks, just as we began the 16-bit conversion by breaking the number into 4-bit chunks. Again, we will the number n and we will call the 16-bit chunks n₃, n₂, n₁ and n₀.

                    n
    |___ ___ ___ ___ ___ ___ ___ ___|  64-bits
    |___|___|___|___|___|___|___|___|  8 bytes
    |   n   |   n   |   n   |   n   |  4 16-bit chunks
         3       2       1       0

The value of the number can be expressed in terms of these 4 fields as follows:

n = 2⁴⁸n₃ + 2³²n₂ + 2¹⁶n₁ + 2⁰n₀
n =
281,474,976,710,656 n₃ +
4,294,967,296 n₂ +
65,536 n₁ +
1 n₀

Because we are interested in working in base 10,000, we will regroup the digits above and continue with the following version:

n =
281,4749,7671,0656 n₃ +
42,9496,7296 n₂ +
6,5536 n₁ +
1 n₀

Our goal is to express n in base 10,000 as d₄d₃d₂d₁d₀. If each of d₄, d₃, d₂, d₁ and d₀ is a single base 10,000 digit, the value of n can be expressed as:

n =
1,0000,0000,0000,0000 d₄ +
1,0000,0000,0000 d₃ +
1,0000,0000 d₂ +
1,0000 d₁ +
1 d₀

Our initial approximations of the decimal digits comes from rearranging the base 10,000 digits of the multipliers for the 16-bit chunks to conform with the above decimal presentation:

n =
1,0000,0000,0000 ( 281 n₃ ) +
1,0000,0000 ( 4749 n₃ + 42 n₂ ) +
1,0000 ( 7671 n₃ + 9496 n₂ + 6 n₁ ) +
1 ( 0656 n₃ + 7296 n₂ + 5536 n₁ + 1 n₀ )

The approximations for the four least significant base 10,000 digits fall out of the above:

a₃ = 281 n₃
a₂ = 4749 n₃ + 42 n₂
a₁ = 7671 n₃ + 9496 n₂ + 6 n₁
a₀ = 0656 n₃ + 7296 n₂ + 5536 n₁ + 1 n₀

At this point, just as we did with the 4-bit example, we should check the range of values we might encounter in each approximation. Here, our concern is that some of the approximations might exceed the capacity of a 32-bit ALU. The maximum value of each term will occur when all of the n_i are at their maximum value, which is 65,535.

0 ≤ a₃ ≤ 18,415,335
0 ≤ a₂ ≤ 313,978,185
0 ≤ a₁ ≤ 1,125,432,555
0 ≤ a₀ ≤ 884,001,615

All of our approximations are comfortably less than 2 billion, which means that we can safely use 32-bit signed division, if that is all that is available to us.

We are now ready to compute the final values of the digits by a series of 32-bit divisions, as follows:

c₁ = a₀ / 10,000
d₀ = a₀ mod 10,000

c₂ = (a₁ + c₁) / 10,000
d₁ = (a₁ + c₁) mod 10,000

c₃ = (a₂ + c₂) / 10,000
d₂ = (a₂ + c₂) mod 10,000

c₄ = (a₃ + c₃) / 10,000
d₃ = (a₃ + c₃) mod 10,000

d₄ = c₄

In the above, the c_i terms represent carries propagated from one digit position to the next. The upper bounds on each of the c_i can be computed from the bounds given above for the a_i:

c₁ ≤ 88,400 [ = a₀ / 10,000 = 884,001,615 / 10,000 ]
c₂ ≤ 112,552 [ = (a₁ + c₁) / 10,000 = 1,125,520,955 / 10,000 ]
c₃ ≤ 31,409 [ = (a₂ + c₂) / 10,000 = 314,090,737 / 10,000 ]
c₄ ≤ 1,844 [ = (a₃ + c₃) / 10,000 = 18,446,744 / 10,000 ]

Note something encouraging: The largest intermediate value encountered above is 1,125,520,955, which is safely less than 2 billion and therefore safe to represent as either a signed or unsigned integer on a 32-bit machine.

Note also that the bound on c₄ is a small but useful check on our arithmetic. The value, 1,844, is identical to the first 4 decimal digits of 2**64, which is exactly what we expect. This does not eliminate the possibility that the above might contain many small clerical errors.

The code to print the 64-bit number ends up essentially identical to the code given previouly to print a 16-bit number, except for the sizes of the variables, the values of the various constants, and the code to print each digit of the result. If we have not made any errors in computing the various constants, the following C code should work:

    void putdec( uint64_t n ) {
        uint32_t d4, d3, d2, d1, d0, q;

        d0 = n       & 0xFFFF;
        d1 = (n>>16) & 0xFFFF;
        d2 = (n>>32) & 0xFFFF;
        d3 = (n>>48) & 0xFFFF;

        d0 = 656 * d3 + 7296 * d2 + 5536 * d1 + d0;
        q = d0 / 10000;
        d0 = d0 % 10000;

        d1 = q + 7671 * d3 + 9496 * d2 + 6 * d1;
        q = d1 / 10000;
        d1 = d1 % 10000;

        d2 = q + 4749 * d3 + 42 * d2;
        q = d2 / 10000;
        d2 = d2 % 10000;

        d3 = q + 281 * d3;
        q = d3 / 10000;
        d3 = d3 % 10000;

        d4 = q;

        printf( "%4.4u", d4 );
        printf( "%4.4u", d3 );
        printf( "%4.4u", d2 );
        printf( "%4.4u", d1 );
        printf( "%4.4u", d0 );
    }

In the above code, we used the standard types uint32_t and uint64_t to declare 32 and 64-bit unsigned integer variables. By convention, these should be defined in <stdint.h>. Note that if you are working on a 32-bit machine with no support for 64-bit arithemtic, there is no guarantee that you will be able to declare the 64-bit type and you may have to fake it, for example, by representing each 64-bit value as a pair of 32-bit components.

This code uses printf() to print each 4-digit base 10,000 digit of the result. Of course, this code prints all 20 digits of n. Suppressing leading zeros takes a bit more work than it did with the 4-bit nibble code because of the complexity of suppressing leading zeros within the printed representation of the first nonzero base 10,000 digit.

I don't know where I first encountered the basic idea for the 8-bit code presented here. This tutorial began as an attempt to reinvent that code after I found myself working on a machine with an 8-bit ALU and needing to do 16-bit conversions.
Thanks to Mark Ramsey, who, on January 22, 2007 pointed out two small typos with big consequences in the section on Living Within 8 Bits.
Thanks to Dennis Vlasenko, who, on June 15, 2007 pointed out more typos in the section on Doing Without Multiply and Divide, and for his work integrate the 16-bit code into the Linux distribution of vsprintf(). With this, code derived from the examples here was formally released into the public domain. The copyright here applies only to the presentation and does not cover the algorithms or limit the derivation of working code from the examples here.
Thanks to Joe Zbiciak, who, on August 23, 2007 pointed out a small error the introduction.
Thanks to Derick Moore, who, on June 10, 2010, reported that he had checked the constants and tested the code in the section on 64-bit conversion on a 32-bit machine and verified that it worked in the BIOS he was maintaining.
Thanks to Michal Nazarewicz for additional testing reported on August 13, 2010 and for his work to integrate the 64-bit code into the Linux distribution of vsprintf().
Thanks to George Spelvin, who on August 3, 2012, suggested that there is no reason that all of the chunks should be the same size, leading to the addition of the section on other radices.
Thanks to Ralph Corderoy, who on September 10, 2014, pointed out the weakness of the repeated subtraction code when presented with the value -32768.
Thanks to Ben Boldt, who on April 9, 2018, pointed out an error in the first attempt to optimize div10 at the start of the section on doing without hardware multiply and divide.