Strength Reduction of Multiplications by Integer Constants Youfeng Wu,
[email protected] Sequent Computer Systems, Inc., D2-798 15450 S. W. Koll Parkway, Beaverton, OR 97006-6063 A b s t r a c t : Replacing a multiplication of an integer
suitable for instruction-level parallelism. For example, the fourth instruction in the above sequence can be run in parallel with any of the first three instructions on a processor that can issue two or more instructions. This paper discusses how to convert multiplications of integer constants into sequences of SUB/ADD/SHIFT/LEA instructions. More powerful instructions will not be considered. Also we assume that the integer constant is positive, and negative numbers can be handled as suggested by Bernstein [4]. Section 2 formulates the problem. Sections 3 and 4 describe two previous solutions. Section 5 develops our techniques. Section 6 presents the performance experiments and the results. Section 7 briefly discusses the overflow issue. Section 8 concludes the paper.
constant by a sequence of simple instructions can be beneficial because it reduces the total number of cycles and also provides more opportunity for instruction level parallelism. In this paper we developed an algorithm that selectively replaces multiplications of integer constants with sequences of SUB, ADD, SHIFT, and L E A instructions. The algorithm runs fast and the generated instruction sequences are of high quality. The number of simple instructions generated for a multiplication of an integer constant is about 84.5 % of the number of I-bits in the binmT representation of the constant. Also, for several popular processors, the simple instruction sequences significantly reduces the execution time to perform the multiplications. Key words: Integer multiplication, code generation, code optimization, instruction-level parallelism.
2. P r o b l e m F o r m u l a t i o n If only ADD is allowed in the simple instruction sequence, an elegant mathematical formulation of the problem can be found in Knuth [2], namely the concept of %ddition-chain". For a multiplication of X * N, where X is an unknown and N is a positive number, an %dditionchain" for N is defmed as a sequence of integers
1. I n t r o d u c t i o n For pipelined or superscalar processors, a multiplication usually is many times slower than the simple instructions like ADD, SUB, SHIFT (X SHIFT Y = X * 2Y), or LEA (LEA (B,I,S) = B + I * 2 s, where S = 1, 2, or 3). Replacing a multiplication with a sequence of simple instructions reduces the number of cycles taken by the operation and also provides more opportunity for instruction level parallelism. For example, for the Intel Pentium processor [1], a multiplication of a 4-byte integer takes at least 10 cycles, but each ADD, SUB, SHIFT, or LEA takes only one cycle. If we replace the multiplication reg0*136 by the following two instructions (note that 136 = 128 + 8), regl = reg0 SHIFT 7 --- regl = reg0 * 128 reg2 = L E A (regl, reg0, 3) --- reg2 = regl + reg0 * 8 the multiplication will take only 2 cycles (plus a possible address generaation inter-lock cycle 1). Since the speed discrepancy between the multiplication and the sequence of simple instructions is so high, we can usually use a sequence of more than two instructions and still gain performance improvement. For example, we can use the following five instructions to perform reg0 * 1950 and thereby use only 5 cycles (note that 1950 = 2048 - ((1 + 2) * 32 + 2)): regl = LEA (reg0, reg0, 1) -- regl = reg0 + reg0 * 2 reg2 = regl SHIFT 5 -- reg2 = reg 1 * 32 reg3 = LEA (reg2, reg0, 1) -- reg3 = reg2 + reg0 * 2 reg4 = reg0 SHIFT 11 -- reg4 = reg0 *2048 reg5 = reg4 SUB reg3 -- reg5 = reg4 - reg3 Besides saving CPU cycles, a sequence of simple instructions is the RISC style of coding and is more
1 = a 0, a 1, a2,..., ar = N
that are generated by the rule a k = aj ÷ ai, for i, j s.t. k > j _>i for all k in [1... r]. Once an addition-chain is found, the calculation of X * N can be performed by the following sequence of additions: a0=X ak=aj
+a i
k>j>_i
X*N = ar . For example, an addition-chain for the number 9 can be "1, 2, 4, 5, 9" because 9 = 5+4, 5=4+1, 4=2+2, 2=1+1. From this addition-chain, the following sequence of additions can be generated to calculate reg0*9: reg 1 = reg0 + reg0 reg2 = regl + regl reg3 = reg2 + reg0 reg4 = reg3 + reg2 It is not difficult to fired an addition-chain for any integer constant N. For example, if N is an even number, N = N/2 + N/2 and if N is an odd number, N = (N/2 + 1) + N/2, and we can repeat this decomposition on N/2 until it reaches 1. However, the problem to find the optimal chain (the one with the smallest number of additions) is believed to be NP-complete. In the case where ADD, SUB, SHIFT, and LEA instructions are all allowed, Deniel Magenheimer et. al. [3] extended the concept of the addition-chain to a "SUB/ADD/SHIFT/LEA-chain" (we will call it a SASLchain). That is, the rule
1We will ignore the address generation inter-lock cycles [1] in the rest of the paper for ease of discussion.
. . . . . ~ 9 ~¸ . . . .
ACM SIGPLAN Notices, Volume 30, No. 2, February 1995
42
a k = aj + a i, for i~ j s.t. k > j 2 i for all k in [1. r]. is extended to a k = aj + a i or ak = aj - a i or a k = aj SHIFT a i or a k = LEA(a b aj, 1) or a k = LEA(ai, aj, 2) or a k = LEA(ai, aj, 3) for i, j s.t. k > j >_i , k in [1. r]. So the p r o b l e m of replacing a multiplication of an integer constant N by a sequence of SUB/ADD/SHWT[LEA instructions is equivalent to the problem of finding a SASL-chain for N. For example, an SASL-chain for 9 is "1, 9" because 9 = L E A ( l , 1, 3), and the m u l t i p l i c a t i o n reg0*9 can be p e r f o r m e d with LEA(reg0,reg0,3). Although including S U B / S H I F T / L E A instructions can reduce the length o f the simple instruction sequence, it significantly increases the size of the search space for the minimal sequence than searching an optimal additionchain, and it is believed that the problem of finding the shortest SASL-ehain for an integer constant N remains NP-complete. Any practical solution for finding a good S A S L - c h a i n for a c o n s t a n t will p r o b a b l y involve heuristics.
4. A Rule-based Heuristic The rule-based system used by Deniel Magenheimer et. al [3] is very powerful. Many tricks that we humans do can be specified as rules, and they m a y be picked up by the inference engine. Deniel Magenheimer [3] et. al. mentioned that the rule-based approach generated optimal sequences for all numbers under 10,000 except 12. We believe this can be done, and it should be simple to add more rules to cover the 12 cases. However, we speculate that the rulebased approach may not run as fast as a procedural approach• Also, it may not be cost-effective to include a rule-based system in an existing compiler solely for the purpose of reducing the strength of multiplications by integer constants. This situation m a y change because people are proposing rule based approaches for several other compiler optimizations. 5. O u r N e w Heuristics For converting a multiplication of an integer constant to a sequence of A D D / S U B / S H I F T / L E A instructions, our algorithm consists of a basic part, which generates an LEA and possibly a S H I F T for each 1-bit in the binary representation of the constant, and two additional parts that try to reduce the number of 1-bits in the constant quickly. In the following discussions, we assume LEA (B, I, S) = B + I * 2 S, where S = 1, 2,..., or M A X S.
3. A Branch and Bound Heuristic Robert Bernstein [4] provided a nice "branch and bound" algorithm for searching a sequence of ADD/SUB/SHIFT instructions to replace a multiplication of an integer constant. For a number N, Bemstein's algorithm first recursively searches for the simple instruction sequences
5.1. The Basic Algorithm Each integer constant N can be represented as: k N = ~_ ai*21, where a i = 0 or 1.
for N + 1, N - 1, and N / 2 i. Then it chooses the one with the smallest cost and applies an ADD, SUB, or SHIFT to obtain the sequence for N. A source program for this algorithm is available from Preston Briggs [5]• We
l=t/
Assume a u and a v are the lowest two 1-bits and u < v. Namely, au= a v = 1, and aj = 0 for and j such that j < u or u < j < v. The basic algorithm considers the following three cases. Case 1. u = 0. This means a 0 = 1. Assume t = min (v, MAX_S). We have N = 1 + 2 t * N '. We can first calculate N' and then obtain N using the following LEA instruction: N = L E A (1, N', t)
extended it to search for the sequence for N / (2 i + 1) as well, so as to use L E A instructions as much as possible. This heuristic can discover many good sequences. But the extended Bernstein algorithm still didn't take full advantage of the L E A instruction. For example, the SASLchain it discovered for 29 is (note that 29 = (8 - 1) * 4 + 1): 8 = 1 SHIFT 3 7=8SUB 1 28 = 7 S H I F T 2 29 = 28 A D D 1. While we notice that 29 = 32 - 3. Therefore the following three instructions are sufficient: 3 = L E A ( l , 1, 1) 32 = 1 SHIFT 5 29 = 32 SUB 1. A l s o , the e x t e n d e d B e r n s t e i n ' s a l g o r i t h m is computationaly expensive. Later we will compare the performance of this algorithm with that of our algorithm.
Case 2. 0 < u _<MAX_S. That is, N = 2 u + N'. In this case, we can first calculate N' and then obtain N using an LEA instruction, as follows: N = LEA (N', 1, u) Case 3. u > M A X S. That i s N = 2 u * N ' . I n t h i s case, we can first calculate N' and then obtain N using a SHIFT instruction, as follows: N = N' SHIFT u. To i m p l e m e n t the basic algorithm efficiently, we introduce the following "condensed representation" of an integer number.
43
[4,2]. R80[0] = 4 requires that case 3 be applied, and we
k For an integer number N = y~ ai*2 i, assume the number i=0 of 1-bits in its binary representation is r, and the 1-bits in the binary sequence are at the following locations: L0, L1,... , Lr_l, where Li+ 1 > L i >_0 That is r-1 N = ~
have 80 = N' SHIFT 4, where N' = 5 = 20 + 22. R5[0..1] = [0, R80[1]] = [0, 2]. After applying Case 1 again, we have 5 = L E A (1, 1, 2). Altogether this algorithm produces the following instruction sequence (note that 657
=((1 + 1 " 4 ) ' 1 6 + 2 ) ' 8 +
aLj*2 Lj
The condensed representation of the number N is RN[0,... , r-l] where RN[0] = L0, and RN[i] = L i - Li_ 1, for i = 1,..., r-1. For example, 1951 = 2 0 + 2 1 + 2 2 + 2 3
+24 + 2 7 + 2 8
1):
5 = LEA (1, 1, 2) 80 = 5 SHIFT 4 82 = L E A (80, 1, 1) 657 = L E A (1, 82, 3). Note that 657 has four 1-bits, and the simple instruction sequence generated for it by the basic algorithm uses 3 LEA instructions and one S H I F T instruction. In general, we have the following proposition about the simple instruction sequences generated by the basic algorithm. P r o p o s i t i o n : Assume a number N has r 1-bits. The basic algorithm uses r-1 L E A instructions and no m o r e log2N - MAX_S than 1 + k ~ A X _ - S + 2 j S H I F T instructions.
+
29 + 210 can be represented as [0,1,1,1,1,3,1,1,1]. The basic algorithm is described in Algorithm 1, where the funcdon d e c o m p o s e _ m u l 0 will be described Section 5.4. At this point, we can assume it is a call to basic_deeompose_mul itself and the reeursion will terminate.
P r o o f : For both ease 1 and ease 2, when an L E A is used, N' has one less 1-bits than N. For ease 3, no L E A is used, and N' becomes case 1 and has the same number of 1bits as N. When N = 1, no L E A is needed. So the number of L E A instructions used to calculate N will be r 1. Furthermore, the S H I F T instructions are only used in ease 3. The fwst S H I F T will be used if u _> 1 + MAX_S. When a S H I F T is used in ease 3, N' becomes ease 1. For a number in ease 1 coming to case 3, there are two paths: 1) coming to case 3 directly, or 2) first coming to ease 2 and then coming to ease 3. If the number comes to ease 3 directly f r o m case 1, there m u s t be v - u _> 1 + 2*MAX_S. That is, at least 2 * M A X _ S + 2 bits are consumed w h e n coming from case 1 to ease 3. If the number comes to ease 2 and then to case 3, going from case 1 to case 2 consumes at least M A X _ S + 1 bits, and going from ease 2 to case 3 consumes at least M A X _ S + 1 bits. Again, at least 2 * M A X _ S + 2 bits are consumed when coming from case 1 to ease 3. For any number N, it has at most 1 + l o g 2 N significant bits. Applying the basic algorithm to it, the first S H I F T consumes at least MAX_S + 1 bits, and each subsequent S H I F T consumes at least 2 * M A X _ S + 2 bits. So the n u m b e r o f SHIFT, S, satisfies: M A X _ S + 1 + (S - 1) * ( 2 * M A X _ S + 2 ) _< 1 + log2N That is log2N - MAX S
_En_t~zffam..e;. basic decomp..o~9~.m..p..L~,~zy~,..N).......................
Input: N - - the constant to be multiplied r - - the number of 1-bits in N. R ~ . . . r - _ . l ] . . z ; - ~e..9.9.ndg~e.d.Lepresegtation of N. O u t p u t : emit L E A / S H I F T instructions for multiplying the constant N and return the root o f the instruction tree emitted. Process: ff (RIO] == O) { [* Case 1 */ t = R[1] > m a x (MAX_S, R[I]); R[I] -= t; N' = decompose mul (R[1...r-1], r - 1, N); return e m i t _ H A (1, N', t); ] else if (R[0] _<MAX_S) { /* Case 2 */ r0 = R[0];
R[1] += r0; N' = decompose mul (R[1...r-1], r - 1, N); return e m i t L E A (N', 1, r0); } else { /* Case 3 */ rO = R [ 0 ] ;
R[0] = 0; N' = deeompose_mul (R, r, N); return emit S H I F T (N', r0);
s <_ 1
A l g o r i t h m 1. The basic Algorithm.
+L
J
QED C o r o l l a r y : For a 32 bit integer and M A X S = 3, the basic algorithm will use no m o r e than 4 S H I F T
Take the number 657 as an example. 657 = 20 + 24 + 27 + 29. R657[0..3] = [0,4,3,2]. Since R657[0 ] = 0, we can apply case 1, and we have 657 = L E A (1, N', 3), where
instructions.
That is because S _< 1 + ~ J
4.
Since the n u m b e r o f instructions generated by this algorithm is in proportion to the number of 1-bits in N, we developed two techniques to reduce the number of 1-bits as m u c h as possible: the generalized c o m p l e m e n t a r y teelmique and the factoring technique. These techniques are discussed in the following two subsections.
N' = 82 = 21 + 24 + 26 . R82[0..2] = [ R 6 5 7 [ 1 ] - 3 , R65712], R65713]] = [1,3,2]. Since R82[0] = 1, we can apply case 2, and we have 82 = LEA(N', 1, 1), where N' = 80 = 24 + 26. R80[0..1] = [R82[1]+R82[0], R8212]] =
........ r ¸~'J¸ i~
44
5.2. Generalized Technique
Complementary
N
k ~ai*2 i
i=0
The a b o v e basic p r o c e d u r e can be i m p r o v e d by
i=j+l
k
J
converting k consecutive l ' s to 2 k - 1. This complementary approach can reduce the number of 1-bits very quickly. O f course k should be at least 3 for this
= 2j + l - (1 + ~(--lai)*2 i) + i=0
•
ai*21) i=j+l
= N2 - N1,
complementary scheme to be beneficial since 2 k - 1 takes two instructions to accomplish. This approach can be generalized to a sequence of O's and l's (rather than a consecutive sequence of l's) as long as the final sequence needs less instructions. Assume k N = ~ai*2 i i=0
where k N I = (2J+l
+
i ~ + ~ i.21)' =j
and J N2 = (1
+
.~rf-"~l)*21), I--U
and a k =1. Namely 2 k _
j = ~ai*2i+
and we can calculate N by first obtaining N1 and N2. Note that the "complementary condition" r - z - 2 > 0 can be eheeked quickly when N is in the condensed representation. Given a condensed representation R[0,..,rr-1 1], it is not difficult to see that z = R[0] + i~l(R[i]__ - t) =
i+~ai)*2 i
= N + 1 + ~ (--~ai)*21 . i=0
r-1 ( ~ R [ i ] ) - r + 1. The condition r - z - 2 > 0 is the same as i=0 r-1 r - ( ( ~ R [ i ] ) - r + 1) - 2 > 0, or i=0 r-1
That is, N = 2 k+l - N' where k N ' = 1 + E ( - - ~ i ) * 2 i.
2 * r - 3 - ~ R [ i ] > 0. i=0
k Note that ~ , ( ~ a i ) * 2 1 is the number obtained from N with i=0 the first k bits of N complemented and with O's in all other bits. The number of 1-bits in the binary representation of N' will be no more than the number of O's in the first k bits of N (plus one if ~ a 0 is zero). Also, to obtain N from
For example, 5070 = 2 1 + 2 2
+23 +26+27
+28 +
29 + 212. R[0..7] = [1,1,1,3,1,1,1,3]. Table 1 shows the gains for sub-sequences with r = 2 to r = 8. From the table we can see that the s u b - s e q u e n c e that should be complemented beneficially is the sequence when r = 7, for which the m a x i m u m gain of 2 is obtained. Applying the
N', a S H I F T (for generating 2 k + l ) and a SUB is needed. Assume r is the number of 1-bits in N and z is the number of 0-bits among the 1-bits in N. To calculate N using
generafized complementary scheme, we have 5070 = 210 + 212_ (21 + 24 + 25).
2k+ 1 _ N' with less instructions than the basic algorithm, N should satisfy the following "complementary condition": r-z-2>0. When z = 0, we get back to the case that N consists of consecutive l's. We will call the value on the left side of the above condition the "gain" from the generalized complementary scheme. This g e n e r a l i z e d c o m p l e m e n t a r y scheme can be incorporated into the basic procedure as follows. At the beginning of the basic procedure, N is checked to identify a prefix a0, ..., aj that satisfies the complementary condition and maximizes the "gain". ff such a sequence exists, we have
Table 1. Gains of complementing sub,sequeno
r 2 3
R [1,11 [1,1~1]
gains 4 - 3 - 2 = -1 6-3 -3 = 0
4
[i~iT1T3 ]
8 -3 - 6 = -1
5 6 7
[1,1,1,3,1] [1,1,1,3,1,1] [1,1,1,3,1,1,1]
10 - 3 - 7 = 0 12 - 3 - 8 = 1 14 - 3 - 9 = 2
8
[1,1,1,3,1,1,1,3]
16 - 3 - 12 = 1
5.3.
Factoring
Technique
If N is a multiple of (1 + 2i), for i = 1, 2,..., or MAX_S, then N = (1 + 2 i ) * N' where
45
value to c o n d e n s e d (N): returns the condensed representation (R, r) of N. condensed to value (R, r): returns the value of the condensed representation (R, r). complement_value (N): returns k k
N ' = N / ( 1 + 2 i) and N can be calculated as follows: N] ~.,.
N = LEA (N ], N', i) In most cases, N' will have a much fewer number of 1bits than N, and factoring N as the product of N' and (1 +2 i) will reduce the number of 1-bits rapidly. For example, 108 22+23 +25+26= 12"(23 + 1),where12=22+23 has only half the number of 1-bits as 108. But this may not be always true. An example is 290 = 58 * 5, where 290 = 21 + 25 + 28 has three 1-bits, while 58 = 21 + 23 +
1+ w~0(-al)*2i f°r N = i= ~ai*2i" 0 complement condensed (R, r): returns value_to_condensed (complement_value (condensed_to_value (R, r)))
The combined algorithm is described in Algorithm 2. The procedure decomp_simple_case0 emits instructions for the simple cases when r _< 2 and its implementation is omitted. The procedure basic_decompose_mul 0 is described in Section 5.1.
24 + 25 has four 1-bits. We apply factoring only when N' has less 1-bits than N has. A ~ o r i t h m Name: decom ose mul R r N Input: N - - the constant to be multiplied r - - the number of 1-bits in N. ._...~R.,[o:::r=.l]- - the condensed r..epresentation of N. Output: emit ADD/SUB/LEA/SHIFT instructions for multiplying N and return the root of the instruction tree emitted. Process: if (r _<2) return decompose_simple_cases(R, r, N) if (there is an i such that R[0...i-1] satisfies and maximizes the generalized complementary condition)
6. P e r f o r m a n c e Experiment We implemented the combined algorithm in our C compiler for the Intel x86 and Pentium processor. Here we present some of the experiment results. The execution time and code quality results are presented in Sections 6.1 and 6.2, respectively. The effects of the complementary and factoring techniques on the overall algorithm is examined in Section 6.3. In Section 6.4 we investigate the opportunities in which we can apply the simple instruction sequences to reduce the execution time of the multiplications of integer constants.
I
6.1. E x e c u t i o n
(R1, rl) = complement_condensed (R[0...i-1], i); k = R[0] + .... + R[i-1] + 1; (R2, r2) = ([k, R[i]-l, R[i+l] .... , R[r-1]], r - i + 1); N1 -- decompose_mul (R1, rl, condensed to value (R1, rl)); N2 ~ decompose_mul (R2, r2, condensed to value (R2, r2)); return e m i t S U B (N2, N1);
Time
Table 2 compares the execution time (on an i486 system) of our algorithm (ALG1) with extended Bemstein's algorithm (ALG2). Both user time and system time are listed with system time in parentheses. The execution time of ALG1 is fairly uniform, roughly at 40/1,000,000 second per multiplication, and the amount of system time is very small. On the other hand, the execution time of ALG2 seemingly grows more than linearly as the constant gets bigger. Also, ALG2 uses a significant amount of system-time overhead.
} if (N is a multiple of (1+21) for i in [1...MAX_S]) t
(R',if) = value to condensed (N/(l+2i)); if(if
Table 2: Execution time comparison in 0.1 second: User time (sys time) NUMBER [ ALG1 ALG2 RANGES I [1~ 1000001 32.3 (0.1) 36.5 (37.4) 143.4 (220.3) [1, 2000001 70.0 (0.2) 333.8 (688.7) [1, 3000001 110.1 (0.1) 518.1 (653.6) [1, 400000] 151.1 (0.5) 737.7 (421.5) [% 500000] i 193.6 (0.2) 1025.7 (431.1) [1, 600000] 1236.5 (0.2) 1374.5 (432.5) .... [1, 7000001 1279.8 (0.1) 1768.3 (440.8) [1~ 800000] i 323.6 (0.1) 2213.0 (449.4) [% 900000] 368.8 (0:I) 2709.8 (486.6) 11, 1000000] 414.4 (0.1)
I Algorithm 2. Combined Algorithm. 5.4. Combined Algorithm The combined algorithm uses the basic algorithm and the two techniques discussed in 5.2 and 5.3. Given a number N, it first checks to see if N has a prefix of O's and l's that can be complemented to reduce the number of 1bits in N. If so the complementary scheme is applied. Otherwise, it checks to see if N can be factored to reduce the number of 1-bits. If so, the factoring scheme is applied. If N cannot be complemented or factored, the basic algorithm is used. We need the following utility functions to implement the combined algorithm:
46
6.2 Q u a l i t y o f t h e G e n e r a t e d Instruction Sequences. For simplicity, we assume that each of the simple instructions takes one cycle. Table 3 compares the total number of instructions used by ALG1 and ALG2 for all the numbers in the given ranges. ALG2 consistently uses more (around 7%) instructions than ALG1. Furthermore the difference is more significant for small numbers than for big numbers.
the percentage of more instructions used by STEP1 over STEP2 (in column 2) and that used by STEP2 over STEP3 (in column 3). From Table 4 we can see that STEP2 reduces the number instructions over STEP1 more rapidly than STEP3 over STEP2, especially for smaller numbers. STEP3 reduces the number of instructions more significantly for larger numbers than for smaller numbers. For all the numbers in the range [1, 100,000,000], STEP1 uses about 16.7% more instructions than STEP2, and STEP2 uses about 15.1% more than STEP3. Cleanly, the performance of the algorithm gets significantly better as it evolves from STEP1 to STEP2 and then to STEP3. Table 4 also shows (in column 4) the ratio of number of instructions generated by STEP3 over the total number of 1-bits in binary representations of the input constants. On the average, the number o f the simple instructions generated is about 84.5% of the number of 1-bits in the constant.
Table 3: Comoafison of the number of instructions NUMBER ALG1 ALG2 % diff RANGES [1~ 100000] 679904 740311 8.885% 1~_ooooo] 1446258 1565844 8.269% 2247738 2422837 7.790% [1, 3000001 3065921 3301085 7.670% [11 400000] [1, 500000] 3921855 4197590 7.031% 4755289 5098608 7.220% [I t 6000001 5604671 6012433 7.275% [1 t 700000] 6478143 6938681 7.109% [1 t 800000] 7369242 7871300 6.813% [1 t 9000001 8275983 8813951 6.500% [1, 1O0OO001
Opportunities for Using the S i m p l e Instruction Sequences 6.4.
The hardware multiplication function unit normally cannot perform the analysis on its operands as sophisticatedly as our combined algorithm. Due to the hardware cost, they usually adopt simple algorithms that run fast on average [6, 7]. Our software solution complements the hardware solution by picking up those multiplications of integer constants that can be done faster by compile-time analysis. To determine whether to use a simple instruction sequence or the multiplication instruction, we can calculate the exact cost of the simple instruction sequence and compare it with the actual cost of the multiplication instruction. If the number of clock cycles is the only concern, the decision is simple, and we use the simple instruction sequence when it uses fewer clock cycles. For example, the Intel i486 processor [8] employees the earlyout algorithm for its multiplication instruction, and the number of clock cycles taken by the multiplication instruction is determined by the position of the mostsignificant bit in the multiplier. Assume the the value of the multiplier is m > 0. The actual number of clock cycles taken by a multiplication instruction is: max(13, max (rlog2m], 3) + 6).
Table 4: The Effects of the Components of the Combined Algorithm NUMBER RANGES STEP1/ STEP2/ [ STEP3 STEP2 STEP3 I 5.0% 0.834 [000001 t 100000] 15.6% 0.838 16.2% 5.1% [100001 t 200000 l 18.0% 4.3% 0.845 [200001~ 3000001 14.9% 6.0% 0.833 [30000 it 400000] 0.829 15.5% 6.6% [400001, 500000] 0.864 21.0% 2.0% [500001, 600000] 0.854 14.0% 4.7% [600001, 700000] 0.817 16.3% 7.2% [700001, 8000001 0.851 12.8% 7.0% [800001 t 900000] 0.811 18.5% 6.2% [900001 t 10000001 0.843 16.4% 15.0% 00000001 t 10000000 0.845 16.7% 14.9% 10000001 t 20000000 0.841 15.9% 15.5% 20000001 t 30000000 0.852 13.8% 18.2% 30000001 t 40000000 0.837 16.0% 16.1% 40000001~ 50000000 0.847 15.7% 15.3% 50000001 60000000 0.818 11.5% 22.8% 60000001 70000000 0.891 15.9% 14.0% 70000001 80000000 0.851 16.4% 16.6% 80000001 90000000 0.827 15.4% 90000001 100000000] 16.0% 0.845 15.1% 00000001. 100000000] 16.7%
Table 5. Distribution of [1... 100,000] by the differences between the number of cycles taken by 1486 multiplication ~le instruction seauence instruction and by fl diff count diff count cliff count 8 17570 15 570 1 0 9 18415 16 237 2 22 10 15605 17 82 3 190 11 10602 18 27 4 1013 12 6152 19 9 5 3529 13 3089 20 2 6 7831 14 1449 i 21 1 7 13607
6.3 Effects of the Complementary and Factoring Techniques Table 4 summarizes the relative effect of each of the three techniques in our algorithm on the number of instructions used. The basic algorithm is referred to as STEP1, STEP1 with generalized complementary scheme is referred to as STEP2, and STEP2 with factoring is referred to as STEP3. For each range of numbers, Table 4 shows
Table 5 shows how the numbers between 1 and 100,000 are distributed by the differences between the number of cycles taken by the multiplication instruction and that by
47
the simple instruction sequences (assume each ADD and SUB takes one cycle, and each LEA and SHIFT takes two cycles on the Intel i486 processor). For all of the numbers, the simple instruction sequences take fewer cycles, and for over 89% of the numbers the simple instruction sequences take no more than 70% of the cycles used by the multiplication instruction. Table 6 repeats the same experiment for the Intel Pentium processor (for which each ADD, SUB, LEA and SHIFT takes one cycle, and the multiplication instruction takes 10 cycles). For 99.7% of the numbers, the simple instruction sequences take fewer cycles, and for about 71% of the numbers, the simple instruction sequences take no more than 70% of the cycles used by the multiplication instruction.
overflow condition while a multiplication instruction does. We can partially work around this problem by not using an LEA instruction as the last instruction in the sequences.
Table 6. Distribution of [1... 100,000] by the differences between the number of cycles taken by Pentium multiplication instruction and by the simple instruction
9.
8.
Conclusions We designed an algorithm that replaces multiplications of integer constants by sequences of simple instructions. The algorithm runs fast and the generated instruction sequences are of high quality. The number of simple instructions generated for a multiplication of an integer constant averages 84.5% of the number of 1-bits in the binary representation of the constant. Also, for the processors considered, the simple instruction sequences found by our algorithm will significantly reduce the number of cycles needed to perform the multiplications.
sequence diff 0 1 2 3 4
count 332 5968 22930 33292 23648
diff 5 6 7 8 9
Acknowledgments
I would like to thank Rocky Bernstein, Preston Briggs, Cliff Click, Hans Olsson, Thomas David Rivers, and many other people for references and discussions on the topic of multiplications by integer constants. Special thanks to Preston Briggs for his implementation of the Bernstein algorithm which is used in our performance comparison. I would also like to think James Bash, Dorsey Drane, and Al Lau for their helpful comments on the paper.
count 10125 2913 659 113 20
References
In addition to comparing the number of cycles, other factors can also affect the decision of whether or not to use the simple instruction sequence, such as parallelism, register pressure, and instruction space. For example, in a pipelined or superscalar processor, if there are plenttyof idle cycles surrounding the multiplication instructions, the simple instruction sequence may be completely inserted into the idling slots so that its execution can be taken as free, even when the simple instruction sequence is long. On the other hand, if a processor has a very smaller instruction cache, expanding a multiplication instruction to a long sequence of instructions may increase the cache pressure and thus degrade the performance. In our implementation, a simple instruction sequence will only be used if its total number of cycles is no more than 70% of the number of cycles taken by the multiplication instruction.
[1] Intel, Pentium Processor User's Manual, Volume 3: Architecture and Programming Manual. 1993. [2] Knuth, The Art of Computer Programming, Volume 2/Semi-numerical Algorithms. pp 402-422. [3] Deniel Magenheimer, et. al, ``Multiplication and Division on the HP Precision," Proceedings of ASPLOS II. pp. 90-99. [4] Robert Bernstein, "Multiplication by Integer Constants," Software -- Practice and Experience, Vol 16, No. 7, pp 641-652, Jul, 1986. [5] Preston Briggs, source code for implementing Robert Bernstein's algorithm posted in news group eomp.compilers. Oct 20, 1992. [6] Habibi, A and P.A. Wintz, "Fast Multipliers," 1EEE Trans. on Computers, Vol. C-19, No. 2, pp. 153-157, Feb. 1970. [7] Vassiliadis, Stamatis, Eric M. Shwarz, and Baik M. Sung, "Hard-wired Multipliers with Encoded Partial Products," IEEE Trans. Comput., Vol. 40, No. 11, pp 1181-1197. Nov. 1991. [8] Intel, i486 Microprocessor Programmer's Reference Manual. 1990.
7. Handling Overflows Converting a multiplication of an integer constant to a sequence of SUB/ADD SHIFT/LEA instructions may introduce runtime overflows. For example, if we convert X * (232 - 1) to (X SHIFT 32) - X, the operation (X SHIFT 32) will overflow on a 32 bit machine when X = 1, while the original multiplication will not. This can be prevented by disallowing SUB instruction in the simple instruction sequence. Since SUB instruction is only used in the generalized complementary scheme, we added a compiler option that disables that optimization when specified. The dual problem is that converting a multiplication of an integer constant to a sequence of SUB~ADD/SHIFT[LEA instructions may shut off the overflow that would have been generated at runtime. "ntis is because that LEA instruction doesn't generate an
48