Eidgenossische ¨ Technische Hochschule Zurich ¨
Institut f¨ur Integrierte Systeme
Ecole polytechnique federale ´ ´ de Zurich Politecnico federale di Zurigo Swiss Federal Institute of Technology Zurich
Integrated Systems Laboratory
Lecture notes on
Computer Arithmetic: Principles, Architectures, and VLSI Design March 16, 1999
Reto Zimmermann Integrated Systems Laboratory Swiss Federal Institute of Technology (ETH) CH-8092 Z¨urich, Switzerland
[email protected]
Copyright c 1999 by Integrated Systems Laboratory, ETH Z¨urich http://www.iis.ee.ethz.ch/ zimmi/publications/comp arith notes.ps.gz
Contents
Contents
Contents 1 Introduction and Conventions
4
1.1 Outline
4
1.2 Motivation
4
1.3 Conventions
5
1.4 Recursive Function Evaluation
6
4.3 Carry-Propagate Adders (CPA)
26
4.4 Carry-Save Adder (CSA)
45
4.5 Multi-Operand Adders
46
4.6 Sequential Adders
52
5 Simple / Addition-Based Operations
53
5.1 Complement and Subtraction
53
5.2 Increment / Decrement
54
8
5.3 Counting
58
2.1 Overview
8
5.4 Comparison, Coding, Detection
60
2.2 Implementation Techniques
9
2 Arithmetic Operations
5.5 Shift, Extension, Saturation
64
10
5.6 Addition Flags
66
3.1 Binary Number Systems (BNS)
10
5.7 Arithmetic Logic Unit (ALU)
68
3.2 Gray Numbers
13
3.3 Redundant Number Systems
14
6.1 Multiplication Basics
69
3.4 Residue Number Systems (RNS)
16
6.2 Unsigned Array Multiplier
71
3.5 Floating-Point Numbers
18
6.3 Signed Array Multipliers
72
3.6 Logarithmic Number System
19
6.4 Booth Recoding
73
3.7 Antitetrational Number System
19
6.5 Wallace Tree Addition
75
3.8 Composite Arithmetic
20
6.6 Multiplier Implementations
75
3.9 Round-Off Schemes
21
6.7 Composition from Smaller Multipliers
76
6.8 Squaring
76
3 Number Representations
4 Addition
22
4.1 Overview
22
4.2 1-Bit Adders, (m, k)-Counters
23
Computer Arithmetic: Principles, Architectures, and VLSI Design
1
Contents
7.2 Restoring Division
78
7.3 Non-Restoring Division
78
7.4 Signed Division
79
7.5 SRT Division
80
7.6 High-Radix Division
81
7.7 Division by Multiplication
81
7.8 Remainder / Modulus
82
7.9 Divider Implementations
83
7.10 Square Root Extraction
84
8 Elementary Functions
85
8.1 Algorithms
85
8.2 Integer Exponentiation
86
8.3 Integer Logarithm
87
9 VLSI Design Aspects
88
9.1 Design Levels
88
9.2 Synthesis
90
9.3 VHDL
91
9.4 Performance
93
9.5 Testability
95
Bibliography
96
Computer Arithmetic: Principles, Architectures, and VLSI Design
3
6 Multiplication
7 Division / Square Root Extraction 7.1 Division Basics Computer Arithmetic: Principles, Architectures, and VLSI Design
69
77 77 2
1 Introduction and Conventions
1.2 Motivation
1 Introduction and Conventions
1 Introduction and Conventions
1.3 Conventions
1.1 Outline
Naming conventions
Basic principles of computer arithmetic [1, 2, 3, 4, 5, 6, 7] Circuit architectures and implementations of main arithmetic operations Aspects regarding VLSI design of arithmetic units 1.2 Motivation Arithmetic units are, among others, core of every data path and addressing unit
gate-equivalents (GE) model) : 0 0 (i.e. ignored) Inverter, buffer : Simple monotonic 2-input gates (AND, NAND, OR, 1 NOR) : 1
Unit-gate model (
microprocessors (CPU) signal processors (DSP) data-processing application specific ICs (ASIC) and programmable ICs (e.g. FPGA)
Simple non-monotonic 2-input gates (XOR, XNOR) : 2 2
Standard arithmetic units available from libraries
Complex gates : composed from simple gates
Simple -input gates : 1 log
Design of arithmetic units necessary for :
non-standard operations high-performance components library development
Wiring not considered (acceptable for comparison purposes, local wiring, multilevel metallization) Only estimations given for complex circuits
Computer Arithmetic: Principles, Architectures, and VLSI Design
4
1.4 Recursive Function Evaluation
1.4 Recursive Function Evaluation Given : inputs
(1-D), (2-D), (subbus, 1-D) : Signals : , (1-D), (2-D), : (group signal) Circuit complexity measures : (area), (cycle time, delay), (area-time product), (latency, # cycles) ,
, , , log ( log2 ) Arithmetic operators : Logic operators : (or), (and), (xor), (xnor), (not) Signal buses :
Circuit complexity measures
Data path is core of :
1 Introduction and Conventions
1.3 Conventions
Computer Arithmetic: Principles, Architectures, and VLSI Design
1 Introduction and Conventions
2.
, outputs , function (graph sym. :
)
is a function of input (or : const.) ; 0 1 parallel structure : a a a a funn.epsi 119 "17 mm ! !1 1
is associative (r.s.a.) serial or single-tree structure : ! !log
a3 a2 a1 a0
"
1 funrsa.epsi 219 20 mm z
b) with multiple outputs
Output
2
1.4 Recursive Function Evaluation
(r.m.) (prefix problem) : &1 ; 0 1 &1 0 1
Non-recursive functions (n.)
3
5
1.
0
z3 z2 z1 z0
is non-associative (r.m.n.) serial structure : ! !
a3 a2 a1 a0 1 funrmn.epsi 219 25 mm 3
"
z3 z2 z1 z0
Recursive functions (r.)
a3 a2 a1 a0
is a function of all inputs #$ with single output %&1 (r.s.) : ' '&1 ; 0 1 '&1 0 1 '%&1
Output a)
1.
is non-associative (r.s.n.) serial structure : ! !
Computer Arithmetic: Principles, Architectures, and VLSI Design
2.
is associative (r.m.a.) serial or multi-tree structure : ! 2 !log
1 2 z3 funrma1.epsi 19 43 mm
"
z2 z1
or shared-tree structure : ! log !log
a3 a2 a1 a0 1 funrsn.epsi 219 24 mm 3
"
z0 a3 a2 a1 a0
"
1funrma2.epsi 219 21 mm z3 z2 z1 z0
z 6
Computer Arithmetic: Principles, Architectures, and VLSI Design
7
2 Arithmetic Operations
2.1 Overview
2 Arithmetic Operations
2.2 Implementation Techniques
2 Arithmetic Operations
2.2 Implementation Techniques
2.1 Overview
Direct implementation of dedicated units : fixed-point
based on operation
always : 1 – 5
floating-point
related operation
in most cases : 6
<< , >>
sometimes : 7, 8 =,<
+1 , −1
+,−
+/−
× arithops.epsi 98 83 mm
sometimes : 6
×
"
in most cases : 7, 8, 9 (same as on the left for floating-point numbers)
sqrt (x)
⁄
Sequential implementation using simpler units and several clock cycles ( decomposition) :
+,−
Table look-up techniques using ROMs : universal : simple application to all operations
exp (x)
trig (x)
complexity
log (x)
efficient only for single-operand operations of high complexity (8 – 12) and small word length (note: ROM ) size 2
hyp (x)
%
Approximation techniques using simpler units : 7–12 1 2 3 4 5 6
shift/extension comparison increment/decrement complement addition/subtraction multiplication
7 8 9 10 11 12
division square root extraction exponential function logarithm function trigonometric functions hyperbolic functions
Computer Arithmetic: Principles, Architectures, and VLSI Design
3 Number Representations
taylor series expansion polynomial and rational approximations convergence of recursive equation systems CORDIC (COordinate Rotation DIgital Computer) 8
3.1 Binary Number Systems (BNS)
3 Number Representations 3.1 Binary Number Systems (BNS) Radix-2, binary number system (BNS) : irredundant, weighted, positional, monotonic [1, 2]
%&%&
-bit number is ordered sequence of bits (binary digits) : 0 1 1 2 0 2 Simple and efficient implementation in digital circuits MSB/LSB (most-/least-significant bit) :
%&1 / 0
Represents an integer or fixed-point number, exact Fixed-point numbers :
& &&% 1 0 1
% &
-bit integer
-bit fraction
Unsigned : positive or natural numbers
%&2%&1 2 1 1 % Range : 0 2 1
Value :
1
2
0
3 Number Representations
9
3.1 Binary Number Systems (BNS)
% Complement : 2 1 , where %&1 %&2 0 Sign : %&1
Properties : asymmetric range, compatible with unsigned numbers in many arithmetic operations (i.e. same treatment of positive and negative numbers)
% &2 % & 1 % &1 2 1 2 Value :
0 % % & & 1 1
1 2 1 Range : 2 % Complement : 2 1 Sign : %&1 Properties : double of zero, symmetric %representation range, modulo 2 1 number system
One’s (1’s) complement : similar to 2’s complement
Sign-magnitude : alternative representation of signed numbers
0
&2 1 % 0 2 %& %& Range : 2 1 1 2 1 1 Complement : %&1 %&2 0 Sign : %&1
Two’s (2’s) complement : standard representation of signed or integer numbers
Value :
% &2 % & 1 % Value :
&12 2 0 % % & & 1 1
1 Range :
2 2
Computer Arithmetic: Principles, Architectures, and VLSI Design
Computer Arithmetic: Principles, Architectures, and VLSI Design
10
1
Computer Arithmetic: Principles, Architectures, and VLSI Design
11
3 Number Representations
3.1 Binary Number Systems (BNS)
Properties : double representation of zero, symmetric range, different treatment of positive and negative numbers in arithmetic operations, no MSB toggles at sign changes around 0 ( low power)
Gray numbers (code) : binary, irredundant, non-weighted, non-monotonic + Property : unit-distance coding (i.e. exactly one bit toggles between adjacent numbers) Applications : counters with low output toggle rate (low-power signal buses), representation of continuous signals for low-error sampling (no false numbers due to switching of different bits at different times)
111...1
011...1 100...0
000...0
Graphical representation
binary number representation
−2
0
2
3.2 Gray Numbers
3.2 Gray Numbers
n−1
3 Number Representations
n−1
2
"
numrep.epsi 95 73 mm
– Non-monotonic numbers : difficult arithmetic operations, e.g. addition, comparison :
0 0 0 0 0 1 and 0 1 1 1 1 0 but 1 0 binary Gray : % 0 ;
n
1 0
unsigned 2’s complement 1’s complement
1 0
0 1 1
sign-magnitude
(n.)
binary : % 0 ;
Gray Conventions
1 0 1
2’s complement used for signed numbers in these notes Unsigned and signed numbers can be treated equally in most cases, exceptions are mentioned Computer Arithmetic: Principles, Architectures, and VLSI Design
3 Number Representations
12
3.3 Redundant Number Systems
3.3 Redundant Number Systems Non-binary, redundant, weighted number systems [1, 2] Digit set larger than radix (typically radix 2) multiple representations of same number redundancy + No carry-propagation in adders more efficient impl. of adder-based units (e.g. multipliers and dividers) – Redundancy no direct implementation of relational operators conversion to irredundant numbers – Several bits used to represent one digit higher storage requirements – Expensive conversion into irredundant numbers (not necessary if redundant input operands are allowed)
%&1 2
0 1 digit holds sum of 2 bits (no carry-out digit) example : 00 10 00 10 01 01 10 00 irredundant representation of
1 [8], since 1 0 & 1 1
0
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1
0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1
0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0
0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0
0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
Computer Arithmetic: Principles, Architectures, and VLSI Design
3 Number Representations
13
3.3 Redundant Number Systems
1 digit holds sum of 3 bits or 1 digit + 1 bit (no carry-out digit, i.e. carry is saved) standard redundant number system for fast addition Signed-digit (SD) or redundant digit (RD) number representation :
%&
'
'
1 0 1
no carry-propagation in
2
:
2 1 is redundant (e.g. 0 1
1
1 0
1 0 1 ,
, 1 1 01 1 0 1
1 0 1 11)
1 digit holds sum of 2 digits (no carry-out digit)
minimal SD representation : minimal number of non-zero digits, 011 1 10 100 0 10
Delayed-carry of half-adder number representation : 0 1 2 , 0 1 , , 0 1 2 1
1
applications : sequential multiplication (less cycles), filters with constant coefficients (less hardware)
example : minimal 7 0111 1111 1011 1001 11111
canonical SD repres.: minimal SD + not two non-zero digits in sequence, 01 1 10 10 0 10
SD binary : carry-propagation necessary ( adder)
%&
Carry-save number representation : 0 1 2 3 , 0 1 , 1 2 1
1 2 0 Computer Arithmetic: Principles, Architectures, and VLSI Design
(r.m.a.)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
3binary 2 1 0 3 Gray 2 1 0
other applications : high-speed multipliers [9] similar to carry-save, simple use for signed numbers 14
Computer Arithmetic: Principles, Architectures, and VLSI Design
15
3 Number Representations
3.4 Residue Number Systems (RNS)
Non-binary, irredundant, non-weighted number system [1]
– Complex and slow other arithmetic operations (e.g. comparison, sign and overflow detection) because digits are not weighted, conversion to weighted mixed-radix or binary system required
Possible applications (but hardly used) :
digital filters : fast additions and multiplications error detection and correction for arithmetic operations in conventional and residue number systems
%&%& %&1 %&2 0 , 0 1 1 % Range: &1 , anywhere in ZZ mod 0 , %&1 0 , 0 1 0
Base is -tuple of integers 1 2 0 , residues (or moduli) pairwise relatively prime
0
possible range
5 5 5 2 1 1 0 4 5 6 1 0 2 1 3 2 6
1 2 3 0 1 2 0 1 3 6 4 5 1 0 2 1 6
1 2 0 1 2 0 2 3
Computer Arithmetic: Principles, Architectures, and VLSI Design
3 Number Representations
Codes for error detection and correction [1]
2
&1 &2 (Fermat’s theorem) are 2and 2 1: Best moduli high storage efficiency with #bits simple modular addition : 2: #-bit adder without , 2 1 : #-bit adder with end-around carry ( % ) 3 2, 6 Example : 1 0 4 3 2 1 0 1 2 3 4 5 6 7 8 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 0 1 0 1 0 1 01 0 1 0 1 0
+ Carry-free and fast additions and multiplications
3.4 Residue Number Systems (RNS)
Arithmetic operations : (each digit computed separately)
3.4 Residue Number Systems (RNS)
1
3 Number Representations
16
3.5 Floating-Point Numbers
3.5 Floating-Point Numbers
2
6
Computer Arithmetic: Principles, Architectures, and VLSI Design
3 Number Representations
17
3.7 Antitetrational Number System
3.6 Logarithmic Number System
Larger range, smaller precision than fixed-point representation, inexact, real numbers [1, 2]
Alternative representation to floating-point (i.e. mantissa + integer exponent only fixed-point exponent) [1]
Double-number form
Single-number form continuous precision accuracy, more reliable
discontinuous precision
1 1 1 2 & Basic arithmetic operations : 1 1 1 base on fixed-point add, multiply, and shift operations postnormalization required (1 $ 1) S biased exponent E unsigned norm. mantissa M
1 1 2 &
S biased fixed-point exponent E
precision single double
32 64
23 52
bias
(signed-logarithmic)
(additionally consider sign) : by approximation or addition in conventional number system and double conversion 1 1 1
+ Simpler multiplication/exponent., more complex addition – Expensive conversion : (anti)logarithms (table look-up) Applications : real-time digital filters 3.7 Antitetrational Number System
22) and antitetration (a.t. ) [10] " Larger range, smaller precision than logarithmic repres., otherwise analogous (i.e. 2 t. log a.t. )
IEEE floating-point format :
Basic arithmetic operations :
Applications : processors : “real” floating-point formats (e.g. IEEE standard), large range due to universal use ASICs : usually simplified floating-point formats with small exponents, smaller range, used for range extension of normal fixed-point numbers
higher
2
range
Tetration (t.
precision 38
8 127 3 8 10 11 1023 9 10307
Computer Arithmetic: Principles, Architectures, and VLSI Design
&7 &15
!
10 10
!
18
Computer Arithmetic: Principles, Architectures, and VLSI Design
19
3 Number Representations
3.8 Composite Arithmetic
3.8 Composite Arithmetic
3 Number Representations
3.9 Round-Off Schemes
3.9 Round-Off Schemes
Proposal for a new standard of number representations [10] Scheme for storage and display of exact (primary: integer, secondary: rational) and inexact (primary: logarithmic, secondary: antitetrational) numbers
lower bits %& additional 1 0 & 1 & Rounding : keeping error final word small during length reduction : %& Intermediate results with ( higher accuracy) :
1
Secondary forms used for numbers not representable by primary ones ( no over-/underflow handling necessary)
Trade-off : numerical accuracy vs. implementation cost Truncation :
1 1
Choice of number representation hidden from user, i.e. software/compiler selects format for highest accuracy
2
Number representations :
0
2
1
%& 1
0
(= average error )
Round-to-nearest (i.e. normal rounding) :
integer :
tag 00
value 2’s complement integer
rational :
01
slash
logarithmic :
10
log integer
log fraction
antitetrational :
11
a.t. integer
a.t. fraction
%& 1 0 1 2 1 0 2
1 (nearly symmetric)
denominator numerator
if &1 & 0 0 & %&1 1 0
0
Storage form sizes : 32-bit (short), 64-bit (normal), 128-bit (long), 256-bit (extended)
Hardware proposal : long accumulator (4096 bits) holds any floating-point number in fixed-point format higher accurary large hardware/software overhead Computer Arithmetic: Principles, Architectures, and VLSI Design
otherwise
(symmetric) mandatory in IEEE floating-point standard
Implementation : mixed hardware/software solutions
20
4 Addition
1
Round-to-nearest-even/-odd :
Rational numbers : slash position (i.e. size of numerator/ denominator) is variable and stored (floating slash)
0 12” can often be included in previous operation 2
“
4.1 Overview
4 Addition
3 guard bits for rounding after floating-point operations : guard bit (postnormalization), round bit (round-to-nearest), sticky bit (round-to-nearest-even) Computer Arithmetic: Principles, Architectures, and VLSI Design
4 Addition
21
4.2 1-Bit Adders, (m, k)-Counters
4.2 1-Bit Adders, (m, k)-Counters
bits of same magnitude (i.e. 1-bit numbers) Output sum as #-bit number ( # log 1) or : count 1’s at inputs (m, k)-counter [3]
4.1 Overview
Add up HA
1-bit adders
FA
(m,k)
(m,2)
(combinational counters) RCA
CSKA
CSLA
CIA
CLA
PPA
COSA
Half-adder (HA), (2, 2)-counter
carry-propagate adders
2
3 2 1
CPA
(sum) (carry-out)
CSA
3-operand
"
adders.epsi 103 121 mm
carry-save adders
multi-operand
adder array
a b
adder tree
a b a b
multi-operand adders
array adder
"
tree adder
hasym.epsi 18 23HA mm c out
s Legend: HA: FA: (m,k): (m,2):
half-adder full-adder (m,k)-counter (m,2)-compressor
based on component
CPA: carry-propagate adder RCA: ripple-carry adder CSKA:carry-skip adder CSLA: carry-select adder CIA: carry-increment adder
"
chaschema1.epsi out 19 28 mm
CLA: carry-lookahead adder PPA: parallel-prefix adder COSA:conditional-sum adder
"
haschema2.epsi 21 43 mm c out
s
(reference) s
CSA: carry-save adder
related component
Computer Arithmetic: Principles, Architectures, and VLSI Design
22
Computer Arithmetic: Principles, Architectures, and VLSI Design
23
4 Addition
4.2 1-Bit Adders, (m, k)-Counters
Full-adder (FA), (3, 2)-counter
2
% 7 4 2
4 Addition
(m, k)-counters
& 1 &1 0 &1
0 2 0
(generate)
(propagate) 1
% % % % %
% % %
% 0 % 1 0
out
"
"
in
"
s
FA
FA
count73par.epsi FA 36 48 mm
FA
0
p faschematic4.epsi c out c in 29 1 41 mm
"
c in
"
c out
"
faschematic5.epsi 0 c0 35 47 mm 1 c1
FA
s2
s
Computer Arithmetic: Principles, Architectures, and VLSI Design
4 Addition
24
4.3 Carry-Propagate Adders (CPA)
4.3 Carry-Propagate Adders (CPA)
-bit operands and and an optional carry-in
Add two by performing carry-propagation [1, 2, 11]
is irredundant 1-bit number 2% %
s0
s0
linear structure
25
4.3 Carry-Propagate Adders (CPA)
Carry-propagation speed-up techniques
%
a) Concatenation of partial CPAs with fast a n-1:j b n-1:j
a i-1:k b i-1:k
a k-1:0 b k-1:0
...
CPA
c out
"
speedup1.epsi CPA c i84 26 mm
cj
CPA
ck
c in
...
B
A
s i-1:k
s n-1:j
"
cpasymbol.epsi CPA c out 29 26 mm c in
s1
s1
tree structure
4 Addition
1 ; 0 1 1 0 % % (r.m.a.)
s2
Computer Arithmetic: Principles, Architectures, and VLSI Design
2
FA
FA
s
s
"
count73ser.epsi 42 59 mm
c in
Sum
a3a4 a5a6
FA
a b
faschematic1.epsi g p 29 43 mm
%
a0a1 a2
a b
a b
(reference)
28 10
a0a1 a2a3a4a5a6
s
"
s k-1 s 0
28 14
faschematic2.epsi c in 32 35 mm
c out
s
c out
"...
cntsymbol.epsi 23 mm 18 (m,k)
Example : (7, 3)-counter
g HA faschematic3.epsi p c out 29 32 mm c in HA
a m-1
...
7 log2&7 log 1 4 2 log
4 log3 2 log
a b
fasymbol.epsi FA c18 21 mm c
a0
Usually built from full-adders Associativity of addition allows convertion from linear to tree structure faster at same number of FAs
a b
a b
4.2 1-Bit Adders, (m, k)-Counters
s k-1:0
a) Fast carry look-ahead logic for entire range of bits
S
a n-1
b n-1
a1
b1
a0
b0
Ripple-carry adder (RCA) Serial arrangement of
Simplest, smallest, and slowest CPA structure
7 2 14 a n-1
b n-1
a1
b1
a0
FA
...
s n-1
"
rca.epsi FA c n-1 57c 2 23 mm
s1
c1
carry propagation
c in
2
postprocessing
...
b0 FA
"
speedup2.epsi 104 50 mm
c out
s n-1
...
c out
preprocessing
...
full-adders
s1
s0
c in
s0
Computer Arithmetic: Principles, Architectures, and VLSI Design
26
Computer Arithmetic: Principles, Architectures, and VLSI Design
27
4 Addition
4.3 Carry-Propagate Adders (CPA)
Carry-skip adder (CSKA)
4.3 Carry-Propagate Adders (CPA)
Carry-select adder (CSLA)
and &1:
&1: 0&1: 1&1: 0 1 Two CPAs compute two possible results ( % 0 1), group carry-in selects correct one afterwards Type a) : partial CPA with fast
Type a) : partial CPA with fast
&1: &1: (bit group &1 )
&1: &1 &2 (group propagate) ) 1) &1: 0 : and selected ( 2) &1: 1 : but skipped ( ) path never sensitized fast false path inherent logic redundancy problems in circuit optimization, timing analysis, and testing Variable group sizes (faster) : larger groups in the middle 1) (minimize delays 0 1 and
&
%& Partial CPA typ. is RCA or CSKA ( multilevel CSKA)
4 Addition
Variable group sizes (faster) : larger groups at end (MSB) 0) (balance delays 0 and
Part. CPA typ. is RCA, CSLA ( multil. CSLA), or CLA
High speed-up at high hardware overhead (+ MUX/bit + (CPA + MUX)/group)
14 2 8
Medium speed-up at small hardware overhead (+ AND/bit + MUX/group)
8 4
1 2
a n-1:j b n-1:j
32
a k-1:0 b k-1:0
0
c out 0
c out
CPA
cj
ci
CPA
"
cska.epsi 99 36 mm 1
ci
CPA
ck
39
3 2
a k-1:0 b k-1:0
...
...
c’i
1 2
a i-1:k b i-1:k
3 2
a i-1:k b i-1:k
1
c i0
c i1
...
c in
0
CPA
"
csla.epsi mm 102 50CPA 0 s i-1:k 0
1
CPA
ck
c in
1 s i-1:k
1
ck
...
P i-1:k s i-1:k
s n-1:j
Computer Arithmetic: Principles, Architectures, and VLSI Design
4 Addition
28
4.3 Carry-Propagate Adders (CPA)
Carry-increment adder (CIA)
and &1:
&1: &1: &1: &1: &1 &2 (group propagate) Result is incremented after addition, if 1 [12, 11] Type a) : partial CPA with fast
Computer Arithmetic: Principles, Architectures, and VLSI Design
4 Addition
29
4.3 Carry-Propagate Adders (CPA)
Example : gate-level schematic of carry-incr. adder (CIA) only 2 different logic cells (bit-slices) : IHA and IFA
max
4 6 10 12 14 16 18 20 22 24 26 28 ... 38 2 3 4 5 6 7 8 9 10 11 ... 16 1 2 4 7 11 16 22 29 37 46 56 67 ... 137
group
a i-1 IFA
Variable group sizes (faster) : larger groups at end (MSB) ) (balance delays 0 and
b i-1
a i-2
b i-2
IFA
a k+1
b k+1
IFA
Part. CPA typ. is RCA, CIA ( multilevel CIA) or CLA
s k-1:0
s i-1:k
s k-1:0
ak
bk
IHA
...
...
High speed-up at medium hardware overhead (+ AND/bit + (incrementer + AND-OR)/group) ...
Logic of CPA and incrementer can be merged [11]
10 2 8
1 2
28
a i-1:k b i-1:k ...
c out
c’i ci
...
CPA 86
P i-1:k
"
cia.epsi s’i-1:k 43 mm
3 2
ci
"
ciagate.epsi 100 s i-2 112 mm
s i-1 (i-k-1)IFA + IHA
a k-1:0 b k-1:0
2IFA + IHA
s k+1 IFA + IHA
sk IHA
ck
IHA
0 ck
CPA
c in
...
bits i-1...k
...
bits 6...4
bits 3,2
bit 1
bit 0
+1
s i-1:k
s k-1:0 c out
Computer Arithmetic: Principles, Architectures, and VLSI Design
30
Computer Arithmetic: Principles, Architectures, and VLSI Design
c in
31
4 Addition
4.3 Carry-Propagate Adders (CPA)
Conditional-sum adder (COSA)
Type b) : carries looked ahead before sum bits computed
(i.e. double CPAs are merged at higher levels)
Typically 4-bit blocks used (e.g. standard IC SN74181)
Correct sum bits ( 0&1: or 1&1:) are (conditionally)
0 0 1 0 0 0 2 1 1 0 1 0 0 3 2 2 1 2 1 0 2 3 3 3 2 3 2 1 3
levels of multiplexers
Bit groups of size 2 at level
Higher parallelism, more balanced signal paths
Highest speed-up at highest hardware overhead (2 RCA + more than log MUX/bit)
3
2 log 6
log
4.3 Carry-Propagate Adders (CPA)
Carry-lookahead adder (CLA), traditional
Type a) : optimized multilevel CSLA with log levels selected through log
4 Addition
(g3,p3)
"
2 1 0
. . . c0 (g′,p′) 3 3 c3
(g0,p0)
clbsymbol.epsi 27 CLB 26 mm c′ 0
1 0 0
3 2 1 0
3
...
Hierarchical using 12 log levels : passedarrangement up, 0 passed down between levels 3 3
log2
High speed-up at medium hardware overhead
FA
level 1
...
level 2
FA
...
0
0
1
0
1
0
FA
1
FA
0
a0
FA
0
1
14 4 log 56
b0
0
1
FA
c in
(g15,p15) ... (g12,p12) (g11,p11) ... (g8,p8) c′12
CLB
1
1
c 15 ... c 12
...
CLB
c out
s3
s2
s1
32
4.3 Carry-Propagate Adders (CPA)
Type b) : universal adder architecture comprising RCA, CIA, CLA, and more (i.e. entire range of area-delay trade-offs from slowest RCA to fastest CLA) Preprocessing, carry-lookahead, and postprocessing step Carries calculated using parallel-prefix algorithms + High regularity : suitable for synthesis and layout
a1 b1 a0 b0
a n-1 b n-1 a n-2 b n-2
...
c 3 ... c 0
33
c1
...
p0
1:0
2
2:0
3
1
1
3:2
3
1
1:0
2
3:0
at level Carry-propagation is prefix problem : : : : 0 0 : : & 1 1 & 1 & &1 : 1 : 1 : : : : ; #$ $ 1 :& & 1 &1 & 1 &1
1 : 1 : : 1 : :0 ; 0 1 1 1 3:0
! ! ! ! !
Parallel-prefix algorithms [11] : multi-tree structures ( log ) 2 sharing subtrees ( log ) different algorithms trading area vs. delay (influences ) also from wiring and maximum fan-out
c0
postprocessing: s0
1
Group variables : : covers bits
Computer Arithmetic: Principles, Architectures, and VLSI Design
%& %& , associative 1 0 %& %& or 1 0 1 0 1 0 0
& ;
1
1 (r.m.a.) 0 0 1 tree structures for evaluation : Associativity of 3 2 1 0 3 2 1 0 , but 2 ? Inputs 1 0 , outputs binary operator [11, 13]
3
carry-lookahead: prefix algorithm
s1
c′0
4.3 Carry-Propagate Adders (CPA)
2
c in
"
s n-2
4 Addition
preprocessing:
add.epsi///figures 73 64 mm
s n-1
CLB
Computer Arithmetic: Principles, Architectures, and VLSI Design
(g0 , p0 )
...
"
+ preprocessing : + postprocessing :
1
5 3 4 2
c out
c′4
CLB
c in
+ High performance : smallest and fastest adders
c n p n-1
c′8
(g3,p3) ... (g0,p0)
c 11 ... c 8 cla.epsi c 7 ... c 4 97 48 mm
+ High flexibility : special adders, other arithmetic operations, exchangeable prefix algorithms (i.e. speeds)
(gn-1 , p n-1 )
(g7,p7) ... (g4,p4)
Prefix problem
Parallel-prefix adders (PPA)
...
CLB
s0
Computer Arithmetic: Principles, Architectures, and VLSI Design
4 Addition
log
0
FA
"
1
b1
1
cosa.epsi 100 57 mm
1
0
a1
(g′,p′) 3 3
0
b2
(g′,p′) 7 7
a2
(g′11,p′11)
...
b3
(g′15 ,p′15 )
level 0
a3
34
Computer Arithmetic: Principles, Architectures, and VLSI Design
35
4 Addition
4.3 Carry-Propagate Adders (CPA)
Prefix algorithms
4 Addition
4.3 Carry-Propagate Adders (CPA)
Sklansky parallel-prefix algorithm (
Algorithms visualized by directed acyclic graphs (DAG) with array structure ( bits levels)
Tree-like collection, parallel redistribution of carries 1 log log ! 1 2 2
Graph vertex symbols :
& &1 &1 &1 &1 1 :& 1 1 : 1 : : : : : : : : : : : : (contains logic for ) (contains no logic)
: graph depth (number of black nodes on critical path) Serial-prefix algorithm ( RCA) 1 1 ! 2 Performance measures : : graph size (number of black nodes)
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
"
...
ser.epsi///figures 69 38 mm
14 15
Computer Arithmetic: Principles, Architectures, and VLSI Design
36
4.3 Carry-Propagate Adders (CPA)
Kogge-Stone parallel-prefix algorithm ( PPA-KS) very high wiring requirements log 1 log ! 2
"
bk.epsi///figures 67 38 mm
Computer Arithmetic: Principles, Architectures, and VLSI Design
4 Addition
37
4.3 Carry-Propagate Adders (CPA)
Mixed serial/parallel-prefix algorithm (
RCA + PPA)
linear size-depth trade-off using parameter #: 0 $#$ 2 log 2
# 0 : serial-prefix graph # 2 log 1 : Brent-Kung parallel-prefix graph fills gap between RCA and PPA-BK (i.e. CLA) in steps
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2
"
ks.epsi///figures 67 52 mm
3
PPA-BK)
Traditional CLA is PPA-BK with 4-bit groups Tree-like redistribution of carries (fan-out tree) 2 log 2 2 log 2 ! log 0 1 2 3 4 5 6
4 Addition
"
sk.epsi///figures 67 30 mm
Brent-Kung parallel-prefix algorithm (
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3
PPA-SK)
of single -operations
1 # 1 # ! var.
4 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
CIA) 2 1 4 1 2 1 4 1 2 ! 1 4 1 2
0 1 2 3 4 5 6 7 8 9 10
Carry-increment parallel-prefix algorithm (
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5
"
cia.epsi///figures 67 34 mm
Computer Arithmetic: Principles, Architectures, and VLSI Design
38
"
var.epsi///figures 68 54 mm
Computer Arithmetic: Principles, Architectures, and VLSI Design
39
4 Addition
4.3 Carry-Propagate Adders (CPA)
Example : 4-bit parallel-prefix adder (PPA-SK) efficient AND-OR-prefix circuit for the generate and AND-prefix circuit for the propagate signals optimization: alternatingly AOI-/OAI- resp. NAND-/ NOR-gates (inverting gates are smaller and faster) can also be realized using two MUX-prefix circuits
Local prefix graph transformation :
3 2 1 0
a2
b3
b2
a1
b1
a0
4.3 Carry-Propagate Adders (CPA)
Prefix adder synthesis
a3
4 Addition
3 3
b0
c in
0 unfact.epsi 1 20 26 mm 2 3
"
3 2 1 0
depth-decr. transform
0 fact.epsi 1 20 26 mm 2 3
"
size-decr. transform
4 2
Repeated (local) prefix transformations result in overall minimization of graph depth or size which sequence ? Goal: minimal size (area) at given depth (delay) Simple algorithm for sequence of applied transforms : Step 1 : prefix graph compression (depth minimization) : depth-decr. transforms in right-to-left bottom-up order Step 2 : prefix graph expansion (size minimization) : size-decreasing transforms in left-to-right top-down order, if allowed depth not exceeded
"
askgate.epsi///figures 100 103 mm
Prefix adder synthesis : 1) generate serial-prefix graph, 2) graph compression, 3) depth-controlled graph expansion, 4) generate pre-/postprocessing and prefix logic + Generates all previous prefix graphs (except PPA-KS) + Universal adder synthesis algorithm : generates area-optimal adders for any given timing constraints [11] (including non-uniform signal arrival times)
c out P n-1:0
s3
s2
s1
s0
Computer Arithmetic: Principles, Architectures, and VLSI Design
4 Addition
40
4.3 Carry-Propagate Adders (CPA)
Multilevel adders
Computer Arithmetic: Principles, Architectures, and VLSI Design
4 Addition
41
4.3 Carry-Propagate Adders (CPA)
Self-timed adders
Multilevel versions of adders of type a) possible (CSKA, CSLA, and CIA; notation: 2-level CIA = CIA-2L)
! 1 1for levels + Delay is
– Area increase small for CSKA and CIA, high for CSLA ( COSA)
Average carry-propagation length : log
!
+ RCA is fast in average case ( ˜ log ), slow in worst case suitable for self-timed asynchronous designs [15]
– Completion detection is not trivial
Difficult computation of optimal group sizes
Adder performance comparisons
process
Standard-cell implementations, 0 8
Hybrid adders
Arbitrary combinations of speed-up techniques possible hybrid/mixed adder architectures Often used combinations : CLA and CSLA [14]
area [lambda^2] RCA 128-bit
1e+07
CSKA-2L CIA-1L
– Pure architectures usually perform best (at gate-level)
CIA-2L
64-bit
5
PPA-SK
Transistor-level adders
PPA-BK 32-bit
Influence of logic styles (e.g. dynamic logic, pass-transistor logic faster)
2
"
addperf.ps 84 84 mm
CLA COSA const. AT
16-bit
+ Efficient transistor-level implementation of ripple-carry chains (Manchester chain) [14]
1e+06 8-bit 5
+ Combinations of speed-up techniques make sense – Much higher design effort
2
Many efficient implementations exist and published Computer Arithmetic: Principles, Architectures, and VLSI Design
delay [ns] 5
42
10
20
Computer Arithmetic: Principles, Architectures, and VLSI Design
43
4.3 Carry-Propagate Adders (CPA)
Complexity comparison under the unit-gate model
2 log 4 log 2 log 4 log 2 log
39 28 36 44 3 40 6 56 6
aat 3 — — att att —
4 3 2 3 2 4 3 5 4
log2 log log2 log log2
0
1 0 1 2 ; 0 1 1 (n.)
( )
– Result is in redundant carry-save format ( digits), represented by two -bit numbers (sum bits) and (carry bits)
a 0,n-1 a 1,n-1
a 2,n-1 . . . 67
FA cn
"
csa.epsi 27FA mm
c2
s n-1
FA c1
s1
s0
Multi-operand carry-save adders ( 3) adder array (linear arrangement), adder tree (tree arr.)
44
4.5 Multi-Operand Adders
4.5 Multi-Operand Adders
Computer Arithmetic: Principles, Architectures, and VLSI Design
4 Addition
45
4.5 Multi-Operand Adders
a) 4-operand CPA (RCA) array :
a 0,n-1 a 1,n-1
Add three or more ( 2) -bit operands, yield log -bit result in irredundant number rep. [1, 2]
FA a 2,n-1
Realization by array adders : (see figures on next page) a) linear arrangement of CPAs b) linear arr. of CSAs (adder array) and final CPA a) and b) differ in bit arrival times at final CPA : if CPA = RCA : a) and b) have same overall delay if fast final CPA : uniform bit arrival times required CSA array (b) Fast implementation : CSA array + fast final CPA (note: array of fast CPAs not efficient/necessary)
FA
FA
sn
s n-1
HA
a 2,2
a 2,1
a 2,0
"
...
a 3,2
CPA
CPA
HA
a 3,1
a 3,0
FA
FA
HA
s2
s1
s0
CPA
...
FA
a 0,2 a 1,2
A m-1
a 2,n-1
b) 4-operand CSA array with final CPA (RCA) :
...
CSA
FA
cparray.epsi 93 57 mm FA FA
a 3,n-1
a 0,n-1 a 1,n-1
A3
...
FA
A0 A1 A2
FA
...
Array adders
...
a 3,n-1
"
mopadd.epsi CSA 30 58 mm
FA
FA a 3,2
...
"
csarray.epsi 99FA 57 mm
FA a 3,1 FA
FA
CSA
a 3,0 HA
CSA
...
2 2 ! CPA = RCA : ! ! log Fast CPA : ! log
full-adders, constant delay
7 4
Computer Arithmetic: Principles, Architectures, and VLSI Design
S
+ Parallel arrangement of
optimality regarding area and delay aaa : smallest area, longest delay aat : small area, medium delay att : medium area, short delay ttt : large area, shortest delay — : not optimal 2 obtained from prefix adder synthesis 3 automatic logic optimization not possible (redundancy) 4 exact factors not calculated 5 corresponds to 4-bit PPA-BK
C
%
1
4 Addition
"
csasymbol.epsi 21 CSA 26 mm
b) Adds one -bit operand to an -digit carry-save operand
ttt att — — —
A0 A1 A2
2
2
1
a 2,0
log 10 3 log 14 3 log
4 3
2
a 2,0
3 2
2 2 3 4 4
32 3
a 0,0 a 1,0
PPA-SK PPA-BK PPA-KS CLA 5 COSA
1 2 1 3 4 8 1 2 8 1 2 6 1 3 1 4
4
a 0,0 a 1,0
8 8 14 10 10 10
aaa
a 0,0 a 1,0
CSKA-1L CSKA-2L CSLA-1L CIA-1L CIA-2L CIA-3L
2
14
a 2,1
2
a 2,1
7
a) Adds three -bit operands 0 , 1 , 2 performing no carry-propagation (i.e. carries are saved) [1]
a 0,1 a 1,1
RCA
opt.1 syn.2
AT
a 0,1 a 1,1
T
4.4 Carry-Save Adder (CSA)
a 0,1 a 1,1
A
4.4 Carry-Save Adder (CSA)
a 2,2
adder
4 Addition
a 0,2 a 1,2
4 Addition
CPA
FA
FA
sn
s n-1
FA
HA
s2
s1
CPA
...
s0
S Computer Arithmetic: Principles, Architectures, and VLSI Design
46
Computer Arithmetic: Principles, Architectures, and VLSI Design
47
4 Addition
4.5 Multi-Operand Adders
4 Addition
(m, 2)-compressors
a0
7 2 4 2 6 log 1
a m-1
... m-4 c out
"
cprsymbol.epsi 37 (m,2) 26 mm
c
...
0 c out
...
&4
2
10 & &
4 % 0 0
4.5 Multi-Operand Adders
c in0 c inm-4
Optimized (4, 2)-compressor :
2 full-adders merged and optimized (i.e. XORs
s
1-bit adders (similar to (m, k)-counters) [16]
Compresses bits down to 2 by forwarding 3
arranged in tree structure)
14 8
intermediate carries to next higher bit position
Is bit-slice of multi-operand CSA array (see prev. page)
% #)
+ No horizontal carry-propagation (i.e.
14 6
Built from full-adders (= (3, 2)-compressor) or (4, 2)-compressors arranged in linear or tree structures
FA
"
cpr42fa.epsi 32 38 mm
c out
0
c in
FA
"
cpr42opt.epsi 1 41 53 mm
c in
c out
"
a 0,0 a 1,0 a 2,0 a 3,0
(4,2) cpradd.epsi 99 44 mm
a 0,1 a 1,1 a 2,1 a 3,1
a 0,2 a 1,2 a 2,2 a 3,2
a 0,n-1 a 1,n-1 a 2,n-1 a 3,n-1
0
(4,2)
(4,2)
(4,2)
a2 a3
a0 a1 a2 a3
Example : 4-operand adder using (4, 2)-compressors
a0 a1
c
1
s
with full-adders
c
s
optimized
CSA
+ same area, 25% shorter delay
FA
FA
FA
HA
sn
s n-1
s2
s1
CPA
SD-FA (signed-digit full-adder) is similar to (4, 2)-compressor regarding structure and complexity
s n+1
s0
Computer Arithmetic: Principles, Architectures, and VLSI Design
4 Addition
48
4.5 Multi-Operand Adders
Advantages of (4, 2)-compressors over FAs for realizing (m, 2)-compressors :
Computer Arithmetic: Principles, Architectures, and VLSI Design
4 Addition
49
4.5 Multi-Operand Adders
Tree adders (Wallace tree)
higher compression rate (4:2 instead of 3:2) less deep and more regular trees
Adder tree : -bit -operand carry-save adder composed of tree-structured (m, 2)-compressors [1, 17]
tree depth
Tree adders : fastest multi-operand adders using an adder tree and a fast final CPA
# operands
012 3 4 5 6 FA (4,2)
42 16
0 c out
FA
a0a1a2a3 0 c out
c in0 c in1
1 c out
FA
2 c out
"
FA
cpr82fa.epsi 47 65 mm
3 c out
FA
4 c out
FA c
42 12
a4a5 a6a7
FA
! log 2 2 !log log
2 3 4 6 9 13 19 28 42 63 94 2 4 8 16 32 64 128
Example : (8, 2)-compressor
a0a1 a2a3
7 8 9 10
c in2
a4a5a6a7
(4,2)
(4,2)
3 c out
c in0
"
cpr82cpr42.epsi 47 50 mm
c in3 4 c out
c in4
(4,2)
c
Some FA can often be replaced by HA or eliminated (i.e. redundant due to constant inputs)
c in2
Number of (irredundant) FA does not depend on adder structure, but number of HA does
c in3
An
c in1
1 c out 2 c out
Adder arrays and adder trees revisited
c in4
s
-operand adder accomodates 1 carry inputs ! Adder! trees ( log ) are faster than adder ! arrays ( ) at same amount of gates ( )
Adder trees are less regular and have more complex routing than adder arrays larger area, difficult layout (i.e. limited use in layout generators)
(4, 2)-compressor tree
s
full-adder tree Computer Arithmetic: Principles, Architectures, and VLSI Design
50
Computer Arithmetic: Principles, Architectures, and VLSI Design
51
4 Addition
4.6 Sequential Adders
5 Simple / Addition-Based Operations
5.1 Complement and Subtraction
4.6 Sequential Adders
5 Simple / Addition-Based Operations
Bit-serial adder : Sequential -bit adder
5.1 Complement and Subtraction
ai bi
2’s complementer (negation)
1
"
bitseradd.epsi FA 25 27 mm
Accumulators : Sequential With CPA
2’s complement subtractor
1
"
S
2’s complement adder/subtractor
1
CSA
"
accucsa.epsi 33 52 mm
52
5.2 Increment / Decrement
5.2 Increment / Decrement
sub
mod 2% 1
B
A
c out
"
addmod.epsi 29 CPA 28 mm
c in
(end-around carry) S Computer Arithmetic: Principles, Architectures, and VLSI Design
5 Simple / Addition-Based Operations
53
5.2 Increment / Decrement
AND-prefix struct. : : 1 : 2 log 2 1 log2
Prefix problem :
1
Incrementer
Adds a single bit %to an -bit operand 2% % A incsymbol.epsi +1 29 " 26 mm c 1 ; 0 1 0 % % (r.m.a.) Z Corresponds to addition with 0 ( FA HA)
2
out
log
2
%
Decrementer a n-1
a2
a1
"
z n-1
2
z n-1
z2
z1
z0
% 1 %
Incrementer-decrementer
HA
z1
c in
...
a0
"
a0
dec.epsi 93 41 mm
c out
...
incfa.epsi 59c 23HA mm c 2 1
a1
...
Example : Ripple-carry incrementer using half-adders
3 1 3
c in
...
CPA
c out
1’s complement adder
S
5 Simple / Addition-Based Operations
c n-1
"
addsub.epsi 36 35 mm
S CPA
Computer Arithmetic: Principles, Architectures, and VLSI Design
HA
1
A B
Mixed CSA/CPA : CSA with partial CPAs (i.e. fewer carries saved), trade-off between speed and register size
c out
c out
A
4
a n-1
"CPA
sub.epsi 29 32 mm
S
cycles for evaluation
B
A
accucpa.epsi CPA 27 28 mm
Allows higher clock rates Final CPA too slow : pipelining or multiple
+1
Z
A
With CSA and final CPA
"
neg.epsi 21 32 mm1
si
-operand adders
A
c in
a2
a n-1
z0
a1
a0
or using incrementer slices (= half-adder) a n-1
a2
a1
dec
a0
...
...
c out
"
inc.epsi 83 33 mm ...
"
incdec.epsi 94 46 mm
c out
c in
c in
...
HA z n-1
z2
z1
Computer Arithmetic: Principles, Architectures, and VLSI Design
z0
z n-1 54
z2
z1
Computer Arithmetic: Principles, Architectures, and VLSI Design
z0 55
5 Simple / Addition-Based Operations
5.2 Increment / Decrement
Fast incrementers
5 Simple / Addition-Based Operations
5.2 Increment / Decrement
Gray incrementer
4-bit incrementer using multi-input gates : a2
a3
a1
Increments in Gray number system
0 %&1 %&2 0 (parity) 1 ; 0 3 (r.m.a.) 0 0 0 &1 &1 ; 1 2 %&1 %&1 %&2
a0
c in
"
inccg.epsi 62 39 mm
c out z3
z2
z1
z0
Prefix problem
AND-prefix structure
8-bit parallel-prefix incrementer (Sklansky AND-prefix structure) : a7
a6
a5
a4
a3
a2
a1
a0 c in
"
incpp.epsi 98 63 mm
c out
z7
z6
z5
z4
z3
z2
z1
z0
Computer Arithmetic: Principles, Architectures, and VLSI Design
56
5 Simple / Addition-Based Operations
5.3 Counting
5.3 Counting
Count clock cycles counter, divide clock frequency frequency divider ( )
Computer Arithmetic: Principles, Architectures, and VLSI Design
5 Simple / Addition-Based Operations
57
5.3 Counting
! Fast divider ( 1 ) using delayed-carry numbers (irredundant carry-save represention of
1 allows using fast carry-save incrementer) [8]
Gray counter
Binary counter Sequential in-/decrementer Incrementer speed-up techniques applicable Down- and up-down-counters using decrementers / incrementer-decrementers
Counter using Gray incrementer +1 cntblock.epsi 32 33 mm
c out
"
c in
Ring counters Shift register connected to ring :
clk Q
Example : Ripple-carry up-counter using counter slices (= HA + FF), is count enable
%
"
cntring.epsi 51 16 mm
q n-1
q2
q1
q0
FF for counting states Must be initialized correctly (e.g. 00 01) State is not encoded
c out
c in
"
Applications: fast dividers (no logic between FF) state counter for one-hot coded FSMs
cntripple.epsi 87 36 mm
...
q2
q n-1
q1
q0
Johnson / twisted-ring counter (inverted feed-back) :
Asynchronous counter using toggle-flip-flops (lower toggle rate lower power) T
T
...
T
clk
"
cntasync.epsi 64 18 mm
q n-1
q2
q1
q0
Computer Arithmetic: Principles, Architectures, and VLSI Design
"
cntjohnson.epsi 59 16 mm
T
58
q n-1
q2
q1
q0
FF for counting 2 states
Computer Arithmetic: Principles, Architectures, and VLSI Design
59
5 Simple / Addition-Based Operations
5.4 Comparison, Coding, Detection
5.4 Comparison, Coding, Detection
$
Subtractor ( :
%&1:0
(equal) (not equal) (greater or equal) (less than) (greater than) (less or equal)
(for free in PPA)
" CPA
1
EQ = P n-1:0
7 2 or & 3 log & 2 log 2
a n-1 b n-1
EQ
a0 b0
"
cmpeq.epsi 40 36 mm
removing redundancies in subtractor (unused ) single-tree structure speed-up at no cost : 6 2
2 log example : ripple comparator using comparator slices a1 b1
a0 b0
a1 b1
a2 b2
a n-1 b n-1 ...
GE = c out
Optimized comparator :
1
;
1 0 1
0% (r.s.a.)
B
cmpsub.epsi 37 31 mm
Equality comparison
A
a2 b2
5.4 Comparison, Coding, Detection
Comparators
Comparison operations
5 Simple / Addition-Based Operations
Magnitude comparison
% 1
0
1
;
0 (r.s.a.)
"
cmpripple.epsi 100 47 mm
1
equality
EQ
60
5.4 Comparison, Coding, Detection
%& to vector & ( 2%) 1:0 1 if 1:0 0 else ; 0 1
2
Decoder Decodes binary number
a2
A
a1
"
Z
a0
"
decoder.epsi 58 28 mm
decodersym.epsi 21decoder 26 mm
z6
z5
z4
z3
z2
z1
z0
"
a7a5a3a1 a6a4a2a0
"
encoder.epsi 30 34 mm
Detection operations
%&1 %&2 0 All-ones detection : %&1 %&2 0 (r.s.a.) log All-zeroes detection :
0 1 01
0 1 0
a n-1
a n-2
000100)
...
a1
" 2 z z z prefix problem (r.m.a.) AND-prefix structure
(e.g. 000101
a0
lzdnenc.epsi 50 28 mm ...
n-1
n-2
1
z0
b) encoded output : + encoder
z1
signed numbers : + leading-ones detector (LOZ)
Z
2%&1 1 1
5.4 Comparison, Coding, Detection
a) non-encoded output :
& %& % # #
encodersym.epsi 21encoder 26 mm
5 Simple / Addition-Based Operations
61
for scaling, normalization, priority encoding
Encoder Encodes vector 1:0 to binary number 2 ) 1:0 ( (condition: if then 1 else 0) if 1; 0 1 log2 A
Computer Arithmetic: Principles, Architectures, and VLSI Design
Leading-zeroes detection (LZD) :
12% log z7
magnitude
GE
Computer Arithmetic: Principles, Architectures, and VLSI Design
5 Simple / Addition-Based Operations
equality & magnitude
...
z0
z2
(note: connections according to PPA-SK)
Computer Arithmetic: Principles, Architectures, and VLSI Design
62
Computer Arithmetic: Principles, Architectures, and VLSI Design
63
5 Simple / Addition-Based Operations
5.5 Shift, Extension, Saturation
5.5 Shift, Extension, Saturation
#
#
Rotation by bit positions, constant (logic operation) Extension of word lengths by bits ( ) (i.e. sign-extension for signed numbers) Saturation to highest/lowest value after over-/underflow
#
shift a)
unl. signed r. signed l. r.
shift b)
unsigned signed
rotate
l. r.
extend
unl. signed r. signed l. r.
saturate unsigned signed
#
%&2 0 0 0 %&1 1 %&1 %&3 0 0 %&1 %&1 %&2 1 %&1 2%&1 %&2 %&2 0 %&1 0 %&1 1 0 %&1 0 %&1 0 0 %&1 %&1 %&2 0 %&1 %&2 0 0 %&1 %&1 % &1 %&1 %&1
Computer Arithmetic: Principles, Architectures, and VLSI Design
5 Simple / Addition-Based Operations
sll srl sla sra
formula
% % %&1 %% % %% % : 0 %&1
! 2 !log a3
a2
a1
a3
a0
"
s1 z3
z2
s1 s0
5.6 Addition Flags
a1
a0
"
barshift.epsi 44 49 mm
s1 s0 s1 s0
z1
z0
z3
multiplexers 64
a2
s1 s0
muxshift.epsi 41 28 mm
s0
z2
z1
z0
tristate buffers
Computer Arithmetic: Principles, Architectures, and VLSI Design
5 Simple / Addition-Based Operations
65
5.6 Addition Flags
Basic and derived condition flags description carry flag signed overflow flag
condition operation: 0 00
zero flag negative flag, sign
: for free : fast , 1 computed by e.g. PPA very cheap : a) 1 (subtract.) : 1:0 (of PPA)
% %& %
%& b) % 0 1 :
%&1 %&2 0 (r.s.a.) 1) log
2)
rol ror
Implementation of adder with flags
,
Implementation of shift/extension/rotation by constant values : hard-wired variable values : multiplexers possible values : –by– barrel-shifter/rotator Example : 4–by–4 barrel-rotator
5.6 Addition Flags flag
5.5 Shift, Extension, Saturation
Applications : adaption of magnitude (shift a)) or word length (extension) of operands (e.g. for addition) multiplication/division by multiples of 2 (shift) logic bit/byte operations (shift, rotation) scaling of numbers for word-length reduction (i.e. ignore leading zeroes, shift b)) or normalization (e.g. of floating-point numbers, shift a)) using LZD reducing error after over-/underflow (saturation)
Shift : a) shift -bit vector by bit positions b) select out of more bits at position also: logical (= unsigned), arithmetic (= signed)
#
5 Simple / Addition-Based Operations
faster without final sum (i.e. carry prop.) [18] example : 01001 1 00 0 10110 1 00 00000 0 00
0 0 0 % &1 &1
%&1 %&2 0 ; 0 1 (r.s.a.) 3 4 log
flag
( ) zero negative positive overflow underflow
operation:
$
unsigned or
formula signed
( )
—
—( )
( )
Unsigned and signed addition/subtraction only differ with respect to the condition flags
Computer Arithmetic: Principles, Architectures, and VLSI Design
66
Computer Arithmetic: Principles, Architectures, and VLSI Design
67
5 Simple / Addition-Based Operations
5.7 Arithmetic Logic Unit (ALU)
6.1 Multiplication Basics
6 Multiplication
5.7 Arithmetic Logic Unit (ALU) A
6 Multiplication
B
6.1 Multiplication Basics
and [1, 2] is 2 -bit unsigned number or 2 1 -bit
c out alusymbol.epsi c in 29 mm 30 ALU flags op
Multiplies two -bit operands
"
Product signed number
Z
% &1 % &1 % &1 % &1
2 2 2 or 0 0 0 0 % & 1
2 ; 0 1 (r.s.a.) 0
Example : unsigned multiplication ALU operations arithmetic
add inc pass
logic
and or xor pass
shift/ rotate
sll sla rol
% 1 11 1
% 1
11 1
sub dec neg nand nor xnor not
Algorithm 1) Generation of
2) Adding up partial products :
srl sra ror
a) sequentially (sequential shift-and-add), b) serially (combinational shift-and-add), or c) in parallel
s/ro : shift/rotate ; l/r : left/right ; l/a : logic (unsigned) / arithmetic (signed)
Speed-up techniques
Logic of adder/subtractor can partly be shared with logic operations Computer Arithmetic: Principles, Architectures, and VLSI Design
6 Multiplication
68
6.1 Multiplication Basics
Sequential multipliers : partial products generated and added sequentially (using accumulator)
! !log
partial products
Reduce number of partial products Accelerate addition of partial products Computer Arithmetic: Principles, Architectures, and VLSI Design
6 Multiplication
69
6.2 Unsigned Array Multiplier
6.2 Unsigned Array Multiplier Braun multiplier : array multiplier for unsigned numbers
×
% &1 % &1
2 0 0
"
mulseq.epsi 34 28 mm
Array multipliers : partial products generated and added simultaneously in linear array (using array adder)
! 2 !
CPA
1 3 2 3 2 2 3 3 3 2 3 1
× CSA
× CSA
"
mularr.epsi × 34 47 mm CSA
7
×
6
5
b3
CSA
0 3 1 2 2 1 3 0
4
3
b2
8 2 11 6 9 0 2 0 1 0 0 1 1 1 0 2 0 2
1
b1
b0
a0
CPA p0
a1
Parallel multipliers : partial products generated in parallel and added subsequently in multi-operand adder (using tree adder)
! 2 !log
×
×
×
HA
1
HA
HA p1
a2
× mulpar.epsi 34 43 mm
" CSA
FA
tree
"
mulbraun.epsi FA 99 83 mm
FA p2
a3
CPA
2
Signed multipliers : a) complement operands before and result after multiplication unsigned multiplication b) direct implementation (dedicated multiplier structure)
Computer Arithmetic: Principles, Architectures, and VLSI Design
0
70
FA
FA
FA
CSA
p3
CPA
3 p7
FA
FA
HA
p6
p5
p4
Computer Arithmetic: Principles, Architectures, and VLSI Design
71
6 Multiplication
6.3 Signed Array Multipliers
6.3 Signed Array Multipliers
6 Multiplication
6.4 Booth Recoding Speed-up technique : reduction of partial products
Modified Braun multiplier
special FAs [1] % 2
1 neg. bit :
2 neg. bits : % 2
Replace FAs in regions
% % % 1 , 2 , and 3 by :
(input at mark ) Subtract bits with negative weight
Otherwise exactly same structure and complexity as Braun multiplier efficient and flexible
Sequential multiplication
+ One cycle per non-zero partial product (i.e. 0) Minimal (or canonical) signed-digit (SD) represent. of
– Negative partial products – Data-dependent reduction of partial products and latency Combinational multiplication Only fixed reduction of partial product possible
Radix-4 modified Booth recoding : 2 bits recoded to one multiplier digit 2 partial products
%2
2 221) 22 ; &1 0 0 (2 &1& &
Arithmetic transformations yield the following partial products (two additional ones) :
0 3 0 2 0 1 0 0 1 3 1 2 1 1 1 0 2 3 2 2 2 1 2 0 3 3 3 2 3 1 3 0 3 3
1
3 6
7
5
3 3
4
2
1
1 2 2&1
2
0 0 0 0 1 1 1 1
0
– Less efficient and regular than modified Braun multiplier Computer Arithmetic: Principles, Architectures, and VLSI Design
6 Multiplication
72
6.4 Booth Recoding
Applicable to sequential, array, and parallel multipliers
: 8 : 7
– additional recoding logic and more complex partial product generation (MUX for shift, XOR for negation) + adder array/tree cut in half considerably smaller (array and tree)
3
3
3
2
1
0
ext. sign
03 13 23 33 6
03 13 23 32 5
03 13 22 31 4
03 12 21 30 3
02 11 20
01 10
2
1
00
0 0 0 3 1
1 1 1 3
0
33 6
2
1
0
2
1
0
03 12 21 30 3
02 11 20
01 10
23 32 5
2
1
0
×
×
×
× mulbooth.epsi 41 43 mm
2 2
"
CSA array/tree CPA
0
Computer Arithmetic: Principles, Architectures, and VLSI Design
6 Multiplication
73
6.6 Multiplier Implementations
6.5 Wallace Tree Addition
Applicable to parallel multipliers : parallel partial product generation (normal or Booth recoded) – Irregular adder tree (Wallace tree) due to different number of bits per column irregular wiring and/or layout non-uniform bit arrival times at final adder
00
Braun or Baugh-Wooley multiplier (array multiplier) : medium performance, high area, high regularity layout generators data paths and macro-cells simple pipelining, faster CPA higher speed
0
Booth-Wallace multiplier (parallel multiplier) [9] : high performance, high area, low regularity custom multipliers, netlist generators often pipelined (e.g. register between CSA-tree and CPA)
for unsigned multiplication : % 0
Radix-8 (3-bit recoding) and higher radices : precomputing 3 , larger overhead
Computer Arithmetic: Principles, Architectures, and VLSI Design
2
Sequential multipliers : low performance, small area, resource sharing (adder)
Suited for signed multiplication (incl. Booth recod.) Extend
1
6.6 Multiplier Implementations
1
13 22 31 4
10
! 2 !log
Negative partial products (avoid sign-extension) : 3
0 1 0 1 0 1 0 1
2
Speed-up technique : fast partial product addition
: 2 : 2 : 0
much faster for adder arrays slightly or not faster for adder trees
0 0 1 1 0 0 1 1
Booth recoding
Baugh-Wooley multiplier
6.4 Booth Recoding
Signed-unsigned multiplier : signed multiplier with 0) operands extended by 1 bit ( 1 0, 1
% %& % %&
74
Computer Arithmetic: Principles, Architectures, and VLSI Design
75
6 Multiplication
6.8 Squaring
6.7 Composition from Smaller Multipliers
multiplier can be composed from 4 2 2-bit-bitmultipliers (can be repeated recursively) 2% 2%
22% 2% 4 -bit multipliers + 2 -bit CSA + 3 -bit CPA less efficient (area and speed)
2 : multiplier optimizations possible 0 3 0 1 0 1 1 0 2 3 1 2 3 12 21 3 3 2 3 1 3 2 3 1 3 0 3 0 0 1 0 0 3 3 1 2 1 1 2 2
+
2 1partial products (if no Booth recoding used) optimized squarer more efficient than multiplier 6
7.1 Division Basics
7 Division / Square Root Extraction 7.1 Division Basics
;
rem (remainder) 0 0 22% 1 0 2% 1 % %, otherwise overflow 2 2 normalize before division (2%&1 2% 1)
Algorithms (radix-2) Subtract-and-shift : partial remainders
6.8 Squaring
7
7 Division / Square Root Extraction
5
4
3
2
1
0
Computer Arithmetic: Principles, Architectures, and VLSI Design
76
if 2 0
7.3 Non-Restoring Division
7.2 Restoring Division
1
2 0
1 11 ifif 11 00 1 0 : 1 1 2 2 &1 1 1 2 0 : &1 1 1 2 &1 1 2&1
Basic algorithm : compare and conditionally subtract expensive comparison and CPA
Restoring division : subtract and conditionally restore (adder or multiplexer) expensive CPA and restoring
Non-restoring division : detect sign, subtract/add, and correct by next steps expensive CPA
Computer Arithmetic: Principles, Architectures, and VLSI Design
7 Division / Square Root Extraction
7.4 Signed Division
77
1 same sign 1 if 1 if 1 opposite sign
7.4 Signed Division
1
2 0 : 0 (restored) 1
1 1 1 2&1 0 : &1 1 &1 1 2&1
7.3 Non-Restoring Division
0
1
0 if
non-associative
1 2 1 2 % ; 1 0 (r.m.n.)
SRT division : estimate range, subtract/add (CSA), and correct by next steps inexpensive CSA
Table look-up (ROM) less efficient for every
7 Division / Square Root Extraction
Sequential algorithm : recursive,
[1, 2]
Example : signed non-restoring array divider 0, final correction of omitted) (simplifications:
9 2 2 2 4
a6 ⊕ b3
b3
a6
a5
b2
a4
b1
b0
a3
One subtraction/addition (CPA) per step Final correction step for (additional CPA) Simple quotient digit conversion : (note: irredundant)
1 1 0 1 : 1 1 2
%&1 %&2 %&3 0 1
1 2 or ! 2 log
! 1 !
2 or ! log
q3
FA
FA
FA
FA
a2 q2
FA
FA
"
FA
FA
divarray.epsi 81 101 mm
a1 q1
A B
Q
≥ +/− CPA ≥ +/− CPA divnr.epsi +/− CPA 46 ≥38 mm ≥ +/− CPA ≥ +/− CPA
"
R Computer Arithmetic: Principles, Architectures, and VLSI Design
78
FA
FA
FA
FA
a0 q0
FA
FA
FA
FA
r3
r2
r1
r0
Computer Arithmetic: Principles, Architectures, and VLSI Design
79
7 Division / Square Root Extraction
7.5 SRT Division
7.6 High-Radix Division
1 1 if 2 $
0 if 2 $1 2 is SD number 1 if 1 2 %& % If 2 1 $ 2 , i.e. is normalized : 2 $ 2%&1 $1 2%&1 $2 %&1 $1 1 if 2 0 if 2%&1 $1 2%&1 1 if 2%&1 1 are estimated CSA + Only 3 MSB are compared
Radix
2
! 2 !
CPA
Division by convergence
& 1
0 1 &1 1 1 resp. 2% 0 1 1 2%1
1 2%1
2 1 2%
1 2&% 2 2&% 1 (signed) Algorithm : 1 1 ; 0 1 1 (r.s.n.)
≥ +/− CSA ≥ +/− CSA divsrt.epsi ≥ mm +/− CSA 50 38 ≥ +/− CSA ≥ +/− CPA
"
0
Quadratic convergence :
R
7 Division / Square Root Extraction
7.7 Division by Multiplication
A B
Computer Arithmetic: Principles, Architectures, and VLSI Design
Q
– Complex comparisons (more bits) and decisions table look-up ( Pentium bug!)
instead of CPA can be used (precise enough) [19] Correction in following steps (+ final correction step) – Redundant representation of (SD representation) final conversion necessary (CPA) + Highly regular and fast ( ) SRT array dividers only slightly slower/larger than array multipliers
7.7 Division by Multiplication
2, 1 1 0 1 1 quotient bits per step fewer, but more complex steps + Suitable for SRT algorithm faster
7.5 SRT Division (Sweeney, Robertson, Tocher)
!
7 Division / Square Root Extraction
80
7.8 Remainder / Modulus
Division by reciprocation
0
log
Computer Arithmetic: Principles, Architectures, and VLSI Design
7 Division / Square Root Extraction
81
7.9 Divider Implementations
7.9 Divider Implementations
1
Iterative dividers (through multiplication) :
resource sharing of existing components (multiplier) medium performance, medium area high efficiency if components are shared
0 by recursion 1 find 1 1 1
0
Newton-Raphson iteration method :
Sequential dividers (restoring, non-restoring, SRT) :
resource sharing of existing components (e.g. adder) low performance, low area
2
2 ; 0 1 (r.s.n.)
Algorithm : 1
0
Array dividers (restoring, non-restoring, SRT) :
!log
Quadratic convergence :
Speed-up : first approximation
from table 0
7.8 Remainder / Modulus
rem sign sign
Remainder (rem) : signed remainder of a division
ifelse 0
dedicated hardware component high performance, high area high regularity layout generators, pipelining square root extraction possible by minor changes combination with multiplication or/and square root
No parallel dividers exist, as compared to parallel multipliers (sequential nature of division)
Modulus (mod) : positive remainder of a division
mod
0
Computer Arithmetic: Principles, Architectures, and VLSI Design
82
Computer Arithmetic: Principles, Architectures, and VLSI Design
83
7 Division / Square Root Extraction
7.10 Square Root Extraction
7.10 Square Root Extraction
and quotients %& 0[1] 2
1 2 2 21 2 2 1 2 1 2 2 1 2 1 2 2 2 2 ; 1 0 % 1 % 0 1 (r.m.n.) 0 0
Subtract-and-shift : partial remainders 2 0 1 1
Table look-up : inefficient for large word lengths [5] Taylor series expansion : complex implementation Polynomial and rational approximations [1, 5] Shift-and-add algorithms [5]
Implementation
Convergence algorithms [1, 2] :
similar to division-by-convergence two (or more) recursive formulas : one formula
+ Similar to division same algorithms applicable (restoring, non-restoring, SRT, high-radix) + Combination with division in same component possible
converges to a constant, the other to the result
A
Only triangular array required 0) (step :
2
Coordinate rotation (CORDIC) [2, 5, 20] :
3 equations for x-, y-coordinate, and angle computes all elementary functions by proper input
+/− CPA sqrtnr.epsi +/− CPA 42 36+/− mmCPA +/− CPA +/− CPA
"
Q
settings and choice of modes and outputs
simple, universal hardware, small look-up table
R Computer Arithmetic: Principles, Architectures, and VLSI Design
8 Elementary Functions
84
8.2 Integer Exponentiation
8.2 Integer Exponentiation
Approximated exponentiation : ln 2 log 1 0 Base-2 integer exponentiation : 2 0
!
!
Integer exponentiation (exact) :
% 0 2 1
1
1
1
1
1
0
2
2
8.3 Integer Logarithm
12
1 2 1 0 1 2
2
2
1
2
0
; 1 0 2 1 % 1 0 (r.s.n.) 2 1
log2
2 2
2 2 4 2
b)
85
8 Elementary Functions
8.3 Integer Logarithm
Algorithms : square-and-multiply
Computer Arithmetic: Principles, Architectures, and VLSI Design
(!)
Applications : modular exponentiation mod in cryptographic algorithms (e.g. IDEA, RSA)
a)
(exp )
8.1 Algorithms
!
Logarithm function : ln , log Trigonometric functions : sin , cos , tan Inverse trig. functions : arcsin , arccos , arctan Hyperbolic functions : sinh , cosh , tanh Exponential function :
Algorithm
8.1 Algorithms
8 Elementary Functions
2 0 22% 1 0 2% 1
8 Elementary Functions
2
1
0
For detection/comparison of order of magnitude Corresponds to leading-zeroes detection (LZD) with encoded output
&1 1 2 ; 0 1 &1 1 0 %&1 (r.s.n.) 2 or 2
Computer Arithmetic: Principles, Architectures, and VLSI Design
86
Computer Arithmetic: Principles, Architectures, and VLSI Design
87
9 VLSI Design Aspects
9.1 Design Levels
9 VLSI Design Aspects
9.1 Design Levels
Gate-level design
9 VLSI Design Aspects
Cell-based design techniques : standard-cells, gate-array/ sea-of-gates, field-programmable gate-array (FPGA)
9.1 Design Levels Transistor-level design
Circuit implemented by hand or by synthesis (library)
Circuit and layout designed by hand (full custom)
Layout implemented by automated place-and-route
Low design efficiency
Medium to high design efficiency
High circuit performance : high speed, low area
Medium to low circuit performance
High flexibility : choice of architecture and logic style
Medium to low flexibility : full choice of architecture
Transistor-level circuit optimizations :
logic style : static vs. dynamic logic,
Block-level design
complementary CMOS vs. pass-transistor logic
special arithmetic circuits : better than with gates gi
ci
carry chain :
c out
ki
a
b
a
"p
c i-1 carrychain.epsi 54 17 mm
a
b
High design efficiency
g i-1
k i-1 p i-1
i
Layout blocks and netlists from parameterized automatic generators or compilers (library)
c in
Medium to high circuit performance
c in
Low flexibility : limited choice of architectures Implementations :
a b
fulladder :
c in
b
c in
b
"
facmos.epsi 76 40 mm
c in
s
c in
c out
b a
b
a
a
b
c in
a
Computer Arithmetic: Principles, Architectures, and VLSI Design
9 VLSI Design Aspects
data-path : bit-sliced, bus-oriented layout (array of cells: bits operations), implementation of entire data paths, medium performance, medium diversity macro-cells : tiled layout, fixed/single-operation components, high performance, small diversity portable netlists : gate-level design
88
9.2 Synthesis
Computer Arithmetic: Principles, Architectures, and VLSI Design
9 VLSI Design Aspects
89
9.3 VHDL
9.2 Synthesis
9.3 VHDL
High-level synthesis
Arithmetic types : unsigned, signed (2’s complement)
Synthesis from abstract, behavioral hardware description (e.g. data dependency graphs) using e.g. VHDL Involves architectural synthesis and arithmetic transformations
Arithmetic packages numeric_bit, numeric_std (IEEE standard 1076.3), std_logic_arith (Synopsys) contain overloaded arithmetic operators and resizing / type conversion routines for unsigned, signed types
High-level synthesis is still in the beginnings
Arithmetic operators (VHDL’87/93) [21]
Low-level synthesis
relational shift, rotate (’93 only) adding sign (unary) multiplying exponent, absolute
Layout and netlist generators Included in libraries and synthesis tools Low-level synthesis is state-of-the-art Basis for efficient ASIC design Limited diversity and flexibility of library components
: : : : : :
=, /=, <, <=, >, >= rol, ror, sla, sll, sra, srl +, +, *, /, mod, rem **, abs
Synthesis
Circuit optimization Efficient optimization of random logic is state-of-the-art
Typical limitations of synthesis tools :
/, mod, rem : both operands must be constant or divisor
Optimization of entire arithmetic circuits is not feasible only local optimizations possible Logic optimization cannot replace the synthesis of efficient arithmetic circuit structures using generators
Computer Arithmetic: Principles, Architectures, and VLSI Design
must be a power of two ** : for power-of-two bases only Variety of arithmetic components provided in separate libraries (e.g. DesignWare by Synopsys)
90
Computer Arithmetic: Principles, Architectures, and VLSI Design
91
9 VLSI Design Aspects
9.3 VHDL
Resource sharing
Pipelining
Pipelining is basically possible with every combinational circuit higher throughput
2 adders + 1 multiplexer = ’1’ else B; 1 multiplexer + 1 adder
a)
S <= A + C when SELA = ’1’ else B + C;
b)
T <= A when SELA S <= T + C;
Addition : single adder with carry-in/carry-out : resize(A, width+1) & Cin; resize(B, width+1) & ’1’; Aext + Bext; Sext(width downto 1); Sext(width+1);
Pipelining of arithmetic circuits can be very costly :
(except for smaller latency) Fine-grain pipelining arithmetic circuits)
Synthesis : check synthesis result for allocated arithmetic units code sanity check, control of circuit size
systolic arrays (often applied to
High speed Fast circuit architectures, pipelining, replication (parallelization), and combinations of those
VHDL library of arithmetic units
Optimal solution depends on arithmetic operation, circuit architecture, user specifications, and circuit environment
Structural, synthesizable VHDL code for most circuits described in this text is found in [22] Computer Arithmetic: Principles, Architectures, and VLSI Design
9 VLSI Design Aspects
Arithmetic circuits are well suited for pipelining due to high regularity
large amount of internal signals in arithmetic circuits array structures : many small pipeline registers tree structures : few large pipeline registers no advantage of tree structures anymore
Coding & synthesis hints
<= <= <= <= <=
9.4 Performance
9.4 Performance
Sharing one resource for multiple operations Done automatically by some synthesis tools Otherwise, appropriate coding is necessary :
Aext Bext Sext S Cout
9 VLSI Design Aspects
92
9.4 Performance
Computer Arithmetic: Principles, Architectures, and VLSI Design
9 VLSI Design Aspects
93
9.5 Testability
Low power
9.5 Testability
Power-related properties of arithmetic circuits :
Testability goal : high fault coverage with few test vectors that are easy to generate/apply
High glitching activity due to high bit dependencies and large logic depth
Random test vectors : easy to generate and apply/propagate, few vectors give high (but not perfect) fault coverage for most arithmetic circuits
Power reduction in arithmetic circuits [23] : Reduce the switched capacitance by choosing an area efficient circuit architecture Allow for lower supply voltage by speeding up the circuitry
Special test vectors : sometimes hard to generate and apply, required for coverage of hard-detectable faults which are inherent in most arithmetic circuits Hard-detectable faults found in :
Reduce the transition activity :
apply stable inputs while circuit is not in use (
circuits of arithmetic operations with inherent special cases (arithmetic exceptions) : detectors, comparators, incrementers and counters (MSBs), adder flags
disabling subcircuits)
reduce glitching transitions by balancing signal paths (partly done by speed-up techniques, otherwise difficult to realize) reduce glitching transitions by reducing logic depth (pipelining) take advantage of correlated data streams choose appropriate number representations (e.g. Gray codes for counters)
circuits using redundant number representations ( redundant hardware) : dividers (Pentium bug!)
Computer Arithmetic: Principles, Architectures, and VLSI Design
94
Computer Arithmetic: Principles, Architectures, and VLSI Design
95
Bibliography
Bibliography
Bibliography
[11] R. Zimmermann, Binary Adder Architectures for Cell-Based VLSI and their Synthesis, PhD thesis, Swiss Federal Institute of Technology (ETH) Zurich, Hartung-Gorre Verlag, 1998.
[1] I. Koren, Computer Arithmetic Algorithms, Prentice Hall, 1993. [2] K. Hwang, Computer Arithmetic: Principles, Architecture, and Design, John Wiley & Sons, 1979. [3] O. Spaniol, Computer Arithmetic, John Wiley & Sons, 1981.
[13] T. Han and D. A. Carlson, “Fast area-efficient VLSI adders”, in Proc. 8th Computer Arithmetic Symp., Como, May 1987, pp. 49–56.
[4] J. J. F. Cavanagh, Digital Computer Arithmetic: Design and Implementation, McGraw-Hill, 1984.
[14] D. W. Dobberpuhl et al., “A 200-MHz 64-b dual-issue CMOS microprocessor”, IEEE J. Solid-State Circuits, vol. 27, no. 11, pp. 1555–1564, Nov. 1992.
[5] J.-M. Muller, Elementary Functions: Algorithms and Implementation, Birkhauser Boston, 1997. [6] Proceedings of the Xth Symposium on Computer Arithmetic. [7] IEEE Transactions on Computers. [8] D. R. Lutz and D. N. Jayasimha, “Programmable modulo-k counters”, IEEE Trans. Circuits and Syst., vol. 43, no. 11, pp. 939–941, Nov. 1996. [9] H. Makino et al., “An 8.8-ns 54 54-bit multiplier with high speed redundant binary architecture”, IEEE J. Solid-State Circuits, vol. 31, no. 6, pp. 773–783, June 1996. [10] W. N. Holmes, “Composite arithmetic: Proposal for a new standard”, IEEE Computer, vol. 30, no. 3, pp. 65–73, Mar. 1997. Computer Arithmetic: Principles, Architectures, and VLSI Design
96
Bibliography
[18] J. Cortadella and J. M. Llaberia, “Evaluation of A + B = K conditions without carry propagation”, IEEE Trans. Comput., vol. 41, no. 11, pp. 1484–1488, Nov. 1992. [19] S. E. McQuillan and J. V. McCanny, “Fast VLSI algorithms for division and square root”, J. VLSI Signal Processing, vol. 8, pp. 151–168, Oct. 1994. [20] Y. H. Hu, “CORDIC-based VLSI architectures for digital signal processing”, IEEE Signal Processing Magazine, vol. 9, no. 3, pp. 16–35, July 1992. [21] K. C. Chang, Digital Design and Modeling with VHDL and Synthesis, IEEE Computer Society Press, Los Alamitos, California, 1997. [22] R. Zimmermann, “VHDL Library of Arithmetic Units”, http://www.iis.ee.ethz.ch/˜zimmi/arith lib.html. [23] A. P. Chandrakasan and R. W. Brodersen, Low Power Digital CMOS Design, Kluwer, Norwell, MA, 1995.
Computer Arithmetic: Principles, Architectures, and VLSI Design
[12] A. Tyagi, “A reduced-area scheme for carry-select adders”, IEEE Trans. Comput., vol. 42, no. 10, pp. 1162–1170, Oct. 1993.
98
[15] A. De Gloria and M. Olivieri, “Statistical carry lookahead adders”, IEEE Trans. Comput., vol. 45, no. 3, pp. 340–347, Mar. 1996. [16] V. G. Oklobdzija, D. Villeger, and S. S. Liu, “A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach”, IEEE Trans. Comput., vol. 45, no. 3, pp. 294–305, Mar. 1996. [17] Z. Wang, G. A. Jullien, and W. C. Miller, “A new design technique for column compression multipliers”, IEEE Trans. Comput., vol. 44, no. 8, pp. 962–970, Aug. 1995.
Computer Arithmetic: Principles, Architectures, and VLSI Design
97