Computer arithmetic.Principles,architectures,and VLSI design

Eidgenossische ¨ Technische Hochschule Zurich ¨ Institut f¨ur Integrierte Systeme Ecole polytechnique federale ´ ´ de ...

Author: Zimmermann R.

113 downloads 1128 Views 465KB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Eidgenossische ¨ Technische Hochschule Zurich ¨

Institut f¨ur Integrierte Systeme

Ecole polytechnique federale ´ ´ de Zurich Politecnico federale di Zurigo Swiss Federal Institute of Technology Zurich

Integrated Systems Laboratory

Lecture notes on

Computer Arithmetic: Principles, Architectures, and VLSI Design March 16, 1999

Reto Zimmermann Integrated Systems Laboratory Swiss Federal Institute of Technology (ETH) CH-8092 Z¨urich, Switzerland [email protected]

Copyright c 1999 by Integrated Systems Laboratory, ETH Z¨urich http://www.iis.ee.ethz.ch/ zimmi/publications/comp arith notes.ps.gz

Contents

Contents

Contents 1 Introduction and Conventions

4

1.1 Outline

4

1.2 Motivation

4

1.3 Conventions

5

1.4 Recursive Function Evaluation

6

4.3 Carry-Propagate Adders (CPA)

26

4.4 Carry-Save Adder (CSA)

45

4.5 Multi-Operand Adders

46

4.6 Sequential Adders

52

5 Simple / Addition-Based Operations

53

5.1 Complement and Subtraction

53

5.2 Increment / Decrement

54

8

5.3 Counting

58

2.1 Overview

8

5.4 Comparison, Coding, Detection

60

2.2 Implementation Techniques

9

2 Arithmetic Operations

5.5 Shift, Extension, Saturation

64

10

5.6 Addition Flags

66

3.1 Binary Number Systems (BNS)

10

5.7 Arithmetic Logic Unit (ALU)

68

3.2 Gray Numbers

13

3.3 Redundant Number Systems

14

6.1 Multiplication Basics

69

3.4 Residue Number Systems (RNS)

16

6.2 Unsigned Array Multiplier

71

3.5 Floating-Point Numbers

18

6.3 Signed Array Multipliers

72

3.6 Logarithmic Number System

19

6.4 Booth Recoding

73

3.7 Antitetrational Number System

19

6.5 Wallace Tree Addition

75

3.8 Composite Arithmetic

20

6.6 Multiplier Implementations

75

3.9 Round-Off Schemes

21

6.7 Composition from Smaller Multipliers

76

6.8 Squaring

76

3 Number Representations

4 Addition

22

4.1 Overview

22

4.2 1-Bit Adders, (m, k)-Counters

23

Computer Arithmetic: Principles, Architectures, and VLSI Design

1

Contents

7.2 Restoring Division

78

7.3 Non-Restoring Division

78

7.4 Signed Division

79

7.5 SRT Division

80

7.6 High-Radix Division

81

7.7 Division by Multiplication

81

7.8 Remainder / Modulus

82

7.9 Divider Implementations

83

7.10 Square Root Extraction

84

8 Elementary Functions

85

8.1 Algorithms

85

8.2 Integer Exponentiation

86

8.3 Integer Logarithm

87

9 VLSI Design Aspects

88

9.1 Design Levels

88

9.2 Synthesis

90

9.3 VHDL

91

9.4 Performance

93

9.5 Testability

95

Bibliography

96

Computer Arithmetic: Principles, Architectures, and VLSI Design

3

6 Multiplication

7 Division / Square Root Extraction 7.1 Division Basics Computer Arithmetic: Principles, Architectures, and VLSI Design

69

77 77 2

1 Introduction and Conventions

1.2 Motivation

1 Introduction and Conventions

1 Introduction and Conventions

1.3 Conventions

1.1 Outline

Naming conventions

Basic principles of computer arithmetic [1, 2, 3, 4, 5, 6, 7] Circuit architectures and implementations of main arithmetic operations Aspects regarding VLSI design of arithmetic units 1.2 Motivation Arithmetic units are, among others, core of every data path and addressing unit

gate-equivalents (GE) model) : 0 0 (i.e. ignored) Inverter, buffer : Simple monotonic 2-input gates (AND, NAND, OR, 1 NOR) : 1

Unit-gate model (

microprocessors (CPU) signal processors (DSP) data-processing application specific ICs (ASIC) and programmable ICs (e.g. FPGA)

Simple non-monotonic 2-input gates (XOR, XNOR) : 2 2

Standard arithmetic units available from libraries

Complex gates : composed from simple gates

Simple -input gates : 1 log

Design of arithmetic units necessary for :

non-standard operations high-performance components library development

Wiring not considered (acceptable for comparison purposes, local wiring, multilevel metallization) Only estimations given for complex circuits

Computer Arithmetic: Principles, Architectures, and VLSI Design

4

1.4 Recursive Function Evaluation

1.4 Recursive Function Evaluation Given : inputs

(1-D), (2-D), (subbus, 1-D) : Signals : , (1-D), (2-D), : (group signal) Circuit complexity measures : (area), (cycle time, delay), (area-time product), (latency, # cycles) ,

, , , log ( log2 ) Arithmetic operators : Logic operators : (or), (and), (xor), (xnor), (not) Signal buses :

Circuit complexity measures

Data path is core of :

1 Introduction and Conventions

1.3 Conventions

Computer Arithmetic: Principles, Architectures, and VLSI Design

1 Introduction and Conventions

2.

, outputs , function (graph sym. :

)

is a function of input (or : const.) ; 0 1 parallel structure : a a a a funn.epsi 119 "17 mm ! !1 1

is associative (r.s.a.) serial or single-tree structure : ! !log

a3 a2 a1 a0

"

1 funrsa.epsi 219 20 mm z

b) with multiple outputs

Output

2

1.4 Recursive Function Evaluation

(r.m.) (prefix problem) : &1 ; 0 1 &1 0 1

Non-recursive functions (n.)

3

5

1.

0

z3 z2 z1 z0

is non-associative (r.m.n.) serial structure : ! !

a3 a2 a1 a0 1 funrmn.epsi 219 25 mm 3

"

z3 z2 z1 z0

Recursive functions (r.)

a3 a2 a1 a0

is a function of all inputs #$ with single output %&1 (r.s.) : ' '&1 ; 0 1 '&1 0 1 '%&1

Output a)

1.

is non-associative (r.s.n.) serial structure : ! !

Computer Arithmetic: Principles, Architectures, and VLSI Design

2.

is associative (r.m.a.) serial or multi-tree structure : ! 2 !log

1 2 z3 funrma1.epsi 19 43 mm

"

z2 z1

or shared-tree structure : ! log !log

a3 a2 a1 a0 1 funrsn.epsi 219 24 mm 3

"

z0 a3 a2 a1 a0

"

1funrma2.epsi 219 21 mm z3 z2 z1 z0

z 6

Computer Arithmetic: Principles, Architectures, and VLSI Design

7

2 Arithmetic Operations

2.1 Overview

2 Arithmetic Operations

2.2 Implementation Techniques

2 Arithmetic Operations

2.2 Implementation Techniques

2.1 Overview

Direct implementation of dedicated units : fixed-point

based on operation

always : 1 – 5

floating-point

related operation

in most cases : 6

<< , >>

sometimes : 7, 8 =,<

+1 , −1

+,−

+/−

× arithops.epsi 98 83 mm

sometimes : 6

×

"

in most cases : 7, 8, 9 (same as on the left for floating-point numbers)

sqrt (x)

⁄

Sequential implementation using simpler units and several clock cycles ( decomposition) :

+,−

Table look-up techniques using ROMs : universal : simple application to all operations

exp (x)

trig (x)

complexity

log (x)

efficient only for single-operand operations of high complexity (8 – 12) and small word length (note: ROM ) size 2

hyp (x)

%

Approximation techniques using simpler units : 7–12 1 2 3 4 5 6

shift/extension comparison increment/decrement complement addition/subtraction multiplication

7 8 9 10 11 12

division square root extraction exponential function logarithm function trigonometric functions hyperbolic functions

Computer Arithmetic: Principles, Architectures, and VLSI Design

3 Number Representations

taylor series expansion polynomial and rational approximations convergence of recursive equation systems CORDIC (COordinate Rotation DIgital Computer) 8

3.1 Binary Number Systems (BNS)

3 Number Representations 3.1 Binary Number Systems (BNS) Radix-2, binary number system (BNS) : irredundant, weighted, positional, monotonic [1, 2]

%&%&

-bit number is ordered sequence of bits (binary digits) : 0 1 1 2 0 2 Simple and efficient implementation in digital circuits MSB/LSB (most-/least-significant bit) :

%&1 / 0

Represents an integer or fixed-point number, exact Fixed-point numbers :

& &&% 1 0 1

% &

-bit integer

-bit fraction

Unsigned : positive or natural numbers

%&2%&1 2 1 1 % Range : 0 2 1

Value :

1

2

0

3 Number Representations

9

3.1 Binary Number Systems (BNS)

% Complement : 2 1 , where %&1 %&2 0 Sign : %&1

Properties : asymmetric range, compatible with unsigned numbers in many arithmetic operations (i.e. same treatment of positive and negative numbers)

% &2 % & 1 % &1 2 1 2 Value :

0 % % & & 1 1

1 2 1 Range : 2 % Complement : 2 1 Sign : %&1 Properties : double of zero, symmetric %representation range, modulo 2 1 number system

One’s (1’s) complement : similar to 2’s complement

Sign-magnitude : alternative representation of signed numbers

0

&2 1 % 0 2 %& %& Range : 2 1 1 2 1 1 Complement : %&1 %&2 0 Sign : %&1

Two’s (2’s) complement : standard representation of signed or integer numbers

Value :

% &2 % & 1 % Value :

&12 2 0 % % & & 1 1

1 Range :

2 2

Computer Arithmetic: Principles, Architectures, and VLSI Design

Computer Arithmetic: Principles, Architectures, and VLSI Design

10

1

Computer Arithmetic: Principles, Architectures, and VLSI Design

11

3 Number Representations

3.1 Binary Number Systems (BNS)

Properties : double representation of zero, symmetric range, different treatment of positive and negative numbers in arithmetic operations, no MSB toggles at sign changes around 0 ( low power)

Gray numbers (code) : binary, irredundant, non-weighted, non-monotonic + Property : unit-distance coding (i.e. exactly one bit toggles between adjacent numbers) Applications : counters with low output toggle rate (low-power signal buses), representation of continuous signals for low-error sampling (no false numbers due to switching of different bits at different times)

111...1

011...1 100...0

000...0

Graphical representation

binary number representation

−2

0

2

3.2 Gray Numbers

3.2 Gray Numbers

n−1

3 Number Representations

n−1

2

"

numrep.epsi 95 73 mm

– Non-monotonic numbers : difficult arithmetic operations, e.g. addition, comparison :

0 0 0 0 0 1 and 0 1 1 1 1 0 but 1 0 binary Gray : % 0 ;

n

1 0

unsigned 2’s complement 1’s complement

1 0

0 1 1

sign-magnitude

(n.)

binary : % 0 ;

Gray Conventions

1 0 1

2’s complement used for signed numbers in these notes Unsigned and signed numbers can be treated equally in most cases, exceptions are mentioned Computer Arithmetic: Principles, Architectures, and VLSI Design

3 Number Representations

12

3.3 Redundant Number Systems

3.3 Redundant Number Systems Non-binary, redundant, weighted number systems [1, 2] Digit set larger than radix (typically radix 2) multiple representations of same number redundancy + No carry-propagation in adders more efficient impl. of adder-based units (e.g. multipliers and dividers) – Redundancy no direct implementation of relational operators conversion to irredundant numbers – Several bits used to represent one digit higher storage requirements – Expensive conversion into irredundant numbers (not necessary if redundant input operands are allowed)

%&1 2

0 1 digit holds sum of 2 bits (no carry-out digit) example : 00 10 00 10 01 01 10 00 irredundant representation of

1 [8], since 1 0 & 1 1

0

0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1

0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1

0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0

0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0

0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0

Computer Arithmetic: Principles, Architectures, and VLSI Design

3 Number Representations

13

3.3 Redundant Number Systems

1 digit holds sum of 3 bits or 1 digit + 1 bit (no carry-out digit, i.e. carry is saved) standard redundant number system for fast addition Signed-digit (SD) or redundant digit (RD) number representation :

%&

'

'

1 0 1

no carry-propagation in

2

:

2 1 is redundant (e.g. 0 1

1

1 0

1 0 1 ,

, 1 1 01 1 0 1

1 0 1 11)

1 digit holds sum of 2 digits (no carry-out digit)

minimal SD representation : minimal number of non-zero digits, 011 1 10 100 0 10

Delayed-carry of half-adder number representation : 0 1 2 , 0 1 , , 0 1 2 1

1

applications : sequential multiplication (less cycles), filters with constant coefficients (less hardware)

example : minimal 7 0111 1111 1011 1001 11111

canonical SD repres.: minimal SD + not two non-zero digits in sequence, 01 1 10 10 0 10

SD binary : carry-propagation necessary ( adder)

%&

Carry-save number representation : 0 1 2 3 , 0 1 , 1 2 1

1 2 0 Computer Arithmetic: Principles, Architectures, and VLSI Design

(r.m.a.)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

3binary 2 1 0 3 Gray 2 1 0

other applications : high-speed multipliers [9] similar to carry-save, simple use for signed numbers 14

Computer Arithmetic: Principles, Architectures, and VLSI Design

15

3 Number Representations

3.4 Residue Number Systems (RNS)

Non-binary, irredundant, non-weighted number system [1]

– Complex and slow other arithmetic operations (e.g. comparison, sign and overflow detection) because digits are not weighted, conversion to weighted mixed-radix or binary system required

Possible applications (but hardly used) :

digital filters : fast additions and multiplications error detection and correction for arithmetic operations in conventional and residue number systems

%&%& %&1 %&2 0 , 0 1 1 % Range: &1 , anywhere in ZZ mod 0 , %&1 0 , 0 1 0

Base is -tuple of integers 1 2 0 , residues (or moduli) pairwise relatively prime

0

possible range

5 5 5 2 1 1 0 4 5 6 1 0 2 1 3 2 6

1 2 3 0 1 2 0 1 3 6 4 5 1 0 2 1 6

1 2 0 1 2 0 2 3

Computer Arithmetic: Principles, Architectures, and VLSI Design

3 Number Representations

Codes for error detection and correction [1]

2

&1 &2 (Fermat’s theorem) are 2and 2 1: Best moduli high storage efficiency with #bits simple modular addition : 2: #-bit adder without , 2 1 : #-bit adder with end-around carry ( % ) 3 2, 6 Example : 1 0 4 3 2 1 0 1 2 3 4 5 6 7 8 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 0 1 0 1 0 1 01 0 1 0 1 0

+ Carry-free and fast additions and multiplications

3.4 Residue Number Systems (RNS)

Arithmetic operations : (each digit computed separately)

3.4 Residue Number Systems (RNS)

1

3 Number Representations

16

3.5 Floating-Point Numbers

3.5 Floating-Point Numbers

2

6

Computer Arithmetic: Principles, Architectures, and VLSI Design

3 Number Representations

17

3.7 Antitetrational Number System

3.6 Logarithmic Number System

Larger range, smaller precision than fixed-point representation, inexact, real numbers [1, 2]

Alternative representation to floating-point (i.e. mantissa + integer exponent only fixed-point exponent) [1]

Double-number form

Single-number form continuous precision accuracy, more reliable

discontinuous precision

1 1 1 2 & Basic arithmetic operations : 1 1 1 base on fixed-point add, multiply, and shift operations postnormalization required (1 $ 1) S biased exponent E unsigned norm. mantissa M

1 1 2 &

S biased fixed-point exponent E

precision single double

32 64

23 52

bias

(signed-logarithmic)

(additionally consider sign) : by approximation or addition in conventional number system and double conversion 1 1 1

+ Simpler multiplication/exponent., more complex addition – Expensive conversion : (anti)logarithms (table look-up) Applications : real-time digital filters 3.7 Antitetrational Number System

22) and antitetration (a.t. ) [10] " Larger range, smaller precision than logarithmic repres., otherwise analogous (i.e. 2 t. log a.t. )

IEEE floating-point format :

Basic arithmetic operations :

Applications : processors : “real” floating-point formats (e.g. IEEE standard), large range due to universal use ASICs : usually simplified floating-point formats with small exponents, smaller range, used for range extension of normal fixed-point numbers

higher

2

range

Tetration (t.

precision 38

8 127 3 8 10 11 1023 9 10307

Computer Arithmetic: Principles, Architectures, and VLSI Design

&7 &15

!

10 10

!

18

Computer Arithmetic: Principles, Architectures, and VLSI Design

19

3 Number Representations

3.8 Composite Arithmetic

3.8 Composite Arithmetic

3 Number Representations

3.9 Round-Off Schemes

3.9 Round-Off Schemes

Proposal for a new standard of number representations [10] Scheme for storage and display of exact (primary: integer, secondary: rational) and inexact (primary: logarithmic, secondary: antitetrational) numbers

lower bits %& additional 1 0 & 1 & Rounding : keeping error final word small during length reduction : %& Intermediate results with ( higher accuracy) :

1

Secondary forms used for numbers not representable by primary ones ( no over-/underflow handling necessary)

Trade-off : numerical accuracy vs. implementation cost Truncation :

1 1

Choice of number representation hidden from user, i.e. software/compiler selects format for highest accuracy

2

Number representations :

0

2

1

%& 1

0

(= average error )

Round-to-nearest (i.e. normal rounding) :

integer :

tag 00

value 2’s complement integer

rational :

01

slash

logarithmic :

10

log integer

log fraction

antitetrational :

11

a.t. integer

a.t. fraction

%& 1 0 1 2 1 0 2

1 (nearly symmetric)

denominator numerator

if &1 & 0 0 & %&1 1 0

0

Storage form sizes : 32-bit (short), 64-bit (normal), 128-bit (long), 256-bit (extended)

Hardware proposal : long accumulator (4096 bits) holds any floating-point number in fixed-point format higher accurary large hardware/software overhead Computer Arithmetic: Principles, Architectures, and VLSI Design

otherwise

(symmetric) mandatory in IEEE floating-point standard

Implementation : mixed hardware/software solutions

20

4 Addition

1

Round-to-nearest-even/-odd :

Rational numbers : slash position (i.e. size of numerator/ denominator) is variable and stored (floating slash)

0 12” can often be included in previous operation 2

“

4.1 Overview

4 Addition

3 guard bits for rounding after floating-point operations : guard bit (postnormalization), round bit (round-to-nearest), sticky bit (round-to-nearest-even) Computer Arithmetic: Principles, Architectures, and VLSI Design

4 Addition

21

4.2 1-Bit Adders, (m, k)-Counters

4.2 1-Bit Adders, (m, k)-Counters

bits of same magnitude (i.e. 1-bit numbers) Output sum as #-bit number ( # log 1) or : count 1’s at inputs (m, k)-counter [3]

4.1 Overview

Add up HA

1-bit adders

FA

(m,k)

(m,2)

(combinational counters) RCA

CSKA

CSLA

CIA

CLA

PPA

COSA

Half-adder (HA), (2, 2)-counter

carry-propagate adders

2

3 2 1

CPA

(sum) (carry-out)

CSA

3-operand

"

adders.epsi 103 121 mm

carry-save adders

multi-operand

adder array

a b

adder tree

a b a b

multi-operand adders

array adder

"

tree adder

hasym.epsi 18 23HA mm c out

s Legend: HA: FA: (m,k): (m,2):

half-adder full-adder (m,k)-counter (m,2)-compressor

based on component

CPA: carry-propagate adder RCA: ripple-carry adder CSKA:carry-skip adder CSLA: carry-select adder CIA: carry-increment adder

"

chaschema1.epsi out 19 28 mm

CLA: carry-lookahead adder PPA: parallel-prefix adder COSA:conditional-sum adder

"

haschema2.epsi 21 43 mm c out

s

(reference) s

CSA: carry-save adder

related component

Computer Arithmetic: Principles, Architectures, and VLSI Design

22

Computer Arithmetic: Principles, Architectures, and VLSI Design

23

4 Addition

4.2 1-Bit Adders, (m, k)-Counters

Full-adder (FA), (3, 2)-counter

2

% 7 4 2

4 Addition

(m, k)-counters

& 1 &1 0 &1

0 2 0

(generate)

(propagate) 1

% % % % %

% % %

% 0 % 1 0

out

"

"

in

"

s

FA

FA

count73par.epsi FA 36 48 mm

FA

0

p faschematic4.epsi c out c in 29 1 41 mm

"

c in

"

c out

"

faschematic5.epsi 0 c0 35 47 mm 1 c1

FA

s2

s

Computer Arithmetic: Principles, Architectures, and VLSI Design

4 Addition

24

4.3 Carry-Propagate Adders (CPA)

4.3 Carry-Propagate Adders (CPA)

-bit operands and and an optional carry-in

Add two by performing carry-propagation [1, 2, 11]

is irredundant 1-bit number 2% %

s0

s0

linear structure

25

4.3 Carry-Propagate Adders (CPA)

Carry-propagation speed-up techniques

%

a) Concatenation of partial CPAs with fast a n-1:j b n-1:j

a i-1:k b i-1:k

a k-1:0 b k-1:0

...

CPA

c out

"

speedup1.epsi CPA c i84 26 mm

cj

CPA

ck

c in

...

B

A

s i-1:k

s n-1:j

"

cpasymbol.epsi CPA c out 29 26 mm c in

s1

s1

tree structure

4 Addition

1 ; 0 1 1 0 % % (r.m.a.)

s2

Computer Arithmetic: Principles, Architectures, and VLSI Design

2

FA

FA

s

s

"

count73ser.epsi 42 59 mm

c in

Sum

a3a4 a5a6

FA

a b

faschematic1.epsi g p 29 43 mm

%

a0a1 a2

a b

a b

(reference)

28 10

a0a1 a2a3a4a5a6

s

"

s k-1 s 0

28 14

faschematic2.epsi c in 32 35 mm

c out

s

c out

"...

cntsymbol.epsi 23 mm 18 (m,k)

Example : (7, 3)-counter

g HA faschematic3.epsi p c out 29 32 mm c in HA

a m-1

...

7 log2&7 log 1 4 2 log

4 log3 2 log

a b

fasymbol.epsi FA c18 21 mm c

a0

Usually built from full-adders Associativity of addition allows convertion from linear to tree structure faster at same number of FAs

a b

a b

4.2 1-Bit Adders, (m, k)-Counters

s k-1:0

a) Fast carry look-ahead logic for entire range of bits

S

a n-1

b n-1

a1

b1

a0

b0

Ripple-carry adder (RCA) Serial arrangement of

Simplest, smallest, and slowest CPA structure

7 2 14 a n-1

b n-1

a1

b1

a0

FA

...

s n-1

"

rca.epsi FA c n-1 57c 2 23 mm

s1

c1

carry propagation

c in

2

postprocessing

...

b0 FA

"

speedup2.epsi 104 50 mm

c out

s n-1

...

c out

preprocessing

...

full-adders

s1

s0

c in

s0

Computer Arithmetic: Principles, Architectures, and VLSI Design

26

Computer Arithmetic: Principles, Architectures, and VLSI Design

27

4 Addition

4.3 Carry-Propagate Adders (CPA)

Carry-skip adder (CSKA)

4.3 Carry-Propagate Adders (CPA)

Carry-select adder (CSLA)

and &1:

&1: 0&1: 1&1: 0 1 Two CPAs compute two possible results ( % 0 1), group carry-in selects correct one afterwards Type a) : partial CPA with fast

Type a) : partial CPA with fast

&1: &1: (bit group &1 )

&1: &1 &2 (group propagate) ) 1) &1: 0 : and selected ( 2) &1: 1 : but skipped ( ) path never sensitized fast false path inherent logic redundancy problems in circuit optimization, timing analysis, and testing Variable group sizes (faster) : larger groups in the middle 1) (minimize delays 0 1 and

&

%& Partial CPA typ. is RCA or CSKA ( multilevel CSKA)

4 Addition

Variable group sizes (faster) : larger groups at end (MSB) 0) (balance delays 0 and

Part. CPA typ. is RCA, CSLA ( multil. CSLA), or CLA

High speed-up at high hardware overhead (+ MUX/bit + (CPA + MUX)/group)

14 2 8

Medium speed-up at small hardware overhead (+ AND/bit + MUX/group)

8 4

1 2

a n-1:j b n-1:j

32

a k-1:0 b k-1:0

0

c out 0

c out

CPA

cj

ci

CPA

"

cska.epsi 99 36 mm 1

ci

CPA

ck

39

3 2

a k-1:0 b k-1:0

...

...

c’i

1 2

a i-1:k b i-1:k

3 2

a i-1:k b i-1:k

1

c i0

c i1

...

c in

0

CPA

"

csla.epsi mm 102 50CPA 0 s i-1:k 0

1

CPA

ck

c in

1 s i-1:k

1

ck

...

P i-1:k s i-1:k

s n-1:j

Computer Arithmetic: Principles, Architectures, and VLSI Design

4 Addition

28

4.3 Carry-Propagate Adders (CPA)

Carry-increment adder (CIA)

and &1:

&1: &1: &1: &1: &1 &2 (group propagate) Result is incremented after addition, if 1 [12, 11] Type a) : partial CPA with fast

Computer Arithmetic: Principles, Architectures, and VLSI Design

4 Addition

29

4.3 Carry-Propagate Adders (CPA)

Example : gate-level schematic of carry-incr. adder (CIA) only 2 different logic cells (bit-slices) : IHA and IFA

max

4 6 10 12 14 16 18 20 22 24 26 28 ... 38 2 3 4 5 6 7 8 9 10 11 ... 16 1 2 4 7 11 16 22 29 37 46 56 67 ... 137

group

a i-1 IFA

Variable group sizes (faster) : larger groups at end (MSB) ) (balance delays 0 and

b i-1

a i-2

b i-2

IFA

a k+1

b k+1

IFA

Part. CPA typ. is RCA, CIA ( multilevel CIA) or CLA

s k-1:0

s i-1:k

s k-1:0

ak

bk

IHA

...

...

High speed-up at medium hardware overhead (+ AND/bit + (incrementer + AND-OR)/group) ...

Logic of CPA and incrementer can be merged [11]

10 2 8

1 2

28

a i-1:k b i-1:k ...

c out

c’i ci

...

CPA 86

P i-1:k

"

cia.epsi s’i-1:k 43 mm

3 2

ci

"

ciagate.epsi 100 s i-2 112 mm

s i-1 (i-k-1)IFA + IHA

a k-1:0 b k-1:0

2IFA + IHA

s k+1 IFA + IHA

sk IHA

ck

IHA

0 ck

CPA

c in

...

bits i-1...k

...

bits 6...4

bits 3,2

bit 1

bit 0

+1

s i-1:k

s k-1:0 c out

Computer Arithmetic: Principles, Architectures, and VLSI Design

30

Computer Arithmetic: Principles, Architectures, and VLSI Design

c in

31

4 Addition

4.3 Carry-Propagate Adders (CPA)

Conditional-sum adder (COSA)

Type b) : carries looked ahead before sum bits computed

(i.e. double CPAs are merged at higher levels)

Typically 4-bit blocks used (e.g. standard IC SN74181)

Correct sum bits ( 0&1: or 1&1:) are (conditionally)

0 0 1 0 0 0 2 1 1 0 1 0 0 3 2 2 1 2 1 0 2 3 3 3 2 3 2 1 3

levels of multiplexers

Bit groups of size 2 at level

Higher parallelism, more balanced signal paths

Highest speed-up at highest hardware overhead (2 RCA + more than log MUX/bit)

3

2 log 6

log

4.3 Carry-Propagate Adders (CPA)

Carry-lookahead adder (CLA), traditional

Type a) : optimized multilevel CSLA with log levels selected through log

4 Addition

(g3,p3)

"

2 1 0

. . . c0 (g′,p′) 3 3 c3

(g0,p0)

clbsymbol.epsi 27 CLB 26 mm c′ 0

1 0 0

3 2 1 0

3

...

Hierarchical using 12 log levels : passedarrangement up, 0 passed down between levels 3 3

log2

High speed-up at medium hardware overhead

FA

level 1

...

level 2

FA

...

0

0

1

0

1

0

FA

1

FA

0

a0

FA

0

1

14 4 log 56

b0

0

1

FA

c in

(g15,p15) ... (g12,p12) (g11,p11) ... (g8,p8) c′12

CLB

1

1

c 15 ... c 12

...

CLB

c out

s3

s2

s1

32

4.3 Carry-Propagate Adders (CPA)

Type b) : universal adder architecture comprising RCA, CIA, CLA, and more (i.e. entire range of area-delay trade-offs from slowest RCA to fastest CLA) Preprocessing, carry-lookahead, and postprocessing step Carries calculated using parallel-prefix algorithms + High regularity : suitable for synthesis and layout

a1 b1 a0 b0

a n-1 b n-1 a n-2 b n-2

...

c 3 ... c 0

33

c1

...

p0

1:0

2

2:0

3

1

1

3:2

3

1

1:0

2

3:0

at level Carry-propagation is prefix problem : : : : 0 0 : : & 1 1 & 1 & &1 : 1 : 1 : : : : ; #$ $ 1 :& & 1 &1 & 1 &1

1 : 1 : : 1 : :0 ; 0 1 1 1 3:0

! ! ! ! !

Parallel-prefix algorithms [11] : multi-tree structures ( log ) 2 sharing subtrees ( log ) different algorithms trading area vs. delay (influences ) also from wiring and maximum fan-out

c0

postprocessing: s0

1

Group variables : : covers bits

Computer Arithmetic: Principles, Architectures, and VLSI Design

%& %& , associative 1 0 %& %& or 1 0 1 0 1 0 0

& ;

1

1 (r.m.a.) 0 0 1 tree structures for evaluation : Associativity of 3 2 1 0 3 2 1 0 , but 2 ? Inputs 1 0 , outputs binary operator [11, 13]

3

carry-lookahead: prefix algorithm

s1

c′0

4.3 Carry-Propagate Adders (CPA)

2

c in

"

s n-2

4 Addition

preprocessing:

add.epsi///figures 73 64 mm

s n-1

CLB

Computer Arithmetic: Principles, Architectures, and VLSI Design

(g0 , p0 )

...

"

+ preprocessing : + postprocessing :

1

5 3 4 2

c out

c′4

CLB

c in

+ High performance : smallest and fastest adders

c n p n-1

c′8

(g3,p3) ... (g0,p0)

c 11 ... c 8 cla.epsi c 7 ... c 4 97 48 mm

+ High flexibility : special adders, other arithmetic operations, exchangeable prefix algorithms (i.e. speeds)

(gn-1 , p n-1 )

(g7,p7) ... (g4,p4)

Prefix problem

Parallel-prefix adders (PPA)

...

CLB

s0

Computer Arithmetic: Principles, Architectures, and VLSI Design

4 Addition

log

0

FA

"

1

b1

1

cosa.epsi 100 57 mm

1

0

a1

(g′,p′) 3 3

0

b2

(g′,p′) 7 7

a2

(g′11,p′11)

...

b3

(g′15 ,p′15 )

level 0

a3

34

Computer Arithmetic: Principles, Architectures, and VLSI Design

35

4 Addition

4.3 Carry-Propagate Adders (CPA)

Prefix algorithms

4 Addition

4.3 Carry-Propagate Adders (CPA)

Sklansky parallel-prefix algorithm (

Algorithms visualized by directed acyclic graphs (DAG) with array structure ( bits levels)

Tree-like collection, parallel redistribution of carries 1 log log ! 1 2 2

Graph vertex symbols :

& &1 &1 &1 &1 1 :& 1 1 : 1 : : : : : : : : : : : : (contains logic for ) (contains no logic)

: graph depth (number of black nodes on critical path) Serial-prefix algorithm ( RCA) 1 1 ! 2 Performance measures : : graph size (number of black nodes)

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

"

...

ser.epsi///figures 69 38 mm

14 15

Computer Arithmetic: Principles, Architectures, and VLSI Design

36

4.3 Carry-Propagate Adders (CPA)

Kogge-Stone parallel-prefix algorithm ( PPA-KS) very high wiring requirements log 1 log ! 2

"

bk.epsi///figures 67 38 mm

Computer Arithmetic: Principles, Architectures, and VLSI Design

4 Addition

37

4.3 Carry-Propagate Adders (CPA)

Mixed serial/parallel-prefix algorithm (

RCA + PPA)

linear size-depth trade-off using parameter #: 0 $#$ 2 log 2

# 0 : serial-prefix graph # 2 log 1 : Brent-Kung parallel-prefix graph fills gap between RCA and PPA-BK (i.e. CLA) in steps

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2

"

ks.epsi///figures 67 52 mm

3

PPA-BK)

Traditional CLA is PPA-BK with 4-bit groups Tree-like redistribution of carries (fan-out tree) 2 log 2 2 log 2 ! log 0 1 2 3 4 5 6

4 Addition

"

sk.epsi///figures 67 30 mm

Brent-Kung parallel-prefix algorithm (

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3

PPA-SK)

of single -operations

1 # 1 # ! var.

4 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

CIA) 2 1 4 1 2 1 4 1 2 ! 1 4 1 2

0 1 2 3 4 5 6 7 8 9 10

Carry-increment parallel-prefix algorithm (

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5

"

cia.epsi///figures 67 34 mm

Computer Arithmetic: Principles, Architectures, and VLSI Design

38

"

var.epsi///figures 68 54 mm

Computer Arithmetic: Principles, Architectures, and VLSI Design

39

4 Addition

4.3 Carry-Propagate Adders (CPA)

Example : 4-bit parallel-prefix adder (PPA-SK) efficient AND-OR-prefix circuit for the generate and AND-prefix circuit for the propagate signals optimization: alternatingly AOI-/OAI- resp. NAND-/ NOR-gates (inverting gates are smaller and faster) can also be realized using two MUX-prefix circuits

Local prefix graph transformation :

3 2 1 0

a2

b3

b2

a1

b1

a0

4.3 Carry-Propagate Adders (CPA)

Prefix adder synthesis

a3

4 Addition

3 3

b0

c in

0 unfact.epsi 1 20 26 mm 2 3

"

3 2 1 0

depth-decr. transform

0 fact.epsi 1 20 26 mm 2 3

"

size-decr. transform

4 2

Repeated (local) prefix transformations result in overall minimization of graph depth or size which sequence ? Goal: minimal size (area) at given depth (delay) Simple algorithm for sequence of applied transforms : Step 1 : prefix graph compression (depth minimization) : depth-decr. transforms in right-to-left bottom-up order Step 2 : prefix graph expansion (size minimization) : size-decreasing transforms in left-to-right top-down order, if allowed depth not exceeded

"

askgate.epsi///figures 100 103 mm

Prefix adder synthesis : 1) generate serial-prefix graph, 2) graph compression, 3) depth-controlled graph expansion, 4) generate pre-/postprocessing and prefix logic + Generates all previous prefix graphs (except PPA-KS) + Universal adder synthesis algorithm : generates area-optimal adders for any given timing constraints [11] (including non-uniform signal arrival times)

c out P n-1:0

s3

s2

s1

s0

Computer Arithmetic: Principles, Architectures, and VLSI Design

4 Addition

40

4.3 Carry-Propagate Adders (CPA)

Multilevel adders

Computer Arithmetic: Principles, Architectures, and VLSI Design

4 Addition

41

4.3 Carry-Propagate Adders (CPA)

Self-timed adders

Multilevel versions of adders of type a) possible (CSKA, CSLA, and CIA; notation: 2-level CIA = CIA-2L)

! 1 1for levels + Delay is

– Area increase small for CSKA and CIA, high for CSLA ( COSA)

Average carry-propagation length : log

!

+ RCA is fast in average case ( ˜ log ), slow in worst case suitable for self-timed asynchronous designs [15]

– Completion detection is not trivial

Difficult computation of optimal group sizes

Adder performance comparisons

process

Standard-cell implementations, 0 8

Hybrid adders

Arbitrary combinations of speed-up techniques possible hybrid/mixed adder architectures Often used combinations : CLA and CSLA [14]

area [lambda^2] RCA 128-bit

1e+07

CSKA-2L CIA-1L

– Pure architectures usually perform best (at gate-level)

CIA-2L

64-bit

5

PPA-SK

Transistor-level adders

PPA-BK 32-bit

Influence of logic styles (e.g. dynamic logic, pass-transistor logic faster)

2

"

addperf.ps 84 84 mm

CLA COSA const. AT

16-bit

+ Efficient transistor-level implementation of ripple-carry chains (Manchester chain) [14]

1e+06 8-bit 5

+ Combinations of speed-up techniques make sense – Much higher design effort

2

Many efficient implementations exist and published Computer Arithmetic: Principles, Architectures, and VLSI Design

delay [ns] 5

42

10

20

Computer Arithmetic: Principles, Architectures, and VLSI Design

43

4.3 Carry-Propagate Adders (CPA)

Complexity comparison under the unit-gate model

2 log 4 log 2 log 4 log 2 log

39 28 36 44 3 40 6 56 6

aat 3 — — att att —

4 3 2 3 2 4 3 5 4

log2 log log2 log log2

0

1 0 1 2 ; 0 1 1 (n.)

( )

– Result is in redundant carry-save format ( digits), represented by two -bit numbers (sum bits) and (carry bits)

a 0,n-1 a 1,n-1

a 2,n-1 . . . 67

FA cn

"

csa.epsi 27FA mm

c2

s n-1

FA c1

s1

s0

Multi-operand carry-save adders ( 3) adder array (linear arrangement), adder tree (tree arr.)

44

4.5 Multi-Operand Adders

4.5 Multi-Operand Adders

Computer Arithmetic: Principles, Architectures, and VLSI Design

4 Addition

45

4.5 Multi-Operand Adders

a) 4-operand CPA (RCA) array :

a 0,n-1 a 1,n-1

Add three or more ( 2) -bit operands, yield log -bit result in irredundant number rep. [1, 2]

FA a 2,n-1

Realization by array adders : (see figures on next page) a) linear arrangement of CPAs b) linear arr. of CSAs (adder array) and final CPA a) and b) differ in bit arrival times at final CPA : if CPA = RCA : a) and b) have same overall delay if fast final CPA : uniform bit arrival times required CSA array (b) Fast implementation : CSA array + fast final CPA (note: array of fast CPAs not efficient/necessary)

FA

FA

sn

s n-1

HA

a 2,2

a 2,1

a 2,0

"

...

a 3,2

CPA

CPA

HA

a 3,1

a 3,0

FA

FA

HA

s2

s1

s0

CPA

...

FA

a 0,2 a 1,2

A m-1

a 2,n-1

b) 4-operand CSA array with final CPA (RCA) :

...

CSA

FA

cparray.epsi 93 57 mm FA FA

a 3,n-1

a 0,n-1 a 1,n-1

A3

...

FA

A0 A1 A2

FA

...

Array adders

...

a 3,n-1

"

mopadd.epsi CSA 30 58 mm

FA

FA a 3,2

...

"

csarray.epsi 99FA 57 mm

FA a 3,1 FA

FA

CSA

a 3,0 HA

CSA

...

2 2 ! CPA = RCA : ! ! log Fast CPA : ! log

full-adders, constant delay

7 4

Computer Arithmetic: Principles, Architectures, and VLSI Design

S

+ Parallel arrangement of

optimality regarding area and delay aaa : smallest area, longest delay aat : small area, medium delay att : medium area, short delay ttt : large area, shortest delay — : not optimal 2 obtained from prefix adder synthesis 3 automatic logic optimization not possible (redundancy) 4 exact factors not calculated 5 corresponds to 4-bit PPA-BK

C

%

1

4 Addition

"

csasymbol.epsi 21 CSA 26 mm

b) Adds one -bit operand to an -digit carry-save operand

ttt att — — —

A0 A1 A2

2

2

1

a 2,0

log 10 3 log 14 3 log

4 3

2

a 2,0

3 2

2 2 3 4 4

32 3

a 0,0 a 1,0

PPA-SK PPA-BK PPA-KS CLA 5 COSA

1 2 1 3 4 8 1 2 8 1 2 6 1 3 1 4

4

a 0,0 a 1,0

8 8 14 10 10 10

aaa

a 0,0 a 1,0

CSKA-1L CSKA-2L CSLA-1L CIA-1L CIA-2L CIA-3L

2

14

a 2,1

2

a 2,1

7

a) Adds three -bit operands 0 , 1 , 2 performing no carry-propagation (i.e. carries are saved) [1]

a 0,1 a 1,1

RCA

opt.1 syn.2

AT

a 0,1 a 1,1

T

4.4 Carry-Save Adder (CSA)

a 0,1 a 1,1

A

4.4 Carry-Save Adder (CSA)

a 2,2

adder

4 Addition

a 0,2 a 1,2

4 Addition

CPA

FA

FA

sn

s n-1

FA

HA

s2

s1

CPA

...

s0

S Computer Arithmetic: Principles, Architectures, and VLSI Design

46

Computer Arithmetic: Principles, Architectures, and VLSI Design

47

4 Addition

4.5 Multi-Operand Adders

4 Addition

(m, 2)-compressors

a0

7 2 4 2 6 log 1

a m-1

... m-4 c out

"

cprsymbol.epsi 37 (m,2) 26 mm

c

...

0 c out

...

&4

2

10 & &

4 % 0 0

4.5 Multi-Operand Adders

c in0 c inm-4

Optimized (4, 2)-compressor :

2 full-adders merged and optimized (i.e. XORs

s

1-bit adders (similar to (m, k)-counters) [16]

Compresses bits down to 2 by forwarding 3

arranged in tree structure)

14 8

intermediate carries to next higher bit position

Is bit-slice of multi-operand CSA array (see prev. page)

% #)

+ No horizontal carry-propagation (i.e.

14 6

Built from full-adders (= (3, 2)-compressor) or (4, 2)-compressors arranged in linear or tree structures

FA

"

cpr42fa.epsi 32 38 mm

c out

0

c in

FA

"

cpr42opt.epsi 1 41 53 mm

c in

c out

"

a 0,0 a 1,0 a 2,0 a 3,0

(4,2) cpradd.epsi 99 44 mm

a 0,1 a 1,1 a 2,1 a 3,1

a 0,2 a 1,2 a 2,2 a 3,2

a 0,n-1 a 1,n-1 a 2,n-1 a 3,n-1

0

(4,2)

(4,2)

(4,2)

a2 a3

a0 a1 a2 a3

Example : 4-operand adder using (4, 2)-compressors

a0 a1

c

1

s

with full-adders

c

s

optimized

CSA

+ same area, 25% shorter delay

FA

FA

FA

HA

sn

s n-1

s2

s1

CPA

SD-FA (signed-digit full-adder) is similar to (4, 2)-compressor regarding structure and complexity

s n+1

s0

Computer Arithmetic: Principles, Architectures, and VLSI Design

4 Addition

48

4.5 Multi-Operand Adders

Advantages of (4, 2)-compressors over FAs for realizing (m, 2)-compressors :

Computer Arithmetic: Principles, Architectures, and VLSI Design

4 Addition

49

4.5 Multi-Operand Adders

Tree adders (Wallace tree)

higher compression rate (4:2 instead of 3:2) less deep and more regular trees

Adder tree : -bit -operand carry-save adder composed of tree-structured (m, 2)-compressors [1, 17]

tree depth

Tree adders : fastest multi-operand adders using an adder tree and a fast final CPA

# operands

012 3 4 5 6 FA (4,2)

42 16

0 c out

FA

a0a1a2a3 0 c out

c in0 c in1

1 c out

FA

2 c out

"

FA

cpr82fa.epsi 47 65 mm

3 c out

FA

4 c out

FA c

42 12

a4a5 a6a7

FA

! log 2 2 !log log

2 3 4 6 9 13 19 28 42 63 94 2 4 8 16 32 64 128

Example : (8, 2)-compressor

a0a1 a2a3

7 8 9 10

c in2

a4a5a6a7

(4,2)

(4,2)

3 c out

c in0

"

cpr82cpr42.epsi 47 50 mm

c in3 4 c out

c in4

(4,2)

c

Some FA can often be replaced by HA or eliminated (i.e. redundant due to constant inputs)

c in2

Number of (irredundant) FA does not depend on adder structure, but number of HA does

c in3

An

c in1

1 c out 2 c out

Adder arrays and adder trees revisited

c in4

s

-operand adder accomodates 1 carry inputs ! Adder! trees ( log ) are faster than adder ! arrays ( ) at same amount of gates ( )

Adder trees are less regular and have more complex routing than adder arrays larger area, difficult layout (i.e. limited use in layout generators)

(4, 2)-compressor tree

s

full-adder tree Computer Arithmetic: Principles, Architectures, and VLSI Design

50

Computer Arithmetic: Principles, Architectures, and VLSI Design

51

4 Addition

4.6 Sequential Adders

5 Simple / Addition-Based Operations

5.1 Complement and Subtraction

4.6 Sequential Adders

5 Simple / Addition-Based Operations

Bit-serial adder : Sequential -bit adder

5.1 Complement and Subtraction

ai bi

2’s complementer (negation)

1

"

bitseradd.epsi FA 25 27 mm

Accumulators : Sequential With CPA

2’s complement subtractor

1

"

S

2’s complement adder/subtractor

1

CSA

"

accucsa.epsi 33 52 mm

52

5.2 Increment / Decrement

5.2 Increment / Decrement

sub

mod 2% 1

B

A

c out

"

addmod.epsi 29 CPA 28 mm

c in

(end-around carry) S Computer Arithmetic: Principles, Architectures, and VLSI Design

5 Simple / Addition-Based Operations

53

5.2 Increment / Decrement

AND-prefix struct. : : 1 : 2 log 2 1 log2

Prefix problem :

1

Incrementer

Adds a single bit %to an -bit operand 2% % A incsymbol.epsi +1 29 " 26 mm c 1 ; 0 1 0 % % (r.m.a.) Z Corresponds to addition with 0 ( FA HA)

2

out

log

2

%

Decrementer a n-1

a2

a1

"

z n-1

2

z n-1

z2

z1

z0

% 1 %

Incrementer-decrementer

HA

z1

c in

...

a0

"

a0

dec.epsi 93 41 mm

c out

...

incfa.epsi 59c 23HA mm c 2 1

a1

...

Example : Ripple-carry incrementer using half-adders

3 1 3

c in

...

CPA

c out

1’s complement adder

S

5 Simple / Addition-Based Operations

c n-1

"

addsub.epsi 36 35 mm

S CPA

Computer Arithmetic: Principles, Architectures, and VLSI Design

HA

1

A B

Mixed CSA/CPA : CSA with partial CPAs (i.e. fewer carries saved), trade-off between speed and register size

c out

c out

A

4

a n-1

"CPA

sub.epsi 29 32 mm

S

cycles for evaluation

B

A

accucpa.epsi CPA 27 28 mm

Allows higher clock rates Final CPA too slow : pipelining or multiple

+1

Z

A

With CSA and final CPA

"

neg.epsi 21 32 mm1

si

-operand adders

A

c in

a2

a n-1

z0

a1

a0

or using incrementer slices (= half-adder) a n-1

a2

a1

dec

a0

...

...

c out

"

inc.epsi 83 33 mm ...

"

incdec.epsi 94 46 mm

c out

c in

c in

...

HA z n-1

z2

z1

Computer Arithmetic: Principles, Architectures, and VLSI Design

z0

z n-1 54

z2

z1

Computer Arithmetic: Principles, Architectures, and VLSI Design

z0 55

5 Simple / Addition-Based Operations

5.2 Increment / Decrement

Fast incrementers

5 Simple / Addition-Based Operations

5.2 Increment / Decrement

Gray incrementer

4-bit incrementer using multi-input gates : a2

a3

a1

Increments in Gray number system

0 %&1 %&2 0 (parity) 1 ; 0 3 (r.m.a.) 0 0 0 &1 &1 ; 1 2 %&1 %&1 %&2

a0

c in

"

inccg.epsi 62 39 mm

c out z3

z2

z1

z0

Prefix problem

AND-prefix structure

8-bit parallel-prefix incrementer (Sklansky AND-prefix structure) : a7

a6

a5

a4

a3

a2

a1

a0 c in

"

incpp.epsi 98 63 mm

c out

z7

z6

z5

z4

z3

z2

z1

z0

Computer Arithmetic: Principles, Architectures, and VLSI Design

56

5 Simple / Addition-Based Operations

5.3 Counting

5.3 Counting

Count clock cycles counter, divide clock frequency frequency divider ( )

Computer Arithmetic: Principles, Architectures, and VLSI Design

5 Simple / Addition-Based Operations

57

5.3 Counting

! Fast divider ( 1 ) using delayed-carry numbers (irredundant carry-save represention of

1 allows using fast carry-save incrementer) [8]

Gray counter

Binary counter Sequential in-/decrementer Incrementer speed-up techniques applicable Down- and up-down-counters using decrementers / incrementer-decrementers

Counter using Gray incrementer +1 cntblock.epsi 32 33 mm

c out

"

c in

Ring counters Shift register connected to ring :

clk Q

Example : Ripple-carry up-counter using counter slices (= HA + FF), is count enable

%

"

cntring.epsi 51 16 mm

q n-1

q2

q1

q0

FF for counting states Must be initialized correctly (e.g. 00 01) State is not encoded

c out

c in

"

Applications: fast dividers (no logic between FF) state counter for one-hot coded FSMs

cntripple.epsi 87 36 mm

...

q2

q n-1

q1

q0

Johnson / twisted-ring counter (inverted feed-back) :

Asynchronous counter using toggle-flip-flops (lower toggle rate lower power) T

T

...

T

clk

"

cntasync.epsi 64 18 mm

q n-1

q2

q1

q0

Computer Arithmetic: Principles, Architectures, and VLSI Design

"

cntjohnson.epsi 59 16 mm

T

58

q n-1

q2

q1

q0

FF for counting 2 states

Computer Arithmetic: Principles, Architectures, and VLSI Design

59

5 Simple / Addition-Based Operations

5.4 Comparison, Coding, Detection

5.4 Comparison, Coding, Detection

$

Subtractor ( :

%&1:0

(equal) (not equal) (greater or equal) (less than) (greater than) (less or equal)

(for free in PPA)

" CPA

1

EQ = P n-1:0

7 2 or & 3 log & 2 log 2

a n-1 b n-1

EQ

a0 b0

"

cmpeq.epsi 40 36 mm

removing redundancies in subtractor (unused ) single-tree structure speed-up at no cost : 6 2

2 log example : ripple comparator using comparator slices a1 b1

a0 b0

a1 b1

a2 b2

a n-1 b n-1 ...

GE = c out

Optimized comparator :

1

;

1 0 1

0% (r.s.a.)

B

cmpsub.epsi 37 31 mm

Equality comparison

A

a2 b2

5.4 Comparison, Coding, Detection

Comparators

Comparison operations

5 Simple / Addition-Based Operations

Magnitude comparison

% 1

0

1

;

0 (r.s.a.)

"

cmpripple.epsi 100 47 mm

1

equality

EQ

60

5.4 Comparison, Coding, Detection

%& to vector & ( 2%) 1:0 1 if 1:0 0 else ; 0 1

2

Decoder Decodes binary number

a2

A

a1

"

Z

a0

"

decoder.epsi 58 28 mm

decodersym.epsi 21decoder 26 mm

z6

z5

z4

z3

z2

z1

z0

"

a7a5a3a1 a6a4a2a0

"

encoder.epsi 30 34 mm

Detection operations

%&1 %&2 0 All-ones detection : %&1 %&2 0 (r.s.a.) log All-zeroes detection :

0 1 01

0 1 0

a n-1

a n-2

000100)

...

a1

" 2 z z z prefix problem (r.m.a.) AND-prefix structure

(e.g. 000101

a0

lzdnenc.epsi 50 28 mm ...

n-1

n-2

1

z0

b) encoded output : + encoder

z1

signed numbers : + leading-ones detector (LOZ)

Z

2%&1 1 1

5.4 Comparison, Coding, Detection

a) non-encoded output :

& %& % # #

encodersym.epsi 21encoder 26 mm

5 Simple / Addition-Based Operations

61

for scaling, normalization, priority encoding

Encoder Encodes vector 1:0 to binary number 2 ) 1:0 ( (condition: if then 1 else 0) if 1; 0 1 log2 A

Computer Arithmetic: Principles, Architectures, and VLSI Design

Leading-zeroes detection (LZD) :

12% log z7

magnitude

GE

Computer Arithmetic: Principles, Architectures, and VLSI Design

5 Simple / Addition-Based Operations

equality & magnitude

...

z0

z2

(note: connections according to PPA-SK)

Computer Arithmetic: Principles, Architectures, and VLSI Design

62

Computer Arithmetic: Principles, Architectures, and VLSI Design

63

5 Simple / Addition-Based Operations

5.5 Shift, Extension, Saturation

5.5 Shift, Extension, Saturation

#

#

Rotation by bit positions, constant (logic operation) Extension of word lengths by bits ( ) (i.e. sign-extension for signed numbers) Saturation to highest/lowest value after over-/underflow

#

shift a)

unl. signed r. signed l. r.

shift b)

unsigned signed

rotate

l. r.

extend

unl. signed r. signed l. r.

saturate unsigned signed

#

%&2 0 0 0 %&1 1 %&1 %&3 0 0 %&1 %&1 %&2 1 %&1 2%&1 %&2 %&2 0 %&1 0 %&1 1 0 %&1 0 %&1 0 0 %&1 %&1 %&2 0 %&1 %&2 0 0 %&1 %&1 % &1 %&1 %&1

Computer Arithmetic: Principles, Architectures, and VLSI Design

5 Simple / Addition-Based Operations

sll srl sla sra

formula

% % %&1 %% % %% % : 0 %&1

! 2 !log a3

a2

a1

a3

a0

"

s1 z3

z2

s1 s0

5.6 Addition Flags

a1

a0

"

barshift.epsi 44 49 mm

s1 s0 s1 s0

z1

z0

z3

multiplexers 64

a2

s1 s0

muxshift.epsi 41 28 mm

s0

z2

z1

z0

tristate buffers

Computer Arithmetic: Principles, Architectures, and VLSI Design

5 Simple / Addition-Based Operations

65

5.6 Addition Flags

Basic and derived condition flags description carry flag signed overflow flag

condition operation: 0 00

zero flag negative flag, sign

: for free : fast , 1 computed by e.g. PPA very cheap : a) 1 (subtract.) : 1:0 (of PPA)

% %& %

%& b) % 0 1 :

%&1 %&2 0 (r.s.a.) 1) log

2)

rol ror

Implementation of adder with flags

,

Implementation of shift/extension/rotation by constant values : hard-wired variable values : multiplexers possible values : –by– barrel-shifter/rotator Example : 4–by–4 barrel-rotator

5.6 Addition Flags flag

5.5 Shift, Extension, Saturation

Applications : adaption of magnitude (shift a)) or word length (extension) of operands (e.g. for addition) multiplication/division by multiples of 2 (shift) logic bit/byte operations (shift, rotation) scaling of numbers for word-length reduction (i.e. ignore leading zeroes, shift b)) or normalization (e.g. of floating-point numbers, shift a)) using LZD reducing error after over-/underflow (saturation)

Shift : a) shift -bit vector by bit positions b) select out of more bits at position also: logical (= unsigned), arithmetic (= signed)

#

5 Simple / Addition-Based Operations

faster without final sum (i.e. carry prop.) [18] example : 01001 1 00 0 10110 1 00 00000 0 00

0 0 0 % &1 &1

%&1 %&2 0 ; 0 1 (r.s.a.) 3 4 log

flag

( ) zero negative positive overflow underflow

operation:

$

unsigned or

formula signed

( )

—

—( )

( )

Unsigned and signed addition/subtraction only differ with respect to the condition flags

Computer Arithmetic: Principles, Architectures, and VLSI Design

66

Computer Arithmetic: Principles, Architectures, and VLSI Design

67

5 Simple / Addition-Based Operations

5.7 Arithmetic Logic Unit (ALU)

6.1 Multiplication Basics

6 Multiplication

5.7 Arithmetic Logic Unit (ALU) A

6 Multiplication

B

6.1 Multiplication Basics

and [1, 2] is 2 -bit unsigned number or 2 1 -bit

c out alusymbol.epsi c in 29 mm 30 ALU flags op

Multiplies two -bit operands

"

Product signed number

Z

% &1 % &1 % &1 % &1

2 2 2 or 0 0 0 0 % & 1

2 ; 0 1 (r.s.a.) 0

Example : unsigned multiplication ALU operations arithmetic

add inc pass

logic

and or xor pass

shift/ rotate

sll sla rol

% 1 11 1

% 1

11 1

sub dec neg nand nor xnor not

Algorithm 1) Generation of

2) Adding up partial products :

srl sra ror

a) sequentially (sequential shift-and-add), b) serially (combinational shift-and-add), or c) in parallel

s/ro : shift/rotate ; l/r : left/right ; l/a : logic (unsigned) / arithmetic (signed)

Speed-up techniques

Logic of adder/subtractor can partly be shared with logic operations Computer Arithmetic: Principles, Architectures, and VLSI Design

6 Multiplication

68

6.1 Multiplication Basics

Sequential multipliers : partial products generated and added sequentially (using accumulator)

! !log

partial products

Reduce number of partial products Accelerate addition of partial products Computer Arithmetic: Principles, Architectures, and VLSI Design

6 Multiplication

69

6.2 Unsigned Array Multiplier

6.2 Unsigned Array Multiplier Braun multiplier : array multiplier for unsigned numbers

×

% &1 % &1

2 0 0

"

mulseq.epsi 34 28 mm

Array multipliers : partial products generated and added simultaneously in linear array (using array adder)

! 2 !

CPA

1 3 2 3 2 2 3 3 3 2 3 1

× CSA

× CSA

"

mularr.epsi × 34 47 mm CSA

7

×

6

5

b3

CSA

0 3 1 2 2 1 3 0

4

3

b2

8 2 11 6 9 0 2 0 1 0 0 1 1 1 0 2 0 2

1

b1

b0

a0

CPA p0

a1

Parallel multipliers : partial products generated in parallel and added subsequently in multi-operand adder (using tree adder)

! 2 !log

×

×

×

HA

1

HA

HA p1

a2

× mulpar.epsi 34 43 mm

" CSA

FA

tree

"

mulbraun.epsi FA 99 83 mm

FA p2

a3

CPA

2

Signed multipliers : a) complement operands before and result after multiplication unsigned multiplication b) direct implementation (dedicated multiplier structure)

Computer Arithmetic: Principles, Architectures, and VLSI Design

0

70

FA

FA

FA

CSA

p3

CPA

3 p7

FA

FA

HA

p6

p5

p4

Computer Arithmetic: Principles, Architectures, and VLSI Design

71

6 Multiplication

6.3 Signed Array Multipliers

6.3 Signed Array Multipliers

6 Multiplication

6.4 Booth Recoding Speed-up technique : reduction of partial products

Modified Braun multiplier

special FAs [1] % 2

1 neg. bit :

2 neg. bits : % 2

Replace FAs in regions

% % % 1 , 2 , and 3 by :

(input at mark ) Subtract bits with negative weight

Otherwise exactly same structure and complexity as Braun multiplier efficient and flexible

Sequential multiplication

+ One cycle per non-zero partial product (i.e. 0) Minimal (or canonical) signed-digit (SD) represent. of

– Negative partial products – Data-dependent reduction of partial products and latency Combinational multiplication Only fixed reduction of partial product possible

Radix-4 modified Booth recoding : 2 bits recoded to one multiplier digit 2 partial products

%2

2 221) 22 ; &1 0 0 (2 &1& &

Arithmetic transformations yield the following partial products (two additional ones) :

0 3 0 2 0 1 0 0 1 3 1 2 1 1 1 0 2 3 2 2 2 1 2 0 3 3 3 2 3 1 3 0 3 3

1

3 6

7

5

3 3

4

2

1

1 2 2&1

2

0 0 0 0 1 1 1 1

0

– Less efficient and regular than modified Braun multiplier Computer Arithmetic: Principles, Architectures, and VLSI Design

6 Multiplication

72

6.4 Booth Recoding

Applicable to sequential, array, and parallel multipliers

: 8 : 7

– additional recoding logic and more complex partial product generation (MUX for shift, XOR for negation) + adder array/tree cut in half considerably smaller (array and tree)

3

3

3

2

1

0

ext. sign

03 13 23 33 6

03 13 23 32 5

03 13 22 31 4

03 12 21 30 3

02 11 20

01 10

2

1

00

0 0 0 3 1

1 1 1 3

0

33 6

2

1

0

2

1

0

03 12 21 30 3

02 11 20

01 10

23 32 5

2

1

0

×

×

×

× mulbooth.epsi 41 43 mm

2 2

"

CSA array/tree CPA

0

Computer Arithmetic: Principles, Architectures, and VLSI Design

6 Multiplication

73

6.6 Multiplier Implementations

6.5 Wallace Tree Addition

Applicable to parallel multipliers : parallel partial product generation (normal or Booth recoded) – Irregular adder tree (Wallace tree) due to different number of bits per column irregular wiring and/or layout non-uniform bit arrival times at final adder

00

Braun or Baugh-Wooley multiplier (array multiplier) : medium performance, high area, high regularity layout generators data paths and macro-cells simple pipelining, faster CPA higher speed

0

Booth-Wallace multiplier (parallel multiplier) [9] : high performance, high area, low regularity custom multipliers, netlist generators often pipelined (e.g. register between CSA-tree and CPA)

for unsigned multiplication : % 0

Radix-8 (3-bit recoding) and higher radices : precomputing 3 , larger overhead

Computer Arithmetic: Principles, Architectures, and VLSI Design

2

Sequential multipliers : low performance, small area, resource sharing (adder)

Suited for signed multiplication (incl. Booth recod.) Extend

1

6.6 Multiplier Implementations

1

13 22 31 4

10

! 2 !log

Negative partial products (avoid sign-extension) : 3

0 1 0 1 0 1 0 1

2

Speed-up technique : fast partial product addition

: 2 : 2 : 0

much faster for adder arrays slightly or not faster for adder trees

0 0 1 1 0 0 1 1

Booth recoding

Baugh-Wooley multiplier

6.4 Booth Recoding

Signed-unsigned multiplier : signed multiplier with 0) operands extended by 1 bit ( 1 0, 1

% %& % %&

74

Computer Arithmetic: Principles, Architectures, and VLSI Design

75

6 Multiplication

6.8 Squaring

6.7 Composition from Smaller Multipliers

multiplier can be composed from 4 2 2-bit-bitmultipliers (can be repeated recursively) 2% 2%

22% 2% 4 -bit multipliers + 2 -bit CSA + 3 -bit CPA less efficient (area and speed)

2 : multiplier optimizations possible 0 3 0 1 0 1 1 0 2 3 1 2 3 12 21 3 3 2 3 1 3 2 3 1 3 0 3 0 0 1 0 0 3 3 1 2 1 1 2 2

+

2 1partial products (if no Booth recoding used) optimized squarer more efficient than multiplier 6

7.1 Division Basics

7 Division / Square Root Extraction 7.1 Division Basics

;

rem (remainder) 0 0 22% 1 0 2% 1 % %, otherwise overflow 2 2 normalize before division (2%&1 2% 1)

Algorithms (radix-2) Subtract-and-shift : partial remainders

6.8 Squaring

7

7 Division / Square Root Extraction

5

4

3

2

1

0

Computer Arithmetic: Principles, Architectures, and VLSI Design

76

if 2 0

7.3 Non-Restoring Division

7.2 Restoring Division

1

2 0

1 11 ifif 11 00 1 0 : 1 1 2 2 &1 1 1 2 0 : &1 1 1 2 &1 1 2&1

Basic algorithm : compare and conditionally subtract expensive comparison and CPA

Restoring division : subtract and conditionally restore (adder or multiplexer) expensive CPA and restoring

Non-restoring division : detect sign, subtract/add, and correct by next steps expensive CPA

Computer Arithmetic: Principles, Architectures, and VLSI Design

7 Division / Square Root Extraction

7.4 Signed Division

77

1 same sign 1 if 1 if 1 opposite sign

7.4 Signed Division

1

2 0 : 0 (restored) 1

1 1 1 2&1 0 : &1 1 &1 1 2&1

7.3 Non-Restoring Division

0

1

0 if

non-associative

1 2 1 2 % ; 1 0 (r.m.n.)

SRT division : estimate range, subtract/add (CSA), and correct by next steps inexpensive CSA

Table look-up (ROM) less efficient for every

7 Division / Square Root Extraction

Sequential algorithm : recursive,

[1, 2]

Example : signed non-restoring array divider 0, final correction of omitted) (simplifications:

9 2 2 2 4

a6 ⊕ b3

b3

a6

a5

b2

a4

b1

b0

a3

One subtraction/addition (CPA) per step Final correction step for (additional CPA) Simple quotient digit conversion : (note: irredundant)

1 1 0 1 : 1 1 2

%&1 %&2 %&3 0 1

1 2 or ! 2 log

! 1 !

2 or ! log

q3

FA

FA

FA

FA

a2 q2

FA

FA

"

FA

FA

divarray.epsi 81 101 mm

a1 q1

A B

Q

≥ +/− CPA ≥ +/− CPA divnr.epsi +/− CPA 46 ≥38 mm ≥ +/− CPA ≥ +/− CPA

"

R Computer Arithmetic: Principles, Architectures, and VLSI Design

78

FA

FA

FA

FA

a0 q0

FA

FA

FA

FA

r3

r2

r1

r0

Computer Arithmetic: Principles, Architectures, and VLSI Design

79

7 Division / Square Root Extraction

7.5 SRT Division

7.6 High-Radix Division

1 1 if 2 $

0 if 2 $1 2 is SD number 1 if 1 2 %& % If 2 1 $ 2 , i.e. is normalized : 2 $ 2%&1 $1 2%&1 $2 %&1 $1 1 if 2 0 if 2%&1 $1 2%&1 1 if 2%&1 1 are estimated CSA + Only 3 MSB are compared

Radix

2

! 2 !

CPA

Division by convergence

& 1

0 1 &1 1 1 resp. 2% 0 1 1 2%1

1 2%1

2 1 2%

1 2&% 2 2&% 1 (signed) Algorithm : 1 1 ; 0 1 1 (r.s.n.)

≥ +/− CSA ≥ +/− CSA divsrt.epsi ≥ mm +/− CSA 50 38 ≥ +/− CSA ≥ +/− CPA

"

0

Quadratic convergence :

R

7 Division / Square Root Extraction

7.7 Division by Multiplication

A B

Computer Arithmetic: Principles, Architectures, and VLSI Design

Q

– Complex comparisons (more bits) and decisions table look-up ( Pentium bug!)

instead of CPA can be used (precise enough) [19] Correction in following steps (+ final correction step) – Redundant representation of (SD representation) final conversion necessary (CPA) + Highly regular and fast ( ) SRT array dividers only slightly slower/larger than array multipliers

7.7 Division by Multiplication

2, 1 1 0 1 1 quotient bits per step fewer, but more complex steps + Suitable for SRT algorithm faster

7.5 SRT Division (Sweeney, Robertson, Tocher)

!

7 Division / Square Root Extraction

80

7.8 Remainder / Modulus

Division by reciprocation

0

log

Computer Arithmetic: Principles, Architectures, and VLSI Design

7 Division / Square Root Extraction

81

7.9 Divider Implementations

7.9 Divider Implementations

1

Iterative dividers (through multiplication) :

resource sharing of existing components (multiplier) medium performance, medium area high efficiency if components are shared

0 by recursion 1 find 1 1 1

0

Newton-Raphson iteration method :

Sequential dividers (restoring, non-restoring, SRT) :

resource sharing of existing components (e.g. adder) low performance, low area

2

2 ; 0 1 (r.s.n.)

Algorithm : 1

0

Array dividers (restoring, non-restoring, SRT) :

!log

Quadratic convergence :

Speed-up : first approximation

from table 0

7.8 Remainder / Modulus

rem sign sign

Remainder (rem) : signed remainder of a division

ifelse 0

dedicated hardware component high performance, high area high regularity layout generators, pipelining square root extraction possible by minor changes combination with multiplication or/and square root

No parallel dividers exist, as compared to parallel multipliers (sequential nature of division)

Modulus (mod) : positive remainder of a division

mod

0

Computer Arithmetic: Principles, Architectures, and VLSI Design

82

Computer Arithmetic: Principles, Architectures, and VLSI Design

83

7 Division / Square Root Extraction

7.10 Square Root Extraction

7.10 Square Root Extraction

and quotients %& 0[1] 2

1 2 2 21 2 2 1 2 1 2 2 1 2 1 2 2 2 2 ; 1 0 % 1 % 0 1 (r.m.n.) 0 0

Subtract-and-shift : partial remainders 2 0 1 1

Table look-up : inefficient for large word lengths [5] Taylor series expansion : complex implementation Polynomial and rational approximations [1, 5] Shift-and-add algorithms [5]

Implementation

Convergence algorithms [1, 2] :

similar to division-by-convergence two (or more) recursive formulas : one formula

+ Similar to division same algorithms applicable (restoring, non-restoring, SRT, high-radix) + Combination with division in same component possible

converges to a constant, the other to the result

A

Only triangular array required 0) (step :

2

Coordinate rotation (CORDIC) [2, 5, 20] :

3 equations for x-, y-coordinate, and angle computes all elementary functions by proper input

+/− CPA sqrtnr.epsi +/− CPA 42 36+/− mmCPA +/− CPA +/− CPA

"

Q

settings and choice of modes and outputs

simple, universal hardware, small look-up table

R Computer Arithmetic: Principles, Architectures, and VLSI Design

8 Elementary Functions

84

8.2 Integer Exponentiation

8.2 Integer Exponentiation

Approximated exponentiation : ln 2 log 1 0 Base-2 integer exponentiation : 2 0

!

!

Integer exponentiation (exact) :

% 0 2 1

1

1

1

1

1

0

2

2

8.3 Integer Logarithm

12

1 2 1 0 1 2

2

2

1

2

0

; 1 0 2 1 % 1 0 (r.s.n.) 2 1

log2

2 2

2 2 4 2

b)

85

8 Elementary Functions

8.3 Integer Logarithm

Algorithms : square-and-multiply

Computer Arithmetic: Principles, Architectures, and VLSI Design

(!)

Applications : modular exponentiation mod in cryptographic algorithms (e.g. IDEA, RSA)

a)

(exp )

8.1 Algorithms

!

Logarithm function : ln , log Trigonometric functions : sin , cos , tan Inverse trig. functions : arcsin , arccos , arctan Hyperbolic functions : sinh , cosh , tanh Exponential function :

Algorithm

8.1 Algorithms

8 Elementary Functions

2 0 22% 1 0 2% 1

8 Elementary Functions

2

1

0

For detection/comparison of order of magnitude Corresponds to leading-zeroes detection (LZD) with encoded output

&1 1 2 ; 0 1 &1 1 0 %&1 (r.s.n.) 2 or 2

Computer Arithmetic: Principles, Architectures, and VLSI Design

86

Computer Arithmetic: Principles, Architectures, and VLSI Design

87

9 VLSI Design Aspects

9.1 Design Levels

9 VLSI Design Aspects

9.1 Design Levels

Gate-level design

9 VLSI Design Aspects

Cell-based design techniques : standard-cells, gate-array/ sea-of-gates, field-programmable gate-array (FPGA)

9.1 Design Levels Transistor-level design

Circuit implemented by hand or by synthesis (library)

Circuit and layout designed by hand (full custom)

Layout implemented by automated place-and-route

Low design efficiency

Medium to high design efficiency

High circuit performance : high speed, low area

Medium to low circuit performance

High flexibility : choice of architecture and logic style

Medium to low flexibility : full choice of architecture

Transistor-level circuit optimizations :

logic style : static vs. dynamic logic,

Block-level design

complementary CMOS vs. pass-transistor logic

special arithmetic circuits : better than with gates gi

ci

carry chain :

c out

ki

a

b

a

"p

c i-1 carrychain.epsi 54 17 mm

a

b

High design efficiency

g i-1

k i-1 p i-1

i

Layout blocks and netlists from parameterized automatic generators or compilers (library)

c in

Medium to high circuit performance

c in

Low flexibility : limited choice of architectures Implementations :

a b

fulladder :

c in

b

c in

b

"

facmos.epsi 76 40 mm

c in

s

c in

c out

b a

b

a

a

b

c in

a

Computer Arithmetic: Principles, Architectures, and VLSI Design

9 VLSI Design Aspects

data-path : bit-sliced, bus-oriented layout (array of cells: bits operations), implementation of entire data paths, medium performance, medium diversity macro-cells : tiled layout, fixed/single-operation components, high performance, small diversity portable netlists : gate-level design

88

9.2 Synthesis

Computer Arithmetic: Principles, Architectures, and VLSI Design

9 VLSI Design Aspects

89

9.3 VHDL

9.2 Synthesis

9.3 VHDL

High-level synthesis

Arithmetic types : unsigned, signed (2’s complement)

Synthesis from abstract, behavioral hardware description (e.g. data dependency graphs) using e.g. VHDL Involves architectural synthesis and arithmetic transformations

Arithmetic packages numeric_bit, numeric_std (IEEE standard 1076.3), std_logic_arith (Synopsys) contain overloaded arithmetic operators and resizing / type conversion routines for unsigned, signed types

High-level synthesis is still in the beginnings

Arithmetic operators (VHDL’87/93) [21]

Low-level synthesis

relational shift, rotate (’93 only) adding sign (unary) multiplying exponent, absolute

Layout and netlist generators Included in libraries and synthesis tools Low-level synthesis is state-of-the-art Basis for efficient ASIC design Limited diversity and flexibility of library components

: : : : : :

=, /=, <, <=, >, >= rol, ror, sla, sll, sra, srl +, +, *, /, mod, rem **, abs

Synthesis

Circuit optimization Efficient optimization of random logic is state-of-the-art

Typical limitations of synthesis tools :

/, mod, rem : both operands must be constant or divisor

Optimization of entire arithmetic circuits is not feasible only local optimizations possible Logic optimization cannot replace the synthesis of efficient arithmetic circuit structures using generators

Computer Arithmetic: Principles, Architectures, and VLSI Design

must be a power of two ** : for power-of-two bases only Variety of arithmetic components provided in separate libraries (e.g. DesignWare by Synopsys)

90

Computer Arithmetic: Principles, Architectures, and VLSI Design

91

9 VLSI Design Aspects

9.3 VHDL

Resource sharing

Pipelining

Pipelining is basically possible with every combinational circuit higher throughput

2 adders + 1 multiplexer = ’1’ else B; 1 multiplexer + 1 adder

a)

S <= A + C when SELA = ’1’ else B + C;

b)

T <= A when SELA S <= T + C;

Addition : single adder with carry-in/carry-out : resize(A, width+1) & Cin; resize(B, width+1) & ’1’; Aext + Bext; Sext(width downto 1); Sext(width+1);

Pipelining of arithmetic circuits can be very costly :

(except for smaller latency) Fine-grain pipelining arithmetic circuits)

Synthesis : check synthesis result for allocated arithmetic units code sanity check, control of circuit size

systolic arrays (often applied to

High speed Fast circuit architectures, pipelining, replication (parallelization), and combinations of those

VHDL library of arithmetic units

Optimal solution depends on arithmetic operation, circuit architecture, user specifications, and circuit environment

Structural, synthesizable VHDL code for most circuits described in this text is found in [22] Computer Arithmetic: Principles, Architectures, and VLSI Design

9 VLSI Design Aspects

Arithmetic circuits are well suited for pipelining due to high regularity

large amount of internal signals in arithmetic circuits array structures : many small pipeline registers tree structures : few large pipeline registers no advantage of tree structures anymore

Coding & synthesis hints

<= <= <= <= <=

9.4 Performance

9.4 Performance

Sharing one resource for multiple operations Done automatically by some synthesis tools Otherwise, appropriate coding is necessary :

Aext Bext Sext S Cout

9 VLSI Design Aspects

92

9.4 Performance

Computer Arithmetic: Principles, Architectures, and VLSI Design

9 VLSI Design Aspects

93

9.5 Testability

Low power

9.5 Testability

Power-related properties of arithmetic circuits :

Testability goal : high fault coverage with few test vectors that are easy to generate/apply

High glitching activity due to high bit dependencies and large logic depth

Random test vectors : easy to generate and apply/propagate, few vectors give high (but not perfect) fault coverage for most arithmetic circuits

Power reduction in arithmetic circuits [23] : Reduce the switched capacitance by choosing an area efficient circuit architecture Allow for lower supply voltage by speeding up the circuitry

Special test vectors : sometimes hard to generate and apply, required for coverage of hard-detectable faults which are inherent in most arithmetic circuits Hard-detectable faults found in :

Reduce the transition activity :

apply stable inputs while circuit is not in use (

circuits of arithmetic operations with inherent special cases (arithmetic exceptions) : detectors, comparators, incrementers and counters (MSBs), adder flags

disabling subcircuits)

reduce glitching transitions by balancing signal paths (partly done by speed-up techniques, otherwise difficult to realize) reduce glitching transitions by reducing logic depth (pipelining) take advantage of correlated data streams choose appropriate number representations (e.g. Gray codes for counters)

circuits using redundant number representations ( redundant hardware) : dividers (Pentium bug!)

Computer Arithmetic: Principles, Architectures, and VLSI Design

94

Computer Arithmetic: Principles, Architectures, and VLSI Design

95

Bibliography

Bibliography

Bibliography

[11] R. Zimmermann, Binary Adder Architectures for Cell-Based VLSI and their Synthesis, PhD thesis, Swiss Federal Institute of Technology (ETH) Zurich, Hartung-Gorre Verlag, 1998.

[1] I. Koren, Computer Arithmetic Algorithms, Prentice Hall, 1993. [2] K. Hwang, Computer Arithmetic: Principles, Architecture, and Design, John Wiley & Sons, 1979. [3] O. Spaniol, Computer Arithmetic, John Wiley & Sons, 1981.

[13] T. Han and D. A. Carlson, “Fast area-efficient VLSI adders”, in Proc. 8th Computer Arithmetic Symp., Como, May 1987, pp. 49–56.

[4] J. J. F. Cavanagh, Digital Computer Arithmetic: Design and Implementation, McGraw-Hill, 1984.

[14] D. W. Dobberpuhl et al., “A 200-MHz 64-b dual-issue CMOS microprocessor”, IEEE J. Solid-State Circuits, vol. 27, no. 11, pp. 1555–1564, Nov. 1992.

[5] J.-M. Muller, Elementary Functions: Algorithms and Implementation, Birkhauser Boston, 1997. [6] Proceedings of the Xth Symposium on Computer Arithmetic. [7] IEEE Transactions on Computers. [8] D. R. Lutz and D. N. Jayasimha, “Programmable modulo-k counters”, IEEE Trans. Circuits and Syst., vol. 43, no. 11, pp. 939–941, Nov. 1996. [9] H. Makino et al., “An 8.8-ns 54 54-bit multiplier with high speed redundant binary architecture”, IEEE J. Solid-State Circuits, vol. 31, no. 6, pp. 773–783, June 1996. [10] W. N. Holmes, “Composite arithmetic: Proposal for a new standard”, IEEE Computer, vol. 30, no. 3, pp. 65–73, Mar. 1997. Computer Arithmetic: Principles, Architectures, and VLSI Design

96

Bibliography

[18] J. Cortadella and J. M. Llaberia, “Evaluation of A + B = K conditions without carry propagation”, IEEE Trans. Comput., vol. 41, no. 11, pp. 1484–1488, Nov. 1992. [19] S. E. McQuillan and J. V. McCanny, “Fast VLSI algorithms for division and square root”, J. VLSI Signal Processing, vol. 8, pp. 151–168, Oct. 1994. [20] Y. H. Hu, “CORDIC-based VLSI architectures for digital signal processing”, IEEE Signal Processing Magazine, vol. 9, no. 3, pp. 16–35, July 1992. [21] K. C. Chang, Digital Design and Modeling with VHDL and Synthesis, IEEE Computer Society Press, Los Alamitos, California, 1997. [22] R. Zimmermann, “VHDL Library of Arithmetic Units”, http://www.iis.ee.ethz.ch/˜zimmi/arith lib.html. [23] A. P. Chandrakasan and R. W. Brodersen, Low Power Digital CMOS Design, Kluwer, Norwell, MA, 1995.

Computer Arithmetic: Principles, Architectures, and VLSI Design

[12] A. Tyagi, “A reduced-area scheme for carry-select adders”, IEEE Trans. Comput., vol. 42, no. 10, pp. 1162–1170, Oct. 1993.

98

[15] A. De Gloria and M. Olivieri, “Statistical carry lookahead adders”, IEEE Trans. Comput., vol. 45, no. 3, pp. 340–347, Mar. 1996. [16] V. G. Oklobdzija, D. Villeger, and S. S. Liu, “A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach”, IEEE Trans. Comput., vol. 45, no. 3, pp. 294–305, Mar. 1996. [17] Z. Wang, G. A. Jullien, and W. C. Miller, “A new design technique for column compression multipliers”, IEEE Trans. Comput., vol. 44, no. 8, pp. 962–970, Aug. 1995.

Computer Arithmetic: Principles, Architectures, and VLSI Design

97