(..) Computer Arithmetic - Principles, Architectures & VLSI Design

Eidgenossische
Technische Hochschule
Zurich
Institut fur Integrierte Systeme
Ecole polytechnique federale

de Zurich
Politecnico federale di Zurigo
Swiss Federal Institute of Technology Zurich
Integrated Systems Laboratory
Lecture notes on
Computer Arithmetic:
Principles, Architectures,
and VLSI Design
June 25, 1998
Reto Zimmermann
Integrated Systems Laboratory
Swiss Federal Institute of Technology (ETH)
CH-8092 Zurich, Switzerland
zimmermann@iis.ee.ethz.ch
Copyright c 1998 by Integrated Systems Laboratory, ETH Zurich

http://www.iis.ee.ethz.ch/ zimmi/publications/comp arith notes.ps.gz
Contents
Contents
::::::::::::::::::::::: 4
:::::::::::::::::::::::::::::::::::::::::: 4
1.2 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4
1.3 Conventions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5
1.4 Recursive Function Evaluation : : : : : : : : : : : : : : : : : : : : : 6
Arithmetic Operations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8
2.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8
2.2 Implementation Techniques : : : : : : : : : : : : : : : : : : : : : : : 9
Number Representations : : : : : : : : : : : : : : : : : : : : : : : : : : : 10
3.1 Binary Number Systems (BNS) : : : : : : : : : : : : : : : : : : : 10
3.2 Gray Numbers : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13
3.3 Redundant Number Systems : : : : : : : : : : : : : : : : : : : : : : 14
3.4 Residue Number Systems (RNS) : : : : : : : : : : : : : : : : : : 16
3.5 Floating-Point Numbers : : : : : : : : : : : : : : : : : : : : : : : : : : 18
3.6 Logarithmic Number System : : : : : : : : : : : : : : : : : : : : : 19
3.7 Antitetrational Number System : : : : : : : : : : : : : : : : : : : 19
3.8 Composite Arithmetic : : : : : : : : : : : : : : : : : : : : : : : : : : : 20
3.9 Round-Off Schemes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21
Addition : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22
4.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22
4.2 1-Bit Adders, (m, k)-Counters : : : : : : : : : : : : : : : : : : : : 23
1 Introduction and Conventions

1.1 Outline
Computer Arithmetic: Principles, Architectures, and VLSI Design
Contents
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 78
: : : : : : : : : : : : : : : : : : : : : : : : : : 78
7.4 Signed Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 79
7.5 SRT Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 80
7.6 High-Radix Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 81
7.7 Division by Multiplication : : : : : : : : : : : : : : : : : : : : : : : 81
7.8 Remainder / Modulus : : : : : : : : : : : : : : : : : : : : : : : : : : : : 82
7.9 Divider Implementations : : : : : : : : : : : : : : : : : : : : : : : : : 83
7.10 Square Root Extraction : : : : : : : : : : : : : : : : : : : : : : : : : 84
Elementary Functions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85
8.1 Algorithms : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85
8.2 Integer Exponentiation : : : : : : : : : : : : : : : : : : : : : : : : : : : 86
8.3 Integer Logarithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 87
VLSI Design Aspects : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 88
9.1 Design Levels : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 88
9.2 Synthesis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 90
9.3 VHDL : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 91
9.4 Performance : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 93
9.5 Testability : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 95
Bibliography : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 96
7.2 Restoring Division
7.3 Non-Restoring Division
: : : : : : : : : : : : : : : : : : : 26
4.4 Carry-Save Adder (CSA) : : : : : : : : : : : : : : : : : : : : : : : : : 45
4.5 Multi-Operand Adders : : : : : : : : : : : : : : : : : : : : : : : : : : : 46
4.6 Sequential Adders : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 52
Simple / Addition-Based Operations : : : : : : : : : : : : : : : : 53
5.1 Complement and Subtraction : : : : : : : : : : : : : : : : : : : : : 53
5.2 Increment / Decrement : : : : : : : : : : : : : : : : : : : : : : : : : : : 54
5.3 Counting : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 58
5.4 Comparison, Coding, Detection : : : : : : : : : : : : : : : : : : : 60
5.5 Shift, Extension, Saturation : : : : : : : : : : : : : : : : : : : : : : 64
5.6 Addition Flags : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 66
5.7 Arithmetic Logic Unit (ALU) : : : : : : : : : : : : : : : : : : : : : 68
Multiplication : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69
6.1 Multiplication Basics : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69
6.2 Unsigned Array Multiplier : : : : : : : : : : : : : : : : : : : : : : : 71
6.3 Signed Array Multipliers : : : : : : : : : : : : : : : : : : : : : : : : : 72
6.4 Booth Recoding : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 73
6.5 Wallace Tree Addition : : : : : : : : : : : : : : : : : : : : : : : : : : : 75
6.6 Multiplier Implementations : : : : : : : : : : : : : : : : : : : : : : : 75
6.7 Composition from Smaller Multipliers : : : : : : : : : : : : : 76
6.8 Squaring : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 76
Division / Square Root Extraction : : : : : : : : : : : : : : : : : : 77
7.1 Division Basics : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 77
4.3 Carry-Propagate Adders (CPA)
Contents
1.2 Motivation
1.3 Conventions
1.1 Outline
Naming conventions
Basic principles of computer arithmetic [1, 2, 3, 4, 5, 6, 7]

Circuit architectures and implementations of main
arithmetic operations
Aspects regarding VLSI design of arithmetic units
1.2 Motivation
A (1-D), Ai (2-D), ai:k (subbus, 1-D)

Signals : a, ai (1-D), ai k (2-D), Ai:k (group signal)
Circuit complexity measures : A (area), T (cycle time,
delay), AT (area-time product), L (latency, # cycles)
Arithmetic operators : +, ;, , =, log (= log2 )
Signal buses :
Logic operators :
Arithmetic units are, among others, core of every data

path and addressing unit
1.3 Conventions
+ (or),
(and),
(xor),
(xnor), (not)
Circuit complexity measures
Data path is core of :
Unit-gate model (
microprocessors (CPU)
signal processors (DSP)
gate-equivalents (GE) model) :
Inverter, buffer :
A=0 T =0
(i.e. ignored)
Simple monotonic 2-input gates (AND, NAND, OR,

NOR) : A = 1 T = 1
data-processing application specific ICs (ASIC) and

programmable ICs (e.g. FPGA)
Standard arithmetic units available from libraries
Simple non-monotonic 2-input gates (XOR, XNOR) :

A=2 T =2
Design of arithmetic units necessary for :
Complex gates : composed from simple gates

) Simple m-input gates :
non-standard operations
high-performance components
library development
Only estimations given for complex circuits

A = m ; 1 T = dlog me
Wiring not considered (acceptable for comparison

purposes, local wiring, multilevel metallization)
1.4 Recursive Function Evaluation

2.
Given : inputs ai , outputs zi , function f (graph sym. : )

Non-recursive functions (n.)
a3 a2 a1 a0
1 funrsa.epsi
219 20 mm
z
zi = f (ai zi;1) ; i = 0 : : : n ; 1 z;1 = 0=1
a3 a2 a1 a0
1.
f is non-associative (r.m.n.)
) serial structure :
funn.epsi
119 17 mm
A = O(n) T = O(1)
f is associative (r.s.a.)
) serial or single-tree structure :
A = O(n) T = O(log n)
zi = f (ai x) ; i = 0 : : : n ; 1
) parallel structure :
b) with multiple outputs zi (r.m.) () prefix problem) :
m const.)
Output zi is a function of input ai (or aj +m:j
A = O(n) T = O(n)
z3 z2 z1 z0
a3 a2 a1 a0
1 funrmn.epsi
219 25 mm
3
z3 z2 z1 z0
Recursive functions (r.)

Output zi is a function of all inputs ak
a) with single output z
a3 a2 a1 a0
k i
2.
= zn (r.s.) :
) serial or multi-tree structure :
A = O(n2) T = O(log n)
ti = f (ai ti;1) ; i = 0 : : : n ; 1
t;1 = 0=1 z = tn;1
1.
f is non-associative (r.s.n.)
) serial structure :
A = O(n) T = O(n)
f is associative (r.m.a.)
1
2
z3
funrma1.epsi
19 43 mm
z2
z1
z0
a3 a2 a1 a0
) or shared-tree structure :
1 funrsn.epsi
219 24 mm
3
A = O(n log n) T = O(log n)
a3 a2 a1 a0
1funrma2.epsi
219 21 mm
z3 z2 z1 z0
z
6
2 Arithmetic Operations
2.1 Overview
2.2 Implementation Techniques
2.2 Implementation Techniques
2.1 Overview
Direct implementation of dedicated units :

fixed-point
based on operation
related operation
always : 1 5
floating-point
in most cases : 6
<< , >>
sometimes : 7, 8
=,<
+1 , 1
+/
+,
arithops.epsi
98 83 mm
Sequential implementation using simpler units and

several clock cycles () decomposition) :
+,
sometimes : 6
in most cases : 7, 8, 9
Table look-up techniques using ROMs :
(same as on
the left for
floating-point
numbers)
sqrt (x)
universal : simple application to all operations
exp (x)
trig (x)
complexity
log (x)
efficient only for single-operand operations of high

complexity (8 12) and small word length (note: ROM
size = 2n n)
hyp (x)
Approximation techniques using simpler units : 712

1
2
3
4
5
6
shift/extension
7 division
comparison
8 square root extraction
increment/decrement
9 exponential function
complement
10 logarithm function
addition/subtraction 11 trigonometric functions
multiplication
12 hyperbolic functions

3 Number Representations
taylor series expansion

polynomial and rational approximations
convergence of recursive equation systems
CORDIC (COordinate Rotation DIgital Computer)
8
3.1 Binary Number Systems (BNS)

Radix-2, binary number system (BNS) : irredundant,
weighted, positional, monotonic [1, 2]
n-bit number is ordered sequence of bits (binary digits) :

A = (an;1 an;2 : : : a0)2 ai 2 f0 1g
Simple and efficient implementation in digital circuits
MSB/LSB (most-/least-significant bit) : an;1 / a0
Represents an integer or fixed-point number, exact
n ; m)-bit fraction
Unsigned : positive or natural numbers

Value :
A = an;1 2n;1 +
+ a12 + a0 =
Range : 0 2n ; 1]
nX
;1
i=0
Ones (1s) complement : similar to 2s complement

nX
;2
Value : A = ;an;1 (2n;1 + 1) +
ai 2i
i=0
Range : ;(2n;1 ; 1) 2n;1 ; 1]
Sign : an;1
Properties : double representation of zero, symmetric

range, modulo (2n ; 1) number system
ai2i
Sign-magnitude : alternative representation of signed

numbers
nX
;2
ai 2i
Value : A = (;1)an;1
i=0
n
;
1
Range : ;(2
; 1) 2n;1 ; 1]
Twos (2s) complement : standard representation of

signed or integer numbers
nX
;2
ai2i
Value : A = ;an;1 2n;1 +
i=0
Range : ;2n;1 2n;1 ; 1]
Sign : an;1
Properties : asymmetric range, compatible with
unsigned numbers in many arithmetic operations
(i.e. same treatment of positive and negative numbers)
Complement : ;A = 2n ; A ; 1 = A
(|am;1 {z: : : a0} : |a;1 : :{z: am;n} )

m-bit integer
Complement : ;A = 2n ; A = A + 1 ,
where A = (an;1 an;2 : : : a0 )
Fixed-point numbers :
Complement : ;A = (an;1 an;2

Sign : an;1
10
: : : a0 )
11
Properties : double representation of zero, symmetric

range, different treatment of positive and negative
numbers in arithmetic operations, no MSB toggles at
sign changes around 0 () low power)
Gray numbers (code) : binary, irredundant, non-weighted,

non-monotonic
+ Property : unit-distance coding (i.e. exactly one bit
toggles between adjacent numbers)
111...1
011...1
100...0
000...0
Applications : counters with low output toggle rate

(low-power signal buses), representation of continuous
signals for low-error sampling (no false numbers due to
switching of different bits at different times)
binary number representation
3.2 Gray Numbers
3.2 Gray Numbers
Graphical representation
n1
n1
numrep.epsi
95 73 mm
Non-monotonic numbers : difficult arithmetic operations,

e.g. addition, comparison :
g1 g0 g10 g00 g0 g00

0 0 < 0 1 and 0 < 1
1 1 < 1 0 but 1 > 0
unsigned
2s complement
binary ! Gray :
1s complement
gi = bi+1 bi bn = 0 ;
i = 0 : : : n ; 1 (n.)
sign-magnitude
Gray ! binary :
Conventions
bi = bi+1 gi bn = 0 ;
i = n ; 1 : : : 0 (r.m.a.)
2s complement used for signed numbers in these notes

Unsigned and signed numbers can be treated equally in
most cases, exceptions are mentioned
12
3.3 Redundant Number Systems

Non-binary, redundant, weighted number systems [1, 2]
Digit set larger than radix (typically radix 2) ) multiple
representations of same number ) redundancy
+ No carry-propagation in adders ) more efficient impl.
of adder-based units (e.g. multipliers and dividers)
Redundancy ) no direct implementation of relational
operators ) conversion to irredundant numbers
Several bits used to represent one digit ) higher storage
requirements
Expensive conversion into irredundant numbers (not
necessary if redundant input operands are allowed)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Gray
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
0
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0

13
1 digit holds sum of 3 bits or 1 digit + 1 bit (no

carry-out digit, i.e. carry is saved)
standard redundant number system for fast addition
Signed-digit (SD) or redundant digit (RD) number
representation :
ri si ti 2 f;1 0 1g f1 0 1g , R = Pni=;01 ri2i
no carry-propagation in S = R + T :
ri + ti = (ci+1 ui) = 2ci+1 + ui , ci+1 ui 2 f1

(ci+1 ui) is redundant (e.g. 0 + 1 = 01 = 11)
8i 9(ci ui ) j ci + ui = si 2 f1 0 1g
0 1g
1 digit holds sum of 2 digits (no carry-out digit)

minimal SD representation : minimal number of
non-zero digits,
011f1g10
!
100f0g10
Delayed-carry of half-adder number representation :

ri 2 f0 1 2g , ci si ai bi 2 f0 1g ,
ri = (ci+1 si) = 2ci+1 + si = ai + bi , ci+1si = 0
R = Pni=;01 ri2i = (C S ) = C + S = A + B
1 digit holds sum of 2 bits (no carry-out digit)
example : (00 10) = 00 + 10 = 01 + 01 = (10 00)
irredundant representation of ;1 [8], since
ci+1si = 0 & C + S = ;1 ! S = ;1 C = 0
applications : sequential multiplication (less cycles),

filters with constant coefficients (less hardware)
example :
minimal
z }| {
7 = (0111 j 1111 j 1011 j 1001 j 11111 j
canonical SD repres.: minimal SD + not two non-zero

digits in sequence,
01f1g10
!
10f0g10
Carry-save number representation :

ri 2 f0 1 2 3g , ci si ai bi di 2 f0 1g ,
ri = (ci+1 si) = 2ci+1 + si = ai + bi + di = ai + ri0
R = Pni=;01 ri2i = (C S ) = C + S = A + R0
binary
b3 b2 b1 b0 g3 g2 g1 g0
SD ! binary : carry-propagation necessary () adder)

other applications : high-speed multipliers [9]
similar to carry-save, simple use for signed numbers
14
15
3.4 Residue Number Systems (RNS)
Non-binary, irredundant, non-weighted number system [1]

+ Carry-free and fast additions and multiplications
Complex and slow other arithmetic operations
(e.g. comparison, sign and overflow detection) because
digits are not weighted, conversion to weighted
mixed-radix or binary system required
Possible applications (but hardly used) :

digital filters : fast additions and multiplications
error detection and correction for arithmetic operations
in conventional and residue number systems
Base is n-tuple of integers (mn;1 mn;2 : : : m0 ),
residues (or moduli) mi pairwise relatively prime
Range:
M=
nY
;1
i=0
nX
;1
i=0
Ciai
mn;2 ::: m0 ,
= (: : :
0 1 0
|{z}
i
(m1 m0) = (3 2) , M = 6
;4 ;3 ;2 ;1 0 1 2 3 4 5 6 7 8
2 0 1 2 0 1 2 0 1 2 0 1 2
0 1 0 1 0 1 0 1 0 1 0 1 0
|
j5j6 = A = (a1
{z
a0) = (j5j3
j5j2 ) = (2 1)
j4 + 5j6 = (1 0) + (2 1) =
= (j1 + 2j3 j0 + 1j2) = (0 1) = j3j6
j4 5j6 = (1 0) (2 1) =
= (j1 2j3 j0 1j2) = (2 0) = j2j6
: : :)

Example :
possible range
mi, anywhere in ZZ
, Ci
high storage efficiency with k bits

simple modular addition : 2k : k -bit adder without cout ,
2k ; 1 : k -bit adder with end-around carry (cin = cout )
A
a1
a0
ai = A mod mi = jAjmi , A = mi qi + ai
jAjM
zi = jZ jmi = jf (A)jmi = f (jAjmi ) mi = jf (ai)jmi

jA + B jmi = jAjmi + jB jmi
= jai + bijmi
mi
jA B jmi = jAjmi jB jmi
= jai bijmi
mi
j ; ai jmi = jmi ; ai jmi
a;i 1 mi = aimi ;2 mi (Fermats theorem)
Best moduli mi are 2k and (2k ; 1) :
Codes for error detection and correction [1]
Arithmetic operations : (each digit computed separately)
A = (an;1 an;2 : : : a0 )mn;

ai 2 f0 1 : : : mi ; 1g
16
3.5 Floating-Point Numbers
3.5 Floating-Point Numbers

17
3.7 Antitetrational Number System
3.6 Logarithmic Number System
Larger range, smaller precision than fixed-point

representation, inexact, real numbers [1, 2]
Alternative representation to floating-point (i.e. mantissa

+ integer exponent ! only fixed-point exponent) [1]
Double-number form ) discontinuous precision
Single-number form ) continuous precision ) higher

accuracy, more reliable
S biased exponent E unsigned norm. mantissa M

F = (;1)S M E = (;1)S 1:M 2E;bias
Basic arithmetic operations :
A B = (;1)SA SB MA MB EA+EB
A + B = (;1)SA MA +
EA
(;1)SB MB (EA ; EB )
base on fixed-point add, multiply, and shift operations
postnormalization required (1=
M < 1)
Applications :
processors : real floating-point formats (e.g. IEEE
standard), large range due to universal use
ASICs : usually simplified floating-point formats with
small exponents, smaller range, used for range
extension of normal fixed-point numbers
single
double
n nM nE
32
64
23
52
8
11
bias
Basic arithmetic operations :
(A < B ) = (EA < EB ) (additionally consider sign)

A + B : by approximation or addition in conventional
number system and double conversion

A B = (;1)SA SB EpA+EB
Ay = (;1)SA y EA y A = (;1)SA
EA =y
+ Simpler multiplication/exponent., more complex addition

Expensive conversion : (anti)logarithms (table look-up)
Applications : real-time digital filters
3.7 Antitetrational Number System
IEEE floating-point format :

precision
S biased fixed-point exponent E

L = (;1)S E = (;1)S 2E;bias (signed-logarithmic)
Tetration (t. x = |{z}

22 ) and antitetration (a.t. x) [10]
x
Larger range, smaller precision than logarithmic repres.,
otherwise analogous (i.e. 2x ! t. x log x ! a.t. x)
2
range
127 3:8 1038

1023 9 10307
precision
10;7
10;15
18
19
3.8 Composite Arithmetic
3.8 Composite Arithmetic
3.9 Round-Off Schemes
3.9 Round-Off Schemes
Proposal for a new standard of number representations [10]

Scheme for storage and display of exact (primary:
integer, secondary: rational) and inexact (primary:
logarithmic, secondary: antitetrational) numbers
Intermediate results with d additional lower bits

() higher accuracy) : A = (an;1 : : : a0 a;1 : : :
a;d)
Rounding : keeping error small during final word

length reduction : R = (rn;1 : : : r0 ) = A ;
Secondary forms used for numbers not representable by

primary ones () no over-/underflow handling necessary)
Choice of number representation hidden from user, i.e.
software/compiler selects format for highest accuracy
Trade-off : numerical accuracy vs. implementation cost
RTRUNC = (an;1 : : : a0 )
Truncation :
bias = ; 12 + 2d+1
Number representations :
(= average error )
Round-to-nearest (i.e. normal rounding) :
integer :
tag
00
value
2s complement integer
rational :
01
slash
logarithmic :
10
log integer
log fraction
antitetrational :
11
a.t. integer
a.t. fraction
RROUND = (a0n;1 : : : a00 ) A0 = A + 0:1
bias = 2d+1 (nearly symmetric)

+ 0:1 can often be included in previous operation
denominator n numerator
Round-to-nearest-even/-odd :
Rational numbers : slash position (i.e. size of numerator/

denominator) is variable and stored (floating slash)
Storage form sizes : 32-bit (short), 64-bit (normal),
128-bit (long), 256-bit (extended)
Hardware proposal : long accumulator (4096 bits) holds
any floating-point number in fixed-point format )
higher accurary ) large hardware/software overhead
20
4 Addition
mandatory in IEEE floating-point standard
Implementation : mixed hardware/software solutions
0
0
ROUND if (a;1 : : : a;d ) 6= 0
RROUND ;EVEN = R
0
0
(an;1 : : : a1 0) otherwise
bias = 0 (symmetric)
4.1 Overview
4 Addition
3 guard bits for rounding after floating-point operations :

guard bit G (postnormalization), round bit R
(round-to-nearest), sticky bit S (round-to-nearest-even)
4 Addition
21
4.2 1-Bit Adders, (m, k)-Counters

Add up m bits of same magnitude (i.e. 1-bit numbers)
4.1 Overview
HA
1-bit adders
FA
(m,k)
Output sum as k -bit number (k
(m,2)
= blog mc + 1)
or : count 1s at inputs ) (m, k)-counter [3]

(combinational counters)
RCA
CSKA
CSLA
CIA
CLA
PPA
COSA
Half-adder (HA), (2, 2)-counter
carry-propagate adders
(cout s) = 2cout + s = a + b
CPA
s=a b
cout = ab
CSA
3-operand
adders.epsi
103 121 mm
carry-save adders
multi-operand
adder
array
A = 3 T = 2 (1)
(sum)
(carry-out)
a b
adder
tree
a b
a b
multi-operand adders
array
adder
tree
adder
hasym.epsi
23HA
mm
18
c
out
s
Legend:
HA:
FA:
(m,k):
(m,2):
half-adder
full-adder
(m,k)-counter
(m,2)-compressor
based on component
CPA: carry-propagate adder

RCA: ripple-carry adder
CSKA:carry-skip adder
CSLA: carry-select adder
CIA: carry-increment adder
chaschema1.epsi
out
19 28 mm
CLA: carry-lookahead adder

PPA: parallel-prefix adder
COSA:conditional-sum adder
haschema2.epsi
21 43 mm
c out
(reference)
s
CSA: carry-save adder
related component
22
23
4 Addition
Full-adder (FA), (3, 2)-counter
4 Addition
(m, k)-counters
(cout s) = 2cout + s = a + b + cin
(sk;1 : : : s0) =
kX
;1
mX
;1
sj 2j = ai
A = 7 T = 4 (2 )
g = ab (generate)
c0 = ab
p = a b (propagate) c1 = a + b
s = a b cin = p cin
cout = ab + acin + bcin = ab + (a b)cin
= g + pcin = pg + pcin = pa + pcin
= cinc0 + cinc1
j =0
c out
s k-1 s 0
A = 28 T = 10
a0a1 a2a3a4a5a6
a0a1 a2
a3a4 a5a6
FA
FA
FA
FA
count73par.epsi
FA
36 48 mm
a b
a b
a b
count73ser.epsi
42 59 mm
faschematic1.epsi
g
p
29 43 mm
c out
p
faschematic4.epsi
c out
c in
29 1 41 mm
c in
c out
faschematic5.epsi
0
c0
35 47 mm
1
c1
FA
s2

4 Addition
24
s2
s1
s0
s1
s0
tree structure
linear
structure

4 Addition
25
Carry-propagation speed-up techniques

Add two n-bit operands A and B and an optional carry-in
cin by performing carry-propagation [1, 2, 11]
a) Concatenation of partial CPAs with fast cin ! cout
S ) is irredundant (n + 1)-bit number
a n-1:j b n-1:j
2ci+1 + si
CPA
c out
A
a i-1:k b i-1:k
a k-1:0 b k-1:0
...
(cout S ) = cout2n + S = A + B + cin

= ai + bi + ci ;
i = 0 1 ::: n; 1
c0 = cin cout = cn (r.m.a.)
FA
FA
c in
Sum (cout
...
i=0
A = 28 T = 14
faschematic2.epsi
c in
32 35 mm
(reference)
cntsymbol.epsi
23 mm
18 (m,k)
Example : (7, 3)-counter
g
HA
faschematic3.epsi
p
32
mm c
29
c out
in
HA
in
a m-1
...
m
;k
A = 7 Plog
k=1 bm2 c 7(m ; log m)
TLIN = 4m + 2blog mc TTREE = 4dlog3 me + 2blog mc
a b
out
a0
Usually built from full-adders

Associativity of addition allows convertion from linear to
tree structure ) faster at same number of FAs
a b
a b
fasymbol.epsi
FA
c18 21 mm
c
speedup1.epsi
CPA
c i84 26 mm
cj
CPA
ck
c in
...
s n-1:j
cpasymbol.epsi
CPA
c out 29 26 mm c in
s i-1:k
s k-1:0
a) Fast carry look-ahead logic for entire range of bits
a n-1
b n-1
a1
b1
a0
b0
Ripple-carry adder (RCA)

Serial arrangement of n full-adders
Simplest, smallest, and slowest CPA structure
speedup2.epsi
104 50 mm
c out
A = 7n T = 2n AT = 14n2
a n-1
b n-1
a1
b1
a0
c out
FA
c n-1
c1
b0
FA
carry propagation
c in
postprocessing
...
s n-1
...
rca.epsi
57c 23FA
mm
preprocessing
...
s1
s0
c in
...
s n-1
s1
s0
26
27
4 Addition
Carry-skip adder (CSKA)
4 Addition
Carry-select adder (CSLA)

Type a) : partial CPA with fast ck ! ci and ck ! si;1:k
Type a) : partial CPA with fast ck ! ci
si;1:k = ck s0i;1:k + ck s1i;1:k

ci = ck c0i + ck c1i
ci = P i;1:k c0i + Pi;1:k ck (bit group (ai;1 : : : ak ))

Pi;1:k = pi;1pi;2 pk (group propagate)
1) Pi;1:k
2) Pi;1:k
= 0 : ck 6! c0i and c0i selected (c0i ! ci)

= 1 : ck ! c0i but c0i skipped (c0i 6! ci)
) path ck ! c0i ! ci never sensitized ) fast ck ! ci

) false path ) inherent logic redundancy ) problems in
circuit optimization, timing analysis, and testing
Variable group sizes (faster) : larger groups in the middle

(minimize delays a0 ! ck ! si;1 and ak ! ci ! sn;1 )
Two CPAs compute two possible results (cin = 0=1),

group carry-in ck selects correct one afterwards
Variable group sizes (faster) : larger groups at end (MSB)
(balance delays a0 ! ck and ak ! c0i )
Part. CPA typ. is RCA, CSLA () multil. CSLA), or CLA
High speed-up at high hardware overhead
(+ MUX/bit + (CPA + MUX)/group)
Partial CPA typ. is RCA or CSKA () multilevel CSKA)
Medium speed-up at small hardware overhead

(+ AND/bit + MUX/group)
8n
4n1=2
a n-1:j b n-1:j
a i-1:k b i-1:k
a k-1:0 b k-1:0
c out
0
c out
CPA
cj
ci
CPA
cska.epsi
36 mm
99
1
ci
CPA
ck
39n3=2
AT
a k-1:0 b k-1:0
...
...
ci
2:8n1=2
a i-1:k b i-1:k
32n3=2
AT
14n
c i0
c i1
...
c in
CPA
csla.epsi
102 50CPA
mm
0
s i-1:k
CPA
ck
c in
1
s i-1:k
0
ck
...
P i-1:k
s n-1:j
s i-1:k
s i-1:k
s k-1:0

4 Addition
28

4 Addition
29
Example : gate-level schematic of carry-incr. adder (CIA)

only 2 different logic cells (bit-slices) : IHA and IFA
Carry-increment adder (CIA)

Type a) : partial CPA with fast ck ! ci and ck ! si;1:k
si;1:k = s0i;1:k + ck ci = c0i + Pi;1:k ck

Pi;1:k = pi;1pi;2 pk (group propagate)
Result is incremented after addition, if ck
s k-1:0
4 6 10 12 14 16 18 20 22 24 26 28 ... 38
2 3 4 5 6 7 8 9 10 11 ... 16
1 2 4 7 11 16 22 29 37 46 56 67 ... 137
max ngroup
= 1 [12, 11]
a i-1
IFA
Variable group sizes (faster) : larger groups at end (MSB)

(balance delays a0 ! ck and ak ! c0i )
b i-1
a i-2
b i-2
IFA
a k+1
b k+1
IFA
ak
bk
IHA
...
Part. CPA typ. is RCA, CIA () multilevel CIA) or CLA

...
High speed-up at medium hardware overhead

(+ AND/bit + (incrementer + AND-OR)/group)
...
Logic of CPA and incrementer can be merged [11]
10n
2:8n1=2
AT
a i-1:k b i-1:k
...
c out
ci
ci
...
CPA
86
cia.epsi
si-1:k
43 mm
28n3=2
ci
ciagate.epsi
100
s i-2 112 mm
s i-1
(i-k-1)IFA + IHA
a k-1:0 b k-1:0
2IFA + IHA
s k+1
IFA + IHA
sk
IHA
ck
IHA
0
ck
CPA
c in
...
bits i-1...k
...
bits 6...4
bits 3,2
bit 1
bit 0
P i-1:k
+1
s i-1:k
s k-1:0
c out
30
c in
31
4 Addition
Conditional-sum adder (COSA)
Correct sum bits (si;1:k or si;1:k ) are (conditionally)

selected through (log n) levels of multiplexers
1
Higher parallelism, more balanced signal paths

Highest speed-up at highest hardware overhead
(2 RCA + more than (log n) MUX/bit)
3n log n
2 log n
AT
6n log
Type b) : carries looked ahead before sum bits computed

Typically 4-bit blocks used (e.g. standard IC SN74181)
c0 = c00
c1 = g0 + p0c00
c2 = g1 + p1g0 + p1p0c00
c3 = g2 + p2g1 + p2p1g0 + p2p1 p0c00
g30 = g3 + p3g2 + p3p2g1 + p3p2 p1g0
p30 = p3p2p1 p0
Bit groups of size 2l at level l
Carry-lookahead adder (CLA), traditional
Type a) : optimized multilevel CSLA with (log n) levels

(i.e. double CPAs are merged at higher levels)
0
4 Addition
...
(g3,p3)
(g0,p0)
clbsymbol.epsi
26 mm c
27 CLB
0
. . . c0
(g,p)
3 3 c3
Hierarchical arrangement using ( 12 log n) levels :

(g30 p03) passed up, c00 passed down between levels
High speed-up at medium hardware overhead
level 0
a3
...
b3
a2
FA
level 1
...
level 2
FA
...
b2
a1
FA
FA
a0
FA
b0
FA
c in
CLB
s2
s1
32
Type b) : universal adder architecture comprising RCA,

CIA, CLA, and more (i.e. entire range of area-delay
trade-offs from slowest RCA to fastest CLA)
Preprocessing, carry-lookahead, and postprocessing step
Carries calculated using parallel-prefix algorithms
+ High flexibility : special adders, other arithmetic
operations, exchangeable prefix algorithms (i.e. speeds)
+ High performance : smallest and fastest adders
...
(gn-1 , p n-1 )
carry-lookahead:
prefix algorithm
add.epsi///figures
73 64 mm
c1
p0
c3 . . . c0
+ preprocessing : gi = ai bi
+ postprocessing : si = pi
pi = ai bi
ci
4 Addition
33
postprocessing:
Inputs (xn;1 : : : x0 ), outputs (yn;1

binary operator [11, 13]
: : : y0), associative
(yn;1 : : : y0) = (xn;1

x0 : : : x1 x0 x0)
y0 = x0 yi = xi yi;1 ; i = 1 : : : n ; 1 (r.m.a.)
or
) tree structures for evaluation :
x3 (x2 (x| 1 {z x0})) = (x| 3 {z x2}) (|x1 {z x0}) , but y2 ?

|
y1 = Y1:01
{z
y2 = Y2:02
{z
y3 = Y3:03
|
}
Y3:21
{z
y1 = Y1:01
y3 = Y3:02
: : : xi) at level l
Carry-propagation is prefix problem : Yil:k = (Gli:k Pil:k )
(G0i:i Pi0:i) = (gi pi)
(Gli:k Pil:k ) = (Gli;:j+1 1 Pil:;j+1 1) (Glj;:k1 Pjl:;k1) ; k j i
= (Gil;:j+1 1 + Pil:;j+1 1Glj;:k1 Pil:;j+1 1Pjl:;k1)
ci+1 = Gmi:0 ; i = 0 : : : n ; 1 l = 1 : : : m
Parallel-prefix algorithms [14] :
multi-tree structures (T = O(n) ! O(log n))
sharing subtrees (A = O(n2 ) ! O(n log n))
different algorithms trading area vs. delay (influences
also from wiring and maximum fan-out FOmax )
c0
s0
s n-2
...
s1
...
s n-1
gi = aibi
pi = ai bi
(g0 , p0 )
c out
CLB
Group variables Yil:k : covers bits (xk
preprocessing:
c in
c n p n-1
CLB
T = 4 + 2T
a1
b1
a0
b0
a n-1
b n-1
a n-2
b n-2
...
c in
Associativity of
+ High regularity : suitable for synthesis and layout
5n + 3A
56n log n
Prefix problem
Parallel-prefix adders (PPA)
AT
c 11 . . . c 8 cla.epsi c 7 . . . c 4
97 48 mm
s0

4 Addition
4 log n
CLB
c 15 . . . c 12
...
s3
(g15,p15) . . . (g12,p12)(g11,p11) . . . (g8,p8) (g7,p7) . . . (g4,p4) (g3,p3) . . . (g0,p0)
CLB
c out
14n
FA
cosa.epsi
100 57 mm
b1
si = pi ci
34
35
4 Addition
Sklansky parallel-prefix algorithm () PPA-SK)
Prefix algorithms
Algorithms visualized by directed acyclic graphs (DAG)
with array structure (n bits m levels)
Graph vertex symbols :
(Gil;:j+1 1 Pil:;j+1 1) (Gjl;:k1 Pjl:;k1 )
y?;
;
; (Gl P l )
(Gli:k Pil:k ) ?
i:k i:k
Tree-like collection, parallel redistribution of carries
1
2
?
i
;
(Gli:k Pil:k ) ?
(Gli:k Pil:k )
0
1
2
3
4
sk.epsi///figures
67 30 mm
Brent-Kung parallel-prefix algorithm () PPA-BK)
: graph depth (number of black nodes on critical path)
Serial-prefix algorithm () RCA)
Traditional CLA is PPA-BK with 4-bit groups

Tree-like redistribution of carries (fan-out tree)
A = 2n ; dlog ne ; 2 T = 2dlog ne ; 2
FOmax log n
A = n ; 1 T = n ; 1 FOmax = 2
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0
1
2
3
4
5
6
...
ser.epsi///figures
69 38 mm
14
15

4 Addition
36
Kogge-Stone parallel-prefix algorithm () PPA-KS)
bk.epsi///figures
67 38 mm

4 Addition
37
Mixed serial/parallel-prefix algorithm () RCA + PPA)

linear size-depth trade-off using parameter k :
very high wiring requirements
1
2
(contains no logic)
Performance measures :
A : graph size (number of black nodes)
0
1
2
3
n log n T = dlog ne FOmax
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
(Gil;:k1 Pil:;k 1 )
(contains logic for )
4 Addition
n log n ; n + 1 T = dlog ne FOmax = 2
k n ; 2dlog ne + 2
k = 0 : serial-prefix graph
k = n ; 2dlog ne + 1 : Brent-Kung parallel-prefix
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0
1
graph
fills gap between RCA and PPA-BK (i.e. CLA) in steps

of single -operations
ks.epsi///figures
67 52 mm
A = n ; 1 + k T = n ; 1 ; k FOmax = var.
4
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0
1
2
3
4
5
6
7
8
9
10
Carry-increment parallel-prefix algorithm () CIA)
2n ; 1:4n1=2
1:4n1=2
FOmax
1:4n1=2
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0
1
2
3
4
5
cia.epsi///figures
67 34 mm
38
var.epsi///figures
68 54 mm
39
4 Addition
Example : 4-bit parallel-prefix adder (PPA-SK)

efficient AND-OR-prefix circuit for the generate and
AND-prefix circuit for the propagate signals
optimization: alternatingly AOI-/OAI- resp. NAND-/
NOR-gates (inverting gates are smaller and faster)
can also be realized using two MUX-prefix circuits
a3
b3
a2
b2
a1
b1
a0
4 Addition
Prefix adder synthesis

Local prefix graph transformation :
3 2 1 0
A =3
T =3
b0
c in
0
unfact.epsi
1
20 26 mm
2
3
3 2 1 0
depth-decr.
transform
0
fact.epsi
1
20 26 mm
2
3
;!
size-decr.
transform
A =4
T =2
Repeated (local) prefix transformations result in overall

minimization of graph depth or size ) which sequence ?
Goal: minimal size (area) at given depth (delay)
Simple algorithm for sequence of applied transforms :
Step 1 : prefix graph compression (depth minimization) :
depth-decr. transforms in right-to-left bottom-up order
Step 2 : prefix graph expansion (size minimization) :
size-decreasing transforms in left-to-right top-down
order, if allowed depth not exceeded
askgate.epsi///figures
100 103 mm
Prefix adder synthesis : 1) generate serial-prefix graph, 2)

graph compression, 3) depth-controlled graph expansion,
4) generate pre-/postprocessing and prefix logic
+ Generates all previous prefix graphs (except PPA-KS)
+ Universal adder synthesis algorithm : generates
area-optimal adders for any given timing constraints [14]
(including non-uniform signal arrival times)
c out
P n-1:0
s3
s2
s1
s0

4 Addition
40
Multilevel adders

4 Addition
41
Self-timed adders
Multilevel versions of adders of type a) possible (CSKA,

CSLA, and CIA; notation: 2-level CIA = CIA-2L)
+ Delay is O(n1=(m+1) ) for m levels
Area increase small for CSKA and CIA,
high for CSLA () COSA)
Average carry-propagation length : log n
+ RCA is fast in average case (T = O(log n)), slow in worst

case ) suitable for self-timed asynchronous designs [16]
Completion detection is not trivial
Difficult computation of optimal group sizes
Adder performance comparisons

Standard-cell implementations, 0:8
Hybrid adders
Arbitrary combinations of speed-up techniques possible
) hybrid/mixed adder architectures
Often used combinations : CLA and CSLA [15]
m process
area [lambda^2]
RCA
128-bit
1e+07
CSKA-2L
CIA-1L
Pure architectures usually perform best (at gate-level)
CIA-2L
64-bit
PPA-SK
Transistor-level adders
PPA-BK
32-bit
Influence of logic styles (e.g. dynamic logic,

pass-transistor logic ) faster)
addperf.ps
84 84 mm
CLA
COSA
const. AT
16-bit
+ Efficient transistor-level implementation of ripple-carry

chains (Manchester chain) [15]
1e+06
8-bit
5
+ Combinations of speed-up techniques make sense

Much higher design effort
delay [ns]
5
Many efficient implementations exist and published

42
10
20
43
Complexity comparison under the unit-gate model
= a0 i + a1 i + a2 i ;
i = 0 1 : : : n ; 1 (n.)
Result is in redundant carry-save format (n digits),

represented by two n-bit numbers S (sum bits) and C
(carry bits)
p
( )
+ Parallel arrangement of n full-adders, constant delay
a 0,n-1
a 1,n-1
a 2,n-1
csa.epsi
27FA
mm
. . . 67
FA
cn
c2
s n-1
FA
c1
s1
s0
Multi-operand carry-save adders (m > 3)

) adder array (linear arrangement), adder tree (tree arr.)
44
4.5 Multi-Operand Adders

4 Addition
45
a) 4-operand CPA (RCA) array :

a 0,n-1
a 1,n-1
Add three or more (m > 2) n-bit operands, yield

(n + dlog me)-bit result in irredundant number rep. [1, 2]
FA
a 2,n-1
Realization by array adders : (see figures on next page)

a) linear arrangement of CPAs
b) linear arr. of CSAs (adder array) and final CPA
CSA
FA
HA
a 3,n-1
FA
a 2,2
a 2,1
...
FA
a 3,2
CPA
a 2,0
cparray.epsi
93 57 mm FA
FA
CPA
HA
a 3,1
a 3,0
FA
FA
HA
s2
s1
s0
CPA
...
sn
s n-1
FA
...
...
a 3,n-1
mopadd.epsi
CSA
30 58 mm
a 3,2
...
csarray.epsi
99FA 57 mm
FA
a 3,1
FA
FA
CSA
a 3,0
HA
CSA
...
FA
FA
a 2,0
a 0,2
a 1,2
A m-1
a 0,0
a 1,0
b) 4-operand CSA array with final CPA (RCA) :

a 2,n-1
A3
...
FA
a 0,n-1
a 1,n-1
A0 A1 A2
FA
...
Array adders
a) and b) differ in bit arrival times at final CPA :

) if CPA = RCA : a) and b) have same overall delay
) if fast final CPA : uniform bit arrival times required
) CSA array (b)
Fast implementation : CSA array + fast final CPA
(note: array of fast CPAs not efficient/necessary)
a 0,0
a 1,0
A = O(mn + n)
T = O(m + n)
a 2,0
A = 7n T = 4
CPA = RCA :
(C S )out = A + (C S )in
optimality regarding area and delay

aaa : smallest area, longest delay
aat : small area, medium delay
att : medium area, short delay
ttt : large area, shortest delay
: not optimal
2 obtained from prefix adder synthesis
3 automatic logic optimization not possible (redundancy)
4 exact factors not calculated
5 corresponds to 4-bit PPA-BK
A = (m ; 2)ACSA + ACPA
T = (m ; 2)TCSA + TCPA
b) Adds one n-bit operand to an n-digit carry-save operand
p
p
4 Addition
csasymbol.epsi
26 mm
21 CSA
2ci+1 + si
p
p
p
ttt
att
A0 A1 A2
a 0,0
a 1,0
3n log2 n
40n log n
6n log2 n
56n log n
6n log2 n
(C S ) = C + S = A0 + A1 + A2
a 2,1
2 log n
4 log n
2 log n
4 log n
2 log n
aat 3
att
att
a 2,1
n log n
10n
3n log n
14n
3n log n
3
2
PPA-SK
PPA-BK
PPA-KS
CLA 5
COSA
14n2
32n3=2
xn4=3 4
39n3=2
28n3=2
36n4=3
44n5=4
a) Adds three n-bit operands A0 , A1 , A2 performing no

carry-propagation (i.e. carries are saved) [1]
a 0,1
a 1,1
8n
8n
14n
10n
10n
10n
CSKA-1L
CSKA-2L
CSLA-1L
CIA-1L
CIA-2L
CIA-3L
2n
4n1=2
xn1=3 4
2:8n1=2
2:8n1=2
3:6n1=3
4:4n1=4
opt.1 syn.2
p
aaa
AT
a 0,1
a 1,1
7n
RCA
4.4 Carry-Save Adder (CSA)
a 0,1
a 1,1
4.4 Carry-Save Adder (CSA)
a 2,2
adder
4 Addition
a 0,2
a 1,2
4 Addition
Fast CPA :
A = O(mn + n log n)
T = O(m + log n)
CPA
FA
FA
sn
s n-1
FA
HA
s2
s1
CPA
...
s0
S
46
47
4 Addition
4 Addition
(m, 2)-compressors
2(c +
mX
;4
l =0
0
c out
clin
m-4
c out
cprsymbol.epsi
26 mm
37 (m,2)
...
i=0
ai +
mX
;4
...
...
l=0
mX
;1
A = 7(m ; 2)
TLIN = 4(m ; 2) TTREE = 6(dlog me ; 1)
a m-1
a0
clout) + s =
c in0
c inm-4
Optimized (4, 2)-compressor :
2 full-adders merged and optimized (i.e. XORs

arranged in tree structure)
1-bit adders (similar to (m, k)-counters) [17]

Compresses m bits down to 2 by forwarding (m ; 3)
intermediate carries to next higher bit position
Is bit-slice of multi-operand CSA array (see prev. page)
+ No horizontal carry-propagation (i.e. clin ! ckout k > l)
A = 14 T = 6
A = 14 T = 8
a0 a1
a0 a1 a2 a3
Built from full-adders (= (3, 2)-compressor) or

(4, 2)-compressors arranged in linear or tree structures
FA
cpr42fa.epsi
32 38 mm
c out
Example : 4-operand adder using (4, 2)-compressors
c in
FA
cpr42opt.epsi
1
41 53 mm
c out
c in
(4,2)
cpradd.epsi
99 44 mm
(4,2)
(4,2)
FA
FA
HA
sn
s n-1
s2
s1
4 Addition
CPA
48
higher compression rate (4:2 instead of 3:2)

less deep and more regular trees
012 3 4 5
SD-FA (signed-digit full-adder) is similar to

(4, 2)-compressor regarding structure and complexity
s0
Advantages of (4, 2)-compressors over FAs for realizing

(m, 2)-compressors :
FA
(4,2)
optimized
CSA
# operands
with full-adders
tree depth
+ same area, 25% shorter delay
FA
s n+1
a 0,0
a 1,0
a 2,0
a 3,0
a 0,1
a 1,1
a 2,1
a 3,1
a 0,2
a 1,2
a 2,2
a 3,2
a 0,n-1
a 1,n-1
a 2,n-1
a 3,n-1
(4,2)
a2 a3

4 Addition
49
Tree adders (Wallace tree)

Adder tree : n-bit m-operand carry-save adder
composed of n tree-structured (m, 2)-compressors [1, 18]
Tree adders : fastest multi-operand adders using an
adder tree and a fast final CPA
7 8 9 10
2 3 4 6 9 13 19 28 42 63 94
2 4 8 16 32 64 128
A = A(m 2) n + ACPA = O(mn + n log n)

T = T(m 2) + TCPA = O(log m + log n)
Example : (8, 2)-compressor
A = 42 T = 16
a0a1 a2a3
0
c out
FA
A = 42 T = 12
a0a1a2a3
a4a5 a6a7
FA
1
c out
0
c out
c in0
c in1
FA
2
c out
FA
cpr82fa.epsi
47 65 mm
3
c out
FA
4
c out
FA
c
c in2
a4a5a6a7
(4,2)
(4,2)
1
c out
2
c out
Adder arrays and adder trees revisited
c in0
c in1
cpr82cpr42.epsi
47 50 mm
c in2
3
c out
c in3
4
c out
c in4
(4,2)
Some FA can often be replaced by HA or eliminated

(i.e. redundant due to constant inputs)
Number of (irredundant) FA does not depend on adder
structure, but number of HA does
c in3
An m-operand adder accomodates (m ; 1) carry inputs
c in4
Adder trees (T = O(log n)) are faster than adder arrays

(T = O(n)) at same amount of gates (A = O(mn))
Adder trees are less regular and have more complex

routing than adder arrays ) larger area, difficult layout
(i.e. limited use in layout generators)
(4, 2)-compressor tree
full-adder tree
50
51
4 Addition
4.6 Sequential Adders
4.6 Sequential Adders
5 Simple / Addition-Based Operations
Bit-serial adder : Sequential n-bit adder
2s complementer (negation)
Z
A
A ; B = A + (;B )
=A+B+1
sub.epsi
29 32 mm
CPA
c out
S
A B
2s complement adder/subtractor
Allows higher clock rates

Final CPA too slow :
) pipelining or multiple
cycles for evaluation
A B = A + (;1)sub B
= A + (B sub) + sub
CSA
accucsa.epsi
33 52 mm
A = ACSA + ACPA + 4AREG

T = TCSA + TREG
L=m
CPA
c out
A + B (mod 2n ; 1)
= A + B + cout
52
5.2 Increment / Decrement
sub
1
2
1
2
n log2 n
(cout Z ) = A ; cin
a2
a1
a0
...
Z
dec.epsi
93 41 mm
c out
A = 3n T = n + 1 AT
c in
...
Example : Ripple-carry incrementer using half-adders

3n2
z n-1
a0
z2
z1
z0
Incrementer-decrementer
...
incfa.epsi
59c 23HA
mm c
2
1
) AND-prefix struct.
n log n + 2n T = dlog ne + 2 AT
a n-1
incsymbol.epsi
+1
c out 29 26 mm c in
53
Ci:k = Ci:j+1Cj:k
Decrementer
= 0 () FA ! HA)
a1
c in
Prefix problem :
addmod.epsi
28 mm
29 CPA
Incrementer
Adds a single bit cin to an n-bit operand A
(cout Z ) = cout2n + Z = A + cin
c out
(end-around carry)
zi = ai ci
ci+1 = aici ; i = 0 : : : n ; 1
c0 = cin cout = cn (r.m.a.)
1s complement adder
CPA
Corresponds to addition with B
addsub.epsi
36 35 mm
Mixed CSA/CPA : CSA with partial CPAs (i.e. fewer

carries saved), trade-off between speed and register size
c n-1
2s complement subtractor
accucpa.epsi
CPA
27 28 mm
With CSA and final CPA
HA
+1
si
Accumulators : Sequential m-operand adders

A
With CPA
a n-1
neg.epsi
21 32 mm1
;A = A + 1
bitseradd.epsi
FA
25 27 mm
A = ACPA + AREG
T = TCPA + TREG
L=m
5.1 Complement and Subtraction

ai bi
A = AFA + AFF
T = TFA + TFF
L=n
c out
5.1 Complement and Subtraction
HA
(cout Z ) = A cin = A + (;1)dec cin
c in
...
z n-1
z1
a n-1
z0
a2
a1
a0
or using incrementer slices (= half-adder)

a n-1
a2
a1
dec
a0
...
...
c out
incdec.epsi
94 46 mm
inc.epsi
83 33 mm
c out
c in
c in
...
...
HA
z n-1
z2
z1
z0
z n-1
54
z2
z1
z0
55
Fast incrementers
Gray incrementer
4-bit incrementer using multi-input gates :

a3
a2
a1
Increments in Gray number system
c0 = an;1 an;2
a0 (parity)
ci+1 = ai ci ; i = 0 : : : n ; 3 (r.m.a.)
z0 = a0 c0
zi = ai ai;1 ci;1 ; i = 1 : : : n ; 2
zn;1 = an;1 cn;2
a0
c in
inccg.epsi
62 39 mm
c out
z3
z2
z1
z0
Prefix problem ) AND-prefix structure
8-bit parallel-prefix incrementer (Sklansky AND-prefix

structure) :
a7
a6
a5
a4
a3
a2
a1
a0
c in
incpp.epsi
98 63 mm
c out
z7
z6
z5
z4
z3
z2
z1
z0
56
5.3 Counting
Count clock cycles ) counter,

divide clock frequency ) frequency divider (cout )
Counter using Gray incrementer

+1
cntblock.epsi
32 33 mm
c out
c in
Ring counters
Shift register connected to ring :
clk
Q
Example : Ripple-carry up-counter using counter slices

(= HA + FF), cin is count enable
c out
c in
cntring.epsi
51 16 mm
q n-1
q1
q0
q2
q1
01)
Johnson / twisted-ring counter (inverted feed-back) :

cntjohnson.epsi
59 16 mm
clk
cntasync.epsi
64 18 mm
q n-1
q0
Applications:
fast dividers (no logic between FF)
state counter for one-hot coded FSMs
q2
q1
Must be initialized correctly (e.g. 00
cntripple.epsi
87 36 mm
...
q2
State is not encoded ) n FF for counting n states
Asynchronous counter using toggle-flip-flops

(lower toggle rate ) lower power)
T
5.3 Counting
Gray counter
Binary counter
Sequential in-/decrementer
Incrementer speed-up
techniques applicable
Down- and up-down-counters
using decrementers /
incrementer-decrementers
q n-1
57
Fast divider (T = O(1)) using delayed-carry numbers

(irredundant carry-save represention of ;1 allows using
fast carry-save incrementer) [8]
5.3 Counting
...
q n-1
q1
q0
n FF for counting 2n states
q0
q2
58
59
5.4 Comparison, Coding, Detection
Comparators
Subtractor (A ; B ) :
Comparison operations
EQ = (A = B )
(equal)
NE = (A 6= B ) = EQ
(not equal)
GE = (A B )
(greater or equal)
LT = (A < B ) = GE
(less than)
GT = (A > B ) = GE EQ
(greater than)
LE = (A B ) = GT = GE + EQ (less or equal)
cmpsub.epsi
37 31 mm
GE = cout
EQ = Pn;1:0
CPA
GE = c out
EQ = P n-1:0
(for free in PPA)
ARCA = 7n TRCA = 2n or
APPA;KS 32 n log n TPPA;KS
2 log n
Optimized comparator :
eqi+1 = (ai = bi) eqi

= (ai bi) eqi ;
i = 0 ::: n ; 1
eq0 = 1 EQ = eqn (r.s.a.)
removing redundancies in subtractor (unused si )

single-tree structure ) speed-up at no cost :
a0
b0
a2
b2
a n-1
b n-1
EQ = (A = B )
a1
b1
Equality comparison
...
A = 6n TLIN = 2n TTREE
cmpeq.epsi
40 36 mm
2 log n
a0
b0
a1
b1
EQ
a2
b2
a n-1
b n-1
example : ripple comparator using comparator slices
Magnitude comparison
cmpripple.epsi
100 47 mm
gei+1 = (ai > bi) + (ai = bi) gei

= aibi + (ai bi) gei ; i = 0 : : : n ; 1
ge0 = 1 GE = gen (r.s.a.)
equality
EQ
60
Decoder
Decodes binary number An;1:0 to vector Zm;1:0 (m = 2n )
zi =
1 if A = i
0 else
;
a2
i = 0 ::: m ; 1
a1
= 2A

61
Detection operations
All-zeroes detection :
All-ones detection :
z = an;1 + an;2 + + a0
z = an;1 an;2 a0 (r.s.a.)
A = n T = log n
a0
decoder.epsi
58 28 mm
decodersym.epsi
26 mm
21decoder
magnitude
GE

equality &
magnitude
...
GE = (A B )
Leading-zeroes detection (LZD) :

for scaling, normalization, priority encoding
z7
z6
z5
z4
z3
z2
z1
z0
a) non-encoded output :
A = (n ; 1)2n T = dlog ne
Encoder
Encodes vector Am;1:0 to binary number Zn;1:0 (m = 2n )
(condition: 9i 8k j if k = i then ak = 1 else ak = 0)
Z = i if ai = 1 ; i = 0 : : : m ; 1 Z = log2 A
A
encodersym.epsi
26 mm
21encoder
z0
a n-2
(e.g. 000101 ! 000100)
A = 2n T = n
a1
...
a0
lzdnenc.epsi
50 28 mm
...
z n-1
z n-2
z1
z0
b) encoded output : + encoder
z1
signed numbers : + leading-ones detector (LOZ)
A = n(2n;1 ; 1)
T =n;1
a n-1
prefix problem (r.m.a.) ) AND-prefix structure
a7a5a3a1
a6a4a2a0
encoder.epsi
30 34 mm
f0g1f0j1g ! f0g1f0g
z2
(note: connections
according to PPA-SK)
62
63
5.5 Shift, Extension, Saturation
Rotation by k bit positions, n constant (logic operation)

Extension of word lengths by k bits (n ! n + k )
(i.e. sign-extension for signed numbers)
Saturation to highest/lowest value after over-/underflow
unl.
signed r.
signed l.
r.
shift b)
unsigned
signed
rotate
extend
l.
r.
unl.
signed r.
signed l.
r.
saturate unsigned
signed
an;2
0 an;1
an;1
an;3
an;1 an;1 an;2
an+k;1
a2n;1 an+k;2
an;2
a0 an;1
0 an;1
an;1
an;1 an;1 an;2
an;1
an;2
an;1
an;1
an;1
:::
:::
:::
:::
:::
:::
:::
:::
:::
:::
:::
:::
:::
:::
a0 0
a1
a0 0
a1
ak
ak
a0 an;1
a1
a0
a0 0
a0
a0 0
an;1
an;1

sll
srl
sla
sra
C
V
formula
cn
cn cn;1
an bnsn + anbn sn
Z
8i : si = 0
N
sn;1
A = O(n2)
T = O(log n)
rol
ror
a3
a2
a1
a3
a0
a0
s1 s0
barshift.epsi
44 49 mm
s1 s0
z3
z2
z1
z0
z3
multiplexers
5.6 Addition Flags
a1
s1 s0
s1
64
a2
s1 s0
muxshift.epsi
41 28 mm
s0
z2
z1
z0
tristate buffers

65
5.6 Addition Flags
Basic and derived condition flags

description
carry flag
signed overflow flag
condition
flag
unsigned
formula
signed
S = A + B (+) or S = A ; B (;)
zero
Z
Z
negative
N
positive
N
S > max overflow C (+)
VC
S < min underflow C (;)
VC
operation: A ; B
A=B
EQ
Z
Z
A 6= B
NE
Z
Z
A B
GE
C
N V + NV
A>B
GT
CZ (N V + NV )Z
A<B
LT
C
NV + NV
A B
LE
C + Z NV + NV + Z
operation:
S=0
S<0
S 0
zero flag
negative flag, sign
Implementation of adder with flags
C , N : for free
V : fast cn, cn;1 computed by e.g. PPA ) very cheap
Z : a) cin = 1 (subtract.) : Z = (A = B ) = Pn;1:0 (of PPA)
b) cin = 0=1 :
Z = sn;1 + sn;2 + + s0 (r.s.a.)
1)
A = ACPA + n TZ = TCPA + dlog ne
2)
Implementation of shift/extension/rotation by
constant values : hard-wired
variable values : multiplexers
n possible values : nbyn barrel-shifter/rotator
Example : 4by4 barrel-rotator
5.6 Addition Flags

flag
Applications :
adaption of magnitude (shift a)) or word length
(extension) of operands (e.g. for addition)
multiplication/division by multiples of 2 (shift)
logic bit/byte operations (shift, rotation)
scaling of numbers for word-length reduction (i.e.
ignore leading zeroes, shift b)) or normalization (e.g.
of floating-point numbers, shift a)) using LZD
reducing error after over-/underflow (saturation)
Shift : a) shift n-bit vector by k bit positions

b) select n out of more bits at position k
also: logical (= unsigned), arithmetic (= signed)
shift a)
faster without final sum (i.e. carry prop.) [19]

example :
01001 1 00 0
+ 10110 1 00
= 00000 0 00
Unsigned and signed addition/subtraction only differ

with respect to the condition flags
z0 = ((a0 b0) cin)

zi = ((ai bi) (ai;1 + bi;1))
Z = zn;1 zn;2 z0 ; i = 0 : : : n ; 1 (r.s.a.)
A = ACPA + 3n TZ = 4 + dlog ne
66
67
5.7 Arithmetic Logic Unit (ALU)
6.1 Multiplication Basics
6 Multiplication
5.7 Arithmetic Logic Unit (ALU)

A
6 Multiplication

Multiplies two n-bit operands A and B [1, 2]
c out alusymbol.epsi c in
29 mm
30 ALU
flags
op
Product P is (2n)-bit unsigned number or (2n ; 1)-bit

signed number
Example : unsigned multiplication

nX
;1
nX
;1
nX
;1 nX
;1
P = A B = ai2i
bj 2j =
aibj 2i+j or
i=0
j =0
i=0 j =0
nX
;1
Pi = ai B P = Pi2i ; i = 0 : : : n ; 1 (r.s.a.)
i=0
ALU operations
arithmetic
add
inc
pass
logic
and
or
xor
pass
shift/
rotate
sll
sla
rol
A + B + cin
A+1
A
aibi
ai + bi
ai bi
ai
A 1
A a1
A r1
A;B
A;1
;A
ai bi
ai + bi
ai bi
ai
A 1
A a1
A r1
sub
dec
neg
nand
nor
xnor
not
srl
sra
ror
Algorithm
1) Generation of n partial products Pi
2) Adding up partial products :
a) sequentially (sequential shift-and-add),
b) serially (combinational shift-and-add), or
c) in parallel
s/ro : shift/rotate ; l/r : left/right ;

l/a : logic (unsigned) / arithmetic (signed)
Speed-up techniques
Logic of adder/subtractor can partly be re-used for logic

operations
6 Multiplication
68
Sequential multipliers :
partial products generated
and added sequentially (using
accumulator)
Reduce number of partial products

Accelerate addition of partial products
6 Multiplication
6.2 Unsigned Array Multiplier
6.2 Unsigned Array Multiplier
Braun multiplier : array multiplier for unsigned numbers
mulseq.epsi
34 28 mm
P=
CPA
A = O(n) T = O(log n) L = n
Array multipliers :
partial products generated and
added simultaneously in linear
array (using array adder)
69
CSA
mularr.epsi
mm
34 47
CSA
A = O(n ) T = O(n)
i=0 j =0
b3
CSA
A = 8n2 ; 11n
T = 6n ; 9
aibj 2i+j
a1 b3
a2 b3 a2 b2
+ a3b3 a3b2 a3b1
p7 p6 p5 p4
CSA
;1
nX
;1 nX
a0 b3
a1 b2
a2 b1
a3 b0
p3
b2
a0 b2 a0 b1 a0 b0
a1 b1 a1 b0
a2 b0
p2 p1 p0
b1
b0
a0
CPA
p0
a1
Parallel multipliers :
partial
products
generated in parallel and added
subsequently in multi-operand
adder (using tree adder)
mulpar.epsi
34 43 mm
a2
CPA
a3
A = O(n ) T = O(log n)
2
HA
HA
p1
FA
CSA
tree
mulbraun.epsi
FA
99 83 mm
FA
p2
Signed multipliers :
a) complement operands before and result after
multiplication ) unsigned multiplication
b) direct implementation (dedicated multiplier structure)
HA
70
FA
FA
FA
CSA
p3
CPA
3
p7
FA
FA
HA
p6
p5
p4
71
6 Multiplication
6.3 Signed Array Multipliers
6.3 Signed Array Multipliers
Speed-up technique : reduction of partial products
Subtract bits with negative weight ) special FAs [1]
Sequential multiplication
Minimal (or canonical) signed-digit (SD) represent. of A
1 neg. bit : ;a + b + cin = 2cout ; s

2 neg. bits :
a ; b ; cin = ;2cout + s
+ One cycle per non-zero partial product (i.e. 8ai j ai 6= 0)

Negative partial products
s = a b cin
cout = ab + acin + bcin
Replace FAs in regions

1 , 2 , and 3 by :
(input a at mark )
Data-dependent reduction of partial products and latency

Combinational multiplication
Otherwise exactly same structure and complexity as

Braun multiplier ) efficient and flexible
Only fixed reduction of partial product possible

Radix-4 modified Booth recoding : 2 bits recoded to one
multiplier digit ) n=2 partial products
Arithmetic transformations yield the following partial

products (two additional ones) :
a0 b3
a1b3 a1 b2
a2 b3 a2 b2 a2 b1
a3 b3 a3 b2 a3 b1 a3 b0
a3
a3
+ 1 b3
b3
p7 p6 p5 p4 p3
a0 b2 a0 b1 a0 b0
a1 b1 a1 b0
a2 b0
p2
p1
Less efficient and regular than modified Braun

multiplier
72
6 Multiplication
6.4 Booth Recoding
Applicable to sequential, array, and parallel multipliers
A : +8n
T : +7
additional recoding logic and more

complex partial product generation
(MUX for shift, XOR for negation)
+ adder array/tree cut in half
) considerably smaller (array and tree)
) slightly or not faster for adder trees
0 0 0 ;p3
=
1
+ 1 1 1 p3
f;2 ;1 0 +1 +2g
=0
mulbooth.epsi
41 43 mm
CSA
array/tree
CPA

6 Multiplication
73
6.6 Multiplier Implementations
6.5 Wallace Tree Addition
A = O(n2) T = O(log n)
Negative partial products (avoid sign-extension) :
p
3 p3 p3 p3 p2 p1 p0 =
| {z }
i=0
(a2i;1 + a2i ; 2a2i+1 ) 22i ; a;1

|
{z
}
Speed-up technique : fast partial product addition
A : =2
T : =2
T : ;0
) much faster for adder arrays
2;1
n=X
a2i+1 a2i a2i;1 Pi

0
0
0
+ 0
1
+ B
0
0
0
1
0
+ B
0
1
1
+ 2B
1
0
0
; 2B
1
0
1
; B
1
1
0
; B
1
1
1
; 0
p0
A=
Booth
recoding
Baugh-Wooley multiplier
p2 p1 p0
Applicable to parallel multipliers : parallel partial

product generation (normal or Booth recoded)
Irregular adder tree (Wallace tree) due to different
number of bits per column
) irregular wiring and/or layout
) non-uniform bit arrival times at final adder
6.6 Multiplier Implementations
p2 p1 p0
Sequential multipliers :
low performance, small area, component re-use (adder)
Braun or Baugh-Wooley multiplier (array multiplier) :

medium performance, high area, high regularity
layout generators ) data paths and macro-cells
simple pipelining, faster CPA ) higher speed
p03 p02 p01 p00

p03 p03 p03 p03 p02 p01 p00
p13 p12 p11 p10
p13 p13 p13 p12 p11 p10
!
p23 p22 p21 p20
p23 p23 p22 p21 p20
+
p33 p32 p31 p30
+ p33 p32 p31 p30
p6 p5 p4 p3 p2 p1 p0
6.4 Booth Recoding
6.4 Booth Recoding
Modified Braun multiplier
ext. sign
6 Multiplication
p6 p5 p4 p3 p2 p1 p0
Booth-Wallace multiplier (parallel multiplier) [9] :

high performance, high area, low regularity
custom multipliers, netlist generators
often pipelined (e.g. register between CSA-tree and CPA)
Suited for signed multiplication (incl. Booth recod.)

Extend A for unsigned multiplication : an
=0
Radix-8 (3-bit recoding) and higher radices :

precomputing 3B , : : : ) inefficient
Signed-unsigned multiplier : signed multiplier with

operands extended by 1 bit (an = an;1 =0, bn = bn;1 =0)
74
75
6 Multiplication
6.8 Squaring
6.7 Composition from Smaller Multipliers
7.1 Division Basics
AH BL
AH BH AL BL
AL BH
less efficient (area and speed)

6.8 Squaring
P = A2 = AA
A=Q B+R; R <B

R = A rem B (remainder)
A 2 0 22n ; 1] B Q R 2 0 2n ; 1] B 6= 0
Q < 2n ! A < 2nB , otherwise overflow
) normalize B before division (B 2 2n;1 2n ; 1])
A =Q+ R
B
B
B = (AH 2n + AL) (BH 2n + BL)

= AH BH 22n + (AH BL + ALBH )2n + ALBL
4 (n n)-bit multipliers
+ (2n)-bit CSA + (3n)-bit CPA
7.1 Division Basics
7 Division / Square Root Extraction
(2n 2n)-bit multiplier can be composed from 4

(n n)-bit multipliers (can be repeated recursively)
A
Algorithms (radix-2)
Subtract-and-shift : partial remainders Ri [1, 2]
Sequential algorithm : recursive, f non-associative
: multiplier optimizations possible
qi = Ri+1 2iB Ri = Ri+1 ; qi2iB

Rn = A R = R0 ; i = n ; 1 : : : 0 (r.m.n.)
a0 a3 a0 a2 a0 a1 a0 a0
a1 a3 a1 a2 a1 a1 a1 a0
a2 a3 a2 a2 a2 a1 a2 a0
+ a3a3 a3a2 a3a1 a3a0
a2 a3 a1 a3 a0 a3 a0 a2 a0 a1
a0 a0
a3 a3
a1 a2
a1 a1
+
a2 a2
p7 p6 p5 p4 p3
p2 p1 p0
Basic algorithm : compare and conditionally subtract

) expensive comparison and CPA
Restoring division : subtract and conditionally restore
(adder or multiplexer) ) expensive CPA and restoring
+ bn=2c + 1 partial products (if no Booth recoding used)

) optimized squarer more efficient than multiplier
;
Non-restoring division : detect sign, subtract/add, and

correct by next steps ) expensive CPA
Table look-up (ROM) less efficient for every n
SRT division : estimate range, subtract/add (CSA), and

correct by next steps ) inexpensive CSA

7.2 Restoring Division
76
qi =
1 if
0 if
Ri+1 ; B 2i 0
Ri+1 ; B 2i < 0
Ri+1 ; B 2i < 0 : qi = 0 Ri = Ri+1 (restored)

i ; 1 Ri+1 ; B 2i;1 0 : qi;1 = 1 Ri;1 = Ri+1 ; B 2i;1
i
qi0 =
1 if
;1 = 1 if
Ri+1 0
Ri+1 < 0
Ri+1 0 : qi0 = 1 Ri = Ri+1 ; B 2i

i ; 1 Ri+1 ; B 2i < 0 : qi0;1 = 1 Ri;1 = Ri+1 ; B 2i
+B 2i;1 = Ri+1 ; B 2i;1
77
7.4 Signed Division
7.4 Signed Division
q0 = 1 if
i
1 if
Ri+1 B same sign

Ri+1 B opposite sign
Example : signed non-restoring array divider

(simplifications: B > 0, final correction of R omitted)
A = 9n2 T = 2n2 + 4n
a6 b3
b3
a6
b2
a5
b1
a4
b0
a3
One subtraction/addition (CPA) per step

Final correction step for R (additional CPA)
Simple quotient digit conversion : (note: qi0 irredundant)
q3
FA
q2
FA
q1
FA
FA
FA
FA
q0
FA
FA
FA
FA
r3
r2
r1
r0
+/ CPA
+/ CPA
divnr.epsi
mm CPA
46 38 +/
+/ CPA
+/ CPA
R
FA
FA
FA
divarray.epsi
81 101 mm
FA
a1
A B
FA
a2
qi0 2 f1 1g ! qi 2 f0 1g : qi = 12 (qi0 + 1)
Q = (qn;1 qn;2 qn;3 : : : q0 1)
A = (n + 1)ACPA
= O(n2) or O(n2 log n)
T = (n + 1)TCPA
= O(n2) or O(n log n)
FA
78
a0
79
7.5 SRT Division
7.5 SRT Division

8
>
>
<1
if
>
:1
if
B 2i Ri+1
qi0 is SD number
Ri+1 < B 2i
i
Ri+1 < ;B 2
Radix
CPA
:::
; 1g
Division by convergence
A = A R0R1
Q= B
B R0 R1
Rm;1 ! A
Rm;1 B
Algorithm :
+/ CSA
+/ CSA
divsrt.epsi
mm
+/ CSA
50 38
+/ CSA
+/ CPA
= Q1 resp. 2Qn
Bi+1 = Bi Ri Ai+1 = Ai Ri
Ri = B i + 1 ; i = 0 : : : m ; 1
A0 = A B0 = B Q = Am (r.s.n.)
Quadratic convergence :
B
1
B
Bi+1 = Bi Ri = 2| n(1{z; y)} (|1 +

y) = 2| n(1{z; y2 )}
{z }
Bi
Ri
> Bi ! 2n
;
n
;
n
y = 1 ; Bi2 Ri = 2 ; Bi2 = B i + 1 (signed)
R
80
7.8 Remainder / Modulus
Division by reciprocation
L = dlog ne

81
7.9 Divider Implementations
7.9 Divider Implementations
A =A 1
Q= B
B
Iterative dividers (through multiplication) :

re-use of existing components (multiplier)
Newton-Raphson iteration method :
f (X ) = 0
1 0 1
7.7 Division by Multiplication
A B
:::
Complex comparisons (more bits) and decisions

) table look-up () Pentium bug!)
+ Only 3 MSB are compared ) qi0 are estimated ) CSA

instead of CPA can be used (precise enough) [20]
Correction in following steps (+ final correction step)
Redundant representation of qi0 (SD representation) )
final conversion necessary (CPA)
+ Highly regular and fast (O(n)) SRT array dividers
) only slightly slower/larger than array multipliers
;1
+ Suitable for SRT algorithm ) faster
B < 2n , i.e. B is normalized :

;2n+i;1 Ri+1 < 2n+i;1 B 2i
8
>
2n+i;1 Ri+1
>
<1 if
0
qi = >0 if ;2n+i;1 Ri+1 < 2n+i;1
>
:1 if
Ri+1 < ;2n+i;1
A = nACSA + 2ACPA
= O(n2)
T = nTCSA + TCPA
= O(n)
= 2m , qi0 2 f
m quotient bits per step ) fewer, but more complex steps
if 2n;1
) ;B 2i
find
7.7 Division by Multiplication
7.6 High-Radix Division
qi0 = >0 if ;B 2i
by recursion
Xi+1 = Xi ; ff0((XXo))
i
medium performance, medium area

high efficiency if components are re-used
f (X ) = X1 ; B f 0 (X ) = ; X12 f B1 = 0
Sequential dividers (restoring, non-restoring, SRT) :

re-use of existing components (e.g. adder)
Algorithm :
low performance, low area
Xi+1 = Xi (2 ; B Xi) ; i = 0 : : : m ; 1
X0 = B Q = Xm (r.s.n.)
Array dividers (restoring, non-restoring, SRT) :

dedicated hardware component
L = O(log n)
Speed-up : first approximation X0 from table
Quadratic convergence :
high performance, high area

high regularity ) layout generators, pipelining
square root extraction possible by minor changes
7.8 Remainder / Modulus
combination with multiplication or/and square root
Remainder (rem) : signed remainder of a division
R = A rem B
sign(R) = sign(A)
No parallel dividers exist (sequential nature of division)
Modulus (mod) : positive remainder of a division
M = A mod B M
M= R
R+B
if A
else
82
83
7.10 Square Root Extraction
7.10 Square Root Extraction

p
A;R =Q
A2
0 22n ; 1]
A = Q2 + R
Exponential function : ex (exp x)

Logarithm function : ln x, log x
Trigonometric functions : sin x, cos x, tan x
Algorithm
Inverse trig. functions : arcsin x, arccos x, arctan x
Subtract-and-shift : partial remainders Ri and quotients

Qi = Qi+1 + qi2i = (qn;1 : : : qi 0 : : : 0)
2
Q2i = Qi+1 + qi2i = Q2i+1 + qi2i 2Qi+1 + qi2i
Hyperbolic functions : sinh x, cosh x, tanh x

8.1 Algorithms
qi = Ri+1 2i 2Qi+1 + 2i
Qi = Qi+1 + qi2i
Ri = Ri+1 ; qi2i 2Qi+1 + qi2i ; i = n ; 1 : : : 0
Rn = A Qn = 0 R = R0 Q = Q0 (r.m.n.)
Table look-up : inefficient for large word lengths [5]

Taylor series expansion : complex implementation
Polynomial and rational approximations [1, 5]
Shift-and-add algorithms [5]
Implementation
Convergence algorithms [1, 2] :
+ Similar to division ) same algorithms applicable

(restoring, non-restoring, SRT, high-radix)
+ Combination with division in same component possible
similar to division-by-convergence
two (or more) recursive formulas : one formula
converges to a constant, the other to the result
Only triangular array required

(step i : qk i = 0)
Coordinate rotation (CORDIC) [2, 5, 21] :
A ADIV =2
T TDIV
+/ CPA
sqrtnr.epsi
+/ CPA
mmCPA
42 36+/
+/ CPA
+/ CPA
3 equations for x-, y-coordinate, and angle

computes all elementary functions by proper input
settings and choice of modes and outputs
simple, universal hardware, small look-up table
R
8 Elementary Functions
84
8.2 Integer Exponentiation
8.2 Integer Exponentiation
= (: : :
0 1 0
|{z}
A
: : :)
Integer exponentiation (exact) :
L=0
85
8.3 Integer Logarithm
E = AB = Abn; 2n; + +b 2+b

= ( ((Abn; )2 Abn; )2
1
Ab )2 Ab
Ei = Ei2+1 Abi ; i = n ; 1 : : :
En = 1 E = E0 (r.s.n.)
A = AMUL T = TMUL L = 2(n ; 1)
2n ; 1 (!)
8.3 Integer Logarithm
Applications : modular exponentiation AB (mod

in cryptographic algorithms (e.g. IDEA, RSA)
C)
Z = blog2 Ac
For detection/comparison of order of magnitude
Algorithms : square-and-multiply
a)
xy = ey ln x = 2y log x
Base-2 integer exponentiation : 2A
= A| A{z A}
b)
Approximated exponentiation :
AB
8.1 Algorithms
0 2n ; 1]
Q2
E = AB = Abn; 2n; + +b 2+b

= A2n; bn; A2n; bn;
1
A4b A2b Ab
2
Corresponds to leading-zeroes detection (LZD) with

encoded output
Ei = Pibi Ei;1 Pi+1 = Pi2 ; i = 0 : : : n ; 1

E;1 = 1 P0 = A E = En;1 (r.s.n.)
A = 2AMUL T = TMUL L = n
A = AMUL T = TMUL L = 2n
or
86
87
9 VLSI Design Aspects
9.1 Design Levels
9.1 Design Levels
Gate-level design
Cell-based design techniques : standard-cells, gate-array/

sea-of-gates, field-programmable gate-array (FPGA)
9.1 Design Levels

Transistor-level design
Circuit implemented by hand or by synthesis (library)
Circuit and layout designed by hand (full custom)
Layout implemented by automated place-and-route
Low design efficiency
Medium to high design efficiency
High circuit performance : high speed, low area
Medium to low circuit performance
High flexibility : choice of architecture and logic style
Medium to low flexibility : full choice of architecture
Transistor-level circuit optimizations :

Block-level design
logic style : static vs. dynamic logic,

complementary CMOS vs. pass-transistor logic
special arithmetic circuits : better than with gates
gi
ci
carry chain :
c i-1
carrychain.epsi
54 17 mm
c out
ki
pi
Layout blocks and netlists from parameterized automatic

generators or compilers (library)
High design efficiency
g i-1
c in
Medium to high circuit performance
c in
k i-1 p i-1
Low flexibility : limited choice of architectures

Implementations :
a
b
fulladder :
c in
c in
facmos.epsi
76 40 mm
c in
s
c in
c out
b
a
c in

88
9.2 Synthesis
data-path : bit-sliced, bus-oriented layout (array of

cells: n bits m operations), implementation of entire
data paths, medium performance, medium diversity
macro-cells : tiled layout, fixed/single-operation
components, high performance, small diversity
portable netlists : ) gate-level design
89
9.3 VHDL
9.2 Synthesis
9.3 VHDL
High-level synthesis
Arithmetic types : unsigned, signed (2s complement)
Synthesis from abstract, behavioral hardware description

(e.g. data dependency graphs) using e.g. VHDL
Involves architectural synthesis and arithmetic
transformations
contain overloaded arithmetic operators and resizing /

type conversion routines for unsigned, signed types
High-level synthesis is still in the beginnings

Low-level synthesis
Arithmetic operators (VHDL87/93) [22]
Layout and netlist generators
relational
shift, rotate (93 only)
adding
sign (unary)
multiplying
exponent, absolute
Included in libraries and synthesis tools

Low-level synthesis is state-of-the-art
Basis for efficient ASIC design
Limited diversity and flexibility of library components
Circuit optimization
:
:
:
:
:
:
=, /=, <, <=, >, >=

rol, ror, sla, sll, sra, srl
+, +, *, /, mod, rem
**, abs
Synthesis
Efficient optimization of random logic (low factorization

degree) is state-of-the-art
Typical limitations of synthesis tools :

/, mod, rem : both operands must be constant or divisor
Optimization of entire arithmetic circuits (high

factorization degree) is not feasible ) only local
optimizations possible
must be a power of two

** : for power-of-two bases only
Logic optimization cannot replace the synthesis of

efficient arithmetic circuit structures using generators
Arithmetic packages
numeric_bit, numeric_std (IEEE standard 1076.3),
std_logic_arith (Synopsys)
Variety of arithmetic components provided in separate

libraries (e.g. DesignWare by Synopsys)
90
91
9.3 VHDL
Resource sharing
Pipelining
Pipelining is basically possible with every combinational
circuit ) higher throughput
S <= A + C when SELA = 1 else B + C;
) 2 adders + 1 multiplexer
b)
T <= A when SELA = 1 else B;

S <= T + C;
) 1 multiplexer + 1 adder
tree structures : few large pipeline registers
resize(A, width+1) & Cin;

resize(B, width+1) & 1;
Aext + Bext;
Sext(width downto 1);
Sext(width+1);
) no advantage of tree structures anymore

(except for smaller latency)
Fine-grain pipelining ) systolic arrays (often applied to

arithmetic circuits)
Synthesis : check synthesis result for allocated arithmetic

units ) code sanity check, control of circuit size
High speed
Fast circuit architectures, pipelining, replication
(parallelization), and combinations of those
VHDL library of arithmetic units
Optimal solution depends on arithmetic operation, circuit

architecture, user specifications, and circuit environment
Structural, synthesizable VHDL code for most circuits

described in this text is found in [23]
Pipelining of arithmetic circuits can be very costly :
array structures : many small pipeline registers
Addition : single adder with carry-in/carry-out :

<=
<=
<=
<=
<=
Arithmetic circuits are well suited for pipelining due to

high regularity
large amount of internal signals in arithmetic circuits
Coding & synthesis hints
Aext
Bext
Sext
S
Cout
9.4 Performance
9.4 Performance
Sharing one resource for multiple operations

Done automatically by some synthesis tools
Otherwise, appropriate coding is necessary :
a)
92
9.4 Performance

93
9.5 Testability
Low power
9.5 Testability
Power-related properties of arithmetic circuits :
Testability goal : high fault coverage with few test vectors

that are easy to generate/apply
High glitching activity due to high bit dependencies

and large logic depth
Random test vectors : easy to generate and

apply/propagate, few vectors give high (but not perfect)
fault coverage for most arithmetic circuits
Power reduction in arithmetic circuits [24] :

Reduce the switched capacitance by choosing an area
efficient circuit architecture
Allow for lower supply voltage by speeding up the
circuitry
Hard-detectable faults found in :
Reduce the transition activity :

apply stable inputs while circuit is not in use ()
disabling subcircuits)
reduce glitching transitions by balancing signal
paths (partly done by speed-up techniques, otherwise
difficult to realize)
reduce glitching transitions by reducing logic depth
(pipelining)
take advantage of correlated data streams
choose appropriate number representations
(e.g. Gray codes for counters)
Special test vectors : sometimes hard to generate and

apply, required for coverage of hard-detectable faults
which are inherent in most arithmetic circuits
94
circuits of arithmetic operations with inherent special

cases (arithmetic exceptions) : detectors, comparators,
incrementers and counters (MSBs), adder flags
circuits using redundant number representations
(=
6 redundant hardware) : dividers (Pentium bug!)
95
Bibliography
Bibliography
Bibliography
[11] R. Zimmermann, Binary Adder Architectures for

Cell-Based VLSI and their Synthesis, PhD thesis, Swiss
Federal Institute of Technology (ETH) Zurich,
Hartung-Gorre Verlag, 1998.
[1] I. Koren, Computer Arithmetic Algorithms, Prentice Hall,

1993.
[2] K. Hwang, Computer Arithmetic: Principles, Architecture,
and Design, John Wiley & Sons, 1979.
[3] O. Spaniol, Computer Arithmetic, John Wiley & Sons,
1981.
[12] A. Tyagi, A reduced-area scheme for carry-select adders,

IEEE Trans. Comput., vol. 42, no. 10, pp. 11621170, Oct.
1993.
[13] T. Han and D. A. Carlson, Fast area-efficient VLSI
adders, in Proc. 8th Computer Arithmetic Symp., Como,
May 1987, pp. 4956.
[4] J. J. F. Cavanagh, Digital Computer Arithmetic: Design

and Implementation, McGraw-Hill, 1984.
[5] J.-M. Muller, Elementary Functions: Algorithms and
Implementation, Birkhauser Boston, 1997.
[6] Proceedings of the Xth Symposium on Computer Arithmetic.
[7] IEEE Transactions on Computers.
[8] D. R. Lutz and D. N. Jayasimha, Programmable modulo-k
counters, IEEE Trans. Circuits and Syst., vol. 43, no. 11,
pp. 939941, Nov. 1996.
[9] H. Makino et al., An 8.8-ns 54 54-bit multiplier with
high speed redundant binary architecture, IEEE J.
Solid-State Circuits, vol. 31, no. 6, pp. 773783, June 1996.
[14] R. Zimmermann, Non-heuristic optimization and

synthesis of parallel-prefix adders, in Proc. Int. Workshop
on Logic and Architecture Synthesis, Grenoble, France,
Dec. 1996, pp. 123132.
[15] D. W. Dobberpuhl et al., A 200-MHz 64-b dual-issue
CMOS microprocessor, IEEE J. Solid-State Circuits, vol.
27, no. 11, pp. 15551564, Nov. 1992.
[16] A. De Gloria and M. Olivieri, Statistical carry lookahead
adders, IEEE Trans. Comput., vol. 45, no. 3, pp. 340347,
Mar. 1996.
[10] W. N. Holmes, Composite arithmetic: Proposal for a new

standard, IEEE Computer, vol. 30, no. 3, pp. 6573, Mar.
1997.
[17] V. G. Oklobdzija, D. Villeger, and S. S. Liu, A method for

speed optimized partial product reduction and generation of
fast parallel multipliers using an algorithmic approach,
IEEE Trans. Comput., vol. 45, no. 3, pp. 294305, Mar.
1996.
96
Bibliography
[18] Z. Wang, G. A. Jullien, and W. C. Miller, A new design

technique for column compression multipliers, IEEE
Trans. Comput., vol. 44, no. 8, pp. 962970, Aug. 1995.
[19] J. Cortadella and J. M. Llaberia, Evaluation of A + B = K
conditions without carry propagation, IEEE Trans.
Comput., vol. 41, no. 11, pp. 14841488, Nov. 1992.
[20] S. E. McQuillan and J. V. McCanny, Fast VLSI algorithms
for division and square root, J. VLSI Signal Processing,
vol. 8, pp. 151168, Oct. 1994.
[21] Y. H. Hu, CORDIC-based VLSI architectures for digital
signal processing, IEEE Signal Processing Magazine, vol.
9, no. 3, pp. 1635, July 1992.
[22] K. C. Chang, Digital Design and Modeling with VHDL and
Synthesis, IEEE Computer Society Press, Los Alamitos,
California, 1997.
[23] R. Zimmermann, VHDL Library of Arithmetic Units,
http://www.iis.ee.ethz.ch/zimmi/arith_lib.html.
[24] A. P. Chandrakasan and R. W. Brodersen, Low Power

Digital CMOS Design, Kluwer, Norwell, MA, 1995.
98
97

(..) Computer Arithmetic - Principles, Architectures & VLSI Design

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

(..) Computer Arithmetic - Principles, Architectures & VLSI Design

Загружено:

Авторское право:

Доступные форматы

Eidgenossische

Institut fur Integrierte Systeme

Ecole polytechnique federale

Integrated Systems Laboratory

Copyright c 1998 by Integrated Systems Laboratory, ETH Zurich

1 Introduction and Conventions

Computer Arithmetic: Principles, Architectures, and VLSI Design

7.3 Non-Restoring Division

Computer Arithmetic: Principles, Architectures, and VLSI Design

Computer Arithmetic: Principles, Architectures, and VLSI Design

1 Introduction and Conventions

1 Introduction and Conventions

1 Introduction and Conventions

Basic principles of computer arithmetic [1, 2, 3, 4, 5, 6, 7]

A (1-D), Ai (2-D), ai:k (subbus, 1-D)

Arithmetic units are, among others, core of every data

Circuit complexity measures

Data path is core of :

gate-equivalents (GE) model) :

Simple monotonic 2-input gates (AND, NAND, OR,

data-processing application specific ICs (ASIC) and

Simple non-monotonic 2-input gates (XOR, XNOR) :

Design of arithmetic units necessary for :

Complex gates : composed from simple gates

Only estimations given for complex circuits

Computer Arithmetic: Principles, Architectures, and VLSI Design

Wiring not considered (acceptable for comparison

1.4 Recursive Function Evaluation

1.4 Recursive Function Evaluation

Computer Arithmetic: Principles, Architectures, and VLSI Design

Given : inputs ai , outputs zi , function f (graph sym. : )

zi = f (ai zi;1) ; i = 0 : : : n ; 1 z;1 = 0=1

1.4 Recursive Function Evaluation

b) with multiple outputs zi (r.m.) () prefix problem) :

Output zi is a function of input ai (or aj +m:j

Recursive functions (r.)

) serial or multi-tree structure :

A = O(n log n) T = O(log n)

Computer Arithmetic: Principles, Architectures, and VLSI Design

2.2 Implementation Techniques

2.2 Implementation Techniques

Direct implementation of dedicated units :

Sequential implementation using simpler units and

universal : simple application to all operations

efficient only for single-operand operations of high

Approximation techniques using simpler units : 712

Computer Arithmetic: Principles, Architectures, and VLSI Design

taylor series expansion

3.1 Binary Number Systems (BNS)

3.1 Binary Number Systems (BNS)

n-bit number is ordered sequence of bits (binary digits) :

Unsigned : positive or natural numbers

3.1 Binary Number Systems (BNS)

Ones (1s) complement : similar to 2s complement

Properties : double representation of zero, symmetric

Sign-magnitude : alternative representation of signed

Twos (2s) complement : standard representation of

(|am;1 {z: : : a0} : |a;1 : :{z: am;n} )

Computer Arithmetic: Principles, Architectures, and VLSI Design

Complement : ;A = (an;1 an;2

Computer Arithmetic: Principles, Architectures, and VLSI Design

3.1 Binary Number Systems (BNS)

Properties : double representation of zero, symmetric