Computer Arithmetic

Eidgenossische
Ecole polytechnique federale

de Zurich
Technische Hochschule Politecnico federale di Zurigo
Zurich
Swiss Federal Institute of Technology Zurich
Institut fur Integrierte Systeme Integrated Systems Laboratory
Lecture notes on
Computer Arithmetic:
Principles, Architectures,
and VLSI Design
March 16, 1999
Reto Zimmermann
Integrated Systems Laboratory

Swiss Federal Institute of Technology (ETH)
CH-8092 Zurich, Switzerland
zimmermann@iis.ee.ethz.ch
Copyright
c 1999 by Integrated Systems Laboratory, ETH Zurich
http://www.iis.ee.ethz.ch/ zimmi/publications/comp arith notes.ps.gz
Contents 4.3 Carry-Propagate Adders (CPA) : : : : : : : : : : : : : : : : : : : 26
4.4 Carry-Save Adder (CSA) : : : : : : : : : : : : : : : : : : : : : : : : : 45
1 Introduction and Conventions ::::::::::::::::::::::: 4 4.5 Multi-Operand Adders : : : : : : : : : : : : : : : : : : : : : : : : : : : 46
1.1 Outline : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4
4.6 Sequential Adders : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 52
1.2 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4
5 Simple / Addition-Based Operations : : : : : : : : : : : : : : : : 53
1.3 Conventions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5
5.1 Complement and Subtraction : : : : : : : : : : : : : : : : : : : : : 53
1.4 Recursive Function Evaluation : : : : : : : : : : : : : : : : : : : : : 6
5.2 Increment / Decrement : : : : : : : : : : : : : : : : : : : : : : : : : : : 54
2 Arithmetic Operations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 5.3 Counting : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 58
2.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 5.4 Comparison, Coding, Detection : : : : : : : : : : : : : : : : : : : 60
2.2 Implementation Techniques : : : : : : : : : : : : : : : : : : : : : : : 9 5.5 Shift, Extension, Saturation : : : : : : : : : : : : : : : : : : : : : : 64
3 Number Representations : : : : : : : : : : : : : : : : : : : : : : : : : : : 10 5.6 Addition Flags : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 66
3.1 Binary Number Systems (BNS) : : : : : : : : : : : : : : : : : : : 10 5.7 Arithmetic Logic Unit (ALU) : : : : : : : : : : : : : : : : : : : : : 68
3.2 Gray Numbers : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 6 Multiplication : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69
3.3 Redundant Number Systems : : : : : : : : : : : : : : : : : : : : : : 14 6.1 Multiplication Basics : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69
3.4 Residue Number Systems (RNS) : : : : : : : : : : : : : : : : : : 16 6.2 Unsigned Array Multiplier : : : : : : : : : : : : : : : : : : : : : : : 71
3.5 Floating-Point Numbers : : : : : : : : : : : : : : : : : : : : : : : : : : 18 6.3 Signed Array Multipliers : : : : : : : : : : : : : : : : : : : : : : : : : 72
3.6 Logarithmic Number System : : : : : : : : : : : : : : : : : : : : : 19 6.4 Booth Recoding : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 73
3.7 Antitetrational Number System : : : : : : : : : : : : : : : : : : : 19 6.5 Wallace Tree Addition : : : : : : : : : : : : : : : : : : : : : : : : : : : 75
3.8 Composite Arithmetic : : : : : : : : : : : : : : : : : : : : : : : : : : : 20 6.6 Multiplier Implementations : : : : : : : : : : : : : : : : : : : : : : : 75
3.9 Round-Off Schemes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21 6.7 Composition from Smaller Multipliers : : : : : : : : : : : : : 76
4 Addition : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22 6.8 Squaring : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 76
4.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22 7 Division / Square Root Extraction : : : : : : : : : : : : : : : : : : 77
4.2 1-Bit Adders, (m, k)-Counters : : : : : : : : : : : : : : : : : : : : 23 7.1 Division Basics : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 77
Computer Arithmetic: Principles, Architectures, and VLSI Design 1 Computer Arithmetic: Principles, Architectures, and VLSI Design 2
Contents
7.2 Restoring Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 78

7.3 Non-Restoring Division : : : : : : : : : : : : : : : : : : : : : : : : : : 78
7.4 Signed Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 79
7.5 SRT Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 80
7.6 High-Radix Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 81
7.7 Division by Multiplication : : : : : : : : : : : : : : : : : : : : : : : 81
7.8 Remainder / Modulus : : : : : : : : : : : : : : : : : : : : : : : : : : : : 82
7.9 Divider Implementations : : : : : : : : : : : : : : : : : : : : : : : : : 83
7.10 Square Root Extraction : : : : : : : : : : : : : : : : : : : : : : : : : 84
8 Elementary Functions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85
8.1 Algorithms : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85
8.2 Integer Exponentiation : : : : : : : : : : : : : : : : : : : : : : : : : : : 86
8.3 Integer Logarithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 87
9 VLSI Design Aspects : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 88
9.1 Design Levels : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 88
9.2 Synthesis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 90
9.3 VHDL : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 91
9.4 Performance : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 93
9.5 Testability : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 95
Bibliography : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 96
Computer Arithmetic: Principles, Architectures, and VLSI Design 3

1 Introduction and Conventions 1.3 Conventions
1.1 Outline Naming conventions
Basic principles of computer arithmetic [1, 2, 3, 4, 5, 6, 7] Signal buses : A (1-D), Ai (2-D), ai:k (subbus, 1-D)
Circuit architectures and implementations of main Signals : a, ai (1-D), ai;k (2-D), Ai:k (group signal)
arithmetic operations Circuit complexity measures : A (area), T (cycle time,
Aspects regarding VLSI design of arithmetic units delay), AT (area-time product), L (latency, # cycles)
Arithmetic operators : +, ,, , =, log (= log2 )
(or), (and), (xor), (xnor), (not)
1.2 Motivation
Logic operators : +
Arithmetic units are, among others, core of every data
path and addressing unit Circuit complexity measures
Data path is core of : Unit-gate model ( gate-equivalents (GE) model) :
microprocessors (CPU) Inverter, buffer : A = 0 ; T = 0 (i.e. ignored)
signal processors (DSP) Simple monotonic 2-input gates (AND, NAND, OR,
data-processing application specific ICs (ASIC) and NOR) : A=1; T =1
programmable ICs (e.g. FPGA)
Simple non-monotonic 2-input gates (XOR, XNOR) :
Standard arithmetic units available from libraries A=2; T =2
Design of arithmetic units necessary for : Complex gates : composed from simple gates
non-standard operations ) Simple m-input gates : A = m , 1 ; T = dlog me
high-performance components Wiring not considered (acceptable for comparison
library development purposes, local wiring, multilevel metallization)
Only estimations given for complex circuits
1 Introduction and Conventions 1.4 Recursive Function Evaluation 1 Introduction and Conventions 1.4 Recursive Function Evaluation
1.4 Recursive Function Evaluation 2. f is associative (r.s.a.) a3 a2 a1 a0

) serial or single-tree structure :
Given : inputs ai, outputs zi, function f (graph sym. : ) 1 funrsa.epsi

219 20 mm
A = O(n) ; T = O(log n) z
Non-recursive functions (n.)
b) with multiple outputs zi (r.m.) () prefix problem) :
Output zi is a function of input ai (or aj + m:j ; m const.)
zi = f (ai; x) ; i = 0; : : : ; n , 1 zi = f (ai; zi,1) ; i = 0; : : : ; n , 1 ; z,1 = 0=1
) parallel structure : a3 a2 a1 a0 1. f is non-associative (r.m.n.) a3 a2 a1 a0
funn.epsi
) serial structure : 1 funrmn.epsi
A = O(n) ; T = O(1)
119 17 mm

219 25 mm
z3 z2 z1 z0
A = O(n) ; T = O(n) 3
z3 z2 z1 z0
Recursive functions (r.) a3 a2 a1 a0
Output zi is a function of all inputs ak ; k i 1

2. f is associative (r.m.a.) 2
a) with single output z = zn,1 (r.s.) : ) serial or multi-tree structure : z3
funrma1.epsi

19 43 mm
ti = f (ai; ti,1) ; i = 0; : : : ; n , 1 A = O(n2) ; T = O(log n) z2
t,1 = 0=1 ; z = tn,1 z1

z0
1. f is non-associative (r.s.n.) a3 a2 a1 a0
) or shared-tree structure :
a3 a2 a1 a0
) serial structure : 1 funrsn.epsi

219 24 mm 1funrma2.epsi

A = O(n) ; T = O(n) 3 A = O(n log n) ; T = O(log n) 219 21 mm
z z3 z2 z1 z0
2 Arithmetic Operations 2.2 Implementation Techniques
2.1 Overview Direct implementation of dedicated units :
based on operation fixed-point floating-point always : 1 5

related operation
<< , >>
in most cases : 6
sometimes : 7, 8
=,< +1 , 1 +/ +, +, Sequential implementation using simpler units and
several clock cycles () decomposition) :

arithops.epsi
sometimes : 6

98 83 mm in most cases : 7, 8, 9
sqrt (x) (same as on Table look-up techniques using ROMs :
the left for
floating-point
numbers) universal : simple application to all operations
exp (x)
efficient only for single-operand operations of high
complexity complexity (8 12) and small word length (note: ROM
log (x) trig (x) hyp (x) size = 2n n)
Approximation techniques using simpler units : 712
1
2
shift/extension
comparison
7 division
8 square root extraction
taylor series expansion
3 increment/decrement 9 exponential function polynomial and rational approximations
4 complement 10 logarithm function convergence of recursive equation systems
5 addition/subtraction 11 trigonometric functions CORDIC (COordinate Rotation DIgital Computer)
6 multiplication 12 hyperbolic functions
3 Number Representations 3.1 Binary Number Systems (BNS) 3 Number Representations 3.1 Binary Number Systems (BNS)
3 Number Representations Complement : ,A = 2n , A = A + 1 ,

where A = (an,1 ; an,2 ; : : : ; a0 )
3.1 Binary Number Systems (BNS)
Sign : an,1
Radix-2, binary number system (BNS) : irredundant, Properties : asymmetric range, compatible with
weighted, positional, monotonic [1, 2] unsigned numbers in many arithmetic operations
n-bit number is ordered sequence of bits (binary digits) : (i.e. same treatment of positive and negative numbers)
A = (an,1; an,2 ; : : : ; a0)2 ; ai 2 f0; 1g Ones (1s) complement : similar to 2s complement
Simple and efficient implementation in digital circuits nX,2
Value : A = ,an,1 (2n,1 , 1) + ai2i
MSB/LSB (most-/least-significant bit) : an,1 / a0 i=0
Represents an integer or fixed-point number, exact Range : [,(2 n , 1
, 1); 2 , 1]
n , 1
Fixed-point numbers : (a| m,1;{z: : : ; a0} : |a,1; : :{z: ; am,n} ) Complement : ,A = 2n , A , 1 = A

m-bit integer ( n , m)-bit fraction Sign : an,1
Properties : double representation of zero, symmetric
range, modulo (2n , 1) number system
Unsigned : positive or natural numbers
,1
nX
Value :A = an,1 2n,1 + + a1 2 + a0 = ai2i Sign-magnitude : alternative representation of signed
i 0 =
Range : [0; 2n , 1] numbers
nX,2
Twos (2s) complement : standard representation of Value : A = (,1)an,1 ai 2i
signed or integer numbers i=0
nX,2 Range : [,(2n,1 , 1); 2n,1 , 1]
Value : A = ,an,1 2n,1 + ai2i
i=0 Complement : ,A = (an,1; an,2; : : : ; a0)
Range : [,2 ; 2
n , 1 n , 1
, 1] Sign : an,1
Properties : double representation of zero, symmetric 3.2 Gray Numbers
range, different treatment of positive and negative
numbers in arithmetic operations, no MSB toggles at Gray numbers (code) : binary, irredundant, non-weighted,
sign changes around 0 () low power) non-monotonic
+ Property : unit-distance coding (i.e. exactly one bit
Graphical representation toggles between adjacent numbers)
Applications : counters with low output toggle rate
000...0
011...1
100...0
111...1
(low-power signal buses), representation of continuous
signals for low-error sampling (no false numbers due to
binary number representation switching of different bits at different times)
Non-monotonic numbers : difficult arithmetic operations,
e.g. addition, comparison :
2 n1 0 2 n1 2n
numrep.epsi

g1 g0 g10 g00 g0 g00 binary Gray
95 73 mm unsigned
0 0 < 0 1 and 0 < 1 b3 b2 b1 b0 g3 g2 g1 g0
0 0 0 0 0 0 0 0 0
1 1 < 1 0 but 1 > 0 1 0 0 0 1 0 0 0 1
2s complement
2 0 0 1 0 0 0 1 1
binary ! Gray : 3 0 0 1 1 0 0 1 0
1s complement 4 0 1 0 0 0 1 1 0
gi = bi 1 bi ; bn = 0 ;
+ 5 0 1 0 1 0 1 1 1
sign-magnitude i = 0; : : : ; n , 1 (n.) 6
7
0
0
1
1
1
1
0
1
0
0
1
1
0
0
1
0
Gray ! binary : 8
9
1
1
0
0
0
0
0
1
1
1
1
1
0
0
0
1
Conventions
bi = bi 1 gi ; bn = 0 ;
10 1 0 1 0 1 1 1 1
2s complement used for signed numbers in these notes +
i = n , 1; : : : ; 0 (r.m.a.)
11
12
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
Unsigned and signed numbers can be treated equally in 13
14
1
1
1
1
0
1
1
0
1
1
0
0
1
0
1
1
most cases, exceptions are mentioned 15 1 1 1 1 1 0 0 0
3 Number Representations 3.3 Redundant Number Systems 3 Number Representations 3.3 Redundant Number Systems
3.3 Redundant Number Systems 1 digit holds sum of 3 bits or 1 digit + 1 bit (no
Non-binary, redundant, weighted number systems [1, 2] carry-out digit, i.e. carry is saved)
Digit set larger than radix (typically radix 2) ) multiple standard redundant number system for fast addition
representations of same number ) redundancy Signed-digit (SD) or redundant digit (RD) number
+ No carry-propagation in adders ) more efficient impl. representation :
of adder-based units (e.g. multipliers and dividers)
Redundancy ) no direct implementation of relational
ri; si; ti 2 f,1; 0; 1g f1; 0; 1g , R = Pni=,01 ri2i
operators ) conversion to irredundant numbers no carry-propagation in S = R + T :
Several bits used to represent one digit ) higher storage ri + ti = (ci 1; ui) = 2ci 1 + ui , ci 1; ui 2 f1; 0; 1g
+ + +
requirements (ci 1; ui) is redundant (e.g. 0 + 1 = 01 = 11)

+
Expensive conversion into irredundant numbers (not 8i 9(ci; ui) j ci + ui = si 2 f1; 0; 1g

necessary if redundant input operands are allowed) 1 digit holds sum of 2 digits (no carry-out digit)
Delayed-carry of half-adder number representation : minimal SD representation : minimal number of
ri 2 f0; 1; 2g , ci; si; ai; bi 2 f0; 1g , non-zero digits, 011f1g10 ! 100f0g10
ri = (ci+1; si) = 2ci+1 + si = ai + bi , ci+1si = 0 applications : sequential multiplication (less cycles),
R = Pni=,01 ri2i = (C; S ) = C + S = A + B filters with constant coefficients (less hardware)
1 digit holds sum of 2 bits (no carry-out digit) example : minimal
z }| {
example : (00; 10) = 00 + 10 = 01 + 01 = (10; 00) 7 = (0111 j 1111 j 1011 j 1001 j 11111 j )
irredundant representation of ,1 [8], since canonical SD repres.: minimal SD + not two non-zero
ci+1si = 0 & C + S = ,1 ! S = ,1; C = 0
digits in sequence, 01f1g10 ! 10f0g10
Carry-save number representation :
ri 2 f0; 1; 2; 3g , ci; si; ai; bi; di 2 f0; 1g , SD ! binary : carry-propagation necessary () adder)
ri = (ci+1; si) = 2ci+1 + si = ai + bi + di = ai + ri0 other applications : high-speed multipliers [9]
R = Pin=,01 ri2i = (C; S ) = C + S = A + R0 similar to carry-save, simple use for signed numbers
3.4 Residue Number Systems (RNS) Arithmetic operations : (each digit computed separately)

Non-binary, irredundant, non-weighted number system [1] zi = jZ jmi = jf (A)jmi = f (jAjmi )mi = jf (ai)jmi

+ Carry-free and fast additions and multiplications jA + B jmi = jAjmi + jB jmi mi = jai + bijmi

Complex and slow other arithmetic operations jA B jmi = jAjmi jB jmi mi = jai bijmi
(e.g. comparison, sign and overflow detection) because
digits are not weighted, conversion to weighted
j , aijmi = jmi , aijmi
mixed-radix or binary system required a,i 1mi = aimi,2mi (Fermats theorem)
Codes for error detection and correction [1] Best moduli mi are 2k and (2k , 1) :
Possible applications (but hardly used) : high storage efficiency with k bits
digital filters : fast additions and multiplications simple modular addition : 2k : k-bit adder without cout,
2k , 1 : k -bit adder with end-around carry (cin = cout )
error detection and correction for arithmetic operations
in conventional and residue number systems Example : (m1; m0) = (3; 2) , M = 6
Base is n-tuple of integers (mn,1; mn,2; : : : ; m0), A ,4 ,3 ,2 ,1 0 1 2 3 4 5 6 7 8
residues (or moduli) mi pairwise relatively prime a1 2 0 1 2 0 1 2 0 1 2 0 1 2
A = (an,1; an,2; : : : ; a0)mn, ;mn, ;:::;m , a0 0 1 0 1 0 1 0 1 0 1 0 1 0
| {z }
ai 2 f0; 1; : : : ; mi , 1g
1 2 0
possible range
nY,1
Range: M = mi, anywhere in ZZ j5j6 = A = (a1; a0) = (j5j3; j5j2) = (2; 1)
i 0 = j4 + 5j6 = (1; 0) + (2; 1) =
ai = A mod

mi = jAjmi , A = mi qi + ai

= (j1 + 2j3 ; j0 + 1j2 ) = (0; 1) = j3j6
,1
nX
jAjM = Ciai , Ci = (: : : ; 0 |{z}

; 1; 0; : : :) j4 5j6 = (1; 0) (2; 1) =
i=0 M = (j1 2j3 ; j0 1j2 ) = (2; 0) = j2j6
i
3 Number Representations 3.5 Floating-Point Numbers 3 Number Representations 3.7 Antitetrational Number System
3.5 Floating-Point Numbers 3.6 Logarithmic Number System

Larger range, smaller precision than fixed-point Alternative representation to floating-point (i.e. mantissa
representation, inexact, real numbers [1, 2] + integer exponent ! only fixed-point exponent) [1]
Double-number form ) discontinuous precision Single-number form ) continuous precision ) higher
S biased exponent E unsigned norm. mantissa M accuracy, more reliable
F = (,1)S M E = (,1)S 1:M 2E,bias S biased fixed-point exponent E
Basic arithmetic operations : L = (,1)S E = (,1)S 2E,bias (signed-logarithmic)
A B = (, 1)SASB MA MB EA EB +
Basic arithmetic operations :
A + B = (,1)SA MA + (A < B ) = (EA < EB ) (additionally consider sign)

(,1)SB MB (EA , EB ) EA A + B : by approximation or addition in conventional
base on fixed-point add, multiply, and shift operations number system and double conversion
postnormalization required (1= M < 1) A B = (,1)SASB EpA EB +
Applications : Ay = (,1)SA yEA ; y A = (,1)SA EA=y

processors : real floating-point formats (e.g. IEEE + Simpler multiplication/exponent., more complex addition
standard), large range due to universal use Expensive conversion : (anti)logarithms (table look-up)
ASICs : usually simplified floating-point formats with
small exponents, smaller range, used for range
Applications : real-time digital filters
extension of normal fixed-point numbers
3.7 Antitetrational Number System
IEEE floating-point format : 2
2
precision n n M nE bias range precision Tetration (t. x = 2|{z}) and antitetration (a.t. x) [10]
x
single 32 23 8 127 3:8 10 38
10,7 Larger range, smaller precision than logarithmic repres.,
double 64 52 11 1023 9 10307 10,15 otherwise analogous (i.e. 2x ! t. x ; log x ! a.t. x)
3.8 Composite Arithmetic 3.9 Round-Off Schemes
Proposal for a new standard of number representations [10] Intermediate results with d additional lower bits
Scheme for storage and display of exact (primary: () higher accuracy) : A = (an,1 ; : : : ; a0 ; a,1 ; : : : ; a,d )
integer, secondary: rational) and inexact (primary: Rounding : keeping error small during final word
logarithmic, secondary: antitetrational) numbers length reduction : R = (rn,1 ; : : : ; r0 ) = A ,
Secondary forms used for numbers not representable by Trade-off : numerical accuracy vs. implementation cost
primary ones () no over-/underflow handling necessary)
RTRUNC = (an,1; : : : ; a0 )
Choice of number representation hidden from user, i.e. Truncation :
software/compiler selects format for highest accuracy bias = , 12 + 2d+1 1 (= average error )
Number representations : Round-to-nearest (i.e. normal rounding) :
tag value
integer : 00 2s complement integer RROUND = (a0n,1; : : : ; a00 ) ; A0 = A + 12 = A + 0:12
rational : 01 slash denominator n numerator bias = 2d+1 (nearly symmetric)
1
logarithmic : 10 log integer log fraction + 0:12 can often be included in previous operation
antitetrational : 11 a.t. integer a.t. fraction Round-to-nearest-even/-odd :
Rational numbers : slash position (i.e. size of numerator/ (
RROUND if (a0,1; : : : ; a0,d) 6= 0 0
denominator) is variable and stored (floating slash) RROUND ,EVEN =
(a0n,1 ; : : : ; a01 ; 0) otherwise
Storage form sizes : 32-bit (short), 64-bit (normal),
128-bit (long), 256-bit (extended) bias = 0 (symmetric)
Implementation : mixed hardware/software solutions mandatory in IEEE floating-point standard
Hardware proposal : long accumulator (4096 bits) holds 3 guard bits for rounding after floating-point operations :
any floating-point number in fixed-point format ) guard bit G (postnormalization), round bit R
higher accurary ) large hardware/software overhead (round-to-nearest), sticky bit S (round-to-nearest-even)
4 Addition 4.1 Overview 4 Addition 4.2 1-Bit Adders, (m, k)-Counters
4 Addition 4.2 1-Bit Adders, (m, k)-Counters
4.1 Overview Add up m bits of same magnitude (i.e. 1-bit numbers)

Output sum as k-bit number (k = blog mc + 1)
1-bit adders HA FA (m,k) (m,2)
or : count 1s at inputs ) (m, k)-counter [3]
(combinational counters)
RCA CSKA CSLA CIA
carry-propagate adders Half-adder (HA), (2, 2)-counter

CLA PPA COSA
CPA (cout; s) = 2cout + s = a + b A = 3 ; T = 2 (1)
3-operand CSA s=ab (sum)

adders.epsi cout = ab (carry-out)
carry-save adders

103 121 mm
adder adder a b
multi-operand
array tree a b
a b
chaschema1.epsi
out
array tree hasym.epsi 19 28 mm haschema2.epsi
multi-operand adders
adder adder HA
c 23 mm
18
out
21 43 mm
c out
s s
Legend:
(reference)
HA: half-adder CPA: carry-propagate adder CLA: carry-lookahead adder
FA: full-adder RCA: ripple-carry adder PPA: parallel-prefix adder s
(m,k): (m,k)-counter CSKA:carry-skip adder COSA:conditional-sum adder
(m,2): (m,2)-compressor CSLA: carry-select adder
CIA: carry-increment adder CSA: carry-save adder
based on component related component
Full-adder (FA), (3, 2)-counter (m, k)-counters
a0 a m-1
( cout; s) = 2cout + s = a + b + cin A = 7 ; T = 4 (2) ( sk,1 ; : : : ; s0 ) = ...
kX,1 ,1
mX cntsymbol.epsi
sj 2j = ai
18 (m,k)
23 mm
g = ab (generate) c = ab 0 j =0 i =0 ...
p = a b (propagate) c1 = a + b
s k-1 s 0
s = a b cin = p cin
Usually built from full-adders
cout = ab + acin + bcin = ab + (a b)cin Associativity of addition allows convertion from linear to
tree structure ) faster at same number of FAs
= g + pcin = pg + pcin = pa + pcin
= cin c0 + cin c1 A = 7 Pklog 1mbm2,k c 7(m , log m) ;
=
a b
TLIN = 4m + 2blog mc ; TTREE = 4dlog3 me + 2blog mc
a b
a b
g
Example : (7, 3)-counter
HA
fasymbol.epsi

FA
faschematic3.epsi
p c out
faschematic2.epsi
c in
A = 28 ; T = 14 A = 28 ; T = 10
c18 21 mm
out c in c out 29 32 mm c in 32 35 mm
HA a0a1 a2a3a4a5a6 a0a1 a2 a3a4 a5a6
s
s s FA FA FA
a b
a b
a b
count73par.epsi
FA
36 48 mm FA
count73ser.epsi
0 42 59 mm
faschematic1.epsi
g p p
faschematic4.epsi faschematic5.epsi
FA FA
0
c0
c out
29 43 mm
c in
c out

29 1 41 mm
c in
c out

35 47 mm
1
c1
s2 s1 s0
c in FA
tree structure
linear
s
(reference) s s2 s1 s0 structure
s
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)
4.3 Carry-Propagate Adders (CPA) Carry-propagation speed-up techniques
Add two n-bit operands A and B and an optional carry-in a) Concatenation of partial CPAs with fast cin ! cout
cin by performing carry-propagation [1, 2, 11]
Sum (cout; S ) is irredundant (n + 1)-bit number a n-1:j b n-1:j
...
a i-1:k b i-1:k a k-1:0 b k-1:0
cout; S ) = cout2n + S = A + B + cin

speedup1.epsi
( c i84 26 mm
CPA CPA CPA
c out cj ck c in
...
2ci+1 + si ai + bi + ci ; A B s n-1:j s i-1:k s k-1:0
=
i = 0; 1; : : : ; n , 1 cpasymbol.epsi
c0 = cin ; cout = cn (r.m.a.) c out 29 26 mm
CPA
c in
a) Fast carry look-ahead logic for entire range of bits
S
a n-1 b n-1 a1 b1 a0 b0
Ripple-carry adder (RCA)
Serial arrangement of n full-adders ... preprocessing
Simplest, smallest, and slowest CPA structure

speedup2.epsi
c out

104 50 mm
c in
carry propagation
A = 7n ; T = 2n ; AT = 14n2 ... postprocessing
a n-1 b n-1 a1 b1 a0 b0 s n-1 s1 s0

...
rca.epsi
c n-1 57c
FA 23FA
mm c FA
c out 2 1 c in
...
s n-1 s1 s0
Carry-skip adder (CSKA) Carry-select adder (CSLA)
Type a) : partial CPA with fast ck ! ci Type a) : partial CPA with fast ck ! ci and ck ! si,1:k
ci = P i,1:k c0i + Pi,1:k ck (bit group (ai,1; : : : ; ak )) si,1:k = ck s0i,1:k + ck s1i,1:k
Pi,1:k = pi,1pi,2 pk (group propagate) ci = ck c0i + ck c1i
1) Pi,1:k = 0 : ck 6! c0i and c0i selected (c0i ! ci) Two CPAs compute two possible results (cin = 0=1),
2) Pi,1:k = 1 : ck ! c0i but c0i skipped (c0i 6! ci ) group carry-in ck selects correct one afterwards
) path ck ! c0i ! ci never sensitized ) fast ck ! ci Variable group sizes (faster) : larger groups at end (MSB)
) false path ) inherent logic redundancy ) problems in (balance delays a0 ! ck and ak ! c0i )
circuit optimization, timing analysis, and testing Part. CPA typ. is RCA, CSLA () multil. CSLA), or CLA
Variable group sizes (faster) : larger groups in the middle High speed-up at high hardware overhead
(minimize delays a0 ! ck ! si,1 and ak ! ci ! sn,1 ) (+ MUX/bit + (CPA + MUX)/group)
Partial CPA typ. is RCA or CSKA () multilevel CSKA)
A 14n ; T 2:8n1=2 ; AT 39n3=2
Medium speed-up at small hardware overhead
(+ AND/bit + MUX/group) a i-1:k b i-1:k a k-1:0 b k-1:0
A 8n ; T 4n1=2 ; AT 32n3=2 ...
c i0 0
a n-1:j b n-1:j a i-1:k b i-1:k a k-1:0 b k-1:0 0 CPA
csla.epsi 1 CPA
...
ci
c out ci 1
c i1
102 50CPAmm
ck c in
CPA 0 1
0 s i-1:k s i-1:k
CPA cska.epsi CPA ...
c out cj ci 99
1
36 mm ck c in 0 1
ck
...
P i-1:k
s i-1:k s k-1:0
s n-1:j s i-1:k s k-1:0
Carry-increment adder (CIA) Example : gate-level schematic of carry-incr. adder (CIA)

Type a) : partial CPA with fast ck ! ci and ck ! si,1:k only 2 different logic cells (bit-slices) : IHA and IFA
si,1:k = s0i,1:k + ck ; ci = c0i + Pi,1:k ck T 4 6 10 12 14 16 18 20 22 24 26 28 ... 38
max ngroup 2 3 4 5 6 7 8 9 10 11 ... 16
Pi,1:k = pi,1pi,2 pk (group propagate) n 1 2 4 7 11 16 22 29 37 46 56 67 ... 137
Result is incremented after addition, if ck = 1 [12, 11] a i-1 b i-1 a i-2 b i-2 a k+1 b k+1 ak bk
Variable group sizes (faster) : larger groups at end (MSB)

IFA IFA IFA IHA
(balance delays a0 ! ck and ak ! c0i ) ...
Part. CPA typ. is RCA, CIA () multilevel CIA) or CLA

High speed-up at medium hardware overhead
...
(+ AND/bit + (incrementer + AND-OR)/group)

Logic of CPA and incrementer can be merged [11]
...
ciagate.epsi
A 10n ; T 2:8n1=2 ; AT 28n3=2
ci ck
s i-1 s i-2 112 mm
100 s k+1 sk
a i-1:k b i-1:k a k-1:0 b k-1:0 (i-k-1)IFA + IHA 2IFA + IHA IFA + IHA IHA IHA
...
ci 0
CPA
cia.epsi
si-1:k CPA
c out ci ck c in
86 43 mm
... bits i-1...k ... bits 6...4 bits 3,2 bit 1 bit 0
... P i-1:k
+1
s i-1:k s k-1:0
c out c in
Conditional-sum adder (COSA) Carry-lookahead adder (CLA), traditional
Type a) : optimized multilevel CSLA with (log n) levels Type b) : carries looked ahead before sum bits computed
(i.e. double CPAs are merged at higher levels)
Typically 4-bit blocks used (e.g. standard IC SN74181)
Correct sum bits (s0i,1:k or s1i,1:k ) are (conditionally) c0 = c00
selected through (log n) levels of multiplexers
c1 = g0 + p0c00 (g3,p3) ... (g0,p0)
Bit groups of size 2l at level l c2 = g1 + p1g0 + p1p0c00

c3 = g2 + p2g1 + p2p1g0 + p2p1 p0c00
clbsymbol.epsi

Higher parallelism, more balanced signal paths 27 CLB
26 mm c
0
g30 = g3 + p3g2 + p3p2g1 + p3p2 p1g0

Highest speed-up at highest hardware overhead p30 = p3p2p1 p0 (g3,p3) c 3 . . . c 0
(2 RCA + more than (log n) MUX/bit)

Hierarchical arrangement using ( 12 log n) levels :
A 3n log n ; T 2 log n ; AT 6n log n 2
( 30g ; p30 ) passed up, c00 passed down between levels
High speed-up at medium hardware overhead
a3 b3 a2 b2 a1 b1 a0 b0
A 14n ; T 4 log n ; AT 56n log n

level 0
... 0 0 0
FA FA FA
1 1 1 FA (g15,p15) ... (g12,p12) (g11,p11) ... (g8,p8) (g7,p7) ... (g4,p4) (g3,p3) ... (g0,p0)
FA FA FA c in
c12 c8 c4 c0
level 1
cosa.epsi CLB CLB CLB CLB

... 0 1 0 1 0 1 0 1
100 57 mm
)
)
,p11
(g7,p7)
(g3,p3)
,p15
c 15 ... c 12 c 11 ... c 8 cla.epsi c 7 ... c 4 c 3 ... c 0
(g11
level 2
48 mm
0 1 0 1 0 1
(g15
... 97
...
+ preprocessing : gi = ai bi ; pi = ai bi
CLB c in
c out
+ postprocessing : si = pi ci
s3 s2 s1 s0
Parallel-prefix adders (PPA) Prefix problem

Type b) : universal adder architecture comprising RCA, Inputs (xn,1; : : : ; x0), outputs (yn,1; : : : ; y0), associative
CIA, CLA, and more (i.e. entire range of area-delay binary operator [11, 13]
(yn,1 ; : : : ; y0 ) = (xn,1 x0 ; : : : ; x1 x0 ; x0 ) or
trade-offs from slowest RCA to fastest CLA)
Preprocessing, carry-lookahead, and postprocessing step y0 = x0 ; yi = xi yi,1 ; i = 1; : : : ; n , 1 (r.m.a.)
Carries calculated using parallel-prefix algorithms
Associativity of ) tree structures for evaluation :
x3 (x2 (x| 1 {z x0})) = (x| 3 {z x2}) (x| 1 {z x0}) , but y2 ?
+ High regularity : suitable for synthesis and layout
+ High flexibility : special adders, other arithmetic
y1 = Y1:01 Y3:21 y1 = Y1:01
operations, exchangeable prefix algorithms (i.e. speeds) | {z } | {z }
y2 = Y2:02 y3 = Y3:02
+ High performance : smallest and fastest adders | {z }
y3 = Y3:03
A 5n + 3A ; T = 4 + 2T Group variables Yil:k : covers bits (xk ; : : : ; xi) at level l
Carry-propagation is prefix problem : Yil:k = (Gli:k ; Pil:k )
a n-1
b n-1
a n-2
b n-2
preprocessing:
a1
b1
a0
b0
... ... gi = aibi G0i:i; Pi0:i) = (gi; pi)

(
c in pi = ai bi (Gli:k ; Pil:k ) = (Gi:j 1 ; Pi:j 1 ) (Gj :k ; Pj :k ) ; k j i
l ,1 l,1 l,1 l,1
+ +
(gn-1 , p n-1 ) (g0 , p0 ) l,1 l,1 l,1 l,1 l,1
= (Gi:j 1 + Pi:j 1 Gj :k ; Pi:j 1 Pj :k )
+ + +
add.epsi///figures

carry-lookahead: ci 1 = Gmi:0 ; i = 0; : : : ; n , 1 ; l = 1; : : : ; m
+
73 64 mm prefix algorithm
Parallel-prefix algorithms [11] :
c n p n-1 c1 p0 c0
multi-tree structures (T = O(n) ! O(log n))
c out
... ... postprocessing: sharing subtrees (A = O(n2) ! O(n log n))
si = pi ci different algorithms trading area vs. delay (influences
also from wiring and maximum fan-out FOmax )
s n-1
s n-2
s1
s0
Prefix algorithms Sklansky parallel-prefix algorithm () PPA-SK)
Algorithms visualized by directed acyclic graphs (DAG) Tree-like collection, parallel redistribution of carries
with array structure (n bits m levels)
A 12 n log n ; T = dlog ne ; FOmax 12 n
Graph vertex symbols : 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
( Gil,:j 1 1; Pil:,j 1 1) (Glj,:k1 ; Pjl:,k1 ) (Gil,:k1 ; Pil:,k 1 )
y?, ?
+ +
i
0
,
, ?(Gl ; P l )
, ?
1
2
sk.epsi///figures

67 30 mm
(Gli:k ; Pil:k ) i:k i:k (Gli:k ; Pil:k ) (Gli:k ; Pil:k ) 3
(contains logic for )

4
(contains no logic)
Performance measures : Brent-Kung parallel-prefix algorithm () PPA-BK)

A : graph size (number of black nodes)
T : graph depth (number of black nodes on critical path) Traditional CLA is PPA-BK with 4-bit groups
Tree-like redistribution of carries (fan-out tree)
Serial-prefix algorithm () RCA)
A = 2n , dlog ne , 2 ; T = 2dlog ne , 2
A = n , 1 ; T = n , 1 ; FOmax = 2 FOmax log n
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0
1 1
2 ser.epsi///figures 2 bk.epsi///figures
3
69 38 mm 3
67 38 mm
...
4
14 5
15 6
Kogge-Stone parallel-prefix algorithm () PPA-KS) Mixed serial/parallel-prefix algorithm () RCA + PPA)

very high wiring requirements linear size-depth trade-off using parameter k :
A n log n , n + 1 ; T = dlog ne ; FOmax = 2 0k n , 2dlog ne + 2
k = 0 : serial-prefix graph
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0
1
k = n , 2dlog ne + 1 : Brent-Kung parallel-prefix
graph
2
ks.epsi///figures fills gap between RCA and PPA-BK (i.e. CLA) in steps
3
67 52 mm of single -operations
A = n , 1 + k ; T = n , 1 , k ; FOmax = var.
4
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0
Carry-increment parallel-prefix algorithm () CIA) 1
2
A 2n , 1:4n1=2 ; T 1:4n1=2 ; FOmax 1:4n1=2 3

4 var.epsi///figures
5
68 54 mm
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
6
0 7
1 8
cia.epsi///figures 9
2

67 34 mm 10
3
4
5
Example : 4-bit parallel-prefix adder (PPA-SK) Prefix adder synthesis
efficient AND-OR-prefix circuit for the generate and Local prefix graph transformation :
AND-prefix circuit for the propagate signals
optimization: alternatingly AOI-/OAI- resp. NAND-/ 3 2 1 0
depth-decr.
3 2 1 0
NOR-gates (inverting gates are smaller and faster)

can also be realized using two MUX-prefix circuits A = 3 0
unfact.epsi ,!
transform 0
fact.epsi A = 4
T = 3 20 26 mm 20 26 mm T = 2
1 1
2 size-decr. 2
,
a3 b3 a2 b2 a1 b1 a0 b0 3 transform 3
Repeated (local) prefix transformations result in overall

minimization of graph depth or size ) which sequence ?
c in
Goal: minimal size (area) at given depth (delay)

Simple algorithm for sequence of applied transforms :
Step 1 : prefix graph compression (depth minimization) :
depth-decr. transforms in right-to-left bottom-up order
askgate.epsi///figures Step 2 : prefix graph expansion (size minimization) :

100 103 mm
size-decreasing transforms in left-to-right top-down
order, if allowed depth not exceeded
Prefix adder synthesis : 1) generate serial-prefix graph, 2)
graph compression, 3) depth-controlled graph expansion,
4) generate pre-/postprocessing and prefix logic
+ Generates all previous prefix graphs (except PPA-KS)
c out + Universal adder synthesis algorithm : generates
P n-1:0 area-optimal adders for any given timing constraints [11]
s3 s2 s1 s0 (including non-uniform signal arrival times)
Multilevel adders Self-timed adders

Multilevel versions of adders of type a) possible (CSKA, Average carry-propagation length : log n
CSLA, and CIA; notation: 2-level CIA = CIA-2L)
+ RCA is fast in average case (T = O(log n)), slow in worst
+ Delay is O(n1=(m+1) ) for m levels case ) suitable for self-timed asynchronous designs [15]
Area increase small for CSKA and CIA,
high for CSLA () COSA)
Completion detection is not trivial
Difficult computation of optimal group sizes Adder performance comparisons
Hybrid adders Standard-cell implementations, 0:8m process

Arbitrary combinations of speed-up techniques possible
) hybrid/mixed adder architectures area [lambda^2]
RCA
Often used combinations : CLA and CSLA [14] 1e+07
128-bit CSKA-2L
CIA-1L
Pure architectures usually perform best (at gate-level) CIA-2L
64-bit
5 PPA-SK
Transistor-level adders PPA-BK
32-bit
Influence of logic styles (e.g. dynamic logic,
addperf.ps CLA
2
84 84 mm COSA
pass-transistor logic ) faster) 16-bit const. AT
1e+06
+ Efficient transistor-level implementation of ripple-carry
chains (Manchester chain) [14] 8-bit
5
+ Combinations of speed-up techniques make sense
Much higher design effort 2 delay [ns]
Many efficient implementations exist and published 5 10 20
Complexity comparison under the unit-gate model 4.4 Carry-Save Adder (CSA)
a) Adds three n-bit operands A0 , A1 , A2 performing no
adder A T AT opt.1 syn.2
RCA 7n 2n 14n2 aaa
p carry-propagation (i.e. carries are saved) [1]
CSKA-1L 8n 4n1=2 32n3=2 aat 3 ( C; S ) = C + S = A0 + A1 + A2 A0 A1 A2
CSKA-2L 8n xn1=3 4 xn4=3 4 csasymbol.epsi

14n 2:8n1=2 39n3=2 2ci+1 + si a0;i + a1;i + a2;i ;
=
21 CSA
26 mm
CSLA-1L
10n 2:8n1=2 28n3=2

p i = 0; 1; : : : ; n , 1 (n.)
CIA-1L
10n 3:6n1=3 36n4=3
att
p C S
CIA-2L
CIA-3L 10n 4:4n1=4 44n5=4
att

p b) Adds one n-bit operand to an n-digit carry-save operand
n log n
3
2 log n 3n log2 n
p ( C; S )out = A + (C; S )in
PPA-SK 2
10n 4 log n 40n log n
ttt
p
PPA-BK att Result is in redundant carry-save format (n digits),
3n log n 2 log n 6n log2 n represented by two n-bit numbers S (sum bits) and C
PPA-KS
CLA 5 14n 4 log n 56n log n

p
( ) (carry bits)
COSA 3n log n 2 log n 6n log2 n + Parallel arrangement of n full-adders, constant delay
1 optimality regarding area and delay A = 7n ; T = 4
aaa : smallest area, longest delay
aat : small area, medium delay
a 0,n-1
a 1,n-1
a 2,n-1
a 0,1
a 1,1
a 2,1
a 0,0
a 1,0
a 2,0
att : medium area, short delay
ttt : large area, shortest delay csa.epsi
: not optimal FA . . . 67 27FA
mm FA
2 obtained from prefix adder synthesis
3 automatic logic optimization not possible (redundancy) cn s n-1 c2 s1 c1 s0
4 exact factors not calculated

Multi-operand carry-save adders (m > 3)
5 corresponds to 4-bit PPA-BK
) adder array (linear arrangement), adder tree (tree arr.)
4 Addition 4.5 Multi-Operand Adders 4 Addition 4.5 Multi-Operand Adders
4.5 Multi-Operand Adders a) 4-operand CPA (RCA) array :

Add three or more (m > 2) n-bit operands, yield
a 0,n-1
a 1,n-1
(n + dlog me)-bit result in irredundant number rep. [1, 2]

a 0,2
a 1,2
a 0,1
a 1,1
a 0,0
a 1,0
...
Array adders CPA

FA FA FA HA
Realization by array adders : (see figures on next page) a 2,n-1

...
a 2,2
cparray.epsi
a 2,1 a 2,0
a) linear arrangement of CPAs FA 93 57 mm FA

FA HA
CPA
b) linear arr. of CSAs (adder array) and final CPA a 3,n-1 a 3,2 a 3,1 a 3,0
...
a) and b) differ in bit arrival times at final CPA :
) if CPA = RCA : a) and b) have same overall delay FA FA FA FA HA
CPA
) if fast final CPA : uniform bit arrival times required ...
) CSA array (b) sn s n-1 s2 s1 s0
Fast implementation : CSA array + fast final CPA b) 4-operand CSA array with final CPA (RCA) :
(note: array of fast CPAs not efficient/necessary)
a 0,n-1
a 1,n-1
a 2,n-1
a 0,2
a 1,2
a 2,2
a 0,1
a 1,1
a 2,1
a 0,0
a 1,0
a 2,0
A0 A1 A2 A 3 A m-1
A = (m , 2)ACSA + ACPA
T = (m , 2)TCSA + TCPA CSA ... FA ... FA FA FA
CSA
a 3,n-1 a 3,2 a 3,1 a 3,0
CPA = RCA :
A = O(mn + n) mopadd.epsi FA ...
csarray.epsi

99FA 57 mm FA HA
CSA
T = O(m + n) 30 58 mm
CSA
...
Fast CPA :
A = O(mn + n log n) CPA
FA FA FA HA
CPA
T = O(m + log n) sn s n-1

...
s2 s1 s0
S
(m, 2)-compressors
,4
mX a0 a m-1
A = 7(m , 2)
2(c + clout) + s = ...
TLIN = 4(m , 2) ; TTREE = 6(dlog me , 1)
l=0
0
c out cprsymbol.epsi c in0
mX ,1 mX ,4
...
...
37 (m,2)
26 mm
ai + clin
m-4
c inm-4
Optimized (4, 2)-compressor :
c out
i=0 l =0 c s
2 full-adders merged and optimized (i.e. XORs
1-bit adders (similar to (m, k)-counters) [16] arranged in tree structure)
Compresses m bits down to 2 by forwarding (m , 3) A = 14 ; T = 6
intermediate carries to next higher bit position
A = 14 ; T = 8
Is bit-slice of multi-operand CSA array (see prev. page) a0 a1 a2 a3
+ No horizontal carry-propagation (i.e. clin ! ckout ; k > l)

a0 a1 a2 a3
Built from full-adders (= (3, 2)-compressor) or FA

(4, 2)-compressors arranged in linear or tree structures cpr42fa.epsi

32 38 mm
) 0 cpr42opt.epsi
1

41 53 mm
Example : 4-operand adder using (4, 2)-compressors c out
FA
c in
c out c in
0 1
a 2,n-1
a 0,n-1
a 1,n-1
a 3,n-1
c s
a 2,2
a 2,1
a 2,0
a 0,2
a 1,2
a 3,2
a 0,1
a 1,1
a 3,1
a 0,0
a 1,0
a 3,0 with full-adders c s
(4,2) (4,2) (4,2) (4,2) CSA optimized

cpradd.epsi

99 44 mm
+ same area, 25% shorter delay
FA FA FA HA CPA SD-FA (signed-digit full-adder) is similar to
(4, 2)-compressor regarding structure and complexity
s n+1 sn s n-1 s2 s1 s0
4 Addition 4.5 Multi-Operand Adders 4 Addition 4.5 Multi-Operand Adders
Advantages of (4, 2)-compressors over FAs for realizing Tree adders (Wallace tree)
(m, 2)-compressors :
higher compression rate (4:2 instead of 3:2) Adder tree : n-bit m-operand carry-save adder
composed of n tree-structured (m, 2)-compressors [1, 17]
less deep and more regular trees
Tree adders : fastest multi-operand adders using an
tree depth 012 3 4 5 6 7 8 9 10 adder tree and a fast final CPA
FA 2 3 4 6 9 13 19 28 42 63 94
# operands
(4,2) 2 4 8 16 32 64 128 A = A m; 2 n + ACPA = O(mn + n log n)
( )
T = T m; 2 + TCPA = O(log m + log n)

( )
Example : (8, 2)-compressor

A = 42 ; T = 16 A = 42 ; T = 12 Adder arrays and adder trees revisited
a0a1 a2a3 a4a5 a6a7 a0a1a2a3 a4a5a6a7
Some FA can often be replaced by HA or eliminated
0
c out c in0 (i.e. redundant due to constant inputs)
FA FA (4,2) (4,2)
Number of (irredundant) FA does not depend on adder
0
c out c in0
1
c out c in1
1
c out c in1
2
c out cpr82cpr42.epsi c in2 structure, but number of HA does
FA FA 47 50 mm
2
c out
3
cpr82fa.epsi
47 65 mm
c in2 3
c out c in3
An m-operand adder accomodates (m , 1) carry inputs
c out c in3
4
c out
(4,2)
c in4 Adder trees (T = O(log n)) are faster than adder arrays
(T O(n)) at same amount of gates (A = O(mn))
FA
4
c out c in4 =
c s
FA (4, 2)-compressor tree Adder trees are less regular and have more complex
routing than adder arrays ) larger area, difficult layout
c s (i.e. limited use in layout generators)
full-adder tree
4.6 Sequential Adders 5 Simple / Addition-Based Operations
Bit-serial adder : Sequential n-bit adder 5.1 Complement and Subtraction A
ai bi
A = AFA + AFF 2s complementer (negation) neg.epsi
T = TFA + TFF
bitseradd.epsi

25 27 mm
FA
,A = A + 1 21 32 mm1
+1
L=n
si Z
Accumulators : Sequential m-operand adders A B
With CPA A 2s complement subtractor
A = ACPA + AREG accucpa.epsi A , B = A + (,B ) sub.epsi

CPA
29 32 mm 1
27 28 mm
T = TCPA + TREG =A+B+1
CPA
c out
L=m S
S
With CSA and final CPA A A B
Allows higher clock rates 2s complement adder/subtractor

Final CPA too slow : CSA
A B = A + (,1)sub B addsub.epsi

) pipelining or multiple accucsa.epsi = A + (B sub) + sub c out
36 35 mm
CPA sub
cycles for evaluation
33 52 mm
A = ACSA + ACPA + 4AREG S
T = TCSA + TREG CPA 1s complement adder A B
L=m A+B 2n , 1)
(mod addmod.epsi
S
= A + B + cout

29 CPA
Mixed CSA/CPA : CSA with partial CPAs (i.e. fewer
28 mm
c out c in
carries saved), trade-off between speed and register size (end-around carry)
S
5 Simple / Addition-Based Operations 5.2 Increment / Decrement 5 Simple / Addition-Based Operations 5.2 Increment / Decrement
5.2 Increment / Decrement Prefix problem : Ci:k = Ci:j 1Cj:k ) AND-prefix struct.
+
Incrementer
A 12 n log n + 2n ; T = dlog ne + 2 ; AT 12 n log2 n
Adds a single bit cin to an n-bit operand A
(cout; Z ) = cout2n + Z = A + cin A Decrementer ( cout; Z ) = A , cin
zi = ai ci incsymbol.epsi a n-1 a2 a1 a0
c out 29 26 mm
+1
ci 1 = aici ; i = 0; : : : ; n , 1
+
c in
c0 = cin ; cout = cn (r.m.a.) Z

...
dec.epsi
Corresponds to addition with B = 0 () FA ! HA) c out
93 41 mm
c in
Example : Ripple-carry incrementer using half-adders ...
A = 3n ; T = n + 1 ; AT 3n2 z n-1 z2 z1 z0
a n-1 a1 a0
... Incrementer-decrementer
cout; Z ) = A cin = A + (,1)dec cin

incfa.epsi
c out
HA
c n-1 2

59c 23HA mm c
1
HA
c in (
...
z n-1 z1 z0 a n-1 a2 a1 a0
or using incrementer slices (= half-adder)

a n-1 a2 a1 a0 dec
... ...
incdec.epsi

94 46 mm
c out inc.epsi c out
83 33 mm
c in c in
... ...
HA
z n-1 z2 z1 z0 z n-1 z2 z1 z0
Fast incrementers Gray incrementer
4-bit incrementer using multi-input gates : Increments in Gray number system

c0 = an,1 an,2 a0 (parity)
a3 a2 a1 a0
ci 1 = ai ci ; i = 0; : : : ; n , 3 (r.m.a.)
+
inccg.epsi c in z0 = a0 c0

62 39 mm
zi = ai ai,1 ci,1 ; i = 1; : : : ; n , 2
c out zn,1 = an,1 cn,2
z3 z2 z1 z0
Prefix problem ) AND-prefix structure
8-bit parallel-prefix incrementer (Sklansky AND-prefix
structure) :
a7 a6 a5 a4 a3 a2 a1 a0
c in
incpp.epsi

98 63 mm
c out z7 z6 z5 z4 z3 z2 z1 z0
5 Simple / Addition-Based Operations 5.3 Counting 5 Simple / Addition-Based Operations 5.3 Counting
5.3 Counting Fast divider (T = O(1)) using delayed-carry numbers

Count clock cycles ) counter, (irredundant carry-save represention of ,1 allows using
divide clock frequency ) frequency divider (cout ) fast carry-save incrementer) [8]
Binary counter Gray counter

Sequential in-/decrementer Counter using Gray incrementer
Incrementer speed-up c out
+1
c in
techniques applicable cntblock.epsi Ring counters
32 33 mm
Down- and up-down-counters clk Shift register connected to ring :
using decrementers /
incrementer-decrementers Q cntring.epsi

Example : Ripple-carry up-counter using counter slices 51 16 mm
(= HA + FF), cin is count enable

q n-1 q2 q1 q0
State is not encoded ) n FF for counting n states

c out c in Must be initialized correctly (e.g. 00 01)
cntripple.epsi

Applications:
fast dividers (no logic between FF)
... 87 36 mm
state counter for one-hot coded FSMs

q n-1 q2 q1 q0
Johnson / twisted-ring counter (inverted feed-back) :
Asynchronous counter using toggle-flip-flops
(lower toggle rate ) lower power) cntjohnson.epsi
T ... T T T
59 16 mm
cntasync.epsi
clk q n-1 q2 q1 q0

64 18 mm
q n-1 q2 q1 q0 n FF for counting 2n states
5.4 Comparison, Coding, Detection Comparators A B
Comparison operations Subtractor (A , B ) :

cmpsub.epsi
EQ = (A = B ) (equal) GE = cout
37 31 mm 1
NE = (A 6= B ) = EQ
CPA
(not equal) EQ = Pn,1:0 GE = c out
GE = (A B ) (greater or equal) (for free in PPA) EQ = P n-1:0

LT = (A < B ) = GE (less than)
GT = (A > B ) = GE EQ (greater than) ARCA = 7n ; TRCA = 2n or
LE = (A B ) = GT = GE + EQ (less or equal) APPA,KS 32 n log n ; TPPA,KS 2 log n
Equality comparison Optimized comparator :
EQ = (A = B ) removing redundancies in subtractor (unused si)
a n-1
b n-1
a2
b2
a1
b1
a0
b0
single-tree structure ) speed-up at no cost :
eqi ai = bi) eqi
...
+1 =( cmpeq.epsi A = 6n ; TLIN = 2n ; TTREE 2 log n
= (ai bi ) eqi ;

40 36 mm
i = 0; : : : ; n , 1 example : ripple comparator using comparator slices

eq0 = 1 ; EQ = eqn (r.s.a.)
a n-1
b n-1
a2
b2
a1
b1
a0
b0
EQ
Magnitude comparison
... equality &
GE = (A B ) cmpripple.epsi

100 47 mm
magnitude
gei +1 =( ai > bi) + (ai = bi) gei magnitude
= ai bi + (ai bi ) gei ; i = 0; : : : ; n , 1
GE
ge0 = 1 ; GE = gen (r.s.a.) EQ

equality
5 Simple / Addition-Based Operations 5.4 Comparison, Coding, Detection 5 Simple / Addition-Based Operations 5.4 Comparison, Coding, Detection
Decoder Detection operations

Decodes binary number An,1:0 to vector Zm,1:0 (m = 2n)
( All-zeroes detection : z = an,1 + an,2 + + a0
1 if A = i
zi = Z = 2A
0 else ; i = 0; : : : ; m , 1 All-ones detection : z = an,1 an,2 a0 (r.s.a.)
A a2 a1 a0 A = n ; T = log n
decodersym.epsi

21decoder
26 mm
decoder.epsi

58 28 mm Leading-zeroes detection (LZD) :
for scaling, normalization, priority encoding
Z z7 z6 z5 z4 z3 z2 z1 z0
a) non-encoded output :
A = (n , 1)2n ; T = dlog ne
f0g1f0j1g ! f0g1f0g
a n-1 a n-2 a1 a0
...
(e.g. 000101 ! 000100)

Encoder
Encodes vector Am,1:0 to binary number Zn,1:0 (m = 2n) lzdnenc.epsi
. . .
(condition: 9i 8k j if k = i then ak = 1 else ak = 0)
50 28 mm
Z = i if ai = 1 ; i = 0; : : : ; m , 1 Z = log2 A A = 2n ; T = n z n-1 z n-2 z1 z0
A a7a5a3a1
a6a4a2a0
prefix problem (r.m.a.) ) AND-prefix structure
encodersym.epsi z0

21encoder
26 mm encoder.epsi
b) encoded output : + encoder

30 34 mm
Z
z1
signed numbers : + leading-ones detector (LOZ)
z2
A = n(2n,1 , 1)
T =n,1 (note: connections
according to PPA-SK)
5.5 Shift, Extension, Saturation Applications :
Shift : a) shift n-bit vector by k bit positions adaption of magnitude (shift a)) or word length
b) select n out of more bits at position k (extension) of operands (e.g. for addition)
also: logical (= unsigned), arithmetic (= signed) multiplication/division by multiples of 2 (shift)
Rotation by k bit positions, n constant (logic operation)
logic bit/byte operations (shift, rotation)
scaling of numbers for word-length reduction (i.e.
Extension of word lengths by k bits (n ! n + k ) ignore leading zeroes, shift b)) or normalization (e.g.
(i.e. sign-extension for signed numbers) of floating-point numbers, shift a)) using LZD
Saturation to highest/lowest value after over-/underflow reducing error after over-/underflow (saturation)
shift a) un- l. an,2 ; : : : ; a0 ; 0 sll Implementation of shift/extension/rotation by
signed r. 0; an,1 ; : : : ; a1 srl constant values : hard-wired
signed l. a n, 1 ; an,3 ; : : : ; a0 ; 0 sla variable values : multiplexers
r. an,1 ; an,1; an,2 ; : : : ; a1 sra n possible values : nbyn barrel-shifter/rotator
shift b) unsigned an k,1 ; : : : ; ak
+ Example : 4by4 barrel-rotator
signed a2n,1 ; an k,2 ; : : : ; ak
+ a3 a2 a1 a0
rotate l. an,2 ; : : : ; a0 ; an,1 rol A = O(n2)
r. a0 ; an,1 ; : : : ; a1 ror
T = O(log n) s1 s0
extend un- l. 0; an,1 ; : : : ; a0 a3 a2 a1 a0 s1 s0 barshift.epsi

signed r. an,1 ; : : : ; a0 ; 0
44 49 mm
signed l. an,1 ; an,1; an,2 ; : : : ; a0 s0

muxshift.epsi

41 28 mm
s1 s0
r. an,1 ; an,2 ; : : : ; a0 ; 0 s1 s1 s0
saturate unsigned an,1 ; : : : ; an,1 z3 z2 z1 z0 z3 z2 z1 z0

signed an,1 ; an,1 ; : : : ; an,1 multiplexers tristate buffers
5 Simple / Addition-Based Operations 5.6 Addition Flags 5 Simple / Addition-Based Operations 5.6 Addition Flags
5.6 Addition Flags Basic and derived condition flags

flag formula description
formula
C cn carry flag condition flag
V cn cn,1 signed overflow flag
unsigned signed
an bnsn + anbn sn operation: S =A+B (+) or S = A , B (,)
Z 8i : s i = 0 zero flag S=0 zero Z Z
N sn,1 negative flag, sign S<0 negative N
S0 positive N
Implementation of adder with flags S > max overflow C (+) VC
C , N : for free S < min underflow C (,) VC
V : fast cn, cn,1 computed by e.g. PPA ) very cheap operation: A,B
Z : a) cin = 1 (subtract.) : Z = (A = B ) = Pn,1:0 (of PPA) A=B EQ Z Z
b) cin = 0=1 : A 6= B NE Z Z
AB GE C N V + NV
1) Z = sn,1 + sn,2 + + s0 (r.s.a.) A>B GT CZ (N V + NV )Z
A = ACPA + n ; TZ = TCPA + dlog ne A<B LT C NV + NV
AB LE C + Z NV + NV + Z
2) faster without final sum (i.e. carry prop.) [18]
example : 01001 1 00 0 Unsigned and signed addition/subtraction only differ
+ 10110 1 00 with respect to the condition flags
= 00000 0 00
z0 = ((a0 b0) cin)
zi = ((ai bi) (ai,1 + bi,1))
Z = zn,1 zn,2 z0 ; i = 0; : : : ; n , 1 (r.s.a.)
A = ACPA + 3n ; TZ = 4 + dlog ne
5.7 Arithmetic Logic Unit (ALU) 6 Multiplication
A B
6.1 Multiplication Basics
c out alusymbol.epsi c in Multiplies two n-bit operands A and B [1, 2]
30 ALU
flags
29 mm
op Product P is (2n)-bit unsigned number or (2n , 1)-bit
signed number
Example : unsigned multiplication
Z
,1
nX ,1
nX ,1 nX
nX ,1
ALU operations
P =AB = ai2i bj 2j = aibj 2i+ j or
add A + B + cin sub A , B , cin i=0 j =0 i=0 j =0
A+1 A,1 ,1
nX
Pi = ai B ; P = Pi2i ; i = 0; : : : ; n , 1
arithmetic inc dec
pass A neg ,A i=0
(r.s.a.)
and a i bi nand ai bi
or ai + bi nor ai + bi Algorithm
logic
xor a i bi xnor ai bi 1) Generation of n partial products Pi
pass ai not ai 2) Adding up partial products :
sll A1 srl A1 a) sequentially (sequential shift-and-add),
shift/
sla A a 1 sra A a 1 b) serially (combinational shift-and-add), or
rotate
rol A r 1 ror A r 1 c) in parallel
s/ro : shift/rotate ; l/r : left/right ;
l/a : logic (unsigned) / arithmetic (signed) Speed-up techniques
Logic of adder/subtractor can partly be shared with logic Reduce number of partial products
operations
Accelerate addition of partial products
6 Multiplication 6.1 Multiplication Basics 6 Multiplication 6.2 Unsigned Array Multiplier
Sequential multipliers : 6.2 Unsigned Array Multiplier

partial products generated
and added sequentially (using
Braun multiplier : array multiplier for unsigned numbers
mulseq.epsi
accumulator)
34 28 mm nX,1 nX
,1 A = 8n2 , 11n
CPA
P= aibj 2i j +
T = 6n , 9
A = O(n) ; T = O(log n) ; L = n i 0j 0 = =
a0 b3 a0 b2 a0 b1 a0 b0
Array multipliers :

a1 b3 a1 b2 a1 b1 a1 b0
CSA
a2 b3 a2 b2 a2 b1 a2 b0
+ a3 b3 a3 b2 a3 b1 a3 b0
partial products generated and
CSA
added simultaneously in linear
mularr.epsi

p7 p6 p5 p4 p3 p2 p1 p0
array (using array adder) 34 47 CSAmm
b3 b2 b1 b0
A = O(n ) ; T = O(n)
2 CSA
a0
CPA
p0
a1
Parallel multipliers : HA HA HA
1
partial products p1
generated in parallel and added

mulpar.epsi
a2
subsequently in multi-operand mulbraun.epsi
adder (using tree adder)
34 43 mm
CSA
FA
FA
99 83 mm
FA
tree p2
A = O(n ) ; T = O(log n)
2
CPA
a3
2 FA FA FA
CSA
Signed multipliers : p3
CPA
a) complement operands before and result after
multiplication ) unsigned multiplication
3 FA FA HA
b) direct implementation (dedicated multiplier structure) p7 p6 p5 p4
6.3 Signed Array Multipliers 6.4 Booth Recoding
Modified Braun multiplier Speed-up technique : reduction of partial products
Subtract bits with negative weight ) special FAs [1] Sequential multiplication
1 neg. bit : ,a + b + cin = 2cout , s Minimal (or canonical) signed-digit (SD) represent. of A
2 neg. bits : a , b , cin = ,2cout + s + One cycle per non-zero partial product (i.e. 8ai j ai =
6 0)
Replace FAs in regions Negative partial products

1 , 2 , and 3 by :
s = a b cin
cout = ab + acin + bcin Data-dependent reduction of partial products and latency
(input a at mark )
Combinational multiplication
Otherwise exactly same structure and complexity as
Braun multiplier ) efficient and flexible Only fixed reduction of partial product possible
Radix-4 modified Booth recoding : 2 bits recoded to one
Baugh-Wooley multiplier multiplier digit ) n=2 partial products
Arithmetic transformations yield the following partial n=2
X
products (two additional ones) : A= (a2i,1 + a2i , 2a2i+1 ) 22i ; a,1 = 0
i=0 | f,2;,1{z;0;+1;+2g
}
a0 b3 a0 b2 a0 b1 a0 b0
a1b3 a1 b2 a1 b1 a1 b0 a2i a2i a2i,1 Pi
a2 b3 a2 b2 a2 b1 a2 b0 0
+1
0 0 + 0
recoding
a3 b3 a3 b2 a3 b1 a3 b0 B
Booth
0 0 1 +
a3 a3 0 1 0 + B

+ 1 b3 b3 0 1 1 +2 B mulbooth.epsi

p7 p6 p5 p4 p3 p2 p1 p0 1 0 0 , 2B 41 43 mm
CSA
1 0 1 , B array/tree
Less efficient and regular than modified Braun 1 1 0 , B
multiplier 1 1 1 , 0 CPA
6 Multiplication 6.4 Booth Recoding 6 Multiplication 6.6 Multiplier Implementations
Applicable to sequential, array, and parallel multipliers 6.5 Wallace Tree Addition
additional recoding logic and more A : +8n Speed-up technique : fast partial product addition
complex partial product generation
(MUX for shift, XOR for negation)
T : +7 A = O(n2) ; T = O(log n)
+ adder array/tree cut in half Applicable to parallel multipliers : parallel partial
) considerably smaller (array and tree) A : =2 product generation (normal or Booth recoded)
Irregular adder tree (Wallace tree) due to different
) much faster for adder arrays T : =2 number of bits per column
) slightly or not faster for adder trees T : ,0 ) irregular wiring and/or layout
Negative partial products (avoid sign-extension) : ) non-uniform bit arrival times at final adder
p 3 p3 p3 p3 p2 p1 p0 = 0 0 0 ,p3 p2 p1 p0 6.6 Multiplier Implementations
| {z }
ext. sign = 1 Sequential multipliers :
+ 1 1 1 p3 p2 p1 p0 low performance, small area, resource sharing (adder)
p03 p03 p03 p03 p02 p01 p00
1
p03 p02 p01 p00
Braun or Baugh-Wooley multiplier (array multiplier) :
p13 p13 p13 p12 p11 p10 ! p12 p11 p10
p13 medium performance, high area, high regularity
p23 p23 p22 p21 p20 p23p22 p21 p20 layout generators ) data paths and macro-cells
p33 p32 p31 p30 + p33 p32 p31 p30 + simple pipelining, faster CPA ) higher speed
p6 p5 p4 p3 p2 p1 p0 p6 p5 p4 p3 p2 p1 p0
Booth-Wallace multiplier (parallel multiplier) [9] :
Suited for signed multiplication (incl. Booth recod.) high performance, high area, low regularity
Extend A for unsigned multiplication : an = 0 custom multipliers, netlist generators
often pipelined (e.g. register between CSA-tree and CPA)
Radix-8 (3-bit recoding) and higher radices : Signed-unsigned multiplier : signed multiplier with
precomputing 3B , : : : ) larger overhead
operands extended by 1 bit (an = an,1 =0, bn = bn,1=0)
6.7 Composition from Smaller Multipliers 7 Division / Square Root Extraction
(2n 2n)-bit multiplier can be composed from 4 7.1 Division Basics
(n n)-bit multipliers (can be repeated recursively)
A =Q+ R A=QB+R; R <B
A B = (AH 2n + AL) (BH 2n + BL) B B R = A rem B (remainder)
2n n
= AH BH 2 + (AH BL + AL BH )2 + AL BL
A 2 [0; 22n , 1] ; B; Q; R 2 [0; 2n , 1] ; B 6= 0
4 (n n)-bit multipliers AH BL Q < 2n ! A < 2nB , otherwise overflow
+ (2n)-bit CSA + (3n)-bit CPA ) normalize B before division (B 2 [2n,1; 2n , 1])
AH BH AL BL
less efficient (area and speed) AL BH Algorithms (radix-2)
6.8 Squaring Subtract-and-shift : partial remainders Ri [1, 2]

Sequential algorithm : recursive, f non-associative
P = A2 = AA : multiplier optimizations possible
qi = Ri 1 2iB ; Ri = Ri 1 , qi2iB
+ +
a0 a3 a0 a2 a0 a1 a0 Rn = A ; R = R0 ; i = n , 1; : : : ; 0 (r.m.n.)
a1 a3 a1 a2 a1 a1 a0
a2 a3 a2 a2 a1 a2 a0
+ a3 a3 a2 a3 a1 a3 a0 Basic algorithm : compare and conditionally subtract
a2 a3 a1 a3 a0 a3 a0 a2 a0 a1 a0 a0 ) expensive comparison and CPA
! a3 a3 a1 a2 a1 a1
+ a2 a2 Restoring division : subtract and conditionally restore
p7 p6 p5 p4 p3 p2 p1 p0 (adder or multiplexer) ) expensive CPA and restoring
+ bn=2c + 1 partial products (if no Booth recoding used)

, Non-restoring division : detect sign, subtract/add, and
correct by next steps ) expensive CPA
) optimized squarer more efficient than multiplier
SRT division : estimate range, subtract/add (CSA), and
Table look-up (ROM) less efficient for every n correct by next steps ) inexpensive CSA
7 Division / Square Root Extraction 7.3 Non-Restoring Division 7 Division / Square Root Extraction 7.4 Signed Division
( (
7.2 Restoring Division
qi = 1 if Ri 1 , B 2i 0
+
7.4 Signed Division
qi0 = 1 if Ri 1; B same sign
+
0 if Ri 1 , B 2i < 0
+ 1 if Ri 1; B opposite sign
+
i Ri 1 , B 2i < 0 : qi = 0 ; Ri = Ri 1 (restored) Example : signed non-restoring array divider

B > 0, final correction of R omitted)
+ +
i , 1 Ri 1 , B 2i,1 0 : qi,1 = 1 ; Ri,1 = Ri 1 , B 2i,1

+ +
(simplifications:
A = 9n2 ; T = 2n2 + 4n
(
7.3 Non-Restoring Division
q0 1 if Ri 1 0
+
,1 = 1 if
b3 a6 b2 a5 b1 a4 b0 a3
i =
Ri 1 < 0 a6 b3
+
i Ri 1 0 : qi0 = 1 ; Ri = Ri 1 , B 2i
+ +
i , 1 Ri 1 , B 2i < 0 : qi0,1 = 1 ; Ri,1 = Ri 1 , B 2i

+ + q3
+B 2i,1 = Ri 1 , B 2i,1
FA FA FA FA
+
One subtraction/addition (CPA) per step

Final correction step for R (additional CPA) a2
Simple quotient digit conversion : (note: qi0 irredundant) q2 FA FA

divarray.epsi
FA FA

qi0 2 f1; 1g ! qi 2 f0; 1g : qi = 12 (qi0 + 1) 81 101 mm
Q = (qn,1; qn,2; qn,3; : : : ; q0; 1) a1

q1 FA FA FA FA
A B
A = (n + 1)ACPA +/ CPA
= O (n2 ) or O (n2 log n) +/ CPA
divnr.epsi a0
Q
46 38 mm
T = (n + 1)TCPA +/ CPA
+/ CPA q0 FA FA FA FA
= O (n2 ) or O (n log n) +/ CPA
r3 r2 r1 r0
R
7.5 SRT Division (Sweeney, Robertson, Tocher) 7.6 High-Radix Division
8
>
>1
if
< B 2i Ri 1 + Radix = 2m , qi0 2 f , 1; : : : ; 1; 0; 1; : : : ; , 1g
0
qi = >0 if ,B 2i Ri 1 < B 2i ; qi0 is SD number
>
:1 if
+
Ri 1 < ,B 2i
+
m quotient bits per step ) fewer, but more complex steps
+ Suitable for SRT algorithm ) faster
If 2n,1 B < 2n , i.e. B is normalized :
) ,B 2i ,2n i,1 Ri 1 < 2n i,1 B 2i
+ + Complex comparisons (more bits) and decisions
) table look-up () Pentium bug!)
+
8
>
<1 if
> 2n i,1 Ri 1 +
+
) qi = >>0 if ,2n i,1 Ri 1 < 2n i,1

0 +
+
+
:1 if Ri 1 < ,2n i,1 +

+ 7.7 Division by Multiplication
+ Only 3 MSB are compared ) qi0 are estimated ) CSA

Division by convergence
instead of CPA can be used (precise enough) [19] A = A R0R1 Rm,1 ! A B1
Q= B Q resp. Q
Correction in following steps (+ final correction step) B R0 R1 Rm,1 B B1 =
1 2n
Redundant representation of qi0 (SD representation) )
final conversion necessary (CPA) Bi +1 Bi Ri = 2| n(1{z, y)} (|1 +
= y) = |2n(1{z, y2 )} ;
{z }
+ Highly regular and fast (O(n)) SRT array dividers Bi Ri > Bi ; ! 2n
) only slightly slower/larger than array multipliers y = 1 , Bi2,n ; Ri = 2 , Bi2,n = B i + 1 (signed)
A B
A = nACSA + 2ACPA +/ CSA

Algorithm : Bi +1 Bi Ri ; Ai 1 = Ai Ri
= +
= O (n2 ) +/ CSA Ri = B i + 1 ; i = 0; : : : ; m , 1
CPA
Q divsrt.epsi
T = nTCSA + TCPA 50 38 mm+/ CSA
+/ CSA A0 = A ; B0 = B ; Q = Am (r.s.n.)
= O (n) +/ CPA
R
Quadratic convergence : L = dlog ne
7 Division / Square Root Extraction 7.8 Remainder / Modulus 7 Division / Square Root Extraction 7.9 Divider Implementations
Division by reciprocation 7.9 Divider Implementations

A =A 1
Q= B Iterative dividers (through multiplication) :
B
Newton-Raphson iteration method : resource sharing of existing components (multiplier)
medium performance, medium area
find f (X ) = 0 by recursion Xi+1 = Xi , ff0((XXo)) high efficiency if components are shared
i
f (X ) = X1 , B ; f 0(X ) = , X12 ; f B1 = 0

Sequential dividers (restoring, non-restoring, SRT) :
resource sharing of existing components (e.g. adder)
Algorithm :
low performance, low area
Xi 1 = Xi (2 , B Xi) ; i = 0; : : : ; m , 1
Array dividers (restoring, non-restoring, SRT) :
+
X0 = B ; Q = Xm (r.s.n.)
dedicated hardware component
Quadratic convergence : L = O(log n) high performance, high area
Speed-up : first approximation X0 from table high regularity ) layout generators, pipelining
7.8 Remainder / Modulus
square root extraction possible by minor changes
combination with multiplication or/and square root
Remainder (rem) : signed remainder of a division
R = A rem B = A , bA=B c B ; sign(R) = sign(A) No parallel dividers exist, as compared to parallel
multipliers (sequential nature of division)
Modulus (mod) : positive remainder of a division
(
R if A 0
M = A mod B ; M 0 ; M = R +B else
7.10 Square Root Extraction 8 Elementary Functions
p
A,R =Q A=Q 2
+ R Exponential function : ex (exp x)
A 2 [0; 22n , 1] ; Q 2 [0; 2n , 1] Logarithm function : ln x, log x
Algorithm
Trigonometric functions : sin x, cos x, tan x
Subtract-and-shift : partial remainders Ri and quotients Inverse trig. functions : arcsin x, arccos x, arctan x
Qi = Qi 1 + qi2i = (qn,1; : : : ; qi; 0; : : : ; 0) [1]
+
Hyperbolic functions : sinh x, cosh x, tanh x

Q2i = Qi 1 + qi2i 2 = Q2i 1 + qi2i 2Qi 1 + qi2i
+ + +
8.1 Algorithms

qi = Ri 1 2i 2Qi 1 + 2i ; Qi = Qi 1 + qi2i
+ + +
Table look-up : inefficient for large word lengths [5]
Ri = Ri 1 , qi2i 2Qi 1 + qi2i ; i = n , 1; : : : ; 0
+ +
Taylor series expansion : complex implementation
Rn = A ; Qn = 0 ; R = R0 ; Q = Q0 (r.m.n.) Polynomial and rational approximations [1, 5]
Shift-and-add algorithms [5]
Implementation
Convergence algorithms [1, 2] :
+ Similar to division ) same algorithms applicable
(restoring, non-restoring, SRT, high-radix) similar to division-by-convergence
+ Combination with division in same component possible two (or more) recursive formulas : one formula
converges to a constant, the other to the result
Only triangular array required A
(step i : qki = 0) Coordinate rotation (CORDIC) [2, 5, 20] :
+/ CPA
+/ CPA
3 equations for x-, y-coordinate, and angle
A ADIV =2
sqrtnr.epsi
Q 42 36+/
mmCPA computes all elementary functions by proper input
T TDIV +/ CPA
+/ CPA settings and choice of modes and outputs
R
simple, universal hardware, small look-up table
8 Elementary Functions 8.2 Integer Exponentiation 8 Elementary Functions 8.3 Integer Logarithm
8.2 Integer Exponentiation b) E = AB = Abn, 2n, b 2 b

1
1+ + 1 + 0
= ( ((A n, ) A n, ) A ) A
b 2 b 21 b 2 b 2 1 0
Approximated exponentiation : xy = ey ln x = 2y log x

Base-2 integer exponentiation : 2A =( ; 1; 0; : : :)
: : : ; 0 |{z} Ei = Ei2 1 Abi ; i = n , 1; : : : ; 0
+
A En = 1 ; E = E0 (r.s.n.)
Integer exponentiation (exact) :
A = AMUL ; T = TMUL ; L = 2(n , 1)
AB =| A A{z A} L = 0 2n , 1 (!)
B 8.3 Integer Logarithm
Applications : modular exponentiation AB (mod C) Z = blog2 Ac

in cryptographic algorithms (e.g. IDEA, RSA)
Algorithms : square-and-multiply For detection/comparison of order of magnitude
a) E = AB = Abn, 2n, b 2 b 1
1+ + 1 + 0 Corresponds to leading-zeroes detection (LZD) with
2n, bn,
=A A2n, bn, A4b A2b Ab
1
1
2
2 2 1 0
encoded output
Ei = Pibi Ei,1 ; Pi 1 = Pi2 ; i = 0; : : : ; n , 1

+
E,1 = 1 ; P0 = A ; E = En,1 (r.s.n.)

A = 2AMUL ; T = TMUL ; L = n or
A = AMUL ; T = TMUL ; L = 2n
9 VLSI Design Aspects Gate-level design
9.1 Design Levels Cell-based design techniques : standard-cells, gate-array/

sea-of-gates, field-programmable gate-array (FPGA)
Transistor-level design
Circuit implemented by hand or by synthesis (library)
Circuit and layout designed by hand (full custom) Layout implemented by automated place-and-route
Low design efficiency Medium to high design efficiency
High circuit performance : high speed, low area Medium to low circuit performance
High flexibility : choice of architecture and logic style Medium to low flexibility : full choice of architecture
Transistor-level circuit optimizations :
logic style : static vs. dynamic logic, Block-level design
complementary CMOS vs. pass-transistor logic Layout blocks and netlists from parameterized automatic
special arithmetic circuits : better than with gates generators or compilers (library)
gi g i-1 High design efficiency
Medium to high circuit performance
ci c i-1
carrychain.epsi
carry chain : c out
54 17 mm c in
ki pi k i-1 p i-1
Low flexibility : limited choice of architectures
a b a a b c in a
Implementations :
b data-path : bit-sliced, bus-oriented layout (array of
c in c in cells: n bits m operations), implementation of entire
full- b facmos.epsi
adder : c in b
76 40 mm c in
s data paths, medium performance, medium diversity
c out macro-cells : tiled layout, fixed/single-operation
b
components, high performance, small diversity
portable netlists : ) gate-level design
a b a a b c in a
9 VLSI Design Aspects 9.2 Synthesis 9 VLSI Design Aspects 9.3 VHDL
9.2 Synthesis 9.3 VHDL

High-level synthesis Arithmetic types : unsigned, signed (2s complement)
Synthesis from abstract, behavioral hardware description Arithmetic packages
(e.g. data dependency graphs) using e.g. VHDL
numeric_bit, numeric_std (IEEE standard 1076.3),
Involves architectural synthesis and arithmetic std_logic_arith (Synopsys)
transformations
contain overloaded arithmetic operators and resizing /
High-level synthesis is still in the beginnings type conversion routines for unsigned, signed types
Low-level synthesis Arithmetic operators (VHDL87/93) [21]
Layout and netlist generators relational : =, /=, <, <=, >, >=
Included in libraries and synthesis tools shift, rotate (93 only)

adding
:
:
rol, ror, sla, sll, sra, srl
+, -
Low-level synthesis is state-of-the-art sign (unary) : +, -
Basis for efficient ASIC design multiplying : *, /, mod, rem
Limited diversity and flexibility of library components exponent, absolute : **, abs
Circuit optimization Synthesis

Efficient optimization of random logic is state-of-the-art Typical limitations of synthesis tools :
Optimization of entire arithmetic circuits is not feasible /, mod, rem : both operands must be constant or divisor
) only local optimizations possible must be a power of two
Logic optimization cannot replace the synthesis of ** : for power-of-two bases only
efficient arithmetic circuit structures using generators Variety of arithmetic components provided in separate
libraries (e.g. DesignWare by Synopsys)
Resource sharing 9.4 Performance
Sharing one resource for multiple operations Pipelining

Done automatically by some synthesis tools Pipelining is basically possible with every combinational
Otherwise, appropriate coding is necessary : circuit ) higher throughput
a) S <= A + C when SELA = 1 else B + C;
) 2 adders + 1 multiplexer Arithmetic circuits are well suited for pipelining due to
b) T <= A when SELA = 1 else B; high regularity
S <= T + C; )
1 multiplexer + 1 adder Pipelining of arithmetic circuits can be very costly :
Coding & synthesis hints large amount of internal signals in arithmetic circuits
array structures : many small pipeline registers
Addition : single adder with carry-in/carry-out : tree structures : few large pipeline registers
) no advantage of tree structures anymore
Aext <= resize(A, width+1) & Cin;
Bext <= resize(B, width+1) & 1;
Sext <= Aext + Bext; (except for smaller latency)
S <= Sext(width downto 1); Fine-grain pipelining ) systolic arrays (often applied to
Cout <= Sext(width+1); arithmetic circuits)
Synthesis : check synthesis result for allocated arithmetic High speed

units ) code sanity check, control of circuit size
Fast circuit architectures, pipelining, replication
VHDL library of arithmetic units (parallelization), and combinations of those
Structural, synthesizable VHDL code for most circuits Optimal solution depends on arithmetic operation, circuit
described in this text is found in [22] architecture, user specifications, and circuit environment
9 VLSI Design Aspects 9.4 Performance 9 VLSI Design Aspects 9.5 Testability
Low power 9.5 Testability
Power-related properties of arithmetic circuits : Testability goal : high fault coverage with few test vectors
that are easy to generate/apply
High glitching activity due to high bit dependencies
and large logic depth Random test vectors : easy to generate and
apply/propagate, few vectors give high (but not perfect)
Power reduction in arithmetic circuits [23] : fault coverage for most arithmetic circuits
Reduce the switched capacitance by choosing an area Special test vectors : sometimes hard to generate and
efficient circuit architecture apply, required for coverage of hard-detectable faults
Allow for lower supply voltage by speeding up the which are inherent in most arithmetic circuits
circuitry
Hard-detectable faults found in :
Reduce the transition activity :
apply stable inputs while circuit is not in use () circuits of arithmetic operations with inherent special
disabling subcircuits) cases (arithmetic exceptions) : detectors, comparators,
reduce glitching transitions by balancing signal incrementers and counters (MSBs), adder flags
paths (partly done by speed-up techniques, otherwise circuits using redundant number representations
difficult to realize) 6 redundant hardware) : dividers (Pentium bug!)
(=
reduce glitching transitions by reducing logic depth
(pipelining)
take advantage of correlated data streams
choose appropriate number representations
(e.g. Gray codes for counters)
Bibliography [11] R. Zimmermann, Binary Adder Architectures for
Cell-Based VLSI and their Synthesis, PhD thesis, Swiss
[1] I. Koren, Computer Arithmetic Algorithms, Prentice Hall, Federal Institute of Technology (ETH) Zurich,
1993. Hartung-Gorre Verlag, 1998.
[2] K. Hwang, Computer Arithmetic: Principles, Architecture, [12] A. Tyagi, A reduced-area scheme for carry-select adders,
and Design, John Wiley & Sons, 1979. IEEE Trans. Comput., vol. 42, no. 10, pp. 11621170, Oct.
1993.
[3] O. Spaniol, Computer Arithmetic, John Wiley & Sons,
1981. [13] T. Han and D. A. Carlson, Fast area-efficient VLSI
adders, in Proc. 8th Computer Arithmetic Symp., Como,
[4] J. J. F. Cavanagh, Digital Computer Arithmetic: Design May 1987, pp. 4956.
and Implementation, McGraw-Hill, 1984.
[14] D. W. Dobberpuhl et al., A 200-MHz 64-b dual-issue
[5] J.-M. Muller, Elementary Functions: Algorithms and CMOS microprocessor, IEEE J. Solid-State Circuits, vol.
Implementation, Birkhauser Boston, 1997. 27, no. 11, pp. 15551564, Nov. 1992.
[6] Proceedings of the Xth Symposium on Computer Arithmetic. [15] A. De Gloria and M. Olivieri, Statistical carry lookahead
[7] IEEE Transactions on Computers. adders, IEEE Trans. Comput., vol. 45, no. 3, pp. 340347,
Mar. 1996.
[8] D. R. Lutz and D. N. Jayasimha, Programmable modulo-k
counters, IEEE Trans. Circuits and Syst., vol. 43, no. 11, [16] V. G. Oklobdzija, D. Villeger, and S. S. Liu, A method for
pp. 939941, Nov. 1996. speed optimized partial product reduction and generation of
fast parallel multipliers using an algorithmic approach,
[9] H. Makino et al., An 8.8-ns 54 54-bit multiplier with IEEE Trans. Comput., vol. 45, no. 3, pp. 294305, Mar.
high speed redundant binary architecture, IEEE J. 1996.
Solid-State Circuits, vol. 31, no. 6, pp. 773783, June 1996.
[17] Z. Wang, G. A. Jullien, and W. C. Miller, A new design
[10] W. N. Holmes, Composite arithmetic: Proposal for a new technique for column compression multipliers, IEEE
standard, IEEE Computer, vol. 30, no. 3, pp. 6573, Mar. Trans. Comput., vol. 44, no. 8, pp. 962970, Aug. 1995.
1997.
Bibliography
[18] J. Cortadella and J. M. Llaberia, Evaluation of A + B = K

conditions without carry propagation, IEEE Trans.
Comput., vol. 41, no. 11, pp. 14841488, Nov. 1992.
[19] S. E. McQuillan and J. V. McCanny, Fast VLSI algorithms

for division and square root, J. VLSI Signal Processing,
vol. 8, pp. 151168, Oct. 1994.
[20] Y. H. Hu, CORDIC-based VLSI architectures for digital

signal processing, IEEE Signal Processing Magazine, vol.
9, no. 3, pp. 1635, July 1992.
[21] K. C. Chang, Digital Design and Modeling with VHDL and

Synthesis, IEEE Computer Society Press, Los Alamitos,
California, 1997.
[22] R. Zimmermann, VHDL Library of Arithmetic Units,

http://www.iis.ee.ethz.ch/zimmi/arith lib.html.
[23] A. P. Chandrakasan and R. W. Brodersen, Low Power

Digital CMOS Design, Kluwer, Norwell, MA, 1995.
Computer Arithmetic: Principles, Architectures, and VLSI Design 98

Computer Arithmetic

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Computer Arithmetic

Загружено:

Авторское право:

Доступные форматы

Eidgenossische

Ecole polytechnique federale

Institut fur Integrierte Systeme Integrated Systems Laboratory

March 16, 1999

Integrated Systems Laboratory

7.2 Restoring Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 78

Computer Arithmetic: Principles, Architectures, and VLSI Design 3

1.1 Outline Naming conventions

1.4 Recursive Function Evaluation 2. f is associative (r.s.a.) a3 a2 a1 a0

) parallel structure : a3 a2 a1 a0 1. f is non-associative (r.m.n.) a3 a2 a1 a0

 Output zi is a function of all inputs ak ; k  i 1

t,1 = 0=1 ; z = tn,1 z1

2.1 Overview Direct implementation of dedicated units :

based on operation fixed-point floating-point  always : 1 5

3 Number Representations Complement : ,A = 2n , A = A + 1 ,

 Fixed-point numbers : (a| m,1;{z: : : ; a0} : |a,1; : :{z: ; am,n} ) Complement : ,A = 2n , A , 1 = A

requirements  (ci 1; ui) is redundant (e.g. 0 + 1 = 01 = 11)

Expensive conversion into irredundant numbers (not  8i 9(ci; ui) j ci + ui = si 2 f1; 0; 1g

3.5 Floating-Point Numbers 3.6 Logarithmic Number System

 Applications :  Ay = (,1)SA yEA ; y A = (,1)SA EA=y

4 Addition 4.1 Overview 4 Addition 4.2 1-Bit Adders, (m, k)-Counters

4 Addition 4.2 1-Bit Adders, (m, k)-Counters

4.1 Overview  Add up m bits of same magnitude (i.e. 1-bit numbers)

carry-propagate adders Half-adder (HA), (2, 2)-counter

3-operand CSA s=ab (sum)

4.3 Carry-Propagate Adders (CPA) Carry-propagation speed-up techniques

cout; S ) = cout2n + S = A + B + cin

Ripple-carry adder (RCA)

 Serial arrangement of n full-adders ... preprocessing

 Simplest, smallest, and slowest CPA structure

A = 7n ; T = 2n ; AT = 14n2 ... postprocessing

a n-1 b n-1 a1 b1 a0 b0 s n-1 s1 s0

A  8n ; T  4n1=2 ; AT  32n3=2 ...

Carry-increment adder (CIA)  Example : gate-level schematic of carry-incr. adder (CIA)

 Variable group sizes (faster) : larger groups at end (MSB)

(balance delays a0 ! ck and ak ! c0i ) ...

 Part. CPA typ. is RCA, CIA () multilevel CIA) or CLA

(+ AND/bit + (incrementer + AND-OR)/group)

 Bit groups of size 2l at level l c2 = g1 + p1g0 + p1p0c00

g30 = g3 + p3g2 + p3p2g1 + p3p2 p1g0

(2 RCA + more than (log n) MUX/bit)

A  14n ; T  4 log n ; AT  56n log n

cosa.epsi CLB CLB CLB CLB

Parallel-prefix adders (PPA) Prefix problem

... ... gi = aibi G0i:i; Pi0:i) = (gi; pi)

(contains logic for )

 Performance measures :  Brent-Kung parallel-prefix algorithm () PPA-BK)

 Kogge-Stone parallel-prefix algorithm () PPA-KS)  Mixed serial/parallel-prefix algorithm () RCA + PPA)

A  2n , 1:4n1=2 ; T  1:4n1=2 ; FOmax  1:4n1=2 3

NOR-gates (inverting gates are smaller and faster)

 Repeated (local) prefix transformations result in overall

 Goal: minimal size (area) at given depth (delay)

Multilevel adders Self-timed adders

 Difficult computation of optimal group sizes Adder performance comparisons

Hybrid adders  Standard-cell implementations, 0:8m process

CSKA-1L 8n 4n1=2 32n3=2 aat 3 ( C; S ) = C + S = A0 + A1 + A2 A0 A1 A2

CSKA-2L 8n xn1=3 4 xn4=3 4 csasymbol.epsi

4 exact factors not calculated

4 Addition 4.5 Multi-Operand Adders 4 Addition 4.5 Multi-Operand Adders

4.5 Multi-Operand Adders a) 4-operand CPA (RCA) array :

(n + dlog me)-bit result in irredundant number rep. [1, 2]

Array adders CPA

 Realization by array adders : (see figures on next page) a 2,n-1

a) linear arrangement of CPAs FA 93 57 mm FA

Output zi is a function of all inputs ak ; k i 1

based on operation fixed-point floating-point always : 1 5

Fixed-point numbers : (a| m,1;{z: : : ; a0} : |a,1; : :{z: ; am,n} ) Complement : ,A = 2n , A , 1 = A

requirements (ci 1; ui) is redundant (e.g. 0 + 1 = 01 = 11)

Expensive conversion into irredundant numbers (not 8i 9(ci; ui) j ci + ui = si 2 f1; 0; 1g

Applications : Ay = (,1)SA yEA ; y A = (,1)SA EA=y

4.1 Overview Add up m bits of same magnitude (i.e. 1-bit numbers)

3-operand CSA s=ab (sum)

Serial arrangement of n full-adders ... preprocessing

Simplest, smallest, and slowest CPA structure

A 8n ; T 4n1=2 ; AT 32n3=2 ...

Carry-increment adder (CIA) Example : gate-level schematic of carry-incr. adder (CIA)

Variable group sizes (faster) : larger groups at end (MSB)

Part. CPA typ. is RCA, CIA () multilevel CIA) or CLA

Bit groups of size 2l at level l c2 = g1 + p1g0 + p1p0c00

A 14n ; T 4 log n ; AT 56n log n

(contains logic for )

Performance measures : Brent-Kung parallel-prefix algorithm () PPA-BK)

Kogge-Stone parallel-prefix algorithm () PPA-KS) Mixed serial/parallel-prefix algorithm () RCA + PPA)

A 2n , 1:4n1=2 ; T 1:4n1=2 ; FOmax 1:4n1=2 3

Repeated (local) prefix transformations result in overall

Goal: minimal size (area) at given depth (delay)

Difficult computation of optimal group sizes Adder performance comparisons

Hybrid adders Standard-cell implementations, 0:8m process

Realization by array adders : (see figures on next page) a 2,n-1

Built from full-adders (= (3, 2)-compressor) or FA

Example : (8, 2)-compressor

With CSA and final CPA A A B

Allows higher clock rates 2s complement adder/subtractor

cout; Z ) = A cin = A + (,1)dec cin

4-bit incrementer using multi-input gates : Increments in Gray number system

5.3 Counting Fast divider (T = O(1)) using delayed-carry numbers

State is not encoded ) n FF for counting n states

state counter for one-hot coded FSMs

Comparison operations Subtractor (A , B ) :

GE = (A B ) (greater or equal) (for free in PPA) EQ = P n-1:0

i = 0; : : : ; n , 1 example : ripple comparator using comparator slices

6.8 Squaring Subtract-and-shift : partial remainders Ri [1, 2]

i Ri 1 , B 2i < 0 : qi = 0 ; Ri = Ri 1 (restored) Example : signed non-restoring array divider

i , 1 Ri 1 , B 2i,1 0 : qi,1 = 1 ; Ri,1 = Ri 1 , B 2i,1

One subtraction/addition (CPA) per step

Simple quotient digit conversion : (note: qi0 irredundant) q2 FA FA

) qi = >>0 if ,2n i,1 Ri 1 < 2n i,1

Approximated exponentiation : xy = ey ln x = 2y log x

9.1 Design Levels Cell-based design techniques : standard-cells, gate-array/

Included in libraries and synthesis tools shift, rotate (93 only)

Sharing one resource for multiple operations Pipelining

Synthesis : check synthesis result for allocated arithmetic High speed