Вы находитесь на странице: 1из 26

Eidgenossische

Ecole polytechnique federale


de Zurich
Technische Hochschule Politecnico federale di Zurigo
Zurich
Swiss Federal Institute of Technology Zurich

Institut fur Integrierte Systeme Integrated Systems Laboratory

Lecture notes on

Computer Arithmetic:
Principles, Architectures,
and VLSI Design

March 16, 1999

Reto Zimmermann

Integrated Systems Laboratory


Swiss Federal Institute of Technology (ETH)
CH-8092 Zurich, Switzerland
zimmermann@iis.ee.ethz.ch

Copyright
c 1999 by Integrated Systems Laboratory, ETH Zurich
http://www.iis.ee.ethz.ch/ zimmi/publications/comp arith notes.ps.gz
Contents 4.3 Carry-Propagate Adders (CPA) : : : : : : : : : : : : : : : : : : : 26
4.4 Carry-Save Adder (CSA) : : : : : : : : : : : : : : : : : : : : : : : : : 45
1 Introduction and Conventions ::::::::::::::::::::::: 4 4.5 Multi-Operand Adders : : : : : : : : : : : : : : : : : : : : : : : : : : : 46
1.1 Outline : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4
4.6 Sequential Adders : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 52
1.2 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4
5 Simple / Addition-Based Operations : : : : : : : : : : : : : : : : 53
1.3 Conventions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5
5.1 Complement and Subtraction : : : : : : : : : : : : : : : : : : : : : 53
1.4 Recursive Function Evaluation : : : : : : : : : : : : : : : : : : : : : 6
5.2 Increment / Decrement : : : : : : : : : : : : : : : : : : : : : : : : : : : 54
2 Arithmetic Operations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 5.3 Counting : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 58
2.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 5.4 Comparison, Coding, Detection : : : : : : : : : : : : : : : : : : : 60
2.2 Implementation Techniques : : : : : : : : : : : : : : : : : : : : : : : 9 5.5 Shift, Extension, Saturation : : : : : : : : : : : : : : : : : : : : : : 64
3 Number Representations : : : : : : : : : : : : : : : : : : : : : : : : : : : 10 5.6 Addition Flags : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 66
3.1 Binary Number Systems (BNS) : : : : : : : : : : : : : : : : : : : 10 5.7 Arithmetic Logic Unit (ALU) : : : : : : : : : : : : : : : : : : : : : 68
3.2 Gray Numbers : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 6 Multiplication : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69
3.3 Redundant Number Systems : : : : : : : : : : : : : : : : : : : : : : 14 6.1 Multiplication Basics : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69
3.4 Residue Number Systems (RNS) : : : : : : : : : : : : : : : : : : 16 6.2 Unsigned Array Multiplier : : : : : : : : : : : : : : : : : : : : : : : 71
3.5 Floating-Point Numbers : : : : : : : : : : : : : : : : : : : : : : : : : : 18 6.3 Signed Array Multipliers : : : : : : : : : : : : : : : : : : : : : : : : : 72
3.6 Logarithmic Number System : : : : : : : : : : : : : : : : : : : : : 19 6.4 Booth Recoding : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 73
3.7 Antitetrational Number System : : : : : : : : : : : : : : : : : : : 19 6.5 Wallace Tree Addition : : : : : : : : : : : : : : : : : : : : : : : : : : : 75
3.8 Composite Arithmetic : : : : : : : : : : : : : : : : : : : : : : : : : : : 20 6.6 Multiplier Implementations : : : : : : : : : : : : : : : : : : : : : : : 75
3.9 Round-Off Schemes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21 6.7 Composition from Smaller Multipliers : : : : : : : : : : : : : 76
4 Addition : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22 6.8 Squaring : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 76
4.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22 7 Division / Square Root Extraction : : : : : : : : : : : : : : : : : : 77
4.2 1-Bit Adders, (m, k)-Counters : : : : : : : : : : : : : : : : : : : : 23 7.1 Division Basics : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 77
Computer Arithmetic: Principles, Architectures, and VLSI Design 1 Computer Arithmetic: Principles, Architectures, and VLSI Design 2

Contents

7.2 Restoring Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 78


7.3 Non-Restoring Division : : : : : : : : : : : : : : : : : : : : : : : : : : 78
7.4 Signed Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 79
7.5 SRT Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 80
7.6 High-Radix Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 81
7.7 Division by Multiplication : : : : : : : : : : : : : : : : : : : : : : : 81
7.8 Remainder / Modulus : : : : : : : : : : : : : : : : : : : : : : : : : : : : 82
7.9 Divider Implementations : : : : : : : : : : : : : : : : : : : : : : : : : 83
7.10 Square Root Extraction : : : : : : : : : : : : : : : : : : : : : : : : : 84
8 Elementary Functions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85
8.1 Algorithms : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85
8.2 Integer Exponentiation : : : : : : : : : : : : : : : : : : : : : : : : : : : 86
8.3 Integer Logarithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 87
9 VLSI Design Aspects : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 88
9.1 Design Levels : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 88
9.2 Synthesis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 90
9.3 VHDL : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 91
9.4 Performance : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 93
9.5 Testability : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 95
Bibliography : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 96

Computer Arithmetic: Principles, Architectures, and VLSI Design 3


1 Introduction and Conventions 1.3 Conventions

1.1 Outline Naming conventions

 Basic principles of computer arithmetic [1, 2, 3, 4, 5, 6, 7] Signal buses : A (1-D), Ai (2-D), ai:k (subbus, 1-D)
 Circuit architectures and implementations of main Signals : a, ai (1-D), ai;k (2-D), Ai:k (group signal)
arithmetic operations Circuit complexity measures : A (area), T (cycle time,
 Aspects regarding VLSI design of arithmetic units delay), AT (area-time product), L (latency, # cycles)
Arithmetic operators : +, ,, , =, log (= log2 )
(or),  (and),  (xor), (xnor), (not)
1.2 Motivation
Logic operators : +
 Arithmetic units are, among others, core of every data
path and addressing unit Circuit complexity measures
 Data path is core of : Unit-gate model ( gate-equivalents (GE) model) :
 microprocessors (CPU)  Inverter, buffer : A = 0 ; T = 0 (i.e. ignored)
 signal processors (DSP)  Simple monotonic 2-input gates (AND, NAND, OR,
 data-processing application specific ICs (ASIC) and NOR) : A=1; T =1
programmable ICs (e.g. FPGA)
 Simple non-monotonic 2-input gates (XOR, XNOR) :
 Standard arithmetic units available from libraries A=2; T =2
 Design of arithmetic units necessary for :  Complex gates : composed from simple gates
 non-standard operations ) Simple m-input gates : A = m , 1 ; T = dlog me
 high-performance components  Wiring not considered (acceptable for comparison
 library development purposes, local wiring, multilevel metallization)
 Only estimations given for complex circuits
Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Computer Arithmetic: Principles, Architectures, and VLSI Design 5

1 Introduction and Conventions 1.4 Recursive Function Evaluation 1 Introduction and Conventions 1.4 Recursive Function Evaluation

1.4 Recursive Function Evaluation 2. f is associative (r.s.a.) a3 a2 a1 a0


) serial or single-tree structure :
 Given : inputs ai, outputs zi, function f (graph sym. : ) 1 funrsa.epsi

219 20 mm
A = O(n) ; T = O(log n) z
Non-recursive functions (n.)
b) with multiple outputs zi (r.m.) () prefix problem) :
 Output zi is a function of input ai (or aj + m:j ; m const.)
zi = f (ai; x) ; i = 0; : : : ; n , 1 zi = f (ai; zi,1) ; i = 0; : : : ; n , 1 ; z,1 = 0=1

) parallel structure : a3 a2 a1 a0 1. f is non-associative (r.m.n.) a3 a2 a1 a0

funn.epsi
 ) serial structure : 1 funrmn.epsi
A = O(n) ; T = O(1)
119 17 mm

219 25 mm
z3 z2 z1 z0
A = O(n) ; T = O(n) 3

z3 z2 z1 z0
Recursive functions (r.) a3 a2 a1 a0

 Output zi is a function of all inputs ak ; k  i 1


2. f is associative (r.m.a.) 2
a) with single output z = zn,1 (r.s.) : ) serial or multi-tree structure : z3
funrma1.epsi

19 43 mm
ti = f (ai; ti,1) ; i = 0; : : : ; n , 1 A = O(n2) ; T = O(log n) z2

t,1 = 0=1 ; z = tn,1 z1


z0

1. f is non-associative (r.s.n.) a3 a2 a1 a0

) or shared-tree structure :
a3 a2 a1 a0
) serial structure : 1 funrsn.epsi

219 24 mm 1funrma2.epsi

A = O(n) ; T = O(n) 3 A = O(n log n) ; T = O(log n) 219 21 mm

z z3 z2 z1 z0

Computer Arithmetic: Principles, Architectures, and VLSI Design 6 Computer Arithmetic: Principles, Architectures, and VLSI Design 7
2 Arithmetic Operations 2.2 Implementation Techniques

2.1 Overview Direct implementation of dedicated units :

based on operation fixed-point floating-point  always : 1 5


related operation
<< , >>
 in most cases : 6
 sometimes : 7, 8
=,< +1 , 1 +/ +, +, Sequential implementation using simpler units and
several clock cycles () decomposition) :

arithops.epsi
 sometimes : 6

98 83 mm  in most cases : 7, 8, 9
sqrt (x) (same as on Table look-up techniques using ROMs :
the left for
floating-point
numbers)  universal : simple application to all operations
exp (x)
 efficient only for single-operand operations of high
complexity complexity (8 12) and small word length (note: ROM
log (x) trig (x) hyp (x) size = 2n  n)
Approximation techniques using simpler units : 712
1
2
shift/extension
comparison
7 division
8 square root extraction
 taylor series expansion
3 increment/decrement 9 exponential function  polynomial and rational approximations
4 complement 10 logarithm function  convergence of recursive equation systems
5 addition/subtraction 11 trigonometric functions  CORDIC (COordinate Rotation DIgital Computer)
6 multiplication 12 hyperbolic functions
Computer Arithmetic: Principles, Architectures, and VLSI Design 8 Computer Arithmetic: Principles, Architectures, and VLSI Design 9

3 Number Representations 3.1 Binary Number Systems (BNS) 3 Number Representations 3.1 Binary Number Systems (BNS)

3 Number Representations Complement : ,A = 2n , A = A + 1 ,


where A = (an,1 ; an,2 ; : : : ; a0 )
3.1 Binary Number Systems (BNS)
Sign : an,1
 Radix-2, binary number system (BNS) : irredundant, Properties : asymmetric range, compatible with
weighted, positional, monotonic [1, 2] unsigned numbers in many arithmetic operations
 n-bit number is ordered sequence of bits (binary digits) : (i.e. same treatment of positive and negative numbers)
A = (an,1; an,2 ; : : : ; a0)2 ; ai 2 f0; 1g Ones (1s) complement : similar to 2s complement
 Simple and efficient implementation in digital circuits nX,2
Value : A = ,an,1 (2n,1 , 1) + ai2i
 MSB/LSB (most-/least-significant bit) : an,1 / a0 i=0
 Represents an integer or fixed-point number, exact Range : [,(2 n , 1
, 1); 2 , 1]
n , 1

 Fixed-point numbers : (a| m,1;{z: : : ; a0} : |a,1; : :{z: ; am,n} ) Complement : ,A = 2n , A , 1 = A


m-bit integer ( n , m)-bit fraction Sign : an,1
Properties : double representation of zero, symmetric
range, modulo (2n , 1) number system
Unsigned : positive or natural numbers
,1
nX
Value :A = an,1 2n,1 +    + a1 2 + a0 = ai2i Sign-magnitude : alternative representation of signed
i 0 =
Range : [0; 2n , 1] numbers
nX,2
Twos (2s) complement : standard representation of Value : A = (,1)an,1  ai 2i
signed or integer numbers i=0
nX,2 Range : [,(2n,1 , 1); 2n,1 , 1]
Value : A = ,an,1 2n,1 + ai2i
i=0 Complement : ,A = (an,1; an,2; : : : ; a0)
Range : [,2 ; 2
n , 1 n , 1
, 1] Sign : an,1
Computer Arithmetic: Principles, Architectures, and VLSI Design 10 Computer Arithmetic: Principles, Architectures, and VLSI Design 11
Properties : double representation of zero, symmetric 3.2 Gray Numbers
range, different treatment of positive and negative
numbers in arithmetic operations, no MSB toggles at  Gray numbers (code) : binary, irredundant, non-weighted,
sign changes around 0 () low power) non-monotonic
+ Property : unit-distance coding (i.e. exactly one bit
Graphical representation toggles between adjacent numbers)
 Applications : counters with low output toggle rate
000...0

011...1
100...0

111...1
(low-power signal buses), representation of continuous
signals for low-error sampling (no false numbers due to
binary number representation switching of different bits at different times)
Non-monotonic numbers : difficult arithmetic operations,
e.g. addition, comparison :
2 n1 0 2 n1 2n
numrep.epsi

g1 g0 g10 g00 g0 g00 binary Gray
95 73 mm unsigned
0 0 < 0 1 and 0 < 1 b3 b2 b1 b0 g3 g2 g1 g0
0 0 0 0 0 0 0 0 0
1 1 < 1 0 but 1 > 0 1 0 0 0 1 0 0 0 1
2s complement
2 0 0 1 0 0 0 1 1
 binary ! Gray : 3 0 0 1 1 0 0 1 0
1s complement 4 0 1 0 0 0 1 1 0
gi = bi 1  bi ; bn = 0 ;
+ 5 0 1 0 1 0 1 1 1
sign-magnitude i = 0; : : : ; n , 1 (n.) 6
7
0
0
1
1
1
1
0
1
0
0
1
1
0
0
1
0
 Gray ! binary : 8
9
1
1
0
0
0
0
0
1
1
1
1
1
0
0
0
1
Conventions
bi = bi 1  gi ; bn = 0 ;
10 1 0 1 0 1 1 1 1
 2s complement used for signed numbers in these notes +

i = n , 1; : : : ; 0 (r.m.a.)
11
12
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
 Unsigned and signed numbers can be treated equally in 13
14
1
1
1
1
0
1
1
0
1
1
0
0
1
0
1
1
most cases, exceptions are mentioned 15 1 1 1 1 1 0 0 0
Computer Arithmetic: Principles, Architectures, and VLSI Design 12 Computer Arithmetic: Principles, Architectures, and VLSI Design 13

3 Number Representations 3.3 Redundant Number Systems 3 Number Representations 3.3 Redundant Number Systems

3.3 Redundant Number Systems  1 digit holds sum of 3 bits or 1 digit + 1 bit (no
 Non-binary, redundant, weighted number systems [1, 2] carry-out digit, i.e. carry is saved)
 Digit set larger than radix (typically radix 2) ) multiple  standard redundant number system for fast addition
representations of same number ) redundancy Signed-digit (SD) or redundant digit (RD) number
+ No carry-propagation in adders ) more efficient impl. representation :
of adder-based units (e.g. multipliers and dividers)
Redundancy ) no direct implementation of relational
 ri; si; ti 2 f,1; 0; 1g  f1; 0; 1g , R = Pni=,01 ri2i
operators ) conversion to irredundant numbers  no carry-propagation in S = R + T :
Several bits used to represent one digit ) higher storage  ri + ti = (ci 1; ui) = 2ci 1 + ui , ci 1; ui 2 f1; 0; 1g
+ + +

requirements  (ci 1; ui) is redundant (e.g. 0 + 1 = 01 = 11)


+

Expensive conversion into irredundant numbers (not  8i 9(ci; ui) j ci + ui = si 2 f1; 0; 1g


necessary if redundant input operands are allowed)  1 digit holds sum of 2 digits (no carry-out digit)
Delayed-carry of half-adder number representation :  minimal SD representation : minimal number of
 ri 2 f0; 1; 2g , ci; si; ai; bi 2 f0; 1g , non-zero digits,    011f1g10    !    100f0g10   
ri = (ci+1; si) = 2ci+1 + si = ai + bi , ci+1si = 0  applications : sequential multiplication (less cycles),
 R = Pni=,01 ri2i = (C; S ) = C + S = A + B filters with constant coefficients (less hardware)
 1 digit holds sum of 2 bits (no carry-out digit)  example : minimal
z }| {
 example : (00; 10) = 00 + 10 = 01 + 01 = (10; 00) 7 = (0111 j 1111 j 1011 j 1001 j 11111 j   )
 irredundant representation of ,1 [8], since  canonical SD repres.: minimal SD + not two non-zero
ci+1si = 0 & C + S = ,1 ! S = ,1; C = 0
digits in sequence,    01f1g10    !    10f0g10   
Carry-save number representation :
 ri 2 f0; 1; 2; 3g , ci; si; ai; bi; di 2 f0; 1g ,  SD ! binary : carry-propagation necessary () adder)
ri = (ci+1; si) = 2ci+1 + si = ai + bi + di = ai + ri0  other applications : high-speed multipliers [9]
 R = Pin=,01 ri2i = (C; S ) = C + S = A + R0  similar to carry-save, simple use for signed numbers
Computer Arithmetic: Principles, Architectures, and VLSI Design 14 Computer Arithmetic: Principles, Architectures, and VLSI Design 15
3.4 Residue Number Systems (RNS)  Arithmetic operations : (each digit computed separately)

 Non-binary, irredundant, non-weighted number system [1]  zi = jZ jmi = jf (A)jmi = f (jAjmi ) mi = jf (ai)jmi

+ Carry-free and fast additions and multiplications  jA + B jmi = jAjmi + jB jmi mi = jai + bijmi

Complex and slow other arithmetic operations  jA  B jmi = jAjmi  jB jmi mi = jai  bijmi
(e.g. comparison, sign and overflow detection) because
digits are not weighted, conversion to weighted
 j , a ijmi = jmi , aijmi
mixed-radix or binary system required  a,i 1 mi = aimi,2 mi (Fermats theorem)
 Codes for error detection and correction [1]  Best moduli mi are 2k and (2k , 1) :
 Possible applications (but hardly used) :  high storage efficiency with k bits
 digital filters : fast additions and multiplications  simple modular addition : 2k : k-bit adder without cout,
2k , 1 : k -bit adder with end-around carry (cin = cout )
 error detection and correction for arithmetic operations
in conventional and residue number systems  Example : (m1; m0) = (3; 2) , M = 6
 Base is n-tuple of integers (mn,1; mn,2; : : : ; m0), A    ,4 ,3 ,2 ,1 0 1 2 3 4 5 6 7 8   
residues (or moduli) mi pairwise relatively prime a1    2 0 1 2 0 1 2 0 1 2 0 1 2   
 A = (an,1; an,2; : : : ; a0)mn, ;mn, ;:::;m , a0    0 1 0 1 0 1 0 1 0 1 0 1 0   
| {z }
ai 2 f0; 1; : : : ; mi , 1g
1 2 0
possible range
nY,1
 Range: M = mi, anywhere in ZZ j5j6 = A = (a1; a0) = (j5j3; j5j2) = (2; 1)
i 0 = j4 + 5j6 = (1; 0) + (2; 1) =
 ai = A mod

mi = jAjmi , A = mi  qi + ai

= (j1 + 2j3 ; j0 + 1j2 ) = (0; 1) = j3j6
,1
nX
 jAjM = Ciai , Ci = (: : : ; 0 |{z}

; 1; 0; : : :) j4  5j6 = (1; 0)  (2; 1) =
i=0 M = (j1  2j3 ; j0  1j2 ) = (2; 0) = j2j6
i
Computer Arithmetic: Principles, Architectures, and VLSI Design 16 Computer Arithmetic: Principles, Architectures, and VLSI Design 17

3 Number Representations 3.5 Floating-Point Numbers 3 Number Representations 3.7 Antitetrational Number System

3.5 Floating-Point Numbers 3.6 Logarithmic Number System


 Larger range, smaller precision than fixed-point  Alternative representation to floating-point (i.e. mantissa
representation, inexact, real numbers [1, 2] + integer exponent ! only fixed-point exponent) [1]
 Double-number form ) discontinuous precision  Single-number form ) continuous precision ) higher
 S biased exponent E unsigned norm. mantissa M accuracy, more reliable
 F = (,1)S  M  E = (,1)S  1:M  2E,bias  S biased fixed-point exponent E
 Basic arithmetic operations :  L = (,1)S  E = (,1)S  2E,bias (signed-logarithmic)
 A  B = (, 1)SASB  MA  MB  EA EB +
 Basic arithmetic operations :
 A + B = (,1)SA  MA +  (A < B ) = (EA < EB ) (additionally consider sign)
 
(,1)SB  MB  (EA , EB )  EA  A + B : by approximation or addition in conventional
 base on fixed-point add, multiply, and shift operations number system and double conversion
 postnormalization required (1=  M < 1)  A  B = (,1)SASB  EpA EB +

 Applications :  Ay = (,1)SA  yEA ; y A = (,1)SA  EA=y


processors : real floating-point formats (e.g. IEEE + Simpler multiplication/exponent., more complex addition
standard), large range due to universal use Expensive conversion : (anti)logarithms (table look-up)
ASICs : usually simplified floating-point formats with
small exponents, smaller range, used for range
 Applications : real-time digital filters
extension of normal fixed-point numbers
3.7 Antitetrational Number System
 IEEE floating-point format : 2
2
precision n n M nE bias range precision  Tetration (t. x = 2|{z}) and antitetration (a.t. x) [10]
x
single 32 23 8 127 3:8  10 38
10,7  Larger range, smaller precision than logarithmic repres.,
double 64 52 11 1023 9  10307 10,15 otherwise analogous (i.e. 2x ! t. x ; log x ! a.t. x)
Computer Arithmetic: Principles, Architectures, and VLSI Design 18 Computer Arithmetic: Principles, Architectures, and VLSI Design 19
3.8 Composite Arithmetic 3.9 Round-Off Schemes
 Proposal for a new standard of number representations [10]  Intermediate results with d additional lower bits
 Scheme for storage and display of exact (primary: () higher accuracy) : A = (an,1 ; : : : ; a0 ; a,1 ; : : : ; a,d )
integer, secondary: rational) and inexact (primary:  Rounding : keeping error  small during final word
logarithmic, secondary: antitetrational) numbers length reduction : R = (rn,1 ; : : : ; r0 ) = A , 
 Secondary forms used for numbers not representable by  Trade-off : numerical accuracy vs. implementation cost
primary ones () no over-/underflow handling necessary)
RTRUNC = (an,1; : : : ; a0 )
 Choice of number representation hidden from user, i.e. Truncation :
software/compiler selects format for highest accuracy  bias = , 12 + 2d+1 1 (= average error )
 Number representations : Round-to-nearest (i.e. normal rounding) :
tag value
integer : 00 2s complement integer RROUND = (a0n,1; : : : ; a00 ) ; A0 = A + 12 = A + 0:12
rational : 01 slash denominator n numerator  bias = 2d+1 (nearly symmetric)
1

logarithmic : 10 log integer log fraction  + 0:12 can often be included in previous operation
antitetrational : 11 a.t. integer a.t. fraction Round-to-nearest-even/-odd :
 Rational numbers : slash position (i.e. size of numerator/ (
RROUND if (a0,1; : : : ; a0,d) 6= 0    0
denominator) is variable and stored (floating slash) RROUND ,EVEN =
(a0n,1 ; : : : ; a01 ; 0) otherwise
 Storage form sizes : 32-bit (short), 64-bit (normal),
128-bit (long), 256-bit (extended)  bias = 0 (symmetric)
 Implementation : mixed hardware/software solutions  mandatory in IEEE floating-point standard
 Hardware proposal : long accumulator (4096 bits) holds  3 guard bits for rounding after floating-point operations :
any floating-point number in fixed-point format ) guard bit G (postnormalization), round bit R
higher accurary ) large hardware/software overhead (round-to-nearest), sticky bit S (round-to-nearest-even)
Computer Arithmetic: Principles, Architectures, and VLSI Design 20 Computer Arithmetic: Principles, Architectures, and VLSI Design 21

4 Addition 4.1 Overview 4 Addition 4.2 1-Bit Adders, (m, k)-Counters

4 Addition 4.2 1-Bit Adders, (m, k)-Counters

4.1 Overview  Add up m bits of same magnitude (i.e. 1-bit numbers)


 Output sum as k-bit number (k = blog mc + 1)
1-bit adders HA FA (m,k) (m,2)
 or : count 1s at inputs ) (m, k)-counter [3]
(combinational counters)
RCA CSKA CSLA CIA

carry-propagate adders Half-adder (HA), (2, 2)-counter


CLA PPA COSA
CPA (cout; s) = 2cout + s = a + b A = 3 ; T = 2 (1)

3-operand CSA s=ab (sum)


adders.epsi cout = ab (carry-out)
carry-save adders

103 121 mm
adder adder a b
multi-operand
array tree a b
a b
chaschema1.epsi
out
array tree hasym.epsi 19  28 mm haschema2.epsi
multi-operand adders
adder adder  HA
c 23 mm
18
out
21  43 mm
c out

s s
Legend:
(reference)
HA: half-adder CPA: carry-propagate adder CLA: carry-lookahead adder
FA: full-adder RCA: ripple-carry adder PPA: parallel-prefix adder s
(m,k): (m,k)-counter CSKA:carry-skip adder COSA:conditional-sum adder
(m,2): (m,2)-compressor CSLA: carry-select adder
CIA: carry-increment adder CSA: carry-save adder
based on component related component

Computer Arithmetic: Principles, Architectures, and VLSI Design 22 Computer Arithmetic: Principles, Architectures, and VLSI Design 23
Full-adder (FA), (3, 2)-counter (m, k)-counters
a0 a m-1
( cout; s) = 2cout + s = a + b + cin A = 7 ; T = 4 (2) ( sk,1 ; : : : ; s0 ) = ...
kX,1 ,1
mX cntsymbol.epsi
sj 2j = ai 
18 (m,k)
23 mm

g = ab (generate) c = ab 0 j =0 i =0 ...

p = a  b (propagate) c1 = a + b
s k-1 s 0

s = a  b  cin = p  cin
 Usually built from full-adders
cout = ab + acin + bcin = ab + (a  b)cin  Associativity of addition allows convertion from linear to
tree structure ) faster at same number of FAs
= g + pcin = pg + pcin = pa + pcin
= cin c0 + cin c1 A = 7 Pklog 1mbm2,k c  7(m , log m) ;
=

a b
TLIN = 4m + 2blog mc ; TTREE = 4dlog3 me + 2blog mc
a b

a b
g
 Example : (7, 3)-counter
HA
fasymbol.epsi

FA
faschematic3.epsi
 p c out
faschematic2.epsi
 c in
A = 28 ; T = 14 A = 28 ; T = 10
c18 21 mm
out c in c out 29 32 mm c in 32 35 mm
HA a0a1 a2a3a4a5a6 a0a1 a2 a3a4 a5a6
s
s s FA FA FA
a b
a b
a b
count73par.epsi
FA 
36 48 mm FA
count73ser.epsi
0 42  59 mm
faschematic1.epsi
g p p
faschematic4.epsi faschematic5.epsi
 FA FA
0
c0
c out
29 43 mm
c in
c out

29 1 41 mm
c in
c out

35 47 mm
1
c1
s2 s1 s0
c in FA
tree structure
linear
s
(reference) s s2 s1 s0 structure
s
Computer Arithmetic: Principles, Architectures, and VLSI Design 24 Computer Arithmetic: Principles, Architectures, and VLSI Design 25

4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)

4.3 Carry-Propagate Adders (CPA) Carry-propagation speed-up techniques

 Add two n-bit operands A and B and an optional carry-in a) Concatenation of partial CPAs with fast cin ! cout
cin by performing carry-propagation [1, 2, 11]
 Sum (cout; S ) is irredundant (n + 1)-bit number a n-1:j b n-1:j
...
a i-1:k b i-1:k a k-1:0 b k-1:0

cout; S ) = cout2n + S = A + B + cin


speedup1.epsi
( c i84  26 mm
CPA CPA CPA
c out cj ck c in

...
2ci+1 + si ai + bi + ci ; A B s n-1:j s i-1:k s k-1:0
=
i = 0; 1; : : : ; n , 1 cpasymbol.epsi
c0 = cin ; cout = cn (r.m.a.) c out 29  26 mm
CPA
c in
a) Fast carry look-ahead logic for entire range of bits

S
a n-1 b n-1 a1 b1 a0 b0

Ripple-carry adder (RCA)

 Serial arrangement of n full-adders ... preprocessing

 Simplest, smallest, and slowest CPA structure


speedup2.epsi

c out

104 50 mm
c in
carry propagation

A = 7n ; T = 2n ; AT = 14n2 ... postprocessing

a n-1 b n-1 a1 b1 a0 b0 s n-1 s1 s0


...
rca.epsi
c n-1 57c
FA 23FA
mm c FA
c out 2 1 c in
...
s n-1 s1 s0

Computer Arithmetic: Principles, Architectures, and VLSI Design 26 Computer Arithmetic: Principles, Architectures, and VLSI Design 27
Carry-skip adder (CSKA) Carry-select adder (CSLA)
 Type a) : partial CPA with fast ck ! ci  Type a) : partial CPA with fast ck ! ci and ck ! si,1:k
ci = P i,1:k c0i + Pi,1:k ck (bit group (ai,1; : : : ; ak )) si,1:k = ck s0i,1:k + ck s1i,1:k
Pi,1:k = pi,1pi,2    pk (group propagate) ci = ck c0i + ck c1i
 1) Pi,1:k = 0 : ck 6! c0i and c0i selected (c0i ! ci)  Two CPAs compute two possible results (cin = 0=1),
2) Pi,1:k = 1 : ck ! c0i but c0i skipped (c0i 6! ci ) group carry-in ck selects correct one afterwards
) path ck ! c0i ! ci never sensitized ) fast ck ! ci  Variable group sizes (faster) : larger groups at end (MSB)
) false path ) inherent logic redundancy ) problems in (balance delays a0 ! ck and ak ! c0i )
circuit optimization, timing analysis, and testing  Part. CPA typ. is RCA, CSLA () multil. CSLA), or CLA
 Variable group sizes (faster) : larger groups in the middle  High speed-up at high hardware overhead
(minimize delays a0 ! ck ! si,1 and ak ! ci ! sn,1 ) (+ MUX/bit + (CPA + MUX)/group)
 Partial CPA typ. is RCA or CSKA () multilevel CSKA)
A  14n ; T  2:8n1=2 ; AT  39n3=2
 Medium speed-up at small hardware overhead
(+ AND/bit + MUX/group) a i-1:k b i-1:k a k-1:0 b k-1:0

A  8n ; T  4n1=2 ; AT  32n3=2 ...

c i0 0
a n-1:j b n-1:j a i-1:k b i-1:k a k-1:0 b k-1:0 0 CPA
csla.epsi 1 CPA
...
ci
c out ci 1
c i1
102 50CPAmm
ck c in

CPA 0 1
0 s i-1:k s i-1:k
CPA cska.epsi CPA ...
c out cj ci 99
1 
36 mm ck c in 0 1
ck
...
P i-1:k
s i-1:k s k-1:0
s n-1:j s i-1:k s k-1:0
Computer Arithmetic: Principles, Architectures, and VLSI Design 28 Computer Arithmetic: Principles, Architectures, and VLSI Design 29

4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)

Carry-increment adder (CIA)  Example : gate-level schematic of carry-incr. adder (CIA)


 Type a) : partial CPA with fast ck ! ci and ck ! si,1:k  only 2 different logic cells (bit-slices) : IHA and IFA
si,1:k = s0i,1:k + ck ; ci = c0i + Pi,1:k ck T 4 6 10 12 14 16 18 20 22 24 26 28 ... 38
max ngroup 2 3 4 5 6 7 8 9 10 11 ... 16
Pi,1:k = pi,1pi,2    pk (group propagate) n 1 2 4 7 11 16 22 29 37 46 56 67 ... 137

 Result is incremented after addition, if ck = 1 [12, 11] a i-1 b i-1 a i-2 b i-2 a k+1 b k+1 ak bk

 Variable group sizes (faster) : larger groups at end (MSB)


IFA IFA IFA IHA

(balance delays a0 ! ck and ak ! c0i ) ...

 Part. CPA typ. is RCA, CIA () multilevel CIA) or CLA


 High speed-up at medium hardware overhead
...

(+ AND/bit + (incrementer + AND-OR)/group)


 Logic of CPA and incrementer can be merged [11]
...

ciagate.epsi
A  10n ; T  2:8n1=2 ; AT  28n3=2 
ci ck
s i-1 s i-2 112 mm
100 s k+1 sk

a i-1:k b i-1:k a k-1:0 b k-1:0 (i-k-1)IFA + IHA 2IFA + IHA IFA + IHA IHA IHA

...
ci 0
CPA
cia.epsi
si-1:k CPA
c out ci ck c in
86  43 mm
... bits i-1...k ... bits 6...4 bits 3,2 bit 1 bit 0

... P i-1:k
+1

s i-1:k s k-1:0
c out c in

Computer Arithmetic: Principles, Architectures, and VLSI Design 30 Computer Arithmetic: Principles, Architectures, and VLSI Design 31
Conditional-sum adder (COSA) Carry-lookahead adder (CLA), traditional
 Type a) : optimized multilevel CSLA with (log n) levels  Type b) : carries looked ahead before sum bits computed
(i.e. double CPAs are merged at higher levels)
 Typically 4-bit blocks used (e.g. standard IC SN74181)
 Correct sum bits (s0i,1:k or s1i,1:k ) are (conditionally) c0 = c00
selected through (log n) levels of multiplexers
c1 = g0 + p0c00 (g3,p3) ... (g0,p0)

 Bit groups of size 2l at level l c2 = g1 + p1g0 + p1p0c00


c3 = g2 + p2g1 + p2p1g0 + p2p1 p0c00
clbsymbol.epsi

 Higher parallelism, more balanced signal paths 27 CLB
26 mm c
0

g30 = g3 + p3g2 + p3p2g1 + p3p2 p1g0


 Highest speed-up at highest hardware overhead p30 = p3p2p1 p0 (g3,p3) c 3 . . . c 0

(2 RCA + more than (log n) MUX/bit)


 Hierarchical arrangement using ( 12 log n) levels :
A  3n log n ; T  2 log n ; AT  6n log n 2
( 30g ; p30 ) passed up, c00 passed down between levels
 High speed-up at medium hardware overhead
a3 b3 a2 b2 a1 b1 a0 b0

A  14n ; T  4 log n ; AT  56n log n


level 0

... 0 0 0
FA FA FA
1 1 1 FA (g15,p15) ... (g12,p12) (g11,p11) ... (g8,p8) (g7,p7) ... (g4,p4) (g3,p3) ... (g0,p0)
FA FA FA c in

c12 c8 c4 c0
level 1

cosa.epsi CLB CLB CLB CLB



... 0 1 0 1 0 1 0 1
100 57 mm

)
)

,p11

(g7,p7)

(g3,p3)
,p15
c 15 ... c 12 c 11 ... c 8 cla.epsi c 7 ... c 4 c 3 ... c 0

(g11
level 2

 48 mm
0 1 0 1 0 1

(g15
... 97
...

+ preprocessing : gi = ai bi ; pi = ai  bi
CLB c in
c out
+ postprocessing : si = pi  ci
s3 s2 s1 s0

Computer Arithmetic: Principles, Architectures, and VLSI Design 32 Computer Arithmetic: Principles, Architectures, and VLSI Design 33

4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)

Parallel-prefix adders (PPA) Prefix problem


 Type b) : universal adder architecture comprising RCA,  Inputs (xn,1; : : : ; x0), outputs (yn,1; : : : ; y0), associative
CIA, CLA, and more (i.e. entire range of area-delay binary operator  [11, 13]
(yn,1 ; : : : ; y0 ) = (xn,1      x0 ; : : : ; x1  x0 ; x0 ) or
trade-offs from slowest RCA to fastest CLA)
 Preprocessing, carry-lookahead, and postprocessing step y0 = x0 ; yi = xi  yi,1 ; i = 1; : : : ; n , 1 (r.m.a.)
 Carries calculated using parallel-prefix algorithms
 Associativity of  ) tree structures for evaluation :
x3  (x2  (x| 1 {z x0})) = (x| 3 {z x2})  (x| 1 {z x0}) , but y2 ?
+ High regularity : suitable for synthesis and layout
+ High flexibility : special adders, other arithmetic
y1 = Y1:01 Y3:21 y1 = Y1:01
operations, exchangeable prefix algorithms (i.e. speeds) | {z } | {z }
y2 = Y2:02 y3 = Y3:02
+ High performance : smallest and fastest adders | {z }
y3 = Y3:03
A  5n + 3A ; T = 4 + 2T  Group variables Yil:k : covers bits (xk ; : : : ; xi) at level l
 Carry-propagation is prefix problem : Yil:k = (Gli:k ; Pil:k )
a n-1
b n-1
a n-2
b n-2

preprocessing:
a1
b1
a0
b0

... ... gi = aibi G0i:i; Pi0:i) = (gi; pi)


(
c in pi = ai  bi (Gli:k ; Pil:k ) = (Gi:j 1 ; Pi:j 1 )  (Gj :k ; Pj :k ) ; k  j  i
l ,1 l,1 l,1 l,1
+ +
(gn-1 , p n-1 ) (g0 , p0 ) l,1 l,1 l,1 l,1 l,1
= (Gi:j 1 + Pi:j 1 Gj :k ; Pi:j 1 Pj :k )
+ + +
add.epsi///figures

carry-lookahead: ci 1 = Gmi:0 ; i = 0; : : : ; n , 1 ; l = 1; : : : ; m
+
73 64 mm prefix algorithm
 Parallel-prefix algorithms [11] :
c n p n-1 c1 p0 c0
 multi-tree structures (T = O(n) ! O(log n))
c out
... ... postprocessing:  sharing subtrees (A = O(n2) ! O(n log n))
si = pi  ci  different algorithms trading area vs. delay (influences
also from wiring and maximum fan-out FOmax )
s n-1
s n-2

s1
s0

Computer Arithmetic: Principles, Architectures, and VLSI Design 34 Computer Arithmetic: Principles, Architectures, and VLSI Design 35
Prefix algorithms  Sklansky parallel-prefix algorithm () PPA-SK)
 Algorithms visualized by directed acyclic graphs (DAG)  Tree-like collection, parallel redistribution of carries
with array structure (n bits  m levels)
A  12 n log n ; T = dlog ne ; FOmax  12 n
 Graph vertex symbols : 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
( Gil,:j 1 1; Pil:,j 1 1) (Glj,:k1 ; Pjl:,k1 ) (Gil,:k1 ; Pil:,k 1 )
y?, ?
+ +

i
0

,
, ?(Gl ; P l )
, ?
1
2
sk.epsi///figures

67 30 mm
(Gli:k ; Pil:k ) i:k i:k (Gli:k ; Pil:k ) (Gli:k ; Pil:k ) 3

(contains logic for )


4
(contains no logic)

 Performance measures :  Brent-Kung parallel-prefix algorithm () PPA-BK)


A : graph size (number of black nodes)
T : graph depth (number of black nodes on critical path)  Traditional CLA is PPA-BK with 4-bit groups
 Tree-like redistribution of carries (fan-out tree)
 Serial-prefix algorithm () RCA)
A = 2n , dlog ne , 2 ; T = 2dlog ne , 2
A = n , 1 ; T = n , 1 ; FOmax = 2 FOmax  log n
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

0 0
1 1
2 ser.epsi///figures 2 bk.epsi///figures
3 
69 38 mm 3 
67 38 mm
...

4
14 5
15 6

Computer Arithmetic: Principles, Architectures, and VLSI Design 36 Computer Arithmetic: Principles, Architectures, and VLSI Design 37

4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)

 Kogge-Stone parallel-prefix algorithm () PPA-KS)  Mixed serial/parallel-prefix algorithm () RCA + PPA)


 very high wiring requirements  linear size-depth trade-off using parameter k :
A  n log n , n + 1 ; T = dlog ne ; FOmax = 2 0k  n , 2dlog ne + 2
 k = 0 : serial-prefix graph
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

0
1
k = n , 2dlog ne + 1 : Brent-Kung parallel-prefix
graph
2
ks.epsi///figures  fills gap between RCA and PPA-BK (i.e. CLA) in steps
3 
67 52 mm of single -operations

A = n , 1 + k ; T = n , 1 , k ; FOmax = var.
4
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

0
 Carry-increment parallel-prefix algorithm () CIA) 1
2

A  2n , 1:4n1=2 ; T  1:4n1=2 ; FOmax  1:4n1=2 3


4 var.epsi///figures
5 
68 54 mm
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
6
0 7
1 8
cia.epsi///figures 9
2

67 34 mm 10
3
4
5

Computer Arithmetic: Principles, Architectures, and VLSI Design 38 Computer Arithmetic: Principles, Architectures, and VLSI Design 39
 Example : 4-bit parallel-prefix adder (PPA-SK) Prefix adder synthesis
 efficient AND-OR-prefix circuit for the generate and  Local prefix graph transformation :
AND-prefix circuit for the propagate signals
 optimization: alternatingly AOI-/OAI- resp. NAND-/ 3 2 1 0
depth-decr.
3 2 1 0

NOR-gates (inverting gates are smaller and faster)


 can also be realized using two MUX-prefix circuits A = 3 0
unfact.epsi ,!
transform 0
fact.epsi A = 4
T = 3 20  26 mm 20  26 mm T = 2
1 1
2 size-decr. 2

,
a3 b3 a2 b2 a1 b1 a0 b0 3 transform 3

 Repeated (local) prefix transformations result in overall


minimization of graph depth or size ) which sequence ?
c in

 Goal: minimal size (area) at given depth (delay)


 Simple algorithm for sequence of applied transforms :
Step 1 : prefix graph compression (depth minimization) :
depth-decr. transforms in right-to-left bottom-up order
askgate.epsi///figures Step 2 : prefix graph expansion (size minimization) :

100 103 mm
size-decreasing transforms in left-to-right top-down
order, if allowed depth not exceeded
 Prefix adder synthesis : 1) generate serial-prefix graph, 2)
graph compression, 3) depth-controlled graph expansion,
4) generate pre-/postprocessing and prefix logic
+ Generates all previous prefix graphs (except PPA-KS)
c out + Universal adder synthesis algorithm : generates
P n-1:0 area-optimal adders for any given timing constraints [11]
s3 s2 s1 s0 (including non-uniform signal arrival times)
Computer Arithmetic: Principles, Architectures, and VLSI Design 40 Computer Arithmetic: Principles, Architectures, and VLSI Design 41

4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)

Multilevel adders Self-timed adders


 Multilevel versions of adders of type a) possible (CSKA,  Average carry-propagation length : log n
CSLA, and CIA; notation: 2-level CIA = CIA-2L)
+ RCA is fast in average case (T = O(log n)), slow in worst
+ Delay is O(n1=(m+1) ) for m levels case ) suitable for self-timed asynchronous designs [15]
Area increase small for CSKA and CIA,
high for CSLA () COSA)
Completion detection is not trivial

 Difficult computation of optimal group sizes Adder performance comparisons

Hybrid adders  Standard-cell implementations, 0:8m process


 Arbitrary combinations of speed-up techniques possible
) hybrid/mixed adder architectures area [lambda^2]

RCA
 Often used combinations : CLA and CSLA [14] 1e+07
128-bit CSKA-2L
CIA-1L
Pure architectures usually perform best (at gate-level) CIA-2L
64-bit
5 PPA-SK
Transistor-level adders PPA-BK
32-bit
 Influence of logic styles (e.g. dynamic logic,
addperf.ps CLA
2 
84 84 mm COSA
pass-transistor logic ) faster) 16-bit const. AT
1e+06
+ Efficient transistor-level implementation of ripple-carry
chains (Manchester chain) [14] 8-bit
5
+ Combinations of speed-up techniques make sense
Much higher design effort 2 delay [ns]
 Many efficient implementations exist and published 5 10 20

Computer Arithmetic: Principles, Architectures, and VLSI Design 42 Computer Arithmetic: Principles, Architectures, and VLSI Design 43
 Complexity comparison under the unit-gate model 4.4 Carry-Save Adder (CSA)
a) Adds three n-bit operands A0 , A1 , A2 performing no
adder A T AT opt.1 syn.2
RCA 7n 2n 14n2 aaa
p carry-propagation (i.e. carries are saved) [1]

CSKA-1L 8n 4n1=2 32n3=2 aat 3 ( C; S ) = C + S = A0 + A1 + A2 A0 A1 A2

CSKA-2L 8n xn1=3 4 xn4=3 4 csasymbol.epsi


14n 2:8n1=2 39n3=2 2ci+1 + si a0;i + a1;i + a2;i ;
= 
21 CSA
26 mm
CSLA-1L
10n 2:8n1=2 28n3=2

p i = 0; 1; : : : ; n , 1 (n.)
CIA-1L
10n 3:6n1=3 36n4=3
att
p C S
CIA-2L
CIA-3L 10n 4:4n1=4 44n5=4
att

p b) Adds one n-bit operand to an n-digit carry-save operand

n log n
3
2 log n 3n log2 n
p ( C; S )out = A + (C; S )in
PPA-SK 2
10n 4 log n 40n log n
ttt
p
PPA-BK att Result is in redundant carry-save format (n digits),
3n log n 2 log n 6n log2 n represented by two n-bit numbers S (sum bits) and C
PPA-KS
CLA 5 14n 4 log n 56n log n


p
( ) (carry bits)
COSA 3n log n 2 log n 6n log2 n + Parallel arrangement of n full-adders, constant delay
1 optimality regarding area and delay A = 7n ; T = 4
aaa : smallest area, longest delay
aat : small area, medium delay

a 0,n-1
a 1,n-1
a 2,n-1

a 0,1
a 1,1
a 2,1

a 0,0
a 1,0
a 2,0
att : medium area, short delay
ttt : large area, shortest delay csa.epsi
: not optimal FA . . . 67  27FA
mm FA
2 obtained from prefix adder synthesis
3 automatic logic optimization not possible (redundancy) cn s n-1 c2 s1 c1 s0

4 exact factors not calculated


 Multi-operand carry-save adders (m > 3)
5 corresponds to 4-bit PPA-BK
) adder array (linear arrangement), adder tree (tree arr.)
Computer Arithmetic: Principles, Architectures, and VLSI Design 44 Computer Arithmetic: Principles, Architectures, and VLSI Design 45

4 Addition 4.5 Multi-Operand Adders 4 Addition 4.5 Multi-Operand Adders

4.5 Multi-Operand Adders a) 4-operand CPA (RCA) array :


 Add three or more (m > 2) n-bit operands, yield
a 0,n-1
a 1,n-1

(n + dlog me)-bit result in irredundant number rep. [1, 2]


a 0,2
a 1,2

a 0,1
a 1,1

a 0,0
a 1,0
...

Array adders CPA


FA FA FA HA

 Realization by array adders : (see figures on next page) a 2,n-1


...
a 2,2
cparray.epsi
a 2,1 a 2,0

a) linear arrangement of CPAs FA 93  57 mm FA


FA HA
CPA

b) linear arr. of CSAs (adder array) and final CPA a 3,n-1 a 3,2 a 3,1 a 3,0
...
 a) and b) differ in bit arrival times at final CPA :
) if CPA = RCA : a) and b) have same overall delay FA FA FA FA HA
CPA

) if fast final CPA : uniform bit arrival times required ...

) CSA array (b) sn s n-1 s2 s1 s0

 Fast implementation : CSA array + fast final CPA b) 4-operand CSA array with final CPA (RCA) :
(note: array of fast CPAs not efficient/necessary)
a 0,n-1
a 1,n-1
a 2,n-1

a 0,2
a 1,2
a 2,2

a 0,1
a 1,1
a 2,1

a 0,0
a 1,0
a 2,0

A0 A1 A2 A 3 A m-1
A = (m , 2)ACSA + ACPA
T = (m , 2)TCSA + TCPA CSA ... FA ... FA FA FA
CSA

a 3,n-1 a 3,2 a 3,1 a 3,0

CPA = RCA :
A = O(mn + n) mopadd.epsi FA ...
csarray.epsi

99FA 57 mm FA HA
CSA

T = O(m + n) 30  58 mm
CSA
...

Fast CPA :
A = O(mn + n log n) CPA
FA FA FA HA
CPA

T = O(m + log n) sn s n-1


...
s2 s1 s0
S

Computer Arithmetic: Principles, Architectures, and VLSI Design 46 Computer Arithmetic: Principles, Architectures, and VLSI Design 47
(m, 2)-compressors
,4
mX a0 a m-1
A = 7(m , 2)
2(c + clout) + s = ...
TLIN = 4(m , 2) ; TTREE = 6(dlog me , 1)
l=0
0
c out cprsymbol.epsi c in0
mX ,1 mX ,4 

...

...
37 (m,2)
26 mm
ai + clin
m-4
c inm-4
 Optimized (4, 2)-compressor :
c out

i=0 l =0 c s
 2 full-adders merged and optimized (i.e. XORs
 1-bit adders (similar to (m, k)-counters) [16] arranged in tree structure)
 Compresses m bits down to 2 by forwarding (m , 3) A = 14 ; T = 6
intermediate carries to next higher bit position
A = 14 ; T = 8
 Is bit-slice of multi-operand CSA array (see prev. page) a0 a1 a2 a3

+ No horizontal carry-propagation (i.e. clin ! ckout ; k > l)


a0 a1 a2 a3

 Built from full-adders (= (3, 2)-compressor) or FA


(4, 2)-compressors arranged in linear or tree structures cpr42fa.epsi

32 38 mm
) 0 cpr42opt.epsi
1

41 53 mm
 Example : 4-operand adder using (4, 2)-compressors c out
FA
c in
c out c in
0 1
a 2,n-1
a 0,n-1
a 1,n-1
a 3,n-1

c s
a 2,2

a 2,1

a 2,0
a 0,2
a 1,2
a 3,2

a 0,1
a 1,1
a 3,1

a 0,0
a 1,0
a 3,0 with full-adders c s

(4,2) (4,2) (4,2) (4,2) CSA optimized


cpradd.epsi

99 44 mm
+ same area, 25% shorter delay
FA FA FA HA CPA  SD-FA (signed-digit full-adder) is similar to
(4, 2)-compressor regarding structure and complexity
s n+1 sn s n-1 s2 s1 s0

Computer Arithmetic: Principles, Architectures, and VLSI Design 48 Computer Arithmetic: Principles, Architectures, and VLSI Design 49

4 Addition 4.5 Multi-Operand Adders 4 Addition 4.5 Multi-Operand Adders

 Advantages of (4, 2)-compressors over FAs for realizing Tree adders (Wallace tree)
(m, 2)-compressors :
 higher compression rate (4:2 instead of 3:2)  Adder tree : n-bit m-operand carry-save adder
composed of n tree-structured (m, 2)-compressors [1, 17]
 less deep and more regular trees
 Tree adders : fastest multi-operand adders using an
tree depth 012 3 4 5 6 7 8 9 10 adder tree and a fast final CPA
FA 2 3 4 6 9 13 19 28 42 63 94
# operands
(4,2) 2 4 8 16 32 64 128    A = A m; 2  n + ACPA = O(mn + n log n)
( )

T = T m; 2 + TCPA = O(log m + log n)


( )

 Example : (8, 2)-compressor


A = 42 ; T = 16 A = 42 ; T = 12 Adder arrays and adder trees revisited
a0a1 a2a3 a4a5 a6a7 a0a1a2a3 a4a5a6a7
 Some FA can often be replaced by HA or eliminated
0
c out c in0 (i.e. redundant due to constant inputs)
FA FA (4,2) (4,2)
 Number of (irredundant) FA does not depend on adder
0
c out c in0
1
c out c in1
1
c out c in1
2
c out cpr82cpr42.epsi c in2 structure, but number of HA does
FA FA 47 50 mm
2
c out
3
cpr82fa.epsi
47  65 mm
c in2 3
c out c in3
 An m-operand adder accomodates (m , 1) carry inputs
c out c in3
4
c out
(4,2)
c in4  Adder trees (T = O(log n)) are faster than adder arrays
(T O(n)) at same amount of gates (A = O(mn))
FA
4
c out c in4 =
c s
FA (4, 2)-compressor tree  Adder trees are less regular and have more complex
routing than adder arrays ) larger area, difficult layout
c s (i.e. limited use in layout generators)
full-adder tree
Computer Arithmetic: Principles, Architectures, and VLSI Design 50 Computer Arithmetic: Principles, Architectures, and VLSI Design 51
4.6 Sequential Adders 5 Simple / Addition-Based Operations
Bit-serial adder : Sequential n-bit adder 5.1 Complement and Subtraction A
ai bi
A = AFA + AFF 2s complementer (negation) neg.epsi
T = TFA + TFF 
bitseradd.epsi

25 27 mm
FA
,A = A + 1 21 32 mm1
+1
L=n
si Z
Accumulators : Sequential m-operand adders A B
 With CPA A 2s complement subtractor

A = ACPA + AREG accucpa.epsi A , B = A + (,B ) sub.epsi


CPA
29 32 mm 1
27  28 mm
T = TCPA + TREG =A+B+1
CPA
c out

L=m S
S

 With CSA and final CPA A A B

 Allows higher clock rates 2s complement adder/subtractor


 Final CPA too slow : CSA
A  B = A + (,1)sub B addsub.epsi

) pipelining or multiple accucsa.epsi = A + (B  sub) + sub c out
36 35 mm
CPA sub
cycles for evaluation 
33 52 mm

A = ACSA + ACPA + 4AREG S

T = TCSA + TREG CPA 1s complement adder A B

L=m A+B 2n , 1)
(mod addmod.epsi
S
= A + B + cout

29 CPA
 Mixed CSA/CPA : CSA with partial CPAs (i.e. fewer
28 mm
c out c in

carries saved), trade-off between speed and register size (end-around carry)
S
Computer Arithmetic: Principles, Architectures, and VLSI Design 52 Computer Arithmetic: Principles, Architectures, and VLSI Design 53

5 Simple / Addition-Based Operations 5.2 Increment / Decrement 5 Simple / Addition-Based Operations 5.2 Increment / Decrement

5.2 Increment / Decrement  Prefix problem : Ci:k = Ci:j 1Cj:k ) AND-prefix struct.
+

Incrementer
A  12 n log n + 2n ; T = dlog ne + 2 ; AT  12 n log2 n
 Adds a single bit cin to an n-bit operand A
(cout; Z ) = cout2n + Z = A + cin A Decrementer ( cout; Z ) = A , cin
zi = ai  ci incsymbol.epsi a n-1 a2 a1 a0
c out 29  26 mm
+1
ci 1 = aici ; i = 0; : : : ; n , 1
+
c in

c0 = cin ; cout = cn (r.m.a.) Z


...

dec.epsi
 Corresponds to addition with B = 0 () FA ! HA) c out 
93 41 mm
c in
 Example : Ripple-carry incrementer using half-adders ...

A = 3n ; T = n + 1 ; AT  3n2 z n-1 z2 z1 z0

a n-1 a1 a0
... Incrementer-decrementer

cout; Z ) = A  cin = A + (,1)dec cin


incfa.epsi
c out
HA
c n-1 2

59c 23HA mm c
1
HA
c in (
...
z n-1 z1 z0 a n-1 a2 a1 a0

or using incrementer slices (= half-adder)


a n-1 a2 a1 a0 dec
... ...
incdec.epsi

94 46 mm
c out inc.epsi c out
83  33 mm
c in c in
... ...
HA

z n-1 z2 z1 z0 z n-1 z2 z1 z0

Computer Arithmetic: Principles, Architectures, and VLSI Design 54 Computer Arithmetic: Principles, Architectures, and VLSI Design 55
Fast incrementers Gray incrementer

 4-bit incrementer using multi-input gates :  Increments in Gray number system


c0 = an,1  an,2      a0 (parity)
a3 a2 a1 a0

ci 1 = ai ci ; i = 0; : : : ; n , 3 (r.m.a.)
+

inccg.epsi c in z0 = a0  c0

62 39 mm
zi = ai  ai,1 ci,1 ; i = 1; : : : ; n , 2
c out zn,1 = an,1  cn,2
z3 z2 z1 z0
 Prefix problem ) AND-prefix structure
 8-bit parallel-prefix incrementer (Sklansky AND-prefix
structure) :

a7 a6 a5 a4 a3 a2 a1 a0

c in

incpp.epsi

98 63 mm

c out z7 z6 z5 z4 z3 z2 z1 z0

Computer Arithmetic: Principles, Architectures, and VLSI Design 56 Computer Arithmetic: Principles, Architectures, and VLSI Design 57

5 Simple / Addition-Based Operations 5.3 Counting 5 Simple / Addition-Based Operations 5.3 Counting

5.3 Counting  Fast divider (T = O(1)) using delayed-carry numbers


 Count clock cycles ) counter, (irredundant carry-save represention of ,1 allows using
divide clock frequency ) frequency divider (cout ) fast carry-save incrementer) [8]

Binary counter Gray counter


 Sequential in-/decrementer  Counter using Gray incrementer
 Incrementer speed-up c out
+1
c in
techniques applicable cntblock.epsi Ring counters
32  33 mm
 Down- and up-down-counters clk  Shift register connected to ring :
using decrementers /
incrementer-decrementers Q cntring.epsi

 Example : Ripple-carry up-counter using counter slices 51 16 mm

(= HA + FF), cin is count enable


q n-1 q2 q1 q0

 State is not encoded ) n FF for counting n states


c out c in  Must be initialized correctly (e.g. 00    01)
cntripple.epsi

 Applications:
 fast dividers (no logic between FF)
... 87 36 mm

 state counter for one-hot coded FSMs


q n-1 q2 q1 q0
 Johnson / twisted-ring counter (inverted feed-back) :
 Asynchronous counter using toggle-flip-flops
(lower toggle rate ) lower power) cntjohnson.epsi
T ... T T T 
59 16 mm
cntasync.epsi
clk q n-1 q2 q1 q0

64 18 mm
q n-1 q2 q1 q0  n FF for counting 2n states
Computer Arithmetic: Principles, Architectures, and VLSI Design 58 Computer Arithmetic: Principles, Architectures, and VLSI Design 59
5.4 Comparison, Coding, Detection Comparators A B

Comparison operations  Subtractor (A , B ) :


cmpsub.epsi
EQ = (A = B ) (equal) GE = cout 
37 31 mm 1
NE = (A 6= B ) = EQ
CPA
(not equal) EQ = Pn,1:0 GE = c out

GE = (A  B ) (greater or equal) (for free in PPA) EQ = P n-1:0


LT = (A < B ) = GE (less than)
GT = (A > B ) = GE  EQ (greater than) ARCA = 7n ; TRCA = 2n or
LE = (A  B ) = GT = GE + EQ (less or equal) APPA,KS  32 n log n ; TPPA,KS  2 log n
Equality comparison  Optimized comparator :
EQ = (A = B )  removing redundancies in subtractor (unused si)
a n-1
b n-1

a2
b2
a1
b1
a0
b0
 single-tree structure ) speed-up at no cost :
eqi ai = bi) eqi
...
+1 =( cmpeq.epsi A = 6n ; TLIN = 2n ; TTREE  2 log n
= (ai bi ) eqi ;

40 36 mm

i = 0; : : : ; n , 1  example : ripple comparator using comparator slices


eq0 = 1 ; EQ = eqn (r.s.a.)

a n-1
b n-1

a2
b2

a1
b1

a0
b0
EQ

Magnitude comparison
... equality &

GE = (A  B ) cmpripple.epsi

100 47 mm
magnitude

gei +1 =( ai > bi) + (ai = bi) gei magnitude

= ai bi + (ai bi ) gei ; i = 0; : : : ; n , 1
GE

ge0 = 1 ; GE = gen (r.s.a.) EQ


equality

Computer Arithmetic: Principles, Architectures, and VLSI Design 60 Computer Arithmetic: Principles, Architectures, and VLSI Design 61

5 Simple / Addition-Based Operations 5.4 Comparison, Coding, Detection 5 Simple / Addition-Based Operations 5.4 Comparison, Coding, Detection

Decoder Detection operations


 Decodes binary number An,1:0 to vector Zm,1:0 (m = 2n)
(  All-zeroes detection : z = an,1 + an,2 +    + a0
1 if A = i
zi = Z = 2A
0 else ; i = 0; : : : ; m , 1 All-ones detection : z = an,1 an,2    a0 (r.s.a.)

A a2 a1 a0 A = n ; T = log n
decodersym.epsi

21decoder
26 mm
decoder.epsi

58 28 mm  Leading-zeroes detection (LZD) :
 for scaling, normalization, priority encoding
Z z7 z6 z5 z4 z3 z2 z1 z0
a) non-encoded output :
A = (n , 1)2n ; T = dlog ne
f0g1f0j1g ! f0g1f0g
a n-1 a n-2 a1 a0
...

(e.g. 000101 ! 000100)


Encoder
 Encodes vector Am,1:0 to binary number Zn,1:0 (m = 2n) lzdnenc.epsi
. . .
(condition: 9i 8k j if k = i then ak = 1 else ak = 0)
50 28 mm

Z = i if ai = 1 ; i = 0; : : : ; m , 1 Z = log2 A A = 2n ; T = n z n-1 z n-2 z1 z0

A a7a5a3a1
a6a4a2a0
 prefix problem (r.m.a.) ) AND-prefix structure
encodersym.epsi z0

21encoder
26 mm encoder.epsi
b) encoded output : + encoder

30 34 mm

Z
z1
 signed numbers : + leading-ones detector (LOZ)
z2
A = n(2n,1 , 1)
T =n,1 (note: connections
according to PPA-SK)
Computer Arithmetic: Principles, Architectures, and VLSI Design 62 Computer Arithmetic: Principles, Architectures, and VLSI Design 63
5.5 Shift, Extension, Saturation  Applications :
Shift : a) shift n-bit vector by k bit positions  adaption of magnitude (shift a)) or word length
b) select n out of more bits at position k (extension) of operands (e.g. for addition)
 also: logical (= unsigned), arithmetic (= signed)  multiplication/division by multiples of 2 (shift)
Rotation by k bit positions, n constant (logic operation)
 logic bit/byte operations (shift, rotation)
 scaling of numbers for word-length reduction (i.e.
Extension of word lengths by k bits (n ! n + k ) ignore leading zeroes, shift b)) or normalization (e.g.
(i.e. sign-extension for signed numbers) of floating-point numbers, shift a)) using LZD
Saturation to highest/lowest value after over-/underflow  reducing error after over-/underflow (saturation)
shift a) un- l. an,2 ; : : : ; a0 ; 0 sll  Implementation of shift/extension/rotation by
signed r. 0; an,1 ; : : : ; a1 srl  constant values : hard-wired
signed l. a n, 1 ; an,3 ; : : : ; a0 ; 0 sla  variable values : multiplexers
r. an,1 ; an,1; an,2 ; : : : ; a1 sra  n possible values : nbyn barrel-shifter/rotator
shift b) unsigned an k,1 ; : : : ; ak
+  Example : 4by4 barrel-rotator
signed a2n,1 ; an k,2 ; : : : ; ak
+ a3 a2 a1 a0
rotate l. an,2 ; : : : ; a0 ; an,1 rol A = O(n2)
r. a0 ; an,1 ; : : : ; a1 ror
T = O(log n) s1 s0

extend un- l. 0; an,1 ; : : : ; a0 a3 a2 a1 a0 s1 s0 barshift.epsi


signed r. an,1 ; : : : ; a0 ; 0 
44 49 mm

signed l. an,1 ; an,1; an,2 ; : : : ; a0 s0


muxshift.epsi

41 28 mm
s1 s0

r. an,1 ; an,2 ; : : : ; a0 ; 0 s1 s1 s0

saturate unsigned an,1 ; : : : ; an,1 z3 z2 z1 z0 z3 z2 z1 z0


signed an,1 ; an,1 ; : : : ; an,1 multiplexers tristate buffers
Computer Arithmetic: Principles, Architectures, and VLSI Design 64 Computer Arithmetic: Principles, Architectures, and VLSI Design 65

5 Simple / Addition-Based Operations 5.6 Addition Flags 5 Simple / Addition-Based Operations 5.6 Addition Flags

5.6 Addition Flags Basic and derived condition flags


flag formula description
formula
C cn carry flag condition flag
V cn  cn,1 signed overflow flag
unsigned signed
an bnsn + anbn sn operation: S =A+B (+) or S = A , B (,)
Z 8i : s i = 0 zero flag S=0 zero Z Z
N sn,1 negative flag, sign S<0 negative N
S0 positive N
Implementation of adder with flags S > max overflow C (+) VC
C , N : for free S < min underflow C (,) VC
V : fast cn, cn,1 computed by e.g. PPA ) very cheap operation: A,B
Z : a) cin = 1 (subtract.) : Z = (A = B ) = Pn,1:0 (of PPA) A=B EQ Z Z
b) cin = 0=1 : A 6= B NE Z Z
AB GE C N V + NV
1) Z = sn,1 + sn,2 +    + s0 (r.s.a.) A>B GT CZ (N V + NV )Z
A = ACPA + n ; TZ = TCPA + dlog ne A<B LT C NV + NV
AB LE C + Z NV + NV + Z
2)  faster without final sum (i.e. carry prop.) [18]
 example : 01001 1 00 0  Unsigned and signed addition/subtraction only differ
+ 10110 1 00 with respect to the condition flags
= 00000 0 00
z0 = ((a0  b0) cin)
zi = ((ai  bi) (ai,1 + bi,1))
Z = zn,1 zn,2    z0 ; i = 0; : : : ; n , 1 (r.s.a.)
A = ACPA + 3n ; TZ = 4 + dlog ne
Computer Arithmetic: Principles, Architectures, and VLSI Design 66 Computer Arithmetic: Principles, Architectures, and VLSI Design 67
5.7 Arithmetic Logic Unit (ALU) 6 Multiplication
A B
6.1 Multiplication Basics
c out alusymbol.epsi c in  Multiplies two n-bit operands A and B [1, 2]
30 ALU
flags
29 mm
op  Product P is (2n)-bit unsigned number or (2n , 1)-bit
signed number
 Example : unsigned multiplication
Z

,1
nX ,1
nX ,1 nX
nX ,1
ALU operations
P =AB = ai2i  bj 2j = aibj 2i+ j or
add A + B + cin sub A , B , cin i=0 j =0 i=0 j =0
A+1 A,1 ,1
nX
Pi = ai  B ; P = Pi2i ; i = 0; : : : ; n , 1
arithmetic inc dec
pass A neg ,A i=0
(r.s.a.)

and a i bi nand ai bi
or ai + bi nor ai + bi Algorithm
logic
xor a i  bi xnor ai bi 1) Generation of n partial products Pi
pass ai not ai 2) Adding up partial products :
sll  A1 srl  A1 a) sequentially (sequential shift-and-add),
shift/
sla  A a 1 sra  A a 1 b) serially (combinational shift-and-add), or
rotate
rol  A r 1 ror  A r 1 c) in parallel
 s/ro : shift/rotate ; l/r : left/right ;
l/a : logic (unsigned) / arithmetic (signed) Speed-up techniques
 Logic of adder/subtractor can partly be shared with logic  Reduce number of partial products
operations
 Accelerate addition of partial products
Computer Arithmetic: Principles, Architectures, and VLSI Design 68 Computer Arithmetic: Principles, Architectures, and VLSI Design 69

6 Multiplication 6.1 Multiplication Basics 6 Multiplication 6.2 Unsigned Array Multiplier

Sequential multipliers : 6.2 Unsigned Array Multiplier


partial products generated
and added sequentially (using
 Braun multiplier : array multiplier for unsigned numbers
mulseq.epsi
accumulator) 
34 28 mm nX,1 nX
,1 A = 8n2 , 11n
CPA
P= aibj 2i j +
T = 6n , 9
A = O(n) ; T = O(log n) ; L = n i 0j 0 = =

a0 b3 a0 b2 a0 b1 a0 b0
Array multipliers :

a1 b3 a1 b2 a1 b1 a1 b0
CSA
a2 b3 a2 b2 a2 b1 a2 b0
+ a3 b3 a3 b2 a3 b1 a3 b0
partial products generated and
CSA
added simultaneously in linear
mularr.epsi

p7 p6 p5 p4 p3 p2 p1 p0
array (using array adder) 34 47 CSAmm
b3 b2 b1 b0
A = O(n ) ; T = O(n)
2 CSA
a0
CPA
p0
a1

Parallel multipliers : HA HA HA
1
partial products p1
generated in parallel and added

mulpar.epsi
a2
subsequently in multi-operand  mulbraun.epsi
adder (using tree adder)
34 43 mm
CSA
FA
FA
99 83 mm
FA
tree p2

A = O(n ) ; T = O(log n)
2
CPA
a3
2 FA FA FA
CSA
Signed multipliers : p3
CPA
a) complement operands before and result after
multiplication ) unsigned multiplication
3 FA FA HA

b) direct implementation (dedicated multiplier structure) p7 p6 p5 p4

Computer Arithmetic: Principles, Architectures, and VLSI Design 70 Computer Arithmetic: Principles, Architectures, and VLSI Design 71
6.3 Signed Array Multipliers 6.4 Booth Recoding
Modified Braun multiplier  Speed-up technique : reduction of partial products
 Subtract bits with negative weight ) special FAs [1] Sequential multiplication

1 neg. bit : ,a + b + cin = 2cout , s  Minimal (or canonical) signed-digit (SD) represent. of A
2 neg. bits : a , b , cin = ,2cout + s + One cycle per non-zero partial product (i.e. 8ai j ai =
6 0)
 Replace FAs in regions Negative partial products

1 , 2 , and 3 by :
s = a  b  cin
cout = ab + acin + bcin Data-dependent reduction of partial products and latency
(input a at mark )
Combinational multiplication
 Otherwise exactly same structure and complexity as
Braun multiplier ) efficient and flexible  Only fixed reduction of partial product possible
 Radix-4 modified Booth recoding : 2 bits recoded to one
Baugh-Wooley multiplier multiplier digit ) n=2 partial products
 Arithmetic transformations yield the following partial n=2
X
products (two additional ones) : A= (a2i,1 + a2i , 2a2i+1 ) 22i ; a,1 = 0
i=0 | f,2;,1{z;0;+1;+2g
}

a0 b3 a0 b2 a0 b1 a0 b0
a1b3 a1 b2 a1 b1 a1 b0 a2i a2i a2i,1 Pi
a2 b3 a2 b2 a2 b1 a2 b0 0
+1
0 0 + 0

recoding
a3 b3 a3 b2 a3 b1 a3 b0 B

Booth
0 0 1 +
a3 a3 0 1 0 + B

+ 1 b3 b3 0 1 1 +2 B mulbooth.epsi

p7 p6 p5 p4 p3 p2 p1 p0 1 0 0 , 2B 41 43 mm
CSA
1 0 1 , B array/tree
Less efficient and regular than modified Braun 1 1 0 , B
multiplier 1 1 1 , 0 CPA

Computer Arithmetic: Principles, Architectures, and VLSI Design 72 Computer Arithmetic: Principles, Architectures, and VLSI Design 73

6 Multiplication 6.4 Booth Recoding 6 Multiplication 6.6 Multiplier Implementations

 Applicable to sequential, array, and parallel multipliers 6.5 Wallace Tree Addition
 additional recoding logic and more A : +8n  Speed-up technique : fast partial product addition
complex partial product generation
(MUX for shift, XOR for negation)
T : +7 A = O(n2) ; T = O(log n)
+ adder array/tree cut in half  Applicable to parallel multipliers : parallel partial
) considerably smaller (array and tree) A : =2 product generation (normal or Booth recoded)
Irregular adder tree (Wallace tree) due to different
) much faster for adder arrays T : =2 number of bits per column
) slightly or not faster for adder trees T : ,0 ) irregular wiring and/or layout
 Negative partial products (avoid sign-extension) : ) non-uniform bit arrival times at final adder
p 3 p3 p3 p3 p2 p1 p0 = 0 0 0 ,p3 p2 p1 p0 6.6 Multiplier Implementations
| {z }
ext. sign = 1  Sequential multipliers :
+ 1 1 1 p3 p2 p1 p0  low performance, small area, resource sharing (adder)
p03 p03 p03 p03 p02 p01 p00
1
p03 p02 p01 p00
 Braun or Baugh-Wooley multiplier (array multiplier) :
p13 p13 p13 p12 p11 p10 ! p12 p11 p10
p13  medium performance, high area, high regularity
p23 p23 p22 p21 p20 p23p22 p21 p20  layout generators ) data paths and macro-cells
p33 p32 p31 p30 + p33 p32 p31 p30 +  simple pipelining, faster CPA ) higher speed
p6 p5 p4 p3 p2 p1 p0 p6 p5 p4 p3 p2 p1 p0
 Booth-Wallace multiplier (parallel multiplier) [9] :
 Suited for signed multiplication (incl. Booth recod.)  high performance, high area, low regularity
 Extend A for unsigned multiplication : an = 0  custom multipliers, netlist generators
 often pipelined (e.g. register between CSA-tree and CPA)
 Radix-8 (3-bit recoding) and higher radices :  Signed-unsigned multiplier : signed multiplier with
precomputing 3B , : : : ) larger overhead
operands extended by 1 bit (an = an,1 =0, bn = bn,1=0)
Computer Arithmetic: Principles, Architectures, and VLSI Design 74 Computer Arithmetic: Principles, Architectures, and VLSI Design 75
6.7 Composition from Smaller Multipliers 7 Division / Square Root Extraction
 (2n  2n)-bit multiplier can be composed from 4 7.1 Division Basics
(n  n)-bit multipliers (can be repeated recursively)
A =Q+ R A=QB+R; R <B
A  B = (AH 2n + AL)  (BH 2n + BL) B B R = A rem B (remainder)
2n n
= AH BH 2 + (AH BL + AL BH )2 + AL BL
 A 2 [0; 22n , 1] ; B; Q; R 2 [0; 2n , 1] ; B 6= 0
 4 (n  n)-bit multipliers AH  BL  Q < 2n ! A < 2nB , otherwise overflow
+ (2n)-bit CSA + (3n)-bit CPA ) normalize B before division (B 2 [2n,1; 2n , 1])
AH  BH AL  BL
 less efficient (area and speed) AL  BH Algorithms (radix-2)

6.8 Squaring  Subtract-and-shift : partial remainders Ri [1, 2]


 Sequential algorithm : recursive, f non-associative
 P = A2 = AA : multiplier optimizations possible  
qi = Ri 1  2iB ; Ri = Ri 1 , qi2iB
+ +
a0 a3 a0 a2 a0 a1 a0 Rn = A ; R = R0 ; i = n , 1; : : : ; 0 (r.m.n.)
a1 a3 a1 a2 a1 a1 a0
a2 a3 a2 a2 a1 a2 a0
+ a3 a3 a2 a3 a1 a3 a0 Basic algorithm : compare and conditionally subtract
a2 a3 a1 a3 a0 a3 a0 a2 a0 a1 a0 a0 ) expensive comparison and CPA
! a3 a3 a1 a2 a1 a1
+ a2 a2 Restoring division : subtract and conditionally restore
p7 p6 p5 p4 p3 p2 p1 p0 (adder or multiplexer) ) expensive CPA and restoring

+ bn=2c + 1 partial products (if no Booth recoding used)


, Non-restoring division : detect sign, subtract/add, and
correct by next steps ) expensive CPA
) optimized squarer more efficient than multiplier
SRT division : estimate range, subtract/add (CSA), and
 Table look-up (ROM) less efficient for every n correct by next steps ) inexpensive CSA
Computer Arithmetic: Principles, Architectures, and VLSI Design 76 Computer Arithmetic: Principles, Architectures, and VLSI Design 77

7 Division / Square Root Extraction 7.3 Non-Restoring Division 7 Division / Square Root Extraction 7.4 Signed Division

( (
7.2 Restoring Division
qi = 1 if Ri 1 , B 2i  0
+
7.4 Signed Division
qi0 = 1 if Ri 1; B same sign
+

0 if Ri 1 , B 2i < 0
+ 1 if Ri 1; B opposite sign
+

i Ri 1 , B 2i < 0 : qi = 0 ; Ri = Ri 1 (restored)  Example : signed non-restoring array divider


B > 0, final correction of R omitted)
+ +

i , 1 Ri 1 , B 2i,1  0 : qi,1 = 1 ; Ri,1 = Ri 1 , B 2i,1


+ +
(simplifications:
A = 9n2 ; T = 2n2 + 4n
(
7.3 Non-Restoring Division
q0 1 if Ri 1  0
+

,1 = 1 if
b3 a6 b2 a5 b1 a4 b0 a3
i =
Ri 1 < 0 a6 b3
+

i Ri 1  0 : qi0 = 1 ; Ri = Ri 1 , B 2i
+ +

i , 1 Ri 1 , B 2i < 0 : qi0,1 = 1 ; Ri,1 = Ri 1 , B 2i


+ + q3
+B 2i,1 = Ri 1 , B 2i,1
FA FA FA FA
+

 One subtraction/addition (CPA) per step


 Final correction step for R (additional CPA) a2

 Simple quotient digit conversion : (note: qi0 irredundant) q2 FA FA


divarray.epsi
FA FA


qi0 2 f1; 1g ! qi 2 f0; 1g : qi = 12 (qi0 + 1) 81 101 mm

Q = (qn,1; qn,2; qn,3; : : : ; q0; 1) a1


q1 FA FA FA FA
A B

A = (n + 1)ACPA +/ CPA
= O (n2 ) or O (n2 log n) +/ CPA
divnr.epsi a0
Q
46  38 mm
T = (n + 1)TCPA +/ CPA
+/ CPA q0 FA FA FA FA
= O (n2 ) or O (n log n) +/ CPA
r3 r2 r1 r0
R
Computer Arithmetic: Principles, Architectures, and VLSI Design 78 Computer Arithmetic: Principles, Architectures, and VLSI Design 79
7.5 SRT Division (Sweeney, Robertson, Tocher) 7.6 High-Radix Division
8
>
>1
if
< B 2i  Ri 1 +  Radix = 2m , qi0 2 f , 1; : : : ; 1; 0; 1; : : : ; , 1g
0
qi = >0 if ,B 2i  Ri 1 < B 2i ; qi0 is SD number
>
:1 if
+

Ri 1 < ,B 2i
+
 m quotient bits per step ) fewer, but more complex steps
+ Suitable for SRT algorithm ) faster
 If 2n,1  B < 2n , i.e. B is normalized :
) ,B 2i  ,2n i,1  Ri 1 < 2n i,1  B 2i
+ + Complex comparisons (more bits) and decisions
) table look-up () Pentium bug!)
+
8
>
<1 if
> 2n i,1  Ri 1 +
+

) qi = >>0 if ,2n i,1  Ri 1 < 2n i,1


0 +
+
+

:1 if Ri 1 < ,2n i,1 +


+ 7.7 Division by Multiplication

+ Only 3 MSB are compared ) qi0 are estimated ) CSA


Division by convergence
instead of CPA can be used (precise enough) [19] A = A  R0R1    Rm,1 ! A  B1
Q= B Q resp. Q
 Correction in following steps (+ final correction step) B  R0 R1    Rm,1 B  B1 =
1 2n
Redundant representation of qi0 (SD representation) )
final conversion necessary (CPA)  Bi +1 Bi  Ri = 2| n(1{z, y)}  (|1 +
= y) = |2n(1{z, y2 )} ;
{z }
+ Highly regular and fast (O(n)) SRT array dividers Bi Ri > Bi ; ! 2n
) only slightly slower/larger than array multipliers y = 1 , Bi2,n ; Ri = 2 , Bi2,n = B i + 1 (signed)
A B

A = nACSA + 2ACPA +/ CSA


 Algorithm : Bi +1 Bi  Ri ; Ai 1 = Ai  Ri
= +

= O (n2 ) +/ CSA Ri = B i + 1 ; i = 0; : : : ; m , 1
CPA

Q divsrt.epsi
T = nTCSA + TCPA 50  38 mm+/ CSA
+/ CSA A0 = A ; B0 = B ; Q = Am (r.s.n.)
= O (n) +/ CPA

R
 Quadratic convergence : L = dlog ne
Computer Arithmetic: Principles, Architectures, and VLSI Design 80 Computer Arithmetic: Principles, Architectures, and VLSI Design 81

7 Division / Square Root Extraction 7.8 Remainder / Modulus 7 Division / Square Root Extraction 7.9 Divider Implementations

Division by reciprocation 7.9 Divider Implementations


A =A 1
Q= B  Iterative dividers (through multiplication) :
B
 Newton-Raphson iteration method :  resource sharing of existing components (multiplier)
 medium performance, medium area
find f (X ) = 0 by recursion Xi+1 = Xi , ff0((XXo))  high efficiency if components are shared
i

 f (X ) = X1 , B ; f 0(X ) = , X12 ; f B1 = 0
 
 Sequential dividers (restoring, non-restoring, SRT) :
 resource sharing of existing components (e.g. adder)
 Algorithm :
 low performance, low area
Xi 1 = Xi  (2 , B  Xi) ; i = 0; : : : ; m , 1
 Array dividers (restoring, non-restoring, SRT) :
+

X0 = B ; Q = Xm (r.s.n.)
 dedicated hardware component
 Quadratic convergence : L = O(log n)  high performance, high area
 Speed-up : first approximation X0 from table  high regularity ) layout generators, pipelining
7.8 Remainder / Modulus
 square root extraction possible by minor changes
 combination with multiplication or/and square root
Remainder (rem) : signed remainder of a division
R = A rem B = A , bA=B c  B ; sign(R) = sign(A)  No parallel dividers exist, as compared to parallel
multipliers (sequential nature of division)
Modulus (mod) : positive remainder of a division
(
R if A  0
M = A mod B ; M  0 ; M = R +B else

Computer Arithmetic: Principles, Architectures, and VLSI Design 82 Computer Arithmetic: Principles, Architectures, and VLSI Design 83
7.10 Square Root Extraction 8 Elementary Functions
p
A,R =Q A=Q 2
+ R  Exponential function : ex (exp x)
 A 2 [0; 22n , 1] ; Q 2 [0; 2n , 1]  Logarithm function : ln x, log x
Algorithm
 Trigonometric functions : sin x, cos x, tan x
 Subtract-and-shift : partial remainders Ri and quotients  Inverse trig. functions : arcsin x, arccos x, arctan x
Qi = Qi 1 + qi2i = (qn,1; : : : ; qi; 0; : : : ; 0) [1]
+
 Hyperbolic functions : sinh x, cosh x, tanh x
   
 Q2i = Qi 1 + qi2i 2 = Q2i 1 + qi2i 2Qi 1 + qi2i
+ + +
8.1 Algorithms
  
qi = Ri 1  2i 2Qi 1 + 2i ; Qi = Qi 1 + qi2i
+ + +
 Table look-up : inefficient for large word lengths [5]
Ri = Ri 1 , qi2i 2Qi 1 + qi2i ; i = n , 1; : : : ; 0
+ +
 Taylor series expansion : complex implementation
Rn = A ; Qn = 0 ; R = R0 ; Q = Q0 (r.m.n.)  Polynomial and rational approximations [1, 5]
 Shift-and-add algorithms [5]
Implementation
 Convergence algorithms [1, 2] :
+ Similar to division ) same algorithms applicable
(restoring, non-restoring, SRT, high-radix)  similar to division-by-convergence
+ Combination with division in same component possible  two (or more) recursive formulas : one formula
converges to a constant, the other to the result
 Only triangular array required A
(step i : qki = 0)  Coordinate rotation (CORDIC) [2, 5, 20] :
+/ CPA
+/ CPA
 3 equations for x-, y-coordinate, and angle
A  ADIV =2
sqrtnr.epsi
Q 42  36+/
mmCPA  computes all elementary functions by proper input
T  TDIV +/ CPA
+/ CPA settings and choice of modes and outputs

R
 simple, universal hardware, small look-up table
Computer Arithmetic: Principles, Architectures, and VLSI Design 84 Computer Arithmetic: Principles, Architectures, and VLSI Design 85

8 Elementary Functions 8.2 Integer Exponentiation 8 Elementary Functions 8.3 Integer Logarithm

8.2 Integer Exponentiation b) E = AB = Abn, 2n,  b 2 b


1
1+ + 1 + 0

= (   ((A n, )  A n, )    A )  A
b 2 b 21 b 2 b 2 1 0

 Approximated exponentiation : xy = ey ln x = 2y log x


 Base-2 integer exponentiation : 2A =( ; 1; 0; : : :)
: : : ; 0 |{z} Ei = Ei2 1  Abi ; i = n , 1; : : : ; 0
+
A En = 1 ; E = E0 (r.s.n.)
 Integer exponentiation (exact) :
A = AMUL ; T = TMUL ; L = 2(n , 1)
AB =| A  A{z   A} L = 0  2n , 1 (!)
B 8.3 Integer Logarithm

Applications : modular exponentiation AB (mod C) Z = blog2 Ac


in cryptographic algorithms (e.g. IDEA, RSA)
Algorithms : square-and-multiply  For detection/comparison of order of magnitude
a) E = AB = Abn, 2n,  b 2 b 1
1+ + 1 + 0  Corresponds to leading-zeroes detection (LZD) with
2n, bn,
=A  A2n, bn,     A4b  A2b  Ab
1
1
2
2 2 1 0
encoded output

Ei = Pibi  Ei,1 ; Pi 1 = Pi2 ; i = 0; : : : ; n , 1


+

E,1 = 1 ; P0 = A ; E = En,1 (r.s.n.)


A = 2AMUL ; T = TMUL ; L = n or
A = AMUL ; T = TMUL ; L = 2n

Computer Arithmetic: Principles, Architectures, and VLSI Design 86 Computer Arithmetic: Principles, Architectures, and VLSI Design 87
9 VLSI Design Aspects Gate-level design

9.1 Design Levels  Cell-based design techniques : standard-cells, gate-array/


sea-of-gates, field-programmable gate-array (FPGA)
Transistor-level design
 Circuit implemented by hand or by synthesis (library)
 Circuit and layout designed by hand (full custom)  Layout implemented by automated place-and-route
 Low design efficiency  Medium to high design efficiency
 High circuit performance : high speed, low area  Medium to low circuit performance
 High flexibility : choice of architecture and logic style  Medium to low flexibility : full choice of architecture
 Transistor-level circuit optimizations :
 logic style : static vs. dynamic logic, Block-level design
complementary CMOS vs. pass-transistor logic  Layout blocks and netlists from parameterized automatic
 special arithmetic circuits : better than with gates generators or compilers (library)
gi g i-1  High design efficiency
 Medium to high circuit performance
ci c i-1
carrychain.epsi
carry chain : c out 
54 17 mm c in
ki pi k i-1 p i-1
 Low flexibility : limited choice of architectures
a b a a b c in a
 Implementations :
b data-path : bit-sliced, bus-oriented layout (array of
c in c in cells: n bits  m operations), implementation of entire
full- b facmos.epsi
adder : c in b 
76 40 mm c in
s data paths, medium performance, medium diversity
c out macro-cells : tiled layout, fixed/single-operation
b
components, high performance, small diversity
portable netlists : ) gate-level design
a b a a b c in a

Computer Arithmetic: Principles, Architectures, and VLSI Design 88 Computer Arithmetic: Principles, Architectures, and VLSI Design 89

9 VLSI Design Aspects 9.2 Synthesis 9 VLSI Design Aspects 9.3 VHDL

9.2 Synthesis 9.3 VHDL


High-level synthesis Arithmetic types : unsigned, signed (2s complement)
 Synthesis from abstract, behavioral hardware description Arithmetic packages
(e.g. data dependency graphs) using e.g. VHDL
 numeric_bit, numeric_std (IEEE standard 1076.3),
 Involves architectural synthesis and arithmetic std_logic_arith (Synopsys)
transformations
 contain overloaded arithmetic operators and resizing /
 High-level synthesis is still in the beginnings type conversion routines for unsigned, signed types

Low-level synthesis Arithmetic operators (VHDL87/93) [21]

 Layout and netlist generators relational : =, /=, <, <=, >, >=

 Included in libraries and synthesis tools shift, rotate (93 only)


adding
:
:
rol, ror, sla, sll, sra, srl
+, -
 Low-level synthesis is state-of-the-art sign (unary) : +, -
 Basis for efficient ASIC design multiplying : *, /, mod, rem
 Limited diversity and flexibility of library components exponent, absolute : **, abs

Circuit optimization Synthesis


 Efficient optimization of random logic is state-of-the-art  Typical limitations of synthesis tools :
 Optimization of entire arithmetic circuits is not feasible /, mod, rem : both operands must be constant or divisor
) only local optimizations possible must be a power of two
 Logic optimization cannot replace the synthesis of ** : for power-of-two bases only
efficient arithmetic circuit structures using generators  Variety of arithmetic components provided in separate
libraries (e.g. DesignWare by Synopsys)

Computer Arithmetic: Principles, Architectures, and VLSI Design 90 Computer Arithmetic: Principles, Architectures, and VLSI Design 91
Resource sharing 9.4 Performance

 Sharing one resource for multiple operations Pipelining


 Done automatically by some synthesis tools  Pipelining is basically possible with every combinational
 Otherwise, appropriate coding is necessary : circuit ) higher throughput
a) S <= A + C when SELA = 1 else B + C;
) 2 adders + 1 multiplexer  Arithmetic circuits are well suited for pipelining due to
b) T <= A when SELA = 1 else B; high regularity
S <= T + C; )
1 multiplexer + 1 adder  Pipelining of arithmetic circuits can be very costly :
Coding & synthesis hints  large amount of internal signals in arithmetic circuits
 array structures : many small pipeline registers
 Addition : single adder with carry-in/carry-out :  tree structures : few large pipeline registers
) no advantage of tree structures anymore
Aext <= resize(A, width+1) & Cin;
Bext <= resize(B, width+1) & 1;
Sext <= Aext + Bext; (except for smaller latency)
S <= Sext(width downto 1);  Fine-grain pipelining ) systolic arrays (often applied to
Cout <= Sext(width+1); arithmetic circuits)

 Synthesis : check synthesis result for allocated arithmetic High speed


units ) code sanity check, control of circuit size
 Fast circuit architectures, pipelining, replication
VHDL library of arithmetic units (parallelization), and combinations of those

 Structural, synthesizable VHDL code for most circuits  Optimal solution depends on arithmetic operation, circuit
described in this text is found in [22] architecture, user specifications, and circuit environment

Computer Arithmetic: Principles, Architectures, and VLSI Design 92 Computer Arithmetic: Principles, Architectures, and VLSI Design 93

9 VLSI Design Aspects 9.4 Performance 9 VLSI Design Aspects 9.5 Testability

Low power 9.5 Testability

Power-related properties of arithmetic circuits : Testability goal : high fault coverage with few test vectors
that are easy to generate/apply
 High glitching activity due to high bit dependencies
and large logic depth Random test vectors : easy to generate and
apply/propagate, few vectors give high (but not perfect)
Power reduction in arithmetic circuits [23] : fault coverage for most arithmetic circuits
 Reduce the switched capacitance by choosing an area Special test vectors : sometimes hard to generate and
efficient circuit architecture apply, required for coverage of hard-detectable faults
 Allow for lower supply voltage by speeding up the which are inherent in most arithmetic circuits
circuitry
Hard-detectable faults found in :
 Reduce the transition activity :
 apply stable inputs while circuit is not in use ()  circuits of arithmetic operations with inherent special
disabling subcircuits) cases (arithmetic exceptions) : detectors, comparators,
 reduce glitching transitions by balancing signal incrementers and counters (MSBs), adder flags
paths (partly done by speed-up techniques, otherwise  circuits using redundant number representations
difficult to realize) 6 redundant hardware) : dividers (Pentium bug!)
(=
 reduce glitching transitions by reducing logic depth
(pipelining)
 take advantage of correlated data streams
 choose appropriate number representations
(e.g. Gray codes for counters)

Computer Arithmetic: Principles, Architectures, and VLSI Design 94 Computer Arithmetic: Principles, Architectures, and VLSI Design 95
Bibliography [11] R. Zimmermann, Binary Adder Architectures for
Cell-Based VLSI and their Synthesis, PhD thesis, Swiss
[1] I. Koren, Computer Arithmetic Algorithms, Prentice Hall, Federal Institute of Technology (ETH) Zurich,
1993. Hartung-Gorre Verlag, 1998.

[2] K. Hwang, Computer Arithmetic: Principles, Architecture, [12] A. Tyagi, A reduced-area scheme for carry-select adders,
and Design, John Wiley & Sons, 1979. IEEE Trans. Comput., vol. 42, no. 10, pp. 11621170, Oct.
1993.
[3] O. Spaniol, Computer Arithmetic, John Wiley & Sons,
1981. [13] T. Han and D. A. Carlson, Fast area-efficient VLSI
adders, in Proc. 8th Computer Arithmetic Symp., Como,
[4] J. J. F. Cavanagh, Digital Computer Arithmetic: Design May 1987, pp. 4956.
and Implementation, McGraw-Hill, 1984.
[14] D. W. Dobberpuhl et al., A 200-MHz 64-b dual-issue
[5] J.-M. Muller, Elementary Functions: Algorithms and CMOS microprocessor, IEEE J. Solid-State Circuits, vol.
Implementation, Birkhauser Boston, 1997. 27, no. 11, pp. 15551564, Nov. 1992.
[6] Proceedings of the Xth Symposium on Computer Arithmetic. [15] A. De Gloria and M. Olivieri, Statistical carry lookahead
[7] IEEE Transactions on Computers. adders, IEEE Trans. Comput., vol. 45, no. 3, pp. 340347,
Mar. 1996.
[8] D. R. Lutz and D. N. Jayasimha, Programmable modulo-k
counters, IEEE Trans. Circuits and Syst., vol. 43, no. 11, [16] V. G. Oklobdzija, D. Villeger, and S. S. Liu, A method for
pp. 939941, Nov. 1996. speed optimized partial product reduction and generation of
fast parallel multipliers using an algorithmic approach,
[9] H. Makino et al., An 8.8-ns 54  54-bit multiplier with IEEE Trans. Comput., vol. 45, no. 3, pp. 294305, Mar.
high speed redundant binary architecture, IEEE J. 1996.
Solid-State Circuits, vol. 31, no. 6, pp. 773783, June 1996.
[17] Z. Wang, G. A. Jullien, and W. C. Miller, A new design
[10] W. N. Holmes, Composite arithmetic: Proposal for a new technique for column compression multipliers, IEEE
standard, IEEE Computer, vol. 30, no. 3, pp. 6573, Mar. Trans. Comput., vol. 44, no. 8, pp. 962970, Aug. 1995.
1997.

Computer Arithmetic: Principles, Architectures, and VLSI Design 96 Computer Arithmetic: Principles, Architectures, and VLSI Design 97

Bibliography

[18] J. Cortadella and J. M. Llaberia, Evaluation of A + B = K


conditions without carry propagation, IEEE Trans.
Comput., vol. 41, no. 11, pp. 14841488, Nov. 1992.

[19] S. E. McQuillan and J. V. McCanny, Fast VLSI algorithms


for division and square root, J. VLSI Signal Processing,
vol. 8, pp. 151168, Oct. 1994.

[20] Y. H. Hu, CORDIC-based VLSI architectures for digital


signal processing, IEEE Signal Processing Magazine, vol.
9, no. 3, pp. 1635, July 1992.

[21] K. C. Chang, Digital Design and Modeling with VHDL and


Synthesis, IEEE Computer Society Press, Los Alamitos,
California, 1997.

[22] R. Zimmermann, VHDL Library of Arithmetic Units,


http://www.iis.ee.ethz.ch/zimmi/arith lib.html.

[23] A. P. Chandrakasan and R. W. Brodersen, Low Power


Digital CMOS Design, Kluwer, Norwell, MA, 1995.

Computer Arithmetic: Principles, Architectures, and VLSI Design 98

Вам также может понравиться