Академический Документы
Профессиональный Документы
Культура Документы
Lecture notes on
Computer Arithmetic:
Principles, Architectures,
and VLSI Design
Reto Zimmermann
Copyright
c 1999 by Integrated Systems Laboratory, ETH Zurich
http://www.iis.ee.ethz.ch/ zimmi/publications/comp arith notes.ps.gz
Contents 4.3 Carry-Propagate Adders (CPA) : : : : : : : : : : : : : : : : : : : 26
4.4 Carry-Save Adder (CSA) : : : : : : : : : : : : : : : : : : : : : : : : : 45
1 Introduction and Conventions ::::::::::::::::::::::: 4 4.5 Multi-Operand Adders : : : : : : : : : : : : : : : : : : : : : : : : : : : 46
1.1 Outline : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4
4.6 Sequential Adders : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 52
1.2 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4
5 Simple / Addition-Based Operations : : : : : : : : : : : : : : : : 53
1.3 Conventions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5
5.1 Complement and Subtraction : : : : : : : : : : : : : : : : : : : : : 53
1.4 Recursive Function Evaluation : : : : : : : : : : : : : : : : : : : : : 6
5.2 Increment / Decrement : : : : : : : : : : : : : : : : : : : : : : : : : : : 54
2 Arithmetic Operations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 5.3 Counting : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 58
2.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8 5.4 Comparison, Coding, Detection : : : : : : : : : : : : : : : : : : : 60
2.2 Implementation Techniques : : : : : : : : : : : : : : : : : : : : : : : 9 5.5 Shift, Extension, Saturation : : : : : : : : : : : : : : : : : : : : : : 64
3 Number Representations : : : : : : : : : : : : : : : : : : : : : : : : : : : 10 5.6 Addition Flags : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 66
3.1 Binary Number Systems (BNS) : : : : : : : : : : : : : : : : : : : 10 5.7 Arithmetic Logic Unit (ALU) : : : : : : : : : : : : : : : : : : : : : 68
3.2 Gray Numbers : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13 6 Multiplication : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69
3.3 Redundant Number Systems : : : : : : : : : : : : : : : : : : : : : : 14 6.1 Multiplication Basics : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69
3.4 Residue Number Systems (RNS) : : : : : : : : : : : : : : : : : : 16 6.2 Unsigned Array Multiplier : : : : : : : : : : : : : : : : : : : : : : : 71
3.5 Floating-Point Numbers : : : : : : : : : : : : : : : : : : : : : : : : : : 18 6.3 Signed Array Multipliers : : : : : : : : : : : : : : : : : : : : : : : : : 72
3.6 Logarithmic Number System : : : : : : : : : : : : : : : : : : : : : 19 6.4 Booth Recoding : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 73
3.7 Antitetrational Number System : : : : : : : : : : : : : : : : : : : 19 6.5 Wallace Tree Addition : : : : : : : : : : : : : : : : : : : : : : : : : : : 75
3.8 Composite Arithmetic : : : : : : : : : : : : : : : : : : : : : : : : : : : 20 6.6 Multiplier Implementations : : : : : : : : : : : : : : : : : : : : : : : 75
3.9 Round-Off Schemes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21 6.7 Composition from Smaller Multipliers : : : : : : : : : : : : : 76
4 Addition : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22 6.8 Squaring : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 76
4.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22 7 Division / Square Root Extraction : : : : : : : : : : : : : : : : : : 77
4.2 1-Bit Adders, (m, k)-Counters : : : : : : : : : : : : : : : : : : : : 23 7.1 Division Basics : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 77
Computer Arithmetic: Principles, Architectures, and VLSI Design 1 Computer Arithmetic: Principles, Architectures, and VLSI Design 2
Contents
Basic principles of computer arithmetic [1, 2, 3, 4, 5, 6, 7] Signal buses : A (1-D), Ai (2-D), ai:k (subbus, 1-D)
Circuit architectures and implementations of main Signals : a, ai (1-D), ai;k (2-D), Ai:k (group signal)
arithmetic operations Circuit complexity measures : A (area), T (cycle time,
Aspects regarding VLSI design of arithmetic units delay), AT (area-time product), L (latency, # cycles)
Arithmetic operators : +, ,, , =, log (= log2 )
(or), (and), (xor), (xnor), (not)
1.2 Motivation
Logic operators : +
Arithmetic units are, among others, core of every data
path and addressing unit Circuit complexity measures
Data path is core of : Unit-gate model ( gate-equivalents (GE) model) :
microprocessors (CPU) Inverter, buffer : A = 0 ; T = 0 (i.e. ignored)
signal processors (DSP) Simple monotonic 2-input gates (AND, NAND, OR,
data-processing application specific ICs (ASIC) and NOR) : A=1; T =1
programmable ICs (e.g. FPGA)
Simple non-monotonic 2-input gates (XOR, XNOR) :
Standard arithmetic units available from libraries A=2; T =2
Design of arithmetic units necessary for : Complex gates : composed from simple gates
non-standard operations ) Simple m-input gates : A = m , 1 ; T = dlog me
high-performance components Wiring not considered (acceptable for comparison
library development purposes, local wiring, multilevel metallization)
Only estimations given for complex circuits
Computer Arithmetic: Principles, Architectures, and VLSI Design 4 Computer Arithmetic: Principles, Architectures, and VLSI Design 5
1 Introduction and Conventions 1.4 Recursive Function Evaluation 1 Introduction and Conventions 1.4 Recursive Function Evaluation
funn.epsi
) serial structure : 1 funrmn.epsi
A = O(n) ; T = O(1)
119 17 mm
219 25 mm
z3 z2 z1 z0
A = O(n) ; T = O(n) 3
z3 z2 z1 z0
Recursive functions (r.) a3 a2 a1 a0
1. f is non-associative (r.s.n.) a3 a2 a1 a0
) or shared-tree structure :
a3 a2 a1 a0
) serial structure : 1 funrsn.epsi
219 24 mm 1funrma2.epsi
A = O(n) ; T = O(n) 3 A = O(n log n) ; T = O(log n) 219 21 mm
z z3 z2 z1 z0
Computer Arithmetic: Principles, Architectures, and VLSI Design 6 Computer Arithmetic: Principles, Architectures, and VLSI Design 7
2 Arithmetic Operations 2.2 Implementation Techniques
3 Number Representations 3.1 Binary Number Systems (BNS) 3 Number Representations 3.1 Binary Number Systems (BNS)
011...1
100...0
111...1
(low-power signal buses), representation of continuous
signals for low-error sampling (no false numbers due to
binary number representation switching of different bits at different times)
Non-monotonic numbers : difficult arithmetic operations,
e.g. addition, comparison :
2 n1 0 2 n1 2n
numrep.epsi
g1 g0 g10 g00 g0 g00 binary Gray
95 73 mm unsigned
0 0 < 0 1 and 0 < 1 b3 b2 b1 b0 g3 g2 g1 g0
0 0 0 0 0 0 0 0 0
1 1 < 1 0 but 1 > 0 1 0 0 0 1 0 0 0 1
2s complement
2 0 0 1 0 0 0 1 1
binary ! Gray : 3 0 0 1 1 0 0 1 0
1s complement 4 0 1 0 0 0 1 1 0
gi = bi 1 bi ; bn = 0 ;
+ 5 0 1 0 1 0 1 1 1
sign-magnitude i = 0; : : : ; n , 1 (n.) 6
7
0
0
1
1
1
1
0
1
0
0
1
1
0
0
1
0
Gray ! binary : 8
9
1
1
0
0
0
0
0
1
1
1
1
1
0
0
0
1
Conventions
bi = bi 1 gi ; bn = 0 ;
10 1 0 1 0 1 1 1 1
2s complement used for signed numbers in these notes +
i = n , 1; : : : ; 0 (r.m.a.)
11
12
1
1
0
1
1
0
1
0
1
1
1
0
1
1
0
0
Unsigned and signed numbers can be treated equally in 13
14
1
1
1
1
0
1
1
0
1
1
0
0
1
0
1
1
most cases, exceptions are mentioned 15 1 1 1 1 1 0 0 0
Computer Arithmetic: Principles, Architectures, and VLSI Design 12 Computer Arithmetic: Principles, Architectures, and VLSI Design 13
3 Number Representations 3.3 Redundant Number Systems 3 Number Representations 3.3 Redundant Number Systems
3.3 Redundant Number Systems 1 digit holds sum of 3 bits or 1 digit + 1 bit (no
Non-binary, redundant, weighted number systems [1, 2] carry-out digit, i.e. carry is saved)
Digit set larger than radix (typically radix 2) ) multiple standard redundant number system for fast addition
representations of same number ) redundancy Signed-digit (SD) or redundant digit (RD) number
+ No carry-propagation in adders ) more efficient impl. representation :
of adder-based units (e.g. multipliers and dividers)
Redundancy ) no direct implementation of relational
ri; si; ti 2 f,1; 0; 1g f1; 0; 1g , R = Pni=,01 ri2i
operators ) conversion to irredundant numbers no carry-propagation in S = R + T :
Several bits used to represent one digit ) higher storage ri + ti = (ci 1; ui) = 2ci 1 + ui , ci 1; ui 2 f1; 0; 1g
+ + +
3 Number Representations 3.5 Floating-Point Numbers 3 Number Representations 3.7 Antitetrational Number System
logarithmic : 10 log integer log fraction + 0:12 can often be included in previous operation
antitetrational : 11 a.t. integer a.t. fraction Round-to-nearest-even/-odd :
Rational numbers : slash position (i.e. size of numerator/ (
RROUND if (a0,1; : : : ; a0,d) 6= 0 0
denominator) is variable and stored (floating slash) RROUND ,EVEN =
(a0n,1 ; : : : ; a01 ; 0) otherwise
Storage form sizes : 32-bit (short), 64-bit (normal),
128-bit (long), 256-bit (extended) bias = 0 (symmetric)
Implementation : mixed hardware/software solutions mandatory in IEEE floating-point standard
Hardware proposal : long accumulator (4096 bits) holds 3 guard bits for rounding after floating-point operations :
any floating-point number in fixed-point format ) guard bit G (postnormalization), round bit R
higher accurary ) large hardware/software overhead (round-to-nearest), sticky bit S (round-to-nearest-even)
Computer Arithmetic: Principles, Architectures, and VLSI Design 20 Computer Arithmetic: Principles, Architectures, and VLSI Design 21
s s
Legend:
(reference)
HA: half-adder CPA: carry-propagate adder CLA: carry-lookahead adder
FA: full-adder RCA: ripple-carry adder PPA: parallel-prefix adder s
(m,k): (m,k)-counter CSKA:carry-skip adder COSA:conditional-sum adder
(m,2): (m,2)-compressor CSLA: carry-select adder
CIA: carry-increment adder CSA: carry-save adder
based on component related component
Computer Arithmetic: Principles, Architectures, and VLSI Design 22 Computer Arithmetic: Principles, Architectures, and VLSI Design 23
Full-adder (FA), (3, 2)-counter (m, k)-counters
a0 a m-1
( cout; s) = 2cout + s = a + b + cin A = 7 ; T = 4 (2) ( sk,1 ; : : : ; s0 ) = ...
kX,1 ,1
mX cntsymbol.epsi
sj 2j = ai
18 (m,k)
23 mm
g = ab (generate) c = ab 0 j =0 i =0 ...
p = a b (propagate) c1 = a + b
s k-1 s 0
s = a b cin = p cin
Usually built from full-adders
cout = ab + acin + bcin = ab + (a b)cin Associativity of addition allows convertion from linear to
tree structure ) faster at same number of FAs
= g + pcin = pg + pcin = pa + pcin
= cin c0 + cin c1 A = 7 Pklog 1mbm2,k c 7(m , log m) ;
=
a b
TLIN = 4m + 2blog mc ; TTREE = 4dlog3 me + 2blog mc
a b
a b
g
Example : (7, 3)-counter
HA
fasymbol.epsi
FA
faschematic3.epsi
p c out
faschematic2.epsi
c in
A = 28 ; T = 14 A = 28 ; T = 10
c18 21 mm
out c in c out 29 32 mm c in 32 35 mm
HA a0a1 a2a3a4a5a6 a0a1 a2 a3a4 a5a6
s
s s FA FA FA
a b
a b
a b
count73par.epsi
FA
36 48 mm FA
count73ser.epsi
0 42 59 mm
faschematic1.epsi
g p p
faschematic4.epsi faschematic5.epsi
FA FA
0
c0
c out
29 43 mm
c in
c out
29 1 41 mm
c in
c out
35 47 mm
1
c1
s2 s1 s0
c in FA
tree structure
linear
s
(reference) s s2 s1 s0 structure
s
Computer Arithmetic: Principles, Architectures, and VLSI Design 24 Computer Arithmetic: Principles, Architectures, and VLSI Design 25
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)
Add two n-bit operands A and B and an optional carry-in a) Concatenation of partial CPAs with fast cin ! cout
cin by performing carry-propagation [1, 2, 11]
Sum (cout; S ) is irredundant (n + 1)-bit number a n-1:j b n-1:j
...
a i-1:k b i-1:k a k-1:0 b k-1:0
...
2ci+1 + si ai + bi + ci ; A B s n-1:j s i-1:k s k-1:0
=
i = 0; 1; : : : ; n , 1 cpasymbol.epsi
c0 = cin ; cout = cn (r.m.a.) c out 29 26 mm
CPA
c in
a) Fast carry look-ahead logic for entire range of bits
S
a n-1 b n-1 a1 b1 a0 b0
c out
104 50 mm
c in
carry propagation
Computer Arithmetic: Principles, Architectures, and VLSI Design 26 Computer Arithmetic: Principles, Architectures, and VLSI Design 27
Carry-skip adder (CSKA) Carry-select adder (CSLA)
Type a) : partial CPA with fast ck ! ci Type a) : partial CPA with fast ck ! ci and ck ! si,1:k
ci = P i,1:k c0i + Pi,1:k ck (bit group (ai,1; : : : ; ak )) si,1:k = ck s0i,1:k + ck s1i,1:k
Pi,1:k = pi,1pi,2 pk (group propagate) ci = ck c0i + ck c1i
1) Pi,1:k = 0 : ck 6! c0i and c0i selected (c0i ! ci) Two CPAs compute two possible results (cin = 0=1),
2) Pi,1:k = 1 : ck ! c0i but c0i skipped (c0i 6! ci ) group carry-in ck selects correct one afterwards
) path ck ! c0i ! ci never sensitized ) fast ck ! ci Variable group sizes (faster) : larger groups at end (MSB)
) false path ) inherent logic redundancy ) problems in (balance delays a0 ! ck and ak ! c0i )
circuit optimization, timing analysis, and testing Part. CPA typ. is RCA, CSLA () multil. CSLA), or CLA
Variable group sizes (faster) : larger groups in the middle High speed-up at high hardware overhead
(minimize delays a0 ! ck ! si,1 and ak ! ci ! sn,1 ) (+ MUX/bit + (CPA + MUX)/group)
Partial CPA typ. is RCA or CSKA () multilevel CSKA)
A 14n ; T 2:8n1=2 ; AT 39n3=2
Medium speed-up at small hardware overhead
(+ AND/bit + MUX/group) a i-1:k b i-1:k a k-1:0 b k-1:0
c i0 0
a n-1:j b n-1:j a i-1:k b i-1:k a k-1:0 b k-1:0 0 CPA
csla.epsi 1 CPA
...
ci
c out ci 1
c i1
102 50CPAmm
ck c in
CPA 0 1
0 s i-1:k s i-1:k
CPA cska.epsi CPA ...
c out cj ci 99
1
36 mm ck c in 0 1
ck
...
P i-1:k
s i-1:k s k-1:0
s n-1:j s i-1:k s k-1:0
Computer Arithmetic: Principles, Architectures, and VLSI Design 28 Computer Arithmetic: Principles, Architectures, and VLSI Design 29
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)
Result is incremented after addition, if ck = 1 [12, 11] a i-1 b i-1 a i-2 b i-2 a k+1 b k+1 ak bk
ciagate.epsi
A 10n ; T 2:8n1=2 ; AT 28n3=2
ci ck
s i-1 s i-2 112 mm
100 s k+1 sk
a i-1:k b i-1:k a k-1:0 b k-1:0 (i-k-1)IFA + IHA 2IFA + IHA IFA + IHA IHA IHA
...
ci 0
CPA
cia.epsi
si-1:k CPA
c out ci ck c in
86 43 mm
... bits i-1...k ... bits 6...4 bits 3,2 bit 1 bit 0
... P i-1:k
+1
s i-1:k s k-1:0
c out c in
Computer Arithmetic: Principles, Architectures, and VLSI Design 30 Computer Arithmetic: Principles, Architectures, and VLSI Design 31
Conditional-sum adder (COSA) Carry-lookahead adder (CLA), traditional
Type a) : optimized multilevel CSLA with (log n) levels Type b) : carries looked ahead before sum bits computed
(i.e. double CPAs are merged at higher levels)
Typically 4-bit blocks used (e.g. standard IC SN74181)
Correct sum bits (s0i,1:k or s1i,1:k ) are (conditionally) c0 = c00
selected through (log n) levels of multiplexers
c1 = g0 + p0c00 (g3,p3) ... (g0,p0)
... 0 0 0
FA FA FA
1 1 1 FA (g15,p15) ... (g12,p12) (g11,p11) ... (g8,p8) (g7,p7) ... (g4,p4) (g3,p3) ... (g0,p0)
FA FA FA c in
c12 c8 c4 c0
level 1
)
)
,p11
(g7,p7)
(g3,p3)
,p15
c 15 ... c 12 c 11 ... c 8 cla.epsi c 7 ... c 4 c 3 ... c 0
(g11
level 2
48 mm
0 1 0 1 0 1
(g15
... 97
...
+ preprocessing : gi = ai bi ; pi = ai bi
CLB c in
c out
+ postprocessing : si = pi ci
s3 s2 s1 s0
Computer Arithmetic: Principles, Architectures, and VLSI Design 32 Computer Arithmetic: Principles, Architectures, and VLSI Design 33
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)
preprocessing:
a1
b1
a0
b0
s1
s0
Computer Arithmetic: Principles, Architectures, and VLSI Design 34 Computer Arithmetic: Principles, Architectures, and VLSI Design 35
Prefix algorithms Sklansky parallel-prefix algorithm () PPA-SK)
Algorithms visualized by directed acyclic graphs (DAG) Tree-like collection, parallel redistribution of carries
with array structure (n bits m levels)
A 12 n log n ; T = dlog ne ; FOmax 12 n
Graph vertex symbols : 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
( Gil,:j 1 1; Pil:,j 1 1) (Glj,:k1 ; Pjl:,k1 ) (Gil,:k1 ; Pil:,k 1 )
y?, ?
+ +
i
0
,
, ?(Gl ; P l )
, ?
1
2
sk.epsi///figures
67 30 mm
(Gli:k ; Pil:k ) i:k i:k (Gli:k ; Pil:k ) (Gli:k ; Pil:k ) 3
0 0
1 1
2 ser.epsi///figures 2 bk.epsi///figures
3
69 38 mm 3
67 38 mm
...
4
14 5
15 6
Computer Arithmetic: Principles, Architectures, and VLSI Design 36 Computer Arithmetic: Principles, Architectures, and VLSI Design 37
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)
0
1
k = n , 2dlog ne + 1 : Brent-Kung parallel-prefix
graph
2
ks.epsi///figures fills gap between RCA and PPA-BK (i.e. CLA) in steps
3
67 52 mm of single -operations
A = n , 1 + k ; T = n , 1 , k ; FOmax = var.
4
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0
Carry-increment parallel-prefix algorithm () CIA) 1
2
Computer Arithmetic: Principles, Architectures, and VLSI Design 38 Computer Arithmetic: Principles, Architectures, and VLSI Design 39
Example : 4-bit parallel-prefix adder (PPA-SK) Prefix adder synthesis
efficient AND-OR-prefix circuit for the generate and Local prefix graph transformation :
AND-prefix circuit for the propagate signals
optimization: alternatingly AOI-/OAI- resp. NAND-/ 3 2 1 0
depth-decr.
3 2 1 0
,
a3 b3 a2 b2 a1 b1 a0 b0 3 transform 3
4 Addition 4.3 Carry-Propagate Adders (CPA) 4 Addition 4.3 Carry-Propagate Adders (CPA)
RCA
Often used combinations : CLA and CSLA [14] 1e+07
128-bit CSKA-2L
CIA-1L
Pure architectures usually perform best (at gate-level) CIA-2L
64-bit
5 PPA-SK
Transistor-level adders PPA-BK
32-bit
Influence of logic styles (e.g. dynamic logic,
addperf.ps CLA
2
84 84 mm COSA
pass-transistor logic ) faster) 16-bit const. AT
1e+06
+ Efficient transistor-level implementation of ripple-carry
chains (Manchester chain) [14] 8-bit
5
+ Combinations of speed-up techniques make sense
Much higher design effort 2 delay [ns]
Many efficient implementations exist and published 5 10 20
Computer Arithmetic: Principles, Architectures, and VLSI Design 42 Computer Arithmetic: Principles, Architectures, and VLSI Design 43
Complexity comparison under the unit-gate model 4.4 Carry-Save Adder (CSA)
a) Adds three n-bit operands A0 , A1 , A2 performing no
adder A T AT opt.1 syn.2
RCA 7n 2n 14n2 aaa
p carry-propagation (i.e. carries are saved) [1]
n log n
3
2 log n 3n log2 n
p ( C; S )out = A + (C; S )in
PPA-SK 2
10n 4 log n 40n log n
ttt
p
PPA-BK att Result is in redundant carry-save format (n digits),
3n log n 2 log n 6n log2 n represented by two n-bit numbers S (sum bits) and C
PPA-KS
CLA 5 14n 4 log n 56n log n
p
( ) (carry bits)
COSA 3n log n 2 log n 6n log2 n + Parallel arrangement of n full-adders, constant delay
1 optimality regarding area and delay A = 7n ; T = 4
aaa : smallest area, longest delay
aat : small area, medium delay
a 0,n-1
a 1,n-1
a 2,n-1
a 0,1
a 1,1
a 2,1
a 0,0
a 1,0
a 2,0
att : medium area, short delay
ttt : large area, shortest delay csa.epsi
: not optimal FA . . . 67 27FA
mm FA
2 obtained from prefix adder synthesis
3 automatic logic optimization not possible (redundancy) cn s n-1 c2 s1 c1 s0
a 0,1
a 1,1
a 0,0
a 1,0
...
b) linear arr. of CSAs (adder array) and final CPA a 3,n-1 a 3,2 a 3,1 a 3,0
...
a) and b) differ in bit arrival times at final CPA :
) if CPA = RCA : a) and b) have same overall delay FA FA FA FA HA
CPA
Fast implementation : CSA array + fast final CPA b) 4-operand CSA array with final CPA (RCA) :
(note: array of fast CPAs not efficient/necessary)
a 0,n-1
a 1,n-1
a 2,n-1
a 0,2
a 1,2
a 2,2
a 0,1
a 1,1
a 2,1
a 0,0
a 1,0
a 2,0
A0 A1 A2 A 3 A m-1
A = (m , 2)ACSA + ACPA
T = (m , 2)TCSA + TCPA CSA ... FA ... FA FA FA
CSA
CPA = RCA :
A = O(mn + n) mopadd.epsi FA ...
csarray.epsi
99FA 57 mm FA HA
CSA
T = O(m + n) 30 58 mm
CSA
...
Fast CPA :
A = O(mn + n log n) CPA
FA FA FA HA
CPA
Computer Arithmetic: Principles, Architectures, and VLSI Design 46 Computer Arithmetic: Principles, Architectures, and VLSI Design 47
(m, 2)-compressors
,4
mX a0 a m-1
A = 7(m , 2)
2(c + clout) + s = ...
TLIN = 4(m , 2) ; TTREE = 6(dlog me , 1)
l=0
0
c out cprsymbol.epsi c in0
mX ,1 mX ,4
...
...
37 (m,2)
26 mm
ai + clin
m-4
c inm-4
Optimized (4, 2)-compressor :
c out
i=0 l =0 c s
2 full-adders merged and optimized (i.e. XORs
1-bit adders (similar to (m, k)-counters) [16] arranged in tree structure)
Compresses m bits down to 2 by forwarding (m , 3) A = 14 ; T = 6
intermediate carries to next higher bit position
A = 14 ; T = 8
Is bit-slice of multi-operand CSA array (see prev. page) a0 a1 a2 a3
c s
a 2,2
a 2,1
a 2,0
a 0,2
a 1,2
a 3,2
a 0,1
a 1,1
a 3,1
a 0,0
a 1,0
a 3,0 with full-adders c s
Computer Arithmetic: Principles, Architectures, and VLSI Design 48 Computer Arithmetic: Principles, Architectures, and VLSI Design 49
Advantages of (4, 2)-compressors over FAs for realizing Tree adders (Wallace tree)
(m, 2)-compressors :
higher compression rate (4:2 instead of 3:2) Adder tree : n-bit m-operand carry-save adder
composed of n tree-structured (m, 2)-compressors [1, 17]
less deep and more regular trees
Tree adders : fastest multi-operand adders using an
tree depth 012 3 4 5 6 7 8 9 10 adder tree and a fast final CPA
FA 2 3 4 6 9 13 19 28 42 63 94
# operands
(4,2) 2 4 8 16 32 64 128 A = A m; 2 n + ACPA = O(mn + n log n)
( )
L=m S
S
L=m A+B 2n , 1)
(mod addmod.epsi
S
= A + B + cout
29 CPA
Mixed CSA/CPA : CSA with partial CPAs (i.e. fewer
28 mm
c out c in
carries saved), trade-off between speed and register size (end-around carry)
S
Computer Arithmetic: Principles, Architectures, and VLSI Design 52 Computer Arithmetic: Principles, Architectures, and VLSI Design 53
5 Simple / Addition-Based Operations 5.2 Increment / Decrement 5 Simple / Addition-Based Operations 5.2 Increment / Decrement
5.2 Increment / Decrement Prefix problem : Ci:k = Ci:j 1Cj:k ) AND-prefix struct.
+
Incrementer
A 12 n log n + 2n ; T = dlog ne + 2 ; AT 12 n log2 n
Adds a single bit cin to an n-bit operand A
(cout; Z ) = cout2n + Z = A + cin A Decrementer ( cout; Z ) = A , cin
zi = ai ci incsymbol.epsi a n-1 a2 a1 a0
c out 29 26 mm
+1
ci 1 = aici ; i = 0; : : : ; n , 1
+
c in
dec.epsi
Corresponds to addition with B = 0 () FA ! HA) c out
93 41 mm
c in
Example : Ripple-carry incrementer using half-adders ...
A = 3n ; T = n + 1 ; AT 3n2 z n-1 z2 z1 z0
a n-1 a1 a0
... Incrementer-decrementer
z n-1 z2 z1 z0 z n-1 z2 z1 z0
Computer Arithmetic: Principles, Architectures, and VLSI Design 54 Computer Arithmetic: Principles, Architectures, and VLSI Design 55
Fast incrementers Gray incrementer
ci 1 = ai ci ; i = 0; : : : ; n , 3 (r.m.a.)
+
inccg.epsi c in z0 = a0 c0
62 39 mm
zi = ai ai,1 ci,1 ; i = 1; : : : ; n , 2
c out zn,1 = an,1 cn,2
z3 z2 z1 z0
Prefix problem ) AND-prefix structure
8-bit parallel-prefix incrementer (Sklansky AND-prefix
structure) :
a7 a6 a5 a4 a3 a2 a1 a0
c in
incpp.epsi
98 63 mm
c out z7 z6 z5 z4 z3 z2 z1 z0
Computer Arithmetic: Principles, Architectures, and VLSI Design 56 Computer Arithmetic: Principles, Architectures, and VLSI Design 57
5 Simple / Addition-Based Operations 5.3 Counting 5 Simple / Addition-Based Operations 5.3 Counting
a2
b2
a1
b1
a0
b0
single-tree structure ) speed-up at no cost :
eqi ai = bi) eqi
...
+1 =( cmpeq.epsi A = 6n ; TLIN = 2n ; TTREE 2 log n
= (ai bi ) eqi ;
40 36 mm
a n-1
b n-1
a2
b2
a1
b1
a0
b0
EQ
Magnitude comparison
... equality &
GE = (A B ) cmpripple.epsi
100 47 mm
magnitude
= ai bi + (ai bi ) gei ; i = 0; : : : ; n , 1
GE
Computer Arithmetic: Principles, Architectures, and VLSI Design 60 Computer Arithmetic: Principles, Architectures, and VLSI Design 61
5 Simple / Addition-Based Operations 5.4 Comparison, Coding, Detection 5 Simple / Addition-Based Operations 5.4 Comparison, Coding, Detection
A a2 a1 a0 A = n ; T = log n
decodersym.epsi
21decoder
26 mm
decoder.epsi
58 28 mm Leading-zeroes detection (LZD) :
for scaling, normalization, priority encoding
Z z7 z6 z5 z4 z3 z2 z1 z0
a) non-encoded output :
A = (n , 1)2n ; T = dlog ne
f0g1f0j1g ! f0g1f0g
a n-1 a n-2 a1 a0
...
A a7a5a3a1
a6a4a2a0
prefix problem (r.m.a.) ) AND-prefix structure
encodersym.epsi z0
21encoder
26 mm encoder.epsi
b) encoded output : + encoder
30 34 mm
Z
z1
signed numbers : + leading-ones detector (LOZ)
z2
A = n(2n,1 , 1)
T =n,1 (note: connections
according to PPA-SK)
Computer Arithmetic: Principles, Architectures, and VLSI Design 62 Computer Arithmetic: Principles, Architectures, and VLSI Design 63
5.5 Shift, Extension, Saturation Applications :
Shift : a) shift n-bit vector by k bit positions adaption of magnitude (shift a)) or word length
b) select n out of more bits at position k (extension) of operands (e.g. for addition)
also: logical (= unsigned), arithmetic (= signed) multiplication/division by multiples of 2 (shift)
Rotation by k bit positions, n constant (logic operation)
logic bit/byte operations (shift, rotation)
scaling of numbers for word-length reduction (i.e.
Extension of word lengths by k bits (n ! n + k ) ignore leading zeroes, shift b)) or normalization (e.g.
(i.e. sign-extension for signed numbers) of floating-point numbers, shift a)) using LZD
Saturation to highest/lowest value after over-/underflow reducing error after over-/underflow (saturation)
shift a) un- l. an,2 ; : : : ; a0 ; 0 sll Implementation of shift/extension/rotation by
signed r. 0; an,1 ; : : : ; a1 srl constant values : hard-wired
signed l. a n, 1 ; an,3 ; : : : ; a0 ; 0 sla variable values : multiplexers
r. an,1 ; an,1; an,2 ; : : : ; a1 sra n possible values : nbyn barrel-shifter/rotator
shift b) unsigned an k,1 ; : : : ; ak
+ Example : 4by4 barrel-rotator
signed a2n,1 ; an k,2 ; : : : ; ak
+ a3 a2 a1 a0
rotate l. an,2 ; : : : ; a0 ; an,1 rol A = O(n2)
r. a0 ; an,1 ; : : : ; a1 ror
T = O(log n) s1 s0
r. an,1 ; an,2 ; : : : ; a0 ; 0 s1 s1 s0
5 Simple / Addition-Based Operations 5.6 Addition Flags 5 Simple / Addition-Based Operations 5.6 Addition Flags
,1
nX ,1
nX ,1 nX
nX ,1
ALU operations
P =AB = ai2i bj 2j = aibj 2i+ j or
add A + B + cin sub A , B , cin i=0 j =0 i=0 j =0
A+1 A,1 ,1
nX
Pi = ai B ; P = Pi2i ; i = 0; : : : ; n , 1
arithmetic inc dec
pass A neg ,A i=0
(r.s.a.)
and a i bi nand ai bi
or ai + bi nor ai + bi Algorithm
logic
xor a i bi xnor ai bi 1) Generation of n partial products Pi
pass ai not ai 2) Adding up partial products :
sll A1 srl A1 a) sequentially (sequential shift-and-add),
shift/
sla A a 1 sra A a 1 b) serially (combinational shift-and-add), or
rotate
rol A r 1 ror A r 1 c) in parallel
s/ro : shift/rotate ; l/r : left/right ;
l/a : logic (unsigned) / arithmetic (signed) Speed-up techniques
Logic of adder/subtractor can partly be shared with logic Reduce number of partial products
operations
Accelerate addition of partial products
Computer Arithmetic: Principles, Architectures, and VLSI Design 68 Computer Arithmetic: Principles, Architectures, and VLSI Design 69
a0 b3 a0 b2 a0 b1 a0 b0
Array multipliers :
a1 b3 a1 b2 a1 b1 a1 b0
CSA
a2 b3 a2 b2 a2 b1 a2 b0
+ a3 b3 a3 b2 a3 b1 a3 b0
partial products generated and
CSA
added simultaneously in linear
mularr.epsi
p7 p6 p5 p4 p3 p2 p1 p0
array (using array adder) 34 47 CSAmm
b3 b2 b1 b0
A = O(n ) ; T = O(n)
2 CSA
a0
CPA
p0
a1
Parallel multipliers : HA HA HA
1
partial products p1
generated in parallel and added
mulpar.epsi
a2
subsequently in multi-operand mulbraun.epsi
adder (using tree adder)
34 43 mm
CSA
FA
FA
99 83 mm
FA
tree p2
A = O(n ) ; T = O(log n)
2
CPA
a3
2 FA FA FA
CSA
Signed multipliers : p3
CPA
a) complement operands before and result after
multiplication ) unsigned multiplication
3 FA FA HA
Computer Arithmetic: Principles, Architectures, and VLSI Design 70 Computer Arithmetic: Principles, Architectures, and VLSI Design 71
6.3 Signed Array Multipliers 6.4 Booth Recoding
Modified Braun multiplier Speed-up technique : reduction of partial products
Subtract bits with negative weight ) special FAs [1] Sequential multiplication
1 neg. bit : ,a + b + cin = 2cout , s Minimal (or canonical) signed-digit (SD) represent. of A
2 neg. bits : a , b , cin = ,2cout + s + One cycle per non-zero partial product (i.e. 8ai j ai =
6 0)
Replace FAs in regions Negative partial products
1 , 2 , and 3 by :
s = a b cin
cout = ab + acin + bcin Data-dependent reduction of partial products and latency
(input a at mark )
Combinational multiplication
Otherwise exactly same structure and complexity as
Braun multiplier ) efficient and flexible Only fixed reduction of partial product possible
Radix-4 modified Booth recoding : 2 bits recoded to one
Baugh-Wooley multiplier multiplier digit ) n=2 partial products
Arithmetic transformations yield the following partial n=2
X
products (two additional ones) : A= (a2i,1 + a2i , 2a2i+1 ) 22i ; a,1 = 0
i=0 | f,2;,1{z;0;+1;+2g
}
a0 b3 a0 b2 a0 b1 a0 b0
a1b3 a1 b2 a1 b1 a1 b0 a2i a2i a2i,1 Pi
a2 b3 a2 b2 a2 b1 a2 b0 0
+1
0 0 + 0
recoding
a3 b3 a3 b2 a3 b1 a3 b0 B
Booth
0 0 1 +
a3 a3 0 1 0 + B
+ 1 b3 b3 0 1 1 +2 B mulbooth.epsi
p7 p6 p5 p4 p3 p2 p1 p0 1 0 0 , 2B 41 43 mm
CSA
1 0 1 , B array/tree
Less efficient and regular than modified Braun 1 1 0 , B
multiplier 1 1 1 , 0 CPA
Computer Arithmetic: Principles, Architectures, and VLSI Design 72 Computer Arithmetic: Principles, Architectures, and VLSI Design 73
Applicable to sequential, array, and parallel multipliers 6.5 Wallace Tree Addition
additional recoding logic and more A : +8n Speed-up technique : fast partial product addition
complex partial product generation
(MUX for shift, XOR for negation)
T : +7 A = O(n2) ; T = O(log n)
+ adder array/tree cut in half Applicable to parallel multipliers : parallel partial
) considerably smaller (array and tree) A : =2 product generation (normal or Booth recoded)
Irregular adder tree (Wallace tree) due to different
) much faster for adder arrays T : =2 number of bits per column
) slightly or not faster for adder trees T : ,0 ) irregular wiring and/or layout
Negative partial products (avoid sign-extension) : ) non-uniform bit arrival times at final adder
p 3 p3 p3 p3 p2 p1 p0 = 0 0 0 ,p3 p2 p1 p0 6.6 Multiplier Implementations
| {z }
ext. sign = 1 Sequential multipliers :
+ 1 1 1 p3 p2 p1 p0 low performance, small area, resource sharing (adder)
p03 p03 p03 p03 p02 p01 p00
1
p03 p02 p01 p00
Braun or Baugh-Wooley multiplier (array multiplier) :
p13 p13 p13 p12 p11 p10 ! p12 p11 p10
p13 medium performance, high area, high regularity
p23 p23 p22 p21 p20 p23p22 p21 p20 layout generators ) data paths and macro-cells
p33 p32 p31 p30 + p33 p32 p31 p30 + simple pipelining, faster CPA ) higher speed
p6 p5 p4 p3 p2 p1 p0 p6 p5 p4 p3 p2 p1 p0
Booth-Wallace multiplier (parallel multiplier) [9] :
Suited for signed multiplication (incl. Booth recod.) high performance, high area, low regularity
Extend A for unsigned multiplication : an = 0 custom multipliers, netlist generators
often pipelined (e.g. register between CSA-tree and CPA)
Radix-8 (3-bit recoding) and higher radices : Signed-unsigned multiplier : signed multiplier with
precomputing 3B , : : : ) larger overhead
operands extended by 1 bit (an = an,1 =0, bn = bn,1=0)
Computer Arithmetic: Principles, Architectures, and VLSI Design 74 Computer Arithmetic: Principles, Architectures, and VLSI Design 75
6.7 Composition from Smaller Multipliers 7 Division / Square Root Extraction
(2n 2n)-bit multiplier can be composed from 4 7.1 Division Basics
(n n)-bit multipliers (can be repeated recursively)
A =Q+ R A=QB+R; R <B
A B = (AH 2n + AL) (BH 2n + BL) B B R = A rem B (remainder)
2n n
= AH BH 2 + (AH BL + AL BH )2 + AL BL
A 2 [0; 22n , 1] ; B; Q; R 2 [0; 2n , 1] ; B 6= 0
4 (n n)-bit multipliers AH BL Q < 2n ! A < 2nB , otherwise overflow
+ (2n)-bit CSA + (3n)-bit CPA ) normalize B before division (B 2 [2n,1; 2n , 1])
AH BH AL BL
less efficient (area and speed) AL BH Algorithms (radix-2)
7 Division / Square Root Extraction 7.3 Non-Restoring Division 7 Division / Square Root Extraction 7.4 Signed Division
( (
7.2 Restoring Division
qi = 1 if Ri 1 , B 2i 0
+
7.4 Signed Division
qi0 = 1 if Ri 1; B same sign
+
0 if Ri 1 , B 2i < 0
+ 1 if Ri 1; B opposite sign
+
,1 = 1 if
b3 a6 b2 a5 b1 a4 b0 a3
i =
Ri 1 < 0 a6 b3
+
i Ri 1 0 : qi0 = 1 ; Ri = Ri 1 , B 2i
+ +
qi0 2 f1; 1g ! qi 2 f0; 1g : qi = 12 (qi0 + 1) 81 101 mm
A = (n + 1)ACPA +/ CPA
= O (n2 ) or O (n2 log n) +/ CPA
divnr.epsi a0
Q
46 38 mm
T = (n + 1)TCPA +/ CPA
+/ CPA q0 FA FA FA FA
= O (n2 ) or O (n log n) +/ CPA
r3 r2 r1 r0
R
Computer Arithmetic: Principles, Architectures, and VLSI Design 78 Computer Arithmetic: Principles, Architectures, and VLSI Design 79
7.5 SRT Division (Sweeney, Robertson, Tocher) 7.6 High-Radix Division
8
>
>1
if
< B 2i Ri 1 + Radix = 2m , qi0 2 f , 1; : : : ; 1; 0; 1; : : : ; , 1g
0
qi = >0 if ,B 2i Ri 1 < B 2i ; qi0 is SD number
>
:1 if
+
Ri 1 < ,B 2i
+
m quotient bits per step ) fewer, but more complex steps
+ Suitable for SRT algorithm ) faster
If 2n,1 B < 2n , i.e. B is normalized :
) ,B 2i ,2n i,1 Ri 1 < 2n i,1 B 2i
+ + Complex comparisons (more bits) and decisions
) table look-up () Pentium bug!)
+
8
>
<1 if
> 2n i,1 Ri 1 +
+
= O (n2 ) +/ CSA Ri = B i + 1 ; i = 0; : : : ; m , 1
CPA
Q divsrt.epsi
T = nTCSA + TCPA 50 38 mm+/ CSA
+/ CSA A0 = A ; B0 = B ; Q = Am (r.s.n.)
= O (n) +/ CPA
R
Quadratic convergence : L = dlog ne
Computer Arithmetic: Principles, Architectures, and VLSI Design 80 Computer Arithmetic: Principles, Architectures, and VLSI Design 81
7 Division / Square Root Extraction 7.8 Remainder / Modulus 7 Division / Square Root Extraction 7.9 Divider Implementations
f (X ) = X1 , B ; f 0(X ) = , X12 ; f B1 = 0
Sequential dividers (restoring, non-restoring, SRT) :
resource sharing of existing components (e.g. adder)
Algorithm :
low performance, low area
Xi 1 = Xi (2 , B Xi) ; i = 0; : : : ; m , 1
Array dividers (restoring, non-restoring, SRT) :
+
X0 = B ; Q = Xm (r.s.n.)
dedicated hardware component
Quadratic convergence : L = O(log n) high performance, high area
Speed-up : first approximation X0 from table high regularity ) layout generators, pipelining
7.8 Remainder / Modulus
square root extraction possible by minor changes
combination with multiplication or/and square root
Remainder (rem) : signed remainder of a division
R = A rem B = A , bA=B c B ; sign(R) = sign(A) No parallel dividers exist, as compared to parallel
multipliers (sequential nature of division)
Modulus (mod) : positive remainder of a division
(
R if A 0
M = A mod B ; M 0 ; M = R +B else
Computer Arithmetic: Principles, Architectures, and VLSI Design 82 Computer Arithmetic: Principles, Architectures, and VLSI Design 83
7.10 Square Root Extraction 8 Elementary Functions
p
A,R =Q A=Q 2
+ R Exponential function : ex (exp x)
A 2 [0; 22n , 1] ; Q 2 [0; 2n , 1] Logarithm function : ln x, log x
Algorithm
Trigonometric functions : sin x, cos x, tan x
Subtract-and-shift : partial remainders Ri and quotients Inverse trig. functions : arcsin x, arccos x, arctan x
Qi = Qi 1 + qi2i = (qn,1; : : : ; qi; 0; : : : ; 0) [1]
+
Hyperbolic functions : sinh x, cosh x, tanh x
Q2i = Qi 1 + qi2i 2 = Q2i 1 + qi2i 2Qi 1 + qi2i
+ + +
8.1 Algorithms
qi = Ri 1 2i 2Qi 1 + 2i ; Qi = Qi 1 + qi2i
+ + +
Table look-up : inefficient for large word lengths [5]
Ri = Ri 1 , qi2i 2Qi 1 + qi2i ; i = n , 1; : : : ; 0
+ +
Taylor series expansion : complex implementation
Rn = A ; Qn = 0 ; R = R0 ; Q = Q0 (r.m.n.) Polynomial and rational approximations [1, 5]
Shift-and-add algorithms [5]
Implementation
Convergence algorithms [1, 2] :
+ Similar to division ) same algorithms applicable
(restoring, non-restoring, SRT, high-radix) similar to division-by-convergence
+ Combination with division in same component possible two (or more) recursive formulas : one formula
converges to a constant, the other to the result
Only triangular array required A
(step i : qki = 0) Coordinate rotation (CORDIC) [2, 5, 20] :
+/ CPA
+/ CPA
3 equations for x-, y-coordinate, and angle
A ADIV =2
sqrtnr.epsi
Q 42 36+/
mmCPA computes all elementary functions by proper input
T TDIV +/ CPA
+/ CPA settings and choice of modes and outputs
R
simple, universal hardware, small look-up table
Computer Arithmetic: Principles, Architectures, and VLSI Design 84 Computer Arithmetic: Principles, Architectures, and VLSI Design 85
8 Elementary Functions 8.2 Integer Exponentiation 8 Elementary Functions 8.3 Integer Logarithm
= ( ((A n, ) A n, ) A ) A
b 2 b 21 b 2 b 2 1 0
Computer Arithmetic: Principles, Architectures, and VLSI Design 86 Computer Arithmetic: Principles, Architectures, and VLSI Design 87
9 VLSI Design Aspects Gate-level design
Computer Arithmetic: Principles, Architectures, and VLSI Design 88 Computer Arithmetic: Principles, Architectures, and VLSI Design 89
9 VLSI Design Aspects 9.2 Synthesis 9 VLSI Design Aspects 9.3 VHDL
Layout and netlist generators relational : =, /=, <, <=, >, >=
Computer Arithmetic: Principles, Architectures, and VLSI Design 90 Computer Arithmetic: Principles, Architectures, and VLSI Design 91
Resource sharing 9.4 Performance
Structural, synthesizable VHDL code for most circuits Optimal solution depends on arithmetic operation, circuit
described in this text is found in [22] architecture, user specifications, and circuit environment
Computer Arithmetic: Principles, Architectures, and VLSI Design 92 Computer Arithmetic: Principles, Architectures, and VLSI Design 93
9 VLSI Design Aspects 9.4 Performance 9 VLSI Design Aspects 9.5 Testability
Power-related properties of arithmetic circuits : Testability goal : high fault coverage with few test vectors
that are easy to generate/apply
High glitching activity due to high bit dependencies
and large logic depth Random test vectors : easy to generate and
apply/propagate, few vectors give high (but not perfect)
Power reduction in arithmetic circuits [23] : fault coverage for most arithmetic circuits
Reduce the switched capacitance by choosing an area Special test vectors : sometimes hard to generate and
efficient circuit architecture apply, required for coverage of hard-detectable faults
Allow for lower supply voltage by speeding up the which are inherent in most arithmetic circuits
circuitry
Hard-detectable faults found in :
Reduce the transition activity :
apply stable inputs while circuit is not in use () circuits of arithmetic operations with inherent special
disabling subcircuits) cases (arithmetic exceptions) : detectors, comparators,
reduce glitching transitions by balancing signal incrementers and counters (MSBs), adder flags
paths (partly done by speed-up techniques, otherwise circuits using redundant number representations
difficult to realize) 6 redundant hardware) : dividers (Pentium bug!)
(=
reduce glitching transitions by reducing logic depth
(pipelining)
take advantage of correlated data streams
choose appropriate number representations
(e.g. Gray codes for counters)
Computer Arithmetic: Principles, Architectures, and VLSI Design 94 Computer Arithmetic: Principles, Architectures, and VLSI Design 95
Bibliography [11] R. Zimmermann, Binary Adder Architectures for
Cell-Based VLSI and their Synthesis, PhD thesis, Swiss
[1] I. Koren, Computer Arithmetic Algorithms, Prentice Hall, Federal Institute of Technology (ETH) Zurich,
1993. Hartung-Gorre Verlag, 1998.
[2] K. Hwang, Computer Arithmetic: Principles, Architecture, [12] A. Tyagi, A reduced-area scheme for carry-select adders,
and Design, John Wiley & Sons, 1979. IEEE Trans. Comput., vol. 42, no. 10, pp. 11621170, Oct.
1993.
[3] O. Spaniol, Computer Arithmetic, John Wiley & Sons,
1981. [13] T. Han and D. A. Carlson, Fast area-efficient VLSI
adders, in Proc. 8th Computer Arithmetic Symp., Como,
[4] J. J. F. Cavanagh, Digital Computer Arithmetic: Design May 1987, pp. 4956.
and Implementation, McGraw-Hill, 1984.
[14] D. W. Dobberpuhl et al., A 200-MHz 64-b dual-issue
[5] J.-M. Muller, Elementary Functions: Algorithms and CMOS microprocessor, IEEE J. Solid-State Circuits, vol.
Implementation, Birkhauser Boston, 1997. 27, no. 11, pp. 15551564, Nov. 1992.
[6] Proceedings of the Xth Symposium on Computer Arithmetic. [15] A. De Gloria and M. Olivieri, Statistical carry lookahead
[7] IEEE Transactions on Computers. adders, IEEE Trans. Comput., vol. 45, no. 3, pp. 340347,
Mar. 1996.
[8] D. R. Lutz and D. N. Jayasimha, Programmable modulo-k
counters, IEEE Trans. Circuits and Syst., vol. 43, no. 11, [16] V. G. Oklobdzija, D. Villeger, and S. S. Liu, A method for
pp. 939941, Nov. 1996. speed optimized partial product reduction and generation of
fast parallel multipliers using an algorithmic approach,
[9] H. Makino et al., An 8.8-ns 54 54-bit multiplier with IEEE Trans. Comput., vol. 45, no. 3, pp. 294305, Mar.
high speed redundant binary architecture, IEEE J. 1996.
Solid-State Circuits, vol. 31, no. 6, pp. 773783, June 1996.
[17] Z. Wang, G. A. Jullien, and W. C. Miller, A new design
[10] W. N. Holmes, Composite arithmetic: Proposal for a new technique for column compression multipliers, IEEE
standard, IEEE Computer, vol. 30, no. 3, pp. 6573, Mar. Trans. Comput., vol. 44, no. 8, pp. 962970, Aug. 1995.
1997.
Computer Arithmetic: Principles, Architectures, and VLSI Design 96 Computer Arithmetic: Principles, Architectures, and VLSI Design 97
Bibliography