Вы находитесь на странице: 1из 26

Eidgenossische

Technische Hochschule
Zurich

Institut fur Integrierte Systeme

Ecole polytechnique federale



de Zurich
Politecnico federale di Zurigo
Swiss Federal Institute of Technology Zurich

Integrated Systems Laboratory

Lecture notes on

Computer Arithmetic:
Principles, Architectures,
and VLSI Design
June 25, 1998

Reto Zimmermann
Integrated Systems Laboratory
Swiss Federal Institute of Technology (ETH)
CH-8092 Zurich, Switzerland
zimmermann@iis.ee.ethz.ch

Copyright c 1998 by Integrated Systems Laboratory, ETH Zurich


http://www.iis.ee.ethz.ch/ zimmi/publications/comp arith notes.ps.gz

Contents

Contents

::::::::::::::::::::::: 4
:::::::::::::::::::::::::::::::::::::::::: 4
1.2 Motivation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4
1.3 Conventions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5
1.4 Recursive Function Evaluation : : : : : : : : : : : : : : : : : : : : : 6
Arithmetic Operations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8
2.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 8
2.2 Implementation Techniques : : : : : : : : : : : : : : : : : : : : : : : 9
Number Representations : : : : : : : : : : : : : : : : : : : : : : : : : : : 10
3.1 Binary Number Systems (BNS) : : : : : : : : : : : : : : : : : : : 10
3.2 Gray Numbers : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 13
3.3 Redundant Number Systems : : : : : : : : : : : : : : : : : : : : : : 14
3.4 Residue Number Systems (RNS) : : : : : : : : : : : : : : : : : : 16
3.5 Floating-Point Numbers : : : : : : : : : : : : : : : : : : : : : : : : : : 18
3.6 Logarithmic Number System : : : : : : : : : : : : : : : : : : : : : 19
3.7 Antitetrational Number System : : : : : : : : : : : : : : : : : : : 19
3.8 Composite Arithmetic : : : : : : : : : : : : : : : : : : : : : : : : : : : 20
3.9 Round-Off Schemes : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21
Addition : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22
4.1 Overview : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 22
4.2 1-Bit Adders, (m, k)-Counters : : : : : : : : : : : : : : : : : : : : 23

1 Introduction and Conventions


1.1 Outline

Computer Arithmetic: Principles, Architectures, and VLSI Design

Contents

: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 78
: : : : : : : : : : : : : : : : : : : : : : : : : : 78
7.4 Signed Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 79
7.5 SRT Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 80
7.6 High-Radix Division : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 81
7.7 Division by Multiplication : : : : : : : : : : : : : : : : : : : : : : : 81
7.8 Remainder / Modulus : : : : : : : : : : : : : : : : : : : : : : : : : : : : 82
7.9 Divider Implementations : : : : : : : : : : : : : : : : : : : : : : : : : 83
7.10 Square Root Extraction : : : : : : : : : : : : : : : : : : : : : : : : : 84
Elementary Functions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85
8.1 Algorithms : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 85
8.2 Integer Exponentiation : : : : : : : : : : : : : : : : : : : : : : : : : : : 86
8.3 Integer Logarithm : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 87
VLSI Design Aspects : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 88
9.1 Design Levels : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 88
9.2 Synthesis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 90
9.3 VHDL : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 91
9.4 Performance : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 93
9.5 Testability : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 95
Bibliography : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 96
7.2 Restoring Division

7.3 Non-Restoring Division

Computer Arithmetic: Principles, Architectures, and VLSI Design

: : : : : : : : : : : : : : : : : : : 26
4.4 Carry-Save Adder (CSA) : : : : : : : : : : : : : : : : : : : : : : : : : 45
4.5 Multi-Operand Adders : : : : : : : : : : : : : : : : : : : : : : : : : : : 46
4.6 Sequential Adders : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 52
Simple / Addition-Based Operations : : : : : : : : : : : : : : : : 53
5.1 Complement and Subtraction : : : : : : : : : : : : : : : : : : : : : 53
5.2 Increment / Decrement : : : : : : : : : : : : : : : : : : : : : : : : : : : 54
5.3 Counting : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 58
5.4 Comparison, Coding, Detection : : : : : : : : : : : : : : : : : : : 60
5.5 Shift, Extension, Saturation : : : : : : : : : : : : : : : : : : : : : : 64
5.6 Addition Flags : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 66
5.7 Arithmetic Logic Unit (ALU) : : : : : : : : : : : : : : : : : : : : : 68
Multiplication : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69
6.1 Multiplication Basics : : : : : : : : : : : : : : : : : : : : : : : : : : : : 69
6.2 Unsigned Array Multiplier : : : : : : : : : : : : : : : : : : : : : : : 71
6.3 Signed Array Multipliers : : : : : : : : : : : : : : : : : : : : : : : : : 72
6.4 Booth Recoding : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 73
6.5 Wallace Tree Addition : : : : : : : : : : : : : : : : : : : : : : : : : : : 75
6.6 Multiplier Implementations : : : : : : : : : : : : : : : : : : : : : : : 75
6.7 Composition from Smaller Multipliers : : : : : : : : : : : : : 76
6.8 Squaring : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 76
Division / Square Root Extraction : : : : : : : : : : : : : : : : : : 77
7.1 Division Basics : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 77
4.3 Carry-Propagate Adders (CPA)

Contents

Computer Arithmetic: Principles, Architectures, and VLSI Design

1 Introduction and Conventions

1.2 Motivation

1 Introduction and Conventions

1 Introduction and Conventions

1.3 Conventions

1.1 Outline

Naming conventions

Basic principles of computer arithmetic [1, 2, 3, 4, 5, 6, 7]


Circuit architectures and implementations of main
arithmetic operations
Aspects regarding VLSI design of arithmetic units
1.2 Motivation

A (1-D), Ai (2-D), ai:k (subbus, 1-D)


Signals : a, ai (1-D), ai k (2-D), Ai:k (group signal)
Circuit complexity measures : A (area), T (cycle time,
delay), AT (area-time product), L (latency, # cycles)
Arithmetic operators : +, ;, , =, log (= log2 )
Signal buses :

Logic operators :

Arithmetic units are, among others, core of every data


path and addressing unit

1.3 Conventions

+ (or),

(and),

(xor),

(xnor), (not)

Circuit complexity measures

Data path is core of :

Unit-gate model (

microprocessors (CPU)
signal processors (DSP)

gate-equivalents (GE) model) :

Inverter, buffer :

A=0 T =0

(i.e. ignored)

Simple monotonic 2-input gates (AND, NAND, OR,


NOR) : A = 1 T = 1

data-processing application specific ICs (ASIC) and


programmable ICs (e.g. FPGA)
Standard arithmetic units available from libraries

Simple non-monotonic 2-input gates (XOR, XNOR) :


A=2 T =2

Design of arithmetic units necessary for :

Complex gates : composed from simple gates


) Simple m-input gates :

non-standard operations
high-performance components
library development

Only estimations given for complex circuits

Computer Arithmetic: Principles, Architectures, and VLSI Design


1 Introduction and Conventions

A = m ; 1 T = dlog me

Wiring not considered (acceptable for comparison


purposes, local wiring, multilevel metallization)

1.4 Recursive Function Evaluation

1.4 Recursive Function Evaluation

Computer Arithmetic: Principles, Architectures, and VLSI Design


1 Introduction and Conventions

2.

Given : inputs ai , outputs zi , function f (graph sym. : )


Non-recursive functions (n.)

a3 a2 a1 a0
1 funrsa.epsi
219 20 mm
z

zi = f (ai zi;1) ; i = 0 : : : n ; 1 z;1 = 0=1

a3 a2 a1 a0

1.

f is non-associative (r.m.n.)

) serial structure :

funn.epsi
119 17 mm

A = O(n) T = O(1)

f is associative (r.s.a.)
) serial or single-tree structure :
A = O(n) T = O(log n)

zi = f (ai x) ; i = 0 : : : n ; 1
) parallel structure :

1.4 Recursive Function Evaluation

b) with multiple outputs zi (r.m.) () prefix problem) :

m const.)

Output zi is a function of input ai (or aj +m:j

A = O(n) T = O(n)

z3 z2 z1 z0

a3 a2 a1 a0
1 funrmn.epsi
219 25 mm
3
z3 z2 z1 z0

Recursive functions (r.)


Output zi is a function of all inputs ak
a) with single output z

a3 a2 a1 a0

k i
2.

= zn (r.s.) :

) serial or multi-tree structure :

A = O(n2) T = O(log n)

ti = f (ai ti;1) ; i = 0 : : : n ; 1
t;1 = 0=1 z = tn;1
1.

f is non-associative (r.s.n.)

) serial structure :

A = O(n) T = O(n)
Computer Arithmetic: Principles, Architectures, and VLSI Design

f is associative (r.m.a.)

1
2
z3
funrma1.epsi
19 43 mm
z2
z1
z0

a3 a2 a1 a0

) or shared-tree structure :

1 funrsn.epsi
219 24 mm
3

A = O(n log n) T = O(log n)

a3 a2 a1 a0
1funrma2.epsi
219 21 mm
z3 z2 z1 z0

z
6

Computer Arithmetic: Principles, Architectures, and VLSI Design

2 Arithmetic Operations

2.1 Overview

2 Arithmetic Operations

2.2 Implementation Techniques

2 Arithmetic Operations

2.2 Implementation Techniques

2.1 Overview

Direct implementation of dedicated units :


fixed-point

based on operation
related operation

always : 1 5

floating-point

in most cases : 6

<< , >>

sometimes : 7, 8
=,<

+1 , 1

+/

+,

arithops.epsi
98 83 mm

Sequential implementation using simpler units and


several clock cycles () decomposition) :

+,

sometimes : 6
in most cases : 7, 8, 9
Table look-up techniques using ROMs :

(same as on
the left for
floating-point
numbers)

sqrt (x)

universal : simple application to all operations

exp (x)

trig (x)

complexity

log (x)

efficient only for single-operand operations of high


complexity (8 12) and small word length (note: ROM
size = 2n n)

hyp (x)

Approximation techniques using simpler units : 712


1
2
3
4
5
6

shift/extension
7 division
comparison
8 square root extraction
increment/decrement
9 exponential function
complement
10 logarithm function
addition/subtraction 11 trigonometric functions
multiplication
12 hyperbolic functions

Computer Arithmetic: Principles, Architectures, and VLSI Design


3 Number Representations

taylor series expansion


polynomial and rational approximations
convergence of recursive equation systems
CORDIC (COordinate Rotation DIgital Computer)
8

3.1 Binary Number Systems (BNS)

3.1 Binary Number Systems (BNS)


Radix-2, binary number system (BNS) : irredundant,
weighted, positional, monotonic [1, 2]

n-bit number is ordered sequence of bits (binary digits) :


A = (an;1 an;2 : : : a0)2 ai 2 f0 1g
Simple and efficient implementation in digital circuits
MSB/LSB (most-/least-significant bit) : an;1 / a0
Represents an integer or fixed-point number, exact

n ; m)-bit fraction

Unsigned : positive or natural numbers


Value :

A = an;1 2n;1 +

3.1 Binary Number Systems (BNS)

+ a12 + a0 =

Range : 0 2n ; 1]

nX
;1
i=0

Ones (1s) complement : similar to 2s complement


nX
;2
Value : A = ;an;1 (2n;1 + 1) +
ai 2i
i=0
Range : ;(2n;1 ; 1) 2n;1 ; 1]
Sign : an;1

Properties : double representation of zero, symmetric


range, modulo (2n ; 1) number system

ai2i

Sign-magnitude : alternative representation of signed


numbers
nX
;2
ai 2i
Value : A = (;1)an;1
i=0
n
;
1
Range : ;(2
; 1) 2n;1 ; 1]

Twos (2s) complement : standard representation of


signed or integer numbers
nX
;2
ai2i
Value : A = ;an;1 2n;1 +
i=0
Range : ;2n;1 2n;1 ; 1]
Computer Arithmetic: Principles, Architectures, and VLSI Design

Sign : an;1
Properties : asymmetric range, compatible with
unsigned numbers in many arithmetic operations
(i.e. same treatment of positive and negative numbers)

Complement : ;A = 2n ; A ; 1 = A

(|am;1 {z: : : a0} : |a;1 : :{z: am;n} )


m-bit integer

3 Number Representations

Complement : ;A = 2n ; A = A + 1 ,
where A = (an;1 an;2 : : : a0 )

3 Number Representations

Fixed-point numbers :

Computer Arithmetic: Principles, Architectures, and VLSI Design

Complement : ;A = (an;1 an;2


Sign : an;1

10

: : : a0 )

Computer Arithmetic: Principles, Architectures, and VLSI Design

11

3 Number Representations

3.1 Binary Number Systems (BNS)

Properties : double representation of zero, symmetric


range, different treatment of positive and negative
numbers in arithmetic operations, no MSB toggles at
sign changes around 0 () low power)

Gray numbers (code) : binary, irredundant, non-weighted,


non-monotonic
+ Property : unit-distance coding (i.e. exactly one bit
toggles between adjacent numbers)

111...1

011...1
100...0

000...0

Applications : counters with low output toggle rate


(low-power signal buses), representation of continuous
signals for low-error sampling (no false numbers due to
switching of different bits at different times)

binary number representation

3.2 Gray Numbers

3.2 Gray Numbers

Graphical representation

n1

3 Number Representations

n1

numrep.epsi
95 73 mm

Non-monotonic numbers : difficult arithmetic operations,


e.g. addition, comparison :

g1 g0 g10 g00 g0 g00


0 0 < 0 1 and 0 < 1
1 1 < 1 0 but 1 > 0

unsigned
2s complement

binary ! Gray :

1s complement

gi = bi+1 bi bn = 0 ;
i = 0 : : : n ; 1 (n.)

sign-magnitude

Gray ! binary :

Conventions

bi = bi+1 gi bn = 0 ;
i = n ; 1 : : : 0 (r.m.a.)

2s complement used for signed numbers in these notes


Unsigned and signed numbers can be treated equally in
most cases, exceptions are mentioned
Computer Arithmetic: Principles, Architectures, and VLSI Design
3 Number Representations

12

3.3 Redundant Number Systems

3.3 Redundant Number Systems


Non-binary, redundant, weighted number systems [1, 2]
Digit set larger than radix (typically radix 2) ) multiple
representations of same number ) redundancy
+ No carry-propagation in adders ) more efficient impl.
of adder-based units (e.g. multipliers and dividers)
Redundancy ) no direct implementation of relational
operators ) conversion to irredundant numbers
Several bits used to represent one digit ) higher storage
requirements
Expensive conversion into irredundant numbers (not
necessary if redundant input operands are allowed)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

Gray

0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1

0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1

0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1

0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1

0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1

0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0

0
0
1
1
1
1
0
0
0
0
1
1
1
1
0
0

0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0

Computer Arithmetic: Principles, Architectures, and VLSI Design


3 Number Representations

13

3.3 Redundant Number Systems

1 digit holds sum of 3 bits or 1 digit + 1 bit (no


carry-out digit, i.e. carry is saved)
standard redundant number system for fast addition
Signed-digit (SD) or redundant digit (RD) number
representation :
ri si ti 2 f;1 0 1g f1 0 1g , R = Pni=;01 ri2i
no carry-propagation in S = R + T :

ri + ti = (ci+1 ui) = 2ci+1 + ui , ci+1 ui 2 f1


(ci+1 ui) is redundant (e.g. 0 + 1 = 01 = 11)
8i 9(ci ui ) j ci + ui = si 2 f1 0 1g

0 1g

1 digit holds sum of 2 digits (no carry-out digit)


minimal SD representation : minimal number of
non-zero digits,
011f1g10
!
100f0g10

Delayed-carry of half-adder number representation :


ri 2 f0 1 2g , ci si ai bi 2 f0 1g ,
ri = (ci+1 si) = 2ci+1 + si = ai + bi , ci+1si = 0
R = Pni=;01 ri2i = (C S ) = C + S = A + B
1 digit holds sum of 2 bits (no carry-out digit)
example : (00 10) = 00 + 10 = 01 + 01 = (10 00)
irredundant representation of ;1 [8], since
ci+1si = 0 & C + S = ;1 ! S = ;1 C = 0

applications : sequential multiplication (less cycles),


filters with constant coefficients (less hardware)
example :
minimal
z }| {

7 = (0111 j 1111 j 1011 j 1001 j 11111 j

canonical SD repres.: minimal SD + not two non-zero


digits in sequence,
01f1g10
!
10f0g10

Carry-save number representation :


ri 2 f0 1 2 3g , ci si ai bi di 2 f0 1g ,
ri = (ci+1 si) = 2ci+1 + si = ai + bi + di = ai + ri0
R = Pni=;01 ri2i = (C S ) = C + S = A + R0
Computer Arithmetic: Principles, Architectures, and VLSI Design

binary

b3 b2 b1 b0 g3 g2 g1 g0

SD ! binary : carry-propagation necessary () adder)


other applications : high-speed multipliers [9]
similar to carry-save, simple use for signed numbers
14

Computer Arithmetic: Principles, Architectures, and VLSI Design

15

3 Number Representations

3.4 Residue Number Systems (RNS)

Non-binary, irredundant, non-weighted number system [1]


+ Carry-free and fast additions and multiplications
Complex and slow other arithmetic operations
(e.g. comparison, sign and overflow detection) because
digits are not weighted, conversion to weighted
mixed-radix or binary system required

Possible applications (but hardly used) :


digital filters : fast additions and multiplications
error detection and correction for arithmetic operations
in conventional and residue number systems
Base is n-tuple of integers (mn;1 mn;2 : : : m0 ),
residues (or moduli) mi pairwise relatively prime

Range:

M=

nY
;1
i=0

nX
;1
i=0

Ciai

mn;2 ::: m0 ,

= (: : :

0 1 0
|{z}
i

(m1 m0) = (3 2) , M = 6

;4 ;3 ;2 ;1 0 1 2 3 4 5 6 7 8
2 0 1 2 0 1 2 0 1 2 0 1 2
0 1 0 1 0 1 0 1 0 1 0 1 0
|

j5j6 = A = (a1

{z

a0) = (j5j3

j5j2 ) = (2 1)

j4 + 5j6 = (1 0) + (2 1) =
= (j1 + 2j3 j0 + 1j2) = (0 1) = j3j6
j4 5j6 = (1 0) (2 1) =
= (j1 2j3 j0 1j2) = (2 0) = j2j6

: : :)

Computer Arithmetic: Principles, Architectures, and VLSI Design


3 Number Representations

Example :

possible range

mi, anywhere in ZZ
, Ci

high storage efficiency with k bits


simple modular addition : 2k : k -bit adder without cout ,
2k ; 1 : k -bit adder with end-around carry (cin = cout )

A
a1
a0

ai = A mod mi = jAjmi , A = mi qi + ai
jAjM

zi = jZ jmi = jf (A)jmi = f (jAjmi ) mi = jf (ai)jmi


jA + B jmi = jAjmi + jB jmi
= jai + bijmi
mi
jA B jmi = jAjmi jB jmi
= jai bijmi
mi
j ; ai jmi = jmi ; ai jmi
a;i 1 mi = aimi ;2 mi (Fermats theorem)

Best moduli mi are 2k and (2k ; 1) :

Codes for error detection and correction [1]

3.4 Residue Number Systems (RNS)

Arithmetic operations : (each digit computed separately)

3.4 Residue Number Systems (RNS)

A = (an;1 an;2 : : : a0 )mn;


ai 2 f0 1 : : : mi ; 1g

3 Number Representations

16

3.5 Floating-Point Numbers

3.5 Floating-Point Numbers

Computer Arithmetic: Principles, Architectures, and VLSI Design


3 Number Representations

17

3.7 Antitetrational Number System

3.6 Logarithmic Number System

Larger range, smaller precision than fixed-point


representation, inexact, real numbers [1, 2]

Alternative representation to floating-point (i.e. mantissa


+ integer exponent ! only fixed-point exponent) [1]

Double-number form ) discontinuous precision

Single-number form ) continuous precision ) higher


accuracy, more reliable

S biased exponent E unsigned norm. mantissa M


F = (;1)S M E = (;1)S 1:M 2E;bias
Basic arithmetic operations :
A B = (;1)SA SB MA MB EA+EB
A + B = (;1)SA MA +
EA
(;1)SB MB (EA ; EB )
base on fixed-point add, multiply, and shift operations
postnormalization required (1=
M < 1)
Applications :
processors : real floating-point formats (e.g. IEEE
standard), large range due to universal use
ASICs : usually simplified floating-point formats with
small exponents, smaller range, used for range
extension of normal fixed-point numbers

single
double

n nM nE
32
64

23
52

8
11

bias

Basic arithmetic operations :

(A < B ) = (EA < EB ) (additionally consider sign)


A + B : by approximation or addition in conventional

number system and double conversion


A B = (;1)SA SB EpA+EB
Ay = (;1)SA y EA y A = (;1)SA

EA =y

+ Simpler multiplication/exponent., more complex addition


Expensive conversion : (anti)logarithms (table look-up)
Applications : real-time digital filters
3.7 Antitetrational Number System

IEEE floating-point format :


precision

S biased fixed-point exponent E


L = (;1)S E = (;1)S 2E;bias (signed-logarithmic)

Tetration (t. x = |{z}


22 ) and antitetration (a.t. x) [10]
x
Larger range, smaller precision than logarithmic repres.,
otherwise analogous (i.e. 2x ! t. x log x ! a.t. x)
2

range

127 3:8 1038


1023 9 10307

Computer Arithmetic: Principles, Architectures, and VLSI Design

precision
10;7
10;15

18

Computer Arithmetic: Principles, Architectures, and VLSI Design

19

3 Number Representations

3.8 Composite Arithmetic

3.8 Composite Arithmetic

3 Number Representations

3.9 Round-Off Schemes

3.9 Round-Off Schemes

Proposal for a new standard of number representations [10]


Scheme for storage and display of exact (primary:
integer, secondary: rational) and inexact (primary:
logarithmic, secondary: antitetrational) numbers

Intermediate results with d additional lower bits


() higher accuracy) : A = (an;1 : : : a0 a;1 : : :

a;d)

Rounding : keeping error small during final word


length reduction : R = (rn;1 : : : r0 ) = A ;

Secondary forms used for numbers not representable by


primary ones () no over-/underflow handling necessary)
Choice of number representation hidden from user, i.e.
software/compiler selects format for highest accuracy

Trade-off : numerical accuracy vs. implementation cost

RTRUNC = (an;1 : : : a0 )

Truncation :

bias = ; 12 + 2d+1

Number representations :

(= average error )

Round-to-nearest (i.e. normal rounding) :

integer :

tag
00

value
2s complement integer

rational :

01

slash

logarithmic :

10

log integer

log fraction

antitetrational :

11

a.t. integer

a.t. fraction

RROUND = (a0n;1 : : : a00 ) A0 = A + 0:1

bias = 2d+1 (nearly symmetric)


+ 0:1 can often be included in previous operation

denominator n numerator

Round-to-nearest-even/-odd :

Rational numbers : slash position (i.e. size of numerator/


denominator) is variable and stored (floating slash)
Storage form sizes : 32-bit (short), 64-bit (normal),
128-bit (long), 256-bit (extended)
Hardware proposal : long accumulator (4096 bits) holds
any floating-point number in fixed-point format )
higher accurary ) large hardware/software overhead
20

4 Addition

mandatory in IEEE floating-point standard

Implementation : mixed hardware/software solutions

Computer Arithmetic: Principles, Architectures, and VLSI Design

0
0
ROUND if (a;1 : : : a;d ) 6= 0
RROUND ;EVEN = R
0
0
(an;1 : : : a1 0) otherwise
bias = 0 (symmetric)

4.1 Overview

4 Addition

3 guard bits for rounding after floating-point operations :


guard bit G (postnormalization), round bit R
(round-to-nearest), sticky bit S (round-to-nearest-even)
Computer Arithmetic: Principles, Architectures, and VLSI Design
4 Addition

21

4.2 1-Bit Adders, (m, k)-Counters

4.2 1-Bit Adders, (m, k)-Counters


Add up m bits of same magnitude (i.e. 1-bit numbers)

4.1 Overview
HA

1-bit adders

FA

(m,k)

Output sum as k -bit number (k

(m,2)

= blog mc + 1)

or : count 1s at inputs ) (m, k)-counter [3]


(combinational counters)
RCA

CSKA

CSLA

CIA

CLA

PPA

COSA

Half-adder (HA), (2, 2)-counter

carry-propagate adders

(cout s) = 2cout + s = a + b

CPA

s=a b
cout = ab

CSA

3-operand
adders.epsi
103 121 mm

carry-save adders

multi-operand

adder
array

A = 3 T = 2 (1)

(sum)
(carry-out)
a b

adder
tree

a b
a b

multi-operand adders

array
adder

tree
adder

hasym.epsi
23HA
mm
18
c
out

s
Legend:
HA:
FA:
(m,k):
(m,2):

half-adder
full-adder
(m,k)-counter
(m,2)-compressor

based on component

CPA: carry-propagate adder


RCA: ripple-carry adder
CSKA:carry-skip adder
CSLA: carry-select adder
CIA: carry-increment adder

chaschema1.epsi
out
19 28 mm

CLA: carry-lookahead adder


PPA: parallel-prefix adder
COSA:conditional-sum adder

haschema2.epsi
21 43 mm
c out

(reference)
s

CSA: carry-save adder

related component

Computer Arithmetic: Principles, Architectures, and VLSI Design

22

Computer Arithmetic: Principles, Architectures, and VLSI Design

23

4 Addition

4.2 1-Bit Adders, (m, k)-Counters

Full-adder (FA), (3, 2)-counter

4 Addition

(m, k)-counters

(cout s) = 2cout + s = a + b + cin

(sk;1 : : : s0) =
kX
;1
mX
;1
sj 2j = ai

A = 7 T = 4 (2 )

g = ab (generate)
c0 = ab
p = a b (propagate) c1 = a + b
s = a b cin = p cin
cout = ab + acin + bcin = ab + (a b)cin
= g + pcin = pg + pcin = pa + pcin
= cinc0 + cinc1

j =0

c out

s k-1 s 0

A = 28 T = 10

a0a1 a2a3a4a5a6

a0a1 a2

a3a4 a5a6

FA

FA

FA

FA

count73par.epsi
FA
36 48 mm

a b

a b

a b

count73ser.epsi
42 59 mm

faschematic1.epsi
g
p
29 43 mm

c out

p
faschematic4.epsi
c out
c in
29 1 41 mm

c in

c out

faschematic5.epsi
0
c0
35 47 mm
1
c1

FA

s2

Computer Arithmetic: Principles, Architectures, and VLSI Design


4 Addition

24

4.3 Carry-Propagate Adders (CPA)

s2

s1

s0

s1

s0

tree structure

linear
structure

Computer Arithmetic: Principles, Architectures, and VLSI Design


4 Addition

25

4.3 Carry-Propagate Adders (CPA)

Carry-propagation speed-up techniques

4.3 Carry-Propagate Adders (CPA)


Add two n-bit operands A and B and an optional carry-in
cin by performing carry-propagation [1, 2, 11]

a) Concatenation of partial CPAs with fast cin ! cout

S ) is irredundant (n + 1)-bit number

a n-1:j b n-1:j

2ci+1 + si

CPA

c out
A

a i-1:k b i-1:k

a k-1:0 b k-1:0

...

(cout S ) = cout2n + S = A + B + cin


= ai + bi + ci ;
i = 0 1 ::: n; 1
c0 = cin cout = cn (r.m.a.)

FA

FA

c in

Sum (cout

...

i=0

A = 28 T = 14

faschematic2.epsi
c in
32 35 mm

(reference)

cntsymbol.epsi
23 mm
18 (m,k)

Example : (7, 3)-counter

g
HA
faschematic3.epsi
p
32
mm c
29
c out
in
HA

in

a m-1

...

m
;k
A = 7 Plog
k=1 bm2 c 7(m ; log m)
TLIN = 4m + 2blog mc TTREE = 4dlog3 me + 2blog mc

a b

out

a0

Usually built from full-adders


Associativity of addition allows convertion from linear to
tree structure ) faster at same number of FAs

a b

a b

fasymbol.epsi
FA
c18 21 mm
c

4.2 1-Bit Adders, (m, k)-Counters

speedup1.epsi
CPA
c i84 26 mm

cj

CPA

ck

c in

...

s n-1:j

cpasymbol.epsi
CPA
c out 29 26 mm c in

s i-1:k

s k-1:0

a) Fast carry look-ahead logic for entire range of bits

a n-1

b n-1

a1

b1

a0

b0

Ripple-carry adder (RCA)


Serial arrangement of n full-adders
Simplest, smallest, and slowest CPA structure

speedup2.epsi
104 50 mm

c out

A = 7n T = 2n AT = 14n2
a n-1

b n-1

a1

b1

a0

c out

FA

c n-1

c1

b0
FA

carry propagation

c in

postprocessing

...

s n-1

...

rca.epsi
57c 23FA
mm

preprocessing

...

s1

s0

c in

...

s n-1

s1

s0

Computer Arithmetic: Principles, Architectures, and VLSI Design

26

Computer Arithmetic: Principles, Architectures, and VLSI Design

27

4 Addition

4.3 Carry-Propagate Adders (CPA)

Carry-skip adder (CSKA)

4 Addition

4.3 Carry-Propagate Adders (CPA)

Carry-select adder (CSLA)


Type a) : partial CPA with fast ck ! ci and ck ! si;1:k

Type a) : partial CPA with fast ck ! ci

si;1:k = ck s0i;1:k + ck s1i;1:k


ci = ck c0i + ck c1i

ci = P i;1:k c0i + Pi;1:k ck (bit group (ai;1 : : : ak ))


Pi;1:k = pi;1pi;2 pk (group propagate)
1) Pi;1:k
2) Pi;1:k

= 0 : ck 6! c0i and c0i selected (c0i ! ci)


= 1 : ck ! c0i but c0i skipped (c0i 6! ci)

) path ck ! c0i ! ci never sensitized ) fast ck ! ci


) false path ) inherent logic redundancy ) problems in
circuit optimization, timing analysis, and testing

Variable group sizes (faster) : larger groups in the middle


(minimize delays a0 ! ck ! si;1 and ak ! ci ! sn;1 )

Two CPAs compute two possible results (cin = 0=1),


group carry-in ck selects correct one afterwards
Variable group sizes (faster) : larger groups at end (MSB)
(balance delays a0 ! ck and ak ! c0i )
Part. CPA typ. is RCA, CSLA () multil. CSLA), or CLA
High speed-up at high hardware overhead
(+ MUX/bit + (CPA + MUX)/group)

Partial CPA typ. is RCA or CSKA () multilevel CSKA)

Medium speed-up at small hardware overhead


(+ AND/bit + MUX/group)

8n

4n1=2

a n-1:j b n-1:j

a i-1:k b i-1:k

a k-1:0 b k-1:0

c out
0

c out

CPA

cj

ci

CPA

cska.epsi
36 mm
99
1

ci

CPA

ck

39n3=2

AT

a k-1:0 b k-1:0

...

...

ci

2:8n1=2

a i-1:k b i-1:k

32n3=2

AT

14n

c i0

c i1

...

c in

CPA

csla.epsi
102 50CPA
mm
0
s i-1:k

CPA

ck

c in

1
s i-1:k
0

ck

...

P i-1:k
s n-1:j

s i-1:k

s i-1:k

s k-1:0

Computer Arithmetic: Principles, Architectures, and VLSI Design


4 Addition

28

4.3 Carry-Propagate Adders (CPA)

Computer Arithmetic: Principles, Architectures, and VLSI Design


4 Addition

29

4.3 Carry-Propagate Adders (CPA)

Example : gate-level schematic of carry-incr. adder (CIA)


only 2 different logic cells (bit-slices) : IHA and IFA

Carry-increment adder (CIA)


Type a) : partial CPA with fast ck ! ci and ck ! si;1:k

si;1:k = s0i;1:k + ck ci = c0i + Pi;1:k ck


Pi;1:k = pi;1pi;2 pk (group propagate)
Result is incremented after addition, if ck

s k-1:0

4 6 10 12 14 16 18 20 22 24 26 28 ... 38
2 3 4 5 6 7 8 9 10 11 ... 16
1 2 4 7 11 16 22 29 37 46 56 67 ... 137

max ngroup

= 1 [12, 11]

a i-1
IFA

Variable group sizes (faster) : larger groups at end (MSB)


(balance delays a0 ! ck and ak ! c0i )

b i-1

a i-2

b i-2

IFA

a k+1

b k+1

IFA

ak

bk

IHA

...

Part. CPA typ. is RCA, CIA () multilevel CIA) or CLA


...

High speed-up at medium hardware overhead


(+ AND/bit + (incrementer + AND-OR)/group)
...

Logic of CPA and incrementer can be merged [11]

10n

2:8n1=2

AT

a i-1:k b i-1:k
...

c out

ci
ci

...

CPA
86

cia.epsi
si-1:k
43 mm

28n3=2

ci

ciagate.epsi
100
s i-2 112 mm

s i-1
(i-k-1)IFA + IHA

a k-1:0 b k-1:0

2IFA + IHA

s k+1
IFA + IHA

sk
IHA

ck

IHA

0
ck

CPA

c in

...

bits i-1...k

...

bits 6...4

bits 3,2

bit 1

bit 0

P i-1:k
+1

s i-1:k

s k-1:0
c out

Computer Arithmetic: Principles, Architectures, and VLSI Design

30

Computer Arithmetic: Principles, Architectures, and VLSI Design

c in

31

4 Addition

4.3 Carry-Propagate Adders (CPA)

Conditional-sum adder (COSA)

Correct sum bits (si;1:k or si;1:k ) are (conditionally)


selected through (log n) levels of multiplexers
1

Higher parallelism, more balanced signal paths


Highest speed-up at highest hardware overhead
(2 RCA + more than (log n) MUX/bit)
3n log n

2 log n

AT

6n log

Type b) : carries looked ahead before sum bits computed


Typically 4-bit blocks used (e.g. standard IC SN74181)

c0 = c00
c1 = g0 + p0c00
c2 = g1 + p1g0 + p1p0c00
c3 = g2 + p2g1 + p2p1g0 + p2p1 p0c00
g30 = g3 + p3g2 + p3p2g1 + p3p2 p1g0
p30 = p3p2p1 p0

Bit groups of size 2l at level l

4.3 Carry-Propagate Adders (CPA)

Carry-lookahead adder (CLA), traditional

Type a) : optimized multilevel CSLA with (log n) levels


(i.e. double CPAs are merged at higher levels)
0

4 Addition

...

(g3,p3)

(g0,p0)

clbsymbol.epsi
26 mm c
27 CLB
0
. . . c0
(g,p)
3 3 c3

Hierarchical arrangement using ( 12 log n) levels :


(g30 p03) passed up, c00 passed down between levels

High speed-up at medium hardware overhead

level 0

a3

...

b3

a2

FA

level 1

...

level 2

FA

...

b2

a1

FA

FA

a0

FA

b0

FA

c in

CLB

s2

s1

32

4.3 Carry-Propagate Adders (CPA)

Type b) : universal adder architecture comprising RCA,


CIA, CLA, and more (i.e. entire range of area-delay
trade-offs from slowest RCA to fastest CLA)
Preprocessing, carry-lookahead, and postprocessing step
Carries calculated using parallel-prefix algorithms
+ High flexibility : special adders, other arithmetic
operations, exchangeable prefix algorithms (i.e. speeds)
+ High performance : smallest and fastest adders

...

(gn-1 , p n-1 )

carry-lookahead:
prefix algorithm

add.epsi///figures
73 64 mm

c1

p0

c3 . . . c0

+ preprocessing : gi = ai bi
+ postprocessing : si = pi

pi = ai bi
ci

4 Addition

33

4.3 Carry-Propagate Adders (CPA)

postprocessing:

Computer Arithmetic: Principles, Architectures, and VLSI Design

Inputs (xn;1 : : : x0 ), outputs (yn;1


binary operator [11, 13]

: : : y0), associative

(yn;1 : : : y0) = (xn;1


x0 : : : x1 x0 x0)
y0 = x0 yi = xi yi;1 ; i = 1 : : : n ; 1 (r.m.a.)

or

) tree structures for evaluation :

x3 (x2 (x| 1 {z x0})) = (x| 3 {z x2}) (|x1 {z x0}) , but y2 ?


|

y1 = Y1:01
{z

y2 = Y2:02
{z

y3 = Y3:03

|
}

Y3:21

{z

y1 = Y1:01

y3 = Y3:02

: : : xi) at level l
Carry-propagation is prefix problem : Yil:k = (Gli:k Pil:k )
(G0i:i Pi0:i) = (gi pi)
(Gli:k Pil:k ) = (Gli;:j+1 1 Pil:;j+1 1) (Glj;:k1 Pjl:;k1) ; k j i
= (Gil;:j+1 1 + Pil:;j+1 1Glj;:k1 Pil:;j+1 1Pjl:;k1)
ci+1 = Gmi:0 ; i = 0 : : : n ; 1 l = 1 : : : m
Parallel-prefix algorithms [14] :
multi-tree structures (T = O(n) ! O(log n))
sharing subtrees (A = O(n2 ) ! O(n log n))
different algorithms trading area vs. delay (influences
also from wiring and maximum fan-out FOmax )

c0

s0

s n-2

...
s1

...
s n-1

gi = aibi
pi = ai bi

(g0 , p0 )

c out

CLB

Group variables Yil:k : covers bits (xk

preprocessing:
c in

c n p n-1

CLB

Computer Arithmetic: Principles, Architectures, and VLSI Design

T = 4 + 2T

a1
b1
a0
b0

a n-1
b n-1
a n-2
b n-2

...

c in

Associativity of

+ High regularity : suitable for synthesis and layout

5n + 3A

56n log n

Prefix problem

Parallel-prefix adders (PPA)

AT

c 11 . . . c 8 cla.epsi c 7 . . . c 4
97 48 mm

s0

Computer Arithmetic: Principles, Architectures, and VLSI Design


4 Addition

4 log n

CLB

c 15 . . . c 12

...
s3

(g15,p15) . . . (g12,p12)(g11,p11) . . . (g8,p8) (g7,p7) . . . (g4,p4) (g3,p3) . . . (g0,p0)

CLB

c out

14n

FA

cosa.epsi
100 57 mm

b1

si = pi ci
34

Computer Arithmetic: Principles, Architectures, and VLSI Design

35

4 Addition

4.3 Carry-Propagate Adders (CPA)

4.3 Carry-Propagate Adders (CPA)

Sklansky parallel-prefix algorithm () PPA-SK)

Prefix algorithms
Algorithms visualized by directed acyclic graphs (DAG)
with array structure (n bits m levels)
Graph vertex symbols :
(Gil;:j+1 1 Pil:;j+1 1) (Gjl;:k1 Pjl:;k1 )

y?;
;
; (Gl P l )
(Gli:k Pil:k ) ?
i:k i:k

Tree-like collection, parallel redistribution of carries

1
2

?
i
;
(Gli:k Pil:k ) ?
(Gli:k Pil:k )

0
1
2
3
4

sk.epsi///figures
67 30 mm

Brent-Kung parallel-prefix algorithm () PPA-BK)

: graph depth (number of black nodes on critical path)

Serial-prefix algorithm () RCA)

Traditional CLA is PPA-BK with 4-bit groups


Tree-like redistribution of carries (fan-out tree)

A = 2n ; dlog ne ; 2 T = 2dlog ne ; 2
FOmax log n

A = n ; 1 T = n ; 1 FOmax = 2
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0
1
2
3
4
5
6

...

ser.epsi///figures
69 38 mm

14
15

Computer Arithmetic: Principles, Architectures, and VLSI Design


4 Addition

36

4.3 Carry-Propagate Adders (CPA)

Kogge-Stone parallel-prefix algorithm () PPA-KS)

bk.epsi///figures
67 38 mm

Computer Arithmetic: Principles, Architectures, and VLSI Design


4 Addition

37

4.3 Carry-Propagate Adders (CPA)

Mixed serial/parallel-prefix algorithm () RCA + PPA)


linear size-depth trade-off using parameter k :

very high wiring requirements

1
2

(contains no logic)

Performance measures :
A : graph size (number of black nodes)

0
1
2
3

n log n T = dlog ne FOmax

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

(Gil;:k1 Pil:;k 1 )

(contains logic for )

4 Addition

n log n ; n + 1 T = dlog ne FOmax = 2

k n ; 2dlog ne + 2

k = 0 : serial-prefix graph
k = n ; 2dlog ne + 1 : Brent-Kung parallel-prefix

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0
1

graph

fills gap between RCA and PPA-BK (i.e. CLA) in steps


of single -operations

ks.epsi///figures
67 52 mm

A = n ; 1 + k T = n ; 1 ; k FOmax = var.
4
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0
1
2
3
4
5
6
7
8
9
10

Carry-increment parallel-prefix algorithm () CIA)

2n ; 1:4n1=2

1:4n1=2

FOmax

1:4n1=2

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0
1
2
3
4
5

cia.epsi///figures
67 34 mm

Computer Arithmetic: Principles, Architectures, and VLSI Design

38

var.epsi///figures
68 54 mm

Computer Arithmetic: Principles, Architectures, and VLSI Design

39

4 Addition

4.3 Carry-Propagate Adders (CPA)

Example : 4-bit parallel-prefix adder (PPA-SK)


efficient AND-OR-prefix circuit for the generate and
AND-prefix circuit for the propagate signals
optimization: alternatingly AOI-/OAI- resp. NAND-/
NOR-gates (inverting gates are smaller and faster)
can also be realized using two MUX-prefix circuits
a3

b3

a2

b2

a1

b1

a0

4 Addition

4.3 Carry-Propagate Adders (CPA)

Prefix adder synthesis


Local prefix graph transformation :
3 2 1 0

A =3
T =3

b0

c in

0
unfact.epsi
1
20 26 mm
2
3

3 2 1 0

depth-decr.
transform

0
fact.epsi
1
20 26 mm
2
3

;!

size-decr.
transform

A =4
T =2

Repeated (local) prefix transformations result in overall


minimization of graph depth or size ) which sequence ?
Goal: minimal size (area) at given depth (delay)
Simple algorithm for sequence of applied transforms :
Step 1 : prefix graph compression (depth minimization) :
depth-decr. transforms in right-to-left bottom-up order
Step 2 : prefix graph expansion (size minimization) :
size-decreasing transforms in left-to-right top-down
order, if allowed depth not exceeded

askgate.epsi///figures
100 103 mm

Prefix adder synthesis : 1) generate serial-prefix graph, 2)


graph compression, 3) depth-controlled graph expansion,
4) generate pre-/postprocessing and prefix logic
+ Generates all previous prefix graphs (except PPA-KS)
+ Universal adder synthesis algorithm : generates
area-optimal adders for any given timing constraints [14]
(including non-uniform signal arrival times)

c out
P n-1:0

s3

s2

s1

s0

Computer Arithmetic: Principles, Architectures, and VLSI Design


4 Addition

40

4.3 Carry-Propagate Adders (CPA)

Multilevel adders

Computer Arithmetic: Principles, Architectures, and VLSI Design


4 Addition

41

4.3 Carry-Propagate Adders (CPA)

Self-timed adders

Multilevel versions of adders of type a) possible (CSKA,


CSLA, and CIA; notation: 2-level CIA = CIA-2L)
+ Delay is O(n1=(m+1) ) for m levels
Area increase small for CSKA and CIA,
high for CSLA () COSA)

Average carry-propagation length : log n

+ RCA is fast in average case (T = O(log n)), slow in worst


case ) suitable for self-timed asynchronous designs [16]
Completion detection is not trivial

Difficult computation of optimal group sizes

Adder performance comparisons


Standard-cell implementations, 0:8

Hybrid adders
Arbitrary combinations of speed-up techniques possible
) hybrid/mixed adder architectures
Often used combinations : CLA and CSLA [15]

m process

area [lambda^2]
RCA
128-bit

1e+07

CSKA-2L
CIA-1L

Pure architectures usually perform best (at gate-level)

CIA-2L

64-bit

PPA-SK

Transistor-level adders

PPA-BK
32-bit

Influence of logic styles (e.g. dynamic logic,


pass-transistor logic ) faster)

addperf.ps
84 84 mm

CLA
COSA
const. AT

16-bit

+ Efficient transistor-level implementation of ripple-carry


chains (Manchester chain) [15]

1e+06
8-bit
5

+ Combinations of speed-up techniques make sense


Much higher design effort

Computer Arithmetic: Principles, Architectures, and VLSI Design

delay [ns]
5

Many efficient implementations exist and published


42

10

20

Computer Arithmetic: Principles, Architectures, and VLSI Design

43

4.3 Carry-Propagate Adders (CPA)

Complexity comparison under the unit-gate model

= a0 i + a1 i + a2 i ;
i = 0 1 : : : n ; 1 (n.)

Result is in redundant carry-save format (n digits),


represented by two n-bit numbers S (sum bits) and C
(carry bits)

p
( )

+ Parallel arrangement of n full-adders, constant delay

a 0,n-1
a 1,n-1

a 2,n-1

csa.epsi
27FA
mm

. . . 67

FA
cn

c2

s n-1

FA
c1

s1

s0

Multi-operand carry-save adders (m > 3)


) adder array (linear arrangement), adder tree (tree arr.)
44

4.5 Multi-Operand Adders

Computer Arithmetic: Principles, Architectures, and VLSI Design


4 Addition

45

4.5 Multi-Operand Adders

a) 4-operand CPA (RCA) array :


a 0,n-1
a 1,n-1

Add three or more (m > 2) n-bit operands, yield


(n + dlog me)-bit result in irredundant number rep. [1, 2]

FA
a 2,n-1

Realization by array adders : (see figures on next page)


a) linear arrangement of CPAs
b) linear arr. of CSAs (adder array) and final CPA

CSA

FA

HA

a 3,n-1
FA

a 2,2

a 2,1

...

FA

a 3,2

CPA

a 2,0

cparray.epsi
93 57 mm FA
FA

CPA

HA

a 3,1

a 3,0

FA

FA

HA

s2

s1

s0

CPA

...

sn

s n-1

FA

...

...

a 3,n-1
mopadd.epsi
CSA
30 58 mm

a 3,2
...

csarray.epsi
99FA 57 mm

FA
a 3,1
FA

FA

CSA

a 3,0
HA

CSA

...

FA

FA

a 2,0

a 0,2
a 1,2

A m-1

a 0,0
a 1,0

b) 4-operand CSA array with final CPA (RCA) :


a 2,n-1

A3

...

FA

a 0,n-1
a 1,n-1

A0 A1 A2

FA

...

Array adders

a) and b) differ in bit arrival times at final CPA :


) if CPA = RCA : a) and b) have same overall delay
) if fast final CPA : uniform bit arrival times required
) CSA array (b)
Fast implementation : CSA array + fast final CPA
(note: array of fast CPAs not efficient/necessary)

a 0,0
a 1,0

4.5 Multi-Operand Adders

A = O(mn + n)
T = O(m + n)

a 2,0

A = 7n T = 4

Computer Arithmetic: Principles, Architectures, and VLSI Design

CPA = RCA :

(C S )out = A + (C S )in

optimality regarding area and delay


aaa : smallest area, longest delay
aat : small area, medium delay
att : medium area, short delay
ttt : large area, shortest delay
: not optimal
2 obtained from prefix adder synthesis
3 automatic logic optimization not possible (redundancy)
4 exact factors not calculated
5 corresponds to 4-bit PPA-BK

A = (m ; 2)ACSA + ACPA
T = (m ; 2)TCSA + TCPA

b) Adds one n-bit operand to an n-digit carry-save operand

p
p

4 Addition

csasymbol.epsi
26 mm
21 CSA

2ci+1 + si

p
p
p

ttt
att

A0 A1 A2

a 0,0
a 1,0

3n log2 n
40n log n
6n log2 n
56n log n
6n log2 n

(C S ) = C + S = A0 + A1 + A2

a 2,1

2 log n
4 log n
2 log n
4 log n
2 log n

aat 3

att
att

a 2,1

n log n
10n
3n log n
14n
3n log n
3
2

PPA-SK
PPA-BK
PPA-KS
CLA 5
COSA

14n2
32n3=2
xn4=3 4
39n3=2
28n3=2
36n4=3
44n5=4

a) Adds three n-bit operands A0 , A1 , A2 performing no


carry-propagation (i.e. carries are saved) [1]

a 0,1
a 1,1

8n
8n
14n
10n
10n
10n

CSKA-1L
CSKA-2L
CSLA-1L
CIA-1L
CIA-2L
CIA-3L

2n
4n1=2
xn1=3 4
2:8n1=2
2:8n1=2
3:6n1=3
4:4n1=4

opt.1 syn.2
p
aaa

AT

a 0,1
a 1,1

7n

RCA

4.4 Carry-Save Adder (CSA)

a 0,1
a 1,1

4.4 Carry-Save Adder (CSA)

a 2,2

adder

4 Addition

a 0,2
a 1,2

4 Addition

Fast CPA :

A = O(mn + n log n)
T = O(m + log n)

CPA

FA

FA

sn

s n-1

FA

HA

s2

s1

CPA

...

s0

S
Computer Arithmetic: Principles, Architectures, and VLSI Design

46

Computer Arithmetic: Principles, Architectures, and VLSI Design

47

4 Addition

4.5 Multi-Operand Adders

4 Addition

4.5 Multi-Operand Adders

(m, 2)-compressors
2(c +

mX
;4

l =0

0
c out

clin

m-4
c out

cprsymbol.epsi
26 mm
37 (m,2)

...

i=0

ai +

mX
;4

...
...

l=0
mX
;1

A = 7(m ; 2)
TLIN = 4(m ; 2) TTREE = 6(dlog me ; 1)

a m-1

a0

clout) + s =

c in0
c inm-4

Optimized (4, 2)-compressor :

2 full-adders merged and optimized (i.e. XORs


arranged in tree structure)

1-bit adders (similar to (m, k)-counters) [17]


Compresses m bits down to 2 by forwarding (m ; 3)
intermediate carries to next higher bit position
Is bit-slice of multi-operand CSA array (see prev. page)
+ No horizontal carry-propagation (i.e. clin ! ckout k > l)

A = 14 T = 6

A = 14 T = 8

a0 a1

a0 a1 a2 a3

Built from full-adders (= (3, 2)-compressor) or


(4, 2)-compressors arranged in linear or tree structures

FA
cpr42fa.epsi
32 38 mm

c out

Example : 4-operand adder using (4, 2)-compressors

c in

FA

cpr42opt.epsi
1
41 53 mm

c out

c in

(4,2)
cpradd.epsi
99 44 mm

(4,2)

(4,2)

FA

FA

HA

sn

s n-1

s2

s1

4 Addition

CPA

48

higher compression rate (4:2 instead of 3:2)


less deep and more regular trees
012 3 4 5

SD-FA (signed-digit full-adder) is similar to


(4, 2)-compressor regarding structure and complexity

s0

Advantages of (4, 2)-compressors over FAs for realizing


(m, 2)-compressors :

FA
(4,2)

optimized

CSA

4.5 Multi-Operand Adders

# operands

with full-adders

Computer Arithmetic: Principles, Architectures, and VLSI Design

tree depth

+ same area, 25% shorter delay

FA
s n+1

a 0,0
a 1,0
a 2,0
a 3,0

a 0,1
a 1,1
a 2,1
a 3,1

a 0,2
a 1,2
a 2,2
a 3,2

a 0,n-1
a 1,n-1
a 2,n-1
a 3,n-1

(4,2)

a2 a3

Computer Arithmetic: Principles, Architectures, and VLSI Design


4 Addition

49

4.5 Multi-Operand Adders

Tree adders (Wallace tree)


Adder tree : n-bit m-operand carry-save adder
composed of n tree-structured (m, 2)-compressors [1, 18]
Tree adders : fastest multi-operand adders using an
adder tree and a fast final CPA

7 8 9 10

2 3 4 6 9 13 19 28 42 63 94
2 4 8 16 32 64 128

A = A(m 2) n + ACPA = O(mn + n log n)


T = T(m 2) + TCPA = O(log m + log n)

Example : (8, 2)-compressor

A = 42 T = 16
a0a1 a2a3

0
c out

FA

A = 42 T = 12
a0a1a2a3

a4a5 a6a7
FA

1
c out

0
c out

c in0
c in1

FA

2
c out

FA
cpr82fa.epsi
47 65 mm

3
c out

FA

4
c out

FA
c

c in2

a4a5a6a7

(4,2)

(4,2)

1
c out
2
c out

Adder arrays and adder trees revisited

c in0
c in1

cpr82cpr42.epsi
47 50 mm

c in2

3
c out

c in3
4
c out

c in4

(4,2)

Some FA can often be replaced by HA or eliminated


(i.e. redundant due to constant inputs)
Number of (irredundant) FA does not depend on adder
structure, but number of HA does

c in3

An m-operand adder accomodates (m ; 1) carry inputs

c in4

Adder trees (T = O(log n)) are faster than adder arrays


(T = O(n)) at same amount of gates (A = O(mn))

Adder trees are less regular and have more complex


routing than adder arrays ) larger area, difficult layout
(i.e. limited use in layout generators)

(4, 2)-compressor tree

full-adder tree
Computer Arithmetic: Principles, Architectures, and VLSI Design

50

Computer Arithmetic: Principles, Architectures, and VLSI Design

51

4 Addition

4.6 Sequential Adders

4.6 Sequential Adders

5 Simple / Addition-Based Operations

5 Simple / Addition-Based Operations

Bit-serial adder : Sequential n-bit adder

2s complementer (negation)

Z
A

A ; B = A + (;B )
=A+B+1

sub.epsi
29 32 mm

CPA

c out

S
A B

2s complement adder/subtractor

Allows higher clock rates


Final CPA too slow :
) pipelining or multiple
cycles for evaluation

A B = A + (;1)sub B
= A + (B sub) + sub

CSA
accucsa.epsi
33 52 mm

A = ACSA + ACPA + 4AREG


T = TCSA + TREG
L=m

CPA

c out

A + B (mod 2n ; 1)
= A + B + cout

5 Simple / Addition-Based Operations

52

5.2 Increment / Decrement

sub

5 Simple / Addition-Based Operations

1
2

1
2

n log2 n

(cout Z ) = A ; cin
a2

a1

a0

...

Z
dec.epsi
93 41 mm

c out

A = 3n T = n + 1 AT

c in

...

Example : Ripple-carry incrementer using half-adders


3n2

z n-1

a0

z2

z1

z0

Incrementer-decrementer

...

incfa.epsi
59c 23HA
mm c
2
1

) AND-prefix struct.

n log n + 2n T = dlog ne + 2 AT

a n-1

incsymbol.epsi
+1
c out 29 26 mm c in

53

5.2 Increment / Decrement

Ci:k = Ci:j+1Cj:k

Decrementer

= 0 () FA ! HA)

a1

c in

Prefix problem :

addmod.epsi
28 mm
29 CPA

Computer Arithmetic: Principles, Architectures, and VLSI Design

Incrementer
Adds a single bit cin to an n-bit operand A
(cout Z ) = cout2n + Z = A + cin

c out

(end-around carry)

5.2 Increment / Decrement

zi = ai ci
ci+1 = aici ; i = 0 : : : n ; 1
c0 = cin cout = cn (r.m.a.)

1s complement adder

CPA

Computer Arithmetic: Principles, Architectures, and VLSI Design

Corresponds to addition with B

addsub.epsi
36 35 mm

Mixed CSA/CPA : CSA with partial CPAs (i.e. fewer


carries saved), trade-off between speed and register size

c n-1

2s complement subtractor

accucpa.epsi
CPA
27 28 mm

With CSA and final CPA

HA

+1

si

Accumulators : Sequential m-operand adders


A
With CPA

a n-1

neg.epsi
21 32 mm1

;A = A + 1

bitseradd.epsi
FA
25 27 mm

A = ACPA + AREG
T = TCPA + TREG
L=m

5.1 Complement and Subtraction


ai bi

A = AFA + AFF
T = TFA + TFF
L=n

c out

5.1 Complement and Subtraction

HA

(cout Z ) = A cin = A + (;1)dec cin

c in

...

z n-1

z1

a n-1

z0

a2

a1

a0

or using incrementer slices (= half-adder)


a n-1

a2

a1

dec

a0

...

...

c out

incdec.epsi
94 46 mm
inc.epsi
83 33 mm

c out

c in

c in

...

...

HA
z n-1

z2

z1

Computer Arithmetic: Principles, Architectures, and VLSI Design

z0

z n-1
54

z2

z1

Computer Arithmetic: Principles, Architectures, and VLSI Design

z0
55

5 Simple / Addition-Based Operations

5.2 Increment / Decrement

Fast incrementers

5 Simple / Addition-Based Operations

5.2 Increment / Decrement

Gray incrementer

4-bit incrementer using multi-input gates :


a3

a2

a1

Increments in Gray number system

c0 = an;1 an;2
a0 (parity)
ci+1 = ai ci ; i = 0 : : : n ; 3 (r.m.a.)
z0 = a0 c0
zi = ai ai;1 ci;1 ; i = 1 : : : n ; 2
zn;1 = an;1 cn;2

a0

c in

inccg.epsi
62 39 mm

c out
z3

z2

z1

z0

Prefix problem ) AND-prefix structure

8-bit parallel-prefix incrementer (Sklansky AND-prefix


structure) :
a7

a6

a5

a4

a3

a2

a1

a0
c in

incpp.epsi
98 63 mm

c out

z7

z6

z5

z4

z3

z2

z1

z0

Computer Arithmetic: Principles, Architectures, and VLSI Design

56

5 Simple / Addition-Based Operations

5.3 Counting

Count clock cycles ) counter,


divide clock frequency ) frequency divider (cout )

Counter using Gray incrementer


+1
cntblock.epsi
32 33 mm

c out

c in

Ring counters
Shift register connected to ring :

clk
Q

Example : Ripple-carry up-counter using counter slices


(= HA + FF), cin is count enable
c out

c in

cntring.epsi
51 16 mm

q n-1

q1

q0

q2

q1

01)

Johnson / twisted-ring counter (inverted feed-back) :


cntjohnson.epsi
59 16 mm

clk

cntasync.epsi
64 18 mm

q n-1

q0

Applications:
fast dividers (no logic between FF)
state counter for one-hot coded FSMs

q2

q1

Must be initialized correctly (e.g. 00

cntripple.epsi
87 36 mm

...

q2

State is not encoded ) n FF for counting n states

Asynchronous counter using toggle-flip-flops


(lower toggle rate ) lower power)
T

5.3 Counting

Gray counter

Binary counter
Sequential in-/decrementer
Incrementer speed-up
techniques applicable
Down- and up-down-counters
using decrementers /
incrementer-decrementers

q n-1

5 Simple / Addition-Based Operations

57

Fast divider (T = O(1)) using delayed-carry numbers


(irredundant carry-save represention of ;1 allows using
fast carry-save incrementer) [8]

5.3 Counting

...

Computer Arithmetic: Principles, Architectures, and VLSI Design

q n-1

q1

q0

n FF for counting 2n states

q0

Computer Arithmetic: Principles, Architectures, and VLSI Design

q2

58

Computer Arithmetic: Principles, Architectures, and VLSI Design

59

5 Simple / Addition-Based Operations

5.4 Comparison, Coding, Detection

5.4 Comparison, Coding, Detection

5 Simple / Addition-Based Operations

5.4 Comparison, Coding, Detection

Comparators

Subtractor (A ; B ) :

Comparison operations

EQ = (A = B )
(equal)
NE = (A 6= B ) = EQ
(not equal)
GE = (A B )
(greater or equal)
LT = (A < B ) = GE
(less than)
GT = (A > B ) = GE EQ
(greater than)
LE = (A B ) = GT = GE + EQ (less or equal)

cmpsub.epsi
37 31 mm

GE = cout
EQ = Pn;1:0

CPA

GE = c out

EQ = P n-1:0

(for free in PPA)

ARCA = 7n TRCA = 2n or
APPA;KS 32 n log n TPPA;KS

2 log n

Optimized comparator :

eqi+1 = (ai = bi) eqi


= (ai bi) eqi ;
i = 0 ::: n ; 1
eq0 = 1 EQ = eqn (r.s.a.)

removing redundancies in subtractor (unused si )


single-tree structure ) speed-up at no cost :

a0
b0

a2
b2

a n-1
b n-1

EQ = (A = B )

a1
b1

Equality comparison

...

A = 6n TLIN = 2n TTREE

cmpeq.epsi
40 36 mm

2 log n

a0
b0

a1
b1

EQ

a2
b2

a n-1
b n-1

example : ripple comparator using comparator slices

Magnitude comparison
cmpripple.epsi
100 47 mm

gei+1 = (ai > bi) + (ai = bi) gei


= aibi + (ai bi) gei ; i = 0 : : : n ; 1
ge0 = 1 GE = gen (r.s.a.)

equality

EQ

60

5.4 Comparison, Coding, Detection

Decoder
Decodes binary number An;1:0 to vector Zm;1:0 (m = 2n )

zi =

1 if A = i
0 else
;
a2

i = 0 ::: m ; 1
a1

= 2A

Computer Arithmetic: Principles, Architectures, and VLSI Design


5 Simple / Addition-Based Operations

61

5.4 Comparison, Coding, Detection

Detection operations
All-zeroes detection :
All-ones detection :

z = an;1 + an;2 + + a0
z = an;1 an;2 a0 (r.s.a.)

A = n T = log n

a0
decoder.epsi
58 28 mm

decodersym.epsi
26 mm
21decoder

magnitude

GE

Computer Arithmetic: Principles, Architectures, and VLSI Design


5 Simple / Addition-Based Operations

equality &
magnitude

...

GE = (A B )

Leading-zeroes detection (LZD) :


for scaling, normalization, priority encoding

z7

z6

z5

z4

z3

z2

z1

z0

a) non-encoded output :

A = (n ; 1)2n T = dlog ne
Encoder
Encodes vector Am;1:0 to binary number Zn;1:0 (m = 2n )
(condition: 9i 8k j if k = i then ak = 1 else ak = 0)
Z = i if ai = 1 ; i = 0 : : : m ; 1 Z = log2 A
A
encodersym.epsi
26 mm
21encoder

z0

a n-2

(e.g. 000101 ! 000100)

A = 2n T = n

a1

...

a0

lzdnenc.epsi
50 28 mm
...

z n-1

z n-2

z1

z0

b) encoded output : + encoder

z1

signed numbers : + leading-ones detector (LOZ)

A = n(2n;1 ; 1)
T =n;1

a n-1

prefix problem (r.m.a.) ) AND-prefix structure

a7a5a3a1
a6a4a2a0
encoder.epsi
30 34 mm

f0g1f0j1g ! f0g1f0g

z2

(note: connections
according to PPA-SK)

Computer Arithmetic: Principles, Architectures, and VLSI Design

62

Computer Arithmetic: Principles, Architectures, and VLSI Design

63

5 Simple / Addition-Based Operations

5.5 Shift, Extension, Saturation

5.5 Shift, Extension, Saturation

Rotation by k bit positions, n constant (logic operation)


Extension of word lengths by k bits (n ! n + k )
(i.e. sign-extension for signed numbers)
Saturation to highest/lowest value after over-/underflow
unl.
signed r.
signed l.
r.

shift b)

unsigned
signed

rotate
extend

l.
r.
unl.
signed r.
signed l.
r.

saturate unsigned
signed

an;2
0 an;1
an;1
an;3
an;1 an;1 an;2
an+k;1
a2n;1 an+k;2
an;2
a0 an;1
0 an;1
an;1
an;1 an;1 an;2
an;1
an;2
an;1
an;1
an;1

:::
:::
:::
:::
:::
:::
:::
:::
:::
:::
:::
:::
:::
:::

a0 0
a1
a0 0
a1
ak
ak
a0 an;1
a1
a0
a0 0
a0
a0 0
an;1
an;1

Computer Arithmetic: Principles, Architectures, and VLSI Design


5 Simple / Addition-Based Operations

sll
srl
sla
sra

C
V

formula

cn
cn cn;1
an bnsn + anbn sn
Z
8i : si = 0
N
sn;1

A = O(n2)
T = O(log n)

rol
ror

a3

a2

a1

a3

a0

a0

s1 s0

barshift.epsi
44 49 mm

s1 s0
z3

z2

z1

z0

z3

multiplexers

5.6 Addition Flags

a1

s1 s0

s1

64

a2

s1 s0

muxshift.epsi
41 28 mm

s0

z2

z1

z0

tristate buffers

Computer Arithmetic: Principles, Architectures, and VLSI Design


5 Simple / Addition-Based Operations

65
5.6 Addition Flags

Basic and derived condition flags


description
carry flag
signed overflow flag

condition

flag

unsigned

formula
signed

S = A + B (+) or S = A ; B (;)
zero
Z
Z
negative

N
positive

N
S > max overflow C (+)
VC
S < min underflow C (;)
VC
operation: A ; B
A=B
EQ
Z
Z
A 6= B
NE
Z
Z
A B
GE
C
N V + NV
A>B
GT
CZ (N V + NV )Z
A<B
LT
C
NV + NV
A B
LE
C + Z NV + NV + Z
operation:
S=0
S<0
S 0

zero flag
negative flag, sign

Implementation of adder with flags

C , N : for free
V : fast cn, cn;1 computed by e.g. PPA ) very cheap
Z : a) cin = 1 (subtract.) : Z = (A = B ) = Pn;1:0 (of PPA)
b) cin = 0=1 :
Z = sn;1 + sn;2 + + s0 (r.s.a.)
1)
A = ACPA + n TZ = TCPA + dlog ne
2)

Implementation of shift/extension/rotation by
constant values : hard-wired
variable values : multiplexers
n possible values : nbyn barrel-shifter/rotator
Example : 4by4 barrel-rotator

5.6 Addition Flags


flag

5.5 Shift, Extension, Saturation

Applications :
adaption of magnitude (shift a)) or word length
(extension) of operands (e.g. for addition)
multiplication/division by multiples of 2 (shift)
logic bit/byte operations (shift, rotation)
scaling of numbers for word-length reduction (i.e.
ignore leading zeroes, shift b)) or normalization (e.g.
of floating-point numbers, shift a)) using LZD
reducing error after over-/underflow (saturation)

Shift : a) shift n-bit vector by k bit positions


b) select n out of more bits at position k
also: logical (= unsigned), arithmetic (= signed)

shift a)

5 Simple / Addition-Based Operations

faster without final sum (i.e. carry prop.) [19]


example :
01001 1 00 0
+ 10110 1 00
= 00000 0 00

Unsigned and signed addition/subtraction only differ


with respect to the condition flags

z0 = ((a0 b0) cin)


zi = ((ai bi) (ai;1 + bi;1))
Z = zn;1 zn;2 z0 ; i = 0 : : : n ; 1 (r.s.a.)
A = ACPA + 3n TZ = 4 + dlog ne

Computer Arithmetic: Principles, Architectures, and VLSI Design

66

Computer Arithmetic: Principles, Architectures, and VLSI Design

67

5 Simple / Addition-Based Operations

5.7 Arithmetic Logic Unit (ALU)

6.1 Multiplication Basics

6 Multiplication

5.7 Arithmetic Logic Unit (ALU)


A

6 Multiplication

6.1 Multiplication Basics


Multiplies two n-bit operands A and B [1, 2]

c out alusymbol.epsi c in
29 mm
30 ALU
flags
op

Product P is (2n)-bit unsigned number or (2n ; 1)-bit


signed number

Example : unsigned multiplication


nX
;1
nX
;1
nX
;1 nX
;1
P = A B = ai2i
bj 2j =
aibj 2i+j or
i=0
j =0
i=0 j =0
nX
;1
Pi = ai B P = Pi2i ; i = 0 : : : n ; 1 (r.s.a.)
i=0

ALU operations
arithmetic

add
inc
pass

logic

and
or
xor
pass

shift/
rotate

sll
sla
rol

A + B + cin
A+1
A
aibi
ai + bi
ai bi
ai
A 1
A a1
A r1

A;B
A;1
;A
ai bi
ai + bi
ai bi
ai
A 1
A a1
A r1

sub
dec
neg
nand
nor
xnor
not
srl
sra
ror

Algorithm
1) Generation of n partial products Pi
2) Adding up partial products :
a) sequentially (sequential shift-and-add),
b) serially (combinational shift-and-add), or
c) in parallel

s/ro : shift/rotate ; l/r : left/right ;


l/a : logic (unsigned) / arithmetic (signed)

Speed-up techniques

Logic of adder/subtractor can partly be re-used for logic


operations
Computer Arithmetic: Principles, Architectures, and VLSI Design
6 Multiplication

68

6.1 Multiplication Basics

Sequential multipliers :
partial products generated
and added sequentially (using
accumulator)

Reduce number of partial products


Accelerate addition of partial products
Computer Arithmetic: Principles, Architectures, and VLSI Design
6 Multiplication

6.2 Unsigned Array Multiplier

6.2 Unsigned Array Multiplier

Braun multiplier : array multiplier for unsigned numbers

mulseq.epsi
34 28 mm

P=

CPA

A = O(n) T = O(log n) L = n
Array multipliers :
partial products generated and
added simultaneously in linear
array (using array adder)

69

CSA
mularr.epsi

mm
34 47
CSA

A = O(n ) T = O(n)

i=0 j =0

b3

CSA

A = 8n2 ; 11n
T = 6n ; 9

aibj 2i+j

a1 b3
a2 b3 a2 b2
+ a3b3 a3b2 a3b1
p7 p6 p5 p4

CSA

;1
nX
;1 nX

a0 b3
a1 b2
a2 b1
a3 b0
p3

b2

a0 b2 a0 b1 a0 b0
a1 b1 a1 b0
a2 b0
p2 p1 p0
b1

b0

a0

CPA
p0

a1

Parallel multipliers :
partial
products
generated in parallel and added
subsequently in multi-operand
adder (using tree adder)

mulpar.epsi
34 43 mm

a2

CPA

a3

A = O(n ) T = O(log n)
2

HA

HA
p1

FA

CSA
tree

mulbraun.epsi
FA
99 83 mm

FA
p2

Signed multipliers :
a) complement operands before and result after
multiplication ) unsigned multiplication
b) direct implementation (dedicated multiplier structure)
Computer Arithmetic: Principles, Architectures, and VLSI Design

HA

70

FA

FA

FA

CSA

p3

CPA

3
p7

FA

FA

HA

p6

p5

p4

Computer Arithmetic: Principles, Architectures, and VLSI Design

71

6 Multiplication

6.3 Signed Array Multipliers

6.3 Signed Array Multipliers

Speed-up technique : reduction of partial products

Subtract bits with negative weight ) special FAs [1]

Sequential multiplication
Minimal (or canonical) signed-digit (SD) represent. of A

1 neg. bit : ;a + b + cin = 2cout ; s


2 neg. bits :
a ; b ; cin = ;2cout + s

+ One cycle per non-zero partial product (i.e. 8ai j ai 6= 0)


Negative partial products

s = a b cin
cout = ab + acin + bcin

Replace FAs in regions


1 , 2 , and 3 by :
(input a at mark )

Data-dependent reduction of partial products and latency


Combinational multiplication

Otherwise exactly same structure and complexity as


Braun multiplier ) efficient and flexible

Only fixed reduction of partial product possible


Radix-4 modified Booth recoding : 2 bits recoded to one
multiplier digit ) n=2 partial products

Arithmetic transformations yield the following partial


products (two additional ones) :

a0 b3
a1b3 a1 b2
a2 b3 a2 b2 a2 b1
a3 b3 a3 b2 a3 b1 a3 b0
a3
a3
+ 1 b3
b3
p7 p6 p5 p4 p3

a0 b2 a0 b1 a0 b0
a1 b1 a1 b0
a2 b0
p2

p1

Less efficient and regular than modified Braun


multiplier
72

6 Multiplication

6.4 Booth Recoding

Applicable to sequential, array, and parallel multipliers

A : +8n
T : +7

additional recoding logic and more


complex partial product generation
(MUX for shift, XOR for negation)
+ adder array/tree cut in half
) considerably smaller (array and tree)
) slightly or not faster for adder trees

0 0 0 ;p3
=
1
+ 1 1 1 p3

f;2 ;1 0 +1 +2g

=0

mulbooth.epsi
41 43 mm

CSA
array/tree
CPA

Computer Arithmetic: Principles, Architectures, and VLSI Design


6 Multiplication

73

6.6 Multiplier Implementations

6.5 Wallace Tree Addition

A = O(n2) T = O(log n)

Negative partial products (avoid sign-extension) :

p
3 p3 p3 p3 p2 p1 p0 =
| {z }

i=0

(a2i;1 + a2i ; 2a2i+1 ) 22i ; a;1


|
{z
}

Speed-up technique : fast partial product addition

A : =2
T : =2
T : ;0

) much faster for adder arrays

2;1
n=X

a2i+1 a2i a2i;1 Pi


0
0
0
+ 0
1
+ B
0
0
0
1
0
+ B
0
1
1
+ 2B
1
0
0
; 2B
1
0
1
; B
1
1
0
; B
1
1
1
; 0

p0

Computer Arithmetic: Principles, Architectures, and VLSI Design

A=

Booth
recoding

Baugh-Wooley multiplier

p2 p1 p0

Applicable to parallel multipliers : parallel partial


product generation (normal or Booth recoded)
Irregular adder tree (Wallace tree) due to different
number of bits per column
) irregular wiring and/or layout
) non-uniform bit arrival times at final adder
6.6 Multiplier Implementations

p2 p1 p0

Sequential multipliers :
low performance, small area, component re-use (adder)

Braun or Baugh-Wooley multiplier (array multiplier) :


medium performance, high area, high regularity
layout generators ) data paths and macro-cells
simple pipelining, faster CPA ) higher speed

p03 p02 p01 p00


p03 p03 p03 p03 p02 p01 p00
p13 p12 p11 p10
p13 p13 p13 p12 p11 p10
!
p23 p22 p21 p20
p23 p23 p22 p21 p20
+
p33 p32 p31 p30
+ p33 p32 p31 p30

p6 p5 p4 p3 p2 p1 p0

6.4 Booth Recoding

6.4 Booth Recoding

Modified Braun multiplier

ext. sign

6 Multiplication

p6 p5 p4 p3 p2 p1 p0

Booth-Wallace multiplier (parallel multiplier) [9] :


high performance, high area, low regularity
custom multipliers, netlist generators
often pipelined (e.g. register between CSA-tree and CPA)

Suited for signed multiplication (incl. Booth recod.)


Extend A for unsigned multiplication : an

=0

Radix-8 (3-bit recoding) and higher radices :


precomputing 3B , : : : ) inefficient
Computer Arithmetic: Principles, Architectures, and VLSI Design

Signed-unsigned multiplier : signed multiplier with


operands extended by 1 bit (an = an;1 =0, bn = bn;1 =0)
74

Computer Arithmetic: Principles, Architectures, and VLSI Design

75

6 Multiplication

6.8 Squaring

6.7 Composition from Smaller Multipliers

7.1 Division Basics

AH BL
AH BH AL BL
AL BH

less efficient (area and speed)


6.8 Squaring

P = A2 = AA

A=Q B+R; R <B


R = A rem B (remainder)
A 2 0 22n ; 1] B Q R 2 0 2n ; 1] B 6= 0
Q < 2n ! A < 2nB , otherwise overflow
) normalize B before division (B 2 2n;1 2n ; 1])
A =Q+ R
B
B

B = (AH 2n + AL) (BH 2n + BL)


= AH BH 22n + (AH BL + ALBH )2n + ALBL

4 (n n)-bit multipliers
+ (2n)-bit CSA + (3n)-bit CPA

7.1 Division Basics

7 Division / Square Root Extraction

(2n 2n)-bit multiplier can be composed from 4


(n n)-bit multipliers (can be repeated recursively)
A

7 Division / Square Root Extraction

Algorithms (radix-2)
Subtract-and-shift : partial remainders Ri [1, 2]

Sequential algorithm : recursive, f non-associative

: multiplier optimizations possible

qi = Ri+1 2iB Ri = Ri+1 ; qi2iB


Rn = A R = R0 ; i = n ; 1 : : : 0 (r.m.n.)

a0 a3 a0 a2 a0 a1 a0 a0
a1 a3 a1 a2 a1 a1 a1 a0
a2 a3 a2 a2 a2 a1 a2 a0
+ a3a3 a3a2 a3a1 a3a0
a2 a3 a1 a3 a0 a3 a0 a2 a0 a1
a0 a0
a3 a3
a1 a2
a1 a1
+
a2 a2
p7 p6 p5 p4 p3
p2 p1 p0

Basic algorithm : compare and conditionally subtract


) expensive comparison and CPA
Restoring division : subtract and conditionally restore
(adder or multiplexer) ) expensive CPA and restoring

+ bn=2c + 1 partial products (if no Booth recoding used)


) optimized squarer more efficient than multiplier
;

Non-restoring division : detect sign, subtract/add, and


correct by next steps ) expensive CPA

Table look-up (ROM) less efficient for every n

SRT division : estimate range, subtract/add (CSA), and


correct by next steps ) inexpensive CSA

Computer Arithmetic: Principles, Architectures, and VLSI Design


7 Division / Square Root Extraction

7.2 Restoring Division

76

7.3 Non-Restoring Division

qi =

1 if
0 if

Ri+1 ; B 2i 0
Ri+1 ; B 2i < 0

Ri+1 ; B 2i < 0 : qi = 0 Ri = Ri+1 (restored)


i ; 1 Ri+1 ; B 2i;1 0 : qi;1 = 1 Ri;1 = Ri+1 ; B 2i;1
i

7.3 Non-Restoring Division

qi0 =

1 if
;1 = 1 if

Ri+1 0
Ri+1 < 0

Ri+1 0 : qi0 = 1 Ri = Ri+1 ; B 2i


i ; 1 Ri+1 ; B 2i < 0 : qi0;1 = 1 Ri;1 = Ri+1 ; B 2i
+B 2i;1 = Ri+1 ; B 2i;1

Computer Arithmetic: Principles, Architectures, and VLSI Design

77

7 Division / Square Root Extraction

7.4 Signed Division

7.4 Signed Division

q0 = 1 if
i

1 if

Ri+1 B same sign


Ri+1 B opposite sign

Example : signed non-restoring array divider


(simplifications: B > 0, final correction of R omitted)

A = 9n2 T = 2n2 + 4n

a6 b3

b3

a6

b2

a5

b1

a4

b0

a3

One subtraction/addition (CPA) per step


Final correction step for R (additional CPA)
Simple quotient digit conversion : (note: qi0 irredundant)

q3

FA

q2

FA

q1

FA

FA

FA

FA

q0

FA

FA

FA

FA

r3

r2

r1

r0

+/ CPA
+/ CPA
divnr.epsi
mm CPA
46 38 +/
+/ CPA
+/ CPA

R
Computer Arithmetic: Principles, Architectures, and VLSI Design

FA

FA
FA
divarray.epsi
81 101 mm

FA

a1

A B

FA

a2

qi0 2 f1 1g ! qi 2 f0 1g : qi = 12 (qi0 + 1)
Q = (qn;1 qn;2 qn;3 : : : q0 1)

A = (n + 1)ACPA
= O(n2) or O(n2 log n)
T = (n + 1)TCPA
= O(n2) or O(n log n)

FA

78

a0

Computer Arithmetic: Principles, Architectures, and VLSI Design

79

7 Division / Square Root Extraction

7.5 SRT Division

7.5 SRT Division


8
>
>
<1

if

>
:1

if

B 2i Ri+1
qi0 is SD number
Ri+1 < B 2i
i
Ri+1 < ;B 2

Radix

CPA

:::

; 1g

Division by convergence

A = A R0R1
Q= B
B R0 R1

Rm;1 ! A
Rm;1 B

Algorithm :

+/ CSA
+/ CSA
divsrt.epsi
mm
+/ CSA
50 38
+/ CSA
+/ CPA

= Q1 resp. 2Qn

Bi+1 = Bi Ri Ai+1 = Ai Ri
Ri = B i + 1 ; i = 0 : : : m ; 1
A0 = A B0 = B Q = Am (r.s.n.)

Quadratic convergence :

Computer Arithmetic: Principles, Architectures, and VLSI Design

B
1
B

Bi+1 = Bi Ri = 2| n(1{z; y)} (|1 +


y) = 2| n(1{z; y2 )}
{z }
Bi
Ri
> Bi ! 2n
;
n
;
n
y = 1 ; Bi2 Ri = 2 ; Bi2 = B i + 1 (signed)

R
80

7.8 Remainder / Modulus

Division by reciprocation

L = dlog ne

Computer Arithmetic: Principles, Architectures, and VLSI Design


7 Division / Square Root Extraction

81

7.9 Divider Implementations

7.9 Divider Implementations

A =A 1
Q= B
B

Iterative dividers (through multiplication) :


re-use of existing components (multiplier)

Newton-Raphson iteration method :

f (X ) = 0

1 0 1

7.7 Division by Multiplication

A B

7 Division / Square Root Extraction

:::

Complex comparisons (more bits) and decisions


) table look-up () Pentium bug!)

+ Only 3 MSB are compared ) qi0 are estimated ) CSA


instead of CPA can be used (precise enough) [20]
Correction in following steps (+ final correction step)
Redundant representation of qi0 (SD representation) )
final conversion necessary (CPA)
+ Highly regular and fast (O(n)) SRT array dividers
) only slightly slower/larger than array multipliers

;1

+ Suitable for SRT algorithm ) faster

B < 2n , i.e. B is normalized :


;2n+i;1 Ri+1 < 2n+i;1 B 2i
8
>
2n+i;1 Ri+1
>
<1 if
0
qi = >0 if ;2n+i;1 Ri+1 < 2n+i;1
>
:1 if
Ri+1 < ;2n+i;1

A = nACSA + 2ACPA
= O(n2)
T = nTCSA + TCPA
= O(n)

= 2m , qi0 2 f

m quotient bits per step ) fewer, but more complex steps

if 2n;1
) ;B 2i

find

7.7 Division by Multiplication

7.6 High-Radix Division

qi0 = >0 if ;B 2i

7 Division / Square Root Extraction

by recursion

Xi+1 = Xi ; ff0((XXo))
i

medium performance, medium area


high efficiency if components are re-used

f (X ) = X1 ; B f 0 (X ) = ; X12 f B1 = 0

Sequential dividers (restoring, non-restoring, SRT) :


re-use of existing components (e.g. adder)

Algorithm :

low performance, low area

Xi+1 = Xi (2 ; B Xi) ; i = 0 : : : m ; 1
X0 = B Q = Xm (r.s.n.)

Array dividers (restoring, non-restoring, SRT) :


dedicated hardware component

L = O(log n)
Speed-up : first approximation X0 from table
Quadratic convergence :

high performance, high area


high regularity ) layout generators, pipelining
square root extraction possible by minor changes

7.8 Remainder / Modulus

combination with multiplication or/and square root

Remainder (rem) : signed remainder of a division

R = A rem B

sign(R) = sign(A)

No parallel dividers exist (sequential nature of division)

Modulus (mod) : positive remainder of a division

M = A mod B M

M= R
R+B

Computer Arithmetic: Principles, Architectures, and VLSI Design

if A
else

82

Computer Arithmetic: Principles, Architectures, and VLSI Design

83

7 Division / Square Root Extraction

7.10 Square Root Extraction

7.10 Square Root Extraction


p
A;R =Q

A2

0 22n ; 1]

A = Q2 + R

Exponential function : ex (exp x)


Logarithm function : ln x, log x

Trigonometric functions : sin x, cos x, tan x

Algorithm

Inverse trig. functions : arcsin x, arccos x, arctan x

Subtract-and-shift : partial remainders Ri and quotients


Qi = Qi+1 + qi2i = (qn;1 : : : qi 0 : : : 0)
2
Q2i = Qi+1 + qi2i = Q2i+1 + qi2i 2Qi+1 + qi2i

Hyperbolic functions : sinh x, cosh x, tanh x


8.1 Algorithms

qi = Ri+1 2i 2Qi+1 + 2i
Qi = Qi+1 + qi2i
Ri = Ri+1 ; qi2i 2Qi+1 + qi2i ; i = n ; 1 : : : 0
Rn = A Qn = 0 R = R0 Q = Q0 (r.m.n.)

Table look-up : inefficient for large word lengths [5]


Taylor series expansion : complex implementation
Polynomial and rational approximations [1, 5]
Shift-and-add algorithms [5]

Implementation

Convergence algorithms [1, 2] :

+ Similar to division ) same algorithms applicable


(restoring, non-restoring, SRT, high-radix)
+ Combination with division in same component possible

similar to division-by-convergence
two (or more) recursive formulas : one formula
converges to a constant, the other to the result

Only triangular array required


(step i : qk i = 0)

Coordinate rotation (CORDIC) [2, 5, 21] :

A ADIV =2
T TDIV

+/ CPA
sqrtnr.epsi
+/ CPA
mmCPA
42 36+/
+/ CPA
+/ CPA

3 equations for x-, y-coordinate, and angle


computes all elementary functions by proper input
settings and choice of modes and outputs
simple, universal hardware, small look-up table

R
Computer Arithmetic: Principles, Architectures, and VLSI Design
8 Elementary Functions

84

8.2 Integer Exponentiation

8.2 Integer Exponentiation

= (: : :

0 1 0
|{z}
A

: : :)

Integer exponentiation (exact) :

L=0

85

8.3 Integer Logarithm

E = AB = Abn; 2n; + +b 2+b


= ( ((Abn; )2 Abn; )2
1

Ab )2 Ab

Ei = Ei2+1 Abi ; i = n ; 1 : : :
En = 1 E = E0 (r.s.n.)

A = AMUL T = TMUL L = 2(n ; 1)

2n ; 1 (!)

8.3 Integer Logarithm

Applications : modular exponentiation AB (mod


in cryptographic algorithms (e.g. IDEA, RSA)

C)

Z = blog2 Ac
For detection/comparison of order of magnitude

Algorithms : square-and-multiply
a)

8 Elementary Functions

xy = ey ln x = 2y log x

Base-2 integer exponentiation : 2A

= A| A{z A}

Computer Arithmetic: Principles, Architectures, and VLSI Design

b)

Approximated exponentiation :

AB

8.1 Algorithms

8 Elementary Functions

0 2n ; 1]

Q2

8 Elementary Functions

E = AB = Abn; 2n; + +b 2+b


= A2n; bn; A2n; bn;
1

A4b A2b Ab
2

Corresponds to leading-zeroes detection (LZD) with


encoded output

Ei = Pibi Ei;1 Pi+1 = Pi2 ; i = 0 : : : n ; 1


E;1 = 1 P0 = A E = En;1 (r.s.n.)
A = 2AMUL T = TMUL L = n
A = AMUL T = TMUL L = 2n
Computer Arithmetic: Principles, Architectures, and VLSI Design

or

86

Computer Arithmetic: Principles, Architectures, and VLSI Design

87

9 VLSI Design Aspects

9.1 Design Levels

9 VLSI Design Aspects

9.1 Design Levels

Gate-level design

9 VLSI Design Aspects

Cell-based design techniques : standard-cells, gate-array/


sea-of-gates, field-programmable gate-array (FPGA)

9.1 Design Levels


Transistor-level design

Circuit implemented by hand or by synthesis (library)

Circuit and layout designed by hand (full custom)

Layout implemented by automated place-and-route

Low design efficiency

Medium to high design efficiency

High circuit performance : high speed, low area

Medium to low circuit performance

High flexibility : choice of architecture and logic style

Medium to low flexibility : full choice of architecture

Transistor-level circuit optimizations :


Block-level design

logic style : static vs. dynamic logic,


complementary CMOS vs. pass-transistor logic
special arithmetic circuits : better than with gates
gi

ci

carry chain :

c i-1
carrychain.epsi
54 17 mm

c out

ki

pi

Layout blocks and netlists from parameterized automatic


generators or compilers (library)
High design efficiency

g i-1

c in

Medium to high circuit performance

c in

k i-1 p i-1

Low flexibility : limited choice of architectures


Implementations :

a
b

fulladder :

c in

c in

facmos.epsi
76 40 mm

c in
s
c in

c out

b
a

c in

Computer Arithmetic: Principles, Architectures, and VLSI Design


9 VLSI Design Aspects

88
9.2 Synthesis

data-path : bit-sliced, bus-oriented layout (array of


cells: n bits m operations), implementation of entire
data paths, medium performance, medium diversity
macro-cells : tiled layout, fixed/single-operation
components, high performance, small diversity
portable netlists : ) gate-level design
Computer Arithmetic: Principles, Architectures, and VLSI Design
9 VLSI Design Aspects

89
9.3 VHDL

9.2 Synthesis

9.3 VHDL

High-level synthesis

Arithmetic types : unsigned, signed (2s complement)

Synthesis from abstract, behavioral hardware description


(e.g. data dependency graphs) using e.g. VHDL
Involves architectural synthesis and arithmetic
transformations

contain overloaded arithmetic operators and resizing /


type conversion routines for unsigned, signed types

High-level synthesis is still in the beginnings


Low-level synthesis

Arithmetic operators (VHDL87/93) [22]

Layout and netlist generators

relational
shift, rotate (93 only)
adding
sign (unary)
multiplying
exponent, absolute

Included in libraries and synthesis tools


Low-level synthesis is state-of-the-art
Basis for efficient ASIC design
Limited diversity and flexibility of library components
Circuit optimization

:
:
:
:
:
:

=, /=, <, <=, >, >=


rol, ror, sla, sll, sra, srl
+, +, *, /, mod, rem
**, abs

Synthesis

Efficient optimization of random logic (low factorization


degree) is state-of-the-art

Typical limitations of synthesis tools :


/, mod, rem : both operands must be constant or divisor

Optimization of entire arithmetic circuits (high


factorization degree) is not feasible ) only local
optimizations possible

must be a power of two


** : for power-of-two bases only

Logic optimization cannot replace the synthesis of


efficient arithmetic circuit structures using generators
Computer Arithmetic: Principles, Architectures, and VLSI Design

Arithmetic packages
numeric_bit, numeric_std (IEEE standard 1076.3),
std_logic_arith (Synopsys)

Variety of arithmetic components provided in separate


libraries (e.g. DesignWare by Synopsys)
90

Computer Arithmetic: Principles, Architectures, and VLSI Design

91

9 VLSI Design Aspects

9.3 VHDL

Resource sharing

Pipelining
Pipelining is basically possible with every combinational
circuit ) higher throughput

S <= A + C when SELA = 1 else B + C;

) 2 adders + 1 multiplexer

b)

T <= A when SELA = 1 else B;


S <= T + C;
) 1 multiplexer + 1 adder

tree structures : few large pipeline registers

resize(A, width+1) & Cin;


resize(B, width+1) & 1;
Aext + Bext;
Sext(width downto 1);
Sext(width+1);

) no advantage of tree structures anymore


(except for smaller latency)

Fine-grain pipelining ) systolic arrays (often applied to


arithmetic circuits)

Synthesis : check synthesis result for allocated arithmetic


units ) code sanity check, control of circuit size

High speed
Fast circuit architectures, pipelining, replication
(parallelization), and combinations of those

VHDL library of arithmetic units

Optimal solution depends on arithmetic operation, circuit


architecture, user specifications, and circuit environment

Structural, synthesizable VHDL code for most circuits


described in this text is found in [23]
Computer Arithmetic: Principles, Architectures, and VLSI Design
9 VLSI Design Aspects

Pipelining of arithmetic circuits can be very costly :

array structures : many small pipeline registers

Addition : single adder with carry-in/carry-out :


<=
<=
<=
<=
<=

Arithmetic circuits are well suited for pipelining due to


high regularity

large amount of internal signals in arithmetic circuits

Coding & synthesis hints

Aext
Bext
Sext
S
Cout

9.4 Performance

9.4 Performance

Sharing one resource for multiple operations


Done automatically by some synthesis tools
Otherwise, appropriate coding is necessary :
a)

9 VLSI Design Aspects

92
9.4 Performance

Computer Arithmetic: Principles, Architectures, and VLSI Design


9 VLSI Design Aspects

93
9.5 Testability

Low power

9.5 Testability

Power-related properties of arithmetic circuits :

Testability goal : high fault coverage with few test vectors


that are easy to generate/apply

High glitching activity due to high bit dependencies


and large logic depth

Random test vectors : easy to generate and


apply/propagate, few vectors give high (but not perfect)
fault coverage for most arithmetic circuits

Power reduction in arithmetic circuits [24] :


Reduce the switched capacitance by choosing an area
efficient circuit architecture
Allow for lower supply voltage by speeding up the
circuitry

Hard-detectable faults found in :

Reduce the transition activity :


apply stable inputs while circuit is not in use ()
disabling subcircuits)
reduce glitching transitions by balancing signal
paths (partly done by speed-up techniques, otherwise
difficult to realize)
reduce glitching transitions by reducing logic depth
(pipelining)
take advantage of correlated data streams
choose appropriate number representations
(e.g. Gray codes for counters)

Computer Arithmetic: Principles, Architectures, and VLSI Design

Special test vectors : sometimes hard to generate and


apply, required for coverage of hard-detectable faults
which are inherent in most arithmetic circuits

94

circuits of arithmetic operations with inherent special


cases (arithmetic exceptions) : detectors, comparators,
incrementers and counters (MSBs), adder flags
circuits using redundant number representations
(=
6 redundant hardware) : dividers (Pentium bug!)

Computer Arithmetic: Principles, Architectures, and VLSI Design

95

Bibliography

Bibliography

Bibliography

[11] R. Zimmermann, Binary Adder Architectures for


Cell-Based VLSI and their Synthesis, PhD thesis, Swiss
Federal Institute of Technology (ETH) Zurich,
Hartung-Gorre Verlag, 1998.

[1] I. Koren, Computer Arithmetic Algorithms, Prentice Hall,


1993.
[2] K. Hwang, Computer Arithmetic: Principles, Architecture,
and Design, John Wiley & Sons, 1979.
[3] O. Spaniol, Computer Arithmetic, John Wiley & Sons,
1981.

[12] A. Tyagi, A reduced-area scheme for carry-select adders,


IEEE Trans. Comput., vol. 42, no. 10, pp. 11621170, Oct.
1993.
[13] T. Han and D. A. Carlson, Fast area-efficient VLSI
adders, in Proc. 8th Computer Arithmetic Symp., Como,
May 1987, pp. 4956.

[4] J. J. F. Cavanagh, Digital Computer Arithmetic: Design


and Implementation, McGraw-Hill, 1984.
[5] J.-M. Muller, Elementary Functions: Algorithms and
Implementation, Birkhauser Boston, 1997.
[6] Proceedings of the Xth Symposium on Computer Arithmetic.
[7] IEEE Transactions on Computers.
[8] D. R. Lutz and D. N. Jayasimha, Programmable modulo-k
counters, IEEE Trans. Circuits and Syst., vol. 43, no. 11,
pp. 939941, Nov. 1996.
[9] H. Makino et al., An 8.8-ns 54 54-bit multiplier with
high speed redundant binary architecture, IEEE J.
Solid-State Circuits, vol. 31, no. 6, pp. 773783, June 1996.

[14] R. Zimmermann, Non-heuristic optimization and


synthesis of parallel-prefix adders, in Proc. Int. Workshop
on Logic and Architecture Synthesis, Grenoble, France,
Dec. 1996, pp. 123132.
[15] D. W. Dobberpuhl et al., A 200-MHz 64-b dual-issue
CMOS microprocessor, IEEE J. Solid-State Circuits, vol.
27, no. 11, pp. 15551564, Nov. 1992.
[16] A. De Gloria and M. Olivieri, Statistical carry lookahead
adders, IEEE Trans. Comput., vol. 45, no. 3, pp. 340347,
Mar. 1996.

[10] W. N. Holmes, Composite arithmetic: Proposal for a new


standard, IEEE Computer, vol. 30, no. 3, pp. 6573, Mar.
1997.

[17] V. G. Oklobdzija, D. Villeger, and S. S. Liu, A method for


speed optimized partial product reduction and generation of
fast parallel multipliers using an algorithmic approach,
IEEE Trans. Comput., vol. 45, no. 3, pp. 294305, Mar.
1996.

Computer Arithmetic: Principles, Architectures, and VLSI Design

Computer Arithmetic: Principles, Architectures, and VLSI Design

96

Bibliography

[18] Z. Wang, G. A. Jullien, and W. C. Miller, A new design


technique for column compression multipliers, IEEE
Trans. Comput., vol. 44, no. 8, pp. 962970, Aug. 1995.
[19] J. Cortadella and J. M. Llaberia, Evaluation of A + B = K
conditions without carry propagation, IEEE Trans.
Comput., vol. 41, no. 11, pp. 14841488, Nov. 1992.
[20] S. E. McQuillan and J. V. McCanny, Fast VLSI algorithms
for division and square root, J. VLSI Signal Processing,
vol. 8, pp. 151168, Oct. 1994.
[21] Y. H. Hu, CORDIC-based VLSI architectures for digital
signal processing, IEEE Signal Processing Magazine, vol.
9, no. 3, pp. 1635, July 1992.
[22] K. C. Chang, Digital Design and Modeling with VHDL and
Synthesis, IEEE Computer Society Press, Los Alamitos,
California, 1997.
[23] R. Zimmermann, VHDL Library of Arithmetic Units,
http://www.iis.ee.ethz.ch/zimmi/arith_lib.html.

[24] A. P. Chandrakasan and R. W. Brodersen, Low Power


Digital CMOS Design, Kluwer, Norwell, MA, 1995.

Computer Arithmetic: Principles, Architectures, and VLSI Design

98

97

Вам также может понравиться