Buffer Insertion

Interconnect Optimizations
A scaling primer
G
Ideal process scaling: S D

Device geometries shrink by S (= 0.7x)
Device delay shrinks by s w
S
h
Wire geometries shrink by s
R/m : r/(ws.hs) = r/s2 l
Cc/m : (hs).e/(Ss) = Cc
C/m : similar
R/m doubles, C/m and Cc/m unchanged hs
ls
ws Ss
Interconnect role
Short (local) interconnect
Used to connect nearby cells
Minimize wire C, i.e., use short min-width wires
Medium to long-distance (global) interconnect
Size wires to tradeoff area vs. delay
Increasing width Capacitance increases, Resistance
decreases Need to find acceptable tradeoff - wire sizing
problem
Fat wires
Thicker cross-sections in higher metal layers
Useful for reducing delays for global wires
Inductance issues, sharing of limited resource
Cross-Section of A Chip
Block scaling
Block area often stays same

# cells, # nets doubles
Wiring histogram shape invariant
Global interconnect lengths dont shrink

Local interconnect lengths shrink by s
Interconnect delay scaling
Delay of a wire of length l :
tint = (rl)(cl) = rcl2 (first order)
Local interconnects :
tint : (r/s2)(c)(ls)2 = rcl2
Local interconnect delay unchanged (compare to faster devices)
Global interconnects :
tint : (r/s2)(c)(l)2 = (rcl2)/s2
Global interconnect delay doubles unsustainable!
Interconnect delay increasingly more dominant

Buffer Insertion For Delay
Reduction
Analysis of Simple RC Circuiti(t)
R
vT(t) C v(t)
R i (t ) v(t ) vT (t )
d (Cv(t )) dv(t )
i (t ) C
dt dt
dv(t )
RC v(t ) vT (t )
dt
state
variable
Input
waveform
Analysis of Simple RC Circuit
dv(t )
Step-input response: RC v(t ) v0u (t )
dt
t
v(t ) Ke RC v0u(t )
v0u(t) match initial state:
v0 v(0) 0 K v0u (t ) 0
v0(1-e-t/RC)u(t) output response for step-input:

t
v(t ) v0 (1 e RC
)u(t )
Delays of Simple RC Circuit
v(t) = v0(1 - e-t/RC) -- waveform
under step input v0u(t)
v(t)=0.5v0 t = 0.69RC
i.e., delay = 0.69RC (50% delay)
v(t)=0.1v0 t = 0.1RC
v(t)=0.9v0 t = 2.3RC
i.e., rise time = 2.2RC (if defined as time from 10% to 90% of Vdd)
Commonly used metric

TD = RC (= Elmore delay)
Elmore Delay
Delay
Elmore Delay
Driver is modeled as R
Driver intrinsic gate delay t(B)
Delay = all Ri all Cj downstream from Ri Ri*Cj
Elmore delay at n2 R(B)*(C1+C2)+R(w)*C2
Elmore delay at n1 R(B)*(C1+C2)
n1 n2
B R(B)
C1 R(w) C2
Elmore Delay
For uniform wire
unit wire capacitance c x

unit wire resistance r
C
No matter how to lump, the Elmore delay

is the same
Delay for Buffer
u v u
C C(b)
Input capacitance Driver resistance

Intrinsic buffer delay
Buffers Reduce Wire Delay
x/2 x/2
R rx/2 C R rx/2
cx/4 cx/4 cx/4 cx/4 C
t
t_unbuf = R( cx + C ) + rx( cx/2 + C )
t_buf = 2R( cx/2 + C ) + rx( cx/4 + C ) + tb x
t_buf t_unbuf = RC + tb rcx2/4
Combinational Logic Delay
Register Register
Combinational
Primary Logic Primary
Input Output
clock
Combinational logic delay <= clock period

Buffered global interconnects:
Intuition
l
Interconnect delay = r.c.l2
l1 l2 l3 ln
Now, interconnect delay = r.c.li2 < r.c.l2 (where l = lj )

since (lj 2) < ( lj )2
(Of course, account for buffer delay also)
Optimal inter-buffer length
First order (lumped parasitic, Elmore delay) analysis
L

Rd On resistance of inverter
l Cg Gate input capacitance
r,c Resistance, cap. per micron
Assume N identical buffers with equal inter-buffer length l (L = Nl)

T N Rd (C g cl rl (C g cl / 2
L rcl / 2 (rC g Rd c (Rd C g
1
l
For minimum delay,
dT rc Rd C g
0 L 2 0 lopt
2 Rd C g
dl 2 lopt rc
Optimal interconnect delay
Substituting lopt back into the interconnect delay
expression:

L rclopt (rC g Rd c (Rd Cg
1
Topt
lopt

(rC g Rd c
2 Rd C g Rd C g
L rc
rc 2 Rd C g

rc

Topt L 2 Rd C g rc (rC g Rd c
Delay grows linearly with L (instead of quadratically)
Total buffer count
80
clk-buf
70
% cells used to buffer nets

buf
60 tot-buf
50
40
30
20
10
0
90nm 65nm 45nm 32nm
Ever-increasing fractions of total cell count will be buffers
70% in 32nm
ITRS projections
Feature size (nm)

Relative
delay 250 180 130 90 65 45 32
100
Gate delay (fanout 4)
Local interconnect (M1,2)
Global interconnect with repeaters
Global interconnect without repeaters
10
Source: ITRS, 2003

0.1
Buffers Improve Slack
RAT = 300
Delay = 350
Slack = -50
slackmin = -50
RAT = 700
Delay = 600
Slack = 100
RAT = Required Arrival Time
Slack = RAT - Delay
RAT = 300
Delay = 250
Decouple capacitive Slack = 50
slackmin = 50 load from critical path
RAT = 700
Delay = 400
Slack = 300
Timing Driven Buffering
Problem Formulation
Given
A Steiner tree
RAT at each sink
A buffer type
RC parameters
Candidate buffer locations
Find buffer insertion solution such that the
slack at the driver is maximized
Candidate Buffering Solutions
Candidate Solution Characteristics
vi is a sink
Each candidate ci is sink capacitance
solution is
associated with
vi: a node
ci: downstream v is an internal node
capacitance
qi: RAT
Van Ginnekens Algorithm
Candidate solutions are

propagated toward the source
Dynamic Programming
Solution Propagation: Add Wire
x (v1, c1, q1)

(v2, c2, q2)
c2 = c1 + cx
q2 = q1 rcx2/2 rxc1
r: wire resistance per unit length
c: wire capacitance per unit length
Solution Propagation: Insert Buffer
(v1, c1, q1)

(v1, c1b, q1b)
c1b = Cb
q1b = q1 Rbc1 tb
Cb: buffer input capacitance
Rb: buffer output resistance
tb: buffer intrinsic delay 28
Solution Propagation: Merge
(v, cl , ql) (v, cr , qr)
cmerge = cl + cr
qmerge = min(ql , qr)
Solution Propagation: Add Driver
(v0, c0, q0)

(v0, c0d, q0d)
q0d = q0 Rdc0 = slackmin

Rd: driver resistance
Pick solution with max slackmin
Example of Solution Propagation
r = 1, c = 1
2 2
(v1, 1, 20) Rb = 1, Cb = 1, tb = 1
Rd = 1
Add wire
(v2, 3, 16) (v2, 1, 12)
v1 v1
Insert buffer
Add wire Add wire
(v3, 5, 8) (v3, 3, 8)
v1 v1
slack = 3 Add driver slack = 5 Add driver

Example of Merging
Left
candidates
Right candidates
Merged candidates
32
Solution Pruning
Two candidate solutions
(v, c1, q1)
(v, c2, q2)
Solution 1 is inferior if
c1 > c2 : larger load
and q1 < q2 : tighter timing
Pruning When Insert Buffer
They have the same load cap Cb,

only the one with max q is kept
Generating Candidates
(1)
(2)
(3)
35 From Dr. Charles Alpert

Pruning Candidates
(3)
(a) (b)
Both (a) and (b) look the same to the source.

Throw out the one with the worst slack
(4)
36
Candidate Example Continued
(4)
(5)
37
Candidate Example Continued
After pruning
(5)
At driver, compute which candidate maximizes

slack. Result is optimal.
38
Merging Branches
Left
Candidates
Right
Candidates
39
Pruning Merged Branches
Critical
With pruning
40
Van Ginneken Example
(20,400)
Buffer Wire
C=5, d=30 C=10,d=150
(30,250)
(5, 220) (20,400)
Buffer Wire
C=5, d=50 C=15,d=200
C=5, d=30 C=15,d=120
(45, 50) (30,250)
(5, 0) (5, 220) (20,400)
(20,100)
(5, 70)
41
Van Ginneken Example Contd
(45, 50) (30,250)
(5, 0) (5, 220) (20,400)
(20,100)
(5, 70)
(5,0) is inferior to (5,70). (45,50) is inferior to (20,100)
Wire C=10
(20,100) (30,250)
(30,10) (5, 220) (20,400)
(5, 70)
(15, -10)
Pick solution with largest slack, follow arrows to get solution
42
Basic Data Structure
Worse load cap
(c1, q1) (c2, q2) (c3, q3)

Better timing
Sorted list such that

c1 < c2 < c 3
If there is no inferior candidates
q1 < q2 < q3
Prune Solution List
Increasing c
(c1, q1) (c2, q2) (c3, q3) (c4, q4)

N N
q1 < q2 ? Prune 2 q1 < q3 ? Prune 3 q1 < q4 ?
Y
Y
N Prune 3 q2 < q4 ?
q2 < q3 ?
Y
N Prune 4
N Prune 4 q3 < q4 ?
q3 < q4 ?
44
Pruning In Merging
Left Right ql1 < ql2 < qr1 < ql3 < qr2
candidates candidates
(cl1, ql1) (cr1, qr1) Merged (cl1, ql1) (cr1, qr1)
(cl2, ql2) (cr2, qr2) candidates (cl2, ql2) (cr2, qr2)
(cl3, ql3) (cl1+cr1, ql1) (cl3, ql3)
(cl2+cr1, ql2)
(cl3+cr1,
(cl1, ql1) (cr1, qr1) (cl1, ql1) (cr1, qr1)
qr1)
(cl2, ql2) (cr2, qr2) (cl2, ql2) (cr2, qr2)
(cl3+cr2, ql3)
(cl3, ql3) (cl3, ql3)
45
Van Ginneken Complexity
Generate candidates from sinks to source
Quadratic runtime
Adding a wire does not change #candidates
Adding a buffer adds only one new candidate
Merging branches additive, not multiplicative
Linear time solution list pruning
Optimal for Elmore delay model
Multiple Buffer Types
2 2 r = 1, c = 1
(v1, 1, 20)
Rb1 = 1, Cb1 = 1, tb1 = 1
Rb2 = 0.5, Cb2 = 2, tb2 = 0.5
(v2, 3, 16)
Rd = 1
v1
(v2, 1, 12) (v2, 2, 14)

v1 v1

Buffer Insertion

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Buffer Insertion

Загружено:

Авторское право:

Доступные форматы

Interconnect Optimizations

Ideal process scaling: S D

Block area often stays same

Global interconnect lengths dont shrink

Interconnect delay increasingly more dominant

v0(1-e-t/RC)u(t) output response for step-input:

Commonly used metric

unit wire capacitance c x

No matter how to lump, the Elmore delay

Input capacitance Driver resistance

Combinational logic delay <= clock period

Interconnect delay = r.c.l2

Now, interconnect delay = r.c.li2 < r.c.l2 (where l = lj )

Assume N identical buffers with equal inter-buffer length l (L = Nl)

% cells used to buffer nets

Feature size (nm)

Source: ITRS, 2003

Candidate solutions are

x (v1, c1, q1)

(v1, c1, q1)

(v, cl , ql) (v, cr , qr)

(v0, c0, q0)

q0d = q0 Rdc0 = slackmin

slack = 3 Add driver slack = 5 Add driver

They have the same load cap Cb,

35 From Dr. Charles Alpert

Both (a) and (b) look the same to the source.

At driver, compute which candidate maximizes

(5,0) is inferior to (5,70). (45,50) is inferior to (20,100)

Pick solution with largest slack, follow arrows to get solution

Worse load cap

(c1, q1) (c2, q2) (c3, q3)

Sorted list such that

(c1, q1) (c2, q2) (c3, q3) (c4, q4)

(v2, 1, 12) (v2, 2, 14)

Вам также может понравиться