Вы находитесь на странице: 1из 47

www.rejinpaul.

com
www.rejinpaul.com

EC6601/VLSI DESIGN

UNIT-1

MOS TRANSISTOR PRINCIPLE


INTRODUCTON

An MOS transistor is a majority-carrier device, in which the current in a


conducting channel between the source and the drain is modulated by a
voltage applied to the gate.

Symbols

NMOS

PMOS

NMOS (n-type MOS transistor)

(1) Majority carrier = electrons


(2) A positive voltage applied on the gate with respect to the substrate enhances the
number of electrons in the channel and hence increases the conductivity of the
channel.
(3) If gate voltage is less than a threshold voltage Vt , the channel is cut-off (very
low current between source & drain).
PMOS (p-type MOS transistor)
 Majority carrier = holes
 Applied voltage is negative with respect to substrate.

Threshold voltage (Vt):


 The voltage at which an MOS device begins to conduct ("turn on")
 Relationship between Vgs (gate-to-source voltage) and the source-to-drain
current (Ids) , given a fixed drain-to-source voltage (Vds).
www.rejinpaul.com
www.rejinpaul.com

 Devices that are normally cut-off with zero gate bias are classified as
"enhancement-mode "devices.
 Devices that conduct with zero gate bias are called "depletion-mode
devices
 Enhancement-mode devices are more popular in practical use.

NMOS Enhancement Transistor


www.rejinpaul.com
www.rejinpaul.com

It Consist of
 Moderately doped p-type silicon substrate
 Two heavily doped n + regions, the source and drain, are diffused

 Channel is covered by a thin insulating layer of silicon dioxide (SiO2) called


" Gate Oxide "
 Over the oxide is a polycrystalline silicon (polysilicon) electrode, referred to as
the "Gate"

Features
 Since the oxide layer is an insulator, the DC current from the gate to
channel is essentially zero.
 No physical distinction between the drain and source regions.
 Since SiO2 has low loss and high dielectric strength, the application of
high gate fields is feasible.
 In operation
 Set Vds > 0 in operation
 Vgs =0 no current flow between source and drain. They are insulated by two
reversed-biased PN junctions
 When Vg > 0 , the produced E field attracts electrons toward the gate and
repels holes.
 If Vg is sufficiently large, the region under the gate changes from p-type to n-
type(due to accumulation of attracted elections) and provides a conducting path
between source and drain.
 The thin layer of p-type silicon is said to be "inverted".
 Three modes (see Fig 2.4)
o Accumulation mode (Vgs << Vt)
o Depletion mode (Vgs =Vt)
o Inversion mode (Vgs > Vt)
www.rejinpaul.com
www.rejinpaul.com

Electrically

(1) An MOS device can be considered as a voltage-controlled switch that


conducts when Vgs >Vt (given Vds>0)

(2) An MOS device can be considered as a voltage-controlled resistor

Effective gate voltage (Vgs-Vt)

At the source end , the full gate voltage is effective in the inverting the
channel.

At the drain end , only the difference between the gate and drain voltage is
effective
www.rejinpaul.com
www.rejinpaul.com

Pinch-off

 Vds > Vgs-Vt => Vgd < Vt => Vd > Vg –Vt (Vg is not big
enough)
 The channel no longer reaches the drain. (Fig 2.5 c)
 As electrons leave the drain depletion region and are
subsequently accelerated toward the drain.
 The voltage across the pinched-off region remains at (Vgs-Vt)
=>”saturated” state in which the channel current as controlled by Vg ,
and is independent of Vd
For fixed Vds and Vg , Ids is function of
 Distance between drain & source
 Channel width
www.rejinpaul.com
www.rejinpaul.com

 Vt
 Thickness of gate oxide
 The dielectric constant of gate oxide
 (6) Carrier (hole or electron) mobility , .

Conducting mode
 ”cut-off ” region : Ids ≈ 0 , Vgs < Vt
 ” Nonsaturated” region : weak inversion region, when Ids
depends on Vg & Vd
 ”Saturated“ region: channel is strongly inverted and Ids is ideally
independent of Vds (pinch-off region)
 ”Avalanche breakdown” (pinch-through) : very high Vd => gate has no
control over Ids

PMOS Enhancement Transistor


 Vg < 0
 Holes are major carrier
 Vd < 0 , which sweeps holes from the source through the channel to the
drain .

Threshold voltage
 A function of
 Gate conductor material
 Gate insulator material
 Gate insulator thickness
 Impurity at the silicon-insulator interface
 Voltage between the source and the substrate Vsb
 Temperature
a. -4 mV/’C – high substrate doping
b. -2 mV/’C – low substrate doping
www.rejinpaul.com
www.rejinpaul.com

SCALING

In scaling there are really two issues

• Devices Can we build smaller devices

What will their performance be try to avoid the wet noodle effect Wires

There is concern about our ability to scale both of these Components

Limitations
Limitations to device scaling has been around since working in 3m nMOS, 22
years ago (actually bipolar)
• Worries were
 Short channel effect 
 Punchthrough 
• drain control of current rather than gate
 Hot electrons 
 Parasitic resistances 
• Now worries are a little different
 Oxide tunnel currents 
  Punchthrough 
 Parameter control 
Parasitic resistances

Transistor scaling

People are building very short channel devices

 Shown are I-V curves for 15nm L pMOS 


 And a short channel nMOS 
 The structure is strange 
 FinFET 
 But you can make them work 

Wire scaling

More uncertainty than transistor scaling

 Many options with complex trade-offs 



• For each metal layer
www.rejinpaul.com
www.rejinpaul.com

 Need to set H, TT, TB, e1, e2, conductivity of the metal 

CMOS INVERTER

← turn on →

Vgs = Vg −VDD < − V


tp

V
gs =Vin ⇒Vg <VDD − V
tp

V
ds =V
out ⇒Vin <VDD − V
tp
www.rejinpaul.com
www.rejinpaul.com

Both transistors are “on”


P = fcv 2 ⋅α
(Switching activity)

Solve for I dsn = −I dsp

V
inn =Vinp

(1) Region A. 0 ≤Vin ≤Vtn

n-device is ‘ off ’, I dsn = 0 ( = − I dsp )

p-device is in ‘linear’ mode


V −V
out DD =Vdsp =0

⇒ V =V
out DD

(2) Region B. V
tn ≤Vin ≤VDD
2
p-device : linear mode
n-device : saturation mode
[V i −V t ]2 µ ε W
n n n n
n:
I
dsn = βn , βn = ( )
t L
2 ox n

p: V
gs =Vin −VDD

V
ds =V
out −VDD
www.rejinpaul.com
www.rejinpaul.com

(3) Region C. PMOS, NMOS : saturation


βp
I
dsp = − (Vin −VDD −Vtp ) 2
2
βn
I
dsn = (Vin −Vtn )2
2

I = −I
with dsp dsn

βn
V
DD +V
tp +Vtn β
p
⇒ Vin =
βn
1+
βp

⇒ by setting βn = βp and Vtn = −Vtp

we have V : one value only

DD
V
in =
2

V
possible out
V
in −Vtn <Vout
N-MOS
V
gs −Vtn <Vds

(VDD −Vin ) −Vtp > (VDD −Vout )


P-MOS
V
gs −Vtp >Vds
⇒ V
out <Vin −Vtp

⇒ Vin −Vtn < Vout <Vin −Vtp

Negative value
V
⇒ Vin is fixed at DD , Vout varies
www.rejinpaul.com
www.rejinpaul.com

⇒ make the o/p transition very steep

VDD
2 <Vin <VDD +Vtp
(4) Region D.
www.rejinpaul.com
www.rejinpaul.com

P-MOS : saturation mode


N-MOS : linear mode

I dsp = − 12 βp (Vin −VDD −Vtp )2


V 2
ou
t

I
dsn n [(Vin −Vtn )Vout − ]
2

I = −I
solve dsn dsp

β
⇒ V ou = (V −V t ) − (V i −V t ) 2 − p (V i −V D −V t )2
t in n n n β n D p
n

(5) Region E. V
in ≥VDD +Vtp
→ p-device ‘ off ‘ ( n-device is in ‘ linear ’ mode

⇒V
→I
dsp = 0 ⇒ I dsn = 0 out =0
see Table 2.3 for summary

LAYOUT DESIGN RULES

Why we need design rules

 Masks are tooling for manufacturing. 


 Manufacturing processes have inherent limitations in accuracy. 
 Design rules specify geometry of masks which will provide reasonable yields. 
 Design rules are determined by experience. 

Layers Connection Rules


Metal (BLUE) poly n-diff p-diff metal
Polysilicion (RED ) poly S N P NC
N-Diffusion (Green) n-diff S X NC
P -Diffusion (Brown) p-diff S NC
Contact / Via metal S
www.rejinpaul.com
www.rejinpaul.com

A B

NOR Gate
CMO S Invertor
Vdd Vdd

A
Out
In Out B

Gnd Gnd
Inverter NAND Gate
www.rejinpaul.com
www.rejinpaul.com

UNIT-II
COMBINATONAL LOGICCIRCUITS
ELMORE’S CONSTANT
area above step response
T elm k =Z 10(1 ¡ vk(t))dt

first moment of impulse response

T elm k =Z 10tvk(t)0dt

T elm k¸ 0:5Dk
 good approximation of Dk only when vk is monotonically increasing

 interpret v0k as probability density:


Tk is mean, Dk is median

Elmore delay for RC tree

RC tree

 one input voltage source


 resistors form a tree with root at voltage source
 all capacitors are grounded
Elmore delay to node k:
www.rejinpaul.com
www.rejinpaul.com

Example:
Telm 3 = C3(R1 + R2+R3)+C2(R1+R2)+C1R1+C4R1+C5R1+C6R1

Elmore delay optimization via GP


In transistor & wire sizing, Ri = ®i=xi, Cj = aTjx+bj(®i ¸ 0, aj; bj ¸ 0)

Elmore delay:

hence can minimize area or power, subject to bound on Elmore delay using
geometric programming commercial software (1980s): e.g., TILOS

Limitations of Elmore delay optimization

not a good approximation of 50% delay when step response is not monotonic
(capacitive coupling between nodes, or non-diagona C)

 no useful convexity properties when


there are loops of resistors
circuit has multiple sources
resistances depend on more than one variable

Pass-Transistor Logic
www.rejinpaul.com
www.rejinpaul.com

Switch Out A
Out
Network B
B

 We have assumed source is grounded q What if source > 0?


– e.g. pass transistor passing VDD
 Vg = VDD
– If Vs > VDD-Vt, Vgs < Vt
– Hence transistor would turn itself off
 q nMOS pass transistors pull no higher than VDD-Vtn
– Called a degraded “1”
– Approach degraded value slowly (low Ids)
 q pMOS pass transistors pull no lower than Vtp

Pass transistor circuits:

Download Useful Materials from Rejinpaul.com


www.rejinpaul.com
www.rejinpaul.com

Effective resistance:
 Shockley models have limited value
– Not accurate enough for modern transistors
– Too complicated for much hand analysis
 Simplification: treat transistor as resistor
– Replace Ids(Vds, Vgs) with effective resistance R
 Ids = Vds/R
– R averaged across switching of digital gate
 Too inaccurate to predict current at any given time
– But good enough to predict RC delay.

Transmission gates

Static CMOS
Basic CMOS combinational circuits consists of,

Complementary CMOS pull up transistor and pull down transistor

Download Useful Materials from Rejinpaul.com


www.rejinpaul.com
www.rejinpaul.com

For eg .,and gate:

Dynamic logic
‹ There is another class of logic gates which relies on the use of a clock signal. This class
of circuit is known as dynamic circuits. The clock signal is used to divide the gate operation into
two halves. In the first half, the output node is pre-charged to a high or low logic state. In the
second half of a clock cycle, the circuit evaluates the correct output state. ‹
When Ø is low, Z is charged to high. When Ø is high, n logic block evaluates input, and
conditionally discharges Z. This circuit adds series resistance to the pull-down n-channel
transistor, therefore the fall time is increased slightly. ‹ This circuit is dynamic because during
evaluation, the output high level at Z is maintained by the stray capacitance at the output node.
If Ø stays high (i.e. evaluation period) for a long time, Z may eventually discharge to a low
logic level.

Download Useful Materials from Rejinpaul.com


www.rejinpaul.com
www.rejinpaul.com

Problem with cascading such as a circuit:-


• Inputs can only be changed when Ø is low and must be stable when Ø is high. • When
Ø is low, both P1 and P2 are precharged to a high voltage. However when Ø is high, delay
through on the output P1 may erroneously discharge P2

Power dissipation
The past the major concerns of the VLSI designer were area performance cost and
reliability power considerations were mostly of only secondary importance. In recent years
however thi s has begun to change and increasingly power is being given comparable weight to
area and speed.
Several factors have contributed to this trend Perhaps the primary driving factor has
been the remarkable success and growth of the class of personal computing devices
portable desktops audio and video based multimedia products and wireless communications
systems.
Personal digital assistants and personal communicators which demand high speed
computation and complex functionality with In the past the major concerns of the VLSI designer
were area performance cost and reliability power considerations were mostly of only secondary
importance.In recent years however this has begun to change and increasingly power is being
given comparable weight to area and speed.
Several factors have contributed to this trend Perhaps the primary driving factorhas been
the remarkable success and growth of the class of personal computing devices portable
desktops audio and video based multimedia products and wireless communications systems
personal digital assistants and personal communicators which demand high speed computation
and complex functionality with low power consumption There also exists a strong pressure for
producers of high end products to reduce their power consumption.

Software Level Power dissipation


The first task in the estimation of power consumption of a digital system is to identify
the typical application programs that will be executed on the system.

A non trivial application program consumes millions of machine cycles making it nearly

Download Useful Materials from Rejinpaul.com


www.rejinpaul.com
www.rejinpaul.com

impossible to perform power estimation using the complete program at say the RT level. Most
of the reported results are based on power macro modeling_ an estimation approach which is
extensively used for behavioral and RTL level estimation.
In the power cost of a CPU module is characterized by estimating the average
capacitance that would switch when the given CPU module is activated.

In the switching activities on address instruction and data bu ses are used to estimate the
power consumption of the microprocessor, based on actual current measurements of some
Energy p is the total energy dissipation of the program which is divided into three parts
The first part is the summation of the base energy cost of each instruction, BCi is the base
energy cost and Ni is the number of times instruction i is executed.

The second part accounts for the circuit state SCij is the energy cost when instruction i
is followed by j during the program execution. Finally the third part accounts for energy
contribution OCk of other instruction effects such as stalls and cache misses during the program
execution.

Instead of using a macro modeling equation to model the energy dissipation of a


microprocessor the authors use a synthesized program to exercise the microprocessor in such a
way that the resulting instruction trace behaves in terms of performance and power dissipation
much the same as the original trace.

The new instruction trace is however much shorter than the original one and can hence
be simulated on a RT level description of the target microprocessor to provide the power
dissipation results quickly Specifically this approach consists of the following steps

 Perform architectural simulation of the target microprocessor under the instruction trace
 of typical application programs 
 Extract characteristic p role including parameters such as the instruction mix Instruction
data cache miss rates branch prediction miss rate pipeline stalls etc for the
 microprocessor. 
 Use mixed integer linear programming and heuristic rules to gradually transform a
 gener ic program template into a fully functional program. 
 Perform RT level simulation of the target microprocessor under the instruction trace of
the new synthesized program. 

Behavioral Level Power dissipation

Conversely from some of the RT level methods that will be estimation techniques at the
behavioral level cannot rely on information about the gate level structure of the design
components and hence must resort to abstract notions of physical capacitance and switching
activity to predict power dissipation in the design.

Download Useful Materials from Rejinpaul.com


www.rejinpaul.com
www.rejinpaul.com

UNIT-III
SEQUENTIAL LOGIC CIRCUITS

Static and dynamic latches and register

Sequential logic

Static Latches and Registers


The Bistability Principle

Static memories use positive feedback to create a bistable circuit — a circuit having two stable
states that represent 0 and 1. The basic idea is shown in Figure 7.4a, which shows two inverters
connected in cascade along with a voltage-transfer characteristic typical of such a circuit. Also
plotted are the VTCs of the first inverter, that is, Vo1 versus Vi1, and the second inverter (Vo2
versus Vo1). The latter plot is rotated to accentuate that Vi2 = Vo1. Assume now that the output of
the second inverter Vo2 is connected to the input of the first Vi1, as shown by the dotted lines in
oo
2VV 11

fig.
o
1
=

V 2= V
i
V

V V =V
i1 o1 i2 V
o2

Vi1 Vo2

V =V
o2 i1 A

(a)
C

B
Vi1 = Vo2
(b)

Download Useful Materials from Rejinpaul.com


www.rejinpaul.com
www.rejinpaul.com

The resulting circuit has only three possible operation points (A, B, and C), as demonstrated on
the combined VTC. The following important conjecture is easily proven to be valid:
Under the condition that the gain of the inverter in the transient region is larger than 1,
only A and B are stable operation points, and C is a metastable operation point.
Suppose that the cross-coupled inverter pair is biased at point C. A small deviation from
this bias point, possibly caused by noise, is amplified and regenerated around the circuit loop.
This is a consequence of the gain around the loop being larger than 1. The effect is demonstrated
in Figure 7.5a. A small deviation δ is applied to Vi1 (biased in C). This devi-ation is amplified by
the gain of the inverter. The enlarged divergence is applied to the sec-ond inverter and amplified
once more. The bias point moves away from C until one of the operation points A or B is reached.
In conclusion, C is an unstable operation point. Every deviation (even the smallest one) causes
the operation point to run away from its original bias. The chance is indeed very small that the
cross-coupled inverter pair is biased at C and stays there. Operation points with this property are

o
1
termed metastable.

V V
2
i
A A
o
1
=
V V
2
i

C C

B B

V =V V =V
δ i1 o2 δ
i1 o2
(a) (b)

On the other hand, A and B are stable operation points, as demonstrated in Figure 7.5b. In
these points, the loop gain is much smaller than unity. Even a rather large devi-ation from the
operation point is reduced in size and disappears.
Hence the cross-coupling of two inverters results in a bistable circuit, that is, a cir-cuit
with two stable states, each corresponding to a logic state. The circuit serves as a memory, storing
either a 1 or a 0 (corresponding to positions A and B).
In order to change the stored value, we must be able to bring the circuit from state A to B
and vice-versa. Since the precondition for stability is that the loop gain G is smaller than unity,
we can achieve this by making A (or B) temporarily unstable by increasing G to a value larger
than 1. This is generally done by applying a trigger pulse at Vi1 or Vi2. For instance, assume that
the system is in position A (Vi1 = 0, Vi2 = 1). Forcing Vi1 to 1 causes both inverters to be on
simultaneously for a short time and the loop gain G to be larger than 1. The positive feedback
regenerates the effect of the trigger pulse, and the circuit moves to the other state (B in this case).
The width of the trigger pulse need be only a little larger than the total propagation delay around
the circuit loop, which is twice the average propagation delay of the inverters.
In summary, a bistable circuit has two stable states. In absence of any triggering, the
circuit remains in a single state (assuming that the power supply remains applied to the circuit),
and hence remembers a value. A trigger pulse must be applied to change the state of the circuit.
Another common name for a bistable circuit is flip-flop (unfortunately, an edge-triggered register
is also referred to as a flip-flop).

Download Useful Materials from Rejinpaul.com


www.rejinpaul.com
www.rejinpaul.com

Multiplexer Based Latches

There are many approaches for constructing latches. One very common technique
involves the use of transmission gate multiplexers. Multiplexer based latches can provide
similar functionality to the SR latch, but has the important added advantage that the sizing of
devices only affects performance and is not critical to the functionality.
Figure shows an implementation of static positive and negative latches based on multiplexers.
For a negative latch, when the clock signal is low, the input 0 of the multiplexer is selected, and
the D input is passed to the output. When the clock signal is high, the input 1 of the
multiplexer, which connects to the output of the latch, is selected. The feedback holds the
output stable while the clock signal is high. Similarly in the positive latch, the D input is
selected when clock is high, and the output is held (using feedback) when clock is low.
Negative Positive
Latch Latch

1 Q 0 Q

D 0 D 1

CL
K
CLK
Negative and positive latches based on multiplexers.
A transistor level implementation of a positive latch based on multiplexers is shown
below. When CLK is high, the bottom transmission gate is on and the latch is transparent - that
is, the D input is copied to the Q output. During this phase, the feedback loop is open since the
top transmission gate is off. Unlike the SR FF, the feedback does not have to be overridden to
write the memory and hence sizing of transistors is not critical for realizing correct
functionality. The number of transistors that the clock touches is impor-tant since it has an
activity factor of 1. This particular latch implementation is not particu-larly efficient from this
metric as it presents a load of 4 transistors to the CLK signal.
CLK

Q
CLK

D Transistor level implementation of


a positive latch built using
transmission gates.

CLK
It is possible to reduce the clock load to two transistors by using implement multi-
plexers using NMOS only pass transistor as shown in Figure 7.13. The advantage of this
approach is the reduced clock load of only two NMOS devices. When CLK is high, the latch
samples the D input, while a low clock-signal enables the feedback-loop, and puts the latch in

Download Useful Materials from Rejinpaul.com


www.rejinpaul.com
www.rejinpaul.com

the hold mode. While attractive for its simplicity, the use of NMOS only pass transistors results
in the passing of a degraded high voltage of VDD-VTn to the input of the first inverter. This
impacts both noise margin and the switching performance, especially in the case of low values
of VDD and high values of VTn. It also causes static power dissipa-tion in first inverter, as already
pointed out in Chapter 6. Since the maximum input-voltage to the inverter equals VDD-VTn, the
PMOS device of the inverter is never turned off, result-ing is a static current flow.

I I
I2 T2 I3 5 T4 6 Q
QM
D
I T
I1 T1 4 3

CLK
Transistor-level implementation of a master-slave postive
edge-triggered
register using multiplexers.

Timing Properties of the multiplexer Bases Master-Slave Register.


As discussed ear-lier, there are three important timing metrics in registers: the set up
time, the hold time and the propagation delay. It is important to understand these factors that
affect the timing parameters and develop the intuition to manually estimate the parameters.
Assume that the propagation delay of each inverter is tpd_inv and the propagation delay of the
transmission gate is tpd_tx. Also assume that the contamination delay is 0 and the inverter delay to
derive
CLK from CLK has a delay equal to 0.
The set-up time is the time before the rising edge of the clock that the input data D must
become valid. Another way to ask the question is how long before the rising edge does the D
input have to be stable such that QM samples the value reliably. For the trans-mission gate
multiplexer-based register, the input D has to propagate through I1, T1, I3 and I2 before the rising
edge of the clock. This is to ensure that the node voltages on both ter-minals of the transmission
gate T2 are at the same value. Otherwise, it is possible for the cross-coupled pair I2 and I3 to
settle to an incorrect value. The set-up time is therefore equal to 3 *tpd_inv + tpd_tx .
The propagation delay is the time for the value of QM to propagate to the output Q. Note
that since we included the delay of I2 in the set-up time, the output of I4 is valid before the rising
edge of clock. Therefore the delay tc-q is simply the delay through T3 and I6 (tc-q = tpd_tx +
t
pd_inv).

The hold time represents the time that the input must be held stable after the rising edge
of the clock. In this case, the transmission gate T1 turns off when clock goes high and therefore
any changes in the D-input after clock going high are not seen by the input. Therefore, the hold
time is 0.

As mentioned earlier, the drawback of the transmission gate register is the high
capacitive load presented to the clock signal. The clock load per register is important since it
directly impacts the power dissipation of the clock network. Ignoring the overhead required to
invert the clock signal (since the buffer inverter overhead can be amortized over multiple
register bits), each register has a clock load of 8 transistors. One approach to reduce the clock

Download Useful Materials from Rejinpaul.com


www.rejinpaul.com
www.rejinpaul.com

load at the cost of robustness is to make the circuit ratioed. Figure 7.18 shows that the feedback
transmission gate can be eliminated by directly cross coupling the inverters.

CL
K CLK

T
D 1 I1 T2 I3 Q

I
2 I4
CL
K CLK

Reduced load clock load static master-slave register.

The penalty for the reduced clock load is increased design complexity. The trans-
mission gate (T1) and its source driver must overpower the feedback inverter (I2) to switch the
state of the cross-coupled inverter.The sizing requirements for the transmission gates can be
derived using a similar analysis as performed for the SR flip-flop. The input to the inverter I1
must be brought below its switching threshold in order to make a transition. If minimum-sized
devices are to be used in the transmission gates, it is essential that the transistors of inverter I2
should be made even weaker. This can be accomplished by mak-ing their channel-lengths larger
than minimum. Using minimum or close-to-minimum-size devices in the transmission gates is
desirable to reduce the power dissipation in the latches and the clock distribution network.
Another problem with this scheme is the reverse conduction — this is, the second stage
can affect the state of the first latch. When the slave stage is on (Figure 7.19), it is possible for
the combination of T2 and I4 to influence the data stored in I1-I2 latch. As long as I4 is a weak
device, this is fortunately not a major problem.
V
D
D 0

T I
D 1 I1 T 3 Q

I
I2 V
D 4
0 D

Figure 7.19 Reverse conduction possible in the transmission gate.

Non-ideal clock signals

So far, we have assumed that CLK is a perfect inversion of CLK, or in other words, that
the delay of the generating inverter is zero. Even if this were possible, this would still not be a
good assumption. Variations can exist in the wires used to route the two clock signals, or the
load capacitances can vary based on data stored in the connecting latches. This effect, known as
clock skew is a major problem, and causes the two clock signals to overlap as is shown in
Figure. Clock-overlap can cause two types of failures, as illustrated for the NMOS-only
negative master-slave register

Download Useful Materials from Rejinpaul.com


www.rejinpaul.com
www.rejinpaul.com

X
CLK CLK Q
A
D
B

CL
K
CLK
(a) Schematic diagram

CLK

Figure 7.20 Master-slave


register based
CLK
on NMOS-only pass transistors.
(b) Overlapping clock pairs

When the clock goes high, the slave stage should stop sampling the master stage output
and go into a hold mode. However, since CLK and CLK are both high for a short period of time
(the overlap period), both sampling pass transistors conduct and there is a direct path from the D
input to the Q output. As a result, data at the output can change on the rising edge of the clock,
which is undesired for a negative edge-triggered register. The is know as a race condition in
which the value of the output Q is a function of whether the input D arrives at node X before or
after the falling edge of CLK. If node X is sampled in the metastable state, the output will switch
to a value determined by noise in the system.
The primary advantage of the multiplexer-based register is that the feedback loop is
open during the sampling period, and therefore sizing of devices is not critical to functionality.
However, if there is clock overlap between CLK and CLK, node A can be driven by both D and
B, resulting in an undefined state.
Those problems can be avoided by using two non-overlapping clocks PHI1 and PHI2
instead (Fiand by keeping the nonoverlap time tnon_overlap between the clocks large enough such
that no overlap occurs even in the presence of clock-routing delays.
During the nonoverlap time, the FF is in the high-impedance state—the feedback loop
is open, the loop gain is zero, and the input is disconnected. Leakage will destroy the state if this
condition holds for too long a time. Hence the name pseudostatic: the register employs a
combination of static and dynamic storage approaches depending upon the state of the clock.

Low-Voltage Static Latches

The scaling of supply voltages is critical for low power operation. Unfortunately, certain
latch structures don’t function at reduced supply voltages. For example, without the scaling of
device thresholds, NMOS only pass transistors (e.g., Figure 7.21) don’t scale well with supply
voltage due to its inherent threshold drop. At very low power sup-ply voltages, the input to the
inverter cannot be raised above the switching threshold, resulting in incorrect evaluation. Even
with the use of transmission gates, performance degrades significantly at reduced supply
voltages.

Download Useful Materials from Rejinpaul.com


www.rejinpaul.com
www.rejinpaul.com

Scaling to low supply voltages hence requires the use of reduced threshold devices.
However, this has the negative effect of exponentially increasing the sub-threshold leak-age
power as discussed in Chapter 6. When the registers are constantly accessed, the leakage energy
is typically insignificant compared to the switching power. However, with the use of conditional
clocks, it is possible that registers are idle for extended periods and the leakage energy expended
by registers can be quite significant.
Many solutions are being explored to address the problem of high leakage during idle
periods. One approach for this involves the use of Multiple Threshold devices as shown in
Figure 7.23 [Mutoh95]. Only the negative latch is shown here. The shaded inverters and
transmission gates are implemented in low-threshold devices. The low-threshold inverters are
gated using high threshold devices to eliminate leakage.
During normal mode of operation, the sleep devices are tuned on. When clock is low,
the D input is sampled and propagates to the output. When clock is high, the latch is in the hold
mode. The feedback transmission gate conducts and the cross-coupled feed-back is enabled.
Note there is an extra inverter, needed for storage of state when the latch is in the sleep state.
During idle mode, the high threshold devices in series with the low threshold inverter are turned
off (the SLEEP signal is high), eliminating leakage. It is assumed that clock is in the high state
when the latch is in the sleep state. The feedback low-threshold transmission gate is turned on
and the cross-coupled high-threshold devices maintains the state of the latch.

Download Useful Materials from Rejinpaul.com


www.rejinpaul.com
www.rejinpaul.com

Dynamic Latches and Registers

Storage in a static sequential circuit relies on the concept that a cross-coupled inverter
pair produces a bistable element and can thus be used to memorize binary values. This approach
has the useful property that a stored value remains valid as long as the supply voltage is applied
to the circuit, hence the name static. The major disadvantage of the static gate, however, is its
complexity. When registers are used in computational structures that are constantly clocked
such as pipelined datapath, the requirement that the memory should hold state for extended
periods of time can be significantly relaxed.
This results in a class of circuits based on temporary storage of charge on parasitic
capacitors. The principle is exactly identical to the one used in dynamic logic — charge stored
on a capacitor can be used to represent a logic signal. The absence of charge denotes a 0, while
its presence stands for a stored 1. No capacitor is ideal, unfortunately, and some charge leakage
is always present. A stored value can hence only be kept for a limited amount of time, typically
in the range of milliseconds. If one wants to preserve signal integrity, a periodic refresh of its
value is necessary. Hence the name dynamic stor-age. Reading the value of the stored signal
from a capacitor without disrupting the charge requires the availability of a device with a high
input impedance.

Dynamic Transmission-Gate Based Edge-triggred Registers

A fully dynamic positive edge-triggered register based on the master-slave concept is


shown in Figure 7.24. When CLK = 0, the input data is sampled on storage node 1, which has an
equivalent capacitance of C1 consisting of the gate capacitance of I1, the junction capacitance of
T1, and the overlap gate capacitance of T1. During this period, the slave stage is in a hold mode,
with node 2 in a high-impedance (floating) state. On the rising edge of clock, the transmission
gate T2 turns on, and the value sampled on node 1 right before the rising edge propagates to the
output Q (note that node 1 is stable during the high phase of the clock since the first
transmission gate is turned off). Node 2 now stores the inverted version of node 1. This
implementation of an edge-triggered register is very effi-cient as it requires only 8 transistors.
The sampling switches can be implemented using NMOS-only pass transistors, resulting in an
even-simpler 6 transistor implementation. The reduced transistor count is attractive for high-
performance and low-power systems.

The set-up time of this circuit is simply the delay of the transmission gate, and corre-
sponds to the time it takes node 1 to sample the D input. The hold time is approximately zero,
since the transmission gate is turned off on the clock edge and further inputs changes are
ignored. The propagation delay (tc-q) is equal to two inverter delays plus the delay of the
transmission gate T2.

Download Useful Materials from Rejinpaul.com


www.rejinpaul.com
www.rejinpaul.com

One important consideration for such a dynamic register is that the storage nodes (i.e.,
the state) has to be refreshed at periodic intervals to prevent a loss due to charge leak-age, due to
diode leakage as well as sub-threshold currents. In datapath circuits, the refresh rate is not an
issue since the registers are periodically clocked, and the storage nodes are constantly updated.
Clock overlap is an important concern for this register. Consider the clock wave-forms
shown in Figure 7.25. During the 0-0 overlap period, the NMOS of T1 and the PMOS of T2 are
simultaneously on, creating a direct path for data to flow from the D input of the register to the
Q output. This is known as a race condition. The output Q can change on the falling edge if the
overlap period is large — obviously an undesirable effect for a positive edge-triggered register.
The same is true for the 1-1 overlap region, where an input-output path exists through the
PMOS of T1 and the NMOS of T2. The latter case is taken care off by enforcing a hold time
constraint. That is, the data must be stable during the high-high overlap period. The former
situation (0-0 overlap) can be addressed by mak-ing sure that there is enough delay between the
D input and node 2 ensuring that new data sampled by the master stage does not propagate
through to the slave stage. Generally the built in single inverter delay should be sufficient and
the overlap period constraint is given as:

overlap 0 – 0 tT1 + tI 1 + tT 2
t

Similarly, the constraint for the 1-1 overlap is given as:

t
hold ove rlap1 – 1
t

(0,0) overlap
CLK
(1,1) overlap

CLK
Impact of non-overlapping clocks.
C2MOS Dynamic Register: A Clock Skew Insensitive Approach
The C2MOS Register
An ingenious positive edge-triggered register based on the master-slave concept which
is insensitive to clock overlap. This circuit is called the C2MOS (Clocked CMOS) register
[Suzuki73]. The register operates in two phases.
 CLK = 0 (CLK = 1): The first tri-state driver is turned on, and the master stage acts as an
inverter sampling the inverted version of D on the internal node X. The master stage is
in the evaluation mode. Meanwhile, the slave section is in a high-impedance mode, or in
a hold mode. Both transistors M7 and M8 are off, decoupling the output from the input.
The output Q retains its previous value stored on the output capacitor CL2.

 The roles are reversed when CLK = 1: The master stage section is in hold mode (M3-M4
off), while the second section evaluates (M7-M8 on). The value stored on CL1 propagates
to the output node through the slave stage which acts as an inverter.
The overall circuit operates as a positive edge-triggered master-slave register — very similar to
the transmission-gate based register presented earlier. However, there is an important difference:

Download Useful Materials from Rejinpaul.com


www.rejinpaul.com
www.rejinpaul.com

C2MOS D FF during overlap periods. No feasible signal path can exist


between In and D, as illustrated by the arrows.

True Single-Phase Clocked Register (TSPCR)

In the two-phase clocking schemes described above, care must be taken in


routing the two clock signals to ensure that overlap is minimized. While the
C2MOS provides a skew-tol-erant solution, it is possible to design registers that
only use a single phase clock. The True Single-Phase Clocked Register (TSPCR)
proposed by Yuan and Svensson uses a single clock (without an inverse clock)
[Yuan89]. The basic single-phase positive and negative latches are shown in
Figure 7.30. For the positive latch, when CLK is high, the latch is in
V V
VDD VDD DD DD

Out
In CL CLK
In K Out
CLK CLK

Download Useful Materials from Rejinpaul.com


www.rejinpaul.com
www.rejinpaul.com

Pipelining: An approach to optimize sequential circuits

Pipelining is a popular design technique often used to accelerate the operation of the
data-paths in digital processors. The idea is easily explained with the example of Figure 7.40a.
The goal of the presented circuit is to compute log(|a − b|), where both a and b represent streams
of numbers, that is, the computation must be performed on a large set of input val-ues. The
minimal clock period Tmin necessary to ensure correct evaluation is given as:

T
mi n = tc- q + tpd,logic + tsu (7.6)
where tc-q and tsu are the propagation delay and the set-up time of the register,
respectively.
We assume that the registers are edge-triggered D registers. The term tpd,logic stands for
the worst-case delay path through the combinatorial network, which consists of the adder,
absolute value, and logarithm functions. In conventional systems (that don’t push the edge of
technology), the latter delay is generally much larger than the delays associated with the
registers and dominates the circuit performance. Assume that each logic module has an equal
propagation delay. We note that each logic module is then active for only 1/3 of the clock
period (if the delay of the register is ignored). For example, the adder unit is active during the
first third of the period and remains idle—this is, it does no useful computa-tion— during the
other 2/3 of the period. Pipelining is a technique to improve the resource utilization, and
increase the functional throughput. Assume that we introduce registers between the logic blocks,
as shown in Figure 7.40b. This causes the computation for one set of input data to spread over a
number of clock periods, as shown in Table 7.1. The

a
REG

Out
REG

CLK . log

(a) Nonpipelined version


CLK
REG

CLK

a
REG

REG

REG
REG

CLK . log Out

CLK CLK CLK


REG

b
(b) Pipelined version
CLK
Datapath for the computation of log(|a + b|).

Download Useful Materials from Rejinpaul.com


www.rejinpaul.com
www.rejinpaul.com

result for the data set (a1, b1) only appears at the output after three clock-periods. At that time,
the circuit has already performed parts of the computations for the next data sets, (a2, b2) and
(a3,b3). The computation is performed in an assembly-line fashion, hence the name pipeline.

Example of pipelined computations.


Clock Period Adder Absolute Value Logarithm

1 a1 + b1
2 a2 + b2 |a1 + b1|
3 a3 + b3 |a2 + b2| log(|a1 + b1|)
4 a4 + b4 |a3 + b3| log(|a2 + b2|)

5 a5 + b5 |a4 + b4| log(|a3 + b3|)

The advantage of pipelined operation becomes apparent when examining the mini-mum clock
period of the modified circuit. The combinational circuit block has been parti-tioned into three
sections, each of which has a smaller propagation delay than the original function. This
effectively reduces the value of the minimum allowable clock period:

min ,pipe = tc- q + max( tpd ,add,tpd ,abs tpd ,log)


T
(7.7)
Suppose that all logic blocks have approximately the same propagation delay, and that the
register overhead is small with respect to the logic delays. The pipelined network outperforms
the original circuit by a factor of three under these assumptions, or T min,pipe= Tmin/3. The
increased performance comes at the relatively small cost of two additional registers, and an
increased latency.1 This explains why pipelining is popular in the implemen-tation of very high-
performance datapaths.

Latch- vs. Register-Based Pipelines


Pipelined circuits can be constructed using level-sensitive latches instead of edge-trig-
gered registers. Consider the pipelined circuit of Figure 7.41. The pipeline system is
implemented based on pass-transistor-based positive and negative latches instead of edge-
triggered registers. That is, logic is introduced between the master and slave latches of a master-
slave system. In the following discussion, we use without loss of generality the CLK-CLK
notation to denote a two-phase clock system. Latch-based systems give signifi-cantly more
flexibility in implementing a pipelined system, and often offers higher perfor-mance. When the
clocks CLK and CLK are nonoverlapping, correct pipeline operation is obtained. Input data is
sampled on C1 at the negative edge of CLK and the computation of logic block F starts; the
result of the logic block F is stored on C2 on the falling edge of CLK, and the computation of
logic block G starts. The nonoverlapping of the clocks
1
Latency is defined here as the number of clock cycles it takes for the data to
propagate from the input to the output. For the example at hand, pipelining
increases the latency from 1 to 3. An increased latency is in gen-eral acceptable,
but can cause a global performance degradation if not treated with care.

Download Useful Materials from Rejinpaul.com


www.rejinpaul.com
www.rejinpaul.com

ensures correct operation. The value stored on C2 at the end of the CLK low phase is the result of
passing the previous input (stored on the falling edge of CLK on C1) through the logic function F. When overlap
exists between CLK and CLK, the next input is already being applied to F, and its effect might propagate to C2
before CLK goes low (assuming that the contamination delay of F is small).

LOW POWER LOGIC DESIGN:

Is to reduce dynamic power and static power in a circuit


 a: 
  C: 
  VDD: 
 f: 
Reduce static power, Reduce dynamic power

 a: clock gating, sleep mode 


  C: small transistors (esp. on clock), short wires 
  VDD: 
Reduce static power, Reduce dynamic power

 a: clock gating, sleep mode 


  C: small transistors (esp. on clock), short wires 
  VDD: lowest suitable voltage 
 f: lowest suitable frequency 
Reduce static power

  Selectively use ratioed circuits 


  Selectively use low Vt devices 
 Leakage reduction: stacked devices, body bias, low temperature. 

Another class of logic circuits are sequential circuits. These circuits are two-
valued networks in which the outputs at any instant are dependent not only upon the
inputs present at that instant but also upon the past history (sequence) of inputs.
www.rejinpaul.com
www.rejinpaul.com

Sequential circuits are classified into:

 Synchronous sequential circuits – Their behaviour is determined by the values


of the signals at only discrete instants of time. 

 Asynchronous sequential circuits – Their behaviour is immediately affected
by the Input signal changes.

SYNCHRONIZERS

A synchronizer is a circuit,that accepts an input that can change at arbitrary times and
produces an output aligned to the synchoronizer clock.

(i).A synchronizer accepts are D and a clock,it produces an output Q that ought to be valid
some bounted delay after the clock.
(ii).Synchoronizer built from a pair of ffs.
 Meta Stability 
  DC transfer charateristics of the two inverters. 
 Arbiters. 
www.rejinpaul.com
www.rejinpaul.com

UNIT-IV
DESIGNING ARITHMETIC BUILDING BLOCKS

Arithmetic Building Blocks


• Datapath elements
• Adder design
– Static adder
– Dynamic adder
• Multiplier design
– Array multipliers
• Shifters, Parity circuits
Building Blocks for Digital Architectures
• Arithmetic unit
- Bit-sliced datapath (adder, multiplier, shifter, comparator, etc.)
• Memory
- RAM, ROM, Buffers, Shift registers
• Control
- Finite state machine (PLA, random logic.)
- Counters
• Interconnect
- Switches
- Arbiters
- Bus
BIT SLICE ARCHITECTURE
www.rejinpaul.com
www.rejinpaul.com

RIPPLE CARRY ADDDER

CARRY LOOK AHEAD ADDERS

High speed adder

ci 1  g i  pi ci
www.rejinpaul.com
www.rejinpaul.com

 One-bit adder could be implemented more efficiently because MUX is faster

Array multiplier
www.rejinpaul.com
www.rejinpaul.com

Carry save multiplier

Wallace tree multiplier

The Wallace tree has three steps:

1. Multiply (that is - AND) each bit of one of the arguments, by each bit of the other,
yielding results. Depending on position of the multiplied bits, the wires carry
different weights, for example wire of bit carrying result of is 32 (see
explanation of weights below).
2. Reduce the number of partial products to two by layers of full and half adders.
3. Group the wires in two numbers, and add them with a conventional adder.

• Identify Critical Paths

• Other Possible techniques:

• Data Encoding (Booth)


www.rejinpaul.com
www.rejinpaul.com

• Logarithmic v.s. Linear (Wallace Tree Multiplier)

• Pipelining

Barrel shifter

4x4 barrelshifters
www.rejinpaul.com
www.rejinpaul.com

UNIT-V

IMPLEMENTATION STRATEGIES

Design flow

Design methodology

STANDARD CELL DESIGN


www.rejinpaul.com
www.rejinpaul.com

Standard cell libraries are required by almost all CAD tools for chip design Standard cell
libraries contain primitive cells required for digital design However, more complex cells that
have been specially optimized can also be included .

The main purpose of the CAD tools is to implement the so called RTL-to-GDS flow
The input to the design process, in most cases, is the circuit description at the registertransfer
level (RTL) The final output from the design process is the full chip layout, mostly in the GDSII
(gds2) format To produce a functionally correct design that meets all the specifications and
constraints, requires a combination of different tools in the design flows .

These tools require specific information in different formats for each of the cells in the
standard cell library provided to them for the design.

Standard cell library format


Standard Cell Library Formats The formats explained here are for Cadence tools,
however similar information is required for other tool suites. • Physical Layout (gdsII, Virtuoso
Layout Editor) Should follow specific design standards eg. Constant height, offsets etc. •
Logical View (verilog description or TLF or LIB) Verilog is required for dynamic simulation.
Place and route tools usually can use TLF. Verilog description should preferably support back
annotation of timing information. •

Abstract View (Cadence Abstract Generator, LEF) LEF: Contains information about
each cell as well as technology information • Timing, power and parasitics (TLF or LIB)
Transistor and interconnect parasitics are extracted using Cadence or other extraction tools.
Spice or Spectre netlist is generated and detailed timing simulations are performed. Power
information can also be generated during these simulations. Data is formatted into a TLF or LIB
file including process, temperature and supply voltage variations. Logical information for each
cell is also contained in this file.

Standard cell library formats


www.rejinpaul.com
www.rejinpaul.com

FPGA building block architecture

 Based on the principle of functional completeness


 FPGA: Functionally complete elements (Logic Blocks) placed in an interconnect
framework
 Interconnection framework comprises of wire segments and switches; Provide a means
to interconnect logic blocks
 Circuits are partitioned to logic block size, mapped and routed

Interconnection framework
 Granularity and interconnection structure has caused a split in the industry
 FPGA
 Fine grained
 Variable length interconnect segments
 Timing in general is not predictable; Timing extracted after placement and route
www.rejinpaul.com
www.rejinpaul.com

 Field programmability is achieved through switches (Transistors controlled by memory


elements or fuses)
 Switches control the following aspects
 Interconnection among wire segments
 Configuration of logic blocks
 Distributed memory elements controlling the switches and configuration of logic blocks
are together called “Configuration Memory”
FPGA Structural Classification Basic structure of an FPGA includes logic elements,
programmable interconnects and memory. Arrangement of these blocks is specific to particular
manufacturer. On the basis of internal arrangement of blocks FPGAs can be divided into three
classes:
Symmetrical arrays :This architecture consists of logic elements (called CLBs) arranged
in rows and columns of a matrix and interconnect laid out between them shown in Fig 20.2. This
symmetrical matrix is surrounded by I/O blocks which connect it to outside world. Each CLB
consists of n-input Lookup table and a pair of programmable flip flops. I/O blocks also control
functions such as tristate control, output transition speed. Interconnects provide routing path.
Direct interconnects between adjacent logic elements have smaller delay compared to general
purpose interconnect
Row based architecture :Row based architecture shown in Fig 20.5 consists of
alternating rows of logic modules and programmable interconnect tracks. Input output blocks is
located in the periphery of the rows. One row may be connected to adjacent rows via vertical
interconnect. Logic modules can be implemented in various combinations. Combinatorial
modules contain only combinational elements which Sequential modules contain both
combinational elements along with flip flops. This sequential module can implement complex
combinatorial-sequential functions. Routing tracks are divided into smaller segments connected
by anti-fuse elements between them.
Hierarchical PLDs This architecture is designed in hierarchical manner with top level
containing only logic blocks and interconnects. Each logic block contains number of logic
modules. And each logic module has combinatorial as well as sequential functional elements.
Each of these functional elements is controlled by the programmed memory. Communication
between logic blocks is achieved by programmable interconnect arrays. Input output blocks
surround this scheme of logic blocks and interconnects
www.rejinpaul.com
www.rejinpaul.com

Logic array block

Routing

Global routing

The global router performs a coarse route to determine, for each connection, the
minimum distance path through routing channels that it has to go through. If the net to be routed
has more than two terminals the global router will break the net into a set of two-terminal5
connections and route each set independently. The global router considers for each connection
multiple ways of routing it and chooses the one that passes through the least congested routing
channels. By keeping track of the usage of each routing channel, congestion is avoided; and the
principal objective of the global router, balancing the usage of the routing channels, is achieved.
Once all connections have been coarse routed, the solution is optimized by ripping up and
rerouting each connection a small number of times. After that, the final solution is passed to the
detailed router

Detail routing :

The detail router determines for each two point connection the specific wiring segments
to use in the routing channel assigned by the global router. To do this, detail routing algorithms
construct a directed graph from the routing resources to represent the available connection
between wires, C blocks, S blocks and logic blocks within the FPGA. The search performed on
this directed graph is usually based on Dijkstra’s algorithm to find the shortest path between two
nodes. The paths are labeled according to a cost function that takes into account the usage of
each wire segment and the distance of the interconnecting points. The distance is estimated by
calculating the wire length in the bounding box of the interconnecting points using a Manhattan
metric. Most of the routers relax the bounding box constraints and allow searching for possible
solutions in the surrounding routing channels of the bounding box. This is done to avoid
subsequent iterations of ripping out and re-routing if the solution lies on the near outside of the
bounding box.

Routing resources

The programmable routing in an FPGA consists of two categories: (1) routing within
each Logic Block/Logic Cluster, and (2) routing between the Logic Blocks/Logic Clusters.
Figure 2.5 shows a detailed view of the routing for a single tile. Normally, an FPGA is created
by replication of such a tile (a tile consists of one Logic Block and it’s associated routing).
www.rejinpaul.com
www.rejinpaul.com

The programmable routing within each Logic Block consists of the Interconnect Matrix. The
programmable routing between the Logic Blocks consists of fixed metal tracks, Switch Blocks,
Connection Blocks, and the programmable switches. The fixed metal tracks run horizontally and
vertically, and are organized in channels; each channel contains the same number of tracks for
the architecture that we investigated. A Switch Block occurs at each intersection between
horizontal and vertical routing channels, and defines all possible connections between these
channels. Three different topologies for Switch Blocks have been proposed in previous work:
the Disjoint Switch Block the Universal Switch Block and the Wilton Switch Block . The
Connection Block defines all the possible connections from a horizontal or vertical channel to a
neighboring logic block.

The connections in the switch blocks and connection blocks are made by programmable
switches. Part of the programmable routing also lies within each logic block, determining how
different components are connected within the logic block. This Island-Style architecture is a
very general version of most commercial architectures from Altera and Xilinx. Most recent
commercial FPGAs also incorporate other features on chip, such as digital phase lock loops, and
memories.

The architectural features that this thesis intends to explore are sufficiently represented
in such a simplified island-style architecture, and hence this architecture is assumed in all of our
following work. A programmable switch consists of a pass transistor controlled by a static
random access memory cell (in which case, the device is called a SRAM-based FPGA), or an
anti-fuse (such devices are referred to as anti-fuse FPGAs), or a non-volatile memory cell (such
devices are referred to as floating gate devices). Since SRAM-based FPGAs employ static
random access memory (SRAM) cells to control the programmable switches, they can be
reprogrammed by the end user as many times as required and are volatile. Anti-fuse based
www.rejinpaul.com
www.rejinpaul.com

FPGAs, on the other hand, can only be programmed once and are non-volatile. The devices
employing floating gate technology are also non-volatile and can be reprogrammed. Of the three
categories, SRAM-based FPGAs are most widely used and hence we will limit our discussion
and investigations to SRAM-based devices.

Routing architecture terminology

The ‘W’ represents the number of parallel tracks contained in each channel. A track is a
piece of metal traversing one or more logic blocks (for example ‘x’ in Figure 2.5). For clustered
FPGAs the logic block consists of more than one BLE; the figure shows a logic block with ‘N’
such BLEs. The Interconnect Matrix shown in the logic block determines all possible
connections from the logic block inputs ‘I’ to each of N*k BLE inputs.

This Interconnect Matrix is normally implemented using binary tree multiplexers. The
number of logic block inputs ‘I’, and the number of feedback paths determine the size of these
multiplexers. The feedback paths allow the local connections to be made from within the logic
block. Betz [1] showed that a I=2N+2 (for N less than or equal to 16) is sufficient for good logic
utilization. The ‘Logic Utilization’ is defined as the average number of BLEs per logic block
that a circuit is able to use divided by the total number of BLEs per logic block, N. The number
of tracks in each channel to which each logic block input and output pin can connect is called
the connection block flexibility, Fc.

The Fc determines the number of programmable connections in a connection block


Another useful parameter is Fs, the Switch Block flexibility. The Fs defines the number of
connections that can be made by a switch block from a given incoming track to the outgoing
tracks, track ‘x’ in the horizontal channel which is capable of connecting to a total of three other
tracks through the switch block, and hence for the switch block architecture of Figure 2.5, the
Fs=3. Rose and Brown have shown that the most area-efficient FPGAs should have Fs = 3 or 4
and Fc = 0.7W to 0.9W at Fc = 0.25W, and for the FPGAs

employing Disjoint switch block topology the best areaefficiency occurs at Fc = 0.5W. Table
2.1 shows a summary of the terminology which has been described above, and the range of
values which were used in this thesis

In the original Xilinx XC2000 and XC3000 architectures, a very simple architecture
was employed in which most wire segments spanned only the length or the width of a logic
block. In order to improve the speed of an FPGA El-Gamal [6] introduced the idea of
‘Segmented FPGAs’. The main idea is to provide segments that span multiple logic blocks for
connections that travel long distances. Figure 2.5 shows such a track ’y’ (in the vertical channel)
which does not terminates at the switch block.Some example segments which span 1, 2 and 3
logic blocks and hence are referred to as segment length 1, 2 and 3. A ‘long’ segment is that
www.rejinpaul.com
www.rejinpaul.com

segment which spans all the logic blocks in a given architecture. A similar approach is used in
most of Altera’s devices where long segments are used to carry signals over larger distances
across the chip.

FPGAs employing very long or very short segment wires were inferior in performance
to those employing medium sized segments (those traversing 4 to 8 logic blocks). Paul Chow [7]
introduced another important routing architectural feature of segmented routing architectures
besides the segment distribution called ‘Segment Population’. A segment is called internally
populated if it is possible to make connections from the middle of a segment to a logic block or
to other routing segments. The advantage of unpopulated segments is that they have less
parasitic switch capacitance connected to the segment, which makes it faster. The disadvantage
is that the reduction in routing flexibility (without population there cannot be internal fanout)
may result in the need for more tracks and thus, loss of logic density

Вам также может понравиться