Академический Документы
Профессиональный Документы
Культура Документы
com
www.rejinpaul.com
EC6601/VLSI DESIGN
UNIT-1
Symbols
NMOS
PMOS
Devices that are normally cut-off with zero gate bias are classified as
"enhancement-mode "devices.
Devices that conduct with zero gate bias are called "depletion-mode
devices
Enhancement-mode devices are more popular in practical use.
It Consist of
Moderately doped p-type silicon substrate
Two heavily doped n + regions, the source and drain, are diffused
Features
Since the oxide layer is an insulator, the DC current from the gate to
channel is essentially zero.
No physical distinction between the drain and source regions.
Since SiO2 has low loss and high dielectric strength, the application of
high gate fields is feasible.
In operation
Set Vds > 0 in operation
Vgs =0 no current flow between source and drain. They are insulated by two
reversed-biased PN junctions
When Vg > 0 , the produced E field attracts electrons toward the gate and
repels holes.
If Vg is sufficiently large, the region under the gate changes from p-type to n-
type(due to accumulation of attracted elections) and provides a conducting path
between source and drain.
The thin layer of p-type silicon is said to be "inverted".
Three modes (see Fig 2.4)
o Accumulation mode (Vgs << Vt)
o Depletion mode (Vgs =Vt)
o Inversion mode (Vgs > Vt)
www.rejinpaul.com
www.rejinpaul.com
Electrically
At the source end , the full gate voltage is effective in the inverting the
channel.
At the drain end , only the difference between the gate and drain voltage is
effective
www.rejinpaul.com
www.rejinpaul.com
Pinch-off
Vds > Vgs-Vt => Vgd < Vt => Vd > Vg –Vt (Vg is not big
enough)
The channel no longer reaches the drain. (Fig 2.5 c)
As electrons leave the drain depletion region and are
subsequently accelerated toward the drain.
The voltage across the pinched-off region remains at (Vgs-Vt)
=>”saturated” state in which the channel current as controlled by Vg ,
and is independent of Vd
For fixed Vds and Vg , Ids is function of
Distance between drain & source
Channel width
www.rejinpaul.com
www.rejinpaul.com
Vt
Thickness of gate oxide
The dielectric constant of gate oxide
(6) Carrier (hole or electron) mobility , .
Conducting mode
”cut-off ” region : Ids ≈ 0 , Vgs < Vt
” Nonsaturated” region : weak inversion region, when Ids
depends on Vg & Vd
”Saturated“ region: channel is strongly inverted and Ids is ideally
independent of Vds (pinch-off region)
”Avalanche breakdown” (pinch-through) : very high Vd => gate has no
control over Ids
Threshold voltage
A function of
Gate conductor material
Gate insulator material
Gate insulator thickness
Impurity at the silicon-insulator interface
Voltage between the source and the substrate Vsb
Temperature
a. -4 mV/’C – high substrate doping
b. -2 mV/’C – low substrate doping
www.rejinpaul.com
www.rejinpaul.com
SCALING
What will their performance be try to avoid the wet noodle effect Wires
Limitations
Limitations to device scaling has been around since working in 3m nMOS, 22
years ago (actually bipolar)
• Worries were
Short channel effect
Punchthrough
• drain control of current rather than gate
Hot electrons
Parasitic resistances
• Now worries are a little different
Oxide tunnel currents
Punchthrough
Parameter control
Parasitic resistances
Transistor scaling
Wire scaling
CMOS INVERTER
← turn on →
V
gs =Vin ⇒Vg <VDD − V
tp
V
ds =V
out ⇒Vin <VDD − V
tp
www.rejinpaul.com
www.rejinpaul.com
V
inn =Vinp
⇒ V =V
out DD
(2) Region B. V
tn ≤Vin ≤VDD
2
p-device : linear mode
n-device : saturation mode
[V i −V t ]2 µ ε W
n n n n
n:
I
dsn = βn , βn = ( )
t L
2 ox n
p: V
gs =Vin −VDD
V
ds =V
out −VDD
www.rejinpaul.com
www.rejinpaul.com
I = −I
with dsp dsn
βn
V
DD +V
tp +Vtn β
p
⇒ Vin =
βn
1+
βp
DD
V
in =
2
V
possible out
V
in −Vtn <Vout
N-MOS
V
gs −Vtn <Vds
Negative value
V
⇒ Vin is fixed at DD , Vout varies
www.rejinpaul.com
www.rejinpaul.com
VDD
2 <Vin <VDD +Vtp
(4) Region D.
www.rejinpaul.com
www.rejinpaul.com
I = −I
solve dsn dsp
β
⇒ V ou = (V −V t ) − (V i −V t ) 2 − p (V i −V D −V t )2
t in n n n β n D p
n
(5) Region E. V
in ≥VDD +Vtp
→ p-device ‘ off ‘ ( n-device is in ‘ linear ’ mode
⇒V
→I
dsp = 0 ⇒ I dsn = 0 out =0
see Table 2.3 for summary
A B
NOR Gate
CMO S Invertor
Vdd Vdd
A
Out
In Out B
Gnd Gnd
Inverter NAND Gate
www.rejinpaul.com
www.rejinpaul.com
UNIT-II
COMBINATONAL LOGICCIRCUITS
ELMORE’S CONSTANT
area above step response
T elm k =Z 10(1 ¡ vk(t))dt
T elm k =Z 10tvk(t)0dt
T elm k¸ 0:5Dk
good approximation of Dk only when vk is monotonically increasing
RC tree
Example:
Telm 3 = C3(R1 + R2+R3)+C2(R1+R2)+C1R1+C4R1+C5R1+C6R1
Elmore delay:
hence can minimize area or power, subject to bound on Elmore delay using
geometric programming commercial software (1980s): e.g., TILOS
not a good approximation of 50% delay when step response is not monotonic
(capacitive coupling between nodes, or non-diagona C)
Pass-Transistor Logic
www.rejinpaul.com
www.rejinpaul.com
Switch Out A
Out
Network B
B
Effective resistance:
Shockley models have limited value
– Not accurate enough for modern transistors
– Too complicated for much hand analysis
Simplification: treat transistor as resistor
– Replace Ids(Vds, Vgs) with effective resistance R
Ids = Vds/R
– R averaged across switching of digital gate
Too inaccurate to predict current at any given time
– But good enough to predict RC delay.
Transmission gates
Static CMOS
Basic CMOS combinational circuits consists of,
Dynamic logic
‹ There is another class of logic gates which relies on the use of a clock signal. This class
of circuit is known as dynamic circuits. The clock signal is used to divide the gate operation into
two halves. In the first half, the output node is pre-charged to a high or low logic state. In the
second half of a clock cycle, the circuit evaluates the correct output state. ‹
When Ø is low, Z is charged to high. When Ø is high, n logic block evaluates input, and
conditionally discharges Z. This circuit adds series resistance to the pull-down n-channel
transistor, therefore the fall time is increased slightly. ‹ This circuit is dynamic because during
evaluation, the output high level at Z is maintained by the stray capacitance at the output node.
If Ø stays high (i.e. evaluation period) for a long time, Z may eventually discharge to a low
logic level.
Power dissipation
The past the major concerns of the VLSI designer were area performance cost and
reliability power considerations were mostly of only secondary importance. In recent years
however thi s has begun to change and increasingly power is being given comparable weight to
area and speed.
Several factors have contributed to this trend Perhaps the primary driving factor has
been the remarkable success and growth of the class of personal computing devices
portable desktops audio and video based multimedia products and wireless communications
systems.
Personal digital assistants and personal communicators which demand high speed
computation and complex functionality with In the past the major concerns of the VLSI designer
were area performance cost and reliability power considerations were mostly of only secondary
importance.In recent years however this has begun to change and increasingly power is being
given comparable weight to area and speed.
Several factors have contributed to this trend Perhaps the primary driving factorhas been
the remarkable success and growth of the class of personal computing devices portable
desktops audio and video based multimedia products and wireless communications systems
personal digital assistants and personal communicators which demand high speed computation
and complex functionality with low power consumption There also exists a strong pressure for
producers of high end products to reduce their power consumption.
A non trivial application program consumes millions of machine cycles making it nearly
impossible to perform power estimation using the complete program at say the RT level. Most
of the reported results are based on power macro modeling_ an estimation approach which is
extensively used for behavioral and RTL level estimation.
In the power cost of a CPU module is characterized by estimating the average
capacitance that would switch when the given CPU module is activated.
In the switching activities on address instruction and data bu ses are used to estimate the
power consumption of the microprocessor, based on actual current measurements of some
Energy p is the total energy dissipation of the program which is divided into three parts
The first part is the summation of the base energy cost of each instruction, BCi is the base
energy cost and Ni is the number of times instruction i is executed.
The second part accounts for the circuit state SCij is the energy cost when instruction i
is followed by j during the program execution. Finally the third part accounts for energy
contribution OCk of other instruction effects such as stalls and cache misses during the program
execution.
The new instruction trace is however much shorter than the original one and can hence
be simulated on a RT level description of the target microprocessor to provide the power
dissipation results quickly Specifically this approach consists of the following steps
Perform architectural simulation of the target microprocessor under the instruction trace
of typical application programs
Extract characteristic p role including parameters such as the instruction mix Instruction
data cache miss rates branch prediction miss rate pipeline stalls etc for the
microprocessor.
Use mixed integer linear programming and heuristic rules to gradually transform a
gener ic program template into a fully functional program.
Perform RT level simulation of the target microprocessor under the instruction trace of
the new synthesized program.
Conversely from some of the RT level methods that will be estimation techniques at the
behavioral level cannot rely on information about the gate level structure of the design
components and hence must resort to abstract notions of physical capacitance and switching
activity to predict power dissipation in the design.
UNIT-III
SEQUENTIAL LOGIC CIRCUITS
Sequential logic
Static memories use positive feedback to create a bistable circuit — a circuit having two stable
states that represent 0 and 1. The basic idea is shown in Figure 7.4a, which shows two inverters
connected in cascade along with a voltage-transfer characteristic typical of such a circuit. Also
plotted are the VTCs of the first inverter, that is, Vo1 versus Vi1, and the second inverter (Vo2
versus Vo1). The latter plot is rotated to accentuate that Vi2 = Vo1. Assume now that the output of
the second inverter Vo2 is connected to the input of the first Vi1, as shown by the dotted lines in
oo
2VV 11
fig.
o
1
=
V 2= V
i
V
V V =V
i1 o1 i2 V
o2
Vi1 Vo2
V =V
o2 i1 A
(a)
C
B
Vi1 = Vo2
(b)
The resulting circuit has only three possible operation points (A, B, and C), as demonstrated on
the combined VTC. The following important conjecture is easily proven to be valid:
Under the condition that the gain of the inverter in the transient region is larger than 1,
only A and B are stable operation points, and C is a metastable operation point.
Suppose that the cross-coupled inverter pair is biased at point C. A small deviation from
this bias point, possibly caused by noise, is amplified and regenerated around the circuit loop.
This is a consequence of the gain around the loop being larger than 1. The effect is demonstrated
in Figure 7.5a. A small deviation δ is applied to Vi1 (biased in C). This devi-ation is amplified by
the gain of the inverter. The enlarged divergence is applied to the sec-ond inverter and amplified
once more. The bias point moves away from C until one of the operation points A or B is reached.
In conclusion, C is an unstable operation point. Every deviation (even the smallest one) causes
the operation point to run away from its original bias. The chance is indeed very small that the
cross-coupled inverter pair is biased at C and stays there. Operation points with this property are
o
1
termed metastable.
V V
2
i
A A
o
1
=
V V
2
i
C C
B B
V =V V =V
δ i1 o2 δ
i1 o2
(a) (b)
On the other hand, A and B are stable operation points, as demonstrated in Figure 7.5b. In
these points, the loop gain is much smaller than unity. Even a rather large devi-ation from the
operation point is reduced in size and disappears.
Hence the cross-coupling of two inverters results in a bistable circuit, that is, a cir-cuit
with two stable states, each corresponding to a logic state. The circuit serves as a memory, storing
either a 1 or a 0 (corresponding to positions A and B).
In order to change the stored value, we must be able to bring the circuit from state A to B
and vice-versa. Since the precondition for stability is that the loop gain G is smaller than unity,
we can achieve this by making A (or B) temporarily unstable by increasing G to a value larger
than 1. This is generally done by applying a trigger pulse at Vi1 or Vi2. For instance, assume that
the system is in position A (Vi1 = 0, Vi2 = 1). Forcing Vi1 to 1 causes both inverters to be on
simultaneously for a short time and the loop gain G to be larger than 1. The positive feedback
regenerates the effect of the trigger pulse, and the circuit moves to the other state (B in this case).
The width of the trigger pulse need be only a little larger than the total propagation delay around
the circuit loop, which is twice the average propagation delay of the inverters.
In summary, a bistable circuit has two stable states. In absence of any triggering, the
circuit remains in a single state (assuming that the power supply remains applied to the circuit),
and hence remembers a value. A trigger pulse must be applied to change the state of the circuit.
Another common name for a bistable circuit is flip-flop (unfortunately, an edge-triggered register
is also referred to as a flip-flop).
There are many approaches for constructing latches. One very common technique
involves the use of transmission gate multiplexers. Multiplexer based latches can provide
similar functionality to the SR latch, but has the important added advantage that the sizing of
devices only affects performance and is not critical to the functionality.
Figure shows an implementation of static positive and negative latches based on multiplexers.
For a negative latch, when the clock signal is low, the input 0 of the multiplexer is selected, and
the D input is passed to the output. When the clock signal is high, the input 1 of the
multiplexer, which connects to the output of the latch, is selected. The feedback holds the
output stable while the clock signal is high. Similarly in the positive latch, the D input is
selected when clock is high, and the output is held (using feedback) when clock is low.
Negative Positive
Latch Latch
1 Q 0 Q
D 0 D 1
CL
K
CLK
Negative and positive latches based on multiplexers.
A transistor level implementation of a positive latch based on multiplexers is shown
below. When CLK is high, the bottom transmission gate is on and the latch is transparent - that
is, the D input is copied to the Q output. During this phase, the feedback loop is open since the
top transmission gate is off. Unlike the SR FF, the feedback does not have to be overridden to
write the memory and hence sizing of transistors is not critical for realizing correct
functionality. The number of transistors that the clock touches is impor-tant since it has an
activity factor of 1. This particular latch implementation is not particu-larly efficient from this
metric as it presents a load of 4 transistors to the CLK signal.
CLK
Q
CLK
CLK
It is possible to reduce the clock load to two transistors by using implement multi-
plexers using NMOS only pass transistor as shown in Figure 7.13. The advantage of this
approach is the reduced clock load of only two NMOS devices. When CLK is high, the latch
samples the D input, while a low clock-signal enables the feedback-loop, and puts the latch in
the hold mode. While attractive for its simplicity, the use of NMOS only pass transistors results
in the passing of a degraded high voltage of VDD-VTn to the input of the first inverter. This
impacts both noise margin and the switching performance, especially in the case of low values
of VDD and high values of VTn. It also causes static power dissipa-tion in first inverter, as already
pointed out in Chapter 6. Since the maximum input-voltage to the inverter equals VDD-VTn, the
PMOS device of the inverter is never turned off, result-ing is a static current flow.
I I
I2 T2 I3 5 T4 6 Q
QM
D
I T
I1 T1 4 3
CLK
Transistor-level implementation of a master-slave postive
edge-triggered
register using multiplexers.
The hold time represents the time that the input must be held stable after the rising edge
of the clock. In this case, the transmission gate T1 turns off when clock goes high and therefore
any changes in the D-input after clock going high are not seen by the input. Therefore, the hold
time is 0.
As mentioned earlier, the drawback of the transmission gate register is the high
capacitive load presented to the clock signal. The clock load per register is important since it
directly impacts the power dissipation of the clock network. Ignoring the overhead required to
invert the clock signal (since the buffer inverter overhead can be amortized over multiple
register bits), each register has a clock load of 8 transistors. One approach to reduce the clock
load at the cost of robustness is to make the circuit ratioed. Figure 7.18 shows that the feedback
transmission gate can be eliminated by directly cross coupling the inverters.
CL
K CLK
T
D 1 I1 T2 I3 Q
I
2 I4
CL
K CLK
The penalty for the reduced clock load is increased design complexity. The trans-
mission gate (T1) and its source driver must overpower the feedback inverter (I2) to switch the
state of the cross-coupled inverter.The sizing requirements for the transmission gates can be
derived using a similar analysis as performed for the SR flip-flop. The input to the inverter I1
must be brought below its switching threshold in order to make a transition. If minimum-sized
devices are to be used in the transmission gates, it is essential that the transistors of inverter I2
should be made even weaker. This can be accomplished by mak-ing their channel-lengths larger
than minimum. Using minimum or close-to-minimum-size devices in the transmission gates is
desirable to reduce the power dissipation in the latches and the clock distribution network.
Another problem with this scheme is the reverse conduction — this is, the second stage
can affect the state of the first latch. When the slave stage is on (Figure 7.19), it is possible for
the combination of T2 and I4 to influence the data stored in I1-I2 latch. As long as I4 is a weak
device, this is fortunately not a major problem.
V
D
D 0
T I
D 1 I1 T 3 Q
I
I2 V
D 4
0 D
So far, we have assumed that CLK is a perfect inversion of CLK, or in other words, that
the delay of the generating inverter is zero. Even if this were possible, this would still not be a
good assumption. Variations can exist in the wires used to route the two clock signals, or the
load capacitances can vary based on data stored in the connecting latches. This effect, known as
clock skew is a major problem, and causes the two clock signals to overlap as is shown in
Figure. Clock-overlap can cause two types of failures, as illustrated for the NMOS-only
negative master-slave register
X
CLK CLK Q
A
D
B
CL
K
CLK
(a) Schematic diagram
CLK
When the clock goes high, the slave stage should stop sampling the master stage output
and go into a hold mode. However, since CLK and CLK are both high for a short period of time
(the overlap period), both sampling pass transistors conduct and there is a direct path from the D
input to the Q output. As a result, data at the output can change on the rising edge of the clock,
which is undesired for a negative edge-triggered register. The is know as a race condition in
which the value of the output Q is a function of whether the input D arrives at node X before or
after the falling edge of CLK. If node X is sampled in the metastable state, the output will switch
to a value determined by noise in the system.
The primary advantage of the multiplexer-based register is that the feedback loop is
open during the sampling period, and therefore sizing of devices is not critical to functionality.
However, if there is clock overlap between CLK and CLK, node A can be driven by both D and
B, resulting in an undefined state.
Those problems can be avoided by using two non-overlapping clocks PHI1 and PHI2
instead (Fiand by keeping the nonoverlap time tnon_overlap between the clocks large enough such
that no overlap occurs even in the presence of clock-routing delays.
During the nonoverlap time, the FF is in the high-impedance state—the feedback loop
is open, the loop gain is zero, and the input is disconnected. Leakage will destroy the state if this
condition holds for too long a time. Hence the name pseudostatic: the register employs a
combination of static and dynamic storage approaches depending upon the state of the clock.
The scaling of supply voltages is critical for low power operation. Unfortunately, certain
latch structures don’t function at reduced supply voltages. For example, without the scaling of
device thresholds, NMOS only pass transistors (e.g., Figure 7.21) don’t scale well with supply
voltage due to its inherent threshold drop. At very low power sup-ply voltages, the input to the
inverter cannot be raised above the switching threshold, resulting in incorrect evaluation. Even
with the use of transmission gates, performance degrades significantly at reduced supply
voltages.
Scaling to low supply voltages hence requires the use of reduced threshold devices.
However, this has the negative effect of exponentially increasing the sub-threshold leak-age
power as discussed in Chapter 6. When the registers are constantly accessed, the leakage energy
is typically insignificant compared to the switching power. However, with the use of conditional
clocks, it is possible that registers are idle for extended periods and the leakage energy expended
by registers can be quite significant.
Many solutions are being explored to address the problem of high leakage during idle
periods. One approach for this involves the use of Multiple Threshold devices as shown in
Figure 7.23 [Mutoh95]. Only the negative latch is shown here. The shaded inverters and
transmission gates are implemented in low-threshold devices. The low-threshold inverters are
gated using high threshold devices to eliminate leakage.
During normal mode of operation, the sleep devices are tuned on. When clock is low,
the D input is sampled and propagates to the output. When clock is high, the latch is in the hold
mode. The feedback transmission gate conducts and the cross-coupled feed-back is enabled.
Note there is an extra inverter, needed for storage of state when the latch is in the sleep state.
During idle mode, the high threshold devices in series with the low threshold inverter are turned
off (the SLEEP signal is high), eliminating leakage. It is assumed that clock is in the high state
when the latch is in the sleep state. The feedback low-threshold transmission gate is turned on
and the cross-coupled high-threshold devices maintains the state of the latch.
Storage in a static sequential circuit relies on the concept that a cross-coupled inverter
pair produces a bistable element and can thus be used to memorize binary values. This approach
has the useful property that a stored value remains valid as long as the supply voltage is applied
to the circuit, hence the name static. The major disadvantage of the static gate, however, is its
complexity. When registers are used in computational structures that are constantly clocked
such as pipelined datapath, the requirement that the memory should hold state for extended
periods of time can be significantly relaxed.
This results in a class of circuits based on temporary storage of charge on parasitic
capacitors. The principle is exactly identical to the one used in dynamic logic — charge stored
on a capacitor can be used to represent a logic signal. The absence of charge denotes a 0, while
its presence stands for a stored 1. No capacitor is ideal, unfortunately, and some charge leakage
is always present. A stored value can hence only be kept for a limited amount of time, typically
in the range of milliseconds. If one wants to preserve signal integrity, a periodic refresh of its
value is necessary. Hence the name dynamic stor-age. Reading the value of the stored signal
from a capacitor without disrupting the charge requires the availability of a device with a high
input impedance.
The set-up time of this circuit is simply the delay of the transmission gate, and corre-
sponds to the time it takes node 1 to sample the D input. The hold time is approximately zero,
since the transmission gate is turned off on the clock edge and further inputs changes are
ignored. The propagation delay (tc-q) is equal to two inverter delays plus the delay of the
transmission gate T2.
One important consideration for such a dynamic register is that the storage nodes (i.e.,
the state) has to be refreshed at periodic intervals to prevent a loss due to charge leak-age, due to
diode leakage as well as sub-threshold currents. In datapath circuits, the refresh rate is not an
issue since the registers are periodically clocked, and the storage nodes are constantly updated.
Clock overlap is an important concern for this register. Consider the clock wave-forms
shown in Figure 7.25. During the 0-0 overlap period, the NMOS of T1 and the PMOS of T2 are
simultaneously on, creating a direct path for data to flow from the D input of the register to the
Q output. This is known as a race condition. The output Q can change on the falling edge if the
overlap period is large — obviously an undesirable effect for a positive edge-triggered register.
The same is true for the 1-1 overlap region, where an input-output path exists through the
PMOS of T1 and the NMOS of T2. The latter case is taken care off by enforcing a hold time
constraint. That is, the data must be stable during the high-high overlap period. The former
situation (0-0 overlap) can be addressed by mak-ing sure that there is enough delay between the
D input and node 2 ensuring that new data sampled by the master stage does not propagate
through to the slave stage. Generally the built in single inverter delay should be sufficient and
the overlap period constraint is given as:
overlap 0 – 0 tT1 + tI 1 + tT 2
t
t
hold ove rlap1 – 1
t
(0,0) overlap
CLK
(1,1) overlap
CLK
Impact of non-overlapping clocks.
C2MOS Dynamic Register: A Clock Skew Insensitive Approach
The C2MOS Register
An ingenious positive edge-triggered register based on the master-slave concept which
is insensitive to clock overlap. This circuit is called the C2MOS (Clocked CMOS) register
[Suzuki73]. The register operates in two phases.
CLK = 0 (CLK = 1): The first tri-state driver is turned on, and the master stage acts as an
inverter sampling the inverted version of D on the internal node X. The master stage is
in the evaluation mode. Meanwhile, the slave section is in a high-impedance mode, or in
a hold mode. Both transistors M7 and M8 are off, decoupling the output from the input.
The output Q retains its previous value stored on the output capacitor CL2.
The roles are reversed when CLK = 1: The master stage section is in hold mode (M3-M4
off), while the second section evaluates (M7-M8 on). The value stored on CL1 propagates
to the output node through the slave stage which acts as an inverter.
The overall circuit operates as a positive edge-triggered master-slave register — very similar to
the transmission-gate based register presented earlier. However, there is an important difference:
Out
In CL CLK
In K Out
CLK CLK
Pipelining is a popular design technique often used to accelerate the operation of the
data-paths in digital processors. The idea is easily explained with the example of Figure 7.40a.
The goal of the presented circuit is to compute log(|a − b|), where both a and b represent streams
of numbers, that is, the computation must be performed on a large set of input val-ues. The
minimal clock period Tmin necessary to ensure correct evaluation is given as:
T
mi n = tc- q + tpd,logic + tsu (7.6)
where tc-q and tsu are the propagation delay and the set-up time of the register,
respectively.
We assume that the registers are edge-triggered D registers. The term tpd,logic stands for
the worst-case delay path through the combinatorial network, which consists of the adder,
absolute value, and logarithm functions. In conventional systems (that don’t push the edge of
technology), the latter delay is generally much larger than the delays associated with the
registers and dominates the circuit performance. Assume that each logic module has an equal
propagation delay. We note that each logic module is then active for only 1/3 of the clock
period (if the delay of the register is ignored). For example, the adder unit is active during the
first third of the period and remains idle—this is, it does no useful computa-tion— during the
other 2/3 of the period. Pipelining is a technique to improve the resource utilization, and
increase the functional throughput. Assume that we introduce registers between the logic blocks,
as shown in Figure 7.40b. This causes the computation for one set of input data to spread over a
number of clock periods, as shown in Table 7.1. The
a
REG
Out
REG
CLK . log
CLK
a
REG
REG
REG
REG
b
(b) Pipelined version
CLK
Datapath for the computation of log(|a + b|).
result for the data set (a1, b1) only appears at the output after three clock-periods. At that time,
the circuit has already performed parts of the computations for the next data sets, (a2, b2) and
(a3,b3). The computation is performed in an assembly-line fashion, hence the name pipeline.
1 a1 + b1
2 a2 + b2 |a1 + b1|
3 a3 + b3 |a2 + b2| log(|a1 + b1|)
4 a4 + b4 |a3 + b3| log(|a2 + b2|)
The advantage of pipelined operation becomes apparent when examining the mini-mum clock
period of the modified circuit. The combinational circuit block has been parti-tioned into three
sections, each of which has a smaller propagation delay than the original function. This
effectively reduces the value of the minimum allowable clock period:
ensures correct operation. The value stored on C2 at the end of the CLK low phase is the result of
passing the previous input (stored on the falling edge of CLK on C1) through the logic function F. When overlap
exists between CLK and CLK, the next input is already being applied to F, and its effect might propagate to C2
before CLK goes low (assuming that the contamination delay of F is small).
Another class of logic circuits are sequential circuits. These circuits are two-
valued networks in which the outputs at any instant are dependent not only upon the
inputs present at that instant but also upon the past history (sequence) of inputs.
www.rejinpaul.com
www.rejinpaul.com
SYNCHRONIZERS
A synchronizer is a circuit,that accepts an input that can change at arbitrary times and
produces an output aligned to the synchoronizer clock.
(i).A synchronizer accepts are D and a clock,it produces an output Q that ought to be valid
some bounted delay after the clock.
(ii).Synchoronizer built from a pair of ffs.
Meta Stability
DC transfer charateristics of the two inverters.
Arbiters.
www.rejinpaul.com
www.rejinpaul.com
UNIT-IV
DESIGNING ARITHMETIC BUILDING BLOCKS
ci 1 g i pi ci
www.rejinpaul.com
www.rejinpaul.com
Array multiplier
www.rejinpaul.com
www.rejinpaul.com
1. Multiply (that is - AND) each bit of one of the arguments, by each bit of the other,
yielding results. Depending on position of the multiplied bits, the wires carry
different weights, for example wire of bit carrying result of is 32 (see
explanation of weights below).
2. Reduce the number of partial products to two by layers of full and half adders.
3. Group the wires in two numbers, and add them with a conventional adder.
• Pipelining
Barrel shifter
4x4 barrelshifters
www.rejinpaul.com
www.rejinpaul.com
UNIT-V
IMPLEMENTATION STRATEGIES
Design flow
Design methodology
Standard cell libraries are required by almost all CAD tools for chip design Standard cell
libraries contain primitive cells required for digital design However, more complex cells that
have been specially optimized can also be included .
The main purpose of the CAD tools is to implement the so called RTL-to-GDS flow
The input to the design process, in most cases, is the circuit description at the registertransfer
level (RTL) The final output from the design process is the full chip layout, mostly in the GDSII
(gds2) format To produce a functionally correct design that meets all the specifications and
constraints, requires a combination of different tools in the design flows .
These tools require specific information in different formats for each of the cells in the
standard cell library provided to them for the design.
Abstract View (Cadence Abstract Generator, LEF) LEF: Contains information about
each cell as well as technology information • Timing, power and parasitics (TLF or LIB)
Transistor and interconnect parasitics are extracted using Cadence or other extraction tools.
Spice or Spectre netlist is generated and detailed timing simulations are performed. Power
information can also be generated during these simulations. Data is formatted into a TLF or LIB
file including process, temperature and supply voltage variations. Logical information for each
cell is also contained in this file.
Interconnection framework
Granularity and interconnection structure has caused a split in the industry
FPGA
Fine grained
Variable length interconnect segments
Timing in general is not predictable; Timing extracted after placement and route
www.rejinpaul.com
www.rejinpaul.com
Routing
Global routing
The global router performs a coarse route to determine, for each connection, the
minimum distance path through routing channels that it has to go through. If the net to be routed
has more than two terminals the global router will break the net into a set of two-terminal5
connections and route each set independently. The global router considers for each connection
multiple ways of routing it and chooses the one that passes through the least congested routing
channels. By keeping track of the usage of each routing channel, congestion is avoided; and the
principal objective of the global router, balancing the usage of the routing channels, is achieved.
Once all connections have been coarse routed, the solution is optimized by ripping up and
rerouting each connection a small number of times. After that, the final solution is passed to the
detailed router
Detail routing :
The detail router determines for each two point connection the specific wiring segments
to use in the routing channel assigned by the global router. To do this, detail routing algorithms
construct a directed graph from the routing resources to represent the available connection
between wires, C blocks, S blocks and logic blocks within the FPGA. The search performed on
this directed graph is usually based on Dijkstra’s algorithm to find the shortest path between two
nodes. The paths are labeled according to a cost function that takes into account the usage of
each wire segment and the distance of the interconnecting points. The distance is estimated by
calculating the wire length in the bounding box of the interconnecting points using a Manhattan
metric. Most of the routers relax the bounding box constraints and allow searching for possible
solutions in the surrounding routing channels of the bounding box. This is done to avoid
subsequent iterations of ripping out and re-routing if the solution lies on the near outside of the
bounding box.
Routing resources
The programmable routing in an FPGA consists of two categories: (1) routing within
each Logic Block/Logic Cluster, and (2) routing between the Logic Blocks/Logic Clusters.
Figure 2.5 shows a detailed view of the routing for a single tile. Normally, an FPGA is created
by replication of such a tile (a tile consists of one Logic Block and it’s associated routing).
www.rejinpaul.com
www.rejinpaul.com
The programmable routing within each Logic Block consists of the Interconnect Matrix. The
programmable routing between the Logic Blocks consists of fixed metal tracks, Switch Blocks,
Connection Blocks, and the programmable switches. The fixed metal tracks run horizontally and
vertically, and are organized in channels; each channel contains the same number of tracks for
the architecture that we investigated. A Switch Block occurs at each intersection between
horizontal and vertical routing channels, and defines all possible connections between these
channels. Three different topologies for Switch Blocks have been proposed in previous work:
the Disjoint Switch Block the Universal Switch Block and the Wilton Switch Block . The
Connection Block defines all the possible connections from a horizontal or vertical channel to a
neighboring logic block.
The connections in the switch blocks and connection blocks are made by programmable
switches. Part of the programmable routing also lies within each logic block, determining how
different components are connected within the logic block. This Island-Style architecture is a
very general version of most commercial architectures from Altera and Xilinx. Most recent
commercial FPGAs also incorporate other features on chip, such as digital phase lock loops, and
memories.
The architectural features that this thesis intends to explore are sufficiently represented
in such a simplified island-style architecture, and hence this architecture is assumed in all of our
following work. A programmable switch consists of a pass transistor controlled by a static
random access memory cell (in which case, the device is called a SRAM-based FPGA), or an
anti-fuse (such devices are referred to as anti-fuse FPGAs), or a non-volatile memory cell (such
devices are referred to as floating gate devices). Since SRAM-based FPGAs employ static
random access memory (SRAM) cells to control the programmable switches, they can be
reprogrammed by the end user as many times as required and are volatile. Anti-fuse based
www.rejinpaul.com
www.rejinpaul.com
FPGAs, on the other hand, can only be programmed once and are non-volatile. The devices
employing floating gate technology are also non-volatile and can be reprogrammed. Of the three
categories, SRAM-based FPGAs are most widely used and hence we will limit our discussion
and investigations to SRAM-based devices.
The ‘W’ represents the number of parallel tracks contained in each channel. A track is a
piece of metal traversing one or more logic blocks (for example ‘x’ in Figure 2.5). For clustered
FPGAs the logic block consists of more than one BLE; the figure shows a logic block with ‘N’
such BLEs. The Interconnect Matrix shown in the logic block determines all possible
connections from the logic block inputs ‘I’ to each of N*k BLE inputs.
This Interconnect Matrix is normally implemented using binary tree multiplexers. The
number of logic block inputs ‘I’, and the number of feedback paths determine the size of these
multiplexers. The feedback paths allow the local connections to be made from within the logic
block. Betz [1] showed that a I=2N+2 (for N less than or equal to 16) is sufficient for good logic
utilization. The ‘Logic Utilization’ is defined as the average number of BLEs per logic block
that a circuit is able to use divided by the total number of BLEs per logic block, N. The number
of tracks in each channel to which each logic block input and output pin can connect is called
the connection block flexibility, Fc.
employing Disjoint switch block topology the best areaefficiency occurs at Fc = 0.5W. Table
2.1 shows a summary of the terminology which has been described above, and the range of
values which were used in this thesis
In the original Xilinx XC2000 and XC3000 architectures, a very simple architecture
was employed in which most wire segments spanned only the length or the width of a logic
block. In order to improve the speed of an FPGA El-Gamal [6] introduced the idea of
‘Segmented FPGAs’. The main idea is to provide segments that span multiple logic blocks for
connections that travel long distances. Figure 2.5 shows such a track ’y’ (in the vertical channel)
which does not terminates at the switch block.Some example segments which span 1, 2 and 3
logic blocks and hence are referred to as segment length 1, 2 and 3. A ‘long’ segment is that
www.rejinpaul.com
www.rejinpaul.com
segment which spans all the logic blocks in a given architecture. A similar approach is used in
most of Altera’s devices where long segments are used to carry signals over larger distances
across the chip.
FPGAs employing very long or very short segment wires were inferior in performance
to those employing medium sized segments (those traversing 4 to 8 logic blocks). Paul Chow [7]
introduced another important routing architectural feature of segmented routing architectures
besides the segment distribution called ‘Segment Population’. A segment is called internally
populated if it is possible to make connections from the middle of a segment to a logic block or
to other routing segments. The advantage of unpopulated segments is that they have less
parasitic switch capacitance connected to the segment, which makes it faster. The disadvantage
is that the reduction in routing flexibility (without population there cannot be internal fanout)
may result in the need for more tracks and thus, loss of logic density