Академический Документы
Профессиональный Документы
Культура Документы
1. INTRODUCTION
1.1 INTRODUCTION:
In the VLSI design timing, power, area are the three major constraints.
Optimizations in VLSI have been done on three factors: Area, Power and Timing
(Speed). Area optimization means reducing the space of logic which occupy on the
die. This is done in both front-end and back-end of design. In front-end design, proper
description of simplified Boolean expression and removing unused states will lead to
minimize the gate/transistor utilization. Partition, Floor planning, Placement, and
routing are perform in back-end of the design which is done by CAD tool. The CAD
tool have a specific algorithm for each process to produce an area efficient design
similar to Power optimization. Power optimization is to reduce the power dissipation
of the design which suffers by operating voltage, operating frequency, and switching
activity. The first two factors are merely specified in design constraints but switching
activity is a parameter which varies dynamically, based on the way which designs the
logic and input vectors. Timing optimization refers to meeting the user constraints in
efficient manner without any violation otherwise, improving performance of the
design. High performance designs are achieved by proper placement, routing and
sizing the element. The word optimization is approached in different ways by
merging, instead of sizing the memory element.
Multiplication in hardware can be implemented in two ways either by using more
hardware for achieving fast execution or by using less hardware and end up with slow
execution. The area and speed of the multiplier is an important issue, increment in
speed results in large area consumption and vice versa. Multipliers play vital role in
most of the high performance systems. Performance of a system depend to a great
extent on the performance of multiplier thus multipliers should be fast and consume
less area and hardware. This idea forced us to study and review about the multipliers
speed, power consumption and Area occupied. Below are three famous Multipliers
And their drawbacks namely
Multiply (that is - AND) each bit of one of the arguments, by each bit of the
other, yielding
is 32 (see
Reduce the number of partial products to two by layers of full and half adders.
3.
Group the wires in two numbers, and add them with a conventional adder.
The second phase works as follows. As long as there are three or more wires with the
same weight add a following layer:
Take any three wires with the same weights and input them into a full adder. The
result will be an output wire of the same weight and an output wire with a higher
weight for each three input wires.
If there are two wires of the same weight left, input them into a half adder.
and
, not much
slower than addition (however, much more expensive in the gate count). Naively
adding partial products with regular adders would require
time. From
These computations only consider gate delays and don't deal with wire delays, which
can also be very substantial.
The Wallace tree can be also represented by a tree of 3/2 or 4/2 adders.
It is sometimes combined with Booth encoding.
The weight of a wire is the radix (to base 2) of the digit that the wire carries. In
general,
of
have indexes of
is
and
; and since
Example:
, multiplying
by
the weight
1.
weight 1 -
weight 2 -
weight 4 -
weight 8 -
weight 16 -
weight 32 -
weight 64 -
2.
,
,
Reduction layer 1:
Add a half adder for weight 2, outputs: 1 weight-2 wire, 1 weight-4 wire
Add a full adder for weight 4, outputs: 1 weight-4 wire, 1 weight-8 wire
Add a full adder for weight 8, and pass the remaining wire through, outputs: 2 weight8 wires, 1 weight-16 wire
Add a full adder for weight 16, outputs: 1 weight-16 wire, 1 weight-32 wire
Add a half adder for weight 32, outputs: 1 weight-32 wire, 1 weight-64 wire
3.
weight 2 - 1
weight 4 - 2
weight 8 - 3
weight 16 - 2
weight 32 - 2
weight 64 - 2
4.
5.
Reduction layer 2:
Add a full adder for weight 8, and half adders for weights 4, 16, 32, 64
Outputs:
weight 1 - 1
weight 2 - 1
weight 4 - 1
weight 8 - 2
weight 16 - 2
weight 32 - 2
weight 64 - 2
weight 128 - 1
6.
Group the wires into a pair integers and an adder to add them.
Drawback:
Complexity is High
Array Multiplier:
Digital multiplication entails a sequence of additions carried out on partial products.
Fig: 1.3
Fig: 1.4
There are m*n summands that are produced in parallel by a set of m*n numbers of
ANDgates.
If m = n Then it will require n(n-2) full adders, n half-adders and n*n AND gates and
worst case delay would be (2n+1)td .where td is the time delay of gates
Fig: 1.5
Fig: 1.6
Consider computing the product of two 4-bit integer numbers given by A3A2A1A0
(multiplicand) and B3B2B1B0 (multiplier). The product of these two numbers can be
formed as shown below.
Fig: 1.7
Each of the ANDed terms is referred to as a partial product. The final product (the
result) is formed by accumulating (summing) down each column of partial products.
Any carries must be propagated from the right to the left across the columns.
Since we are dealing with binary numbers, the partial products reduce to simple AND
operations between the corresponding bits in the multiplier and multiplicand. The
sums down each column can be implemented using one or more 1-bit binary adders.
Any adder that may need to accept a carry from the right must be a full adder. If there
is no possibility of a carry propagating in from the right, then a half adder can be used
instead, if desired (a full adder can always be used to implement a half adder if the
carry-in is tied low). The diagram below illustrates a combinational circuit for
performing
the
4x4
binarymultiplication.
The initial layer of AND gates forms the sixteen partial products that result from
ANDing all combinations of the four multiplier bits with the four multiplicand bits.
The column sums are formed using a combination of half and full adders. Look again
at the first two illustrations of the binary multiplication process above, and make a
careful comparison with the figure below.
Fig: 1.8
The adder blocks (indicated by FA and HA) in the figure above are drawn in such a
way that the two bits to be added enter from the top, any carry in from the right enters
from the right, and any carry out exits from the left of each block. The output from the
bottom
of
block
is
the
sum.
The least significant output bit, S0 (the first column), involves only two input bits and
is
computed
as
the
simple
output
of
an
AND
gate.
The next output bit, S1, involves the sum of two partial products. A half adder is used
to form the sum since there can be no carry in from the first column.
The third output bit, S2, is formed from the sum of three (1-bit) partial products plus a
possible carry in from the previous bit. This operation requires two cascaded adders
(one half adder and one full adder) to sum the four possible input bits (three partial
products
and
one
possible
carry
in
from
the
right).
The remaining output bits are formed similarly. Because in some columns we must
add more than two binary numbers, there may be more than one carry out generated to
the left.
Drawbacks:
Low speed
More Area
More power
10
Booths Multiplier:
Booths Algorithm is a smart move for multiplying signed numbers. It initiate with the
ability to both add and subtract there are multiple ways to compute a product .
Booths algorithm is a multiplication algorithm that utilizes twos complement
notation of signed binary numbers for multiplication . Earlier multiplication was in
general implemented via sequence of addition then subtraction, and
then shift
11
Fig:1.9
both the multiplicand and the multiplier is always a sign bit, and cant be used as part
of the value. Then choose which operand will be multiplier and which will be
multiplicand. If one operand and both are negative then they are represented in two's
complement form. Start in on with a product that consists of the multiplier in the
company of an additional X leading zero bits. Now check the LSB and the previous
LSB of product to find out the arithmetic action. Add 0 as the previous LSB if it is the
FIRST pass. Probable arithmetic actions are if: 00:- no arithmetic operation is
performed only shifting is done. 01:- add multiplicand to left half part of product and
then shifting is done. 10:- subtract multiplicand from left half part of product and then
shifting is performed 11:- no arithmetic operation is performed only shifting is
Example :
Multiply 10 by -7 using 5-bit numbers (10-bit result). 10 in binary is 01010 -10 in
binary is 10110 (thus now we can add 10110 when we need to subtract multiplicand)
-7 in binary is 11001 Our expected result should be (-70) in binary (11101 11010).
Steps of algorithm are:
Step1:
(00000 11001 0) now as last two bits are 10 so here 00000+10110=10110. Now we
get (10110 11001 0) now by ARS (arithmetic right shift) we get (11011 01100 1).
Step2:
as last two bits are 01 so, 11011+01010=00101(carry is ignored as because addition
+ve and ve numbers cannot overflow). Now we get (00101 01100 1) now by ARS
we get (00010 10110 0).
Step 3:
as last two bits are 00 there is no change only ARS will be done, now we will get
(00001 01011 0).
Step 4:
as last two bits are 10 so, 00001+10110=10111, now we will get (10111 01011 0)
now by ARS we will get (11011 10101 1)
Step 5:
13
as last two bits are 11, there is no change only ARS will take place, now we will get
(11101 11010 1).
Step 6:
now ignoring the last bit we will get our product that is (11101 11010) = -70
Drawbacks:
Most Complex Algorithm
Trade off Between Speed and Area
Parameter
Array multiplier
Wallace
multiplier
Operation speed
Less
High
Highest
Time delay
More
Medium
Less
Area
More area
Medium
Minimum
Complexity
less
More
Most
Power consumption
Most
More
Less
(DFT),
Discrete
Cosine
Transformation(DCT),Discrete
Sine
14
algorithm , the quadratic residue number system(QRNS) , and recently, the redundant
complex number system (RCNS) .Blahut etc. All proposed a technique for complex
number multiplication, where the algebraictransformation was used. This algebraic
transformationsaves one real multiplication, at the expense of three additions as
compared to the direct method implementation. A left to right array for the fast
multiplication has been reported in 2005, and the method is not further extended
forcomplex multiplication. But, all the above techniquesrequire either large overhead
for pre/post processing orlong latency. Further many design issues like as
speed,accuracy, design overhead, power consumption etc., shouldnot be addressed for
fast multiplication .In algorithmicand structural levels, a lot of multiplication
techniques hadbeen developed to enhance the efficiency of the multiplier;which
encounters the reduction of the partial Productsand/or the methods for their partial
products addition ,butthe principle behind multiplication was same in all cases. Vedic
Mathematics is the ancient system of Indianmathematics which has a unique
technique of calculationsbased on 16 Sutras (Formulae). "Urdhva-tiryakbyham" is a
Sanskrit word means vertically and crosswise formula isused for fast Multiplication
All these formulas are adopted from ancient Indian Vedic Mathematics. In this work
we formulate this mathematics for designing the complex multiplier architecture in
transistor level with two clear goals in mind such as: i) Simplicity and
modularitymultiplications for VLSI implementations and ii) Theelimination of carry
propagation for rapid additions and subtractions. Mehta et al.have been proposed a
multiplier design using "Urdhva-tiryakbyham" sutras, which was adopted from the
Vedas. The formulation usingthis sutra is similar to the modern array multiplication,
which also indicating the carry propagation issues. Multiplier implementation in the
gate level(FPGA) using Vedic Mathematics has already been
reported but to the best of our knowledge till date there isno report on transistor
level(ASIC) implementation of suchcomplex multiplier. By employing the Vedic
mathematics, an N bit complex number multiplication was transformedinto four
multiplications for real and imaginary terms of the final product. In this paper we
report on a novel high speed complex multiplier design using ancient Indian
Vedicmathematics.
The sutras of vedic mathematics with their meanings are listed in
the table below:
15
S.No
1
SUTRA NAME
(Anurupye)-Shunyamanyathu
MEANING
If one value is in ratio, other is
zero
Differences and similarities
By one more or less than the
previous
one
Factors of the sum is equal to
sum
The product of the sum is
equal to sum
Many from 9, before 10
Taking Transpose & adjust
By the ending or no ending
Add & subtract
2
3
ChalanaKalanabhyam
Ekadhikina&Ekanyunena
Purvena
Gunakkasamuchhyah
Guniitasamuchhyah
6
7
8
9
10
11
NikilamNavatashcaramam
ParavartyaYojayethu
Puranapuranabhyamm
SankhalanaVyavakhalanabh
yam
SesanyaankenaCaramena
SunyamSamyasamucaye
12
Soopantyadvayamantyam
13
Urdhva-Tiryakbhyam
14
Vyashtisamanstih
15
Yaavadunam
16
used for large number multiplication andsubtraction. All these formulas are adopted
from ancient Indian Vedic Mathematics.
multiplication . Urdhva
17
18
Design of 44 block :
The design of 44 block is a simple arrangement of 22 blocks in an optimized
manner. The first step in the design of 44 block will be grouping the 2 bit of each 4
bit input. These pair terms will form vertical and crosswise product terms. Each input
bit-pair is handled by a separate 22 Vedic the schematic of a 44 block designed
using 22 blocks. The partial products represent the Urdhva vertical and cross product
terms. Then first two bits of right most 2x2 vedic multiplier output will be send
directly to output first two bits. Remaining partial products will be handled by 4 Bit
and 6 Bit Adders as shown in the figure.
19
20
2. SOFTWARE REQUIREMENTS
2.1 XILINX:
Xilinx, Inc. is an American technology company, primarily a supplier of
programmable logic devices. It is known for inventing the field programmable gate
array(FPGA) and as the first semiconductorcompany with a fablessmanufacturing
model.
Xilinx designs, develops and markets programmable logic products, including
integrated circuits (ICs), software design tools, predefined system functions delivered
as intellectual property (IP) cores, design services, customer training, field
engineering and technical support. Xilinx sells both FPGAs and CPLDs for electronic
equipment manufacturers in end markets such as communications, industrial,
consumer, automotive and data processing.
2.2XILINX-ISE:
Xilinx ISE (Integrated Software Environment) is a software tool produced by
Xilinx for synthesis and analysis of HDL designs, enabling the developer to
synthesize ("compile") their designs, perform timing analysis, examine RTL
diagrams, simulate a design's reaction to different stimuli, and configure the target
device with the programmer.
21
Multi-Threaded compilation
Post-Processing capabilities
Debug capabilities
22
3. HARDWARE REQUIREMENTS
3.1 SPARTAN 3:
The Spartan-3 family of Field-Programmable Gate Arrays is specifically
designed to meet the needs of high volume, cost-sensitive consumer electronic
applications. The eight-member family offers densities ranging from 50,000 to
5,000,000 system gates. The Spartan-3 family is a superior alternative to mask
programmed ASICs. FPGAs avoid the high initial cost, the lengthy development
cycles, and the inherent inflexibility of conventional ASICs. Also, FPGA
programmability permits design upgrades in the field with no hardware replacement
necessary, an impossibility with ASICs.
23
24
3.1.2 Configuration:
Spartan-3 FPGAs are programmed by loading configuration data into robust
reprogrammable static CMOS configuration latches (CCLs) that collectively control
all functional elements and routing resources. Before powering on the FPGA,
configuration data is stored externally in a PROM or some other nonvolatile medium
either on or off the board. After applying power, the configuration data is written to
the FPGA using any of five different modes: Master Parallel, Slave Parallel, Master
Serial, Slave Serial, and Boundary Scan (JTAG). The Master and Slave Parallel
modes use an 8-bit-wide SelectMAP port.
The recommended memory for storing the configuration data is the low-cost Xilinx
Platform Flash PROM family, which includes the XCF00S PROMs for serial
configuration and the higher density XCF00P PROMs for parallel or serial
configuration.
25
Fig 3.2: Spartan-3 FPGA QFP Package Marking Example for Part Number XC3S400-
4PQ208C
26
4. APPLICATIONS
4.1The Implementation of Vedic Algorithms in Digital Signal
Processing
Digital signal processing (DSP) is the technology that is omnipresent in almost every
Engineering discipline. It is also the fastest growing technology this century and,
therefore, it poses tremendous challenges to the engineering community. Faster
additions and multiplications are of extreme importance in DSP for convolution,
discrete Fourier transforms digital filters, etc. The core computing process is always a
multiplication routine; therefore, DSP engineers are constantly looking for new
27
algorithms and hardware to implement them. Vedic mathematics is the name given to
the ancient system of mathematics, which was rediscovered, from the Vedas between
1911 and 1918 by Sri Bharati Krishna Tirthaji. The whole of Vedic mathematics is
based on 16 sutras (word formulae) and manifests a unified structure of mathematics.
As such, the methods are complementary, direct and easy. The authors highlight the
use of multiplication process based on Vedic algorithms and its implementations on
8085 and 8086 microprocessors, resulting in appreciable savings in processing time.
The exploration of Vedic algorithms in the DSP domain may prove to be extremely
advantageous. Engineering institutions now seek to incorporate research-based studies
in Vedic mathematics for its applications in various engineering processes. Further
research prospects may include the design and development of a Vedic DSP chip
using VLSI technology.
5. RESULTS
5.1 16x16 Vedic Multiplier:
Device Utilization Summary:
28
29
From the figure we can infer that the input is given when enable is high and exponent
value is determined.
30
31
32
33
34
35
36
37
38
39
40
Fig 5.25 shows the snapshot of Spartan 3 board.A,B are the inputs and c is the
output.
41
Clock Information:
-----------------No clock signals found in this design
Asynchronous Control Signals Information:
---------------------------------------No asynchronous control signals found in this design
Timing Summary:
--------------Speed Grade: -5
Minimum period: No path found
42
`
Minimum input arrival time before clock: No path found
Maximum output required time after clock: No path found
Maximum combinational path delay: 29.126ns
Timing Detail:
-------------All values displayed in nanoseconds (ns)
==================================================================
=======
Timing constraint: Default path analysis
Total number of paths / destination ports: 30173 / 16
------------------------------------------------------------------------Delay:
Source:
a<1> (PAD)
Destination:
c<15> (PAD)
Net
---------------------------------------- -----------IBUF:I->O
LUT2:I0->O
LUT4:I3->O
(z1/z5/fa1/Mxor_sum_Result26)
LUT3:I0->O
MUXF5:S->O
0.479
0.976
z1/z5/fa1/Mxor_sum_Result26
LUT3:I1->O
LUT3:I2->O
LUT4:I0->O
LUT3:I0->O
LUT4:I1->O
LUT3:I2->O
`
LUT4:I3->O
LUT3:I1->O
LUT3:I1->O
LUT3:I1->O
LUT3:I2->O
LUT4:I1->O
OBUF:I->O
4.909
c_15_OBUF (c<15>)
---------------------------------------Total
44
45
7. REFERENCES
[1]
https://learn.digilentinc.com/Documents/259
[5]
http://www.xilinx.com/support/documentation
[6]
http://electronicsforu.com
[7]
http://en.wikipedia.org/wiki/
46
APPENDIX
A. INTRODUCTION TO VERILOG
A.1 OVERVIEW:
Hardware description languages such as Verilog differ from software
programming languages because they include ways of describing the propagation time
and signal strengths (sensitivity). There are two types of assignment operators; a
blocking assignment (=), and a non-blocking (<=) assignment. The non-blocking
assignment allows designers to describe a state-machine update without needing to
declare and use temporary storage variables. Since these concepts are part of Verilog's
language semantics, designers could quickly write descriptions of large circuits in a
relatively compact and concise form. At the time of Verilog's introduction (1984),
Verilog represented a tremendous productivity improvement for circuit designers who
were already using graphical schematic capture software and specially written
software programs to document and simulate electronic circuits.
The designers of Verilog wanted a language with syntax similar to the C
programming language, which was already widely used in engineering software
development. Like C, Verilog is case-sensitive and has a basic preprocessor (though
less sophisticated than that of ANSI C/C++). Its control flow keywords (if/else, for,
while, case, etc.) are equivalent, and its operator precedence is compatible with C.
Syntactic differences include: required bit-widths for variable declarations,
demarcation of procedural blocks (Verilog uses begin/end instead of curly braces {}),
and many other minor differences. Verilog requires that variables be given a definite
size. In C these sizes are assumed from the 'type' of the variable (for instance an
integer type may be 8 bits).
A Verilog design consists of a hierarchy of modules. Modules encapsulate
design hierarchy, and communicate with other modules through a set of declared
input, output, and bidirectional ports. Internally, a module can contain any
combination of the following: net/variable declarations (wire, reg, integer, etc.),
concurrent and sequential statement blocks, and instances of other modules (subhierarchies). Sequential statements are placed inside a begin/end block and executed
47
in sequential order within the block. However, the blocks themselves are executed
concurrently, making Verilog a dataflow language.
Verilog's concept of 'wire' consists of both signal values (4-state: "1, 0, floating,
undefined") and signal strengths (strong, weak, etc.). This system allows abstract
modeling of shared signal lines, where multiple sources drive a common net. When a
wire has multiple drivers, the wire's (readable) value is resolved by a function of the
source drivers and their strengths.
A subset of statements in the Verilog language are synthesizable. Verilog modules that
conform to a synthesizable coding style, known as RTL (register-transfer level), can
be physically realized by synthesis software. Synthesis software algorithmically
transforms the (abstract) Verilog source into a netlist, a logically equivalent
description consisting only of elementary logic primitives (AND, OR, NOT, flipflops, etc.) that are available in a specific FPGA or VLSI technology. Further
manipulations to the netlist ultimately lead to a circuit fabrication blueprint (such as a
photo mask set for an ASIC or a bitstream file for an FPGA).
A.1.1 Beginning:
Verilog was one of the first modern[clarification needed] hardware description
languages to be invented.[citation needed] It was created by PrabhuGoel and Phil
Moorby during the winter of 1983/1984. The wording for this process was
"Automated Integrated Design Systems" (later renamed to Gateway Design
Automation in 1985) as a hardware modeling language. Gateway Design Automation
was purchased by Cadence Design Systems in 1990. Cadence now has full proprietary
rights to Gateway's Verilog and the Verilog-XL, the HDL-simulator that would
become the de facto standard (of Verilog logic simulators) for the next decade.
Originally, Verilog was intended to describe and allow simulation; only afterwards
was support for synthesis added.
A.1.2 Verilog-95:
With the increasing success of VHDL at the time, Cadence decided to make the
language available for open standardization. Cadence transferred Verilog into the
public domain under the Open Verilog International (OVI) (now known as Accellera)
48
organization. Verilog was later submitted to IEEE and became IEEE Standard 13641995, commonly referred to as Verilog-95.
A.1.3 Verilog 2001:
Extensions to Verilog-95 were submitted back to IEEE to cover the deficiencies that
users had found in the original Verilog standard. These extensions became IEEE
Standard 1364-2001 known as Verilog-2001.
Verilog-2001 is a significant upgrade from Verilog-95. First, it adds explicit support
for (2's complement) signed nets and variables. Previously, code authors had to
perform signed operations using awkward bit-level manipulations (for example, the
carry-out bit of a simple 8-bit addition required an explicit description of the Boolean
algebra to determine its correct value). The same function under Verilog-2001 can be
more succinctly described by one of the built-in operators: +, -, /, *, >>>. A
generate/endgenerate construct (similar to VHDL's generate/endgenerate) allows
Verilog-2001 to control instance and statement instantiation through normal decision
operators (case/if/else). Using generate/endgenerate, Verilog-2001 can instantiate an
array of instances, with control over the connectivity of the individual instances. File
I/O has been improved by several new system tasks. And finally, a few syntax
additions were introduced to improve code readability (e.g. always @*, named
parameter override, C-style function/task/module header declaration).
Verilog-2001 is the dominant flavor of Verilog supported by the majority of
commercial EDA software packages.
A.1.4 Verilog 2005:
Not to be confused with SystemVerilog, Verilog 2005 (IEEE Standard 1364-2005)
consists of minor corrections, spec clarifications, and a few new language features
(such as the uwire keyword).
A separate part of the Verilog standard, Verilog-AMS, attempts to integrate analog and
mixed signal modeling with traditional Verilog.
49
50
51
52
53
Signals that are driven from within a process (an initial or always block) must be of
type reg. Signals that are driven from outside a process must be of type wire. The
keyword reg does not necessarily imply a hardware register.
54
case(sel)
1'b0: out = b;
1'b1: out = a;
endcase
end
// Finally - you can use if/else in a procedural structure.
reg out;
always @(a or b or sel)
if (sel)
out = a;
else
out = b;
The next interesting structure is a transparent latch; it will pass the input to the output
when the gate signal is set for "pass-through", and captures the input and stores it
upon transition of the gate signal to "hold". The output will remain stable regardless
of the input signal while the gate is set to "hold". In the example below the "passthrough" level of the gate would be when the value of the if clause is true, i.e. gate =
1. This is read "if gate is true, the din is fed to latch_out continuously." Once the if
clause is false, the last value at latch_out will remain and is independent of the value
of din.
// Transparent latch example
reglatch_out;
always @(gate or din)
if(gate)
latch_out = din; // Pass through state
55
56
q <= 0;
else
if(set)
q <= 1;
else
q <= d;
Note: If this model is used to model a Set/Reset flip flop then simulation errors can
result. Consider the following test sequence of events. 1) reset goes high 2) clk goes
high 3) set goes high 4) clk goes high again 5) reset goes low followed by 6) set going
low. Assume no setup and hold violations.
In this example the always @ statement would first execute when the rising edge of
reset occurs which would place q to a value of 0. The next time the always block
executes would be the rising edge of clk which again would keep q at a value of 0.
The always block then executes when set goes high which because reset is high forces
q to remain at 0. This condition may or may not be correct depending on the actual
flip flop. However, this is not the main problem with this model. Notice that when
reset goes low, that set is still high. In a real flip flop this will cause the output to go to
a 1. However, in this model it will not occur because the always block is triggered by
rising edges of set and reset - not levels. A different approach may be necessary for
set/reset flip flops.
The final basic variant is one that implements a D-flop with a mux feeding its input.
The mux has a d-input and feedback from the flop itself. This allows a gated load
function.
// Basic structure with an EXPLICIT feedback path
always @(posedgeclk)
if(gate)
q <= d;
57
else
q <= q; // explicit feedback path
// The more common structure ASSUMES the feedback is present
// This is a safe assumption since this is how the
// hardware compiler will interpret it. This structure
// looks much like a latch. The differences are the
// '''@(posedgeclk)''' and the non-blocking '''<='''
always @(posedgeclk)
if(gate)
q <= d; // the "else" mux is "implied"
Note that there are no "initial" blocks mentioned in this description. There is a split
between FPGA and ASIC synthesis tools on this structure. FPGA tools allow initial
blocks where reg values are established instead of using a "reset" signal. ASIC
synthesis tools don't support such a statement. The reason is that an FPGA's initial
state is something that is downloaded into the memory tables of the FPGA. An ASIC
is an actual hardware implementation.
begin
a = 1; // Assign a value to reg a at time 0
#1; // Wait 1 time unit
b = a; // Assign the value of reg a to reg b
end
always @(a or b) // Any time a or b CHANGE, run the process
begin
if (a)
c = b;
else
d = ~b;
end // Done with this block, now return to the top (i.e. the @ event-control)
always @(posedge a)// Run whenever reg a has a low to high change
a <= b;
These are the classic uses for these two keywords, but there are two significant
additional uses. The most common of these is an always keyword without the @(...)
sensitivity list. It is possible to use always as shown below:
always
begin // Always begins executing at time 0 and NEVER stops
clk = 0; // Set clk to 0
#1; // Wait for 1 time unit
clk = 1; // Set clk to 1
#1; // Wait 1 time unit
59
60
61
B. INTRODUCTION TO VLSI
Very-large-scale integration (VLSI) is the process of creating integrated
circuits by combining thousands of transistor-based circuits into a single chip. VLSI
began in the 1970s when complex semiconductor and communication technologies
were being developed. The microprocessor is a VLSI device. The term is no longer as
common as it once was, as chips have increased in complexity into the hundreds of
millions of transistors.
B.1 OVERVIEW:
The first semiconductor chips held one transistor each. Subsequent advances
added more and more transistors, and, as a consequence, more individual functions or
systems were integrated over time. The first integrated circuits held only a few
devices, perhaps as many as ten diodes, transistors, resistors and capacitors, making it
possible to fabricate one or more logic gates on a single device. Now known
retrospectively as "small-scale integration" (SSI), improvements in technique led to
devices with hundreds of logic gates, known as large-scale integration (LSI), i.e.
systems with at least a thousand logic gates. Current technology has moved far past
this mark and today's microprocessors have many millions of gates and hundreds of
millions of individual transistors.
At one time, there was an effort to name and calibrate various levels of largescale integration above VLSI. Terms like Ultra-large-scale Integration (ULSI) were
used. But the huge number of gates and transistors available on common devices has
rendered such fine distinctions moot. Terms suggesting greater than VLSI levels of
integration are no longer in widespread use. Even VLSI is now somewhat quaint,
given the common assumption that all microprocessors are VLSI or better.
As of early 2008, billion-transistor processors are commercially available, an
example of which is Intel's Montecito Itanium chip. This is expected to become more
commonplace as semiconductor fabrication moves from the current generation of 65
nm processes to the next 45 nm generations (while experiencing new challenges such
as increased variation across process corners). Another notable example is NVIDIAs
280 series GPU.
62
This microprocessor is unique in the fact that its 1.4 Billion transistor count,
capable of a teraflop of performance, is almost entirely dedicated to logic (Itanium's
transistor count is largely due to the 24MB L3 cache). Current designs, as opposed to
the earliest devices, use extensive design automation and automated logic synthesis to
lay out the transistors, enabling higher levels of complexity in the resulting logic
functionality. Certain high-performance logic blocks like the SRAM cell, however,
are still designed by hand to ensure the highest efficiency (sometimes by bending or
breaking established design rules to obtain the last bit of performance by trading
stability).
B.2 VLSI:
VLSI stands for "Very Large Scale Integration". This is the field
whichInvolves packing more and more logic devices into smaller and smaller areas.
1. Simply we say Integrated circuit is many transistors on one chip.
2. Design/manufacturing of extremely small, complex circuitry using modified
semiconductor material.
3. Integrated circuit (IC) may contain millions of transistors, each a few mm in
size.
4. Applications wide ranging: most electronic logic devices.
63
On
Clk
`
A
shown in Figure. The circuit can be realized in terms of two ICs an A-O-I gate and a
flip-flop. It can be directly
B wired up, tested, and used.
64
Circuit requirements
Other components
ICs
PCB layout
Final circuit
Once the above steps are gone through, a paper design is ready. Starting with
the paper design, one has to do a circuit layout. The physical location of all the
components is tentatively decided; they are interconnected and the circuit-onpaper is
made ready. Once a paper design is done, a layout is carried out and a net-list
prepared. Based on this, the PCB is fabricated and populated and all the populated
cards tested and debugged.
At the debugging stage one may encounter three types of problems:
65
Functional mismatch: The realized and expected functions are different. One
may have to go through the relevant functional block carefully and locate any
error logically. Finally the necessary correction has to be carried out in
hardware.
Timing mismatch: The problem can manifest in different forms. One
possibility is due to the signal going through different propagation delays in
two paths and arriving at a point with a timing mismatch. This can cause
faulty operation. Another possibility is a race condition in a circuit involving
asynchronous feedback. This kind of problem may call for elaborate
debugging. The preferred practice is to do debugging at smaller module stages
and ensuring that feedback through larger loops is avoided: It becomes
essential to check for the existence of long asynchronous loops.
Overload: Some signals may be overloaded to such an extent that the signal
transition may be unduly delayed or even suppressed. The problem manifests
as reflections and erratic behavior in some cases (The signal has to be suitably
buffered here.). In fact, overload on a signal can lead to timing mismatches.
The above have to be carried out after completion of the prototype PCB
manufacturing; it involves cost, time, and also a redesigning process to develop a bug
free design.
History of Scale Integration:
Late 40s Transistor invented at Bell Labs
Late 50s First IC (JK-FF by Jack Kilby at TI)
Early 60s Small Scale Integration (SSI)
10s of transistors on a chip
Late 60s Medium Scale Integration (MSI)
100s of transistors on a chip
66
more to bring out the role of a Hardware Description Language (HDL) in the design
process. An abstraction based model is the basis of the automated design.
B.3.3 Abstraction Model:
The model divides the whole design cycle into various domains. With such an
abstraction through a division process the design is carried out in different layers. The
designer at one layer can function without bothering about the layers above or below.
The thick horizontal lines separating the layers in the figure signify the
compartmentalization. As an example, let us consider design at the gate level. The
circuit to be designed would be described in terms of truth tables and state tables.
With these as available inputs, he has to express them as Boolean logic equations and
realize them in terms of gates and flip-flops. In turn, these form the inputs to the layer
immediately below. Compartmentalization of the approach to design in the manner
described here is the essence of abstraction; it is the basis for development and use of
CAD tools in VLSI design at various levels.
The design methods at different levels use the respective aids such as Boolean
equations, truth tables, state transition table, etc. But the aids play only a small role in
the process. To complete a design, one may have to switch from one tool to another,
raising the issues of tool compatibility and learning new environments.
68
Idea
Design description
Synthesis
Simulation
Physical design
69
dedicated tools. With every simulation run, the simulation results are studied to
identify errors in the design description. The errors are corrected and another
simulation run carried out. Simulation and changes to design description together
form a cyclic iterative process, repeated until an error-free design is evolved.
Design description is an activity independent of the target technology or
manufacturer. It results in a description of the digital circuit. To translate it into a
tangible circuit, one goes through the physical design process. The same constitutes a
set of activities closely linked to the manufacturer and the target technology.
B.4.1 Design Description:
The design is carried out in stages. The process of transforming the idea into a
detailed circuit description in terms of the elementary circuit components constitutes
design description. The final circuit of such an IC can have up to a billion such
components; it is arrived at in a step-by-step manner. The first step in evolving the
design description is to describe the circuit in terms of its behavior. The description
looks like a program in a high level language like C. Once the behavioral level design
description is ready, it is tested extensively with the help of a simulation tool; it
checks and confirms that all the expected functions are carried out satisfactorily. If
necessary, this behavioral level routine is edited, modified, and rerun all done
manually. Finally, one has a design for the expected system described at the
behavioral level. The behavioral design forms the input to the synthesis tools, for
circuit synthesis. The behavioral constructs not supported by the synthesis tools are
replaced by data flow and gate level constructs. To surmise, the designer has to
develop synthesizable codes for his design. The design at the behavioral level is to be
elaborated in terms of known and acknowledged functional blocks. It forms the next
detailed level of design description
Once again the design is to be tested through simulation and iteratively
corrected for errors. The elaboration can be continued one or two steps further. It
leads to a detailed design description in terms of logic gates and transistor switches.
B.4.2 Optimization:
70
The circuit at the gate level in terms of the gates and flip-flops can be
redundant in nature. The same can be minimized with the help of minimization tools.
The step is not shown separately in the figure. The minimized logical design is
converted to a circuit in terms of the switch level cells from standard libraries
provided by the foundries. The cell based design generated by the tool is the last step
in the logical design process; it forms the input to the first level of physical design.
B.4.3 Simulation:
The design descriptions are tested for their functionality at every level
behavioral, data flow, and gate. One has to check here whether all the functions are
carried out as expected and rectify them. All such activities are carried out by the
simulation tool. The tool also has an editor to carry out any corrections to the source
code. Simulation involves testing the design for all its functions, functional sequences,
timing constraints, and specifications. Normally testing and simulation at all the levels
behavioral to switch level are carried out by a single tool; the same is identified as
scope of simulation tool.
71
72
B.4.4 Synthesis:
With the availability of design at the gate (switch) level, the logical design is
complete. The corresponding circuit hardware realization is carried out by a synthesis
tool.
Two common approaches are as follows:
The circuit is realized through an FPGA. The gate level design description is the
starting point for the synthesis here. The FPGA vendors provide an interface to the
synthesis tool. Through the interface the gate level design is realized as a final
circuit. With many synthesis tools, one can directly use the design description at
the data flow level itself to realize the final circuit through an FPGA. The FPGA
route is attractive for limited volume production or a fast development cycle.
The circuit is realized as an ASIC. A typical ASIC vendor will have his own
library of basic components like elementary gates and flip-flops. Eventually the
circuit is to be realized by selecting such components and interconnecting them
conforming to the required design. This constitutes the physical design. Being an
elaborate and costly process, a physical design may call for an intermediate
functional verification through the FPGA route. The circuit realized through the
FPGA is tested as a prototype. It provides another opportunity for testing the
design closer to the final circuit.
B.4.5 Physical Design:
A fully tested and error-free design at the switch level can be the starting point
for a physical design. It is to be realized as the final circuit using (typically) a million
components in the foundrys library. The step-by-step activities in the process are
described briefly as follows:
System partitioning: The design is partitioned into convenient compartments or
functional blocks. Often it would have been done at an earlier stage itself and the
software design prepared in terms of such blocks. Interconnection of the blocks is
part of the partition process.
73
Floor planning: The positions of the partitioned blocks are planned and the blocks
are arranged accordingly. The procedure is analogous to the planning and
arrangement of domestic furniture in a residence. Blocks with I/O pins are kept
close to the periphery; those which interact frequently or through a large number
of interconnections are kept close together, and so on. Partitioning and floor
planning may have to be carried out and refined iteratively to yield best results.
Placement: The selected components from the ASIC library are placed in position
on the Silicon floor. It is done with each of the blocks above.
Routing: The components placed as described above are to be interconnected to
the rest of the block: It is done with each of the blocks by suitably routing the
interconnects. Once the routing is complete, the physical design cam is taken as
complete. The final mask for the design can be made at this stage and the ASIC
manufactured in the foundry.
B.4.6 Post Layout Simulation:
Once the placement and routing are completed, the performance specifications
like silicon area, power consumed, path delays, etc., can be computed. Equivalent
circuit can be extracted at the component level and performance analysis carried out.
This constitutes the final stage called verification. One may have to go through the
placement and routing activity once again to improve performance.
B.4.7 Critical Subsystems:
The design may have critical subsystems. Their performance may be crucial to
the overall performance; in other words, to improve the system performance
substantially, one may have to design such subsystems afresh. The design here may
imply redefinition of the basic feature size of the component, component design,
placement of components, or routing done separately and specifically for the
subsystem. A set of masks used in the foundry may have to be done afresh for the
purpose.
74
To be more clear, if we want to implement a complex design (CPU for instance), then
the design is divided into small sub functions and each sub function is implemented
using one logic block. Now, to get our desired design (CPU), all the sub functions
implemented in logic blocks must be connected and this is done by programming the
interconnects.
75
76
are slow compared to custom ICs as they cant handle vary complex designs and also
they draw more power.Xilinx logic block consists of one Look Up Table (LUT) and
one Flip-flop. An LUT is used to implement number of different functionality. The
input lines to the logic block go into the LUT and enable it. The output of the LUT
gives the result of the logic function that it implements and the output of logic block is
registered or unregistered outputfrom the LUT. SRAM is used to implement a LUT.A
k-input logic function is implemented using 2^k * 1 size SRAM. Number of different
possible functions for k input LUT is 2^2^k. Advantage of such an architecture is that
it supports implementation of so many logic functions, however the disadvantage is
unusually large number of memory cells required to implement such a logic block in
case number of inputs is large. Figure C.2 below shows a 4-input LUT based
implementation of logic block.
LUT based design provides for better logic block utilization. A k-input LUT
based logic block can be implemented in number of different ways with trade-off
between performance and logic density.An n-LUT can be shown as a direct
implementation of a function truth-table. Each of the latch holds the value of the
function corresponding to one input combination. For Example: 2-LUT can be used to
implement 16 types of functions like AND, OR, A+notB etc.
A
0
B
0
AND
0
77
OR
C.2 INTERCONNECTS:
A wire segment can be described as two end points of an interconnect with no
programmable switch between them. A sequence of one or more wire segments in an
FPGA can be termed as a track.Typically an FPGA has logic blocks, interconnects and
switch blocks (Input/output blocks). Switch blocks lie in the periphery of logic blocks
and interconnect. Wire segments are connected to logic blocks through switch blocks.
Depending on the required design, one logic block is connected to another and so on.
In this part of tutorial we are going to have a short intro on FPGA design flow. A
simplified version of design flow is given in the flowing figure C.3.
78
Schematic entry is the better choice. When the design is complex or the designer
thinks the design in an algorithmic way then HDL is the better choice. Language
based entry is faster but lag in performance and density.HDLs represent a level of
abstraction that can isolate the designers from the details of the hardware
implementation. Schematic based entry gives designers much more visibility into the
hardware. It is the better choice for those who are hardware oriented. Another method
but rarely used is state-machines. It is the better choice for the designers who think the
design as a series of states. But the tools for state machine entry are limited. In this
documentation we are going to deal with the HDL based design entry.
C.4 SYNTHESIS:
The process which translates VHDL or Verilog code into a device netlist
format. i.e. a complete circuit with logical elements (gates, flip flops, etc) for the
design.If the design contains more than one sub designs, ex. to implement a processor,
we need a CPU as one design element and RAM as another and so on, then the
synthesis
process
generates
netlist
for
each
design
element
Synthesis process will check code syntax and analyze the hierarchy of the design
which ensures that the design is optimized for the design architecture, the designer has
selected. The resulting netlist(s) is saved to an NGC (Native Generic Circuit) file (for
Xilinx Synthesis Technology (XST)).
C.5 IMPLEMENTATION:
Thisprocess consists a sequence of three steps
1.Translate
2.Map
3. Place and Route
79
Translate process combines all the input netlists and constraints to a logic
design file. This information is saved as a NGD (Native Generic Database) file. This
can be done using NGD Build program. Here, defining constraints is nothing but,
assigning the ports in the design to the physical elements (ex. pins, switches, buttons
etc) of the targeted device and specifying time requirements of the design. This
information is stored in a file named UCF (User Constraints File).
Tools used to create or modify the UCF are PACE, Constraint Editor Etc.
Map process divides the whole circuit with logical elements into sub blocks
such that they can be fit into the FPGA logic blocks. That means map process fits the
logic defined by the NGD file into the targeted FPGA elements (Combinational Logic
Blocks (CLB), Input Output Blocks (IOB)) and generates an NCD (Native Circuit
Description) file which physically represents the design mapped to the components of
FPGA. MAP program is used for this purpose.
80
Place and RoutePAR program is used for this process. The place and route
process places the sub blocks from the map process into logic blocks according to the
constraints and connects the logic blocks. Ex. if a sub block is placed in a logic block
which is very near to IO pin, then it may save the time but it may effect some other
constraint. So trade-off between all the constraints is taken account by the place and
route process. The PAR tool takes the mapped NCD file as input and produces a
completely routed NCD file as output. Output NCD file consists the routing
information.
81
either VHDL or Verilog designs. In this process, signals and variables are observed,
procedures and functions are traced and breakpoints are set. This is a very fast
simulation and so allows the designer to change the HDL code if the required
functionality is not met with in a short time period. Since the design is not yet
synthesized to gate level, timing and resource usage properties are still unknown.
Functional simulation (Post Translate Simulation) Functional simulation
gives information about the logic operation of the circuit. Designer can verify the
functionality of the design using this process after the Translate process. If the
functionality is not as expected, then the designer has to made changes in the code and
again follow the design flow steps.
Static Timing Analysis This can be done after MAP or PAR processes Post
MAP timing report lists signal path delays of the design derived from the design logic.
Post Place and Route timing report incorporates timing delay information to provide a
comprehensive timing.
82