Академический Документы
Профессиональный Документы
Культура Документы
Submitted To
Dr. Anu Gupta
Submitted By
Dwip Bhavsar
Himanshu Verma
2012H123118P
2012H123120P
1|Page
Abstract
This work is the design of an asynchronous 8X8 Multiplier-Accumulator Unit using UMC
180nm technology. The design follows an assigned asynchronous methodology due to the fact
that in a synchronous design, the overall delay is affected by the delay of the least performing
stage. The design employs pipelining in order to accommodate increased throughput and
adiabatic modules have been built to decrease power consumption. The two-phase protocol has
been followed wherein there is latching of data at both edges of the clock. The final unit consists
of 8017 MOS devices and has a ceiling frequency of pipeline operation of 62.5 MHz. The power
consumption is assessed to be 10.61 mW.
2|Page
Contents
Abstract ........................................................................................................................................... 2
1. Introduction ................................................................................................................................. 4
1.1 Design And Operation Of A Pipelined 8x8 Adiabatic Multiplier Accumulator ................ 5
1.2 Design And Operation Of Multiplier ..................................................................................... 5
1.3 Design And Operation Of Adder ........................................................................................... 6
2. Pipelined 8 X 8 Adiabatic Multiplier Accumulator ................................................................... 7
2.1 Adiabatic Circuits ................................................................................................................. 7
2.1.1 Adiabatic Full Adder...................................................................................................... 8
2.1.2 Partial Product Generator ............................................................................................... 9
2.2 Mousetrap Pipeline ............................................................................................................. 11
2.2.1 Xnor Gate ..................................................................................................................... 13
2.2.2 Latch ............................................................................................................................ 13
3. Multiplier ................................................................................................................................... 15
4. Adder ......................................................................................................................................... 17
5. MAC ......................................................................................................................................... 19
6. Simulation Results .................................................................................................................... 21
6.1 Statistics .............................................................................................................................. 21
6.2 Multiplier Simulation Output: ............................................................................................ 22
6.3 MAC Simulation Output ..................................................................................................... 24
7. Conclusion ................................................................................................................................ 26
8. References ................................................................................................................................. 27
3|Page
LIST OF FIGURES
6
6
7
8
9
10
10
11
12
12
13
14
14
15
16
16
17
18
18
19
20
20
22
23
23
24
24
25
25
4|Page
1. Introduction
1.1 Design And Operation of A Pipelined 8x8 Adiabatic Multiplier Accumulator
In computing, the multiplyaccumulate operation is a common step that computes the product of
two numbers and adds that product to an accumulator. The hardware unit that performs the
operation is known as a multiplieraccumulator (MAC) unit. The operation itself is also often
called a MAC or a MAC operation. The operation can be explained as follows:
A A + BXC
With the widespread use of mobile and wireless devices and the increase of clock and logic
speeds in meeting the new performance requirements, energy efficiency has become a key design
aspect in the field of integrated circuits (ICs). For digital circuits, which mostly utilize
Complementary Metal Oxide- Semiconductor (CMOS), voltage scaling is one of the main
strategies as the power consumption is proportional to the square of the power supply voltage. To
maintain high transistor drive current and thus achieve performance improvements, transistor
thresholds must be scaled along with the supply voltage. However, threshold voltage, Vt scaling
results in a substantial increase in sub-threshold leakage current.
Power dissipation in conventional CMOS circuits primarily occurs during device switching. As
opposed to the case of conventional charging, the rate of switching transition in adiabatic circuits
is decreased because of the used of a time-varying voltage source instead of a fixed voltage
supply. By spreading out the charge transfer more evenly over the entire time available, peak
current is greatly reduced.Both the MOSFET diodes are used to recycle charges from the output
node and to improve the discharging speed of internal signal nodes. By using two-split level
sinusoidal waveforms, which have peak-to-peak voltage of 0.9V, the voltage difference between
current carrying electrodes can be minimized, consequently power consumption is minimized., it
can be seen that a 2PASCL-based NAND circuit can save up to 62% at transition frequencies of
10 to 100 MHz
5|Page
6|Page
Figure 3 (a) 2PASCL inverter circuit. (b) Waveforms from the simulation
From the operation of 2PASCL, less dynamic switchings are seen as circuit nodes are not
necessarily charging and discharging at every clock cycle which reduces the node switching
activities significantly. The lower the switching activity, the lower its energy dissipation. One of
the benefits is that the logic behaves like a static logic.
7|Page
FA is designed using two phase clocked adiabatic static CMOS logic (2PASCL) circuit
techniques. Adiabatic logic is discussed in the introduction. The circuit diagram for adiabatic full
adder is shown in fig. The full adder consists of two sub circuits, sum and carry. So, in each full
adder two source transistors are required to output two signals sum and carry.
8|Page
This block generates 8-bit PP. Adiabatic AND gate is used for saving power. The circuit diagram
for adiabatic full adder is shown in figure 11. Initially all the partial products are generated from
inputs. But different group of partial products are required in the different pipeline stages. So,
partial products which are required in the different stages must be pipelined till that stage to
support parallelism.
9|Page
10 | P a g e
Three pipeline stages are shown. Each stage consists of a data latch and a latch controller.
Adjacent stages communicate with each other using requests (reqs) and acknowledgments
.The data latch is a standard level-sensitive D-type transparent latch. The latch is normally
transparent (i.e., enabled), allowing new data to pass through quickly.A commonly-used
asynchronous scheme, called bundled data , is used to encode the datapath: a control signal,
indicates arrival of new data at stage Ns inputs. In particular, a simple one-sided timing
requirement must be satisfied for correct operation: must arrive after the data inputs to stage
have stabilized. (When logic processing is added to the pipeline, the request signal in each stage
is typically delayed by an amount that matches the latency of the associated function block, i.e.,
by a matched delay.) Once new data has passed through stage Ns latch, is produced, which is
sent to its latch controller, as well as to stages N-1 and N+1.The latch controller enables and
disables the data latch. It consists of only a single XNOR gate with two inputs: N the done from
the current stage, stage , and the ack from stage N+1.
11 | P a g e
An alternate view of the basic pipeline is shown in Fig. 9. The latch inside a stage is shown
separated into two parts: 1) a single bit latch that receives the incoming request reqN and
produces doneN and the outgoing request reqN+1 and 2) the remainder of the latch which
captures the data bits. In this representation, the bit latch and the XNOR together form the entire
control circuit that generates and receives the handshake signals from the neighboring pipeline
stages on the left and the right, and also produces the latch enable signal EN , which is internal to
the stage, for controlling the latching action on the datapath.
12 | P a g e
A
0
0
1
1
B
0
1
0
1
Xnor
1
0
0
1
Xor
0
1
1
0
Figure 11 Xnor-Xor
2.2.2 Latch
We have implemented a D-latch using transmission gates. When the enable signai is high which
is coming from XNOR gate, the latch is transparent and the data is passed. When that signal is
low, the latch gets opaque and data is held.
13 | P a g e
Figure 12 D latch
14 | P a g e
3. Multiplier
A naive organization for adding the partial products together in an array multiplier is to
accumulate the partial products in each row and pass the accumulated partial products (using
carry save adders) to the next row . This makes the total delay going through the carry save
proportional to N, the number of bits in the data words (there are N rows or N/2 rows if Booth
recoding is used.) This delay is too long if we use a fast adder for the last row which as delay ~
log N.
The easiest way to reduce the CSA delay is to make a tree structure instead of the linearly
connected naive design. In general, we want an efficient way to add N (or N/2) partial products.
15 | P a g e
16 | P a g e
4. Adder
The adder is the fundamental block in any arithmetic unit, and is often the speed-limiting circuit
in a digital system. Hence, many parallel adder architectures have been proposed to increase
speed, with reasonable area and power dissipation features.One of the fastest and efficient
architectures in terms of area and power dissipation is the Carry Bypass Adder (CBA).
A N-bit CBA is made up of N full adder gates, which are grouped together into blocks, whose
size (i.e., the number of full adders per block) has to be properly chosen to minimize the time
needed for a computation. The CBA architecture can be derived from that of a simple Ripple
Carry Adder by stating that, when some contiguous full adders work in propagate (i.e., each of
them has a carry output equal to the carry input), they can be bypassed to valuate the carry output
of the last one, since it is equal to the carry input of the first. Hence, in a CBA the full adders are
divided into groups, each of them is bypassed by a multiplexer if its full adders are all in
propagate.
17 | P a g e
18 | P a g e
5. MAC
Mac schematic has been made using asynchronous mousetrap pipeline multiplier and carry
bypass adder. Here latch is also used to latch the data of previous result. So we can add next
multipliers output in it.
Figure 20 MAC
19 | P a g e
20 | P a g e
6. Simulation Results
Initial MAC Output: 2b00000000 00000000
Dataset
1
2
Multiplier
input 1
2b10111001
2b10111001
Multiplier
input 2
2b01001010
2b11001010
Multiplier output
MAC output
2b00110101 01111010
2b10010001 11111010
2b00110101 01111010
2b11000111 01110100
6.1 Statistics
No of MOS devices (bsim3v3)
: 8017
No. of equations
: 22290
Simulation time
: 14 min 31.2 sec
Power
: 10.61 mW
Latency
: 47.8 ns
Throughput
: 16 ns
Max frequency of Pipeline Operation: 62.5 MHz
21 | P a g e
22 | P a g e
23 | P a g e
24 | P a g e
25 | P a g e
7. Conclusion
An adiabatic two-phase bundled data 8X8 Multiplier-Accumulator Unit has been realized using
UMC 180nm technology. The multiplication operation has been done using Asynchronous
Mousetrap Pipeline. The multiplication operation has been split into 7 pipeline stages. The
pipeline is an event-driven design style with XNOR elements being used to manage the events
on the bundled data interface. Event-driven interface protocols permit old components to be
replaced by new ones with improved throughput, latency, or cost characteristics. Because the
handshake used here automatically takes care of delays in delivering or making use of data, such
replacements can be made with assurance that the system will still operate properly. The
pipelined design with an asynchronous control transfer scheme has made possible increased
throughput with best utilization of the underlying processing hardware. The adiabatic modules
have resulted in low power consumption with the final unit being assessed at 10.61 mW. Carrysave Multipliers have been used in order to reduce multiplier latency and a carry-bypass
accumulator has been used as the final stage. The maximum frequency of pipeline operation is
43.48 MHz and the design has resulted in 8017 MOS devices.
26 | P a g e
8. References
[1] Dusan Suvakovic, C. Andre and T. Salama, Energy efficient adiabatic multiplieraccumulator design, Journal of VLSI Signal Processing 33, 83103, 2003.
[2] Montek Singh and Steven M. Nowick, MOUSETRAP: High-Speed Transition-Signaling
Asynchronous Pipelines, IEEE Transactions on Very Large Scale Integration (vlsi) systems,
vol. 15, no. 6, june 2007.
[3] N. Anuar, Y. Takahashi and T. Sekine, 4-bit ripple carry adder using two-phase clocked
adiabatic static CMOS logic, Proc. IEEE TENCON 2009.
[4] N. Alioto and G. Palumbo, Performance evaluation of adiabatic gates, Circuits and
Systems, IEEE Transaction on , Vol.47, Issue 9,Sep. 2000,pp 1297-1308.
[5] N. Anuar, Y. Takahashi and T. Sekine, Adiabatic logic verses CMOS for low power
applications, Proc. ITC-CSCC 2009,pp.302-305,Jul. 2009.
[6] V.I. Staroselskii, Adiabatic logic circuits: A review, Russian Microelectronics, Vol. 31,
Issue 1, 2002, pp 37-58.
[7] N. Anuar, Y. Takahashi and T. Sekine, Fundamental logics based on two phase clocked
adiabatic Static CMOS Logic, Circuits and Systems, IEEE Transaction on ,ICECS 2009,
Vol.47, Issue 9,Sep. 2000,pp 503-506.
[8] J.M.Rabaey et.al., Digital Integrated Circuits-A design Perspective,2nd edition-PHI
[9] J. Parso, S Furber, Principles of Asynchronous Circuit Design-A system Perspective,
Kluwer Academic Publishers.
27 | P a g e