Вы находитесь на странице: 1из 27

A Report

Design of 2-Phase Bundled Data Mousetrap 8x8 Adiabatic


Multiplier-Accumulator Unit

Submitted in partial fulfilment of the requirements of the course


MEL G623: ADVANCED VLSI DESIGN

Submitted To
Dr. Anu Gupta

Submitted By
Dwip Bhavsar
Himanshu Verma

2012H123118P
2012H123120P

Birla Institute of Technology & Science, Pilani


November 2013

1|Page

Abstract
This work is the design of an asynchronous 8X8 Multiplier-Accumulator Unit using UMC
180nm technology. The design follows an assigned asynchronous methodology due to the fact
that in a synchronous design, the overall delay is affected by the delay of the least performing
stage. The design employs pipelining in order to accommodate increased throughput and
adiabatic modules have been built to decrease power consumption. The two-phase protocol has
been followed wherein there is latching of data at both edges of the clock. The final unit consists
of 8017 MOS devices and has a ceiling frequency of pipeline operation of 62.5 MHz. The power
consumption is assessed to be 10.61 mW.

2|Page

Contents
Abstract ........................................................................................................................................... 2
1. Introduction ................................................................................................................................. 4
1.1 Design And Operation Of A Pipelined 8x8 Adiabatic Multiplier Accumulator ................ 5
1.2 Design And Operation Of Multiplier ..................................................................................... 5
1.3 Design And Operation Of Adder ........................................................................................... 6
2. Pipelined 8 X 8 Adiabatic Multiplier Accumulator ................................................................... 7
2.1 Adiabatic Circuits ................................................................................................................. 7
2.1.1 Adiabatic Full Adder...................................................................................................... 8
2.1.2 Partial Product Generator ............................................................................................... 9
2.2 Mousetrap Pipeline ............................................................................................................. 11
2.2.1 Xnor Gate ..................................................................................................................... 13
2.2.2 Latch ............................................................................................................................ 13
3. Multiplier ................................................................................................................................... 15
4. Adder ......................................................................................................................................... 17
5. MAC ......................................................................................................................................... 19
6. Simulation Results .................................................................................................................... 21
6.1 Statistics .............................................................................................................................. 21
6.2 Multiplier Simulation Output: ............................................................................................ 22
6.3 MAC Simulation Output ..................................................................................................... 24
7. Conclusion ................................................................................................................................ 26
8. References ................................................................................................................................. 27

3|Page

LIST OF FIGURES

FIGURE 1 CARRY SAVE MULTIPLIER


FIGURE 2 CARRY BYPASS ADDER
FIGURE 3 (A) 2PASCL INVERTER CIRCUIT. (B) WAVEFORMS FROM THE SIMULATION
FIGURE 4 ADIABATIC FULL ADDER
FIGURE 5 8BIT RIPPLE ADDER
FIGURE 6 ADIABATIC NAND-AND
FIGURE 7 PARTIAL-PRODUCT GENERATION
FIGURE 8 3-STAGE MOUSETRAP PIPELINE
FIGURE 9 SINGLE STAGE OF MOUSETRAP
FIGURE 10 MOUSETRAP PIPELINE WITH LOGIC BLOCK
FIGURE 11 XNOR-XOR
FIGURE 12 D LATCH
FIGURE 13 16-BIT LATCH
FIGURE 14 CARRY SAVE MULTIPLIER BLOCK
FIGURE 15 SINGLE STAGE OF CARRY SAVE MULTIPLIER
FIGURE 16 8X8 MULTIPLIER
FIGURE 17 CARRY BYPASS
FIGURE 18 CARRY BYPASS SINGLE STAGE
FIGURE 19 CARRY BYPASS SCHEMATIC
FIGURE 20 MAC
FIGURE 21 MAC: MULTIPLIER
FIGURE 22 MAC: ACCUMULATOR
FIGURE 23 SIMULATION RESULTS: 1
FIGURE 24 SIMULATION RESULTS: 2
FIGURE 25 SIMULATION RESULTS: 3
FIGURE 26 SIMULATION RESULTS: 4
FIGURE 27 SIMULATION RESULTS: 5
FIGURE 28 SIMULATION RESULTS: 6
FIGURE 29 SIMULATION RESULTS: 7

6
6
7
8
9
10
10
11
12
12
13
14
14
15
16
16
17
18
18
19
20
20
22
23
23
24
24
25
25

4|Page

1. Introduction
1.1 Design And Operation of A Pipelined 8x8 Adiabatic Multiplier Accumulator
In computing, the multiplyaccumulate operation is a common step that computes the product of
two numbers and adds that product to an accumulator. The hardware unit that performs the
operation is known as a multiplieraccumulator (MAC) unit. The operation itself is also often
called a MAC or a MAC operation. The operation can be explained as follows:
A A + BXC
With the widespread use of mobile and wireless devices and the increase of clock and logic
speeds in meeting the new performance requirements, energy efficiency has become a key design
aspect in the field of integrated circuits (ICs). For digital circuits, which mostly utilize
Complementary Metal Oxide- Semiconductor (CMOS), voltage scaling is one of the main
strategies as the power consumption is proportional to the square of the power supply voltage. To
maintain high transistor drive current and thus achieve performance improvements, transistor
thresholds must be scaled along with the supply voltage. However, threshold voltage, Vt scaling
results in a substantial increase in sub-threshold leakage current.
Power dissipation in conventional CMOS circuits primarily occurs during device switching. As
opposed to the case of conventional charging, the rate of switching transition in adiabatic circuits
is decreased because of the used of a time-varying voltage source instead of a fixed voltage
supply. By spreading out the charge transfer more evenly over the entire time available, peak
current is greatly reduced.Both the MOSFET diodes are used to recycle charges from the output
node and to improve the discharging speed of internal signal nodes. By using two-split level
sinusoidal waveforms, which have peak-to-peak voltage of 0.9V, the voltage difference between
current carrying electrodes can be minimized, consequently power consumption is minimized., it
can be seen that a 2PASCL-based NAND circuit can save up to 62% at transition frequencies of
10 to 100 MHz

1.2 Design And Operation Of Multiplier


The design of the Multiplier-Accumulator (MAC) is highly modular with basic building blocks.
The 8x8 multiplier operation is distributed into 8 pipeline stages and the carry save multiplier is
used for fast carry propagation to reduce multiplier latency. The structure has the advantage that
its worst case critical path is shorter and uniquely defined. The basic layout of carry save
multiplier is shown in figure 4. The multiplier is built using blocks described below.

5|Page

Figure 1 Carry Save Multiplier

1.3 Design And Operation Of Adder


The adder is the fundamental block in any arithmetic unit, and is often the speed-limiting circuit
in a digital system. Hence, many parallel adder architectures have been proposed to increase
speed, with reasonable area and power dissipation features.One of the fastest and efficient
architectures in terms of area and power dissipation is the Carry Bypass Adder (CBA).
A N-bit CBA is made up of N full adder gates, which are grouped together into blocks, whose
size (i.e., the number of full adders per block) has to be properly chosen to minimize the time
needed for a computation. The CBA architecture can be derived from that of a simple Ripple
Carry Adder by stating that, when some contiguous full adders work in propagate (i.e., each of
them has a carry output equal to the carry input), they can be bypassed to valuate the carry output
of the last one, since it is equal to the carry input of the first. Hence, in a CBA the full adders are
divided into groups, each of them is bypassed by a multiplexer if its full adders are all in
propagate.

Figure 2 Carry Bypass Adder

6|Page

2. Pipelined 8 X 8 Adiabatic Multiplier Accumulator


2.1 Adiabatic Circuits
Figure 3 shows a circuit diagram and waveforms illustrating the operation of the 2PASCL
inverter [10]. Both the MOSFET diodes are used to recycle charges from the output node and to
improve the discharging speed of internal signal nodes. Such a circuit design is particularly
advantageous if the signal nodes are preceded by a long chain of switches. By using these two
split-level sinusoidal waveforms, which have peakto- peak voltages of 0.9 V, the voltage
difference between the current-carrying electrodes can be minimized, consequently power
consumption can be suppressed. The substrates of the pMOS and nMOS transistors are
connected to and GND respectively.Since the criteria for maintaining thermal equilibrium, in
which the voltage between the current-carrying electrodes is zero when the transistors are in the
ON state [4] are satisfied, the energy accumulated in CL is not dissipated. Moreover, sinusoidal
waveforms can be generated with a higher energy efficiency than trapezoidal waveforms

Figure 3 (a) 2PASCL inverter circuit. (b) Waveforms from the simulation

From the operation of 2PASCL, less dynamic switchings are seen as circuit nodes are not
necessarily charging and discharging at every clock cycle which reduces the node switching
activities significantly. The lower the switching activity, the lower its energy dissipation. One of
the benefits is that the logic behaves like a static logic.

7|Page

2.1.1 Adiabatic Full Adder

FA is designed using two phase clocked adiabatic static CMOS logic (2PASCL) circuit
techniques. Adiabatic logic is discussed in the introduction. The circuit diagram for adiabatic full
adder is shown in fig. The full adder consists of two sub circuits, sum and carry. So, in each full
adder two source transistors are required to output two signals sum and carry.

Figure 4 Adiabatic Full Adder

8|Page

Figure 5 8bit Ripple Adder

2.1.2 Partial Product Generator

This block generates 8-bit PP. Adiabatic AND gate is used for saving power. The circuit diagram
for adiabatic full adder is shown in figure 11. Initially all the partial products are generated from
inputs. But different group of partial products are required in the different pipeline stages. So,
partial products which are required in the different stages must be pipelined till that stage to
support parallelism.

9|Page

Figure 6 Adiabatic Nand-And

Figure 7 Partial-product generation

10 | P a g e

2.2 Mousetrap Pipeline


An asynchronous pipeline style is introduced for high speed applications, called MOUSETRAP.
The pipeline uses standard transparent latches and static logic in its datapath, and small latch
controllers consisting of only a single gate per pipeline stage. This simple structure is combined
with an efficient and highly-concurrent event-driven protocol between adjacent stages

Figure 8 3-stage Mousetrap Pipeline

Three pipeline stages are shown. Each stage consists of a data latch and a latch controller.
Adjacent stages communicate with each other using requests (reqs) and acknowledgments
.The data latch is a standard level-sensitive D-type transparent latch. The latch is normally
transparent (i.e., enabled), allowing new data to pass through quickly.A commonly-used
asynchronous scheme, called bundled data , is used to encode the datapath: a control signal,
indicates arrival of new data at stage Ns inputs. In particular, a simple one-sided timing
requirement must be satisfied for correct operation: must arrive after the data inputs to stage
have stabilized. (When logic processing is added to the pipeline, the request signal in each stage
is typically delayed by an amount that matches the latency of the associated function block, i.e.,
by a matched delay.) Once new data has passed through stage Ns latch, is produced, which is
sent to its latch controller, as well as to stages N-1 and N+1.The latch controller enables and
disables the data latch. It consists of only a single XNOR gate with two inputs: N the done from
the current stage, stage , and the ack from stage N+1.

11 | P a g e

Figure 9 Single stage of Mousetrap

An alternate view of the basic pipeline is shown in Fig. 9. The latch inside a stage is shown
separated into two parts: 1) a single bit latch that receives the incoming request reqN and
produces doneN and the outgoing request reqN+1 and 2) the remainder of the latch which
captures the data bits. In this representation, the bit latch and the XNOR together form the entire
control circuit that generates and receives the handshake signals from the neighboring pipeline
stages on the left and the right, and also produces the latch enable signal EN , which is internal to
the stage, for controlling the latching action on the datapath.

Figure 10 Mousetrap Pipeline with logic block

12 | P a g e

2.2.1 Xnor Gate

A
0
0
1
1

B
0
1
0
1

Xnor
1
0
0
1

Xor
0
1
1
0

Figure 11 Xnor-Xor

2.2.2 Latch

We have implemented a D-latch using transmission gates. When the enable signai is high which
is coming from XNOR gate, the latch is transparent and the data is passed. When that signal is
low, the latch gets opaque and data is held.

13 | P a g e

Figure 12 D latch

Figure 13 16-bit latch

14 | P a g e

3. Multiplier
A naive organization for adding the partial products together in an array multiplier is to
accumulate the partial products in each row and pass the accumulated partial products (using
carry save adders) to the next row . This makes the total delay going through the carry save
proportional to N, the number of bits in the data words (there are N rows or N/2 rows if Booth
recoding is used.) This delay is too long if we use a fast adder for the last row which as delay ~
log N.
The easiest way to reduce the CSA delay is to make a tree structure instead of the linearly
connected naive design. In general, we want an efficient way to add N (or N/2) partial products.

Figure 14 Carry Save Multiplier Block

15 | P a g e

Figure 15 Single Stage of Carry Save Multiplier

Figure 16 8x8 Multiplier

16 | P a g e

4. Adder
The adder is the fundamental block in any arithmetic unit, and is often the speed-limiting circuit
in a digital system. Hence, many parallel adder architectures have been proposed to increase
speed, with reasonable area and power dissipation features.One of the fastest and efficient
architectures in terms of area and power dissipation is the Carry Bypass Adder (CBA).
A N-bit CBA is made up of N full adder gates, which are grouped together into blocks, whose
size (i.e., the number of full adders per block) has to be properly chosen to minimize the time
needed for a computation. The CBA architecture can be derived from that of a simple Ripple
Carry Adder by stating that, when some contiguous full adders work in propagate (i.e., each of
them has a carry output equal to the carry input), they can be bypassed to valuate the carry output
of the last one, since it is equal to the carry input of the first. Hence, in a CBA the full adders are
divided into groups, each of them is bypassed by a multiplexer if its full adders are all in
propagate.

Figure 17 Carry bypass

17 | P a g e

Figure 18 Carry bypass Single stage

Figure 19 Carry bypass Schematic

18 | P a g e

5. MAC
Mac schematic has been made using asynchronous mousetrap pipeline multiplier and carry
bypass adder. Here latch is also used to latch the data of previous result. So we can add next
multipliers output in it.

Figure 20 MAC

19 | P a g e

Figure 21 MAC: Multiplier

Figure 22 MAC: Accumulator

20 | P a g e

6. Simulation Results
Initial MAC Output: 2b00000000 00000000
Dataset
1
2

Multiplier
input 1
2b10111001
2b10111001

Multiplier
input 2
2b01001010
2b11001010

Multiplier output

MAC output

2b00110101 01111010
2b10010001 11111010

2b00110101 01111010
2b11000111 01110100

6.1 Statistics
No of MOS devices (bsim3v3)
: 8017
No. of equations
: 22290
Simulation time
: 14 min 31.2 sec
Power
: 10.61 mW
Latency
: 47.8 ns
Throughput
: 16 ns
Max frequency of Pipeline Operation: 62.5 MHz

21 | P a g e

6.2 Multiplier Simulation Output:

Figure 23 Simulation Results: 1

22 | P a g e

Figure 24 Simulation Results: 2

Figure 25 Simulation Results: 3

23 | P a g e

6.3 MAC Simulation Output

Figure 26 Simulation Results: 4

Figure 27 Simulation Results: 5

24 | P a g e

Figure 28 Simulation Results: 6

Figure 29 Simulation Results: 7

25 | P a g e

7. Conclusion
An adiabatic two-phase bundled data 8X8 Multiplier-Accumulator Unit has been realized using
UMC 180nm technology. The multiplication operation has been done using Asynchronous
Mousetrap Pipeline. The multiplication operation has been split into 7 pipeline stages. The
pipeline is an event-driven design style with XNOR elements being used to manage the events
on the bundled data interface. Event-driven interface protocols permit old components to be
replaced by new ones with improved throughput, latency, or cost characteristics. Because the
handshake used here automatically takes care of delays in delivering or making use of data, such
replacements can be made with assurance that the system will still operate properly. The
pipelined design with an asynchronous control transfer scheme has made possible increased
throughput with best utilization of the underlying processing hardware. The adiabatic modules
have resulted in low power consumption with the final unit being assessed at 10.61 mW. Carrysave Multipliers have been used in order to reduce multiplier latency and a carry-bypass
accumulator has been used as the final stage. The maximum frequency of pipeline operation is
43.48 MHz and the design has resulted in 8017 MOS devices.

26 | P a g e

8. References
[1] Dusan Suvakovic, C. Andre and T. Salama, Energy efficient adiabatic multiplieraccumulator design, Journal of VLSI Signal Processing 33, 83103, 2003.
[2] Montek Singh and Steven M. Nowick, MOUSETRAP: High-Speed Transition-Signaling
Asynchronous Pipelines, IEEE Transactions on Very Large Scale Integration (vlsi) systems,
vol. 15, no. 6, june 2007.
[3] N. Anuar, Y. Takahashi and T. Sekine, 4-bit ripple carry adder using two-phase clocked
adiabatic static CMOS logic, Proc. IEEE TENCON 2009.
[4] N. Alioto and G. Palumbo, Performance evaluation of adiabatic gates, Circuits and
Systems, IEEE Transaction on , Vol.47, Issue 9,Sep. 2000,pp 1297-1308.
[5] N. Anuar, Y. Takahashi and T. Sekine, Adiabatic logic verses CMOS for low power
applications, Proc. ITC-CSCC 2009,pp.302-305,Jul. 2009.
[6] V.I. Staroselskii, Adiabatic logic circuits: A review, Russian Microelectronics, Vol. 31,
Issue 1, 2002, pp 37-58.
[7] N. Anuar, Y. Takahashi and T. Sekine, Fundamental logics based on two phase clocked
adiabatic Static CMOS Logic, Circuits and Systems, IEEE Transaction on ,ICECS 2009,
Vol.47, Issue 9,Sep. 2000,pp 503-506.
[8] J.M.Rabaey et.al., Digital Integrated Circuits-A design Perspective,2nd edition-PHI
[9] J. Parso, S Furber, Principles of Asynchronous Circuit Design-A system Perspective,
Kluwer Academic Publishers.

27 | P a g e

Вам также может понравиться