Вы находитесь на странице: 1из 6

FPGA Implementation of Low Power Parallel Multiplier

Sanjiv Kumar Mangal Rahul M. Badghare


Dept. of Electronics and Computer Science Dept. of Electronics and Computer Science
VNIT, Nagpur, Maharashtra, India. VNIT, Nagpur, Maharashtra, India.
sanjivmangal@yahoo.co.in badghare_rahul@yahoo.co.in
Raghavendra B. Deshmukh R. M. Patrikar
Dept. of Electronics and Computer Science Dept. of Electronics and Computer Science
VNIT, Nagpur, Maharashtra, India VNIT, Nagpur, Maharashtra, India
rbdeshmukh@vnitnagpur.co.in rajendra@computer.org

Abstract between efficiency and flexibility, and as a result,


programmable designs incur significant performance
In the fast growing communication field, and power penalties compared to application specific
requirements of low power designs are increasing to solutions. Consequently various digital signal-
reduce the power losses and decrease the thermal processing chips are now designed with low power
losses in the same ratio. Multiplier is an arithmetic dissipation.
circuit that is extensively used in common DSP and Signal processing applications typically exhibit
Communication applications. This paper presents low high degrees of parallelism and are dominated by a few
power multiplier design methodology that inserts more regular kernels of computation such as multiplication,
number of zeros in the multiplicand thereby reducing that are responsible for a large fraction of execution
the number of switching activities as well as power time and energy. In such systems, multiplier is a
consumption. Use of look up table is an added feature fundamental arithmetic unit [1]. Shrinking feature sizes
to this design. Modifying the structure of adders are responsible for increasing thermal-related problems
further reduces switching activity. as well. The on-chip temperature in current processors
can vary by as much as several tens of degrees from
Index TermsLow Power, Multiplier, Reduced one portion of the chip to the other with the maximum
Switching, Column By passing. temperature reaching as high as 100 degree C. The
temperature gradient formed by such units can be a
1. Introduction major source of inaccuracy in delay and clock skew
computations [2].
As we get closer to the limits of scaling in
complementary metaloxidesemiconductor (CMOS) 1.1. Power Reduction Techniques
circuits, power and heat dissipation issues are
becoming more and more important. In recent years, The power dissipation in CMOS circuit has several
the impact of pervasive computing and the internet components that are usually estimated on the device
have accelerated this trend. The applications for these parameters of the technology used. The total power in
domains are typically run on battery-powered the circuit is given by the following equation,
embedded systems. The resultant constraints on the
energy budget require design for power as well as Ptotal = Pswitching + Pshortcircuit + Pstatic + Pleakage (1)
design for performance at all layers of system design.
Thus reducing power consumption is a key design goal Where Pswitching is switching component of the power
for portable computing and communication devices and it is a dominating component in these calculations.
that employ increasingly sophisticated and power- Pshortcircuit is the power dissipated due to the fact that
hungry signal processing techniques. Flexibility is during the circuit operation PMOS and NMOS
another critical requirement that mandates the use of transistors of CMOS gate become simultaneously
programmable components like FPGAs in such during the transition at the input level, Pstatic is the
devices. However, there is a fundamental trade-off contribution due to the biasing current required for the

20th International Conference on VLSI Design (VLSID'07)


0-7695-2762-0/07 $20.00 2007
Authorized licensed use limited to: Jeppiaar Engineering College. Downloaded on July 30,2010 at 14:01:26 UTC from IEEE Xplore. Restrictions apply.
device, Pleakage is the power consumption due to the shifting and accumulating the partial products.
reverse biased P-N junctions in the circuit. Switching activity is poorly correlated with the input
In FPGA designs power reduction is possible only coefficient. In particular, reducing the switching
through reduced switching activity, which is also activity of the component used in the design can
called dynamic power. In general dynamic power minimize the power dissipation i.e. if kth bit of the
consumption is defined as the power consumed while coefficient is zero, the kth row of adders need not be
the clock is running and the external inputs are activated. However, this type of multiplier does not
switching. Dynamic power has several components, help us for reduced switching since there is
mainly capacitive-load charging and discharging unnecessarily switching of adders even if the kth bit is
(internal and I/Os) and short-circuit current. Most of zero.
the dynamic power is consumed by charging and
discharging capacitance, internal and external to the 2.2. Related Research
device. If the device is driving heavily with many I/O
loads, the dynamic current due to the I/Os becomes a Figure 1 shows the 4 X 4 row bypassing architecture
substantial part of the entire power consumption. Since with reduced switching [5].
the voltage VDD is fixed, internal dynamic power
reduction is achieved by
Decreasing the average logic-switching frequency
Reducing the amount of logic switching at each clock
edge
Reducing the propagation of the switching activity
and lowering the capacitance of the routing network
especially for high-frequency signals [3].
For low-power designs, precautions need to be
taken at each abstraction level, from system level to
technology process level. Higher in the abstraction
level an appropriate decision is taken to reduce power,
the higher the impact will be. In general design
practices to reduce switching activity reduction can be
controlled at various levels of the design flow. Figure 1: - 4x4 Multiplier with reduced switching
Architectural decisions in the early design phases have activity
the greatest impact. For high switching signals, delay
balancing and reduction of the number of logic levels The demerit of this technique is that it needs extra
are among the most efficient techniques to tackle correction circuitry shown in ellipse. Structure of the
power penalty. An obvious method to reduce the full adder is complex as well.
switching activity is to shut down the idle part of the The structure shown below in figure 2 eliminates the
circuit, which is not in operating condition. Further low extra correction circuitry. On the other hand, the
power adder structure reduces the switching activity
[4]. This paper presents a multiplier design in which
switching activities are reduced through architecture
optimization. This paper is organized as follows. In
section 2 we give implementation issues and
preliminary work over different multiplier architecture.
Section 3 deals with proposed architecture. Section 4
gives the implementation results and finally the
conclusion is given in section 5.

2. Implementation Issues

2.1. Conventional Parallel Multiplier

A general M x N parallel multiplier operates by Figure 2: - 4x4 Braun multiplier


computing the partial products in parallel and by

20th International Conference on VLSI Design (VLSID'07)


0-7695-2762-0/07 $20.00 2007
Authorized licensed use limited to: Jeppiaar Engineering College. Downloaded on July 30,2010 at 14:01:26 UTC from IEEE Xplore. Restrictions apply.
Baugh-Wooley multiplier [6] uses the same array In this approach we propose Binary / Booth Recoding
structure to handle 2s complement multiplication, Unit which will force multiplicand to have more
with some of the partial products replaced by their number of zeros. The advantage here is that if
complements. The Braun multiplier removes the extra multiplicand contains more successive number of ones
correction circuitry needed. Also, number of adders is then booth-recoding unit converts these ones in zeros.
less. But, the limitation of this technique is that it
cannot stop the switching activity even if the bit 3.1. Approach
coefficient is zero that ultimately results in unnecessary
power dissipation. Another low power designs disable The switching activity of the component used in the
the operation in some rows [7][8]. design depends on the input bit coefficient. This means
Ming-Chen Wen [9] et.al, proposed a technique that if the input bit coefficient is zero, corresponding row or
reduces the switching to fairly good extent. Figure 3 column of adders need not be activated. If multiplicand
shows the 4x4 column bypassing multiplier structure. contains more zeros, higher power reduction can be
achieved. We propose a Binary / Booth Recoding Unit
which will force multiplicand to have more number of
zeros. Consider the multiplication of 1111 x 1000 in
which multiplicand can be booth recoded as 1000b
where b is -1. Booth recoded multiplicand contains
only two ones which will switch two columns.
Therefore, instead of taking 1111 as an multiplicand
1000b can be taken. Now consider the multiplication of
1010 x 1000 in which multiplicand can be booth
recoded as 1b1b0. Booth recoded multiplicand contains
only single zero whereas binary multiplicand contain
two zeros. In this case binary multiplicand can be
chosen for multiplication.

3.2. Multiplier Design


Figure 3: - 4 x 4 Column bypassing multiplier
structure The low power multiplier can be constructed as
shown in figure 4. It is organized in two units as
Consider the multiplication of 1010 x 1000. Since Binary/ Booth Recoding Unit and Multiplication Unit.
the multiplicand contains two zeros, the corresponding
columns i.e. first and third will get disabled. Now, Multiplicand Multiplier
consider another multiplication of 1111 x 1000. Since
multiplicand contains no zero, all columns will get Binary / Booth Recoding Unit
switched.
The limitation of this technique is that number of Sign Multiplicand
columns switched depends on the number of ones in register
the multiplicand. For example if the multiplicand is 16 Multiplication Using
bit in length as 1111111111111111 then all the full Column Bypassing
adders in all the columns will get switched and
consume more power. Less switching activity of the
components can be achieved if the multiplicand Multiplier
contains more zeros than ones. However, in our output
proposed technique the above multiplicand is Figure 4: - Proposed multiplier architecture
represented as 1000000000000000b where b is -1,
which will switch only two columns and ultimately 3.2.1. Binary/Booth Recoding Unit. This unit chooses
reduces the switching and thus power dissipation. the multiplicand with more number of zeros. It
generates booth-recoded multiplicand and chooses any
3. Proposed Design and Implementation multiplicand binary or booth recoded according to the
greater number of zeros. When the booth-recoded
Higher power reduction can be achieved if the multiplicand is chosen, the multiplicand is represented
multiplicand contains more number of 0s than 1s [5]. with (b, 0,1). To represent this b in binary number

20th International Conference on VLSI Design (VLSID'07)


0-7695-2762-0/07 $20.00 2007
Authorized licensed use limited to: Jeppiaar Engineering College. Downloaded on July 30,2010 at 14:01:26 UTC from IEEE Xplore. Restrictions apply.
system we have taken sign bit register S that will hold Multiplying with 1 will take 2s complement of
the value 1 only when the corresponding bit is b multiplier. However, we need extra sign bit circuitry
otherwise 0. For binary multiplicand, S is always zero. to add sign extension bits. But, since in booth recoding
If the multiplicand is 16 bit in length no two consecutive -1 will be there and in worst case
0000111110111100 that can be booth recoded as there will be two -1s. Even though in the case of worst
00010000b1000b00. These ternary values can be multiplicand i.e. 0101 the output of the Binary / Booth
represented in two registers like Recoding Unit is binary multiplicand. So there is no
need of extra correction circuitry since multiplier will
Magnitude Register perform normal binary operation.
The Modified Full Adder is constructed as shown in
0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 figure 6. If aj is zero, FA is disabled. Here sj is a sign
bit of multiplicand.
Sign Register (s)

0 0 0 0 0 0 0 0 b 0 0 0 0 b 0 0

We have used look up tables for counting the number


of zeros and converting multiplicand to booth recoded
multiplicand as shown in table 1. For multiplication
with b it will take 2s complement of multiplier. This
unit guarantees us to have always more or equal
number of zeros in the multiplicand.

Table 1: - Booth recoding table


Multiplicand Version of multiplier Figure 6: - Modified full adder structure
Bit i Bit i-1 selected by bit i
0 0 0XM Structure for Modified Half Adder is shown in figure
0 1 +1 X M 7.
1 0 bXM
1 1 0XM

3.2.2. Multiplication Unit. Figure 5 shows the 4x4


low power multiplier structure. This technique will be
very useful as we go for higher width of the
multiplicand specially when there are successive
numbers of ones.

Figure 7: - Modified half adder structure

4. Implementation and Results


In order to evaluate the performance of low power
parallel multiplier, we implement all these designs on
Xilinx xc2vp2-6fg256 FPGA. We compare the
performance of this design with column bypassing
multiplier, row bypassing multiplier and multiplier
4. Author name(s) and affiliation(s) without bypassing. Table 2 highlights the comparison
between binary multiplicand and booth-recoded
Figure 5: - 4 x 4 Multiplier architecture multiplicand. Later is generated when binary
multiplicand is passed through Binary / Booth

20th International Conference on VLSI Design (VLSID'07)


0-7695-2762-0/07 $20.00 2007
Authorized licensed use limited to: Jeppiaar Engineering College. Downloaded on July 30,2010 at 14:01:26 UTC from IEEE Xplore. Restrictions apply.
Recoding Unit. It clearly indicates that booth-recoded Proposed 593 47.652
multiplicand saves significant amount of switching
activity as compared to the Binary multiplicand. In 16 The floorplanner view of the 16x16 multiplier is
x 16 multiplier, the output of Binary / Booth Recoded shown below.
unit is 60.66% binary multiplicand and 39.34% booth
recoded multiplicand. Implementation of counting the
number of zeros and generation of booth-recoded
multiplicand with the help of lookup table is another
advantage of this design. It takes less number of CLBs
when compared with usual loop statement or FSM.
For Design Entry, we used ModelSim 6.0d and
design with VHDL. The design was synthesized on
Xilinx ISE 7.1i and SynplifyPro. Synthesized results
on SynplifyPro and Xilinx XST are shown below in
table 3 and 4 respectively.

Table 2: - The number of zeros that get


increased when binary multiplicand is passed
through Binary / Booth Recoding Unit

Binary Multiplicand Booth Recoded


Multiplicand Figure 8: - Floor planner view for 16x16
1111111111111111 1000000000000000-1 multiplier in Xilinx xc2vp
1111111111111110 100000000000000-10
0000111110111100 000010000-11000-100 The pseudo code for the 16x16 multiplier proposed
1111110000111111 100000-1000100000-1 architecture is as follows.

1. B= Booth Recoding (A)


Table 3: - Synthesis results on SynplifyPro
If no. zero B > no. zero A then
Y=B,
Multiplier (16x16) Number of LUTs S=sign
With out bypassing 590 Else
Row Bypassing 921 Y<=A,
Column Bypassing 605 S=0
End if
Proposed 941
2. Z=Y (0 to 15)* M (0 to 15)
Maximum combinational path delays along with
number of slices are given in table 4. We have Z = P1+P2+P3+P4
implemented all the above designs on Xilinx xc2vp P1=Y (0 to 15)* M (0 to 3)
FPGA. Thus this method uses more number of slices P2= Y (0 to 15)* M (4 to 7)
compared to earlier methods. However, since number P3= Y (0 to 15)* M (8 to 11)
of logic elements available is large in most of the P4= Y (0 to 15)* M (12 to 15)
todays FPGA this is not considered as a negative
point, since power reduction is a prime goal. L1: for (i=0,i<=15,i++)
Partial_prdt = and (M (0), Y (i))
Table 4: - Synthesis results on Xilinx XST End loop

Multiplier (16x16) Number of Maximum P (0) = Partial_prdt


slices Combinational
Path delay (ns) L2: for (i=0,i<=14,i++)
Without Bypassing 325 36.628 Sum = advanced_fa_adder (b (1), s (i), a (i),
Row Bypassing 557 51.111 cout1 (i), sum1 (i))
End loop
Column Bypassing 497 42.827

20th International Conference on VLSI Design (VLSID'07)


0-7695-2762-0/07 $20.00 2007
Authorized licensed use limited to: Jeppiaar Engineering College. Downloaded on July 30,2010 at 14:01:26 UTC from IEEE Xplore. Restrictions apply.
L3: for (i=0,i<=3,i++) 6. References
And_gate_out = and (Y (15), b (i+1))
End loop [1] Oscal T. -C. Chen, Sandy Wang, and Yi-Wen Wu,
L4: for (i=0,i<=1,i++) Minimization of Switching Activities of Partial
L4: for (i=0,i<=13,i++) Products for Designing Low-Power Multipliers, IEEE
Sum = advanced_fa_adder (b (2), s (i), a (i), Transactions on VLSI Systems, June 2003 vol. 11, no.
cout (2+i)(i), sum2 (i)) 3.
End loop [2] Rajendra M. Patrikar, K. Murali, Li Er Ping,
Thermal distribution calculations for block level
and_gate_out = and(Y(i), cout3(i)) placement in embedded systems, Microelectronics
Pass_ha_adder (Y (0), sum3 (1), sel, p (4), Reliability 44(2004) 129-134
cout (0))
[3] Hichem Belhadj, Behrooz Zahiri, Albert Tai
L5: for (i=0,i<=12,i++) Power-sensitive design techniques on FPGA
Adder =sum3 (i+2), Y (i+1), cout4 (i), sum3 ( devices, Proceedings of International conference on
i+2), P (i+5), cout4 (i+1) IC Taipai (2003).
End loop
[4] A. Wu, High performance adder cell for low
3. If S = 0 then power pipelined multiplier, in Proc. IEEE Int. Symp.
Z=P on Circuits and Systems, May 1996 , vol. 4, pp. 57-60.
Else
For (i=0, i<n-1, i++) [5] S. Hong, S. Kim, M.C. Papaefthymiou, and W.E.
For (i=0, i<n, i++) Stark, Low power parallel multiplier design for DSP
S = pass adder (a (i), b (i), applications through coefficient optimization, in
cin (i), sum (i), cout (i)) Proc. of Twelfth Annual IEEE Int. ASIC/SOC onf., Sep.
End loop 1999, pp. 286-290.
End loop
P1 = S + P [6] C. R. Baugh and B. A.Wooley, A twos
Z = P1 complement parallel array multiplication algorithm,
IEEE Trans. Comput., Dec. 1973, vol. C-22, pp. 1045
End if 1047.

5. Conclusion [7] I. S. Abu-Khater, A. Bellaouar, and M. Elmasry,


Circuit techniques for CMOS low-power high-
In this paper we have presented a new methodology performance multipliers, IEEE J. Solid-State Circuits,
for designing of low power parallel multiplier with Oct. 1996, vol. 31, pp. 15351546.
reduced switching. Method for increasing number of
zeros in the multiplicand is discussed with the help of [8] J. Ohban, V.G. Moshnyaga, and K. Inoue,
Binary / Booth Recoding Unit. We use look up table Multiplier energy reduction through bypassing of
for implementing the logic for counting the number of partial products, Asia-Pacific Conf. on Circuits and
ones and generation of booth recoded multiplicand. Systems. 2002.,vol.2, pp. 13-17.
Comparing with column bypassing and other
techniques our methodology guarantees to have equal [9] Ming-Chen Wen, Sying-Jyan Wang, and Yen-Nan
or more number of zeros in the multiplicand. Effective Lin, Low Power Parallel Multiplier with Column
implementation of 16x16 multiplier in FPGA is also Bypassing, Electronics letters, 10, 12 May 2005
presented. Volume 41, Issue Page(s): 581 583.

20th International Conference on VLSI Design (VLSID'07)


0-7695-2762-0/07 $20.00 2007
Authorized licensed use limited to: Jeppiaar Engineering College. Downloaded on July 30,2010 at 14:01:26 UTC from IEEE Xplore. Restrictions apply.

Вам также может понравиться