FPGA Implementation of Low Power Parallel Multiplier
Sanjiv Kumar Mangal Rahul M. Badghare
Dept. of Electronics and Computer Science Dept. of Electronics and Computer Science VNIT, Nagpur, Maharashtra, India. VNIT, Nagpur, Maharashtra, India. sanjivmangal@yahoo.co.in badghare_rahul@yahoo.co.in Raghavendra B. Deshmukh R. M. Patrikar Dept. of Electronics and Computer Science Dept. of Electronics and Computer Science VNIT, Nagpur, Maharashtra, India VNIT, Nagpur, Maharashtra, India rbdeshmukh@vnitnagpur.co.in rajendra@computer.org
Abstract between efficiency and flexibility, and as a result,
programmable designs incur significant performance In the fast growing communication field, and power penalties compared to application specific requirements of low power designs are increasing to solutions. Consequently various digital signal- reduce the power losses and decrease the thermal processing chips are now designed with low power losses in the same ratio. Multiplier is an arithmetic dissipation. circuit that is extensively used in common DSP and Signal processing applications typically exhibit Communication applications. This paper presents low high degrees of parallelism and are dominated by a few power multiplier design methodology that inserts more regular kernels of computation such as multiplication, number of zeros in the multiplicand thereby reducing that are responsible for a large fraction of execution the number of switching activities as well as power time and energy. In such systems, multiplier is a consumption. Use of look up table is an added feature fundamental arithmetic unit [1]. Shrinking feature sizes to this design. Modifying the structure of adders are responsible for increasing thermal-related problems further reduces switching activity. as well. The on-chip temperature in current processors can vary by as much as several tens of degrees from Index TermsLow Power, Multiplier, Reduced one portion of the chip to the other with the maximum Switching, Column By passing. temperature reaching as high as 100 degree C. The temperature gradient formed by such units can be a 1. Introduction major source of inaccuracy in delay and clock skew computations [2]. As we get closer to the limits of scaling in complementary metaloxidesemiconductor (CMOS) 1.1. Power Reduction Techniques circuits, power and heat dissipation issues are becoming more and more important. In recent years, The power dissipation in CMOS circuit has several the impact of pervasive computing and the internet components that are usually estimated on the device have accelerated this trend. The applications for these parameters of the technology used. The total power in domains are typically run on battery-powered the circuit is given by the following equation, embedded systems. The resultant constraints on the energy budget require design for power as well as Ptotal = Pswitching + Pshortcircuit + Pstatic + Pleakage (1) design for performance at all layers of system design. Thus reducing power consumption is a key design goal Where Pswitching is switching component of the power for portable computing and communication devices and it is a dominating component in these calculations. that employ increasingly sophisticated and power- Pshortcircuit is the power dissipated due to the fact that hungry signal processing techniques. Flexibility is during the circuit operation PMOS and NMOS another critical requirement that mandates the use of transistors of CMOS gate become simultaneously programmable components like FPGAs in such during the transition at the input level, Pstatic is the devices. However, there is a fundamental trade-off contribution due to the biasing current required for the
20th International Conference on VLSI Design (VLSID'07)
0-7695-2762-0/07 $20.00 2007 Authorized licensed use limited to: Jeppiaar Engineering College. Downloaded on July 30,2010 at 14:01:26 UTC from IEEE Xplore. Restrictions apply. device, Pleakage is the power consumption due to the shifting and accumulating the partial products. reverse biased P-N junctions in the circuit. Switching activity is poorly correlated with the input In FPGA designs power reduction is possible only coefficient. In particular, reducing the switching through reduced switching activity, which is also activity of the component used in the design can called dynamic power. In general dynamic power minimize the power dissipation i.e. if kth bit of the consumption is defined as the power consumed while coefficient is zero, the kth row of adders need not be the clock is running and the external inputs are activated. However, this type of multiplier does not switching. Dynamic power has several components, help us for reduced switching since there is mainly capacitive-load charging and discharging unnecessarily switching of adders even if the kth bit is (internal and I/Os) and short-circuit current. Most of zero. the dynamic power is consumed by charging and discharging capacitance, internal and external to the 2.2. Related Research device. If the device is driving heavily with many I/O loads, the dynamic current due to the I/Os becomes a Figure 1 shows the 4 X 4 row bypassing architecture substantial part of the entire power consumption. Since with reduced switching [5]. the voltage VDD is fixed, internal dynamic power reduction is achieved by Decreasing the average logic-switching frequency Reducing the amount of logic switching at each clock edge Reducing the propagation of the switching activity and lowering the capacitance of the routing network especially for high-frequency signals [3]. For low-power designs, precautions need to be taken at each abstraction level, from system level to technology process level. Higher in the abstraction level an appropriate decision is taken to reduce power, the higher the impact will be. In general design practices to reduce switching activity reduction can be controlled at various levels of the design flow. Figure 1: - 4x4 Multiplier with reduced switching Architectural decisions in the early design phases have activity the greatest impact. For high switching signals, delay balancing and reduction of the number of logic levels The demerit of this technique is that it needs extra are among the most efficient techniques to tackle correction circuitry shown in ellipse. Structure of the power penalty. An obvious method to reduce the full adder is complex as well. switching activity is to shut down the idle part of the The structure shown below in figure 2 eliminates the circuit, which is not in operating condition. Further low extra correction circuitry. On the other hand, the power adder structure reduces the switching activity [4]. This paper presents a multiplier design in which switching activities are reduced through architecture optimization. This paper is organized as follows. In section 2 we give implementation issues and preliminary work over different multiplier architecture. Section 3 deals with proposed architecture. Section 4 gives the implementation results and finally the conclusion is given in section 5.
2. Implementation Issues
2.1. Conventional Parallel Multiplier
A general M x N parallel multiplier operates by Figure 2: - 4x4 Braun multiplier
computing the partial products in parallel and by
20th International Conference on VLSI Design (VLSID'07)
0-7695-2762-0/07 $20.00 2007 Authorized licensed use limited to: Jeppiaar Engineering College. Downloaded on July 30,2010 at 14:01:26 UTC from IEEE Xplore. Restrictions apply. Baugh-Wooley multiplier [6] uses the same array In this approach we propose Binary / Booth Recoding structure to handle 2s complement multiplication, Unit which will force multiplicand to have more with some of the partial products replaced by their number of zeros. The advantage here is that if complements. The Braun multiplier removes the extra multiplicand contains more successive number of ones correction circuitry needed. Also, number of adders is then booth-recoding unit converts these ones in zeros. less. But, the limitation of this technique is that it cannot stop the switching activity even if the bit 3.1. Approach coefficient is zero that ultimately results in unnecessary power dissipation. Another low power designs disable The switching activity of the component used in the the operation in some rows [7][8]. design depends on the input bit coefficient. This means Ming-Chen Wen [9] et.al, proposed a technique that if the input bit coefficient is zero, corresponding row or reduces the switching to fairly good extent. Figure 3 column of adders need not be activated. If multiplicand shows the 4x4 column bypassing multiplier structure. contains more zeros, higher power reduction can be achieved. We propose a Binary / Booth Recoding Unit which will force multiplicand to have more number of zeros. Consider the multiplication of 1111 x 1000 in which multiplicand can be booth recoded as 1000b where b is -1. Booth recoded multiplicand contains only two ones which will switch two columns. Therefore, instead of taking 1111 as an multiplicand 1000b can be taken. Now consider the multiplication of 1010 x 1000 in which multiplicand can be booth recoded as 1b1b0. Booth recoded multiplicand contains only single zero whereas binary multiplicand contain two zeros. In this case binary multiplicand can be chosen for multiplication.
3.2. Multiplier Design
Figure 3: - 4 x 4 Column bypassing multiplier structure The low power multiplier can be constructed as shown in figure 4. It is organized in two units as Consider the multiplication of 1010 x 1000. Since Binary/ Booth Recoding Unit and Multiplication Unit. the multiplicand contains two zeros, the corresponding columns i.e. first and third will get disabled. Now, Multiplicand Multiplier consider another multiplication of 1111 x 1000. Since multiplicand contains no zero, all columns will get Binary / Booth Recoding Unit switched. The limitation of this technique is that number of Sign Multiplicand columns switched depends on the number of ones in register the multiplicand. For example if the multiplicand is 16 Multiplication Using bit in length as 1111111111111111 then all the full Column Bypassing adders in all the columns will get switched and consume more power. Less switching activity of the components can be achieved if the multiplicand Multiplier contains more zeros than ones. However, in our output proposed technique the above multiplicand is Figure 4: - Proposed multiplier architecture represented as 1000000000000000b where b is -1, which will switch only two columns and ultimately 3.2.1. Binary/Booth Recoding Unit. This unit chooses reduces the switching and thus power dissipation. the multiplicand with more number of zeros. It generates booth-recoded multiplicand and chooses any 3. Proposed Design and Implementation multiplicand binary or booth recoded according to the greater number of zeros. When the booth-recoded Higher power reduction can be achieved if the multiplicand is chosen, the multiplicand is represented multiplicand contains more number of 0s than 1s [5]. with (b, 0,1). To represent this b in binary number
20th International Conference on VLSI Design (VLSID'07)
0-7695-2762-0/07 $20.00 2007 Authorized licensed use limited to: Jeppiaar Engineering College. Downloaded on July 30,2010 at 14:01:26 UTC from IEEE Xplore. Restrictions apply. system we have taken sign bit register S that will hold Multiplying with 1 will take 2s complement of the value 1 only when the corresponding bit is b multiplier. However, we need extra sign bit circuitry otherwise 0. For binary multiplicand, S is always zero. to add sign extension bits. But, since in booth recoding If the multiplicand is 16 bit in length no two consecutive -1 will be there and in worst case 0000111110111100 that can be booth recoded as there will be two -1s. Even though in the case of worst 00010000b1000b00. These ternary values can be multiplicand i.e. 0101 the output of the Binary / Booth represented in two registers like Recoding Unit is binary multiplicand. So there is no need of extra correction circuitry since multiplier will Magnitude Register perform normal binary operation. The Modified Full Adder is constructed as shown in 0 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 figure 6. If aj is zero, FA is disabled. Here sj is a sign bit of multiplicand. Sign Register (s)
0 0 0 0 0 0 0 0 b 0 0 0 0 b 0 0
We have used look up tables for counting the number
of zeros and converting multiplicand to booth recoded multiplicand as shown in table 1. For multiplication with b it will take 2s complement of multiplier. This unit guarantees us to have always more or equal number of zeros in the multiplicand.
Table 1: - Booth recoding table
Multiplicand Version of multiplier Figure 6: - Modified full adder structure Bit i Bit i-1 selected by bit i 0 0 0XM Structure for Modified Half Adder is shown in figure 0 1 +1 X M 7. 1 0 bXM 1 1 0XM
3.2.2. Multiplication Unit. Figure 5 shows the 4x4
low power multiplier structure. This technique will be very useful as we go for higher width of the multiplicand specially when there are successive numbers of ones.
Figure 7: - Modified half adder structure
4. Implementation and Results
In order to evaluate the performance of low power parallel multiplier, we implement all these designs on Xilinx xc2vp2-6fg256 FPGA. We compare the performance of this design with column bypassing multiplier, row bypassing multiplier and multiplier 4. Author name(s) and affiliation(s) without bypassing. Table 2 highlights the comparison between binary multiplicand and booth-recoded Figure 5: - 4 x 4 Multiplier architecture multiplicand. Later is generated when binary multiplicand is passed through Binary / Booth
20th International Conference on VLSI Design (VLSID'07)
0-7695-2762-0/07 $20.00 2007 Authorized licensed use limited to: Jeppiaar Engineering College. Downloaded on July 30,2010 at 14:01:26 UTC from IEEE Xplore. Restrictions apply. Recoding Unit. It clearly indicates that booth-recoded Proposed 593 47.652 multiplicand saves significant amount of switching activity as compared to the Binary multiplicand. In 16 The floorplanner view of the 16x16 multiplier is x 16 multiplier, the output of Binary / Booth Recoded shown below. unit is 60.66% binary multiplicand and 39.34% booth recoded multiplicand. Implementation of counting the number of zeros and generation of booth-recoded multiplicand with the help of lookup table is another advantage of this design. It takes less number of CLBs when compared with usual loop statement or FSM. For Design Entry, we used ModelSim 6.0d and design with VHDL. The design was synthesized on Xilinx ISE 7.1i and SynplifyPro. Synthesized results on SynplifyPro and Xilinx XST are shown below in table 3 and 4 respectively.
Table 2: - The number of zeros that get
increased when binary multiplicand is passed through Binary / Booth Recoding Unit
Binary Multiplicand Booth Recoded
Multiplicand Figure 8: - Floor planner view for 16x16 1111111111111111 1000000000000000-1 multiplier in Xilinx xc2vp 1111111111111110 100000000000000-10 0000111110111100 000010000-11000-100 The pseudo code for the 16x16 multiplier proposed 1111110000111111 100000-1000100000-1 architecture is as follows.
1. B= Booth Recoding (A)
Table 3: - Synthesis results on SynplifyPro If no. zero B > no. zero A then Y=B, Multiplier (16x16) Number of LUTs S=sign With out bypassing 590 Else Row Bypassing 921 Y<=A, Column Bypassing 605 S=0 End if Proposed 941 2. Z=Y (0 to 15)* M (0 to 15) Maximum combinational path delays along with number of slices are given in table 4. We have Z = P1+P2+P3+P4 implemented all the above designs on Xilinx xc2vp P1=Y (0 to 15)* M (0 to 3) FPGA. Thus this method uses more number of slices P2= Y (0 to 15)* M (4 to 7) compared to earlier methods. However, since number P3= Y (0 to 15)* M (8 to 11) of logic elements available is large in most of the P4= Y (0 to 15)* M (12 to 15) todays FPGA this is not considered as a negative point, since power reduction is a prime goal. L1: for (i=0,i<=15,i++) Partial_prdt = and (M (0), Y (i)) Table 4: - Synthesis results on Xilinx XST End loop
Multiplier (16x16) Number of Maximum P (0) = Partial_prdt
slices Combinational Path delay (ns) L2: for (i=0,i<=14,i++) Without Bypassing 325 36.628 Sum = advanced_fa_adder (b (1), s (i), a (i), Row Bypassing 557 51.111 cout1 (i), sum1 (i)) End loop Column Bypassing 497 42.827
20th International Conference on VLSI Design (VLSID'07)
0-7695-2762-0/07 $20.00 2007 Authorized licensed use limited to: Jeppiaar Engineering College. Downloaded on July 30,2010 at 14:01:26 UTC from IEEE Xplore. Restrictions apply. L3: for (i=0,i<=3,i++) 6. References And_gate_out = and (Y (15), b (i+1)) End loop [1] Oscal T. -C. Chen, Sandy Wang, and Yi-Wen Wu, L4: for (i=0,i<=1,i++) Minimization of Switching Activities of Partial L4: for (i=0,i<=13,i++) Products for Designing Low-Power Multipliers, IEEE Sum = advanced_fa_adder (b (2), s (i), a (i), Transactions on VLSI Systems, June 2003 vol. 11, no. cout (2+i)(i), sum2 (i)) 3. End loop [2] Rajendra M. Patrikar, K. Murali, Li Er Ping, Thermal distribution calculations for block level and_gate_out = and(Y(i), cout3(i)) placement in embedded systems, Microelectronics Pass_ha_adder (Y (0), sum3 (1), sel, p (4), Reliability 44(2004) 129-134 cout (0)) [3] Hichem Belhadj, Behrooz Zahiri, Albert Tai L5: for (i=0,i<=12,i++) Power-sensitive design techniques on FPGA Adder =sum3 (i+2), Y (i+1), cout4 (i), sum3 ( devices, Proceedings of International conference on i+2), P (i+5), cout4 (i+1) IC Taipai (2003). End loop [4] A. Wu, High performance adder cell for low 3. If S = 0 then power pipelined multiplier, in Proc. IEEE Int. Symp. Z=P on Circuits and Systems, May 1996 , vol. 4, pp. 57-60. Else For (i=0, i<n-1, i++) [5] S. Hong, S. Kim, M.C. Papaefthymiou, and W.E. For (i=0, i<n, i++) Stark, Low power parallel multiplier design for DSP S = pass adder (a (i), b (i), applications through coefficient optimization, in cin (i), sum (i), cout (i)) Proc. of Twelfth Annual IEEE Int. ASIC/SOC onf., Sep. End loop 1999, pp. 286-290. End loop P1 = S + P [6] C. R. Baugh and B. A.Wooley, A twos Z = P1 complement parallel array multiplication algorithm, IEEE Trans. Comput., Dec. 1973, vol. C-22, pp. 1045 End if 1047.
5. Conclusion [7] I. S. Abu-Khater, A. Bellaouar, and M. Elmasry,
Circuit techniques for CMOS low-power high- In this paper we have presented a new methodology performance multipliers, IEEE J. Solid-State Circuits, for designing of low power parallel multiplier with Oct. 1996, vol. 31, pp. 15351546. reduced switching. Method for increasing number of zeros in the multiplicand is discussed with the help of [8] J. Ohban, V.G. Moshnyaga, and K. Inoue, Binary / Booth Recoding Unit. We use look up table Multiplier energy reduction through bypassing of for implementing the logic for counting the number of partial products, Asia-Pacific Conf. on Circuits and ones and generation of booth recoded multiplicand. Systems. 2002.,vol.2, pp. 13-17. Comparing with column bypassing and other techniques our methodology guarantees to have equal [9] Ming-Chen Wen, Sying-Jyan Wang, and Yen-Nan or more number of zeros in the multiplicand. Effective Lin, Low Power Parallel Multiplier with Column implementation of 16x16 multiplier in FPGA is also Bypassing, Electronics letters, 10, 12 May 2005 presented. Volume 41, Issue Page(s): 581 583.
20th International Conference on VLSI Design (VLSID'07)
0-7695-2762-0/07 $20.00 2007 Authorized licensed use limited to: Jeppiaar Engineering College. Downloaded on July 30,2010 at 14:01:26 UTC from IEEE Xplore. Restrictions apply.