SRT4 Divider and Multiplier

Processor Design Project: ET4171
Due on Monday, June 10, 2013
Instructor: S. D. Cotofana
W. van Teijlingen, R. de Wit and S. van Breukelen
IN4343
Contents
Contact Information Introduction Other areas of optimization Modied Multiplier Design Modied Divider Design Appendix A: 32 bit twos complementation logic 2 3 4 5 10 13
Page 1
IN4343
Contact Information
Name: E-mail: Student number: Name: E-mail: Student number: Name: E-mail: Student number: Stefan van Breukelen s van breukelen@hotmail.com 4192591 Wouter van Teijlingen wouter@van-teijlingen.nl 4170377 Remco de Wit r.dewit-1@student.tudelft.nl 4179889
Page 2
IN4343
Introduction
The goal of the Processor Design Project is to improve the LEON3 core performance. Improving the core performance can be done by optimizing the design of the core in terms of performance (number of clock cycles, data path delay etc.), area and/or power dissipation. Depending on the requirements of the application, a designer can choose to optimize the design for any of these characteristics. Our proposal to improve the LEON3 core performance is based on optimizing the design in terms of performance. We believe that performance is still a key requirement in many applications and improving the design in terms area or power will yield a lower gain (using a FPGA) in comparison with a ASIC design. Therefor one can better choose a ASIC design when optimizing for area or power. It is because of these factors that area and power consumption will not be taken into account when improving the performance, as long as they are within reasonable1 limits. Due to the frequent use of arithmetic operations like multiplying and dividing in modern day computers, we propose to modify the multiplier and divider. Besides the multiplier and divider other computer architecture and computer arithmetic aspects were considered in order to improve the performance of the LEON3 core. !!!say something about the rest of the report!!!
1 not
doubling resources
Page 3
IN4343
Other areas of optimization

In this section other possible areas of optimization are researched and discussed. Besides the computer arithmetic aspects like the adder, divider and multiplier one can also look at certain aspects of the computer architecture of the LEON3 core.
Adder
One of the possible areas of optimization was the adder module of the processor. This is an obvious contender for optimization because addition is one of the most used operations. In order to improve the preformance of the adder one can choose to reduce the number of cycles or reduce the data delay path. To determine whether the adder module was suitable for optimization a testbench was created. This testbench helped with understanding the adder and determining the number of cycles required. The testbench showed that additions were done in one cycle, therefore no clock cycles could be gained. !!!check data delay path!!! It became clear that little performance could be gained by modifying the adder module.
Superscalar architecture
In a superscalar architecture several instructions can be initiated simultaneously and executed independently, therefore allowing for a better throughput. Superscaling is similar to pipelining but in pipelining multiple instructions have to be in dierent pipeline stages at any given time.
Out-Of-Order execution
Another way of improving the performance of the LEON3 core can be achieved by reordering the instructions. By reordering the instructions the processor can avoid being idle and thereby improve the performance. This reordering of instructions is called Out-Of-Order (OOO) execution. The processor will execute the instructions in order of availability of data or operands and avoids being idle while data is retreived for the next instruction.
Conclusion
Due to the complexity of these optimizations and the small time window the focus will lie on the multiplier and divider. These optimizations are kept for future work.!!!!!
Page 4
IN4343
Modied Multiplier Design

In this section the multiplier in the LEON3 processor is discussed. First, the reason for optimizing the multiplier is explained. Second, the original multiplier and its design is discussed. Third, the design and implementation of the modied multiplier is presented. Finally, performance comparisons are analyzed and we nish with a short summary and conclusion.
Improving the multiplier

Looking for possibilities for optimizing the processor the ALU is an obvious choose for investigation. Multipliers, dividers and adders are logical units that play a vital role in performance of a CPU. Investigating the led to the conclusion it already was high speed and therefore not a suitable pick for further optimizing. However, the divider and multiplier denitely left room for improvement. The divider is discussed in detail in CHXX. The multiplier has a delay of 5 cycles at a frequency of 80 MHz in the reference design. The LEON3 conguration allows users to select between multipliers in the design. It is possible to pick one of the following designs: 16x16, 32x8, 32x16 and 32x32. The rst has a delay of 5 cycles, the following two have 4 and the last has a delay of 2. Depending on the conguration one of the multipliers is instantiated. Every state one part of the multiplication is carried out and intermediate results are stored in a accumulator. The reason for optimizing the multiplier is that multiplications are often used in benchmarks and that reducing the latency gives an overall improvement in performance.
Design of modied multiplier

In high-speed designs the Modied Booth Encoding (MBE) [1] is often used. Various implementations exists and new designs are proposed on regular basis [3, 4]. Therefore, the design of the modied multiplier is based on MBE with a combination of Carry Save Adders (CSAs). The nal addition is computed by a normal adder. In most designs, MBE is applied to signed numbers only. However, in this design unsigned numbers should be supported as well. In this modied design the aim was to have a 2 latency multiplier with 2 clock issue rate and 1 clock data latency. This is exactly the same as with the unmodied 32x16 multiplier. A global overview of the design is shown in Figure 1. Booth recoding requires a recoding of the multiplier term. In this design Radix-4 Booth Recoding (often referred to as Booth 2 recoding) was used. In the following section the dierence between signed and unsigned multiplication is explained.
Page 5
IN4343
Figure 1: Multiplier overview.
Signed and unsigned multiplication in MBE

Signed and unsigned multiplication is requires application of dierent method. If the multiplier is unsigned the Most Signicant Bit (MSB) is padded with two zeros if the bit length is even, otherwise one zero. For signed multiplication the padding is not required as long as the multiplier is even. Otherwise sign extend the multiplier by 1 bit. In this case a 64-bit product is expected. However, bit 32 determines the sign of the multiplicand and multiplier. In this design normal sign extension is used instead of special sign extension methods. In the rst design a sign extension method was applied but the increased performance was negligible and therefore it was not included in the nal design. Depending on the sign bit of the multiplier the last partial product row is generated with the help of a Partial Product Generator (PPG) otherwise it is simply initialized to all zeroes.
Modied Booth Encoder

To MBE is used to encode the multiplier. As previously mentioned, the MSB is padded with two bits [5, 6]. The LSB is also padded with one extra zero bit. The bits are inspected and the following table is used to select the right encoding. The outputs that are used for the PPG are neg, one and two. Where neg indicates whether or not the multiplier should be complemented and one or two indicates if a multiple of 1 or 2 times the multiplier is required. In this particular design Radix-4 MBE encoding is used. It is possible to have higher radices. However, by increasing the radix dicult multiples of the multiplicand need to be calculates, such as 3X and 4X.
Twos complementation logic

The MBE determines whether or not the multiplicand should be added or subtracted. Subtraction means simply complementing the multiplicand and add
Page 6
IN4343
Block 000 001 010 011 100 101 110 111
Partial Product 0 x Multiplicand 1 x Multiplicand 1 x Multiplicand 2 x Multiplicand -2 x Multiplicand -1 x Multiplicand -1 x Multiplicand 0 x Multiplicand
Table 1: Radix 4 Booth Encoding.
Figure 2: Multiplier overview. Adapted from Kang and Gaudiot [2].
one. In an interesting paper titled A Logarithmic Time Method for Twos Complementation by Kang and Gaudiot a O(log(n)) algorithm is proposed for twos complementation, where, in this case, n is the length of the multiplicand. There proposed solution can be generalized to any number of bits in the system. It is implemented for 32 bits values in this design as it was considered as an interesting design for complementation of values. Normally one inverts the value and adds one. Using this method full propagation of the carry bit is possible. In this design based on simple logic gates we have 5 stages before complementation is nished. In Figure 2 a simple 8 bit complementer is shown.
Partial Product Generator

The PPG needs both the multiplier and multiplicand. It needs the multiplier directly in its full bit length. The multiplicand is encoded using the MBE and this information is required to select the correct partial product. As with the MBE the PPG is implemented as a multiplexor and is port mapped in the design. It is a simple design and it is illustrated in in Figure 3. After the PPs
Page 7
IN4343
Figure 3: Partial Product Generator.
Figure 4: Carry Save Adder Tree.
are calculated we are required to sum the products and produce the nal result. Exactly this is discussed in the following section.
Carry Save Adders

Carry Save Adders, in short CSAs, are often used as a major speed enhancement technique. It allows fast addition of numbers with minimal carry propagation. In this design so called 3:2 CSAs are used. In Figure 4 the CSA tree is shown. With the help of this CSA tree it is possible to reduce the addition of 16 partial products to a simple addition of 2 bit streams. This design is actually very fast and without CSA a maximum delay of 30 ns is measured. If the CSA tree is used it is reduced to 15 ns. Applying other methods such as higher order compressors and sign extension tricks could lead to even better performance.
Page 8
IN4343
Evaluation of performance
In this we evaluate the performance.
Summary and conclusion

Modifying the multiplier proved to be a challenging assignment. Designing the multiplier with the discussed methods and techniques was an interesting and useful experience. However, meeting the speed requirements proved to be dicult. At the end we had a lot of problems with the conguration setup. No matter what we tried we actually failed in getting it to work at even low frequencies on the FPGA. The design met timing requirements and simulation learned us that multiplication was actually working according to specication.
Page 9
IN4343
Figure 5: 8 bit barrel shifter.
Modied Divider Design

!!!introduction!!!
Barrel shifter
Due to normalization required by the SRT divider, on the dividend and divisor, a barrel shifter is implemented. A barrel shifter is a circuit which can shift data by a specic number of bits in a single cycle. Due to this property it is well suited for use in a fast SRT divider. Because of the normalization and the 32 bit size of the divisor, a left shifter with a maximum of 31 bit shifts is required. To build this barrel shifter a sequence of two-to-one multiplexers are used. An example of a 8-bit barrel shifter is given in gure x!!!. Using this approach input data can be shifted an arbitrary number of bits in a single cycle. The number of two-to-one multiplexers required for an n-bit input with m number of bit shifts, is n log2 m. The number of levels in a barrel shifter design is log2 m. For the 8-bit barrel shifter in gure x!!!! 24 (8 log2 8) two-to-one multiplexers are required and for a 7 bit shift 3 (log2 8) levels are required. For the SRT divider two barrel shifters are required: a 32 bit barrel shifter and a 64 bit barrel shifter, for the divisor and dividend respectively. Both of these barrel shifters require a maximum of 31 bit shifts in order to normalize the data. A higher number of bit shifts would be redundant because Page 10
IN4343
Figure 6: 4 bit RB to binary converter.
the dividend would be too large and an overow will be inevitable. The 32 bit barrel shifter requires 160 (32 log2 32) multiplexers and the 64 bit barrel shifter requires 320 (64 log2 32) multiplexers. Both barrel shifters have 5 (log2 32) levels and thus require 5 bits to select the number of bit shifted. The bits which are shifted out are used to determine whether an overow condition has occured. When the dividend is positive an overow occurs when a 1 is shifted out and when the dividend is negative an overow occurs when a 0 is shifted out.
On-the-y output decoder (radix-2)

Due to the radix-2 SRT divider the nal result of the division is in a Redundant Binary (RB) representation, where numbers are in the digit set S = {-1, 0, 1}. This nal result in RB representation needs to be converted to a twos complement representation. Several designers have proposed architectures for arithmetic operations in RB representation and conversion from RB representation to twos complement representation. Although arithmetic operations in RB representation are useful for further speed up of the SRT-2 divider, focus lies on the conversion. One paper [!!!] presented a conversion logic where the two bits used to encode the RB number (x+ and x- ) are considered to be two independent unsigned numbers. By subtracting x- from x+ for all RB numbers the result is converted. This subtraction can be done by Minus Minus Plus (MMP) adders resulting in the following logic. The logic in gure x!!! depicts a 4 bit RB to binary converter. By cascading more MMP adders the converter will be suitable for larger RB numbers. In the case of our proposed divider, which has a 32 bit output, 32 MMP adders are required. Page 11
IN4343
Output decoder (radix-4)
References
[1] J.-Y. Kang and J.-L. Gaudiot, A Simple High-Speed Multiplier Design, IEEE Trans. Computers, vol. 55, no. 10, pp. 1253-1258, Oct. 2006. [2] J.-Y. Kang and J.-L. Gaudiot, A Logarithmic Time Method for Twos Complementation, Proc. Intl Conf. Computational Science, pp. 212-219, 2005. [3] J.-Y. Kang and J.-L. Gaudiot, A Fast and Well-Structured Multiplier, Proc. Euromicro Symp. Digital System Design, pp. 508-515, Sept. 2004. [4] M. Prajapati, S. K. Lenka. Ecient Implementation of a Well-Structured Modied Booth Multiplier Design. Int. Journ. of VLSI and Embedded Systems, ISSN: 2249 - 6556, vol 04, Issue 03, May, 2013. [5] R. P. Rajput and M. N Shanmukha Swamy, High speed Modied Booth Encoder multiplier for signed and unsigned numbers , 2012 14th International Conference on Modelling and Simulation 978-0-7695- 4682-7/12 2012 IEEE. [6] S-R Kuang, J-P Wang and C-Y Guo, Modied Booth multipliers with a regular partial product array, IEEE Trans. on Circuit and Systems, vol.56, Issue 5, pp. 404-408, May 2009.
Page 12
IN4343
Appendix A: 32 bit twos complementation logic

Please see the next page for the schematic.
Page 13
IN4343
Figure 7: 32 bits twos complementation logic.
Page 14

SRT4 Divider and Multiplier

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

SRT4 Divider and Multiplier

Загружено:

Авторское право:

Доступные форматы

Processor Design Project: ET4171

Due on Monday, June 10, 2013

W. van Teijlingen, R. de Wit and S. van Breukelen

W. van Teijlingen, R. de Wit and S. van Breukelen

W. van Teijlingen, R. de Wit and S. van Breukelen

W. van Teijlingen, R. de Wit and S. van Breukelen

W. van Teijlingen, R. de Wit and S. van Breukelen

Other areas of optimization

W. van Teijlingen, R. de Wit and S. van Breukelen

Modied Multiplier Design

Improving the multiplier

Design of modied multiplier

W. van Teijlingen, R. de Wit and S. van Breukelen

Figure 1: Multiplier overview.

Signed and unsigned multiplication in MBE

Modied Booth Encoder

Twos complementation logic

W. van Teijlingen, R. de Wit and S. van Breukelen

Block 000 001 010 011 100 101 110 111

Table 1: Radix 4 Booth Encoding.

Figure 2: Multiplier overview. Adapted from Kang and Gaudiot [2].

Partial Product Generator

W. van Teijlingen, R. de Wit and S. van Breukelen

Figure 3: Partial Product Generator.

Figure 4: Carry Save Adder Tree.

Carry Save Adders

W. van Teijlingen, R. de Wit and S. van Breukelen

Summary and conclusion

W. van Teijlingen, R. de Wit and S. van Breukelen

Figure 5: 8 bit barrel shifter.

Modied Divider Design

W. van Teijlingen, R. de Wit and S. van Breukelen

Figure 6: 4 bit RB to binary converter.

On-the-y output decoder (radix-2)

W. van Teijlingen, R. de Wit and S. van Breukelen

Output decoder (radix-4)

W. van Teijlingen, R. de Wit and S. van Breukelen

Appendix A: 32 bit twos complementation logic

W. van Teijlingen, R. de Wit and S. van Breukelen

Figure 7: 32 bits twos complementation logic.

Вам также может понравиться