Академический Документы
Профессиональный Документы
Культура Документы
circuit. In high-speed applications (~5 GHz), each evaluate only has 0.1ns to complete (with the other 0.1ns for precharge), thus a level restoring device may incur too much performance overhead per operation that is it not viable. To prevent leakage power consumption of critical nodes by the precharge device, clock gating is introduced, which only enables precharge of the device when it is in use, not while the device is inactive. Later in section III, the use of clock gating to improve overall performance and prevent leakage power consumption will further explained in detail.
VDD
Level Restorer
operation. These modes define the power consumption of each data path depending on the word length of the input. Clock gating When the circuit senses that a datapath will not be used to evaluate the current input, this particular datapath will remain in precharge mode. The ability to do so reduces dynamic power dissipation since the clock charges and discharges the input capacitances of precharge PMOS transistors. Power consumption is P=CV 2 F and for a continuously active clock is 1. But for a gated clock is dependent on the input it receives and is likely to be much less than 1. Sleep mode In the event that a datapath remains unused for a prolonged period of time, that particular datapath will be shut down and all precharged nodes are allowed to leak away. Sleep mode is achieved by turning off both the precharge PMOS and the evaluate NMOS, this introduces two large resistances and thus minimizes power consumption of unused circuitry. Operation mode If a datapath is determined to be necessary for processing data, that particular datapath will receive a clock signal to precharge the path if it was in sleep node or directly evaluate if the path was in clock gating mode. The length of the clock may vary depending on the bit length of the input, this results in reduced circuit activity time in comparison to a fixed clock operation.
VDD 0 1 VDD VDD
F PDN
Leakage Node
GND
Dynamic Logic w/ Level Restorer Figure 1 B. WALLACE TREE MULTIPLIER In a variety of applications, a basic high-speed Wallace Tree multiplier implemented in dynamic logic does not have optimally performance or power consumption. In multimedia applications, where the multiplier will always be on and the input bit lengths are highly correlated, the dynamic Wallace Tree will have unnecessary power consumption due to the multiplier being precharged every cycle. Since the input bits are correlated, if the inputs do not require the entire Wallace Tree to compute, then parts of the Wallace Tree will not be active for long periods of time. But these parts still leak charge and still get charged by the precharge devices. If those parts of the multiplier can be turned off, then power consumption can be reduced significantly. In microprocessors, where the multiplier can be idle for long periods of time, having a constant clock to precharge the critical nodes also result in unnecessary power consumption. In this case, having a sleep mode to disconnect the multiplier from the supplies makes sense. When computing Legacy code (such as 16-bit and 32-bit operations) in a 64-bit multiplier, the whole word length is never used and therefore precharging only for the active parts will yield optimal power consumption. III. PROPOSED ADDITIONS TO THE MULTIPLIER In the discussion of power saving, the following modes of operation need to be defined: clock gating, sleep, and
PDN
PDN
PDN
GND
GND
GND
Clock Gating
Operation Mode
In order to obtain the aforementioned modes of operation and to maximize power savings, several circuits need to be implemented: most significant bit (MSB) detection, variable duty cycle clock, datapath state selector, and data multiplexer. Most significant bit detection This circuit determines the bit length of the incoming data. MSB detection must be fast and efficient since it controls the length of the clock to reflect operation time, the arrangement of data for top calculation efficiency, and the state of every data path.
...
MSB Detection Circuit Figure 3 Variable duty cycle clock This clock generator is show in the figure below. The reason for having a variable clock is so we can reduce the operation time of the circuitry. This benefits us in two ways: first, we have the ability to run the circuit at higher clock speed depending on the complexity of operation; second, the less time a circuit spends in operation mode the less current is leaked away.
CLK EN
any standard ALU design. The philosophy of design automation requires scripting to reflect the structural regularity of circuits. The goal is to generate any length ALU containing the circuitry mentioned in the previous section. Some parts of ALU structures are highly repetitive, while others are placed in random, thus there are techniques to deal with each of the situations: scripting for regular circuits, scripting for irregular circuits. Scripting for regular circuits In our Wallace Tree example, we can see in the following diagram that a 5-bit Wallace Tree is just a 4-bit tree with an extra row of adders and a lengthened vector add unit. We can exploit this structural regularity to generate Wallace Trees of any bit length. There are two less rows of parallel adders than the number of bits in the adder, each row has two more full adders than the previous row, and its then followed by a vector adder at twice the length of the number of bits (See Figure 6). The most structured part of a Wallace Tree is the block of AND gates; it's simply a square with side width equal to the number of bits. The setup circuitry such as bit detection and data multiplexer all scale linearly with the number of bits. The result of such scripting will feature similarly named circuit elements with slight variation in numbering to differentiate one adder from another.
Variable Clock Generator Figure 4 Datapath state selector Each datapath has a different utilization rate. In our benchmark multiplier, in highly uncorrelated operation, every bit can be considered a noise signal and have 50% utilization rate when active; in correlated operation, the most significant bits see very few transitions and the least significant bits still have noise distribution; in sparse computation mode, idle prevails, thus the utilization rate of every bit is minimal; in legacy mode, we are guaranteed that a select set of bits will never be used. Our datapath state selector must have the following attributes: minimal operational time so when data is highly uncorrelated the datapath doesn't take long to switch modes; carefully choose between clock gating and sleep modes when the data is correlated or mostly idle, this is because it takes a while to bring elements from sleep to active as every node needs to be recharged. Data multiplexer When dealing with two inputs, their relative bit lengths may vary, a fixed circuit is more easily optimized for the condition that A is equally long or longer than B. A data multiplexer is thus needed to route the data into operational circuitry so this condition is always satisfied. This enables a regularly structured operational circuitry to compute data more efficiently. IV. DESIGN METHODOLOGY In showcasing our power reduction and performance boosting circuits for general application, we have developed a full suite of implementation techniques to quickly convert
Regularity of a 4-bit by 4-bit Wallace Tree [1] Figure 5 Scripting for irregular circuits The input and output networks between every row of parallel adders in a Wallace Tree is highly irregular; some might take an original input, some might take a carry and a save, some might take other combinations of original, carry, and save. On top of the irregular wiring, there's a need to simulate the wiring resistance and capacitance leading from one node to another, and the resulting model must also reflect the varying length of the paths. A data structure is necessary to automate the generation of such wiring networks. Whereas in a regularly structured script, the circuit elements can be generated on the fly with small variation in numbering, a irregularly structured network requires the names to be
entered into a database. The entries can be referred to by its relative position in the circuit, and can also be updated with new names as more circuit elements are connected hierarchically. The wiring network in a Wallace Tree multiplier can be represented in 3D, the top level represents a row of adders, each outputting a sum and a carry wire, which can be seen in Figure 6. The relative positions of these two wires are known, so they can be entered into the database in the correct locations. The level below these adders are an interconnect network, they might extend an original wire or represent the sum or carry wires leading to the next adder. These vary in lengths depending on their originating points. With a database, these attributes are remembered and thus the correct values for resistance and capacitances can be extracted.
Input Sequences We would choose input sequences that incur the maximal switching activity in the test multipliers to test the input extremes for dynamic switching power and propagation delay. Long periods of inactivity injected to test the advantages of the sleep mode and inactivity detection in our proposed design. Figure 7 below shows various input bit width sequences we will test.
No. of Bits
No. of Bits
Time
No. of Bits
No. of Bits
Hierarchical Structure of the Wallace Tree [1] Figure 6 V. TESTING METHODOLOGY We will benchmark our proposed additions using 90nm ST Microelectronics standard cell technology. The proposed 64-bit by 64-bit multiplier will be compared against a static CMOS design and a basics dynamic design of the same input size as well as other smaller word length multipliers (i.e. 16b by 16b and 32b by 32b Wallace Trees) for power consumption and propagation delay. We will test full ranges of operation, by using a specific set of testing values that vary input word lengths interspersed with periods of inactivity. Input Length Choices We can see that for the basic Wallace Tree multiplier, a 1-bit by 64-bit multiplication activates a different part of the tree than a 64-bit by 1-bit multiplication. The two operations have different power consumption as well as different propagation delays. However, our proposed design should be unaffected by the input order. Also, a 32-bit by 32-bit multiply on our proposed 64-bit multiplier will be tested against a pure 32bit Wallace Tree as well as the two 64-bit basic Wallace Trees. We want to see if our design still has power and performance advantages over a dedicated 32-bit multiplier.
Various Testing Input Sequences Figure 7 VI. REFERENCES 1. J. Rabaey, A. Chandrakasan, B. Nicolic, Digital Integrated Circuits, 2nd ed.