High-Performance Processors
Ching-Long Su, Chi-Ying Tsui, and Alvin M. Despain
489
1063-6390/94$3.000 1994 IEEE
~.. -
sub-expressions which result in a maximum reduction in Unlike traditional instruction scheduling which
switching activities are extracted. Roy et al.[Roy 921 pro- mainly focuses on scheduling instructions for less pipe-
pose a low power state assignment method which uses line hazards, cold scheduling focuses on scheduling
simulated annealing to find the state encoding of a finite instructions for less switching activities while keeping
state machine which the total probability weighted Ham- performance as high as possible. Cold scheduling can be
ming distance of the states are minimized. Tsui et. a l . r s u i easily implemented by modifying a traditional list sched-
931 minimize the weighted switching activity and hence uler with a cost function to minimize the switching activi-
the power consumption during the technology decomposi- ties invoked by executing consecutive pairs of
tion and mapping phase of logic synthesis. All the above instructions.
methods assume the switching activities at the circuit
inputs are given and minimize the internal switching activ- 1.3 Organization of the Paper
ities based on this assumption. The rest of the paper is organized as follows. Sec-
tion 2 discusses the significance of the switching activi-
1.2 Our approach ties at the inputs of a circuit on the overall switching
In this work, we tackle the problem of minimizing activities of the circuit. Section 3 evaluates the use of
circuit switching activities at a higher level, the architec- Gray code as an instruction addressing scheme. We also
tural level, and we study this in the domain of high perfor- compare the switching activities using Gray code
mance microprocessors. Instead of minimizing the intemal addressing to traditional binary code addressing. Section
switching activities of each module of the microprocessor, 4 presents the cold scheduling techniques and some
we minimize the switching at the inputs of the modules. experimental results. Finally, conclusionary remarks are
Specifically,we minimize the switching activities of the provided in Section 5 .
address bus and the instruction bus of the microprocessor
using some novel hardware and software techniques. The 2. Switching Activity in Pipelined Processors
reasons for focusing on the instruction and address buses
are as follows. Switching activity depends on the sequence Figure 1shows a typical pipelined circuit of which
of the signal values applied at the inputs of the circuit. For each stage consists of a combinational circuit between
the datapath of a microprocessor, the signal values at the two latches. At the beginning of a clock cycle, the input
inputs depend on the data and hence can only be deter- signals of the combinational circuit are first latched in the
mined at run-time. For the instruction and address lines, input latch A. They are then evaluated in the combina-
the values are related to the static code which can be deter- tional circuit and propagated to the output which are
mined at complie time. Also for a typical pipelined RISC latched in the output latch B at the next cycle. These sig-
processor, the instruction line has to drive many modules nals become the input signals for the combinational cir-
such as the instruction cache, instruction register, the con- cuit at the next pipeline stage. The switching activities of
trol path decoder. The address lines also drive many mod- the combinational circuits depends on the logic and struc-
ules such as Memory Address Register, PC chains and the ture of the circuit and the switching activities at the out-
I/O pads. Moreover the instruction and address buses are put of the input latch. Although there is no general theory
usually long and have higher routing capacitance. There- on the relationship between the switching activities of the
fore they are nets with large capacitanceloading and mini- inputs and that of the internal nodes of a combination cir-
mizing the switching activity of these nets has a significant cuit, we believe that if the switching activities are high at
impact on the power consumption of the microprocessor. the inputs, then the intemal switching activities at the
combinational circuit also tend to be high and vice versa.
In this paper, we present two techniques which
reduce switching activity during program execution. The To support this claim, we carried out experiments
first technique, Gray code addressing, is a hardware
method which uses Gray code for instruction addresses.
The second technique, Cold Scheduling, is a software
method which schedules instructions during compilation
to reduce switching activities.
The advantage of Gray code over straight binary
code is that Gray code changes by only one bit as it
sequences from one number to the next. In other words, if
the memory access pattern is a sequence of consecutive
addresses, then each memory access changes only one bit Figure 1 Combinational circuit and latches
at its address bits. Due to instruction locality during pro-
gram execution, most of the time memory accesses are on a set of combinational circuit benchmarks obtained
sequential in nature. Therefore a significant number of bit from the ISCAS-89 and MCNC-91 benchmark set, and
switches can be eliminated through using Gray code studied the effect on the circuit switching activities if the
addressing. switching activities at the input lines are reduced. We
used the estimation method in [Ghosh 921 to estimate the
490
switching activities. Switching activities are measured as cycle, the impact can be significant since an instruction
the expected numbers of switching per cycle. First the scheduler has more instructions to schedule.
switching activity of each input is set to 0.5 (Model I). To better understand the impact of instruction
Then it is reduced to 0.42 (Model 11) and 0.32 (Model III) sequence on the switching activities in general purpose
respectively. Table 1 summarizes the circuit switching processors, we select a RISC-like processor, the VLSI-
BAM polmer 901,as an experimental architecture. This
microprocessor is pipelined with data stationally control.
There are five pipeline stages: Instruction Fetch (IF),
Instruction Decode (ID), Instruction Execution (E),
Memory access (M), and Write Back (WB). The instruc-
tion set of the VLSI-BAM is similar to the MIPS-2000
F l I p S 861 with some extensions for symbolic computa-
tion. Figure 2 shows the pipeline stages and the control
path of the VLSI-BAM processor. For each pipeline
stage, there is an instruction register, a PLA, and a latch
for control signals. Instructions are passed through
instruction registers and decoded by the PLAs in the pipe-
line stages. Control signals which are generated from
PLAs, are latched before they are sent to the datapath.
U
Bf4
.........
.................. ..................
.................. ............
.............
..........
.....
.....
..................
.................. .........
........
.........
..........
...... .........
.........
..............
.............
......... ..........
......... .:.:.:.:.......
.............
x4 14u147 122.35 12.90 103.70 26.18 .:.:.:.:.:.:.:.:.
:si:::::::<
..................
m$$;: :.........
.........
............
.....
.....
.
.............
.........
..............
........
avg.% 10.89 2 2 84
reduction
Table lCircuit switching activities vs input switching activities
t t
From the results, we see that in general reducing the Figure 2 Pipeline structure and control path of VLSI-BAM
switching activities at the inputs will reduce the total
switching activities of the combinational circuit. The aver-
age reductions are about 11%and 23% when the input A cycle-by-cycle instruction-levelsimulator is
switching activities are reduced by 16% and 36% respec- built for collecting the switching activities at the latches
tively. in the control path during execution of benchmark pro-
Switching activities at the staging latches of a pipe- grams. Benchmark programs used in this paper are shown
lined microprocessor are affected by various factors. For in Table 2. The benchmarks are ranging from less than
latches in the datapath, switching activities are mainly 1,000 cycles to larger than lO,OOO,OOO cycles. These
determined by the run-time data sequences. For latches in benchmark programs are selected from the Aquarius
the control path, switching activities are strongly depen- benchmark suite [Haygood 891. Applications of these
dent on the instruction execution sequences. Since run- benchmark programs include list manipulation, data base
time data is not well known at compile-time, the impact of query, theorem prover, and computer language parser.
cold scheduling on switching activities at the latches in the Benchmark programs are first compiled through the
data path is not clear. However, instruction execution Aquarius Prolog compiler [Van Roy 921 into an interme-
sequences are controlled by the instruction scheduler at diate code (BAM code), which is target machine indepen-
compile-time. The impact of cold scheduling on switching dent. The BAM code is then further compiled into
activity at the latches in the control path is more direct. In machine code of the target machine, the VLSI-BAM.
this paper, we will focus on the impact of cold scheduling
for reducing switching activities in the control path. 3. Gray Code Addressing
Different h".ion sequences can have a signifi- In traditional von Neumann machines, data is
cantly different effect on the switching activities and the fetched from memory before executed. For binary code
impact depends on the type of architecture. In a CISC pro- addressing scheme, data and instructionsthat are
cessor, the impact may not be so obvious since one instruc- accessed sequentially are located in the memory with
tion may need several processor cycles to execute. In consecutive binary address. For sequential memory
contrast, for a general pipelined RISC-like processor, in access, next address is obtained by doing a binary incre-
which most instructions can be executed in one processor ment on the current address.
491
Table 2 Benchmark programs Table 4 Gray code representation and decimal equivalent
)Benchmark 1 Cvcles I Description 1
nreverse 4,287 Naive reverse of a 3O-element list
qsort 4.560 Qlucksortof a 50element list
query 97259 Query a static database
circuit
1 1 1
I 0010 I 2 I 1010 I 10 I mula summarize the conversion.
I cm11 I 3 I 1011 I 11 I For example, let B be a binary number <1,1,0,1>
the decimal equivalent of which is 13. The values of b,
b2, bl, and bo are 1, 1, 0, 1 respectively. The Gray code
representation is then equal to <b3, b3@b2, b@ bl, bl@
0110 1110
bo> which is equivalent to <1,0,1,1>.
0111 1111
Similarly the conversion of Gray code to binary
code also uses the exclusive or operation. However, it is
3.2 Gray Code Representation more complex. bi is equal to the exclusive or of gi and all
A Gray code sequence is a set of numbers, repre- of the bits of G that preceding gi, i.e. gi+l, gi+2,..., gn-l.
sented as a combination of digits 1s and Os, in which con- The most significant bit of the binary representation and
tiguous numbers have only one bit different. A formal Gray code representation are the same. The following for-
definition of a Gray code sequence is described as follows mula summarize the conversion.
[Hayes 881, Gray code --> Binary code
1. GI = 0, 1.
2. Let Gk = go, gl,..., g2k-2,g2k-1. Gk+l is formed by first bn = Sn
preceding all members of the sequence Gk by 0, then &);=bite gj (i=n-l, 0)
repeating Gk with the order reversed and all members pre-
ceded by 1. In other words, For example, let G be a Gray code number
c1,1,0,1> the decimal equivalent of which is 9. The values
Gk+l = ago, og1*--.9Og2k-290g2k-19 lg2k-1. lg2k-2...., of g3, g2, gl, and go are 1, L O , 1 respectively. The binary
Igl, ]go code representation is then <g3, g3@g2, g3@g2@gl.
For example, G2 = 00,01,11,10 and G3 = OOO, g3@g2@gl@gO> which is cl,O,O,l>.
001,011,010,110,111,101,100. Clearly the foregoing
construction ensures that consecutivemembers of a Gray
code sequence differ in exactly 1bit. Table 4 shows 4-bit 3.4 Number of Bit Switches in Binary Code vs.
Gray Code representations and their corresponding deci- Gray Code
mal equivalent. The number of bit switches of a sequence of num-
492
I
bers can be significantly different depending on the code modified such that the calculated target address is the cor-
representations. Figure 3 shows an example. For the rect target address in Gray code addressing system.
sequence of numbers from 0 to 16,shown in Figure 3(a), Figure 4 shows an example of branch instructions
there are 31 bit switches when the number are encoded in in binary CO& and Gray code addressing systems. We
binary representation while only 16 bits switch when they represent a binary code as <b,, bn-1....,bo>2 and a Gray
are encoded in Gray code representation. However, for the code as eg,, g,-l ,.... g g p , where the bit length is n+l.
following sequence e1 ,3,7,15,14,12,13,9,8,10,11,2,6, The correct execution sequence is (AB,br,C,and D)
4,5,0,16>,which is shown in Figure 3(b), there are only 17 where A, B, C, and D are instructions and br is a taken
bit switches for binary representation compared to 29 bit branch. The target address of the branch is the addition of
switches for Gray code representation. current program counter and the offset which is specified
in the branch instruction. In binary code addressing sys-
~~~ ~
For random access patterns, Gray code and binary 21 <10101>2 21 < l l 1 l b g
code have similar number of bit switches. Note that the 22 <10110>2 22<11101>,
number sequence in Figure 3(b) is careful selected in favor
of the binary representation. In general, this special ~~
sequence is rather unlikely to happen. For consecutive 'igure 4 Branch instructions in binary code and Gray code addrcssii
access patterns, which occur often in a general processor
for executing consecutive instructions in basic blocks, 3.6 Results
Gray code addressing has fewer bit switches. To validate the advantage of Gray code addressing,
we implement a Gray code addressing scheme on the
3.5 How To Use Gray Code Addressing VLSI-BAM [Holmer 901. Table 5 summarizes the switch-
In a general uni-processor, an instruction is fetched ing activity at the address bits of the processor. For
from an address pointed to by the program counter. After instruction accesses, compared to the traditional binary
this instruction is fetched, the program counter is increased code addressing scheme, Gray code addressing signifi-
by one for the next fetch. The program counter is also
cantly reduces the address bits switching activities. For
modified by branch instructions based on branch condi-
data accesses, switching activity resulting from both
tions. In the binary code addressing system, a binary schemes are quite close. This is because instruction
counter is needed for incrementing the program counter. accesses are more sequential than data addresses for nor-
For branch instruction, the calculated target addresses are mal program execution. Moreover, the more instruction
written directly into the program counter if the branch is locality a program has, the more reduction will be.
taken. In Gray code addressing system, a Gray code We measure the performance of the address coding
counter is needed for incrementing the program counter. scheme by the number of bit switches per executed
For branch instructions, the calculated target address is instruction, denoted as BPI. Figure 5 shows the BPI of
directly written into the program counter if the branch is instruction addresses in binary code and Gray code for
taken. The only difference between binary code and Gray different benchmark programs. The benchmark program
code addressing systems is the instruction field for target with the most significant reduction of bit switches is
addresses. In a Gray code addressing system, the instruc- fastqueens. The BPI of instruction addresses in binary
tion field for the target address in a branch instruction is code and Gray code are 2.46 and 1.03 respectively. The
493
-
Table 5 Switchlng Actlvltles at the address blts
then 7.91%.
4. Cold Scheduling
Traditional instruction scheduling algorithms
mainly focus on reordering instructions to reduce pipelinc
stalls, avoid pipeline hazards, or improve resource usage.
More recent instruction scheduling algorithms such as
trace scheduling [Fisher 8 11, percolation scheduling
[Nicolau 841, and global scheduling [Bemstein 911 sched-
ule instructions across basic blocks in order to increase
instruction-level parallelism. The main goal of these
scheduling algorithms is to improve performance. To
fastqueens reduce power consumption, these instruction scheduling
qsort
algorithms need to be modified to adjust to the new objec-
tive.
reduczr
In this section, we present the details of our cold
circuit scheduling algorithm. Basically cold scheduling uses m-
semigroup ditional performance-driven scheduling techniques with
special heuristics for reducing switching activities.
nand
Before we go into the details of cold scheduling, we first
boyer review the traditional list scheduling algorithm.
browse
chat 4.1 Scheduling for Performance
Traditional instruction scheduling approaches con-
1.0 1.5 2.0 2.5 3.0 sist of three steps: 1) partition a program into regions or
Figure 5 Bit switches of Instructlon addresses basic blocks. 2) build a control dependency graph (CDG)
Figure 6 shows the BPI of data addresses in binary and/or data dependency graph (DDG) for each code
code and Gray code among benchmark programs. The BPI region or basic block. 3) schedule instructions in CDG
of data addresses in binary code and Gray code are very and/or DDG within resource constraints.
close. Among the set of benchmark programs, we find that The main goal of the traditional instruction sched-
two out of nine benchmark programs @stqueen and uler is to schedule the instruction sequence such that it
browse) have lower BPI in binary code than in Gray code. can be executed in a target machine as fast as possible
The other seven benchmark programs have slightly larger with minimal pipeline stalls. Therefore the quality of an
BPI in binary code than in Gray code. The average BPI in instruction scheduler is measured by the amount of pipe-
Gray code addressing is 1.28 while that in binary code line stalls introduced by. the output instruction sequence
addressing is 1.39. The average reduction of bit switches is when it is executed on the machine.
Let B = 11, 12. ...be an output instruction sequence
of a basic block. The number of pipeline stall cycles
494
between execution of instruction Ij and Ij+l is denoted as
D(Ij,Ij+l). For example, if the= is no pipeline stall cycles
between execution of instruction 11 and 12,then D(11,12)=
0. Otherwise, if there are m pipeline stall cycles between
execution of instruction I, and 12, then D(11,12) = m. The
total pipeline stall cycles in the execution of a basic block
is denoted as PS = Z D(Ij,Ij+l),j = O.. n-1. The main
objective of an instruction scheduler is therefore to mini-
mize PS.In the case of scheduling a region, the objective
is thus to minimize the pipeline stalls of this region,
instead of in an individual basic block. For example, if a
region consists of basic blocks B1&. ...Bk, and the num-
ber of pipeline stalls in these basic blocks are PS 1,
P s 2 , ...P s k , then the instruction scheduler tries to minimize
l/k(wl*PS1+ w2*PS2+... +wk*Psk), where wj is a weight
of esumated dynamic execubon frequency of a basic block (a) dah dependencies gra1
Bj.
Figure 7 shows a data dependency graph and vari- labelO(movd4;l)).
ous output instruction sequences from a typical instruction 1 mov(et0). %D(1,2)= 0
2 umax@,e,e). %D(2.3)= 0
scheduler. The target machine is assumed to have a one
3 addi(e.4.e). %D(3,4) = 0
cycle delay slot far load/store instructions after store %D(4,6) = 0
4 pshd(tO/cp,e,2).
instructions due to bus conflicts. These instruction 6 ldi(-1,tl). kD(6.7) = 0
sequences running on the VLSI-BAM introduce different 7 std(r(2)/r(l),e+ -4). %D(7,8)=0
pipeline stalls. Instruction sequence I1 suffers three cycles 8 add28(r(O),tl,r(O)). %D(8,10)= 0
of pipeline stalls. Instruction sequences I and 111 have the 10 mw(r(3).r@)). sOD(l0.9)= 0
best scheduling with only one pipeline stall cycle. 9 st(r(O).e+ -6). SD(9.5) = 1
495
cycle), instruction sequence 111has the worst switching basic blocks since target addresses of branch/jump
activity cost (45% more than that of instruction sequence instructions are decided before instruction scheduling.
I). On the other hand, the switching activities of instruc- For instruction scheduling like trace scheduling, percola-
tion sequence 11is quite close to that of instruction tion scheduling, and global scheduling, this scheme is
sequence I (only 5% more), though instruction sequence 11 hard to apply.
has the worst performance (three pipeline stall cycles).
The above example indicates that there is no clear
correlation between performance and switching activities.
The design for low power hence does not conflict with the
design for performance. In other words, we may be able to
achieve both goals at the same time.
496
DAG construction Cold Scheduling
INPUT: An instruction.”S INPUT: DAG repment+m and bit switching table
OUTPUT DAG representation. OUTPUT A scheduled mstrucuon s m m
0. Set ready list RL to be ( )
0. Reverse the instruction sequence from an instruction stream. Set the last scheduled instrudion Is1 = NOP
1. For any resource R .
set R.input to be {) 1. Remove ready instructions from DAG and
set R.output to be {) add these ready instruction into RL.
2. Visit an instruction I , identify its input and output resources, 2. For each instruction I in RL,
INSand OUTS. find S(LsI.I).
3. For each element Y in OUTS, 3. Remove an instructi? I with the smallest S(LS1.I) from RL.
IF there is any instructionJ associated with Xinput, The removed mstrudlon becomes the current LSI.
THEN create an arc (IJ). write out LSI.
IF there is any inmction K associated with Y.output,
THEN create an arc (IK). 4. IF there is any instruction yet to be scheduled,
set Y.output to be {I). THENgotostepl,
set Kinput to be {). ELSE retum.
4. For each element X in INS, Firmre 10 Cold Scheduling Algorithm
-~
IF there is any instruction L associated with X.output, ~
498