Pipe Lining

Pipelining
Pipelining is used by virtually all modern microprocessors to enhance performance by overlapping the execution of instructions.
Micro coding became less attractive as gap between RAM and ROM speeds reduced. Complex instruction sets difficult to pipeline, so difficult to increase performance as gate
count grew.
Iron-law
explains architecture cycles/instruction, and time/cycle.
design
space
Trade
instructions/program,
Load-Store RISC ISAs designed for efficient pipelined implementations.
Iron Law of Processor Performance
Instructions per program depends on source code, compiler technology, and ISA Cycles per instructions (CPI) depends upon the ISA and the microarchitecture Time per cycle depends upon the microarchitecture and the base technology
Characteristics of Pipelining If the stages of a pipeline are not balanced and one stage is slower than another, the entire throughput of the pipeline is affected.
In terms of a pipeline within a CPU, each instruction is broken up into different stages.
Ideally if each stage is balanced (all stages are ready to start at the same time and take an equal amount of time to execute.) the time taken per instruction (pipelined) is defined as: Time per instruction (unpipelined) / Number of stages (ideal case). In terms of a CPU, the implementation of pipelining has the effect of reducing the average instruction time, therefore reducing the average CPI. EX: If each instruction in a microprocessor takes 5 clock cycles (unpipelined) and we have a 4 stage pipeline, the ideal average CPI with the pipeline will be 1.25.
An Ideal Pipeline
All objects go through the same stages. No sharing of resources between any two stages. Propagation delay through all pipeline stages is equal. The scheduling of an object entering the pipeline is not affected by the objects in other stages. To pipeline MIPS
First build MIPS without pipelining with CPI=1. Next, add pipeline registers to reduce cycle time while maintaining CPI=1
Pipelining Parallelism is achieved and performance is improved by starting to execute one instruction before the previous one is finished. The simplest kind of pipeline overlaps the execution of one instruction with the fetch of the next instruction. Because two instructions can be processed simultaneously, we say that the pipeline has two stages.
Another, idealized way of looking at it:
However, things arent quite that neat. For example: On a cache miss, instruction fetch may have to wait for many cycles. Some instructions take more than one cycle to execute (e.g. load and store instructions).
A branch instruction changes where the next instruction is fetched. So we dont really get
one instruction every cycle with a 2-stage pipeline. A pipeline may have more than two stages. Stone describes an instruction pipeline with 7 stages
This sequence shows that the complete instruction execution requires 70 ns. If each instruction takes this long and we do not start the next instruction before completing the current one, we have a processing rate of 1 instruction per 0.07 microseconds = 14.3 MIPS.
Synchronization and Pipelines

We can improve on this processing rate by beginning subsequent instructions prior to completion of current ones. But to do this, we need to synchronize the passing of results from one stage to the next. We use a fixed clock rate and synchronize all processing with this fixed clock. Each stage: Performs its function within the clock period. Shifts (intermediate) results to the next stage on the clock edge.
Clock Rate The clock rate can be chosen such that the longest stage finishes in one clock
period. An alternative would be to choose a faster clock rate and insert wait states (integral number of delay periods) to allow the longer stages to finish their processing.
With the 7 stage example, we could choose a 20 ns clock period (longest stage is 20
ns), which is a 50 MHz clock. If we can sustain an instruction stream through this pipeline at this rate (a new instruction into stage 1 and a completed instruction out of stage 7
every 20 ns), we get a processing rate of 1 completed instruction every 0.02 microseconds = 50 MIPS.
An alternative clock period of 5 ns (implied by Stone in figure 3.5) would yield the same
processing rate, because we could start a new instruction only every 4 clock periods. Following is the pipeline with a 200 MHz clock:
Synchronous Pipeline
With this example, the 20 ns stages use four clock periods and the 5 ns stages complete their work in one clock period, but must wait 3 additional periods before shifting results to the next stage.
Latency
Even though we can complete one instruction every 20 ns in the example, each instruction takes 80 ns to get thorough the pipeline from initiation to completion. This is called the latency of the pipeline. Shorter clock periods reduce latency. For example, if we set the clock period to 20 ns instead of 5 ns for the example pipeline, the pipeline latency would be 140 ns instead of 80 ns.
Speedup
The speedup of a pipeline measures how much more quickly a workload is completed by the pipeline processor than by a non-pipeline processor.
In the example, Stones best serial time is the asynchronous time of 70 ns from the first diagram. The parallel execution time (per instruction) is 20 ns, so the speedup for this example
is 70/20 = 3.5. In reality, we would never use the asynchronous time, and so I would prefer to use the clocked serial time to calculate speedup. The example pipeline would then give a speedup of 80/20 = 4.0. Once again, this assumes that we can keep the pipeline full and complete an instruction every 20 ns (which is almost never the case), which leads to the concept of efficiency.
Three things that prevent 100% utilization of processing units: We must spend some serial time to set up the parallel units. The parallel units are idle
during this setup time.
For pipelined processors, it takes some time to fill the pipe at the start of processing and
to empty the pipe at the end of the processing. During this startup and finish up time, not all stages are working.
The pipeline has hazards that prevent a steady flow of data through the pipe. For
example, the instruction pipeline may have branches and data dependencies that prevent an instruction from proceeding until an earlier instruction completes processing and exits the pipe.
Basic Performance Issues in Pipelining

Pipelining increases the CPU instruction throughput - the number of instructions completed per unit of time. But it does not reduce the execution time of an individual instruction. In fact, it usually slightly increases the execution time of each instruction due to overhead in the pipeline control. The increase in instruction throughput means that a program runs faster and has lower total execution time. Limitations on practical depth of a pipeline arise from: Pipeline latency. The fact that the execution time of each instruction does not decrease puts limitations on pipeline depth; Imbalance among pipeline stages. Imbalance among the pipe stages reduces performance since the clock can run no faster than the time needed for the slowest pipeline stage; Pipeline overhead. Pipeline overhead arises from the combination of pipeline register delay (setup time plus propagation delay) and clock skew. Once the clock cycle is as small as the sum of the clock skew and latch overhead, no further pipelining is useful, since there is no time left in the cycle for useful work.
Consider a non-pipelined machine with 6 execution stages of lengths 50 ns, 50 ns, 60 ns, 60 ns, 50 ns, and 50 ns. Find the instruction latency on this machine. How much time does it take to execute 100 instructions?
Instruction latency = 50+50+60+60+50+50= 320 ns Time to execute 100 instructions = 100*320 = 32000 ns .
Suppose we introduce pipelining on this machine. Assume that when introducing pipelining, the clock skew adds 5ns of overhead to each execution stage. What is the instruction latency on the pipelined machine? How much time does it take to execute 100 instructions?
Solution: Remember that in the pipelined implementation, the length of the pipe stages must all be the same, i.e., the speed of the slowest stage plus overhead. With 5ns overhead it comes to:
The length of pipelined stage = MAX(lengths of unpipelined stages) + overhead = 60 + 5 = 65 ns Instruction latency = 65 ns Time to execute 100 instructions = 65*6*1 + 65*1*99 = 390 + 6435 = 6825 ns
What is the speedup obtained from pipelining?
Solution: Speedup is the ratio of the average instruction time without pipelining to the average instruction time with pipelining. (here we do not consider any stalls introduced by different types of hazards which we will look at in the next section) Average instruction time not pipelined = 320 ns Average instruction time pipelined = 65 ns Speedup = 320 / 65 = 4.92
Pipeline Hazards
There are situations, called hazards that prevent the next instruction in the instruction stream from being executing during its designated clock cycle. Hazards reduce the performance from the ideal speedup gained by pipelining. There are three classes of hazards:
Structural Hazards: They arise from resource conflicts when the hardware cannot support
all possible combinations of instructions in simultaneous overlapped execution.
Data Hazards: They arise when an instruction depends on the result of a previous
instruction in a way that is exposed by the overlapping of instructions in the pipeline.
Control Hazards: They arise from the pipelining of branches and other instructions that
change the PC. Hazards in pipelines can make it necessary to stall the pipeline. The processor can stall on different events: A cache miss: A cache miss stalls all the instructions on pipeline both before and after the instruction causing the miss. A hazard in pipeline: Eliminating a hazard often requires that some instructions in the pipeline to be allowed to proceed while others are delayed. When the instruction is stalled, all the instructions issued later than the stalled instruction are also stalled. Instructions issued earlier than the stalled instruction must continue, since otherwise the hazard will never clear. A hazard causes pipeline bubbles to be inserted. The following table shows how the stalls are actually implemented. As a result, no new instructions are fetched during clock cycle 4, no instruction will finish during clock cycle 8. In case of structural hazards: Clock cycle number Instr Instr i Instr i+1 Instr i+2 Stall Instr i+3 Instr i+4 1 2 IF 3 ID IF 4 MEM EX ID bubble 5 WB MEM EX bubble IF WB MEM bubble ID IF WB bubble EX ID bubble MEM EX WB MEM WB 6 7 8 9 10 IF ID EX
To simplify the picture it is also commonly shown like this:
Clock cycle number Instr Instr i Instr i+1 Instr i+2 Instr i+3 Instr i+4 In case of data hazards: Clock cycle number Instr Instr i Instr i+1 Instr i+2 Instr i+3 Instr i+4 This appears the same with stalls: Clock cycle number Instr Instr i Instr i+1 Instr i+2 Instr i+3 Instr i+4 1 IF 2 ID IF 3 EX ID IF 4 MEM stall stall stall 5 WB EX ID IF MEM EX ID IF WB MEM EX ID WB MEM EX WB MEM WB 6 7 8 9 10 1 2 IF 3 EX ID IF 4 MEM bubble bubble bubble 5 WB EX ID IF MEM EX ID IF WB MEM EX ID WB MEM EX WB MEM WB 6 7 8 9 10 IF ID 1 IF 2 ID IF 3 EX ID IF 4 MEM EX ID stall 5 WB MEM EX IF WB MEM ID IF WB EX ID MEM EX WB MEM WB 6 7 8 9 10
Performance of Pipelines with Stalls

A stall causes the pipeline performance to degrade the ideal performance. Average instruction time pipelined = ----------------------------Average instruction time unpipelined CPI unpipelined * Clock Cycle Time unpipelined = ------------------------------------CPI pipelined * Clock Cycle Time pipelined
Speedup from pipelining
The ideal CPI on a pipelined machine is almost always 1. Hence, the pipelined CPI is
CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instruction = 1 + Pipeline stall clock cycles per instruction If we ignore the cycle time overhead of pipelining and assume the stages are all perfectly balanced, then the cycle time of the two machines are equal and CPI unpipelined Speedup = ---------------------------1+ Pipeline stall cycles per instruction If all instructions take the same number of cycles, which must also equal the number of pipeline stages (the depth of the pipeline) then unpipelined CPI is equal to the depth of the pipeline, leading to Pipeline depth Speedup = -------------------------1 + Pipeline stall cycles per instruction If there are no pipeline stalls, this leads to the intuitive result that pipelining can improve performance by the depth of pipeline.
Structural Hazards
When a machine is pipelined, the overlapped execution of instructions requires pipelining of functional units and duplication of resources to allow all possible combinations of instructions in the pipeline. If some combination of instructions cannot be accommodated because of a resource conflict, the machine is said to have a structural hazard. Common instances of structural hazards arise when Some functional unit is not fully pipelined. Then a sequence of instructions using that unpipelined unit cannot proceed at the rate of one per clock cycle Some resource has not been duplicated enough to allow all combinations of instructions in the pipeline to execute. Example1: a machine may have only one register-file write port, but in some cases the pipeline might want to perform two writes in a clock cycle. Example2: a machine has shared a single-memory pipeline for data and instructions. As a result, when an instruction contains a data-memory reference(load), it will conflict with the instruction reference for a later instruction (instr 3):
Clock cycle number Instr 1 2 3 4 5 6 7 8
Load Instr 1 Instr 2 Instr 3
IF
ID IF
EX ID IF
MEM EX ID IF
WB MEM EX ID WB MEM EX WB MEM WB
To resolve this, we stall the pipeline for one clock cycle when a data-memory access occurs. The effect of the stall is actually to occupy the resources for that instruction slot. The following table shows how the stalls are actually implemented. Clock cycle number Instr Load Instr 1 Instr 2 Stall Instr 3 1 2 IF 3 EX ID IF 4 MEM EX ID bubble 5 WB MEM EX bubble IF WB MEM bubble ID WB bubble EX bubble MEM WB 6 7 8 9 IF ID
Instruction 1 assumed not to be data-memory reference (load or store), otherwise Instruction 3 cannot start execution for the same reason as above. To simplify the picture it is also commonly shown like this: Clock cycle number Instr Load Instr 1 Instr 2 Instr 3 1 IF 2 ID IF 3 EX ID IF 4 MEM EX ID stall 5 WB MEM EX IF WB MEM ID WB EX MEM WB 6 7 8 9
Introducing stalls degrades performance as we saw before. Why, then, would the designer allow structural hazards? There are two reasons:
To reduce cost. For example, machines that support both an instruction and a cache
access every cycle (to prevent the structural hazard of the above example) require at least twice as much total memory.
To reduce the latency of the unit. The shorter latency comes from the lack of pipeline
registers that introduce overhead.
Data Hazards
A major effect of pipelining is to change the relative timing of instructions by overlapping their execution. This introduces data and control hazards. Data hazards occur when the pipeline
changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially executing instructions on the unpipelined machine. Consider the pipelined execution of these instructions:
1 2 ADD R1, R2, R3 SUB OR R4, R5, R1 R8, R1, R9 AND R6, R1, R7 XOR R10,R1,R11
IF ID EX IF
MEM WB MEM WB EX IDor IF MEM WB EX IDxor MEM WB EX MEM WB IDand IF
IF IDsub EX
All the instructions after the ADD use the result of the ADD instruction (in R1). The ADD instruction writes the value of R1 in the WB stage (shown black), and the SUB instruction reads the value during ID stage (IDsub). This problem is called a data hazard. Unless precautions are taken to prevent it, the SUB instruction will read the wrong value and try to use it. The AND instruction is also affected by this data hazard. The write of R1 does not complete until the end of cycle 5 (shown black). Thus, the AND instruction that reads the registers during cycle 4 (IDand) will receive the wrong result. The OR instruction can be made to operate without incurring a hazard by a simple implementation technique. The technique is to perform register file reads in the second half of the cycle, and writes in the first half. Because both WB for ADD and IDor for OR are performed in one cycle 5, the write to register file by ADD will perform in the first half of the cycle, and the read of registers by OR will perform in the second half of the cycle. The XOR instruction operates properly, because its register read occur in cycle 6 after the register write by ADD. The next page discusses forwarding, a technique to eliminate the stalls for the hazard involving the SUB and AND instructions. We will also classify the data hazards and consider the cases when stalls can not be eliminated. We will see what compiler can do to schedule the pipeline to avoid stalls.
Forwarding
The problem with data hazards, introduced by this sequence of instructions can be solved with a simple hardware technique called forwarding.
1 ADD SUB AND R1, R2, R3 R4, R5, R1 R6, R1, R7
2 IF
3 EX IDsub IF
4 MEM EX IDand
5 WB MEM EX
6 WB MEM
IF ID
WB
The key insight in forwarding is that the result is not really needed by SUB until after the ADD actually produces it. The only problem is to make it available for SUB when it needs it. If the result can be moved from where the ADD produces it (EX/MEM register), to where the SUB needs it (ALU input latch), then the need for a stall can be avoided. Using this observation, forwarding works as follows:
The ALU result from the EX/MEM register is always fed back to the ALU input latches. If the forwarding hardware detects that the previous ALU operation has written the
register corresponding to the source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file.
Forwarding of results to the ALU requires the additional of three extra inputs on each ALU multiplexer and the addtion of three paths to the new inputs. The paths correspond to a forwarding of: (a) the ALU output at the end of EX, (b) the ALU output at the end of MEM, and (c) the memory output at the end of MEM. Without forwarding our example will execute correctly with stalls:
1 2
ADD R1, R2, R3 SUB R4, R5, R1 AND R6, R1, R7
IF ID EX
MEM WB IDsub EX IF MEM WB MEM WB IDand EX
IF stall stall stall stall
As our example shows, we need to forward results not only from the immediately previous instruction, but possibly from an instruction that started three cycles earlier. Forwarding can be arranged from MEM/WB latch to ALU input also. Using those forwarding paths the code sequence can be executed without stalls:
1 ADD SUB AND R1, R2, R3 R4, R5, R1 R6, R1, R7
4 MEMadd EXsub ID
5 WB MEM EXand
6 WB MEM
IF ID EXadd IF ID IF
WB
The first forwarding is for value of R1 from EXadd to EXsub . The second forwarding is also for value of R1 from MEMadd to EXand. This code now can be executed without stalls. Forwarding can be generalized to include passing the result directly to the functional unit that requires it: a result is forwarded from the output of one unit to the input of another, rather than just from the result of a unit to the input of the same unit.
One more Example

To prevent a stall in this example, we would need to forward the values of R1 and R4 from the pipeline registers to the ALU and data memory inputs.
1 2 ADD LW SW R1, R2, R3 R4, d (R1) R4,12(R1)
4 MEMadd EXlw ID
5 WB MEMlw EXsw
6 WB MEMsw
IF ID EXadd IF ID IF
WB
Stores require an operand during MEM, and forwarding of that operand is shown here. The first forwarding is for value of R1 from EXadd to EXlw . The second forwarding is also for value of R1 from MEMadd to EXsw. The third forwarding is for value of R4 from MEMlw to MEMsw.
Observe that the SW instruction is storing the value of R4 into a memory location computed by adding the displacement 12 to the value contained in register R1. This effective address computation is done in the ALU during the EX stage of the SW instruction. The value to be stored (R4 in this case) is needed only in the MEM stage as an input to Data Memory. Thus the value of R1 is forwarded to the EX stage for effective address computation and is needed earlier in time than the value of R4 which is forwarded to the input of Data Memory in the MEM stage.
So forwarding takes place from "left-to-right" in time, but operands are not ALWAYS forwarded
to the EX stage - it depends on the instruction and the point in the Data path where the operand is needed. Of course, hardware support is necessary to support data forwarding.
Data Hazard Classification

A hazard is created whenever there is a dependence between instructions, and they are close enough that the overlap caused by pipelining would change the order of access to an operand. Our example hazards have all been with register operands, but it is also possible to create a dependence by writing and reading the same memory location. In DLX pipeline, however, memory references are always kept in order, preventing this type of hazard from arising. All the data hazards discussed here involve registers within the CPU. By convention, the hazards are named by the ordering in the program that must be preserved by the pipeline.
RAW (read after write) WAW (write after write) WAR (write after read)
Consider two instructions i and j, with i occurring before j. The possible data hazards are: RAW (read after write) - j tries to read a source before i writes it, so j incorrectly gets the old value. This is the most common type of hazard and the kind that we use forwarding to overcome.
WAW (write after write) - j tries to write an operand before it is written by i. The writes end up being performed in the wrong order, leaving the value written by i rather than the value written by j in the destination. This hazard is present only in pipelines that write in more than one pipe stage or allow an instruction to proceed even when a previous instruction is stalled. The DLX integer pipeline writes a register only in WB and avoids this class of hazards. WAW hazards would be possible if we made the following two changes to the DLX pipeline:
Move write back for an ALU operation into the MEM stage, since the data value is
available by then.
Suppose that the data memory access took two pipe stages.
Here is a sequence of two instructions showing the execution in this revised pipeline, highlighting the pipe stage that writes the result: LW R1, 0(R2) ADD R1, R2, R3 IF ID IF EX ID MEM1 EX MEM2 WB WB
Unless this hazard is avoided, execution of this sequence on this revised pipeline will leave the result of the first write (the LW) in R1, rather than the result of the ADD.
Allowing writes in different pipe stages introduces other problems, since two instructions can try to write during the same clock cycle. The DLX FP pipeline , which has both writes in different stages and different pipeline lengths, will deal with both write conflicts and WAW hazards in detail.
WAR (write after read) - j tries to write a destination before it is read by i , so i incorrectly gets the new value.
This can not happen in our example pipeline because all reads are early (in ID) and all writes are late (in WB). This hazard occurs when there are some instructions that write results early in the instruction pipeline, and other instructions that read a source late in the pipeline. Because of the natural structure of a pipeline, which typically reads values before it writes results, such hazards are rare. Pipelines for complex instruction sets that support auto increment addressing and require operands to be read late in the pipeline could create a WAR hazards. If we modified the DLX pipeline as in the above example and also read some operands late, such as the source value for a store instruction, a WAR hazard could occur. Here is the pipeline timing for such a potential hazard, highlighting the stage where the conflict occurs:
SW R1, 0(R2) ADD R2, R3, R4
IF
ID IF
EX ID
MEM1 EX
MEM2 WB
WB
If the SW reads R2 during the second half of its MEM2 stage and the Add writes R2 during the first half of its WB stage, the SW will incorrectly read and store the value produced by the ADD.
RAR (read after read) - this case is not a hazard :).
When Stalls are required

Unfortunately, not all potential hazards can be handled by forwarding. Consider the following sequence of instructions:
1 LW R1, 0(R1) IF
2 ID
3 EX
4 MEM
5 WB
SUB AND OR
R4, R1, R5 R6, R1 R7 R8, R1, R9
IF
ID IF
EXsub ID IF
MEM EXand ID
WB MEM EX WB MEM WB
The LW instruction does not have the data until the end of clock cycle 4 (MEM) , while the SUB instruction needs to have the data by the beginning of that clock cycle (EXsub). For AND instruction we can forward the result immediately to the ALU (EXand) from the MEM/WB register(MEM). OR instruction has no problem, since it receives the value through the register file (ID). In clock cycle no. 5, the WB of the LW instruction occurs "early" in first half of the cycle and the register read of the OR instruction occurs "late" in the second half of the cycle. For SUB instruction, the forwarded result would arrive too late - at the end of a clock cycle, when needed at the beginning. The load instruction has a delay or latency that cannot be eliminated by forwarding alone. Instead, we need to add hardware, called a pipeline interlock, to preserve the correct execution pattern. In general, a pipeline interlock detects a hazard and stalls the pipeline until the hazard is cleared. The pipeline with a stall and the legal forwarding is:
1 LW SUB AND OR R1, 0(R1) R4, R1, R5 R6, R1 R7 R8, R1, R9 IF
2 ID IF
3 EX ID IF
4 MEM stall stall stall
5 WB EXsub ID IF
6 MEM EX ID
7 WB MEM EX
WB MEM WB
The only necessary forwarding is done for R1 from MEM to EXsub. Notice that there is no need to forward R1 for AND instruction because now it is getting the value through the register file in ID (as OR above). There are techniques to reduce number of stalls even in this case, which we consider next.
Pipeline Scheduling
Generate DLX code that avoids pipeline stalls for the following sequence of statements: a = b + c; d = a - f; e = g - h; Assume that all variables are 32-bit integers. Wherever necessary, explicitly explain the actions that are needed to avoid pipeline stalls in your scheduled code.
Solution:
The DLX assembly code for the given sequence of statements is :
1 2 3 4 LW Rb, b LW Rc, c Add Ra,Rb, Rc SW Ra, a LW Rf, f Sub Rd, Ra, Rf SW Rd, d LW Rg, g LW Rh, h Sub Re, Rg, Rh SW Re, e
6 WB
10
11 12
13
14
15 16 17 18
IF ID EX M WB IF ID EX M IF ID stall EX M WB IF stall ID stall IF EX M ID EX IF ID IF WB M WB M WB EX M ID EX IF ID IF WB M EX ID IF WB M WB stall EX M WB stall ID EX M WB stall EX stall ID stall IF
Running this code segment will need some forwarding. But instructions LW and ALU(Add or Sub), when put in sequence, are generating hazards for the pipeline that can not be resolved by forwarding. So the pipeline will stall. Observe that in time steps 4, 5, and 6, there are two forwards from the Data memory unit to the ALU in the EX stage of the Add instruction. So also the case in time steps 13, 14, and 15. The hardware to implement this forwarding will need two Load Memory Data registers to store the output of data memory. Note that for the SW instructions, the register value is needed at the input of Data memory. The better solution with compiler assist is given below. Rather then just allow the pipeline to stall, the compiler could try to schedule the pipeline to avoid these stalls by rearranging the code sequence to eliminate the hazards. Suggested version is (the problem has actually more than one solution) :
Instructio 1 2 3 4 5 n LW Rb, b IF ID EX M LW Rc, c LW Rf, f W B
10 11 12 13 14 15
Explanatio n
IF ID EX M
W B W B W B Rb read in second half of ID; Rc forwarded Ra
IF ID EX M
Add Ra, Rb, Rc SW Ra, a
IF ID EX M
IF ID EX M W
B Sub Rd, Ra, Rf W B W B W B W B
forwarded Rf read in second half of ID; Ra forwarded
IF ID EX M
LW Rg, g LW Rh, h SW Rd, d
IF ID EX M
IF ID EX M
IF ID EX M
Rd read in second half of ID; W B Rg read in second half of ID; Rh forwarded W Re B forwarded
Sub Re, Rg, Rh
IF ID EX M
SW Re, e
IF ID EX M
The same color is used to outline the source and destination of forwarding. The blue color is used to indicate the technique to perform the register file reads in the second half of a cycle, and the writes in the first half. Note: Notice that the use of different registers for the first, second and third statements were critical for this schedule to be legal! In general, pipeline scheduling can increase the register count required.
Control Hazards
Control hazards can cause a greater performance loss for DLX pipeline than data hazards. When a branch is executed, it may or may not change the PC (program counter) to something other than its current value plus 4. If a branch changes the PC to its target address, it is a taken branch; if it falls through, it is not taken. If instruction i is a taken branch, then the PC is normally not changed until the end of MEM stage, after the completion of the address calculation and comparison (see diagram).
The simplest method of dealing with branches is to stall the pipeline as soon as the branch is detected until we reach the MEM stage, which determines the new PC. The pipeline behavior looks like :
Branch Branch successor Branch successor+1
IF ID IF(stall)
EX
MEM
WB IF ID EX MEM IF ID EX WB MEM WB
stall stall
The stall does not occur until after ID stage (where we know that the instruction is a branch). This control hazards stall must be implemented differently from a data hazard, since the IF cycle of the instruction following the branch must be repeated as soon as we know the branch outcome. Thus, the first IF cycle is essentially a stall (because it never performs useful work), which comes to total 3 stalls. Three clock cycles wasted for every branch is a significant loss. With a 30% branch frequency and an ideal CPI of 1, the machine with branch stalls achieves only half the ideal speedup from pipelining! The number of clock cycles can be reduced by two steps:
Find out whether the branch is taken or not taken earlier in the pipeline; Compute the taken PC (i.e., the address of the branch target) earlier.
Both steps should be taken as early in the pipeline as possible. By moving the zero tests into the ID stage, it is possible to know if the branch is taken at the end of the ID cycle. Computing the branch target address during ID requires an additional adder, because the main ALU, which has been used for this function so far, is not usable until EX. The revised data path:
With
this
datapath
we
will
need
only
one-clock-cycle
stall
on
branches.
Branch Branch successor
IF ID IF(stall)
EX IF
MEM ID
WB EX MEM WB
In some machines, branch hazards are even more expensive in clock cycles. For example, a machine with separate decode and register fetch stages will probably have a branch delay the length of the control hazard - that is at least one clock cycle longer. The branch delay, unless it is dealt with, turns into a branch penalty. Many older machines that implement more complex instruction sets have branch delays of four clock cycles or more. In general, the deeper the pipeline, the worse the branch penalty in clock cycles. There are many methods for dealing with the pipeline stalls caused by branch delay. Further we will discuss four simple schemes. Then we will look at more powerful compile-time scheme such as loop unrolling that reduces the frequency of loop branches.
Branch Prediction Schemes

There are many methods to deal with the pipeline stalls caused by branch delay. We discuss four simple compile-time schemes in which predictions are static - they are fixed for each branch during the entire execution, and the predictions are compile-time guesses.
Stall pipeline Predict taken Predict not taken Delayed branch
Stall pipeline The simplest scheme to handle branches is to freeze or flush the pipeline, holding or deleting any instructions after the branch until the branch destination is known. Advantage: simple both to software and hardware (solution described earlier) Predict Not Taken A higher performance, and only slightly more complex, scheme is to predict the branch as not taken, simply allowing the hardware to continue as if the branch were not executed. Care must be taken not to change the machine state until the branch outcome is definitely known. The complexity arises from: we have to know when the state might be changed by an instruction; we have to know how to "back out" a change. The pipeline with this scheme implemented behaves as shown below:
Untaken Branch IF Instr Instr i+1 Instr i+2
ID IF
EX ID IF
MEM EX ID
WB MEM EX WB MEM WB
Taken Branch Instr Instr i+1 Branch target Branch target+1
IF
ID IF
EX idle IF
MEM idle ID IF
WB idle EX ID idle MEM EX WB MEM WB
When branch is not taken, determined during ID, we have fetched the fall-through and just continue. If the branch is taken during ID, we restart the fetch at the branch target. This causes all instructions following the branch to stall one clock cycle.
Predict Taken An alternative scheme is to predict the branch as taken. As soon as the branch is decoded and the target address is computed, we assume the branch to be taken and begin fetching and executing at the target address. Because in DLX pipeline the target address is not known any earlier than the branch outcome, there is no advantage in this approach. In some machines where the target address is known before the branch outcome a predict-taken scheme might make sense. Delayed Branch In a delayed branch, the execution cycle with a branch delay of length n is Branch instr sequential successor 1 sequential successor 2 ..... sequential successor n Branch target if taken Sequential successors are in the branch-delay slots. These instructions are executed whether or not the branch is taken. The pipeline behavior of the DLX pipeline, which has one branch delay slot is shown below:
Untaken branch instr Branch delay instr(i+1) Instr i+2 Instr i+3 Instr i+4
IF ID EX MEM IF ID IF EX ID IF
WB MEM EX ID IF WB MEM EX ID WB MEM EX WB MEM WB
Taken branch instr IF ID EX MEM Branch delay instr(i+1) Branch target Branch target+1 Branch target+2 IF ID IF EX ID IF
The job of the compiler is to make the successor instructions valid and useful. We will show three branch-scheduling schemes:
From before branch from target from fall
Scheduling the branch-delay slot. The left box in each pair shows the code before scheduling; the right box shows the scheduled code In (a) the delay slot is scheduled with an independent instruction from before the branch. This is the best choice. Strategies (b) and (c) are used when (a) is not possible. In the code sequences for (b) and (c), the use of R1 in the branch condition prevents the ADD instruction (whose destination is R1) from being moved after the branch. In (b) the branch-delay slot is scheduled from the target of the branch; usually the target instruction will need to be copied because it can be reached by another path. In (c) the branch-delay slot is scheduled from the not-taken fall through To make this optimization legal for (b) and (c), it must be OK to execute the SUB instruction when the branch goes in the unexpected direction. OK means that the work might be wasted but the program will still execute correctly.
Scheduling
Requirements
When Improves Performance
strategy From before branch From target From fall though Branch must not depend on the rescheduled instructions Must be OK to execute rescheduled instructions if branch is not taken Must be OK to execute instructions if branch is taken Always When branch is taken. May enlarge program if instructions are duplicated When branch is not taken
The limitations on delayed-branch scheduling arise from the restrictions on the instructions that are scheduled into the delay slots and our ability to predict at compile time whether a branch is likely to be taken or not. Canceling Branch To improve the ability of the compiler to fill branch delay slots, most machines with conditional branches have introduced a canceling branch. In a canceling branch the instruction includes the direction that the branch was predicted. - if the branch behaves as predicted, the instruction in the branch delay slot is fully executed; - if the branch is incorrectly predicted, the instruction in the delay slot is turned into noop(idle). The behavior of a predicted-taken canceling branch depends on whether the branch is taken or not: Untaken branch instr Branch delay instr(i+1) Instr i+2 Instr i+3 Instr i+4 IF ID EX MEM IF ID IF idle ID IF WB idle EX ID IF idle MEM EX ID WB MEM EX WB MEM WB
Taken branch instr IF ID EX MEM Branch delay instr(i+1) Branch target Branch target+1 Branch target+2 IF ID IF EX ID IF
The advantage of canceling branches is that they eliminate the requirements on the instruction placed in the delay slot. Delayed branches are an architecturally visible feature of the pipeline. This is the source both of their advantage - allowing the use of simple compiler scheduling to reduce branch penalties; and their disadvantage - exposing an aspect of the implementation that is likely to change.
Problem on Pipeline Hazards

Consider the following pipeline with 8 stages for a version of DLX:
IF1 IF2 ID EX1 EX2
Instruction fetch starts Instruction fetch completes Instruction decode and register fetch; begin computing branch target Execution starts; branch condition tested; finish computing branch target Execution completes - effective address or ALU result available
MEM1/ALUW First part of memory cycle plus WB of ALU operation B MEM2 LWB Memory access completes Write back for a load instruction
As in the standard DLX pipeline, assume register writes are in the first half of a cycle and register reads are in the second half. a) How many register read/write ports are required? b) For each possible type of instruction source and each possible type of instruction destination, show a code example that depicts all possible forwarding requirements (not stalls). c) Show the same information as part (b) but for stalls rather than forwards. d) Assuming a predict-not-taken strategy, find the branch penalty for a taken and untaken branch. Assume that a predicted instruction can be executed up to, but not including, a pipe stage that does a write back.
Solution: a) We need 2 read ports for 2 registers to read in one clock cycle in ID stage because this is the maximum number of operands in an instruction. We need 2 write ports due to potential overlap in time between MEM1/ALUWB and LWB stages.
b) ALU - ALU / ALU - Branch 1 ALU instr R1, _ , _ 2 any instr 3 ALU instr _ , R1, _ / BNEZ R1, _
1 IF1 IF2 ID 2 3
EX1
EX21 EX1 ID
MEM1 EX2 EX13
MEM2 MEM1 EX2
LWB MEM2 MEM1 LWB MEM2 LWB
IF1 IF2 ID IF1 IF2
Forwarding is done for R1 from EX21 to EX13 . Memory - ALU / Memory - Branch / Memory - Memory 1 2 3 4 5 LW instr R1, _ , _ any instr any instr any instr ALU instr _ , R1, _ / BNEZ R1, _ / SW _ , R1 1 IF1 IF2 ID EX1 EX2 MEM1 MEM21 LWB 2 3 4 5 IF1 IF2 ID EX1 EX2 EX1 MEM1 MEM2 LWB EX2 EX1 ID MEM1 MEM2 LWB EX2 EX15 MEM1 MEM2 LWB EX2 MEM1 MEM2 LWB IF1 IF2 ID
IF1 IF2 ID IF1 IF2
Forwarding is done for R1 from MEM21 to EX15 . ALU - Memory 1 ALU instr R1, _ , _ 2 SW _ , R1 1 IF1 2 IF2 IF1 ID IF2 EX1 ID EX2 EX1 MEM11 EX2 MEM2 MEM12 LWB MEM2 LWB
Forwarding is done for R1 from MEM11 to MEM12 without optional instruction. c) ALU - ALU / ALU - Branch 1 ALU instr R1, _ , _ 2 ALU instr _ ,R1, _ /BNEZ R1, _ 1 IF1 IF2 ID 2 EX1 EX2 MEM1/ALUWB MEM2 EX1 LWB EX2 MEM1
IF1 IF2 stall stall ID
Memory - ALU / Memory - Branch / Memory - Memory
1 LW instr R1, _ , _ 2 ALU instr _ , R1, _ / BNEZ R1, _ /SW _ , R1 1 IF1 2 ALU - Memory 1 ALU instr R1, _ , _ 2 SW _ , R1 1 IF1 IF2 2 IF1 ID EX1 EX2 stall MEM1/ALUWB ID MEM2 EX1 LWB EX2 ... IF2 IF1 ID IF2 EX1 stall EX2 stall MEM1 stall MEM2 stall LWB ID ...
IF2 stall
d) Branch taken 1 BNEZ R1, N 2 any instr 3 any instr 4 any instr ... N any instr
1 IF1 2 3 4
IF2 IF1
ID IF2 IF1
EX1 ID IF2 IF1
EX2
MEM1
MEM2
LWB
stall stall stall IF1N IF2N IDN EX1N EX2N Target address is computed at the end of EX1 of a branch instruction. If at that time we find out that the branch is taken, we have to flush out all instructions in pipeline after the branch and fetch the instruction we jumped to. So, it looks like we had 3 stalls. Branch not taken 1 BNEZ R1, N 2 any instr
1 IF1 2
IF2 IF1
ID IF2
EX1 ID
EX2 EX1
MEM1 EX2
MEM2 MEM1
LWB MEM2 LWB
If branch is not taken, then our pipeline will function properly because it is designed as a predict-not-taken pipeline and we have no stalls at all.
Dealing with Exceptions
What makes pipelining hard to implement? Exceptions! Now we are ready to consider the challenges of exceptional situations where the instruction execution order is changed in unexpected ways. Exceptional situations are harder to handle in a pipelined machine because the overlapping of instructions makes it more difficult to know whether an instruction can safely change the state of the machine. In a pipelined machine an instruction is executed step by step and is not completed for several clock cycles. Unfortunately, other instructions in the pipeline can raise exceptions that may force the machine to abort the instructions in the pipeline before they complete. First we look at the types of situations that can arise and what architectural requirements exist for supporting them.
Types of Exceptions
The terminology used to describe exceptional situations where the normal execution order of instruction is changed varies among machines. The term interrupt, fault, and exception are used. We use the term exception to cover all these mechanisms, including the following: I/O device request Invoking an operating system service from a user program (system call) Tracing instruction execution Breakpoint (programmer-requested interrupt) Integer arithmetic overflow or underflow; FP arithmetic anomaly Page fault Misaligned memory accesses (if alignment is required) Memory protection violation Using an undefined instruction Hardware malfunction Power failure The requirements on exceptions can be characterized by five types: Synchronous versus asynchronous. If the event occurs at the same place every time the program is executed with the same data and memory allocation, the event is synchronous. With the exception of hardware malfunctions, asynchronous events are caused by devices external to the processor and memory. Asynchronous events usually can be handled after the completion of the current instruction, which makes them easier to handle.
User requested versus coerced If the user task directly asks for it, it is a user-requested event. In some sense, user-requested exceptions are not really exceptions, since they are predictable. They are treated as
exceptions, because the same mechanisms that are used to save and restore the state are used for these user-requested events. Because the only function of an instruction that triggers this exception is to cause the exception, user-requested exceptions can always be handled after the instruction has completed. Coerced exceptions are caused by some hardware event that is not under the control of the user program. Coerced exceptions are harder to implement because they are not predictable. User maskable versus user nonmaskable If an event can be masked or disabled by a user task, it is user maskable. This mask simply controls whether the hardware responds to the exception or not. Within versus between instructions This classification depends on whether the event prevents instruction completion by occurring in the middle (within) of execution or whether it is recognized between instructions. Exceptions that occur within instructions are always synchronous, since the instruction triggers the exception. It is harder to implement exceptions that occur within instructions than between instructions, since the instruction must be stopped and restarted. Resume versus terminate If the program's execution always stops after the interrupt, it is a terminating event. If the program's execution continues after the interrupt, it is a resuming event. It is easier to implement exceptions that terminate execution, since the machine need not be able to restart execution of the same program after handling the exception.
The following table describes different types of exceptions using the categories above: Exception type Synchronous vs. User User maskable Within vs. Resume vs.
asynchronous I/O device request Asynchronous Invoke operating system Tracing instruction execution Breakpoint Integer arithmetic overflow Floating-point arithmetic overflow or underflow Page fault Misaligned memory accesses Memory protection violation Using undefined instruction Hardware malfunction Power failure Synchronous Synchronous Synchronous Synchronous
request vs. vs. nonmaskable coerced Coerced User request User request User request Coerced Nonmaskable Nonmaskable User maskable User maskable User maskable
between instructions Between Between Between Between Within
terminate Resume Resume Resume Resume Resume
Synchronous Synchronous Synchronous Synchronous Synchronous Asynchronous Asynchronous
Coerced Coerced Coerced Coerced Coerced Coerced Coerced
User maskable Nonmaskable User maskable Nonmaskable Nonmaskable Nonmaskable Nonmaskable
Within Within Within Within Within Within Within
Resume Resume Resume Resume Terminate Terminate Terminate
Synchronous, coerced exceptions occurring within instructions that can be resumed are the most difficult to implement. The difficult task is implementing interrupts occurring within instructions where the instruction must be resumed because it requires another program to be invoked to - save the state of the executing program; - correct the cause of the exception; - restore the state of the program before the instruction that caused the exception; - start the program from the instruction that caused the exception. If a pipeline provides the ability for the machine to handle the exception, save the state, and restart without affecting the execution of the program, the pipeline or machine is said to be restartable. Almost all machines today are restartable, at least for integer pipelines, because it is needed to implement virtual memory.
Exceptions in DLX
As in unpipelined implementations, the most difficult exceptions have two properties:
They occur within instructions; The instruction (within which the exception occurred) must be restartable. If the pipeline can be stopped so that the instructions just before the faulting instruction are completed and those after it can be restarted from scratch, the pipeline is said to have precise exceptions. Supporting precise exceptions is a requirement in many systems. In practice, the need for accommodating virtual memory have led designers to always provide precise exceptions for the integer pipeline.
Exceptions that may occur in the DLX pipeline Pipeline stage IF ID EX MEM WB Problem exceptions occurring Page fault on instruction fetch; misaligned memory access; memory-protection violation Undefined or illegal opcode Arithmetic exception Page fault on data fetch; misaligned memory access; memory-protection violation None
1)With pipelining, multiple exceptions may occur in the same clock cycle because there are multiple instructions in execution. Example LW ADD IF ID IF EX ID MEM EX WB MEM WB
This pair of instructions can cause a data page fault and an arithmetic exception at the same time, since LW is in the MEM stage while the ADD is in the EX stage. This can be handled by dealing with only the data page fault and then restarting the execution. The second exception will reoccur and will be handled independently. 2) Exceptions may even occur out of order; that is an instruction may cause an exception before an earlier instruction causes one. EXAMPLE LW ADD IF ID IF EX ID MEM EX WB MEM WB
This time consider the case when LW gets a data page fault, seen when the instruction is in MEM, and ADD gets an instruction page fault, seen when the ADD instruction is in IF. The instruction page fault will actually occur first even though it is caused by a later instruction. Since we are implementing precise exceptions, the pipeline is required to handle the exception caused by the LW instruction first! So, the pipeline cannot simply handle an
exception when it occurs in time, since that will lead to exceptions occurring out of unpipelined order. Instead, it is done through the following steps: - the hardware posts all exceptions caused by a given instruction in a status vector associated with that instruction; - the status vector is carried along as the instruction goes down the pipeline; - once the exception indicator is set in the exception status vector, any control signals that may cause a data value to be written is turned off (this includes both register and memory writes); - when an instruction enters WB, the exception status vector is checked; - if any exceptions are posted, they are handled in the order in which they would occur in time on an unpipelined machine.

Pipe Lining

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Pipe Lining

Загружено:

Авторское право:

Доступные форматы

Pipelining

explains architecture cycles/instruction, and time/cycle.

Load-Store RISC ISAs designed for efficient pipelined implementations.

Iron Law of Processor Performance

Another, idealized way of looking at it:

Synchronization and Pipelines

Basic Performance Issues in Pipelining

To simplify the picture it is also commonly shown like this:

Performance of Pipelines with Stalls

Speedup from pipelining

Clock cycle number Instr 1 2 3 4 5 6 7 8

Load Instr 1 Instr 2 Instr 3

WB MEM EX ID WB MEM EX WB MEM WB

MEM WB MEM WB EX IDor IF MEM WB EX IDxor MEM WB EX MEM WB IDand IF

1 ADD SUB AND R1, R2, R3 R4, R5, R1 R6, R1, R7

ADD R1, R2, R3 SUB R4, R5, R1 AND R6, R1, R7

MEM WB IDsub EX IF MEM WB MEM WB IDand EX

IF stall stall stall stall

1 ADD SUB AND R1, R2, R3 R4, R5, R1 R6, R1, R7

One more Example

1 2 ADD LW SW R1, R2, R3 R4, d (R1) R4,12(R1)

Data Hazard Classification

SW R1, 0(R2) ADD R2, R3, R4

RAR (read after read) - this case is not a hazard :).

When Stalls are required

R4, R1, R5 R6, R1 R7 R8, R1, R9

1 LW SUB AND OR R1, 0(R1) R4, R1, R5 R6, R1 R7 R8, R1, R9 IF

4 MEM stall stall stall

IF ID EX M WB IF ID EX M IF ID stall EX M WB IF stall ID stall IF EX M ID EX IF ID IF WB M WB M WB EX M ID EX IF ID IF WB M EX ID IF WB M WB stall EX M WB stall ID EX M WB stall EX stall ID stall IF

Instructio 1 2 3 4 5 n LW Rb, b IF ID EX M LW Rc, c LW Rf, f W B

W B W B W B Rb read in second half of ID; Rc forwarded Ra

Add Ra, Rb, Rc SW Ra, a

B Sub Rd, Ra, Rf W B W B W B W B

forwarded Rf read in second half of ID; Ra forwarded

LW Rg, g LW Rh, h SW Rd, d

Sub Re, Rg, Rh

Branch Branch successor Branch successor+1

Branch Branch successor

Branch Prediction Schemes

Stall pipeline Predict taken Predict not taken Delayed branch

Untaken Branch IF Instr Instr i+1 Instr i+2

Taken Branch Instr Instr i+1 Branch target Branch target+1

WB idle EX ID idle MEM EX WB MEM WB

WB MEM EX ID IF WB MEM EX ID WB MEM EX WB MEM WB

WB MEM EX ID IF WB MEM EX ID WB MEM EX WB MEM WB

From before branch from target from fall

When Improves Performance

WB MEM EX ID IF WB MEM EX ID WB MEM EX WB MEM WB

Problem on Pipeline Hazards

IF1 IF2 ID EX1 EX2

MEM1 EX2 EX13

MEM2 MEM1 EX2

LWB MEM2 MEM1 LWB MEM2 LWB

IF1 IF2 ID IF1 IF2

IF1 IF2 ID IF1 IF2

IF1 IF2 stall stall ID

Memory - ALU / Memory - Branch / Memory - Memory

EX1 ID IF2 IF1

LWB MEM2 LWB

Dealing with Exceptions

between instructions Between Between Between Between Within

terminate Resume Resume Resume Resume Resume

Synchronous Synchronous Synchronous Synchronous Synchronous Asynchronous Asynchronous