Академический Документы
Профессиональный Документы
Культура Документы
Syllabus
q q q q q q q
Basic concepts Data hazards Instruction hazards Influence on instruction sets Data path and control considerations Performance considerations Exceptionhandling.
5/27/12
Basic concepts
Speed of execution of programs can be
that more than one operation can be performed at the same time. Thereby the number of operations performed per second is increased even 5/27/12 though the elapsed time needed to perform
5/27/12
Sequential Execution
hardware units, one for fetching instructions and another for executing them, as shown in Figure 2
5/27/12
Hardware Organization
5/27/12
Basic Concepts
This buffer is needed to enable the
execution unit to execute the instruction while the fetch unit is fetching the next instruction.
whose period is such that the fetch and execute steps of any instruction can each be completed in one clock cycle. (Figure 3)
5/27/12
Pipelined Execution
5/27/12
Basic Concepts
The processing of an instruction need not
be divided into only two steps. For example, a pipelined processor may process each instruction in four steps, as follows:
F Fetch
memory. D Decode : decode the instruction and fetch the source operand(s). 5/27/12 E Execute : perform the operation specified by the instruction.
Instruction Execution
Figure 4
5/27/12
Hardware Organization
Figure 5
5/27/12
of time, the clock period must allow the longest task to be completed.
5/27/12
for the remainder of the clock period. Instruction fetch requires main memory access. The access time of main memory takes more time. To avoid such problem we can go for cache memory.
5/27/12
Pipeline performance
The following Figure 6 shows an example in
which the operation specified in instruction I2 requires three cycles to complete and others require only one cycle.
5/27/12
5/27/12
operation is said to have been stalled for two clock cycles. Any condition that causes the pipeline to stall is called a hazard. The different types of hazard which would occur are
u u u
5/27/12
DATA HAZARD
A data hazard is a situation in which the pipeline is stalled because the data to be operated on are delayed for some reason.
The data used in the second instruction depends on the result of the first
instruction.
17
5/27/12
5/27/12
Operand forwarding
Mul R2,R3,R4 Add R5,R4,R6
The result of the multiply operation is stored in register R4 R4 is one of the source operands of Add instruction. As decode unit decodes Add instruction in cycle 3, realizes R4 is source operand. Hence decode instruction of I2 cannot completed until the W step of I1 instruction
to step E2.
5/27/12
SRC1,SRC2, and RSLT are the registers, these registers constitute the interstage buffers
5/27/12
dependency, a decision is made to use data forwarding. and loaded in register SRC1 in clock cycle 3 is available in Register RSLT.
In this case, the compiler can introduce the two-cycle delay needed between
The dependencies is left entirely to software, the compiler must insert NOP
It illustrates the close link between compiler and the hardware. The compiler can attempt to reorder instructions to perform useful task in
5/27/12
Disadvantages
In other hand the NOP instructions leads larger code size. If it is often the given processor architecture has several hardware
implementations.
5/27/12
Side effects
The data dependencies in the previous examples are easily detected because
destination location must also be applied to the registers affected by an autoincrement or autodecrement operation.
When a location other than one explicitly named in an instruction as a
Stack instruction such as push, pop, produce similar side effects because they
implicitly use auto increment and auto decrement addressing modes. Another possible side effect involves the condition code flags which are used by the instructions such as conditional branches and add with carry. 5/27/12
The registers R1 and R2 hold double precision integer number and we wish
to add another double precision number in registers R3 and R4. This may be accomplished as follows
Add
R1,R3 R2,R4
AddWithCarry
carry flag. This flag is set by first instruction and used in second instruction, which performs the operation
Instructions with side effects give rise to multiple data dependencies,
which lead to a substantial increase in the complexity of the hardware or software to resolve them. The instruction designed for execution on pipelined hardware should have few side effects. 5/27/12
INSTRUCTION HAZARDS
Pipeline may be stalled delay in availability of an instruction.
5/27/12
Unconditional branches
Instructions I1 to I3 are stored in successive memory addresses. Consider I2 is a branch instruction and the target to be Ik. In clock cycle 3 the fetch operation for instruction I3 is in progress at the
same time branch address is being decoded and target address is computed.
In clock cycle 4 processor discard I3, which has been incorrectly fetched and
In the mean time hardware unit responsible for Execute step must be told to
do nothing during that clock period. Thus pipeline is stalled for one clock cycle.
penalty.
5/27/12
5/27/12
5/27/12
Reducing branch penalty requires the branch address to be compute earlier in the pipeline. The instruction fetch unit has dedicated hardware to identify a branch instruction and compute branch target address quickly as possible after an instruction is fetched. With this additional hardware both of these tasks can be performed in step D2.
In this case, the branch penalty is only one clock 5/27/12 cycle.
To reduce these many processors employ sophisticated fetch units that can fetch instructions before they are needed and put them in a queue.
A separate unit called dispatch unit takes instructions from the front of the queue and sends them to the execution unit.
To be effective the fetch unit must have sufficient decoding and processing
It attempts to keep the instruction queue filled at all times to reduce delays.
When the pipeline stalls because of data hazard. The fetch unit continues to
the dispatch unit continues to issue instructions from the instruction queue.
5/27/12
Every fetch operation adds one instruction to the queue and every dispatch operation reduces queue length by one.
The queue length remains the same for first 4 clock cycles.
The instruction I1 introduces 2-cycle stall, space is available in the queue the fetch unit continues to fetch instructions and add the queue length rises to 3 in clock cycle 6
5/27/12
Assume that initially the queue contains one instruction. Every fetch operation adds one instruction to the queue and every dispatch operation reduces queue length by one. The queue length remains the same for first 4 clock cycles. The instruction I1 introduces 2-cycle stall, space is available in the queue the fetch unit continues to fetch instructions and add the queue length rises to 3 in clock cycle 6 Instruction I5 is a branch instruction (target is Ik is fetched in cycle 7 so I6 is discarded). The branch instruction would normally cause a stall in clock cycle 7 as a result of discarding instruction I6 but I4 is dispatched from the queue to decoding stage. The queue length drops to 1 in cycle 8. 5/27/12
cycles.
The branch instruction does not increase overall execution time. This is because the instruction fetch unit has executed the branch
encountered, at least one instruction available in the queue other than the branch instruction.
5/27/12
is encountered atleast one instruction is available in the queue other than branch instruction.
instruction cache allow reading more than one instruction in each clock cycle.
misses.
5/27/12
5/27/12
Delayed branch
Branch delay slot A technique delayed branching can minimize the penalty incurred
useful instructions can be placed in delay slots, these slots must be filled with NOP instructions.
5/27/12
The shift instruction is fetched while branch instruction is executed. After evaluating the branch condition the processor fetches the instruction at LOOP or at
cycles.
5/27/12
Branch Prediction
Assume that the branch will not take place and to continue to fetch
If the branch decision indicates otherwise, the instructions and all their
associated data in the execution unit must be purged and correct instructions fetched and executed. 5/27/12
5/27/12
An incorrectly predicted branch in shown in the above figure. The figure shows a compare instruction followed by a Branch>0 instruction. Branch prediction takes place in cycle 3, while instruction I3 is being fetched. The fetch unit predicts that the branch will not be taken, and it continues to
are forwarded immediately to the instruction fetch unit the branch condition is evaluated in cycle 4.
At this time the instruction fetch unit realizes that prediction was incorrect
A new instruction Ik is fetched from the branch target address in clock cycle
5.
5/27/12
making a wrong decision, to avoid fetching instructions that eventually have to be discarded. The execution history used in predicting the out come of a given branch instruction is the result of the most recent execution of the instruction. The processor assumes that the next time the instruction is executed the result is likely to be same. The two states are:
LT: Branch is likely to be taken LNT: Branch is likely not to be taken
5/27/12
Assume that the algorithm started in the state LNT. When the branch instruction
is executed and if the branch is taken, the machine moves to state LT. other wise, it remains same in LNT.
The next time the same instruction is encountered, the branch is predicted as
taken if the corresponding state machine is in state LT. Otherwise it is predicted as taken.
Once a loop is entered, the branch instruction that controls looping will always
yield the same result until the last pass through the loop is reached.
In the last pass, the branch prediction will turn out to be incorrect, and the
assume that there will be more than one pass through the loop, the machine will lead to the wrong prediction.
Better performance can be achieved by keeping more information about
execution history.
5/27/12
information for each branch instruction is shown in the figure. The four states are:
ST: Strongly likely to be taken LT: Likely to be taken LNT: Likely not to be taken SNT: Strongly likely not to be taken
5/27/12
Assume that the same state of algorithm is initially set to LNT. After the branch
instruction has been executed, and if the branch is actually taken, the state is changed to ST; otherwise, it is changed into SNT.
that the branch will be taken if the state is either LT or ST, and it begins to fetch instructions at the branch target address. Otherwise it continues to fetch instruction in sequential order.
When in state SNT, the instruction fetch unit predicts that the branch will not be
taken. If the branch is actually taken, that is if the prediction is incorrect, the state changes to LNT.
This means that the next time the same branch instruction is encountered, the
instruction fetch unit will still predict that the branch will not be taken. Only if the prediction is incorrect twice in a row will change the state to ST.
Assume that the branch instruction is at the end of the loop and that the
processor sets initial state of the algorithm to LNT. During the first pass, the prediction will be wrong (not taken), and hence the state will be changed to ST. 5/27/12
In all subsequent passes the prediction will be correct, except the last pass. At
When the loop is entered a second time, the prediction will be correct (branch
taken).
We now one final modification to correct the mispredicted branch at the time
The information needed to set the initial state correctly can be provided by any
the processor sets the initial state of the algorithm to LNT or LT.
5/27/12
In this case at the end of the loop, the compiler would indicate that the branch
With this modification, branch prediction will be correct all the time, except for
The state information used in dynamic branch prediction algorithms may kept
This may lead to a branch being mispredicted, but it does not cause an error in
5/27/12
Addressing modes
It is for accessing a variety of data structures simply and efficiently. Useful addressing modes are index, indirect, autoincrement, and
autodecrement. Many processors provide various combination of these modes to increase the flexibility of their instruction sets. Addressing modes to be implemented in a pipelined processor, we must consider the effect of each addressing mode on instruction flow in the pipeline.
extent to which complex addressing modes cause the pipeline to stall. The given mode is likely to be used by compilers. 5/27/12
memory twice
First to read the location X+[R1] in clock cycle 4 Then read location [X+[R1]] in cycle 5.
be stalled for 3 clock cycles, which can be reduced to two cycles with operand forwarding.
modes requires several instructions. For example, on a computer that allows three operand addresses, we can use
5/27/12
5/27/12
The add instruction performs the operation R2 X+[R1]. Two load instructions fetch the addresses and operand from memory. This
sequence of instructions takes exactly the same number of clock cycles as the original, single load instruction , as shown in the figure.
Adv: It reduces number of instructions needed to perform a given task and there by reduce the program space needed in the main memory. Disadv: Long execution times cause the pipeline to stall, thus reducing effectiveness. They require more complex hardware to decode and execute them and thet are not convenient for compiles to work with.
5/27/12
following features:
Access to an operand does not require more than one access to the memory. Only load and store instructions access memory operands. The addressing modes used do not have side effects.
register, register indirect, and index. First two require no address computation. In the index mode the address can be computed in one cycle, whether the index value is given in the instruction or in the register. None of these modes has any side effects, with one possible exception.
5/27/12
Condition codes
The condition code flags are stored in processor status register. They are
either set or cleared by many instructions, so that they can be tested by a subsequent conditional branch instruction to change the flow of program execution.
An optimizing compiler for a pipelined processor attempts to reorder
The compiler must ensure that reordering does not cause a change in
outcome of a computation.
Consider the sequence of instructions in the fig 8.17a, the branch decision
takes place in step E2 rather than D2 because it must await the result of the Compare instruction.
This will delay the branch instruction by one cycle relative to the compare
instruction.
As a result, at the time the branch instruction being decoded and the result of the
compare instruction will be available and correct branch decision will be made.
Interchanging the Add and Compare instructions can be done only if the Add
instruction does not affect the condition codes. These lead to two important conclusions: To provide flexibility in reordering instructions, the condition-code flags should be affected by as few instructions as possible. The compiler should be able to specify in which instructions of a program the condition codes are affected and in which they are not.
5/27/12
5/27/12
Fig. 7.8
57
5/27/12
5/27/12
There are separate instruction and data caches that use separate address
and data connections to the processor. This requires two versions of the MAR register, IMAR for accessing the instruction cache and DMAR for accessing the data cache.
The PC is connected directly to the IMAR, so that the contents of the PC
can be transferred to IMAR at the same time that an independent ALU operation is taking place.
The data address in DMAR can be obtained directly from the register file
or from the ALU to support the register indirect and indexed addressing modes.
Separate MDR registers are provided for read and write operations. Data
can be transferred directly between these registers and the register file during load and store operations without the need to pass through the ALU.
5/27/12
output of the ALU. These are registers SRC SRC2, and RSLT in Figure 8.7. Forwarding connections are not included in Figure 8.18. They may be added if desired.
The instruction register has been replaced with an
the control signal pipeline. The need for buffering control signals and passing them from one stage to the next along with the instruction.
5/27/12
5/27/12
The following operations can be performed independently in the processor of Figure 8.l8;
Reading an instruction from the instruction cache Incrementing the PC Decoding an instruction Reading from or writing into the data cache Reading the contents of up to two registers from the
register file Writing into one register in the register file Performing an ALU operation
5/27/12
can be performed simultaneously in any combination. The following actions are happen during clock cycle 4:
Write the result of instruction I1 into the register file Read the operands of instruction I2 from the register file. Decode instruction I3 Fetch instruction I4 and increment the PC.
5/27/12
Performance consideration
The execution time T is given by
T = (N*S)/R
N Dynamic instruction count S Average number of clock cycles it takes to
R/S
5/27/12
of four. In general, an n-stage pipeline has the potential to increase throughput n times. Thus, it would appear that the higher value of n, larger the performance gain.
This lead to two questions: How much of this potential increase in instruction throughput can be
Anytime a pipeline is stalled, the instruction throughput is reduced. Hence, the performance of pipeline is highly influenced by factors such
5/27/12
Let the delay through the ALU be the critical parameter. This is the time needed to add two integers. Thus, if ALU delay is 2 ns, a clock of
The on-chip instruction and data caches for this processor should also be
5/27/12
cycles.
Let TI be the time between two successive instruction completions. For sequential execution TI = S. However, in the absence of hazards, a pipelined processor completes
the execution of one instruction in each clock cycle, thus TI = 1clock cycle.
A cache miss stalls the pipeline by an amount equal to the cache miss
5/27/12
Consider a computer that has shared cache for instructions and data, and let d
&miss = ( (1-hi) + d (1-hd) ) * Mp where hi and hd are the hit ratios for instructions and data respectively. Eg: Assume that 30 percent of the instructions access data and memory, with a 95 percent instruction hit rate and 90 percent data hit rate, is given by &miss = ((0.05+0.3*0.1)*17 = 1.36 cycles
higher throughput.
But the performance gain of 0.42/0.19 = 2.2 is only slightly better than one-
processor. This can be achieved by introducing a secondary cache between the primary, on-chip cache and the memory.
Assume that the time needed to transfer an 8-word block from the
Hence, a miss in the primary cache for which required block is found in
still incurred.
Hence assuming a hit rate Hs of 94 percent in the secondary cache, the
average increase in TI is &miss = ( (1-hi) + d (1-hd) ) * (hs*Ms + (1-hs) * Mp) = 0.46 cycle
The instruction throughput in this case in 0.68R, or 340 MIPs. An
5/27/12
between two instructions that create dependency by placing other instructions between them whenever possible.
Also, in a processor that uses an instruction queue , the cache
miss penalty during instruction fetches may have reduced effect as the processor is able to dispatch instructions from the queue.
5/27/12
probability of the pipeline being stalled, because more instructions are being executed concurrently.
The dependencies between instructions may still cause pipeline to stall
such that one ALU operation can be completed in one clock cycle.
Other operations are divided into steps that take about the same
5/27/12
divide instruction execution into smaller steps and use more pipeline stages and faster clock.
UltraSPARC II uses 9-stage pipeline Pentium pro uses 12-stage pipeline Pentium 4 has 20-stage pipeline and uses the clock speed in the
5/27/12
Exception Handling
Exceptional situations are harder to handle in a pipelined CPU
because the overlapping of instructions makes it more difficult to know whether an instruction can safely change the state of the CPU.
In a pipelined CPU, an instruction is executed piece by piece and is
exceptions that may force the CPU to abort the instructions in the pipeline before they complete.
5/27/12
Types of Exceptions
I/O device request Invoking an operating system service from a user program Tracing instruction execution Breakpoint Integer arithmetic overflow FP arithmetic anomaly Page fault Misaligned memory accesses Memory-protection violation Undefined or unimplemented instruction Hardware malfunctions Privilege violation Hardware and power failure
5/27/12
5/27/12
The requirement can be characterized into five semi independent axes: Synchronous vs. asynchronous: If the event occurs at the same place every time the
program is executed with the same data and memory allocation is synchronous. With the exception of hardware malfunctions, asynchronous events are caused by devices external to the CPU and memory. Asynchronous events usually can be handled after the completion of the current instruction, which makes easier to handle.
User requested vs. coerced If the task directly asks for it, it is a user-requested event.
User requested exceptions are predictable and easier to handle. Usually can be handled after the completion of the current 5/27/12 instruction, which makes easier to handle.
User Maskable vs. unmaskable If an event can be masked or disabled by a user task, it is user maskable.
The mask simply controls whether the hardware responds to the exception or not.
Within vs. between instructions Exceptions occurring within instructions are synchronous It is harder to deal with exceptions that occur within instructions Resume vs. terminate If the programs execution always stops after the interrupt, it is
terminating event. If the program execution continues after the interrupt, it is a resuming event. It is easier to implement exceptions that terminate program execution
5/27/12
5/27/12
I/O device request: Errors that are related to an I/O request are usually indicated in the status data provided with the I/O interrupt. These errors are: Tracing instruction: The trace instruction is used to control the tracing of the execution, and is primarily used for debugging. trace all trace methods trace off trace results Page fault: Page fault is an interrupt (or exception) to the software raised by the hardware, when a program accesses a page that is mapped in address space, but not loaded in physical memory.
5/27/12
Hardware malfunction
This behavior can occur if a hardware component malfunctions, or if there are
Check the Memory Remove any extra memory modules that are in the computer, leaving only the least
amount that is required for the computer to start and run Windows. Restart the computer to see whether the error messages persist.
Check the Adapters Remove any adapters that are not required to start the computer and run Windows. In many cases, you can start your computer with only the drive subsystem controller and the video adapter.
Check the Computer BIOS/Configuration Verify that you have installed the latest revisions for your computer's BIOS or
firmware configuration software. Go into the BIOS and set load Fail-safe defaults or BIOS defaults, disable any antivirus protection in the BIOS, and then set Plug and Play OS to No. Check For Updated Drivers
5/27/12
Arithmetic overflow
If you use fixed precision datatypes (smallint, integer, bigint, decimal and numeric), it is possible that the result of calculation doesn't fit the datatype. Try casting the values in complex expressions as double precision and see whether the error goes away. If it works and you don't care about being too precise, you can leave it at that. Otherwise you need to check every operation and calculate the result. Here's an example: if you multiply 9.12 with 8.11 (both numeric(18,2)) you would get 73.9632. If Firebird would store that into numeric(18,2) datatype, we would lose 0.0032. Doesn't look much, but when you have complex calculations, you can easily loose thousands (dollars or euros). Therefore, the result is stored in numeric(18,4).
5/27/12
code in some DLL and that DLL has found something really wrong, like a destroyed heap. It is fairly common to program a debugger break instruction in the code so that a debugger gets a chance to stop. Fixing this is probably going to be tough. . debugger under the direction of a programmer to help solve or validate an issue in a program. The debugger can define a breakpoint event, and when such an event occurs, the processor issues this exception, and the debugger will regain control.
5/27/12
In paging, the memory address space is divided into equal, small pieces, called pages. Using a virtual memory mechanism, each page can be made to reside in any location of the physical memory, or be flagged as being protected. Virtual memory makes it possible to have a linear virtual memory address space and to use it to access blocks fragmented over physical memory address space.
A page table is used for mapping virtual memory to physical
memory. The page table is usually invisible to the process. Page tables make it easier to allocate new memory, as each new page can be allocated from anywhere in physical memory. 5/27/12
following: Force a trap instruction into the pipeline on the next IF. Until the trap is taken, turn off all writes for the faulting instruction and all the instructions that follow. After the exception-handling routine in the operating system receives control, it saves the PC of the faulting instruction. When using delayed branching, it is impossible to recreate the state using a single PC since the instructions may not be sequentially related
5/27/12
Exceptions in MIPS
----------------------------------------------------------------------------------------------LD IF ID EX MEM WB -----------------------------------------------------------------------------------------------DADD IF ID EX MEM WB 5/27/12 ------------------------------------------------------------------------------------------------This pair of instructions can cause a data page fault and an arithmetic exception at a same
This case can be handled by dealing with only the data page fault and
restarting the execution when the second exception occurs, it can be handled independently. Consider the instructions, LD followed by DADD. The LD can get data page fault, seen when instruction is in MEM, and the DADD can get an instruction page fault, seen when the DADD instruction is in IF. Since we are implementing precise exceptions, the pipeline is required to handle the exception caused by the LD instruction first. The instruction in the position of the LD instruction i and the instruction in the position of the DADD instruction i+1. The MIPS Approach: Hardware posts all exceptions caused by a given instruction in a status vector associated with the instruction The exception status vector is carried along as the instruction goes down the pipeline Once an exception indication is set in the exception status vector, any control signal that may cause a data value to be written is turned off
5/27/12
Upon entering the WB stage the exception status vector is checked and the
exceptions, if any, will be handled according to the time they occurred Allowing an instruction to continue execution till the WB stage is not a problem since all write operations for that instruction will be disallowed
Notes: The MIPS machine design does not allow exception to occur at the WB
stage . All write operations in the MIPS pipeline are in late stages Machines that allow writing in early pipeline stages are difficult to handle since exceptions can occur after the machine state has been already changed
5/27/12