Computer Architecture Prosessor

Introduction
This tutorial was written by Edsko De Vries and makes use of the Vivio animation of a DLX/MIPS
processor.
The processor we will be considering in this tutorial is the MIPS processor. The MIPS processor,
designed in 1984 by researchers at Stanford University, is a RISC (Reduced Instruction Set
Computer) processor. Compared with their CISC (Complex Instruction Set Computer) counterparts
(such as the Intel Pentium processors), RISC processors typically support fewer and much simpler
instructions.
The premise is, however, that a RISC processor can be made much faster than a CISC processor
because of its simpler design. These days, it is generally accepted that RISC processors are more
efficient than CISC processors; and even the only popular CISC processor that is still around (Intel
Pentium) internally translates the CISC instructions into RISC instructions before they are executed
[1].
RISC processors typically have a load store architecture. This means there are two instructions for
accessing memory: a load (l) instruction to load data from memory and a store (s) instruction to
write data to memory. It also means that none of the other instructions can access memory directly.
So, an instruction like "add this byte from memory to register 1" from a CISC instruction set would
need two instructions in a load store architecture: "load this byte from memory into register 2" and
"add register 2 to register 1".
CISC processors also offer many different addressing modes. Consider the following instruction
from the Intel 80x86 instruction set (with simplified register names):
add r1, [r2+r3*4+60] // i86 (not MIPS) example
This instruction loads a value from memory and adds it to a register. The memory location is given
in between the square brackets. As you can see, the Intel instruction set, as is typical for CISC
architectures, allows for very complicated expressions for address calculations or "addressing
modes". The MIPS processor, on the other hand, only allows for one, rather simple, addressing
mode: to specify an address, you can specify a constant and a register. So, the above Intel instruction
could be translated as:
slli r3, r3, 2 // r3 := r3 << 2 (i.e. r3 := r3 * 4)

add r2, r2, r3 // r2 := r2 + r3
l r4, 60(r2) // r4 := memory[60+r4]
add r1, r1, r4 // r1 := r1 + r4
We need four instructions instead of one, and an extra register ( r4) to do what can be done with one
instruction in a CISC architecture. The internal circuitry of the RISC processor is much simpler,
however, and can thus be made very fast. How this is done, is the topic of this tutorial.
Basic Processor Architecture
The execution of an instruction in a processor can be split up into a number of stages. How many
stages there are, and the purpose of each stage is different for each processor design. Examples
-1-
includes 2 stages (Instruction Fetch / Instruction Execute) and 3 stages (Instruction Fetch, Instruction
Decode, Instruction Execute). The MIPS processor has 5 stages:
IF The Instruction Fetch stage fetches the next instruction from memory using
the address in the PC (Program Counter) register and stores this instruction in
the IR (Instruction Register)
ID The Instruction Decode stage decodes the instruction in the IR, calculates the
next PC, and reads any operands required from the register file.
EX The Execute stage "executes" the instruction. In fact, all ALU operations are
done in this stage. (The ALU is the Arithmetic and Logic Unit and performs
operations such as addition, subtraction, shifts left and right, etc.)
MA The Memory Access stage performs any memory access required by the
current instruction, So, for loads, it would load an operand from memory. For
stores, it would store an operand into memory. For all other instructions, it
would do nothing.
WB For instructions that have a result (a destination register), the Write Back
writes this result back to the register file. Note that this includes nearly all
instructions, except nops (a nop, no-op or no-operation instruction simply
does nothing) and s (stores).
Consider the execution of the following program and try to consider the operation of each of the 5
processor stages (IF, ID, EX, MA and WB) for each individual instruction. (Note that register r0 in
the MIPS processor is always 0).
l r1, 0(r0) // r1 := memory[0]

l r2, 1(r0) // r2 := memory[1]
add r3, r1, r2 // r3 := r1 + r2
s 2(r0), r3 // memory[2] := r3
Try it out. Follow the execution of this program in the MIPS animation and check if you were right.
If you don't understand each and every bit of the diagram yet, don't despair - each stage will be
explained separately and in detail in the next section.
Details of the Processor Stages
-2-
As I said before, the Instruction Fetch phase fetches the next instruction. First, it sends the contents
of the PC register, which contains the address for the next instruction, to the instruction memory (1).
The instruction memory will then respond by sending the correct instruction. This instruction is sent
on to the next (instruction decode) phase (2).
The instruction decode phase will calculate the next PC and will send it back to the IF phase (4) so
that the IF phase knows which instruction to fetch next. To be able to execute a jr instruction (which
changes the PC to the contents of a register), we also need a connection from the register file to the
PC (5).
One of these (4 or 5) is then selected (MUX1) to update the PC on the next rising clock edge. The
control line for MUX1 (not shown on the diagram) would also come from the ID phase, and is based
on instruction type.
-3-
The Instruction Decode phase then has two main tasks: calculate the next PC and fetch the operands
for the current instruction. There are three possibilities for the next PC.
For the most common case, for all "normal" (meaning instructions that are not branches or jumps),
we must simply calculate PC+4 to get the address of the next instruction (3, ADD4).
For jumps and branches (if the branch is taken), we might also have to add some immediate value
(the branch offset) to the PC (ADDi). This branch offset is encoded in the instruction itself (6).
For the jr and jalr instructions, we need to use the value of a register instead.
MUX3 selects between the first two posibilities (+4 or +immediate), and MUX1 (in the IF phase)
selects between MUX3 and the third posibility (value from the register file). The control line for
MUX3 comes from a zero-detector (not shown), which in turn gets its input from the register file
(thin line marked "a"). The control line for MUX1 is not shown and is based on the instruction type.
Then, the second main task is to fetch the operands for the next instruction. These operands come
from the register file for all but two instructions (7, 8). The two instructions that are the exception are
jal and jalr. These two jump instructions save the address of the next instruction in a destination
register, so instead of sending an operand from the register file, we need to send the contents of the
PC+4 (10).
MUX5 selects between (7) and (10). Again, the control line for this MUX (not shown) would also
come from the ID phase, and is be based on the instruction type.
-4-
The Execution phase "executes" the instruction. Any calculations necessary are done in this phase.
These calculations are all done by the ALU (Arithmetic and Logic Unit). The internals of these unit
is beyond the scope of this tutorial, but you should find an explanation of this unit in any elementary
book on digital logic, see for example [2].
Conceptually this stage is very simple. The ALU needs two operands. These either come from the ID
phase (8, 9) and thus in turn from the register file (or PC+4 for jump and link instructions), or one
operand comes from the ID phase (8) and the other comes from the instruction register (6) to supply
an immediate value.
MUX7 selects between (9) and (6). The control line for this MUX is not shown, but would come
from the ID phase and is based on the instruction type (it will select (9) for instructions that operate
on two registers, and (6) for instructions that work one one register and one immediate value).
(11) is used for the s (store instruction). Consider the following instruction:
s 10(r0), r2 // memory[10+r0] := r2
The calculation 10+r0 will be done by the ALU (so which source does MUX7 select?), and the
result of this calculation will be passed on to the MA stage (12). In the MA phase, however, we will
also need to know the contents of r2. This is why we need (11).
-5-
The purpose of the Memory access phase is to store operands into memory or load operands from
memory. So, for all instructions except loads and stores, the MA simply passes on the value from the
ALU on to the WB stage (12, 16).
For loads and stores, the MA needs to send the effective address calculated by the ALU to memory
(13). For loads, the memory will then respond by sending the requested data back (15). For stores,
we need to send the data to be stored along with the effective address (11, 14), and the memory will
respond by updating itself.
Finally, MUX9 will select between (12) and (15) for the input for (16). As per usual, this decision
would be based on the instruction type and the control line would come from the ID phase (not
shown).
-6-
The Write Back phase, finally, is a very simple one. It simply takes the output of the MA phase (..),
and sends it back to the write back phase to store the result into the destination register.
For stores (and no-ops), the write-back does nothing.
Try it out. Watch the execution of example 1, example 2 and example 3 and see if you can
understand the operation of each processor stage.
Pipelining the Processor
If you have watched the execution of your programs carefully, you will have noticed that only one
stage is active at any given moment. First, the IF stage fetches the instruction. Then, the ID phase
decodes the instruction and fetches the operands, while the IF phase is idle. Then the EX phase
executes the instruction while both IF and ID are idle. And so forth. Also, the clock period is really
long: it must be long enough to allow an instruction to propagate through all 5 stages.
An obvious method to speed up the processor is to make the individual pipeline stages work in
parallel. So, when the ID phase is decoding an instruction i, the IF phase would be fetching the next
instruction i + 1 in parallel. On the next clock cycle, the EX phase would execute instruction i, the ID
phase would decode instruction i + 1 and the IF phase would fetch instruction i + 2. This technique is
called pipelining.
For this to work, each phase needs it's own set of registers to hold the instruction and instruction
operands. This is necessary because each stage will be working on a different instruction. But
inserting registers in between the different phases has a huge advantage: the clock period can now be
much shorter. Indeed, we only need to allow for an instruction to make it through one stage of the
processor, instead of all 5. So, at least in theory, we can clock the processor 5 times faster. Shown
visually:
-7-
No pipelining
With pipelining
Cycle IF ID EX MA WB
1 I i
There are two observations to be made here. First of all, instruction i takes 1 (long) clock cycle in the
non-pipelined processor, and it takes 5 (short) clock cycles in the pipelined processor. Note that this
"short" cycle is 1/5 of the "long" cycle, so it actually takes just as long to execute an instruction in
the non-pipelined version as it does in the pipelined version.
The point is, however, that even though it takes just as long to execute one instruction, both
processors finish the execution of an instruction after every clock cycle. Instruction i+1 finishes one
clock cycle after instruction i for both processors. BUT: the clock cycle for the pipelined processor is
a fifth of the clock cycle for the non-pipelined processor. Hence, the throughput, the number of
instructions executed per unit time, is 5 times higher for the pipelined processor than it is for the non-
pipelined processor. In other words, the pipelined processor is 5 times faster than the non-pipelined
processor.
Try it out. To illustrate this point, watch the execution of a single instruction in the non-pipelined
and in the pipelined processor. Notice that in the pipelined processor, the clock period is much
shorter, and every stage now has its own set of registers. Also note that despite this, the instruction
takes the same time to execute in either processor.
Try it out. Then consider the following simple program and watch the execution in both the non-
pipelined and the pipelined processor and compare execution speed. You should notice that it takes 6
long cycles to execute the program in the non-pipelined processor, and 10 short cycles to execute the
program in the pipelined processor.
addi r1, r0, 4 // r1 := 4

addi r2, r0, 5 // r2 := 5
addi r3, r0, 6 // r3 := 6
slli r1, r1, 2 // r1 := r1 << 2 (r1 := r1 * 4)
subi r2, r2, 3 // r2 := r2 - 3
add r3, r3, r3 // r3 := r3 + r3
Date Hazards
Of course, it is not as simple as it seems (is anything ever?). The above program was carefully
chosen for the pipelining to work. Consider the following code fragment:
add r1, r2, r3 // r1 := r2 + r3

sub r3, r1, r2
xor r3, r1, r2
-8-
srl r3, r1, r2
ori r3, r1, 10
If we look at the pipeline during the executing of these instructions, we see:
1 add
2 sub add
3 xor sub add
4 srl xor sub add
5 ori srl xor sub add
6 ori srl xor sub
In cycle 3, the sub instruction reads the value of register r0. The add instruction, however, whose
destination register is r0, has not reached the write-back phase yet. Hence, the sub will read the
wrong value for r0!
This situation is called a data hazard. There is another data hazard cycle 4 (which instructions?).
The srl in cycle 5 is OK, because the add will write the result back to the register file in the first
half of the clock cycle, and the srl will read its operands in the second half of the clock cycle (two
phase memory access). It should be clear that the ori in cycle 6 is OK too - the add has finished
execution completely in this cycle.
The easiest solution is to ignore the problem. While that might sound like a bad solution, it is
actually quite usable. Consider the above program again, and assume that r1 is initially 1, and r2 is
initially 2. Then:
add r0, r1, r2 // r0 := 3

sub r2, r0, r1 // r2 := -1 (uses old value of r0)
xor r2, r0, r1 // r2 := 1 (uses old value of r0)
srl r2, r0, r1 // r2 := 6 (uses new value of r0)
ori r2, r0, 10 // r2 := 11 (uses new value of r0)
It becomes a bit harder to read the source code, because the processor seems to execute instructions
out of order. But as long as this behaviour is well-defined, the programmer (or the compiler) can take
it into account and make good use of it. (Note that the processor does not really execute instructions
out of order.)
Try it out. Follow the execution of this program in the MIPS pipeline.
Pipeline Interlocks
A second solution is to stall the pipeline. To explain what a stall is, we need a bit of extra
terminology. We can divide the pipeline into two parts, instruction fetch and instruction execute:
Instruction Fetch Instruction Execute

IF ID EX MA WB
-9-
When an instruction makes it from the first half of this pipeline to the second half, i.e. from the ID
phase to the EX phase, it is said to be issued. On a stall, we do not issue the instruction in the ID
phase. Instead, we issue a NOP ("do nothing") instruction. In the meantime, the IF and and ID phase
will simply re-execute (we shall later see that it is is not quite as simple as that). The above program
would be executed as follows:
1 add
2 sub add
3 xor sub add We detect a
stall here,
and issue a
nop instead
of the sub
4
4 xor sub nop add The add

has not yet
reached
WB, so we
need
another
stalled
cycle
5 xor sub nop nop add The add

will write
its results
in the first
half of the
cycle, so
the xor will
now get the
correct
value for
6 srl xor sub nop nop
5
The downside is that with this solution, we need 7 cycles instead of 5 to execute our program. That's
a 40% increase in execution time!
Try it out. Run this program in the animation. For clarity, the MIPS animation will issue a "stall"
instruction instead of a nop instruction upon a stall. This is only to distinguish nops issued because of
stalls from nops that were actually in the instruction memory. In reality, the stall instruction does
not exist.
- 10 -
ALU Forwarding
We want to be able to use the result of the ALU directly, without having to wait for the result to be
written back to the register file. For this, we need to forward the result of the ALU directly back into
the ALU.
Before the result gets written to the destination register, it is written into the intermediate registers
OUT0 and OUT1 first (see diagram). We need to forward the contents of OUT0 and OUT1 back into
the input of the ALU (connections marked 1a and 1b connect OUT0 to the ALU, and connections
marked 2a and 2b connect OUT1 to the ALU). Of course, we then also need a multiplexer to select
the right operand (MUX6 and MUX7).
With this solution, there are no more stalls, and the code is executed the way you would expect from
reading the source code. If you don't understand it, try the animation - it should become clear.
Load Hazards
Consider the following code fragment.
l r1, 1(r2) // Load the value at address 1+r2 into r1

add r3, r1, r2
How is this executed in the pipeline?
1 lb
2 add lb
3 add lb
4 add lb
5 add lb
- 11 -
Again, the lb will only write the value from memory into the register file in the WB phase, whereas
the add retrieves its operands in the ID phase. Thus, the add will actually use an old value of r1. As
before, we could ignore this problem and call it a feature, and that would be fine. We could also stall
the pipeline for two cycles:
1 lb
2 add lb
3 add lb We detect a
stall, issue a
nop instead
of the add
4 add nop lb And another
one
5 add nop nop lb
The lb will write its result into OUT1 before it is written to the register file, however, so if we use
pipeline forwarding, and we forward the result of OUT1 directly back to the ALU, we can reduce
this to a one-cycle stall.
1 lb
2 add lb
3 add lb We detect a
stall, issue a
nop instead
of the add
4 add nop lb And another
one
5 add nop lb
We cannot do better than this, simply because the value is not loaded from memory until the MA
phase. Note that the load calculates the effective address of the memory operand in the ALU phase;
in the above example, it would calculate 2+r2.
Try it out. Consider each solution (ignoring the problem, interlock and interlock with forwarding) in
the MIPS pipeline.
- 12 -
Stores
The store instruction exhibits the same data hazards as, for example, arithmetic instructions such as
add or sub. Consider the following program:
add r1, r2, r3

s 10(r0), r1 memory[10] :=r1
1. To solve the hazard, we have the usual three options:
2. Copy the result from B (effectively, from the register file) to SMR and ignore any
dependencies. In effect, leave the hardware as it was.
3. Stall the pipeline. This results in a two-cycle penalty
4. Use forwarding. This requires an additional multiplexer (MUX8) and connections from
OUT0 and OUT1 (connections marked 3 and 4)
Note that, conceptually, this is not a new problem (it is the exact same data hazard as we saw before).
It is treated separately here, however, because we do need a bit of extra hardware to resolve the
hardware for stores.
Control Hazards
Consider the following code sequence:
add r1, r2, r3

j l1 // Jump to (label) l1
sub r2, r3, r0
l1: xor r3, r0, r1
- 13 -
ori r1, r2, 10
How this is executed in the MIPS pipeline?
1 add
2 j add
3 sub j add The branch
is resolved
here, but we
already
fetched the
sub
4 xor sub j add
5 ori xor sub j add
Notice that we executed the sub instruction which, logically, should never have executed because of
the branch (j). This happens because the branch target is only calculated in the ID phase. This
problem is called a control hazard.
As with load hazards, we can decide to ignore the control hazard. Once we define that the instruction
directly after the branch is executed anyway, we can take that into account and make use of it. This
technique is called delayed branches, and the instruction following the branch is said to be in the
delay slot.
Try it out. Run this program using delayed branches.
Alternatively, we can stall the pipeline. Remember, when we stall the pipeline, the instruction in the
ID phase (which would be the branch) is not issued (a nop is issued instead). We also said that the IF
and ID phase simply re-execute, i.e., their registers are not updated.
As it turns out, this is not completely true. If the PC would not be updated, the IF phase would fetch
the same instruction again (the sub), and we would be in the exact same situation. Thus, the PC is
updated for a control hazard. The ID register and PC1 are not updated, however, because if they did,
the sub instruction would make it to the ID phase, which is exactly what we want to avoid).
Here is how the program would executed using pipeline interlock.
1 add
2 j add
3 sub j add We detect a
stall
4 xor sub j add PC is
updated so
when the IF
re-executes,
a different
instruction
- 14 -
is loaded
5 ori xor sub j add
Try it out. Run this program using branch interlock.
Branch Prediction
Finally, there is a third option, branch prediction. Branch prediction uses a Branch Target Buffer
to predict what the target of the PC will be. This branch target buffer or BTB is an associative
cache. Each entry in this cache lists an address of a branch instruction (the tag), together with the
prediction target. This predicted target is based on the last time the branch was executed.
When the IF phase executes, it sends the contents of the PC to the BTB (1). The BTB then compares
this with the addresses (tags) in its cache. If the address is in the cache, we must have executed the
branch before. We then predict that the branch will do the same as the last time, so we send the
associated value in the BTB to the PC (2).
If the address is not there, either because we never saw the branch before (compulsory miss), or
because we did see the the branch before, but too long ago and it has been replaced by another
branch in the BTB (capacity miss or conflict miss), we cannot predict the branch and we simply
calculate PC+4 (as if we did not have branch prediction, this is also called predict-not-taken,
because we "predict" that we do not take the branch and simply want to execute the next instruction
in memory).
- 15 -
Of course, the prediction could be wrong. We cannot determine this until the ID phase, however.
Remember, in the IF phase, we do not know what kind of instruction we are executing - this is only
determined in the ID phase. Therefore, in the ID phase, we calculate the branch target anyway. We
then compare the calculated branch target with the predicted PC (by comparing it to the PC register).
If the prediction is correct, all is well, but if it is wrong (or if we did not predict the target address
because of a BTB miss), we need to stall the pipeline and update the BTB (connections 3, 4 and 5).
So there are 3 different situations:
1. The branch is not in the BTB

We update the PC with PC+4. When the branch gets to the ID phase, we stall the pipeline to
update the BTB.
2. The branch is in the BTB, but the prediction is wrong
We update the PC with the predicted PC. When the branch gets to the ID phase, we stall the
pipeline to update the BTB.
3. The branch is in the BTB and the prediction is correct
We update the PC with the predicted PC. When the branch gets to the ID phase, no action is
taken.
Try it out. See a demonstration of branch prediction. Notice that all three situations listed above
occur when the program is executed.
Conditional Branches
The MIPS processor supports two conditional branches: beqz (branch if equal to zero) and bnez
(branch if not equal to zero). Both these instructions take an arbitrary register as a source operand.
This yields another data hazard:
add r1, r2, r3

bneq l1, r1
sub r2, r3, r1
l1: xor r3, r2, r1

ori r3, r1, 10
Consider the execution of this program in the pipeline:
1 add
2 bneq add We could
use a BTB
here, or
predict-not-
taken
3 sub bneq add Now we

need to
check if our
prediction
- 16 -
was correct
The problems start in cycle 3. We need to check register r1 to see if it is not equal to zero. But the
add instruction is still in the EX phase: the value for r1 is still being calculated! As for the other
hazards that we have seen, we can ignore this problem, and call it a feature.
We call also stall the pipeline, but we would have to stall it for two cycles (we would have to wait
until the add gets to the WB phase). Even then, we would only have the result towards the end of the
cycle (the register file is updated halfway through the cycle, and we still have to do the comparison
against zero), but that is acceptable. That is quite a long stall, however, and this situation is very
common (regard the add as a compare instruction and you'll understand why).
Finally, we could try to forward the result, as we did for data hazards. That means, we would have to
forward the contents of OUT0 (connection c) and OUT1 (d) to the comparator. But that is not
enough! We also need to forward the result of the ALU as it is being calculated to the comparator
(b). Remember, this result will only be written to OUT0 on the next clock cycle - which is too late.
This adds an asynchronous (pronounce: dodgy) element to the element, but it is necessary to avoid a
stall in this - very common - situation.
Try it out. See all three possible solutions (no zero forwarding, zero interlock and zero forwarding)
in the animation.
Summary of Pipeline Stalls
The following table lists all possible stalls exhaustively, with the conditions for the stall. For each
individual stall, you can run an animation that will show a program in which the stall occurs. Note
that only for the control hazards, the PC register is updated during the stall. For all other stalls, the
PC register is not updated (see section on control hazards for more information).
- 17 -
The sample programs for each of these hazards allow to change the circuit configuration and the
contents of the instruction memory, if you wish. This allows you to try and remove the hazard by
changing the program or the circuit configuration.
Hazard IF ID EX MA WB Condition Demo
RAW i1 i2 .., Rx, .. i3 Rx, .., .. i4 i5 No ALU Run

forwarding
i1 i2 .., Rx, .. i3 i4 Rx, .., .. i5 No ALU Run
forwarding
RAW i1 sb .., Rx i3 Rx, .., .. i4 i5 No Run
forwarding
to SMR
(store) i1 sb .., Rx i3 i4 Rx, .., .. i5 No Run
forwarding
to SMR
RAW i1 beqz Rx i3 Rx, .., .. i4 i5 No zero- Run
forwarding
(branch) i1 beqz Rx i3 i4 Rx, .., .. i5 No zero- Run
forwarding
i1 jr Rx i3 Rx, .., .. i4 i5 (Always) Run
i1 jr Rx i3 i4 Rx, .., .. i5 (Always) Run
Load i1 i2 .., Rx, .. lb Rx, .. i4 i5 Load Run
interlock
Control i1 j L1 i3 i4 i5 Calculated Run
PC ≠ pc
i1 jr Rx i3 i4 i5 Calculated Run
PC ≠ pc
i1 beqz Rx, i3 i4 i5 Calculated Run
L1 PC ≠ pc
Questions
Q1. Why is the MA phase idle for all instructions except loads and stores?
Q2. In the section introduction of pipelining, the same program is executed in the non-piplined and
in the pipelined processor. Confirm that the speedup is 300%, and explain why this is less than the
expected 500%.
Q3. Draw a pipeline diagram to show the hazard mentioned in the section on forwarding to the
SMDR. Then write a program to show this hazard, and observe the effects of the different circuit
configurations (Store operand forwarding, Store interlock, and No store interlock).
Q4. The table with all pipeline stalls lists two situations that always cause a stall. What extra
hardware would you need to avoid these stalls?
- 18 -
Q5. The diagram shows four control lines for multiplexer MUX3 (see the section on conditional
branches). Are all of these four lines strictly necessary?
Q6. There are quite a number of stalls during the execution shift-and-add multiplier. Rewrite the
program to avoid nearly all of these stalls.
Bibliography
[1] Computer Architecture: A Quantitative Approach, 3rd edition, John L. Hennessy & David A.
Patterson, Morgan Kaufmann Publishers [now Elsevier]
[2] Logic and Computer Design Fundamentals, 2nd edition, M. Morris Mano and Charles R.
Kime, Prentice Hall
- 19 -

Computer Architecture Prosessor

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Computer Architecture Prosessor

Загружено:

Авторское право:

Доступные форматы

Introduction

add r1, [r2+r3*4+60] // i86 (not MIPS) example

slli r3, r3, 2 // r3 := r3 << 2 (i.e. r3 := r3 * 4)

Basic Processor Architecture

l r1, 0(r0) // r1 := memory[0]

Details of the Processor Stages

For stores (and no-ops), the write-back does nothing.

Pipelining the Processor

addi r1, r0, 4 // r1 := 4

add r1, r2, r3 // r1 := r2 + r3

If we look at the pipeline during the executing of these instructions, we see:

add r0, r1, r2 // r0 := 3

Instruction Fetch Instruction Execute

4 xor sub nop add The add

5 xor sub nop nop add The add

Consider the following code fragment.

l r1, 1(r2) // Load the value at address 1+r2 into r1

How is this executed in the pipeline?

5 add nop nop lb

add r1, r2, r3

1. To solve the hazard, we have the usual three options:

3. Stall the pipeline. This results in a two-cycle penalty

Consider the following code sequence:

add r1, r2, r3

l1: xor r3, r0, r1

How this is executed in the MIPS pipeline?

Try it out. Run this program using delayed branches.

5 ori xor sub j add

Try it out. Run this program using branch interlock.

1. The branch is not in the BTB

add r1, r2, r3

l1: xor r3, r2, r1

Consider the execution of this program in the pipeline:

3 sub bneq add Now we

Summary of Pipeline Stalls

Hazard IF ID EX MA WB Condition Demo

RAW i1 i2 .., Rx, .. i3 Rx, .., .. i4 i5 No ALU Run

Вам также может понравиться