Вы находитесь на странице: 1из 21

ANAND INSTITUTE OF HIGHER TECHNOLOGY Chennai-603 103

DEPARTMENT OF ELECTRONICS AND INSTRUMENTATION ENGINEERING


CS2071 COMPUTER ARCHITECTURE
Faculty Name: C.MAGESHKUMAR

Class: IV EIE A&B Semester: VII

UNIT III DATA PATH AND CONTROL


Page no.
2
2
3
4
6
6
6

CONTENT
I.
1.
2.
3.
4.
5.
6.

Instruction Execution Steps


A Small Set of Instructions
The Instruction Execution Unit
A Single-Cycle Data Path
Branching and Jumping
Deriving the Control Signals
Performance of the Single-Cycle Design
Control Unit Synthesis
A Multicycle Implementation
Choosing the Clock Cycle
The Control State Machine
Performance of the Multicycle Design

6
6
8
9
10

III.

Microprogramming

11

IV.
11.
12.
13.
14.
15.
16.

Pipelining
Pipelining Concepts
Pipeline Stalls or Bubbles
Pipeline Timing and Performance
Pipelined Data Path Design
Pipelined Control
Optimal Pipelining

13
13
14
16
16
16
16

Pipeline Performance
Data Dependencies and Hazards
Data Forwarding
Pipeline Branch Hazards
Delayed Branch and Branch Prediction
Advanced Pipelining

17
17
18
19
19
21

II.
7.
8.
9.
10.

V.
17.
18.
19.
20.
21.

CMageshKumar_AP_AIHT

CS2071_Computer Architecture

I. INSTRUCTION EXECUTION STEPS


1. A SMALL SET OF INSTRUCTIONS
MiniMIPS instruction set 40 instructions
MicroMIPS instruction set 22 instructions
The instructions in below table can be divided into 5 categories
1. Seven (7) R-format ALU instructions (add, sub, slt, and, or, xor, nor)
2. Six (6) I-format ALU instructions (lui, addi, slti, andi, ori, xori)
3. Two (2) I-format memory access instructions (lw, sw)
4. Three (3) I-format conditional branch instructions (bltz, beq, bne)
5. Four (4) unconditional jump instructions (j, jr, jal, syscall)
op

31

R
I

25

rs

20

rt

15

rd

10

sh

fn

6 bits

5 bits

5 bits

5 bits

5 bits

6 bits

Opcode

Source 1
or base

Source 2
or destn

Destination

Unused

Opcode ext

imm
Operand / Offset, 16 bits

jta

Jump target address, 26 bits

inst
Instruction, 32 bits

Fig.1. MICROMIPS INSTRUCTION FORMATS

Execution sequence of MicroMIPS instructions:


Seven R-format ALU instructions (add, sub, slt, and, or, xor, nor) have following common execution sequence:
1. Read out the contents of source registers rs & rt and forward them to ALU as inputs
2. Inform the ALU to perform the desired operation by means of appropriate control signal
3. Write the output of ALU in destination register rd
5 out of 6 I-format ALU instructions (addi, slti, andi, ori, xori) have following common execution sequence:
1. Read out the contents of source registers rs & immediate value and forward them to ALU as inputs
2. Inform the ALU to perform the desired operation by means of appropriate control signal
3. Write the output of ALU in destination register rt
2

CMageshKumar_AP_AIHT

CS2071_Computer Architecture

The 1 out of 6 I-format ALU instructions (lui) have following common execution sequence:
1. Read out the contents of source register immediate value and forward them to ALU as input
2. Inform the ALU to perform the desired operation by means of appropriate control signal
3. Write the output of ALU in destination register rt
The Two (2) I-format memory access instructions (lw, sw) have following common execution sequence:
1. Read out the content of rs
2. Add the number of read out from rs to immediate value in instruction to form a memory address
3. Read from / write into memory at specified address.
4. In case of lw instruction, place the word read out from memory into rt
The Three (3) I-format conditional branch instructions (bltz, beq, bne) and Four (4) unconditional jump
instructions (j, jr, jal, syscall) have following common execution sequence:
1. Read out the contents of source registers rs & immediate value and forward them to ALU as inputs
2. Inform the ALU to perform the desired operation by means of appropriate control signal
3. The branch target address is specified by an offset relative to increamented program counter value ((PC)+4)
4. To branch back tp previous instruction, the offset value supplied in the immediate field of instruction will be -2,
which in branch target address [ (PC)+4-(2*4) = (PC)-4]
5. For beq, bne instructions, contents of rs and rt are compared to determine wheather branch condition is
satisfied.
6. For bltz, the branch decision is based on the sign bit of content of rs.
7. For 4 jump instructions (j, jr, jal, syscall):
PC is unconditionally modified to allow the next instruction to be fetched from jump target address.
The jump target address comes from instruction itself (j, jal) is read out from register rs or is a known
constant associated with the location of an operating system routine call (syscall)
2. THE INSTRUCTION EXECUTION UNIT
Step by step execution of all 22 MicroMIPS instructions can be depicted from below block diagram:
1. Beginning at the left end, the content of program counter (PC) is supplied to instruction cache and an
instruction word is read out from specified location.
2. With every clock cycle ticking, a new address is loaded into program counter causing a new instruction to
appear at output of instruction cache after a short access delay
3. Contents of various fields of instruction are sent to relevant blocks including control unit (decides the
operation to be performed)
4. Once an instruction has been read out from instruction cache, its various fields are separated and dispatched
to approx. place.
Example: op and fn fields goto control unit, rs, rt, rd will goto register file
5. The upper input of ALU always comes from register rs and lower input of ALU is from rt or immediate
value of instruction.
6. As the data from register file pass through ALU, the specified operation is performed and the output
appears at ALU output.
7. In case of arithmetic and logic instructions the output of ALU is stored in destination register and thus it
bye-pass data cache, run through feedback line is stored in rd of register file.
8. In case of memory access instructions, the ALU output data is treated as data address for writing into / read
from data cache
9. Data cache: For many instructions, the output of ALU is stored in a register thus, data cache is byepassed.
For lw and sw instructions, the data cache is accessed with the content of rt written into rt for sw
instruction and its output sent to register file for lw instruction
10. In one clock cycle, the content of any 2 registers out of 32 registers (mostly rs & rt) is read out from read
ports, At the same time, the output from ALU is stored in the register via write port.
11. The flip-flops representing registers are edge-triggered. So, reading / writing into same register in a single
clock cycle does not cause any problem.
3

CMageshKumar_AP_AIHT

CS2071_Computer Architecture

12. For beq and bne instructions, contents of rs and rt are compared to determine whether the branch
condition is satisfied. The comparison is performed in next address block.
13. In case of bltz, the branch decision is based on sign bit of content of rs rather than comparison of two
register contents. This is performed by next address block.
14. Next address blocks also choose the jump target address under the guidance of control unit.
15. The jump target address comes from j, jal instructions is read out from register rs (jr instruction).
16. The middle part composing program counter, instruction cache, register file, ALU, data cache is known as
data path.
syscall

beq,bne

Next addr
jta
j,jal

bltz,jr

rs,rt,rd

PC

Instr
cache

12 A/L,
lui,
lw,sw

Reg
file

inst

22 instructions

(rs)

ALU

Address
Data

Data
cache

(rt)
imm
op fn

Control
Fig.2. Abstract view of the instruction execution unit for MicroMIPS.

Harvard
architecture

3. A SINGLE-CYCLE DATA PATH


1. The middle part composing program counter, instruction cache, register file, ALU, data cache is known as
data path.
2. The datapath shown above is capable of executing one instruction per clock cycle. Hence the name single
cycle datapath
3. Singlecycle design : clock rate- 125 MHz and CPI- 1
4. There are 3 multiplexers used in datapath,
1. At input side of register file
2. At lower input of ALU
3. At output of ALU and data cache.
5. Multiplexer 1 (At input side of register file) :
i. This multiplexer allows rt, rd or $31 to be used as the index of destination register into which
results will be written.
ii. The logic signal RegDst is supplied by control unit directs the selection of rt or rd or $31.
iii. RegDst control signals and corresponding selections
S.no Control signal
Selection
1
00
rt
2
01
rd
3
10
$31
iv. RegWrite is declared (asserted) by control unit to write into register file.
6. Registers rs and rt are read out for every instruction even it is not needed, so there is no read control signal.

CMageshKumar_AP_AIHT

CS2071_Computer Architecture

7. Instruction cache block also wont receive any control signal to read the instructions since instructions are
read out in every cycle.
8. Multiplexer 2 (At lower input of ALU):
i. The multiplexer at the lower input of ALU allows the control unit by asserting / deasserting ALUSrc
control signal to choose the content of rt or 32-bit sign-extended version of 16-bit immediate
operand to be used as second ALU input.
1. If ALUSrc signal = 0 (deasserting), then content of rt is used as ALU lower input
2. If ALUSrc signal = 1 (asserting), then content of 32-bit sign-extended version of
16-bit immediate operand is used as ALU lower input.
ii. Sign extension of immediate operand is performed by SE block.
9. Multiplexer 3 (At output of ALU and data cache): The control signal used here is RegInSrc
S.no Control signal
Selection
1
00
Data cache output
2
01
ALU output
3
10
Incremented PC value coming from next-address block
10. With every clock cycle ticking, a new address is loaded into program counter causing a new instruction to
appear at output of instruction cache after a short access delay.
11. Contents of various fields of instruction are sent to relevant blocks including control unit (decides the
operation to be performed)
12. As the data from register file pass through ALU, the specified operation is performed by ALUFunc signal and
the output appears at ALU output.
13. In case of arithmetic and logic instructions the output of ALU is stored in destination register and thus it byepass data cache, run through feedback line is stored in rd of register file.
14. In case of memory access instructions, the ALU output data is treated as data address for writing into
(DataWrite signal ) / read from (DataRead signal) data cache

Incr PC

Next addr
jta

Next PC

ALUOvfl

(PC)
PC

(rs)

rs
rt

Instr
cache

inst
rd
31
imm
op

Br&Jump

0
1
2

Ovfl

Reg
file

ALU
(rt)

/
16

0
32
SE / 1

Func

ALU
out

Data
addr

Data
cache

Data
in

0
1
2

Register input

fn

RegDst
RegWrite

ALUSrc
ALUFunc

DataRead
RegInSrc
DataWrite

CMageshKumar_AP_AIHT

Data
out

CS2071_Computer Architecture

4. BRANCHING AND JUMPING:

(Refer page no. 249,250 in text book B.Parhami)


5. DERIVING THE CONTROL SIGNALS:

(Refer page no. 250-253 in text book B.Parhami)


Control signals for the single-cycle MicroMIPS implementation.

6. PERFORMANCE OF THE SINGLE-CYCLE DESIGN

(Refer page no. 253-255 in text book B.Parhami)


II.

CONTROL UNIT SYNTHESIS

7. MULTICYCLE IMPLEMENTATION:

Clock
Time
needed
Time
allotted

Instr 1

Instr 2

Instr 3

Instr 4

Clock
Time
needed
Time
allotted

Time
saved

3 cycles

5 cycles

3 cycles

4 cycles

Instr 1

Instr 2

Instr 3

Instr 4

Fig.3. Single-cycle versus multicycle instruction execution.

With multicycle design, a subset of actions required for an instruction is performed in one clock cycle.
Hence the clock cycle can be made much shorter, with several cycles needed to execute a single instruction.
Advantages of multicycle implementation are greater speed and economy
6

CMageshKumar_AP_AIHT

CS2071_Computer Architecture

MULTICYCLE DATA PATH:

Inst Reg

x Reg

jta

Address

rs,rt,rd

(rs)

PC
imm

Cache

z Reg

Reg
file

ALU
(rt)

Data
Data Reg

op

y Reg

fn

Control

Fig.4. Abstract view of a multicycle instruction execution unit for MicroMIPS.


1. The datapath in above block diagram is capable of executing one instruction in every 3-5 clock cuycles.
Hence named as multi-cycle data path
2. Multicycle design : clock rate- 500 MHz and CPI- approx. 4
3. Cache block = instruction cache + data cache.
4. All instructions will be executed in 5 cycles refer control state machine
5. When a word is read from cache block, it must be held in a register for use in subsequent cycles.
6. The reason for having 2 registers Instruction register and Data register between cache and register file
is that once the instruction is read out, it must be kept for all the remaining cycles in its execution to
generate the control signals appropriately.
7. So a second register is needed for data readout associated with lw
8. Three other registers namely, x, y, and z also serve the same purpose of holding information between
cycles.
9. It is notable that except program counter and Instruction register all other registers are loaded in every clock
cycle.
10. Instruction fetch cycle: Execution of all instruction starts the same way in first cycle. The content of PC is
used to access cache and the retrieved word is placed in instruction register. This is known as instruction
fetch cycle.
11. In second clock cycle, the instructions are decoded and the registers rs and rt are accessed.
12. If the instruction executed is one of four jump instructions (j, jr, jal, syscall), its execution terminates in 3rd
cycle by simply writing the appropriate address into PC.
13. If it is a branch instruction (beq, bne, bltz), then the branch condition is checked and the appropriate value is
written into PC in 3rd cycle.
14. All other instructions proceed to and completed in 4th cycle.
15. lw instruction requires 5th cycle to write the data retrieved from cache into a register.
FOR DETAILED CONTROL SIGNAL AND MUX EXPLANATION REFER PAGE NO. 260, 261 IN
P.BRAHAMI BOOK

CMageshKumar_AP_AIHT

CS2071_Computer Architecture

8. CHOOSING THE CLOCK CYCLE

(Refer page no. 262 in text book B.Parhami)

CMageshKumar_AP_AIHT

CS2071_Computer Architecture

9. THE CONTROL STATE MACHINE

CONTROL STATE MACHINE for MULTICYCLE MicroMIPS

The control unit must distinguish between 5 cycles of mutlicycle design and additionally be able to perform
different operations depending on the instruction.
The above diagram depicts the control states and state transitions
The control state machine carries the required information along by moving from state to state. The control
state machine is set to state 0 when program execution begins
Then it moves from state to state until one instruction has been completed, at which it returns to state 0 to
begin the execution of another instruction.
The control state sequences for various MicroMIPS instruction classes are as follows:
ALU type 0,1,7,8
Load word 0,1,2,3,4
Store word 0,1,2,6
Jump / branch 0,1,5
In each state except state 5 & 7, the control signals are uniquely determined.
Information regarding the current control state and instruction executed is supplied by decoders.
Control signals can be easily determined by using control state machine diagram and decoder diagram
Example of control signals that are uniquely determined by control state information include:
Certain control signals depend only on the control state
ALUSrcX = ControlSt2 ControlSt5 ControlSt7
RegWrite = ControlSt4 ControlSt8
Auxiliary signals identifying instruction classes
addsubInst = addInst subInst addiInst
logicInst = andInst orInst xorInst norInst andiInst oriInst xoriInst
Logic expressions for ALU control signals
AddSub = ControlSt5 (ControlSt7 subInst)
FnClass1 = ControlSt7 addsubInst logicInst
FnClass0 = ControlSt7 (logicInst sltInst sltiInst)
LogicFn1 = ControlSt7 (xorInst xoriInst norInst)
LogicFn0 = ControlSt7 (orInst oriInst norInst)
9

CMageshKumar_AP_AIHT

CS2071_Computer Architecture

op
/4

5
6
7
8
9
10
11
12

ControlSt0
ControlSt1
ControlSt2
ControlSt3
ControlSt4
ControlSt5
ControlSt6
ControlSt7
ControlSt8

0
1
2
3
4

op Decoder

st Decoder

0
1
2
3
4

fn

/6

13
14
15

bltzInst
jInst
jalInst
beqInst
bneInst

andiInst

10

sltiInst

12
13
14
15

andiInst
oriInst
xoriInst
luiInst

35

lwInst

43
63

/6

RtypeInst

fn Decoder

st

jrInst

12

syscallInst

32

addInst

34

subInst

36
37
38
39

andInst
orInst
xorInst
norInst

42

sltInst

swInst
63

Decoders

10.

PERFORMANCE OF THE MULTICYCLE DESIGN

(Refer page no. 266 in text book B.Parhami)

10

CMageshKumar_AP_AIHT

CS2071_Computer Architecture

III.

MICROPROGRAMMING
The control state machine resembles a program that has instructions /state, branching, and loops. Such
a hardware program is called as microprogram and its basic steps are microinstructions.
A single instruction in microcode. It is the most elementary instruction in the computer, such as
moving the contents of a register to the arithmetic logic unit (ALU).
It takes several microinstructions to carry out one complex machine instruction (CISC).
Also called a "micro-op" or "op," microinstructions differ within the same computer family and even
the same vendor.
Microprogrammed control is a control mechanism to generate control signals by using a memory
called control storage (CS), which contains the control signals.
Although microprogrammed control seems to be advantageous to CISC machines, since CISC
requires systematic development of sophisticated control signals, there is no intrinsic difference
between these 2 control mechanism.
Microprogramming is a method of control unit design in which the control unit selection and
sequencing information are stored in ROM and RAMs called control store or control memory.
Micro programmed control unit is a general approach used for implementation of control unit. Here
control signals are generated by a program similar to machine language programs
Instead of implementing the control state machine in custom hardware, we can store microinstructions
in locations of control ROM, fetching and executing sequence of microinstructions for each machine
language instruction.
Each microinstruction defines a step in execution of a machine language instruction.
Advantages of ROM-based implementation of control
o Simple hardware
o More regular
o Less dependent on instruction-set architecture details
o Same hardware can be used for different purpose by modifying ROM contents
Microprogramming : Designing a suitable sequence of microinstructions to realize a particular
instruction set architecture is called microprogramming.
Micro programmable machine: if the microprogram is easily modifiable, even by user then the
machine is called Micro programmable machine.
Micro instruction format:
o 23 bit microinstruction format. Each bit has one to one correspondence except sequence
control bits in multicycle datapath.
o The 2-bit sequence control field allows for the control of microinstruction sequencing in same
way that PC control affects the sequencing of machine language instructions.
Microprogrammed control unit: Microprogrammed control unit for MicroMIPS diagram shows 4
options (MUX) for choosing next microinstruction.
o Option 0: to advance the next microinstruction in sequence by incrementing
microprogram counter
o Option 1 & 2: allows branching to occur depending on opcode field in machine
instruction being excuted.
o Option 3: is to goto microinstruction 0 corresponding to state 0 (refer control
state machine). This initiates the fetch phase for next machine instruction
Dispatch table 1 : corresponds to multiway branch in going from cycle 2 to cycle 3
Dispatch table 2 : implements the branch between cycles 3 & 4. (refer control state machine)

11

CMageshKumar_AP_AIHT

CS2071_Computer Architecture

PC
control

Cache
control

Register
control

JumpAddr
PCSrc
PCWrite

ALU
inputs

ALU Sequence
function control

FnType
LogicFn
AddSub
ALUSrcY
ALUSrcX
RegInSrc
RegDst
RegWrite

InstData
MemRead
MemWrite
IRWrite

23-BIT MICROINSTRUCTION FORMAT FOR MICROMIPS.

Dispatch
table 1

Dispatch
table 2

0
1
2
3

MicroPC
1

Address

Microprogram
memory or PLA

Incr
Data
Microinstruction register

op (from
instruction
register)

Control signals to data path

Sequence
control

Microprogrammed control unit for MicroMIPS

(For detailed explanation with microprogram example please Refer page no. 269 - 271 in text
book B.Parhami)

12

CMageshKumar_AP_AIHT

CS2071_Computer Architecture

IV. PIPELINING
11. PIPELINING CONCEPTS
2 strategies for achieving greater performance:
Strategy 1: multiple-instruction-issue or superscalar organization: use multiple independent data paths that can
accept several instructions that are read out at once.
Strategy 2: Pipelined or super-pipelined organization: overlap the execution of several instructions in singlecycle design, starting next instruction before previous instruction has executed.
Pipelining:
Pipelining is an implementation technique where multiple instructions are overlapped in execution. The
computer pipeline is divided in stages.
Each stage completes a part of an instruction in parallel. The stages are connected one to the next to form
a pipe - instructions enter at one end, progress through the stages, and exit at the other end.
Pipelining does not decrease the time for individual instruction execution. Instead, it increases instruction
throughput.
The throughput of the instruction pipeline is determined by how often an instruction exits the pipeline.

5 instruction execution steps / stages in a pipelining of MicroMIPS:


Each step takes 1-2 ns.
1. Instruction Fetch
2. Instruction Decode and register access
3. ALU operation
4. Data memory access
5. Register writeback

Cycle 1

Cycle 2

Instr
cache

Reg
file

Instr
cache

Instr 3
Instr 4
Instr 5

Cycle 3

Cycle 4

Cycle 5

Cycle 6

Cycle 7

Cycle 8

ALU

Data
cache

Reg
file

Reg
file

ALU

Data
cache

Reg
file

Instr
cache

Reg
file

ALU

Data
cache

Reg
file

Instr
cache

Reg
file

ALU

Data
cache

Reg
file

Instr
cache

Reg
file

ALU

Data
cache

Cycle 9

Time dimension

Instr 2

Instr 1

Pipelined Instruction Execution (Pipelining in the MicroMIPS instruction execution process.)

Task
dimension

Reg
file

In task-time diagram, stages of each task are horizontally aligned and their positions along the horizontal
axis represent the timing of their execution.
In space-time diagram, the vertical axis represents stages in the pipeline (the space dimension) and boxes
representing the various stages of a task are diagonally aligned.
Ideally a q-stage pipeline can increase instruction execution throughput by a factor of q. But this fact is
not quite the case because of the following:
o Effects of pipeline start-up and drainage
o Wastage due to unequal stage delays.
o Time overhead of saving stage results in registers
o Safety margin in clock period necessitated by clock skew.
13

CMageshKumar_AP_AIHT

CS2071_Computer Architecture

1
2

f
f = Fetch
r = Reg read
a = ALU op
d = Data access
w = Writeback

3
4
5
6
7

10

11

Cycle

1
2

3
4
5

Start-up
region

10

11

Cycle
Drainage
region

Pipeline
stage

Instruction
(a) Task-time diagram

(b) Space-time diagram

Fig. Two abstract graphical representations of a 5-stage pipeline executing 7 tasks (instructions).

12.

PIPELINE STALLS OR BUBBLES

Data dependency in pipeline : Execution of one instruction depending on completion of a previous


instruction.
Data dependency in pipeline can cause pipeline stalls which diminish the performance.
Types of data dependency:
o Read-after-compute: register access after updating it with a computed value.
o Read-after-load: register access after updating it with data from memory
Example for Read-after-compute is shown in below diagram, where the 3rd instruction uses the value that
the 2nd instruction writes into register $8 & the 4th instruction needs the result of 3rd instruction in register
$9. Note that write operation in register $8 is completed in cycle 6. Hence, reading the new value from
register $8 is possible beginning with cycle 7. The 3rd instruction reads out register $8 & $2in cycle 4. The
data dependency problem can be solved by bubble insertion or by data forwarding.

BUBBLE INSERTION:
First detect the type of data dependency
Bubble insertion: The phenomenon of inserting redundant and harmless instruction (adding 0 to a register /
shifting a register by 0 bit) before the next instruction. Such instruction is called as no-op (no-operation)
instruction. Since they didnt perform any useful task but use the memory they resembles the bubble in a
water pipe is called bubble insertion.
Insertion of bubbles in a pipeline implies
o reduced throughput
o hurts the performance when more than 2 bubbles are inserted.
So bubble insertion should be minimized. It can be minimized by relocating an useful instruction in a
program between the data dependent instruction instead of inserting bubbles.
DATA FORWARDING:
the phenomenon of bypassing the output of ALU of 1st instruction to the input of ALU that is needed as
input for execution of 2nd instruction without storing the output value of 1st instruction in memory is called
data forwarding . please see below diagrams for clear understanding
Control dependency:
When a conditional branch is executed, the location of the next branch instruction depends on whether the branch
condition is satisfied. Since branch instructions are based on testing the register contents, branch condition will be
resolved at the end of 2nd pipeline stage. Therefore a bubble is required after every conditional branch instruction.

14

CMageshKumar_AP_AIHT

CS2071_Computer Architecture

$5 = $6 + $7

Cycle 1

Cycle 2

Instr
cache

Reg
file

Instr
cache

$8 = $8 + $6

Cycle 3

$9 = $8 + $2

Cycle 4

Cycle 5

ALU

Data
cache

Reg
file

Reg
file

ALU

Data
cache

Reg
file

Instr
cache

Reg
file

ALU

Data
cache

Reg
file

Instr
cache

Reg
file

ALU

Data
cache

sw $9, 0($3)

Cycle 6

Cycle 7

Cycle 8

Data
forwarding

Reg
file

Read-after-write data dependency and its possible resolution through data forwarding .

Cycle 2

Cycle 3

Cycle 4

Instr
cache

ALU

Instr
cache

Reg
file

Instr 3

Reg
file

Data
cache

Reg
file

ALU

Data
cache

Bubble

Instr
cache

Instr 4
Instr 5

Cycle 5

Cycle 6

Cycle 7

Cycle 8

Cycle 9

Time dimension

Instr 2

Instr 1

Cycle 1

Reg
file

ALU
Bubble

Instr
cache

Reg
file

Data
cache

Bubble
ALU

Reg
file

Instr
cache

Task
dimension

Without data forwarding,


three bubbles are needed
to resolve a read-afterwrite data dependency

Writes into $8

Reg
file

Reg
file

Data
cache

Reg
file

ALU

Data
cache

Reg
file

Reads from $8

Two bubbles, if we assume


that a register can be
updated and read from in
one cycle

C ycle 1

C ycle 2

Instr
mem

Reg
file

ALU

Instr
mem

sw $6, . . .

C ycle 3

C ycle 4

C ycle 5

C ycle 6

C ycle 7

Data
mem

Reg
file

Reg
file

ALU

Data
mem

Reg
file

Instr
mem

Reg
file

ALU

Data
mem

Reg
file

Instr
mem

Reg
file

ALU

Data
mem

C ycle 8

Reorder?
lw $8, . . .

Insert bubble?
$9 = $8 + $2

Without data
forwarding, three
(two) bubbles are
needed to resolve a
read-after-load data
dependency

Reg
file

Read-after-load data dependency and its possible resolution through bubble insertion and data forwarding.

15

CMageshKumar_AP_AIHT

CS2071_Computer Architecture

C ycle 1

C ycle 2

Instr
mem

Reg
file

Instr
mem

$6 = $3 + $5

beq $1, $2, . . .

C ycle 3

Insert bubble?

C ycle 4

C ycle 5

ALU

Data
mem

Reg
file

Reg
file

ALU

Data
mem

Reg
file

Instr
mem

Reg
file

ALU

Data
mem

Reg
file

Instr
mem

Reg
file

ALU

Data
mem

$9 = $8 + $2

Assume branch
resolved here

C ycle 6

C ycle 7

C ycle 8

Reorder?
(delayed
branch)

Reg
file

Here would need


1-2 more bubbles

Control dependency due to conditional branch.

13.

PIPELINE TIMING AND PERFORMANCE (Refer page no. 284 in text book B.Parhami)

14. PIPELINED DATA PATH DESIGN (Refer page no. 285-286 for detailed description of each stage in
text book B.Parhami)

The pipelined datapath for MicroMIPS is obtained by inserting latches or registers in single-cycle data path.
The 5 pipeline stages are
1. Instruction Fetch
2. Instruction Decode and register access
3. ALU operation
4. Data memory access
5. Register writeback
Stage 1

Stage 2

NextPC

ALUOvfl

1
PC

inst

Instr
cache

rs
rt

(rs)

Stage 4

Stage 5

Reg
file

ALU

imm SE
Incr
IncrPC

SeqInst
op

Data
addr

Ovfl

(rt)

15.
16.

Stage 3

Next addr

Data
cache

Func
0
1

0
1
2

rt
rd 0
1
31 2

Br&Jump
RegDst
fn
RegWrite
ALUSrc

ALUFunc

DataRead
RetAddr
DataWrite
RegInSrc

PIPELINED CONTROL (Refer page no. 289 in text book B.Parhami)


OPTIMAL PIPELINING (Refer page no. 291 in text book B.Parhami)
16

CMageshKumar_AP_AIHT

0
1

CS2071_Computer Architecture

V. PIPELINE PERFORMANCE
17. DATA DEPENDENCIES AND HAZARDS
Data dependency in pipeline : Execution of one instruction depending on completion of a previous
instruction or the phenomenon of one instruction requiring data generated by previous instruction is called
data dependency
The generated data may reside in a register or memory location where the subsequent instruction expects to
find the value.
In the below diagram, each instruction from 2nd through 5th instruction reads a register written into by the 1st
instruction.
o The 5th instruction needs the content of $2 register after completion of register writeback by 5th
instruction.
o The 4th instruction needs the new content of register $2 in the same cycle when the 1st instruction
produces it which results in a little problem.
o But the 2nd & 3rd instruction needs the content of 1st instruction before the 1st instruction execution.
This results in a major problem of data dependency.
Data dependency in pipeline can cause pipeline stalls which diminish the performance.
Types of data dependency:
o Read-after-compute: register access after updating it with a computed value. This dependency exists
when 1 instruction updates a register with a computed value and a subsequent instruction uses the
content of that register as an operand.
o Read-after-load: register access after updating it with data from memory. This dependency arises
when one instruction loads a new value from memory into a register and a subsequent instruction
uses the content of that register as an operand.

Cycle 1

Cycle 2

Instr
cache

Reg
file

Instr
cache

Cycle 3

Cycle 4

Cycle 5

Cycle 6

Cycle 7

Cycle 8

ALU

Data
cache

Reg
file

Reg
file

ALU

Data
cache

Reg
file

Instr
cache

Reg
file

ALU

Data
cache

Reg
file

Instr
cache

Reg
file

ALU

Data
cache

Reg
file

Instr
cache

Reg
file

ALU

Data
cache

Cycle 9

$2 = $1 - $3

Instructions
that read
register $2

Reg
file

17

CMageshKumar_AP_AIHT

CS2071_Computer Architecture

SINCE THE BELOW TOPICS ARE CLEAR AND READABLE IN THE BOOK PLEASE REFER PAGE
NO. 298-308 IN TEXT BOOK B.PARHAMI)
18.

DATA FORWARDING:

Resolving Data Dependencies via Forwarding: When a previous instruction writes back a value
computed by the ALU into a register, the data dependency can always be resolved through forwarding
Cycle 1

Cycle 2

Cycle 3

Cycle 4

Cycle 5

Instr
cache

Reg
file

ALU

Instr
cache

Cycle 6

Cycle 7

Data
cache

Reg
file

Reg
file

ALU

Data
cache

Reg
file

Instr
cache

Reg
file

ALU

Data
cache

Reg
file

Instr
cache

Reg
file

ALU

Data
cache

Cycle 8

Cycle 9

$2 = $1 - $3
Instructions
that read
register $2

Reg
file

Certain Data Dependencies Lead to Bubbles: When the immediately preceding instruction writes a value
read out from the data memory into a register, the data dependency cannot be resolved through forwarding
(i.e., we cannot go back in time) and a bubble must be inserted in the pipeline.

Cycle 1

Cycle 2

Cycle 3

Cycle 4

Cycle 5

Cycle 6

Instr
cache

Reg
file

ALU

Instr
cache

Cycle 7

Data
cache

Reg
file

lw

Reg
file

ALU

Data
cache

Reg
file

Instr
cache

Reg
file

ALU

Data
cache

Reg
file

Instr
cache

Reg
file

ALU

Data
cache

Cycle 8

Cycle 9

$2,4($12)
Instructions
that read
register $2

Reg
file

18

CMageshKumar_AP_AIHT

CS2071_Computer Architecture

19. PIPELINE BRANCH HAZARDS


Software-based solutions
Compiler inserts a no-op after every branch (simple, but wasteful)
Branch is redefined to take effect after the instruction that follows it
Branch delay slot(s) are filled with useful instructions via reordering
Hardware-based solutions
Mechanism similar to data hazard detector to flush the pipeline
Constitutes a rudimentary form of branch prediction:
o Always predict that the branch is not taken, flush if mistaken
o More elaborate branch prediction strategies possible
20. DELAYED BRANCH AND BRANCH PREDICTION
Predicting whether a branch will be taken
Always predict that the branch will not be taken
Use program context to decide (backward branch is likely taken, forward branch is likely not taken)
Allow programmer or compiler to supply clues
Decide based on past history (maintain a small history table); to be discussed later
Apply a combination of factors: modern processors use elaborate techniques due to deep pipelines

Problem with this approach:


Each branch in a loop entails two
mispredictions:
1. Once in first iteration (loop is repeated,
but the history indicates exit from loop)
2. Once in last iteration (when loop is
terminated, but history indicates repetition)

19

CMageshKumar_AP_AIHT

CS2071_Computer Architecture

Other branch prediction algorithms:


Taken

Not taken
Not taken

Predict
taken

Taken

Not taken

Predict
taken
again

Taken

Taken

Predict
taken

Taken

Not taken

Predict
not taken

Not taken

Taken

Predict
taken
again

Predict
not taken
again

Taken

Not taken

Predict
not taken

Taken

Not taken

Predict
not taken
again

Not taken

Taken

Not taken
Not taken

Predict
taken

Taken

Not taken

Predict
taken
again

Taken

Predict
not taken

Not taken

Predict
not taken
again

Taken

Hardware Implementation of Branch Prediction


The mapping scheme used to go from PC contents to a table entry is the same as that used in direct-mapped caches

Low-order
bits used
as index

Addresses of recent
branch instructions

Target
addresses

History
bit(s)

Incremented
PC
0
1

Read-out table entry

From
PC

Compare

Next
PC

Logic

20

CMageshKumar_AP_AIHT

CS2071_Computer Architecture

The Three Hardware Designs for MicroMIPS


Incr PC

Single-cycle

Next addr
jta

Next PC

ALUOvfl

(PC)
PC

Instr
cache

inst
rd
31

0
1
2

Reg
file

imm
op

ALU
(rt)

/
16

ALU
out

Data
cache

Data
out

Data
in

Func

0
32
SE / 1

Data
addr

0
1

0
1
2

Data Reg

32 y Reg
SE /

ALU

y Mux
4
0
1
2
4 3

(rt)

imm 16
/

30

0
1
2
3

Func
ALU out

Register input

fn

RegDst
RegWrite

ALUSrc

Stage 1

Stage 2

IRWrite

ALUOvfl

PC

fn

RegInSrc

RegDst

Stage 3

1
inst

Instr
cache

rs
rt

(rs)

ALUSrcX

RegWrite

Stage 4

ALUFunc

ALUSrcY

Stage 5

Reg
file

IncrPC

Address

Data
cache

ALU

imm SE
Incr

Data
Data
addr

Ovfl

(rt)

500 MHz
CPI 1.1

op

MemWrite

MemRead

Next addr

NextPC

PCWrite

DataRead
RegInSrc
DataWrite

ALUFunc

125 MHz
CPI = 1

rt
rd
31

Func

0
1

0
1

0
1
2

0
1
2
2
3
5

SeqInst
op

21.

(rs)

Reg
file

0
1

Data

InstData

Br&Jump

rt
0
1
rd
31 2

Cache

ALUZero
x Mux
ALUOvfl
0
Zero
z Reg
1
Ovfl

x Reg
rs

PC

0
1

SysCallAddr

jta

Address

Ovfl

30
/
4 MSBs

Inst Reg

(rs)

rs
rt

26
/

Multicycle

Br&Jump
RegDst
fn
RegWrite
ALUSrc

ALUFunc

DataRead RetAddr
DataWrite
RegInSrc

ADVANCED PIPELINING (Refer page no. 306-308 in text book B.Parhami)

21

CMageshKumar_AP_AIHT

CS2071_Computer Architecture

PCSrc
JumpAddr

500 MHz
CPI 4

Вам также может понравиться