Академический Документы
Профессиональный Документы
Культура Документы
Myoungsoo Jung Assistant Professor Department of Electrical Engineering University of Texas at Dallas
software
instruction set
hardware
Lasts through many generations (portability) Used in many different ways (generality) Provides convenient functionality to higher levels Permits an efficient implementation at lower levels
Programmable storage 2^32 x bytes 31 x 32-bit GPRs (R0=0) 32 x 32-bit FP regs (paired DP) HI, LO, PC
Arithmetic logical
Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU, AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUI SLL, SRL, SRA, SLLV, SRLV, SRAV
Memory Access
LB, LBU, LH, LHU, LW, LWL,LWR SB, SH, SW, SWL, SWR
Control
1
Beta/
1
Delta/
Latch
Combinational Logic
Gamma/
0
Output a 1 on a sequence of 0,1,1, otherwise output 0
Mealy Machine
0/0 1/0
Alpha Beta
1/0
Delta
0/0
0/0 1/1
State machine in which part of state is a micro-pc. Includes a ROM with microinstructions.
Explicit circuitry for incrementing or changing PC Controlled logic implements at least branches and jumps
Microprogrammed Controllers
Control Branch PC
Addr
+ 1
MUX
Instruction Branch 0: forw 35 xxx 1: b_no_obstacles 000 2: back 10 xxx 3: rotate 90 xxx 4: goto 001
State w/ Address
ROM (Instructions)
PC
LdA
LdB
LdMAR
MAR
Addr
LdIR
IR
WrREG func
Din WrMEM
Din
ALU
regno
Regs
Dout
Memory
Dout
DrPC
DrALU
DrREG
DrMEM
ADD
P C
1 Instr Mem RF
SE
M X
ADD
ALU
A Data D Mem
M X
Pipelined
M X
1
P C
ADD ADD
Instr Mem
BEQ
RF
M X
A
ALU
Data Mem
M X
D
SE
IF
ID
EX
MEM
WB
ADD
1
P C
ADD
Instr Mem
RF
M X
A
ALU
Data Mem
M X
D
SE
ADD
1
P C
ADD
Instr Mem
BEQ
RF
M X
A
ALU
Data Mem
M X
D
SE
IF
ID
EX
MEM
WB
ADD
1
P C
ADD
Instr Mem
BEQ
RF
M X
A
ALU
Data Mem
M X
D
SE
IF
ID
EX
MEM
WB
ADD
1
P C
ADD
Instr Mem
BEQ
RF
M X
A
ALU
Data Mem
M X
D
SE
IF
ID
EX
MEM
WB
ADD
1
P C
ADD
Instr Mem
BEQ
RF
M X
A
ALU
Data Mem
M X
D
SE
IF
ID
EX
MEM
WB
ADD
1
P C
ADD
Instr Mem
BEQ
RF
M X
A
ALU
Data Mem
M X
D
SE
IF
ID
EX
MEM
WB
Is it faster?
Latch at end of each stage adds latency Longest stage determines clock cycle time Example:
IF ID EX MEM WB 1.0 ns 0.6 ns 0.9 ns 1.2 ns 0.4 ns Design 1-stage 5-stage Cycle time 4.1 ns (sum) 1.2 ns (max) # of cycles 1 5 Inst Latency 4.1 ns 6.0 ns
Pipeline Cycle 1
M X
lw sw lw
sw
ADD
R5, X(R0)
1
P C
Instr Mem
BEQ
DPRF
M X
A
ALU
Data Mem
M X
lw R6,X(R0)
SE
IF
ID
EX
MEM
WB
Pipeline Cycle 2
M X
lw lw R6,X(R0) sw lw
sw
ADD
R5, X(R0)
1
P C
Instr Mem
BEQ
DPRF
M X
A
ALU
Data Mem
M X
sw R1,X(R0)
SE
IF
ID
EX
MEM
WB
Pipeline Cycle 3
M X
lw sw R1,X(R0) lw R6,X(R0) sw lw
sw
ADD
R5, X(R0)
1
P C
Instr Mem
BEQ
DPRF
M X
A
ALU
Data Mem
M X
lw R1,Y(R0)
D
SE
IF
ID
EX
MEM
WB
ALU
I n s t r. O r d e r
ALU
Ifetch
Reg
DMem
Reg
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
Latch at end of each stage adds latency Longest stage determines clock cycle time IF 1.0 ns Example:
ID 0.6 ns 0.9 ns 1.2 ns 0.4 ns EX MEM WB Design 1-cycle Pipeline Cycle time 4.1 ns (sum) 1.2 ns (max) Inst/Cycle 1 1
Architecture is an iterative process: Searching the space of possible designs At all levels of computer systems
Analysis
Creativity
Cost / Performance Analysis
Bad Ideas
Mediocre Ideas
Good Ideas
Limits to pipelining
Maintain the von Neumann illusion of one instruction at a time execution Hazards prevent next instruction from executing during its designated clock cycle
Structural hazards: attempt to use the same hardware to do two different things at once Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).
Must disable parts of the system that are not being used Clock Gating, Asynchronous Design, Low Voltage Swings,
Progression of ILP
Full 32-bit processor fit on a chip => issue almost 1 IPC Need to access memory 1+x times per cycle Floating-Point unit on another chip Cache controller a third, off-chip cache 1 board per processor multiprocessor systems Processor and floating point unit on chip (and some cache) Issuing only one instruction per cycle uses at most half Fetch multiple instructions, issue couple Grows from 2 to 4 to 8 How to manage dependencies among all these instructions? Where does the parallelism come from? Expose some of the ILP to compiler, allow it to schedule instructions to reduce dependences
VLIW
Modern ILP
Dynamically scheduled, out-of-order execution
Current microprocessor 6-8 of instructions per cycle Pipelines are 10s of cycles deep many simultaneous instructions in execution at once Unfortunately, hazards cause discarding of much work
What happens:
Grab a bunch of instructions, determine all their dependences, eliminate deps wherever possible, throw them all into the execution unit, let each one move forward as its dependences are resolved Appears as if executed sequentially On a trap or interrupt, capture the state of the machine between instructions perfectly
Huge complexity
Complexity of many components scales as n2 (issue width) Power consumption big problem
IBM Power 4
Combines: Superscalar and OOO Properties:
8 execution units in out-of-order engine, each may issue an instruction each cycle. In-order Instruction Fetch, Decode (compute dependencies) Reordering for in-order commit
How do you determine what instructions for fetch when the ones before it havent executed?
Branch prediction Lots of clever machine structures to predict future based on history Machinery to back out of mis-predictions
speculative threads What can hardware do to make programming (with performance) easier?
Vector processing
Each instruction processes many distinct data Ex: MMX
Cluster
Many independent machine connected with general network Communication through messages
Simultaneous Multithreading
Thread 1 Thread 2
Thread 3 Thread 4