EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)

EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
Myoungsoo Jung Assistant Professor Department of Electrical Engineering University of Texas at Dallas
The Instruction Set: a Critical Interface
software
instruction set
hardware
Properties of a good abstraction

Lasts through many generations (portability) Used in many different ways (generality) Provides convenient functionality to higher levels Permits an efficient implementation at lower levels
Instruction Set Architecture

... the attributes of a [computing] system as seen by the programmer, i. e. the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation. Amdahl, Blaaw, and Brooks, 1964 SOFTWARE
-- Organization of Programmable Storage -- Data Types & Data Structures: Encodings & Representations -- Instruction Formats -- Instruction (or Operation Code) Set -- Modes of Addressing and Accessing Data Items and Instructions -- Exceptional Conditions
Example: MIPS R3000

r0 r1 r31 PC lo hi
0
Programmable storage 2^32 x bytes 31 x 32-bit GPRs (R0=0) 32 x 32-bit FP regs (paired DP) HI, LO, PC
Data types ? Format ? Addressing Modes?
Arithmetic logical
Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU, AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUI SLL, SRL, SRA, SLLV, SRLV, SRAV
Memory Access
LB, LBU, LH, LHU, LW, LWL,LWR SB, SH, SW, SWL, SWR
Control
32-bit instructions on word boundary
J, JAL, JR, JALR BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL
Building Hardware that Computes
Finite State Machines: Implementation as Comb logic + Latch

0
Moore Machine Alpha/
1
Beta/
1
Delta/
Latch
Combinational Logic
Gamma/
0
Output a 1 on a sequence of 0,1,1, otherwise output 0
Mealy Machine
0/0 1/0
Alpha Beta
1/0
Delta
0/0
0/0 1/1
State machine in which part of state is a micro-pc. Includes a ROM with microinstructions.
Explicit circuitry for incrementing or changing PC Controlled logic implements at least branches and jumps
Microprogrammed Controllers
Control Branch PC
Addr
+ 1
MUX
Combinational Logic/ Controlled Machine
Instruction Branch 0: forw 35 xxx 1: b_no_obstacles 000 2: back 10 xxx 3: rotate 90 xxx 4: goto 001
State w/ Address
ROM (Instructions)
Need for Speed
Multiple cycles per instruction

Simple datapath, write microcode for it
LdPC
PC
LdA
LdB
LdMAR
MAR
Addr
LdIR
IR
WrREG func
Din WrMEM
Din
ALU
regno
Regs
Dout
Memory
Dout
DrPC
DrALU
DrREG
DrMEM
Takes several cycles to execute an instruction
Above datapath needs 3 cycles even for NOP
One instruction per cycle

We can get a CPI of 1 (for most instructions)
M X
ADD
P C
1 Instr Mem RF
SE
M X
ADD
ALU
A Data D Mem
M X
But clock cycle time is long
Must be long enough to complete even the most time-consuming instruction
Oil Transport Analogy

Plan A: Move 1,000,000 gallons of oil at a time
Buy a tanker ship, then repeat:

Fill with oil, sail for 30 days, empty, go back Plan B: Move 100 gallons of oil at a time
Buy a speedboat, then repeat:

Take barrel of oil, sail for 2 days, unload, go back Plan C:
Pipelined
M X
1
P C
ADD ADD
Instr Mem
BEQ
RF
M X
A
ALU
Data Mem
M X
D
SE
IF
ID
EX
MEM
WB
Without pipeline done in 1 cycle

M X
ADD
1
P C
ADD
Instr Mem
RF
M X
A
ALU
Data Mem
M X
D
SE
With pipeline Cycle 1

M X
ADD
1
P C
ADD
Instr Mem
BEQ
RF
M X
A
ALU
Data Mem
M X
D
SE
IF
ID
EX
MEM
WB
Obtain instruction from program storage

M X
ADD
1
P C
ADD
Instr Mem
BEQ
RF
M X
A
ALU
Data Mem
M X
D
SE
IF
ID
EX
MEM
WB
Determine required actions and instruction size
Locate and obtain operand data

M X
ADD
1
P C
ADD
Instr Mem
BEQ
RF
M X
A
ALU
Data Mem
M X
D
SE
IF
ID
EX
MEM
WB
Compute result value or status

M X
ADD
1
P C
ADD
Instr Mem
BEQ
RF
M X
A
ALU
Data Mem
M X
D
SE
IF
ID
EX
MEM
WB
Deposit results in storage for later use

M X
ADD
1
P C
ADD
Instr Mem
BEQ
RF
M X
A
ALU
Data Mem
M X
D
SE
IF
ID
EX
MEM
WB
Deposit results in storage for later use
Instruction takes 5 cycles now! Instruction latency longer now
Is it faster?
Latch at end of each stage adds latency Longest stage determines clock cycle time Example:
IF ID EX MEM WB 1.0 ns 0.6 ns 0.9 ns 1.2 ns 0.4 ns Design 1-stage 5-stage Cycle time 4.1 ns (sum) 1.2 ns (max) # of cycles 1 5 Inst Latency 4.1 ns 6.0 ns
But we are after instruction throughput, not latency!!!
Pipeline Cycle 1
M X
lw sw lw
R6, X(R0) R1, X(R0) R1, Y(R0)
add R5, R6, R1

ADD
sw
ADD
R5, X(R0)
1
P C
Instr Mem
BEQ
DPRF
M X
A
ALU
Data Mem
M X
lw R6,X(R0)
SE
IF
ID
EX
MEM
WB
Pipeline Cycle 2
M X
lw lw R6,X(R0) sw lw
R6, X(R0) R1, X(R0) R1, Y(R0)
add R5, R6, R1

ADD
sw
ADD
R5, X(R0)
1
P C
Instr Mem
BEQ
DPRF
M X
A
ALU
Data Mem
M X
sw R1,X(R0)
SE
IF
ID
EX
MEM
WB
Pipeline Cycle 3
M X
lw sw R1,X(R0) lw R6,X(R0) sw lw
R6, X(R0) R1, X(R0) R1, Y(R0)
add R5, R6, R1

ADD
sw
ADD
R5, X(R0)
1
P C
Instr Mem
BEQ
DPRF
M X
A
ALU
Data Mem
M X
lw R1,Y(R0)
D
SE
IF
ID
EX
MEM
WB
Pipelined Instruction Execution

Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
ALU
I n s t r. O r d e r
ALU
Ifetch
Reg
DMem
Reg
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Reg
DMem
Reg
Performance with Pipelining

We finish one instruction per cycle
After the initial warm-up period

Instruction throughput
Latch at end of each stage adds latency Longest stage determines clock cycle time IF 1.0 ns Example:
ID 0.6 ns 0.9 ns 1.2 ns 0.4 ns EX MEM WB Design 1-cycle Pipeline Cycle time 4.1 ns (sum) 1.2 ns (max) Inst/Cycle 1 1
Speedup due to pipelining: 3.42 Note: ideally it would be 5
Computer Architecture is Design and Analysis

Design
Architecture is an iterative process: Searching the space of possible designs At all levels of computer systems
Analysis
Creativity
Cost / Performance Analysis
Bad Ideas
Mediocre Ideas
Good Ideas
Limits to pipelining
Maintain the von Neumann illusion of one instruction at a time execution Hazards prevent next instruction from executing during its designated clock cycle
Structural hazards: attempt to use the same hardware to do two different things at once Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).
Power: Too many thing happening at once Melt your chip!
Must disable parts of the system that are not being used Clock Gating, Asynchronous Design, Low Voltage Swings,
1st generation RISC - pipelined
Progression of ILP
2nd generation: superscalar
Full 32-bit processor fit on a chip => issue almost 1 IPC Need to access memory 1+x times per cycle Floating-Point unit on another chip Cache controller a third, off-chip cache 1 board per processor multiprocessor systems Processor and floating point unit on chip (and some cache) Issuing only one instruction per cycle uses at most half Fetch multiple instructions, issue couple Grows from 2 to 4 to 8 How to manage dependencies among all these instructions? Where does the parallelism come from? Expose some of the ILP to compiler, allow it to schedule instructions to reduce dependences
VLIW
Modern ILP
Dynamically scheduled, out-of-order execution
Current microprocessor 6-8 of instructions per cycle Pipelines are 10s of cycles deep many simultaneous instructions in execution at once Unfortunately, hazards cause discarding of much work
What happens:
Grab a bunch of instructions, determine all their dependences, eliminate deps wherever possible, throw them all into the execution unit, let each one move forward as its dependences are resolved Appears as if executed sequentially On a trap or interrupt, capture the state of the machine between instructions perfectly
Huge complexity
Complexity of many components scales as n2 (issue width) Power consumption big problem
IBM Power 4
Combines: Superscalar and OOO Properties:
8 execution units in out-of-order engine, each may issue an instruction each cycle. In-order Instruction Fetch, Decode (compute dependencies) Reordering for in-order commit
When all else fails - guess

Programs make decisions as they go
Conditionals, loops, calls Translate into branches and jumps (1 of 5 instructions)
How do you determine what instructions for fetch when the ones before it havent executed?
Branch prediction Lots of clever machine structures to predict future based on history Machinery to back out of mis-predictions
Execute all the possible branches

Likely to hit additional branches, perform stores
speculative threads What can hardware do to make programming (with performance) easier?
Have we reached the end of ILP?

Multiple processor easily fit on a chip Every major microprocessor vendor has gone to multithreaded cores
Thread: loci of control, execution context Fetch instructions from multiple threads at once, throw them all into the execution unit Intel: hyperthreading Concept has existed in high performance computing for 20 years (or is it 40? CDC6600)
Vector processing
Each instruction processes many distinct data Ex: MMX
Raise the level of architecture many processors per chip

Tensilica Configurable Proc
Limiting Forces: Clock Speed and ILP

Chip density is continuing increase ~2x every 2 years
Clock speed is not # processors/chip (cores) may double instead
Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)
There is little or no more Instruction Level Parallelism (ILP) to be found

Can no longer allow programmer to think in terms of a serial programming model
Conclusion: Parallelism must be exposed to software!
Examples of MIMD Machines

Symmetric Multiprocessor
Multiple processors in box with shared memory communication Current MultiCore chips like this Every processor runs copy of OS
P P Bus Memory P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M Host P P
Non-uniform shared-memory with separate I/O through host

Multiple processors Each with local memory general scalable network Extremely light OS on node provides simple services Scheduling/synchronization Network-accessible host for I/O
Cluster
Many independent machine connected with general network Communication through messages
Categories of Thread Execution Time (processor cycle)

Superscalar Fine-Grained Coarse-Grained Multiprocessing
Simultaneous Multithreading
Thread 1 Thread 2
Thread 3 Thread 4
Thread 5 Idle slot

EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)

Загружено:

Авторское право:

Доступные форматы

EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)

The Instruction Set: a Critical Interface

Properties of a good abstraction

Instruction Set Architecture

Example: MIPS R3000

Data types ? Format ? Addressing Modes?

32-bit instructions on word boundary

J, JAL, JR, JALR BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL

Building Hardware that Computes

Finite State Machines: Implementation as Comb logic + Latch

Combinational Logic/ Controlled Machine

Need for Speed

Multiple cycles per instruction

Takes several cycles to execute an instruction

Above datapath needs 3 cycles even for NOP

One instruction per cycle

But clock cycle time is long

Must be long enough to complete even the most time-consuming instruction

Oil Transport Analogy

Buy a tanker ship, then repeat:

Buy a speedboat, then repeat:

Without pipeline done in 1 cycle

With pipeline Cycle 1

Obtain instruction from program storage

With pipeline Cycle 2

Determine required actions and instruction size

Locate and obtain operand data

With pipeline Cycle 3

Compute result value or status

With pipeline Cycle 4

Deposit results in storage for later use

With pipeline Cycle 5

Deposit results in storage for later use

Instruction takes 5 cycles now! Instruction latency longer now

But we are after instruction throughput, not latency!!!

R6, X(R0) R1, X(R0) R1, Y(R0)

add R5, R6, R1

R6, X(R0) R1, X(R0) R1, Y(R0)

add R5, R6, R1

R6, X(R0) R1, X(R0) R1, Y(R0)

add R5, R6, R1

Pipelined Instruction Execution

Performance with Pipelining

After the initial warm-up period

Speedup due to pipelining: 3.42 Note: ideally it would be 5

Computer Architecture is Design and Analysis

Power: Too many thing happening at once Melt your chip!

1st generation RISC - pipelined

2nd generation: superscalar

When all else fails - guess

Execute all the possible branches

Have we reached the end of ILP?

Raise the level of architecture many processors per chip

Limiting Forces: Clock Speed and ILP

There is little or no more Instruction Level Parallelism (ILP) to be found

Conclusion: Parallelism must be exposed to software!

Examples of MIMD Machines

Non-uniform shared-memory with separate I/O through host

Categories of Thread Execution Time (processor cycle)

Thread 5 Idle slot

Вам также может понравиться