Вы находитесь на странице: 1из 35

EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)

Myoungsoo Jung Assistant Professor Department of Electrical Engineering University of Texas at Dallas

The Instruction Set: a Critical Interface

software

instruction set

hardware

Properties of a good abstraction


Lasts through many generations (portability) Used in many different ways (generality) Provides convenient functionality to higher levels Permits an efficient implementation at lower levels

Instruction Set Architecture


... the attributes of a [computing] system as seen by the programmer, i. e. the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation. Amdahl, Blaaw, and Brooks, 1964 SOFTWARE
-- Organization of Programmable Storage -- Data Types & Data Structures: Encodings & Representations -- Instruction Formats -- Instruction (or Operation Code) Set -- Modes of Addressing and Accessing Data Items and Instructions -- Exceptional Conditions

Example: MIPS R3000


r0 r1 r31 PC lo hi
0

Programmable storage 2^32 x bytes 31 x 32-bit GPRs (R0=0) 32 x 32-bit FP regs (paired DP) HI, LO, PC

Data types ? Format ? Addressing Modes?

Arithmetic logical
Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU, AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUI SLL, SRL, SRA, SLLV, SRLV, SRAV

Memory Access
LB, LBU, LH, LHU, LW, LWL,LWR SB, SH, SW, SWL, SWR

Control

32-bit instructions on word boundary

J, JAL, JR, JALR BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL

Building Hardware that Computes

Finite State Machines: Implementation as Comb logic + Latch


0
Moore Machine Alpha/

1
Beta/

1
Delta/

Latch

Combinational Logic

Gamma/

0
Output a 1 on a sequence of 0,1,1, otherwise output 0

Mealy Machine

0/0 1/0
Alpha Beta

1/0
Delta

0/0

0/0 1/1

State machine in which part of state is a micro-pc. Includes a ROM with microinstructions.
Explicit circuitry for incrementing or changing PC Controlled logic implements at least branches and jumps

Microprogrammed Controllers

Control Branch PC

Addr

+ 1

MUX

Combinational Logic/ Controlled Machine

Instruction Branch 0: forw 35 xxx 1: b_no_obstacles 000 2: back 10 xxx 3: rotate 90 xxx 4: goto 001

State w/ Address

ROM (Instructions)

Need for Speed

Multiple cycles per instruction


Simple datapath, write microcode for it
LdPC

PC

LdA

LdB

LdMAR

MAR
Addr

LdIR

IR

WrREG func

Din WrMEM

Din

ALU
regno

Regs
Dout

Memory
Dout

DrPC

DrALU

DrREG

DrMEM

Takes several cycles to execute an instruction

Above datapath needs 3 cycles even for NOP

One instruction per cycle


We can get a CPI of 1 (for most instructions)
M X

ADD

P C

1 Instr Mem RF
SE
M X

ADD

ALU

A Data D Mem

M X

But clock cycle time is long

Must be long enough to complete even the most time-consuming instruction

Oil Transport Analogy


Plan A: Move 1,000,000 gallons of oil at a time

Buy a tanker ship, then repeat:


Fill with oil, sail for 30 days, empty, go back Plan B: Move 100 gallons of oil at a time

Buy a speedboat, then repeat:


Take barrel of oil, sail for 2 days, unload, go back Plan C:

Pipelined
M X

1
P C

ADD ADD

Instr Mem

BEQ

RF

M X

A
ALU

Data Mem

M X

D
SE

IF

ID

EX

MEM

WB

Without pipeline done in 1 cycle


M X

ADD

1
P C

ADD

Instr Mem

RF

M X

A
ALU

Data Mem

M X

D
SE

With pipeline Cycle 1


M X

ADD

1
P C

ADD

Instr Mem

BEQ

RF

M X

A
ALU

Data Mem

M X

D
SE

IF

ID

EX

MEM

WB

Obtain instruction from program storage

With pipeline Cycle 2


M X

ADD

1
P C

ADD

Instr Mem

BEQ

RF

M X

A
ALU

Data Mem

M X

D
SE

IF

ID

EX

MEM

WB

Determine required actions and instruction size

Locate and obtain operand data

With pipeline Cycle 3


M X

ADD

1
P C

ADD

Instr Mem

BEQ

RF

M X

A
ALU

Data Mem

M X

D
SE

IF

ID

EX

MEM

WB

Compute result value or status

With pipeline Cycle 4


M X

ADD

1
P C

ADD

Instr Mem

BEQ

RF

M X

A
ALU

Data Mem

M X

D
SE

IF

ID

EX

MEM

WB

Deposit results in storage for later use

With pipeline Cycle 5


M X

ADD

1
P C

ADD

Instr Mem

BEQ

RF

M X

A
ALU

Data Mem

M X

D
SE

IF

ID

EX

MEM

WB

Deposit results in storage for later use

Instruction takes 5 cycles now! Instruction latency longer now

Is it faster?

Latch at end of each stage adds latency Longest stage determines clock cycle time Example:
IF ID EX MEM WB 1.0 ns 0.6 ns 0.9 ns 1.2 ns 0.4 ns Design 1-stage 5-stage Cycle time 4.1 ns (sum) 1.2 ns (max) # of cycles 1 5 Inst Latency 4.1 ns 6.0 ns

But we are after instruction throughput, not latency!!!

Pipeline Cycle 1
M X

lw sw lw

R6, X(R0) R1, X(R0) R1, Y(R0)

add R5, R6, R1


ADD

sw
ADD

R5, X(R0)

1
P C

Instr Mem

BEQ

DPRF

M X

A
ALU

Data Mem

M X

lw R6,X(R0)

SE

IF

ID

EX

MEM

WB

Pipeline Cycle 2
M X

lw lw R6,X(R0) sw lw

R6, X(R0) R1, X(R0) R1, Y(R0)

add R5, R6, R1


ADD

sw
ADD

R5, X(R0)

1
P C

Instr Mem

BEQ

DPRF

M X

A
ALU

Data Mem

M X

sw R1,X(R0)

SE

IF

ID

EX

MEM

WB

Pipeline Cycle 3
M X

lw sw R1,X(R0) lw R6,X(R0) sw lw

R6, X(R0) R1, X(R0) R1, Y(R0)

add R5, R6, R1


ADD

sw
ADD

R5, X(R0)

1
P C

Instr Mem

BEQ

DPRF

M X

A
ALU

Data Mem

M X

lw R1,Y(R0)

D
SE

IF

ID

EX

MEM

WB

Pipelined Instruction Execution


Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7

ALU

I n s t r. O r d e r

ALU

Ifetch

Reg

DMem

Reg

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

ALU

Ifetch

Reg

DMem

Reg

Performance with Pipelining


We finish one instruction per cycle

After the initial warm-up period


Instruction throughput

Latch at end of each stage adds latency Longest stage determines clock cycle time IF 1.0 ns Example:
ID 0.6 ns 0.9 ns 1.2 ns 0.4 ns EX MEM WB Design 1-cycle Pipeline Cycle time 4.1 ns (sum) 1.2 ns (max) Inst/Cycle 1 1

Speedup due to pipelining: 3.42 Note: ideally it would be 5

Computer Architecture is Design and Analysis


Design

Architecture is an iterative process: Searching the space of possible designs At all levels of computer systems

Analysis

Creativity
Cost / Performance Analysis

Bad Ideas

Mediocre Ideas

Good Ideas

Limits to pipelining
Maintain the von Neumann illusion of one instruction at a time execution Hazards prevent next instruction from executing during its designated clock cycle
Structural hazards: attempt to use the same hardware to do two different things at once Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

Power: Too many thing happening at once Melt your chip!

Must disable parts of the system that are not being used Clock Gating, Asynchronous Design, Low Voltage Swings,

1st generation RISC - pipelined

Progression of ILP

2nd generation: superscalar

Full 32-bit processor fit on a chip => issue almost 1 IPC Need to access memory 1+x times per cycle Floating-Point unit on another chip Cache controller a third, off-chip cache 1 board per processor multiprocessor systems Processor and floating point unit on chip (and some cache) Issuing only one instruction per cycle uses at most half Fetch multiple instructions, issue couple Grows from 2 to 4 to 8 How to manage dependencies among all these instructions? Where does the parallelism come from? Expose some of the ILP to compiler, allow it to schedule instructions to reduce dependences

VLIW

Modern ILP
Dynamically scheduled, out-of-order execution
Current microprocessor 6-8 of instructions per cycle Pipelines are 10s of cycles deep many simultaneous instructions in execution at once Unfortunately, hazards cause discarding of much work

What happens:
Grab a bunch of instructions, determine all their dependences, eliminate deps wherever possible, throw them all into the execution unit, let each one move forward as its dependences are resolved Appears as if executed sequentially On a trap or interrupt, capture the state of the machine between instructions perfectly

Huge complexity
Complexity of many components scales as n2 (issue width) Power consumption big problem

IBM Power 4
Combines: Superscalar and OOO Properties:
8 execution units in out-of-order engine, each may issue an instruction each cycle. In-order Instruction Fetch, Decode (compute dependencies) Reordering for in-order commit

When all else fails - guess


Programs make decisions as they go
Conditionals, loops, calls Translate into branches and jumps (1 of 5 instructions)

How do you determine what instructions for fetch when the ones before it havent executed?
Branch prediction Lots of clever machine structures to predict future based on history Machinery to back out of mis-predictions

Execute all the possible branches


Likely to hit additional branches, perform stores

speculative threads What can hardware do to make programming (with performance) easier?

Have we reached the end of ILP?


Multiple processor easily fit on a chip Every major microprocessor vendor has gone to multithreaded cores
Thread: loci of control, execution context Fetch instructions from multiple threads at once, throw them all into the execution unit Intel: hyperthreading Concept has existed in high performance computing for 20 years (or is it 40? CDC6600)

Vector processing
Each instruction processes many distinct data Ex: MMX

Raise the level of architecture many processors per chip


Tensilica Configurable Proc

Limiting Forces: Clock Speed and ILP


Chip density is continuing increase ~2x every 2 years
Clock speed is not # processors/chip (cores) may double instead
Source: Intel, Microsoft (Sutter) and Stanford (Olukotun, Hammond)

There is little or no more Instruction Level Parallelism (ILP) to be found


Can no longer allow programmer to think in terms of a serial programming model

Conclusion: Parallelism must be exposed to software!

Examples of MIMD Machines


Symmetric Multiprocessor
Multiple processors in box with shared memory communication Current MultiCore chips like this Every processor runs copy of OS
P P Bus Memory P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M P/M Host P P

Non-uniform shared-memory with separate I/O through host


Multiple processors Each with local memory general scalable network Extremely light OS on node provides simple services Scheduling/synchronization Network-accessible host for I/O

Cluster
Many independent machine connected with general network Communication through messages

Categories of Thread Execution Time (processor cycle)


Superscalar Fine-Grained Coarse-Grained Multiprocessing

Simultaneous Multithreading

Thread 1 Thread 2

Thread 3 Thread 4

Thread 5 Idle slot

Вам также может понравиться