Вы находитесь на странице: 1из 20

CS 162 Computer Architecture

Lecture 2: Introduction & Pipelining

Instructor: L.N. Bhuyan


www.cs.ucr.edu/~bhuyan/cs162

1 1999 ©UCB
Review of Last Class

°MIPS Datapath
°Introduction to Pipelining
°Introduction to Instruction Level
Parallelism (ILP)
°Introduction to VLIW

2 1999 ©UCB
What is Multiprocessing

°Parallelism at the Instruction Level is


limited because of data dependency
=> Speed up is limited!!
°Abundant availability of program level
parallelism, like Do I = 1000, Loop
Level Parallelism. How about
employing multiple processors to
execute the loops => Parallel
processing or Multiprocessing
°With billion transistors on a chip, we
can put a few CPUs in one chip =>
Chip multiprocessor
3 1999 ©UCB
Memory Latency Problem

Even if we increase CPU power, memory is


the real bottleneck. Techniques to
alleviate memory latency problem:
1. Memory hierarchy – Program locality,
cache memory, multilevel, pages and
context switching
2. Prefetching – Get the instruction/data
before the CPU needs. Good for instns
because of sequential locality, so all
modern processors use prefetch buffers
for instns. What do with data?
3. Multithreading – Can the CPU jump to
another program when accessing
memory? It’s like multiprogramming!!
4 1999 ©UCB
Hardware Multithreading
° We need to develop a hardware multithreading
technique because switching between threads in
software is very time-consuming (Why?), so not
suitable for main memory (instead of I/O) access,
Ex: Multitasking
° Develop multiple PCs and register sets on the CPU
so that thread switching can occur without having
to store the register contents in main memory
(stack, like it is done for context switching).
° Several threads reside in the CPU simultaneously,
and execution switches between the threads on
main memory access.
° How about both multiprocessors and
multithreading on a chip? => Network Processor

5 1999 ©UCB
Architectural Comparisons (cont.)
Simultaneous
Superscalar Fine-Grained Coarse-Grained Multiprocessing Multithreading
Time (processor cycle)

Thread 1 Thread 3 Thread 5


Thread 2 Thread 4 Idle slot

6 1999 ©UCB
Intel IXP1200 Network Processor
 Initial
component of the Intel Exchange
Architecture - IXA
 Eachmicro engine is a 5-stage pipeline – no ILP,
4-way multithreaded
7 core multiprocessing – 6 Micro engines and a
Strong Arm Core
 166 MHz fundamental clock rate
 Intel claims 2.5 Mpps IP routing for 64 byte packets
 Already the most widely used NPU
 Or more accurately the most widely admitted use

7 1999 ©UCB
IXP1200 Chip Layout

 StrongARM processing
core
 Microengines introduce
new ISA
 I/O

 PCI
 SDRAM
 SRAM
 IX : PCI-like packet bus

 On chip FIFOs
 16 entry 64B each
8 1999 ©UCB
IXP1200 Microengine
 4 hardware contexts
 Single issue processor
 Explicit optional context switch on
SRAM access
 Registers
 All are single ported
 Separate GPR
 1536 registers total
 32-bit ALU
 Can access GPR or XFER registers
 Standard 5 stage pipe
 4KB SRAM instruction store – not a
cache!

9 1999 ©UCB
Intel IXP2400 Microengine (New)

 XScale core
replaces
StrongARM
 1.4 GHz target in
0.13-micron
 Nearest neighbor
routes added
between
microengines
 Hardware to
accelerate CRC
operations and
Random number
generation
 16 entry CAM

10 1999 ©UCB
MIPS Pipeline
Chapter 6 CS 161 Text

11 1999 ©UCB
Review: Single-cycle Datapath for MIPS
Stage 5

Instruction Data
PC Memory Registers ALU Memory
(Imem) (Dmem)

Stage 1 Stage 2 Stage 3 Stage 4

°Use datapath figure to represent pipeline


IFtch Dcd Exec Mem WB
ALU

IM Reg DM Reg

12 1999 ©UCB
Stages of Execution in Pipelined MIPS
5 stage instruction pipeline
1) I-fetch: Fetch Instruction, Increment PC
2) Decode: Instruction, Read Registers
3) Execute:
Mem-reference: Calculate Address
R-format: Perform ALU Operation
4) Memory:
Load: Read Data from Data Memory
Store: Write Data to Data Memory
5) Write Back: Write Data to Register
13 1999 ©UCB
Pipelined Execution Representation

Time
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
IFtch Dcd Exec Mem WB
Program Flow
°To simplify pipeline, every instruction
takes same number of steps, called
stages
14
°One clock cycle per stage 1999 ©UCB
Datapath Timing: Single-cycle vs. Pipelined
°Assume the following delays for major
functional units:
• 2 ns for a memory access or ALU operation
• 1 ns for register file read or write
°Total datapath delay for single-cycle:
Insn Insn Reg ALU Data Reg Total
Type Fetch Read Oper Access Write Time

beq 2ns 1ns 2ns 5ns


R-form 2ns 1ns 2ns 1ns 6ns
sw 2ns 1ns 2ns 2ns 7ns
lw 2ns 1ns 2ns 2ns 1ns 8ns

°In pipeline machine, each stage = length


15
of longest delay = 2ns; 5 stages = 10ns 1999 ©UCB
Pipelining Lessons
° Pipelining doesn’t help latency (execution
time) of single task, it helps throughput of
entire workload
° Multiple tasks operating simultaneously
using different resources
° Potential speedup = Number of pipe stages
° Time to “fill” pipeline and time to “drain” it
reduces speedup

° Pipeline rate limited by slowest pipeline stage


° Unbalanced lengths of pipe stages also
reduces speedup
16 1999 ©UCB
Single Cycle Datapath (From Ch 5)
M
a a u
d d x
4 d << d
2 PCSrc
Read 25:21 Read MemWrite
P Addr Reg1
Read Read
C
31:0 Read data1 Zero data
20:16
Instruc- Reg2
A
tion L
Read Address
M Write U MemTo-
data2 M
u Reg Reg
u
Imem x Regs Dmem
x ALU-
Write
15:11 con Write
Data
Data
RegDst ALU- M
RegWrite src MemRead u
15:0 Sign
Extend x

17 ALUOp 1999 ©UCB


Required Changes to Datapath
°Introduce registers to separate 5
stages by putting IF/ID, ID/EX, EX/MEM,
and MEM/WB registers in the datapath.
°Next PC value is computed in the 3rd
step, but we need to bring in next instn
in the next cycle – Move PCSrc Mux to
1st stage. The PC is incremented unless
there is a new branch address.
°Branch address is computed in 3rd
stage. With pipeline, the PC value has
changed! Must carry the PC value
along with instn. Width of IF/ID register
= (IR)+(PC) = 64 bits.
18 1999 ©UCB
Changes to Datapath Contd.

°For lw instn, we need write register


address at stage 5. But the IR is now
occupied by another instn! So, we
must carry the IR destination field as
we move along the stages. See
connection in fig.
Length of ID/EX register =
(Reg1:32)+(Reg2:32)+(offset:32)+
(PC:32)+ (destination register:5)
= 133 bits
Assignment: What are the lengths of
EX/MEM, and MEM/WB registers
19 1999 ©UCB
Pipelined Datapath (with Pipeline Regs)(6.2)
Fetch Decode Execute Memory Write
Back
0
M
u
x
1

IF/ID ID/EX EX/MEM MEM/WB


Add

Add
4 Add
result

Shift
left 2

Read
Ins truction

PC Address register 1
Read
data 1
Read
register 2 Zero
Read ALU ALU
Write 0 Address Read
data 2 result 1
register M data
u M
Imem Write
data Regs x
1
u
x
0
Write

16 32
data
Dmem
Sign
extend

20
64 bits 133 bits 102 bits 69 bits
1999 ©UCB