Computer Organization & Computer Organization & Computer Organization & Computer Organization & Assembly Languages Assembly Languages

Computer Organization &
Assembly Languages
Computer Organization (I)
Fundamentals
Pu-Jen Cheng
Materials
Some materials used in this course are adapted from
¾ The slides prepared by Kip Irvine for the book, Assembly Language
for Intel-Based Computers, 5th Ed.
¾ The slides prepared by S. Dandamudi for the book, Fundamentals of
Computer Organization and Designs.
¾ The slides prepared by S.
S Dandamudi for the book
book, Introduction to
Assembly Language Programming, 2nd Ed.
¾ Introduction to Computer Systems, CMU
(http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/
15213-f05/www/)
¾ Assembly Language & Computer Organization
Organization, NTU
(http://www.csie.ntu.edu.tw/~cyy/courses/assembly/
05fall/news/)
/ /)
(http://www.csie.ntu.edu.tw/~acpang/course/asm_2004)
Outline
General Concepts of Computer Organization
¾ Overview of Microcomputer
CPU, Memory, I/O
Instruction Execution Cycle
y
¾ Central Processing Unit (CPU)
CISC vs. RISC
6 Instruction Set Design Issues
¾ How Hardwares Execute Processor’s Instructions
Digital Logic Design (Combinational & Sequential Circuits)
Microprogrammed Control
¾ Pipelining
3 Hazards
3 ttechnologies
h l i ffor performance
f iimprovementt
¾ Memory
Data Alignment
2 Design Issues (Cache, Virtual Memory)
¾ I/O Devices
Overview of Microcomputer
Von Neumann Machine, 1945
Memory,
y, Input/Output,
p p , Arithmetic/Logic
g Unit,, Control Unit
Stored-program Model
¾ Both data and programs are stored in the same main memory
Sequential Execution
http://www.virtualtravelog.net/entries/2003-08-TheFirstDraft.pdf
What is Microcomputer
Microcomputer
¾ A computer with a microprocessor (µP) as its central processing
unit (CPU)
Microprocessor (µP)
¾ A digital electronic component with transistors on a single
semiconductor integrated circuit (IC)
¾ One or more microprocessors typically serve as a central
processing unit (CPU) in a computer system or handheld device.
Components of Microcomputer
Basic Microcomputer Design
data bus
registers
I/O I/O
Central Processor Unit Memory Storage
Device Device
(CPU) Unit
#1 #2
ALU CU clock
l k
control bus
address bus
CPU
Arithmetic and logic unit (ALU) performs arithmetic (add, subtract) and
logical (AND
(AND, OR
OR, NOT) operations
Registers store data and instructions used by the processor
Control unit (CU) coordinates sequence of execution steps
¾ Fetch instructions from memory, decode them to find their types
Clock
Datapath consists of registers and ALU(s)
Datapath ALU input
ALU output operand operand
Program Counter (PC)

(or Instruction Pointer (IP))
Instruction Register (IR)
M
Memory Add
Address R
Register
i t
(MAR)
Memory Data Register
(MDR)
RISC processor
Clock
Provide timing signal and the basic unit of time
Synchronize all CPU and BUS operations
Machine (clock) cycle measures time of a single operation
Clock is used to trigger
gg events
Clock period = 1 1GHz→clock cycle=1ns
Clock frequency
A instruction could take multiple cycles to complete, e.g. multiply in
8088 takes 50 cycles
one cycle
0
Memory, I/O, System Bus
Main/primary memory (random access memory, RAM)
stores
t both
b th program iinstructions
t ti and
dddata
t
I/O devices
¾ Interface: I/O controller
¾ User interface: keyboard, display screen, printer, modem, …
¾ Secondary storage: disk
¾ Communication network
System Bus
¾ A bunch of parallel wires
¾ Transfer data among the components
¾ Address bus (determine the amount of physical memory addressable)
¾ Data bus (indicate the size of the data transferred)
¾ Control bus (consists of control signals:
memory/IO read/write
read/write, interrupt
interrupt, bus request/grand)
Execution Cycle
¾ Fetch (IF): CU fetches next instruction, advance PC/IP
¾ Decode (ID): CU determines what the instruction will do
¾ Execute
Fetch operands (OF): (memory operand needed) read value from memory
E
Execute
t the
th iinstruction
t ti (IE)
Store output operand (WB): (memory operand needed) write result to
memory y
Instruction Execution Cycle (cont.)
Fetch PC program
Decode I-1 I-2 I-3 I-4
Fetch operands memory fetch
Execute op1
read
op2
Store output
p registers
g registers
g
instruction
I-1 register
decode
write
write
w
w
flags ALU
execute
(output)
Introduction to Digital Logic Design
¾ See asm
asm_ch2_dl.ppt
ch2 dl ppt
CPU
CPU
CISC vs.
vs RISC
¾ Number
N b off Add
Addresses
¾ Flow of Control
¾ O
Operand dTTypes
¾ Addressing Modes
¾ Instruction Types
¾ Instruction Formats
Processor
RISC and CISC designs
¾ Reduced Instruction Set Computer (RISC)
Simple instructions, small instruction set
Operands
O d are assumed d tto be
b iin processor registers
i t
Not in memory
Simplify design (e.g., fixed instruction size)
Examples: ARM (Advanced RISC Machines),
DEC Alpha
p ((now Compaq) p q)
¾ Complex Instruction Set Computer (CISC)
Complex instructions, large instruction set
Operands can be in registers or memory
Instruction size varies

Typically
T i ll use a microprogram
i
Example: Intel 80x86 family
Processor (cont.)
Processor (cont.)
Variations of the ISA
ISA-level
level can be implemented by
changing the microprogram
Instruction Set Design Issues
Number of Addresses
Flow of Control
O
Operand Types
Addressing Modes
Instruction Types
Instruction Formats
Number of Addresses
Four categories
¾ 3-address machines
2 for the source operands and one for the result
One address doubles as source and result
¾ 1-address machine
Accumulator machines
Accumulator is used for one source and result
Stack machines
Operands are taken from the stack
Result
R lt goes onto
t the
th stack
t k
Number of Addresses (cont.)
Three-address machines
¾ Two for the source operands, one for the result
¾ RISC processors use three addresses
¾ Sample instructions
add dest
dest,src1,src2
src1 src2
; M(dest)=[src1]+[src2]
sub
b d
dest,src1,src2
t 1 2
; M(dest)=[src1]-[src2]
mult
lt d
dest,src1,src2
t 1 2
; M(dest)=[src1]*[src2]
Example
¾ C statement
A=B+C*D–E+F+A
¾ Equivalent code:
mult T
T,C,D
C D ;T = C*D
C D
add T,T,B ;T = B+C*D
sub T
T,T,E
T E ;T = B+C*D-E
add T,T,F ;T = B+C*D-E+F
add A
A,T,A
T A ;A = B+C*D-E+F+A
Two-address machines
¾ One address doubles (for source operand & result)
¾ Last example makes a case for it
Address T is used twice
load dest,src ; M(dest)=[src]

add dest
dest,src
src ; M(dest)=[dest]+[src]
M(dest) [dest]+[src]
sub dest,src ; M(dest)=[dest]-[src]
mult
lt d
dest,src
t ; M(dest)=[dest]*[src]
M(d t) [d t]*[ ]
Example
¾ C statement
A=B+C*D–E+F+A
¾ Equivalent code:
load T
T,CC ;T = C
mult T,D ;T = C*D
add T
T,BB ;T = B+C*D
sub T,E ;T = B+C*D-E
add T
T,FF ;T = B+C*D-E+F
add A,T ;A = B+C*D-E+F+A
One-address machines
¾ Use special set of registers called accumulators
Specify one source operand & receive the result
¾ Called accumulator machines
load addr ; accum = [addr]

store addr ; M[addr] = acc accum m
add addr ; accum = accum + [addr]
sub
b addr
dd ; accum = accum - [addr] [ dd ]
mult addr ; accum = accum * [addr]
Example
¾ C statement
A=B+C*D–E+F+A
¾ Equivalent code:
load C ;load C into accum
mult D ;accum = C*D
add B ;accum = C*D+B
sub E ;accum = B+C*D-E
add F ;accum = B+C*D-E+F
add A ;accum = B+C*D-E+F+A
store A ;store accum contents in A
Zero-address machines
¾ Stack supplies operands and receives the result
Special instructions to load and store use an address
¾ Called stack machines (Ex: HP3000, Burroughs B5500)
push addr ; push([addr])

pop addr ; pop([addr])
add ; push(pop + pop)
sub
b ; push(pop
h( - pop) )
mult ; push(pop * pop)
Example
¾ C statement
A=B+C*D–E+F+A
¾ Equivalent code:
push E sub
push
p C p
push F
push D add
Mult push A
push B add
add pop A
Load/Store Architecture
Instructions expect operands in internal processor registers
¾ Special LOAD and STORE instructions move data between
registers and memory
¾ RISC uses this architecture
¾ Reduces instruction length
Load/Store Architecture (cont.)
Sample instructions
load Rd,addr ;Rd = [addr]
store
t addr,Rs
dd R ;(addr)
( dd ) = R
Rs
add Rd,Rs1,Rs2 ;Rd = Rs1 + Rs2
subb Rd
Rd,Rs1,Rs2
R 1 R 2 ;Rd
Rd = R
Rs1
1 - Rs2
R 2
mult Rd,Rs1,Rs2 ;Rd = Rs1 * Rs2
Example
¾ C statement
A = B + C * D – E + F + A
¾ Equivalent code:
load R1,B mult R2,R2,R3
load R2,C add R2,R2,R1
load R3,D sub R2,R2,R4
load R4,E add R2,R2,R5
load R5,F add R2,R2,R6
load R6,A store A,R2
Flow of Control
Default is sequential flow
Several instructions alter this default execution
¾ Branches
B h
Unconditional
Conditional
C di i l
Delayed branches
¾ Procedure calls
Delayed procedure calls
Flow of Control (cont.)
Branches
¾ Unconditional
Absolute address
PC-relative
Target address is specified relative to PC contents

Relocatable code
¾ Example: MIPS
Absolute address
j target
PC-relative
b target
e g , Pentium
e.g., e g , SPARC
e.g.,
Branches
¾ Conditional
Jump p is taken only
y if the condition is met
¾ Two types
Set-Then-Jump
Condition testing is separated from branching

Condition code registers are used to convey the condition test
result
Condition code registers keep a record of the status of the last
ALU operation such as overflow condition
Example: Pentium code
cmp AX,BX ; compare AX and BX
je target ; jump if equal
Test-and-Jump
Test and Jump
Single instruction performs condition testing and branching
Example:
p MIPS instruction
beq Rsrc1,Rsrc2,target
Jumps to target
g if Rsrc1 = Rsrc2
Delayed branching
¾ Control is transferred after executing the instruction that
follows the branch instruction
This instruction slot is called delay
y slot
¾ Improves efficiency
¾ Highly
g yp pipelined
pe ed RISC SC p processors
ocesso s suppo
support
Procedure calls
¾ Facilitate modular programming
¾ Require two pieces of information to return
End of procedure
Pentium
uses ret instruction
MIPS
uses jr instruction
Return address
In a (special) register
MIPS allows any general-purpose register
On the stack
Pentium
Delay slot
Parameter Passing
Two basic techniques
¾ Register-based (e.g., PowerPC, MIPS)
Internal registers are used
Faster
Limit the number of parameters
Recursive procedure
¾ Stack-based ((e.g.,
g Pentium))
Stack is used
More general
Operand Types
Instructions support basic data types
¾ Characters
¾ Integers
¾ Floating-point
I t ti overload
Instruction l d
¾ Same instruction for different data types
¾ Example: Pentium
mov AL,address ;loads an 8-bit value
mov AX,address ;loads a 16-bit value
mov EAX,address ;loads a 32-bit value
Operand Types
Separate instructions
¾ Instructions specify the operand size
¾ Example: MIPS
lb Rdest,address ;loads a byte
lh Rdest
Rdest,address
address ;loads a halfword
;(16 bits)
l
lw Rdest
Rdest,address
address ;loads
loads a word
ord
;(32 bits)
ld Rd
Rdest,address
t dd ;loads
l d a d doubleword
bl d
;(64 bits)
Similar instruction: store
Addressing Modes
How the operands are specified
¾ Operands can be in three places
Registers
Register addressing mode

Part of instruction
Constant
Immediate addressingg mode
All processors support these two addressing modes
Memory
Difference between RISC and CISC
CISC supports a large variety of addressing modes
RISC ffollows
ll lload/store
d/ t architecture
hit t
Instruction Types
Several types
yp of instructions
¾ Data movement
Pentium: mov dest,src
Some do not provide direct data movement
instructions
Indirect
I di t d data
t movementt
add Rdest,Rsrc,0 ;Rdest = Rsrc+0
¾ Arithmetic and Logical
Arithmetic
Integer and floating
floating-point,
point signed and unsigned
add, subtract, multiply, divide
Logical
and, or, not, xor
Instruction Types (cont.)
Condition code bits
¾ S: Sign bit (0 = +, 1= -)
¾ Z: Zero bit (0 = nonzero
nonzero, 1 = zero)
¾ O: Overflow bit (0 = no overflow, 1 = overflow)
¾ C: Carry bit (0 = no carry
carry, 1 = carry)
E
Example:
l P Pentium
ti
cmp count,25 ;compare count to 25
;subtract 25 from count
je target ;jump if equal
Instruction Types (cont.)
¾ Flow control and I/O instructions
Branch
Procedure call
Interrupts
¾ I/O instructions
Memory-mapped I/O
Most processors support memory-mapped I/O

No separate instructions for I/O
Isolated I/O
Pentium supports isolated I/O
Separate I/O instructions
in AX,io_port ;read from an I/O port

outt i
io_port,AX
t AX ;write it tto an I/O port
t
Instruction Formats
Two types
¾ Fixed-length
Used by RISC processors
32-bit RISC processors use 32-bits wide instructions
Examples: SPARC,
SPARC MIPS,
MIPS PowerPC
¾ Variable-length
Used by CISC processors
Memory operands need more bits to specify
Opcode
¾ Major and exact operation
Examples of Instruction Formats
How Hardware Executes
Processor’s
ocesso s Instructions
s uc o s
How Hardware Executes
Processor’s
Processor s Instructions
Digital Logic Design

¾ Combinational and Sequential Circuits
Virtual Machines
Abstractions for computers
Machine-independent
High-Level Language Level 5
Assembly Language Level 4

Machine-specific
Operating System
Level 3
Instruction Set
Architecture Level 2
Microarchitecture L
Level
l1
Digital Logic Level 0

Basic Microcomputer Design
data bus
registers
I/O I/O
Central Processor Unit Memory Storage
Device Device
(CPU) Unit
#1 #2
ALU CU clock
l k
control bus
address bus
Consider 1
1-bus
bus Datapath
Assume all entities are
32-bit wide
1-bit
1 bit ALU
ALU Circuit in 1
1-bus
bus Datapath
Memory Interface Implementation
32 32-bit
32 bit general-purpose
general purpose registers
¾ Interface only with the A-bus
¾ Each register has two control signals
G i and
Gxin d Gxout
G t
Control signals used by the other registers
¾ PC register:
PCin, PCout, and PCbout
¾ IR register:
IRout and IRbin
¾ MAR register:
MARin, MARout, and MARbout
¾ MDR register:
MDRin, MDRout, MDRbin and MDRbout
Microprogrammed Control (cont.)
add %G9,%G5,%G7
Implemented as
Transfer G5 contents to A register
Assert G5out and Ain

Place G7 contents on the A bus
Assert G7out
Instruct ALU to p
perform addition
Appropriate ALU function control signals
Latch the result in the C register
Assert Cin
Transfer contents of the C register to G9
Assert Cout and G9in
Instruction Fetch
Implemented as
PCbout: read: PCout: ALU=add4:
ALU add4: Cin;
read: Cout: PCin;
Read: IRbin;
Decodes the instruction and jumps to
the appropriate execution rountine

Example instruction groups
¾ Load/store
Moves data between registers and memory
¾ Register
Arithmetic and logic instructions
¾ Branch
Jump
J di
direct/indirect
t/i di t
¾ Call
Procedures
P d iinvocation
ti mechanisms
h i
¾ More…
High-level FSM
for instruction
execution
FSM: finite state machine

Software implementation
¾ Typically used in CISC
Hardware implementation (PLA) is complex and
expensive
Example
add %G9,%G5,%G7
¾ Three steps
S1 G5out: Ain;
S2 G7out: ALU=add: Cin;
S3 Cout: G9in: end;
Simple
microcode
organization
Uses a microprogram to generate the control
signals
¾ Encode the signals of each step as a codeword
Called microinstruction
¾ A instruction is expressed by a sequence of codewords

Called microroutine
Microprogram essentially
Mi ti ll iimplements
l t th
the FSM
discussed before
A simple microcontroller can execute a
microprogram to generate the control signals
¾ Control store
Store microprogram
¾ Use μPC
Similar to PC
¾ Address generator
Generates appropriate address depending on the
Opcode, and
Opcode
Condition code inputs
Microcontroller
Microcodes reside in control store, which might be read-only memory (ROM)

Microinstruction format
¾ Two basic ways
Horizontal organization
Vertical organization
¾ Horizontal organization
One
O bit forf eachh signal
i l
Very flexible
Long
L microinstructions
i i t ti
Example: 1-bus datapath
N d 90 bits
Needs bit for
f each
h microinstruction
i i t ti
Horizontal
microinstruction
format
¾ Encodes to reduce microinstruction length
Reduced flexibility
¾ Example:
Horizontal organization
64 control
t l signals
i l ffor th
the 32 generall purpose registers
i t
5 bits to identifyy the register
g and 1 for in/out
2-bus
2 bus Datapath
Adding more buses reduces time needed to
execute instructions
¾ No need to multiplex the bus
Example
add
dd %G9
%G9,%G5,%G7
%G5 %G7
¾ Needed three steps in 1-bus datapath
¾ Need only two steps with a 2-bus datapath
S1 G5out: Ain;
S2 G7out: ALU=add: G9in;
Pipelining
Pipelining
Introduction
3 Hazards
¾ R
Resource, D
Data
t and
dCControl
t lH Hazards
d
3 Technologies for Performance Improvement
¾ Superscalar, Superpipelined, and Very Long Instruction
Word
Serial and Pipelining
Serial execution: 20 cycles

Pipelined execution: 8 cycles
For k states and
F d n instructions,
i i
the number of required cycles is:
k + (n – 1)
Pipelining
Pipelining
¾ Overlapped execution
¾ Increases throughput
Pipelining (cont.)
Pipelining requires buffers
¾ Each buffer holds a single value
¾ Uses jjust-in-time p
principle
p
Any delay in one stage affects the entire pipeline flow
¾ Ideal scenario: equal work for each stage

Sometimes it is not possible
Slowest stage determines the flow rate in the entire
pipeline
Pipelining (cont.)
Some reasons for unequal work stages
¾ A complex step cannot be subdivided conveniently
¾ An operation
p takes variable amount of time to execute
EX: Operand fetch time depends on where the operands
are located
Registers
Cache
Memory
¾ Complexity of operation depends on the type of operation
Add: may take one cycle
Multiply:
M lti l may ttake
k severall cycles
l
Pipeline Stall
Operand fetch of I2 takes three cycles
¾ Pipeline stalls for two cycles
Caused by hazards
¾ Pipeline stalls reduce overall throughput

Hazards
Three types of hazards
¾ Resource hazards
Occurs when two or more instructions use the same
resource
Also called structural hazards
¾ D t hazards
Data h d
Caused by data dependencies between instructions
Example:
p Result produced
p by
y I1 is read by
y I2
¾ Control hazards
Default: sequential execution suits pipelining
Altering control flow (e.g., branching) causes problems
Introduce control dependencies

Resource Hazards
Example
¾ Conflict for memory in clock cycle 3
I1 fetches operand
p
I3 delays its instruction fetch from the same memory
Data Hazards
Example
¾ I1: add R2,R3,R4 /* R2 = R3 + R4 */
¾ I2: sub R5,R6,R2 /* R5 = R6 – R2 */
Introduces data dependency between I1 and I2
Control Hazards
»Determine branch decision early

Performance Improvement
Several techniques to improve performance of a
pipelined system
¾ Superscalar
Replicates the pipeline hardware
¾ Superpipelined
Increases the pipeline depth
¾ Very long instruction word (VLIW)

Encodes multiple operations into a long instruction word
Hardware schedules these instructions on multiple

functional units (No run
run-time
time analysis)
add R1, R2, R3 ; R1 = R2 + R3
sub R5, R6, R7 ; R5 = R6 – R7
and R4, R1, R5 ; R4 = R1 AND R5
xor R9, R9, R9 ; R9 = R9 XOR R9
cycle 1: add, sub, xor

cycle 2: and
Superscalar Processor
Ex: Pentium
Wasted Cycles (pipelined)
When one of the stages requires two or more clock cycles,
clock cycles are again wasted.
St
Stages
exe
S1 S2 S3 S4 S5 S6
For k states and n
1 I-1
2 I-2 I-1
instructions the
instructions,
3 I-3 I-2 I-1 number of required
cycles is:
Cyccles
4 I-3 I-2 I-1

5 II-3
3 II-1
1
k + (2n
(2 – 1)
6 I-2 I-1
7 I-2 I-1
8 I-3 I-2
9 I-3 I-2
10 I-3
11 I-3
Superscalar
A superscalar processor has multiple execution pipelines.
In the following, note that Stage S4 has left and right
pipelines (u and v).
Stages
S4 For k states and n
S1 S2 S3 u v S5 S6 instructions the
instructions,
1 I-1 number of required
2 I-2 I-1 cycles is:
3 I-3 I-2 I-1
k+n
Cycless
4 I-4 I-3 I-2 I-1

5 I-4 I-3 I-1 I-2
6 I-4 I-3 I-2 I-1
7 I-3
3 I-4 I-2 I-1
8 I-4 I-3 I-2
9 I-4 I-3
10 I-4
Superpipelined Processor
Ex: MIPS R4000

Memory
Memory
Introduction
Building Memory Blocks
Alignment
l off Data
2 Memory Design Issues
¾ Cache
¾ Virtual Memoryy
Memory (cont.)
Ordered sequence of bytes
¾ The sequence number is called the memory address
¾ Byte addressable memory
Each byte has a unique address
Almost all p
processors support
pp this
Memory address space
¾ Determined byy the address bus width
¾ Pentium has a 32-bit address bus
address space = 4GB (2 )
32
¾ Itanium with 64-bit address bus supports

2
64 bytes of address space
Memory (cont.)
Memory (cont.)
Read cycle
1. Place address on the address bus
2. Assert memory read control signal
3. Wait for the memory to retrieve the data
Introduce wait states if using
g a slow memory
y
4. Read the data from the data bus
5. Drop the memory read signal
In Pentium, a simple read takes three clocks
cycles
Clock 1: steps 1 and 2
Clock 2: step 3
Clock 3 : steps 4 and 5
Memory (cont.)
Write cycle
1. Place address on the address bus
2. Place data on the data bus
3. Assert memory write signal
4. Wait for the memoryy to retrieve the data
Introduce wait states if necessary
5. Drop the memory write signal
In Pentium, a simple write also takes three clocks
Clock 1: steps 1 and 3
Clock 2: step 2
Clock 3 : steps 4 and 5
How Hardware Implements
Memory Systems
Building a Memory Block
A 4 X 3 memory ddesign
i
using D flip-flops
Building a Memory Block (cont
(cont’d)
d)
Bl k di
Block diagram representation
t ti off a 4x3
4 3 memory
Address
Data
Control signals
¾ Read
¾ Write
Building Larger Memories
2 X 16 memory module using 74373 chips
Designing Larger Memories
64M X 32
memory using
i
16M X 16 chips
Alignment of Data
Get 32-bit data in one or more read cycle?

Alignment of Data (cont.)
Alignment
¾ 2-byte data: Even address
Rightmost address bit should be zero
¾ 4-byte data: Address that is multiple of 4

Rightmost 2 bits should be zero
¾ 8-byte data: Address that is multiple of 8

Rightmost
Ri ht t 3 bit
bits should
h ld bbe zero
¾ Soft alignment
Can
C h handle
dl aligned
li d as wellll as unaligned
li dd
data
t
¾ Hard alignment
Handles
H dl only l aligned
li dddata
t ((enforces
f alignment)
li t)
Memory Design Issues
Slower memories
Problem: Speed gap between processor and memory
Solution: Cache memory
Use smallll amountt off ffastt memory
U
Make the slow memory appear faster
Works due to “reference locality”
Size limitations
¾ Limited amount of physical memory
Overlay technique
Programmer managed
¾ Virtual memory
Automates overlay management
Some additional benefits

Memory Hierarchy
Cache Memory
High speed expensive static RAM both inside and outside
High-speed
the CPU.
¾ Level-1 cache: inside the CPU
¾ Level-2 cache: outside the CPU
Prefetch data into cache before the processor needs it
¾ Need to predict processor future access requirements
¾ Locality of reference
Cache
C h hit
hit: when
h d data
t tto b
be read
d iis already
l d iin cache
h
memory
Cache miss: when data to be read is not in cache memory
memory.
When? compulsory, capacity and conflict.
Cache design: cache size
size, n-way
n-way, block size,
size replacement
policy
Why Cache Memory Works
Example
for (i=0; i<M; i++)
for(j=0; j<N; j++)
X[i][j] = X[i][j] + K;
¾ Each element of X is double (eight bytes)
¾ Loop is executed (M*N) times
Placing
Pl i th the code
d iin cache
h avoids
id access tto main
i
memory
Repetitive use
Temporal locality
Prefetching
g data
Spatial locality
Cache Design Basics
On every read miss

¾ A fixed number of bytes are transferred
More than what the processor needs
Effective due to spatial locality

Cache is divided into blocks of B bytes
b-bits are needed as offset into the block
b = log2B
Block are called cache lines
Main memory is also divided into blocks of same
size
Mapping Function
Determines how memory blocks are mapped to

cache lines
Three types
¾ Direct mapping
Specifies a single cache line for each memory block
¾ Set-associative
Set associative mapping
Specifies a set of cache lines for each memory block
¾ Associative mapping
No restrictions
Any cache line can be used for any memory block

Direct Mapping
Set-Associate
Set Associate Mapping
Virtual Memory
I/O Devices
Input/Output
I/O devices are interfaced via an I/O controller
¾ Takes care of low-level operations details
Several ways of mapping I/O
¾ Memory-mapped I/O
Reading and writing similar to memory read/write
Uses same memory read and write signals
Most p
processors use this I/O mapping
pp g
¾ Isolated I/O
Separate I/O address space
Separate I/O read and write signals are needed
Pentium supports isolated I/O
Also supports memory-mapped I/O

Input/Output (cont.)
Input/Output (cont.)
Several ways
y of transferring
g data
¾ Programmed I/O
Program
g uses a busy-wait
y loop
p
Anticipated transfer
¾ Direct memory access (DMA)
Special controller (DMA controller) handles data
transfers
Typically used for bulk data transfer
¾ Interrupt-driven I/O
Interrupts are used to initiate and/or terminate data
transfers
Powerful technique
Handles unanticipated transfers
Interconnection
System components are interconnected by buses
¾ Bus: a bunch of parallel wires
Uses several buses at various levels
¾ On-chip buses
Buses
B to iinterconnect ALU and
d registers
i
A, B, and C buses in our example
Data
D t and d address
dd b
buses tto connectt on-chip
hi caches
h
¾ Internal buses
PCI,
PCI AGP
AGP, PCMCIA
¾ External buses
Serial,
S i l parallel,
ll l USB
USB, IEEE 1394 (Fi
(FireWire)
Wi )
PC
y
System Buses
ISA (Industry Standard
A hi
Architecture)
)
PCI (Peripheral Component
Interconnect)
AGP (Accelerated Graphics
Port))
Interconnection (cont.)
Bus is a shared resource
¾ Bus transactions
Sequence of actions to complete a well-defined
well defined
activity
Involves a master and a slave
Memory read, memory write, I/O read, I/O write

¾ Bus operations
A bus
b s ttransaction
ansaction ma
may pe
perform
fo m one o
or mo
more
ebbuss
operations
Pentium burst read
Transfers four memory words
Bus transaction consists of four memory read
operations
¾ Bus arbitration
Summary
¾ Overview of Microcomputer
CPU, Memory, I/O
y
¾ Central Processing Unit (CPU)
CISC vs. RISC
¾ How Hardwares Execute Processor’s Instructions
Digital Logic Design (Combinational & Sequential Circuits)
¾ Pipelining
3 Hazards
3 ttechnologies
h l i ffor performance
f iimprovementt
¾ Memory
Data Alignment
2 Design Issues (Cache, Virtual Memory)
¾ I/O Devices

Computer Organization & Computer Organization & Computer Organization & Computer Organization & Assembly Languages Assembly Languages

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Computer Organization & Computer Organization & Computer Organization & Computer Organization & Assembly Languages Assembly Languages

Загружено:

Авторское право:

Доступные форматы

Computer Organization &

Computer Organization (I)

ALU output operand operand

Program Counter (PC)

 Operands can be in registers or memory

 Instruction size varies

 Accumulator is used for one source and result

 Operands are taken from the stack

 Address T is used twice

load dest,src ; M(dest)=[src]

¾ Called accumulator machines

load addr ; accum = [addr]

¾ Called stack machines (Ex: HP3000, Burroughs B5500)

push addr ; push([addr])

 Target address is specified relative to PC contents

 Condition testing is separated from branching

 Register addressing mode

 Most processors support memory-mapped I/O

in AX,io_port ;read from an I/O port

 32-bit RISC processors use 32-bits wide instructions

 Memory operands need more bits to specify

 Digital Logic Design

Assembly Language Level 4

Digital Logic Level 0

 Assert G5out and Ain

 Decodes the instruction and jumps to

the appropriate execution rountine

FSM: finite state machine

¾ A instruction is expressed by a sequence of codewords

Microcodes reside in control store, which might be read-only memory (ROM)

Serial execution: 20 cycles

¾ Ideal scenario: equal work for each stage

 Slowest stage determines the flow rate in the entire

¾ Pipeline stalls reduce overall throughput

 Altering control flow (e.g., branching) causes problems

 Introduce control dependencies

»Determine branch decision early

¾ Very long instruction word (VLIW)

 Hardware schedules these instructions on multiple

cycle 1: add, sub, xor

4 I-3 I-2 I-1

4 I-4 I-3 I-2 I-1

Ex: MIPS R4000

¾ Itanium with 64-bit address bus supports

Get 32-bit data in one or more read cycle?

¾ 4-byte data: Address that is multiple of 4

¾ 8-byte data: Address that is multiple of 8

 Some additional benefits

 On every read miss

 Effective due to spatial locality

 Determines how memory blocks are mapped to

 Any cache line can be used for any memory block

 Uses same memory read and write signals

 Separate I/O read and write signals are needed

 Pentium supports isolated I/O

 Also supports memory-mapped I/O

 Memory read, memory write, I/O read, I/O write

 Bus transaction consists of four memory read

Вам также может понравиться

Operands can be in registers or memory

Instruction size varies

Accumulator is used for one source and result

Operands are taken from the stack

Address T is used twice

Target address is specified relative to PC contents

Condition testing is separated from branching

Register addressing mode

Most processors support memory-mapped I/O

32-bit RISC processors use 32-bits wide instructions

Memory operands need more bits to specify

Digital Logic Design

Assert G5out and Ain

Decodes the instruction and jumps to

Slowest stage determines the flow rate in the entire

Altering control flow (e.g., branching) causes problems

Introduce control dependencies

Hardware schedules these instructions on multiple

Some additional benefits

On every read miss

Effective due to spatial locality

Determines how memory blocks are mapped to

Any cache line can be used for any memory block

Uses same memory read and write signals

Separate I/O read and write signals are needed

Pentium supports isolated I/O

Also supports memory-mapped I/O

Memory read, memory write, I/O read, I/O write

Bus transaction consists of four memory read