Computer Architecture: Memory Hierarchy and Pipelining

Computer Architecture
Sandeep Srivastava
Introduction
•
This is a slightly advanced lecture on
Computer Architecture and covers high level
topics like speed, performance, cost, etc.
•
Brief familiarity with Computer Organization is
assumed
Introduction
•
In making a design trade-off, favor the
frequent case over the infrequent case. This
principle also applies when determining how
to spend resources, since the impact on
making some occurrence faster is higher if the
occurrence is frequent.
•
Improving the frequent occurrence:
– Helps performance
– Is simpler and can be done faster
Locality of References
•
This important fundamental observation
comes from properties of programs. The most
important program property that we regularly
exploit is locality of references : Programs tend
to reuse data and instructions they have used
recently. 90/10 rule comes from empirical
observation:
"A program spends 90% of its time in 10% of
its code"
•
An implication of locality is that we can predict
Smaller is Faster Rule
•
Smaller pieces of hardware will generally be
faster than larger pieces. This simple principle
is particularly applicable to memories built
from the same technology for two reasons:
•
In high-speed machines, signal propagation is
a major cause of delay
•
In most technologies we can obtain smaller
memories that are faster than larger
memories. This is primarily because the
designer can use more power per memory cell
Memory Hierarchy Design
•
Memory hierarchy design is based on three
important principles:
Make the Common Case Fast
Principle of Locality
Smaller is Faster
The objective of Memory Hierarchy is to obtain the highest possible access
speed while minimizing the total cost of the memory system
Auxiliary memory
Magnetic
tapes I/O Main
processor memory
Magnetic
disks
CPU Cache
memory
Register
Cache
Main Memory
Magnetic Disk
•
The above principles suggest that we should
try to keep recently accessed items in the
fastest memory. Because the smaller
memories are more expensive and faster, we
want to use smaller memories to try to hold
the most recently accessed items close to the
CPU and successively larger (and slower, and
less expensive) memories as we move away
from the CPU. This type of organization is
called a memory hierarchy . Two important
•
Using principle of locality to improve
performance while keeping the memory
system affordable we can pose four questions
about any level of memory hierarchy. We will
answer those questions considering one level
of memory hierarchy.
•
Block Placement
Where should a block be placed in the
cache?
Block Placement
•
There are three methods in block placement:
– Direct mapped : if each block has only one place it
can appear in the cache, the cache is said to be
direct mapped. The mapping is usually (Block
address) MOD (Number of blocks in cache)
– Fully Associative : if a block can be placed
anywhere in the cache, the cache is said to be fully
associative.
– Set associative : if a block can be placed in a
restricted set of places in the cache, the cache is
Block Identification
•
Cache memory consists of two portions:
•
Directory
- Address Tags ( checked to match the
block address from CPU )
- Control Bits ( indicate that the content of
a block is valid )
RAM
- Block Frames ( contain data )
Block Identification
•
As a rule, all possible tags are searched in
parallel because speed is critical.
•
The block offset field selects the desired data
(minimal addressable unit) from the block, the
index field selects the set, and the tag field is
compared against cache tag for a hit.
•
While the comparison could be made on more
of the address than the tag, there is no need
because:
Checking the index would be redundant,
Basic Identification Algorithm
•
//Search cache Directory for Tag
•
if "hit" then
Use offset to fetch data from RAM
else
//access main memory
•
if "hit" then
Store data (and block) in cache and
Pass data to CPU
else
Do Context Switch (while processing
Block Replacement
•
When a miss occurs, the cache controller must
select a block to be replaced with the desired
data. A replacement policy determines which
block should be replaced. With direct-mapped
placement the decision is simple because
there is no choice: only one block frame is
checked for a hit and only that block can be
replaced.
•
With fully-associative or set-associative
placement , there are more than one block to
Block Replacement
•
Other strategies:
First In First Out (FIFO)
Most Recently Used (MRU)
Least-Frequently Used (LFU)
Most-Frequently Used (MFU)
Interaction Policies with Main
Memory
•
Reads dominate processor cache accesses. All
instruction accesses are reads, and most
instructions do not write to memory. The
block can be read at the same time that the
tag is read and compared, so the block read
begins as soon as the block address is
available. If the read is a miss, there is no
benefit - but also no harm; just ignore the
value read.
•
The read policies are:
Memory
•
The write policies on write hit often
distinguish cache designs:
Write Through - the information is written to
both the block in the cache and to the block in
the lower-level memory.
Advantage:
- read miss never results in writes to main
memory
- easy to implement
- main memory always has the most current
Ordering of Bytes in Memory
• There are two different conventions for

ordering the bytes within a word.
– Little Endian (followed by DEC and Intel)
– Big Endian (followed by IBM, Motorola and others)
Memory
•
There are two common options on a write
miss:
Write Allocate - the block is loaded on a write
miss, followed by the write-hit action.
No Write Allocate - the block is modified in
the main memory and not loaded into the
cache.
•
Although either write-miss policy could be
used with write through or write back, write-
back caches generally use write allocate
Pipelining
•
Pipelining is an implementation technique
where multiple instructions are overlapped in
execution.
•
The computer pipeline is divided in stages.
Each stage completes a part of an instruction
in parallel.
•
The stages are connected one to the next to
form a pipe - instructions enter at one end,
progress through the stages, and exit at the
other end.
PipeLining
•
Pipelining does not decrease the time for
individual instruction execution. Instead, it
increases instruction throughput.
•
The throughput of the instruction pipeline is
determined by how often an instruction exits
the pipeline
Pipelining
•
Because the pipe stages are hooked together,
all the stages must be ready to proceed at the
same time.
•
We call the time required to move an
instruction one step further in the pipeline a
machine cycle .
•
The length of the machine cycle is determined
by the time required for the slowest pipe
stage
Pipelining
•
The pipeline designer's goal is to balance the
length of each pipeline stage . If the stages are
perfectly balanced, then the time per
instruction on the pipelined machine is equal
to
•
Time per instruction on nonpipelined machine
Number of pipe stages Under these

conditions, the speedup from pipelining
equals the number of pipe stages. Usually,
Addressing Modes
•
While most early machines used stack or
accumulator-style architectures, all machines
designed in the past ten years use a general
purpose architecture. The reason is the
registers are:
– faster then memory
– easier for a compiler to use
– can be used more effectively
Addressing Modes
•
The following are some of the main
Addressing Modes:
– Register : Add R4,R3 : R4 <- R4 + R3
– Immediate : Add R4, #3 :R4 <- R4 + 3
– Displacement : Add R4, 100(R1) : R4 <- R4 +
M[100+R1]
– Register deferred : Add R4,(R1) : R4 <- R4 + M[R1]
– Indexed : Add R3, (R1 + R2) : R3 <- R3 + M[R1+R2]
– Direct : Add R1, (1001) : R1 <- R1 + M[1001]
Encoding of Addressing Modes
•
How the addressing modes of operands are
encoded depends on the range of addressing
modes
the degree of independence between opcodes
and modes For small number of addressing
modes or opcode/addressing mode
combinations, the addressing mode can be
encoded in opcode.
For a larger number of combinations, typically
a separate address specifier is needed for
Classification of Instruction Sets
•
The instruction sets can be differentiated by
Operand storage in the CPU
Number of explicit operands per instruction
Operand location
Operations
Type and size of operands
Classification of Instruction Sets
•
The type of internal storage in the CPU is the
most basic differentiation.
•
The major choices are
•
a stack (the operands are implicitly on top of
the stack)
•
an accumulator (one operand is implicitly the
accumulator)
•
a set of registers (all operands are explicit
either registers or memory locations)

Computer Architecture: Memory Hierarchy and Pipelining

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Computer Architecture: Memory Hierarchy and Pipelining

Загружено:

Авторское право:

Доступные форматы

Computer Architecture

Make the Common Case Fast

• There are two different conventions for

Number of pipe stages Under these

Вам также может понравиться