Вы находитесь на странице: 1из 29

EECS 366: Computer Architecure

Instructor: Shantanu Dutt Department of EECS University of Illinois at Chicago

Lecture Notes # 16

Memory Organization
c Shantanu Dutt

c Shantanu Dutt, UIC

Memory Hierarchy Design

Many programs need large amounts of memory, as the size of the problems they solve increase. To solve the problem quickly, fast access is needed to all this data

One solution is, of course, to build very large fast memory units capable of storing 1000s of MBytes. As we saw, fast memory (static memory, for example) consumes too much VLSI area and power, so that large memory of this kind is impractical to realize

Furthermore, even if it becomes feasible to build large amounts of fast memory, it is well known that access to this memory gets slower as it gets larger

Fortunately, there is a way out! Because of locality property of most programs, it is not necessary to have large amounts of fast memory for quick access to large amounts of data: (1) Temporal Locality: An item just referenced will be referenced again soon. (2) Spatial Locality: When an item is referenced, nearby items in memory will also be referenced soon.

c Shantanu Dutt, UIC

Memory Hierarchy Design (contd.)

What these locality properties mean is that programs use a physically contiguous block of data for some period of time before moving on to another block of data.

Thus we can build very fast memory that is just large enough to store this small block of data that the program is currently working onthis is the 1st level of the memory hierarchy, and is the register le in the CPU.

The next block of data that the program will move to has to be retrieved from the next level of the memory hierarchy which has the 2nd fastest and 2nd smallest memory unitthis the cache

Note that just like there is locality for individual data items (words), there is also locality between small blocks and between groups of these small blocks (larger blocks), and so on.

Thus more levels are required that hold larger and larger blocks until the last level holds the entire data: The 3rd level is main memory and the 4th level is secondary/disk storage.

Block size gets larger as one goes down the hierarchy mainly because access time to the lower level increases, and thus we need to spread this access time over more words.
c Shantanu Dutt, UIC

Memory Hierarchy Design (contd.) In principle, there can be low.

levels in the memory hierarchy as shown be-

Faster, more expensive

Slower, less expensive

The Memory Hierarchy

c Shantanu Dutt, UIC

Memory Hierarchy Design (contd.)

An upper level is generally a subset of the data contained in the next lower level, and also belong to the entire memory address space

An exception is the register level, all of whose data may not be contained in the cache at all times. Also, the register le is not part of the memory address spaceregisters are addressed by a different address that pertains to the register le only, and data transfer between the register le and the lower levels are handled explicitly by the program in using LOADs and STOREs

The rest of the levels share a common memory address space, and data transfers between them are automatic and transparent to the program they are handled either by hardware (cachemain mem. hierarchy) or the operating system (main mem.secondary storage hierarchy)

c Shantanu Dutt, UIC

Memory Hierarchy Design (contd.)

General Denitions and Principles of Memory Hierarchy Consider any 2 adjacent levels and

Level 1

in the memory hierarchy:

Block: Minimum amount of data (in # of words) that can be transferred between the 2 levels

Blocks of level 1
Level 2

Blocks of level 2
Level 3

Hit rate: Fraction of memory accesses to the upper level (of the 2-level sub-hierarchy) that are found in that level; denoted by Miss rate: Fraction of accesses that are not found in the upper level ; denoted by

Hit time: Time taken to access a block in the upper level; denoted by

c Shantanu Dutt, UIC

General Denitions and Principles of Memory Hierarchy (contd.) Consider any 2 adjacent levels in the memory hierarchy:

Miss penalty: Time to replace a block in the upper level by a needed block that is not in that level. Since there can be hits or misses at lower levels for obtaining the required block. The miss penalty for the upper-most level (level 1) is be given by:
# "! %! %& & ) !3 $ $ 10    2 2 2       4  # $ $ '() '() ) !3 8 5 & 6&     7 4   89  ! 8 5 &   ) 5 & @ '() '() A A

where is the miss rate in level , and ment time from level to .

is the block replacefor the CPU is given by

'() 89 !

The average memort access time

! ! D %!           BC # $


  E 6





D E       $  89 ! '()  IH PQ 89 BR ! $ A E EB 89 ! PQ 89 A A BR ! $

# H

The block replacement time = access time (time to access the the ) 1st word of the block in the lower level + transfer time (time to access the remaining word), where is the block size in the upper level and is the transfer rate . (per word) from level



For e.g., there is an initial time required to search for the block/page location in main memory (MM), and further due to refreshing we saw that average time to access MM is given by: . Then the initital access time to MM is:
# E$Q F # # # # #  S P PQ BC BC & # # # # #  E EB $E Q F BC # S T


However, the entire row is stored in the row register after spending time to access the word, and the required block is part of this row. Thus time per the rest of the words in the block can be sent in approx. word. Thus .
# #  PQ BR $ 'U $ 'U


Example: There are 3-levels in the memory hierarchy: cache, MM, secondary storage. The following are values of above parameters: ccs, , cache block size = 4 words, ccs, ccs, , ccs, ccs, MM page size = 2K words.
! !    YX 2    V    W # # # # 3 $ $   `  V   a    W @W W W W PQ BR $ E EB V ! !




Then, the average time taken by the CPU to access a word is:
! ! #G # #G #              BC b '()   D '()c $



YW 2

fY 2

YW 2


c W

c Shantanu Dutt, UIC

f2 Xh h W X e W W W W W W W W

 i `  Vd V a V a V V d

General Denitions and Principles of Memory Hierarchy (contd.) Consider any 2 adjacent levels in the memory hierarchy: Addressing:
Generic or Block # or Page # Block frame address Block offset addr. or Word #

Cache main mem. hierarchy (virtual addr.)

31 Block # (28 bits)

10 9

43 Word # Block offset within a page

Block size is 16 words

Main mem Sec. storage hierarchy (virtual addr.)

31 Page # (22 bits) Translation 23 Physical addr. Page # (14 bits) 23 Corresponding physical addr. of the cache

10 9 Word #

0 Page size is 1K words 10 9 Word # 43 Word # Block # (20 bits) 0 0

c Shantanu Dutt, UIC

General Denitions and Principles of Memory Hierarchy (contd.) Effect of Block Size:

Larger the block size, better the anticipation of nearby items to be referenced soon (spatial locality)

However, beyond a certain block size, the concept of spatial locality is stretched. Note that while a program may access almost all items in a small or medium-size block, it later accesses a random next block, not necessarily one following the current onespatial locality is punctuated by random accesses (for ex., due to branches)

Thus for large block sizes, there will be many useless data items in it that the program might not access in the near-future. Since the space on the upper level is limited, larger the block size, smaller is the # of blocks. Hence the miss rate increases when the next random block is accessed by the program

c Shantanu Dutt, UIC


Effect of Block Size (contd.)

Initial A access, miss, A&B loaded A
16 words

Initial A access, miss, A loaded

A&B Work on A Next access is C, miss, C & D loaded C&D 0.9 Work on C Next access is A, miss, A & B loaded

Empty Work on A

0.05 0.95
16 words

B 1

Next access is C, miss, C loaded A Work on C

C 0.1

16 words

Next access is A, hit

A 2 misses per iteration

16 words

C Work on A

Next access is C, hit

(a) Program Structure

(b) Miss pattern with block size = 32 words

0 misses per iteration

(c) Miss pattern with block size = 16 words

c Shantanu Dutt, UIC


Effect of Block Size (contd.)

p q  




is the average memory access time.

Average access time t_av Pollution point

Miss penalty

Miss rate

Increase happens earlier than in "miss rate" plot

Access time Block size Block size Block size

t_av = hit_time + (miss_rate) (miss_penalty)

c Shantanu Dutt, UIC


General Denitions and Principles of Memory Hierarchy (contd.)

What the CPU does on a miss in the upper level: (1) If the miss penalty is a few 10s of clock cycles (ccs), then the CPU waits (ex., cache miss) (2) If the miss penalty, is 100s to 1000s of ccs (as in main-memory miss or page fault), CPU is interrupted on a miss, and another process starts executing. When the requested block is brought in, this is noted in the previous processs status, so that it can start re-executing at a later stage (when the current process is done or it also has a miss)

Block transfer mechanism: (1) Done in hardware for few 10s of ccs penalty (cache) (2) Done in software (O.S. could do this) for main-mem. missthe O.S. sets up the appropriate disk interface for a DMA and leaves the CPU; the CPU executes another process, while transfer from disk to main-mem. takes place simultaneously

c Shantanu Dutt, UIC


Some Basic Issues in Memory Hierarchies

Again we consider 2 adjacent levels of the hierarchy: 1. Block Placement: Where can a block be placed in the upper level? 2. Block Identication: How is a block found in the upper level? 3. Block Replacement: Which block to replace during a miss? 4. Write Strategy: What happens on a write to the upper levelhow is this percolated to the lower level

c Shantanu Dutt, UIC


Some Basic Issues in Memory Hierarchies (contd.)

(1) Block Placement: Fully Associative (FA): Can place anywhere; have to look everywhere
t @W

Set Associative (SA): The upper-level is divided into sets , each containing blocks ( -way set associative). A block with block # , is placed only in set ; it can be placed anywhere in this set
H H  uv t

@ 2


Direct Mapped (DM): The upper-level is divided into blocks , and a block with block # , is placed only in block ; is generally a power of 2, say, . Will need to look at only 1 block position for the required block.
 V (

@ 2

Fully associative (FA): Block 14 can go anywhere Bl. # 0 1 2 3 4 5 6 7

xw xw wyx wyx y y xw xw wx wyx wyx wyx y y y

Direct mapped (DM): Block 14 can only go into block 14 mod 8 = 6 Bl. # 0 1 2 3 4 5 6 7

2-way Set Associative (SA): Block 14 can go anywhere in set 14 mod 4 = 2 Bl. # 0 1 2 3 4 5 6 7
d d d d d d d d d d d d d d d d d d d d d d

xw xw wx wyx wyx wyx y y y

xw xw wx wyx wyx wyx y y y

Set 0 Block 14
ef ef ef fe fe ef fe fe ef


Set 1

Set 2

Set 3

Bl. # 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1

c Shantanu Dutt, UIC



Some Basic Issues in Memory Hierarchies (1) Block Placement (contd.):

FA and DM are special cases of set-associative. In FA, there is only one set containing all blocks. In DM, there are sets, each containing exactly 1 block

FA has the most exibility in placing a block, while DM has the least

c Shantanu Dutt, UIC


Some Basic Issues in Memory Hierarchies (contd.) (2) Block Identication:

Associative or content-addressible memory (CAM): stores the block # or tags of resident blocks for each set. The index, which is the rightmost bits of the block #, determines which set of the CAM to search for the rest of the block # (the tag). This is generally used in the cache main-mem. hierarchy.
Fully associative (FA): Bl. # 0 1 2 3 4 5 6 7 Direct mapped (DM): Bl. # 0 1 2 3 4 5 6 7 2-way Set Associative (SA): Bl. # 0 1 2 3 4 5 6 7

nm nm mn mn mn mn mn mn mn mn mn mn mn mn mn mn mn mn mn mn

ji ji ij ij ij ij ij ij ij ij ji ji ji ij ij ij ij ij ij ij

Set 0 Block 14 1 4

Set 1

lk lk kl kl kl kl kl kl kl kl lk lk lk kl kl kl kl kl kl kl




Set 2

Set 3

1 4

1 4

po po op op op op rq rq rq qr qr qr

ts ts ts ts ts ts


Search Search everywhere Search only in tag position 14 mod 8 = 6

uv uv uv uv uv uv


Search Search everywhere within set 14 mod 4 = 2

(a) Block identification in different cache types. Search performed in parallel in FA and SA caches for speed.
Block # Tag Index Block offset/ Word #

(b) Different portions of an address: The index (address mod s) is used to select the set (in DM and SA), and the tag is used to check all blocks in the "indexed" set, and the word # is used to select the word in the block

c Shantanu Dutt, UIC


Some Basic Issues in Memory Hierarchies (contd.) (2) Block Identication: CAMs Structure of a CAM:
Structure of a CAM :
Tag Valid bit Equality Comparator 1/0 m Equality Comparator

Note: Search logic replaces a regular decoder.

1/0 16 words/block


Word #

1/0 Equality Comparator

Tag Store Note: Valid bit is present in tag store ANDs with the O/P of the corresponding equality comparator.

Data Store


Fully-associative cache
Equality Comparator (Inputs x & a)
x7 a7 x6 a6 x5 a5 x4 a4 x3 a3 x2 a2 x1 a1 x0 a0

Desired Word

1 : Equal 0 : Not equal

c Shantanu Dutt, UIC


Some Basic Issues in Memory Hierarchies (contd.) CAMs: Hardware Complexity: Of parallel search logic = where is the size of the cache in blocks, and block # . This can be prohibitive for large and
V Q x w x V Q (3 y x d V Q V x   h h

for a FA cache, is the # of bits in the

For SA cache, we have one such CAM of size the sets. So total CAM size is one parallel search logic of size only the indexed set
V ( t V Q d  V Q d (3 x

for each of . However, there is only which is used to search

x h

m m=20 l=5 r=10 23 Tag (15) Block # (20) 98 Cache size= 2**r = 1024 blocks # of sets = 32, set size = 32 blocks l 43 0 Index (5) Word #

Set Tag Store # Tag 0


Data Store

Set # 0

1 I lto2**l n = d 5to32 e 5 Decoder x

15 bits

Search Logic

lto2**l = 1to32 Decoder

512 bits

I n d e x 31 2**(rl)to1 = 32to1 Mux

2**l1 = 31
2**(rl) =32 ml =15

2**(rl) =32 16 blocks = 512 bits

Data Block

There is only one equality comparator in a DM cache; thus complexity is

V x  w y

r u g x

r u g

Time complexity of search:

r u g x  w y

for FA,

for SA, and

x  w h

c Shantanu Dutt, UIC


Some Basic Issues in Memory Hierarchies (2) Block Identication (contd.):

Lookup table: Stores the tags also by sets, as in the CAM. However, this is regular kind of memory, and is of the same technology as the upper level. Thus 2 memory accesses are reqd. to the upper level to get a word from there. This is generally used in the main-mem. sec. storage hierarchy.

Table size total size in blocks in lower level. This is different than the upper level in which a CAM is used as the lookup table and its size is the size in blocks in the upper level.
Lookup Table:
Block # 0 Block # of address 1 2 Present bit Dirty bit Location in current level

14 15 16

c Shantanu Dutt, UIC


Some Basic Issues in Memory Hierarchies (contd.)

(3) Block Replacement Policy: Which block in the set to replace? No choice in DM cache. So the question applies to FA and SA cache. The following policies can be used for each set; all policies make use of temporal locality to predict which block will be accessed furthest in the future.

Least Frequently Used (LFU): Note the # of times each block has been used over some window of time and replace the one used the least # of times. Most expensive to implement

Least Recently Used (LRU): Keep the blocks in each set ordered by the time of their most recent used. Whenever a new block is accessed in the set, move it to the top of the list. Replace the block at the bottom. 2nd most expensive, but best performance
shifted left 1 block shifted left 1 block
MRU Data LRU Tag

Move to end on access

Move to end on access

Implementation of LRU scheme: LRU is performed in entire cache for FA or in the accessed set for SA
c Shantanu Dutt, UIC


Some Basic Issues in Memory Hierarchies (contd.) (3) Block Replacement Policy (contd.):

Not Recently Used (NRU): Just point to the block used most recently. Replace any of the other blocks. 3rd most expensive in hardware and time, and worst performance

Random: Randomly choose any block to replace. Least expensive (especially in time; have to do this only when there is a miss) to implement, and 3rd best performance (after LRU)

c Shantanu Dutt, UIC


Some Basic Issues in Memory Hierarchies (contd.) (4) Write Strategy: What happens on a write?

On a write hit: 1. Write Back: Write to lower level when block is replaced and if its dirty bit is set. This bit is set whenever we write to a block in the upper level. This is generally used when access time to lower level is high. 2. Write Through: Write to both levels simultaneously thus keeping them always consistent.

On a write miss: 1. Write Allocate: Load the block written to to the upper level. Again, this is generally done when access time to lower level is high. 2. No Write Allocate: Block not loaded to the upper levelthe rationale is that read and write do not have the same sphere of spatial locality, and as explained later, the CPU generally does not have to wait for writes (i.e., STOREs)

The combinations generally used on write hit/miss are 1/1 and 2/2. The latter is used mainly for the cache main-mem. hierarchy and the 1/1 combination for the main-mem. sec. storage hierarchy (because of the larger access time)
c Shantanu Dutt, UIC


More About Cahces Made from SRAMs

Source of cache misses: (1) Compulsory: 1st time access to a block will result in a misscold start miss (2) Capacity: Caches cannot contain all blocks needed during a programs execution (3) Conict or Collission: Occurs when (a) too many referenced blocks map to the same set, and/or (b) the set size is very small (for e.g., in DM caches)

c Shantanu Dutt, UIC


Source of cache misses (contd.)

2 2-way SA cache (size=4 blocks) 4 Set 0 Block # accessed: Access class: Set 1 2 7 3 6 3 5

2, 2, 3, 3, 5, .........., 6, 2, 6 , 4, 4, 2, 7, ......., 3

Cm, h, Cm, h, Cm, ..hs..., Cm, h, h, Cm, h, Cn, Cm,.hs., Cp

Block # replaced: -, -, -, -, -, .........., -, ..........., 2, -, 6, 3, ......., 5 (using LRU in sets) Global LRU block? N N Y

c Shantanu Dutt, UIC


Source of cache misses (contd.)

c Shantanu Dutt, UIC


More About Cahces (contd.) Effect of block size

 d      p q   r s     BC  

c Shantanu Dutt, UIC


More About Cahces (contd.)

Separate data and instruction caches. Can have different block sizes, capacities and associativities to optimize performance

c Shantanu Dutt, UIC