Академический Документы
Профессиональный Документы
Культура Документы
Topics
Types and Hierarchy Model level organization Cache memory Performance considerations Mapping Virtual memory Swapping Paging Segmentation Replacement policies
RAM is packaged as a chip. Basic storage unit is a cell (one bit per cell). Multiple RAM chips form a memory.
Retains value indefinitely, as long as it is kept powered. Relatively insensitive to disturbances such as electrical noise. Faster and more expensive than DRAM.
Each cell stores bit with a capacitor and transistor. Value must be refreshed every 10-100 ms. Sensitive to disturbances. Slower and cheaper than SRAM.
Cost 100x 1X
Traditional Architecture
Processor MAR
Memory
MDR
k Up to 2 addressable locations
addr
memory controller (to CPU) data
8 bits /
0 0 1
addr
memory controller
8 /
rows 2 3
data
0
1
To CPU
memory controller supercell (2,1)
addr rows 2
8 /
data
supercell (2,1)
Memory Modules
addr (row = i, col = j) : supercell (i,j)
DRAM 0
DRAM 7
bits bits bits bits bits bits bits 56-63 48-55 40-47 32-39 24-31 16-23 8-15
bits 0-7
63
56 55
48 47
40 39
32 31
24 23 16 15
8 7
Memory controller
64-bit doubleword
Enhanced DRAMs
All enhanced DRAMs are built around the conventional DRAM core.
signals.
signals.
Nonvolatile Memories
DRAM and SRAM are volatile memories
Generic name is read-only memory (ROM). Misleading because some ROMs can be read and modified.
Types of ROMs
Programmable ROM (PROM) Eraseable programmable ROM (EPROM) Electrically eraseable PROM (EEPROM) Flash memory
Firmware
Disk Geometry
Disks consist of platters, each with two surfaces. Each surface consists of concentric rings called tracks. Each track consists of sectors separated by gaps.
tracks surface track k gaps
spindle
sectors
Disk Capacity
Capacity: maximum number of bits that can be stored.
Recording density (bits/in): number of bits that can be squeezed into a 1 inch segment of a track. Track density (tracks/in): number of tracks that can be squeezed into a 1 inch radial segment. Areal density (bits/in2): product of recording and track density.
Modern disks partition tracks into disjoint subsets called recording zones
Each track in a zone has the same number of sectors, determined by the circumference of innermost track. Each zone has a different number of sectors/track
(# platters/disk)
Example:
512 bytes/sector 300 sectors/track (on average) 20,000 tracks/surface 2 surfaces/platter 5 platters/disk
spindle
arm
spindle
Time to position heads over cylinder containing target sector. Typical Tavg seek = 9 ms
Time waiting for first bit of target sector to pass under r/w head. Tavg rotation = 1/2 x 1/RPMs x 60 sec/1 min
The set of available sectors is modeled as a sequence of bsized logical blocks (0, 1, 2, ...)
Maintained by hardware/firmware device called disk controller. Converts requests for logical blocks into (surface,track,sector) triples.
Disk seek time DRAM access time SRAM access time CPU cycle time
Locality
Principle of Locality:
Programs tend to reuse data and instructions near those they have used recently, or that were recently referenced themselves. Temporal locality: Recently referenced items are likely to be referenced in the near future. Spatial locality: Items with nearby addresses tend to be referenced close together in time.
sum = 0; for (i = 0; i < n; i++) sum += a[i]; return sum;
Locality Example:
Data Reference array elements in succession (stride-1 reference pattern): Spatial locality Reference sum each iteration: Temporal locality
Instructions Reference instructions in sequence: Spatial locality Cycle through loop repeatedly: Temporal locality
Memory Hierarchies
Some fundamental and enduring properties of hardware and software:
Fast storage technologies cost more per byte and have less capacity. The gap between CPU and main memory speed is widening. Well-written programs tend to exhibit good locality.
They suggest an approach for organizing memory and storage systems known as a memory hierarchy.
L1 cache holds cache lines retrieved from the L2 cache memory. L2 cache holds cache lines retrieved from main memory.
L3:
L4:
Memory Hierarchy
CPU
Main Memory I/O Processor
Cache
Magnetic Disks
Magnetic Tapes
Cache Memory
Cache: A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device. Fundamental idea of a memory hierarchy:
For each k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1. Programs tend to access the data at level k more often than they access the data at level k+1. Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit. Net effect: A large pool of memory that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top.
Cache
Processor
Cache
Main memory
Figure 5.14. Use of a cache memory. Replacement algorithm Hit / miss Write-through / Write-back Load through
Cache
Memory
(SRAM)
CPU
4. If not, the CPU has to fetch next instruction from main memory - a much slower process
3. If it is, then the instruction is fetched from the cache a very fast position
= Bus connections
Cache Memory
Miss
CPU
Cache (Fast) Cache 95% hit ratio
Mem
Hit
10 4
0 Level k+1: 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15 Larger, slower, cheaper storage device at level k+1 is partitioned into blocks.
Request 12 14
2 3
Level k:
4* 12
14
12 4*
Request 12
Cache miss
0 Level k+1:
4 4*
8 12
5
9 13
6
10 14
7
11 15
b is not at level k, so level k cache must fetch it from level k+1. E.g., block 12. If level k cache is full, then some current block must be replaced (evicted). Which one is the victim?
Placement policy: where can the new
block go? E.g., b mod 4 Replacement policy: which block should be evicted? E.g., LRU
Conflict miss
Most caches limit blocks at level k+1 to a small subset
(sometimes a singleton) of the block positions at level k. E.g. Block i at level k+1 must be placed in block (i mod 4) at level k+1. Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the same level k block. E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.
Capacity miss
Occurs when the set of active cache blocks (working set) is
Network buffer Parts of files cache Browser cache Web pages Web cache Web pages
Performance Considerations
Overview
Two key factors: performance and cost
Price/performance ratio
Performance depends on how fast machine instructions can be brought into the processor for execution and how fast they can be executed.
For memory hierarchy, it is beneficial if transfers to and from the faster units can be done at a rate equal to that of the faster unit.
This is not possible if both the slow and the fast units are accessed in the same manner.
However, it can be achieved when parallelism is used in the organizations of the slower unit.
Interleaving
If the main memory is structured as a collection of physically separated modules, each with its own ABR (Address buffer register) and DBR( Data buffer register), memory access operations may proceed in more than one module at the same time.
mbits k bits Module mbits Address in module MM address Address in module k bits Module MM address
ABR DBR ABR DBR Module 0 ABR DBR Module i ABR DBR Module n- 1
ABR DBR
ABR DBR
Module 0
Module i
Module k 2 - 1
Ideally, the entire memory hierarchy would appear to the processor as a single memory unit that has the access time of a cache on the processor chip and the size of a magnetic disk depends on the hit rate (>>0.9).
A miss causes extra time needed to bring the desired information into the cache.
Example:
Assume that 30 percent of the instructions in a typical program perform a read/write operation, which means that there are 130 memory accesses for every 100 instructions executed. h=0.95 for instructions, h=0.9 for data C=10 clock cycles, M=17 clock cycles, interleaved memory Time without cache : 130x10 Time with cache : 100(0.95x1+0.05x17)+30(0.9x1+0.1x17) = 5.04 The computer with the cache performs five times better
Other Enhancements
Write buffer processor doesnt need to wait for the memory write to be completed
Prefetching prefetch the data into the cache before they are needed
Lockup-Free cache processor is able to access the cache while a miss is being serviced.
Mapping
00000000 00000001 3FFFFFFF
Main Memory
Cache
Direct Mapping
Block j of main memory maps onto block j modulo 128 of the cache
Cache
tag tag Block 0 Block 1
tag
Block 127
Direct Mapping
Address 000 00500 What happens when Address = 100 00500
00000
Cache
Tag Data
00500 000 0 1 A 6 00900 080 4 7 C C 01400 150 0 0 0 5 FFFFF Compare Match No match 000 0 1 A 6
Block Size = 16
00000
Cache
Tag Data
00500 01A6 000 00501 0254 00900 47CC 080 00901 A0B4 01400 0005 150 01401 5C04 FFFFF
000 0 1 A 6
Compare
Match No match
Direct Mapping
Tag 5 Block 7 W ord 4 Main memory address
11101,1111111,1100
Tag: 11101
Associative Mapping
Main memory Block 0 Block 1 Cache tag tag Block 0 Block 1
Block i tag
Block 127
12: 12 tag bits Identify which of the 4096 blocks that are resident in the cache 4096=212.
Associative Memory
Cache Location
00000 00001 FFFFF
00000000 00000001 00012000 08000000 15000000 3FFFFFFF
Main Memory
Cache
00012000 15000000
08000000
Address (Key)
Data
Associative Mapping
Address
00012000
Cache
00012000 0 1 A 6 Data 15000000 0 0 0 5 08000000 4 7 C C 01A6
30 Bits (Key)
16 Bits (Data)
Associative Mapping
Tag 12 Word 4 Main memory address
111011111111,1100
Set-Associative Mapping
Cache tag Set 0 tag tag Set 1 Block 0
Block 63
Block 1 Block 64 Block 2 Block 65
tag
Block 3
4: one of 16 words. (each Set block has 16=24 words) 63 6: points to a particular set in the cache (128/2=64=26)
tag tag
Block 127 Block 126 Block 128 Block 127 Block 129
Block 4095
Word
4
Set-Associative Mapping
Address 000 00500 2-Way Set Associative 00000
Cache
010 0 7 2 1 Tag1 Data1 Tag2 Data2 000 0 8 2 2 000 0 9 0 9 000 0 1 A 6 010 0 7 2 1
Compare
Compare
20 10 16 10 16 Bits Bits Bits Bits Bits (Addr) (Tag) (Data) (Tag) (Data)
Match
No match
Set-Associative Mapping
Tag 6 Set 6 Word 4 Main memory address
111011,111111,1100
Tag: 111011
Set: 111111=63, in the 63th set of the cache Word:1100=12, the 12th word of the 63th set in the cache
Replacement Algorithms
Difficult to determine which blocks to kick out Least Recently Used (LRU) block The cache controller tracks references to all blocks as computation proceeds. Increase / clear track counters when a hit/miss occurs
Replacement Algorithms
For Associative & Set-Associative Cache
Which
location should be emptied when the cache is full and a miss occurs? First In First Out (FIFO) Least Recently Used (LRU)
Valid Bit
Replacement Algorithms
CPU Reference A
Miss
B
Miss
C
Miss
A
Hit
D
Miss
E
Miss
A
Miss
D
Hit
C
Hit
F
Miss
Cache FIFO
A B
A B C
A B C
A B C D
E B C D
E A C D
E A C D
E A C D
E A F D
Replacement Algorithms
CPU Reference A
Miss
B
Miss
C
Miss
A
Hit
D
Miss
E
Miss
A
Hit
D
Hit
C
Hit
F
Miss
Cache LRU
B A
C B A
A C B
D A C B
E D A C
A E D C
D A E C
C D A E
F C D A