Академический Документы
Профессиональный Документы
Культура Документы
Table of Contents:
Content Date
Introduction 02-03-17
memory
cache memory
Introduction:
In data processing systems, a cache memory or memory cache (or sometimes
also CPU cache (*)) is a fast and relatively small memory, not visible to the software,
that is completely handled by the hardware, that stores the most recently used
(MRU) main memory (MM) (or working memory) data.
The function of the cache memory is to speed up the MM data access (performance
increasing) and most important, in multiprocessor systems with shared memory, to
reduce the system bus and MM traffic that is one of the major bottleneck of these
systems.
Cache memory makes use of the fast technology SRAM (static random-access memory
cells), against a slower MM DRAM (dynamic random-access memory), connected
directly to the processor(s).
The term "cache" is derived from the French and means "hidden".
This term has many meanings depending on the context. Examples are: disk
cache, TLB (translation look aside buffer) (Page Table cache), branch
prediction cache, branch history table, Branch Address Cache, trace cache, that are
physical memories. Others are handled by the software, to store temporary data in
reserved MM space (again disk cache (page cache), system cache, application
cache, database cache, web cache, DNS cache, browser cache, router cache, etc.).
Some of these last ones are actually only "buffers" , that is a non-associative
memory with sequential access (strings of data) against the random accesses through
an associative "memory-to-cache" address of a classic cache.
Temporal locality
- Data which have been used recently have high likelihood of being
used again.
A cache stores only a subset of MM data the most recent-used MRU.
Data read from MM are temporary stored in cache. If the processor
requires the same data, this is supplied by the cache. The cache is
effective because short instruction loops and routines are a common
program structure and generally several operations are performed on
the same data values and variables.
Spatial locality
- If a data is referenced, it is very likely that nearby data will be
accessed soon.
Instructions and data are transferred from MM to the cache in fxed
blocks (cache block), known as cache lines. Cache line size is in the
range of 4 to 512 bytes. So that more than one processing data (4/8
bytes) is stored in each cache entry. After a frst MM access, all cache
line data are available in cache.
Most programs are highly sequential. Next instruction usually comes
from the next memory location. Data is usually structured and data in
these structures normally are stored in contiguous memory locations
(data strings, arrays, etc.).
Mapping Function
A block of words have to be brought in and out of the cache memory continuously
Direct mapping
Associative mapping
cache lines
Three types
Direct mapping
Set-associative mapping
The memory system has to quickly determine if a given address is in the cache
Direct Mapping
Direct mapping is simple and inexpensive to implement, but if a program accesses 2
blocks that map to the same line repeatedly, the cache begins to thrash back and forth
reloading the line over and over again meaning misses are very high.
Block j of the main memory is mapped onto block j modulo 128 of the cache
consider a cache of 128 blocks of 16 words each
Each location in RAM has one specific place in cache where the data will be held
Consider the cache to be like an array. Part of the address is used as index into the
cache to identify where the data will be held
Since a data block from RAM can only be in one specific line in the cache, it must
always replace the one block that was already there.
Associative mapping
In this type of mapping the associative memory is used to store content and
addresses both of the memory word. This enables the placement of the any word at
any place in the cache memory. It is considered to be the fastest and the most
flexible mapping form.
Set-associative mapping
This form of mapping is a modified form of the direct mapping where the disadvantage
of direct mapping is removed. Set-associative mapping allows that each word that is
present in the cache can have two or more words in the main memory for the same
index address.
Use tag to see if a desired word is in cache It there is no match, the block containing
the required word must first be read from the memory For example: MOVE $A815,
DO 10101 0000001 0101
Advantage
Disadvantage
not flexible there is contention problem even when cache is not full
For example, block 0 and block 128 both take only block 0 of cache:
0 modulo 128 = 0
Write Policy
Memory write requires special attention
We have two copies
A memory copy
A cached copy
Write policy determines how a memory writes operation
is handled
Two policies
Write-through
4Update both copies
Write-back
4Update only the cached copy
4Needs to be taken care of the memory copy
The cache's write policy determines how it handles writes to memory locations that
are currently being held in cache.
Replacement policies
When a MM block needs to be brought in while all the CM blocks are
occupied, one of them has to be replaced. ... LRU: replace the block in CM
that has not been used for the longest time, i.e., the least recently used
(LRU) block.
tag Block 0
tag Block 1
The optimal replacement is obviously the best but is not realistic, simply
because when a block will be needed in the future is usually not known
ahead of time. The LRU is suboptimal based on the temporal locality of
reference, i.e., memory items that are recently referenced are more likely
to be referenced soon than those which have not been referenced for a
longer time. FIFO is not necessarily consistent with LRU therefore is
usually not as good. The radom selection, surprisingly, is not necessarily
bad.
CACHE MEMORY: The cache memory system stores data used by a program
and also
the instructions of the program. The cache is organised as a 4 way set
associative cache
with each location containing 16 bytes or 4 double words of data.
Control register CR0 is used to control the cache with two new control bits
not present
in the 80386 microprocessor.
The CD ( cache disable ) , NW ( non-cache write through ) bits are new to
the 80486
and are used to control the 8K byte cache.
If the CD bit is a logic 1, all cache operations are inhibited. This setting is
only used for
debugging software and normally remains cleared. The NW bit is used to
inhibit cache
write-through operation. As with CD, cache write through is inhibited only for
testing.
For normal operations CD = 0 and NW = 0.
Because the cache is new to 80486 microprocessor and the cache is flled
using burst
cycle not present on the 386.
2000-2004
AMD Athlon:
In order to achieve a high hit rate the Athlon uses a large L1 and L2 cache.
Both are on chip and accessible in one clock cycle. The Athlon has an
exclusive cache so there is no duplication of data in the L1 and L2 cache. The
L1 cache is divided into an instruction and data cache, while the L2 cache
contains blocks that are removed from L1 cache due to a miss. The idea is
that if a block of instructions or data was used once it is likely to be used
again and therefore to minimize the penalty if it is needed again it is placed
in L2 cache. This is based on the locality of reference property of computer
programs. Translation Look-aside Buffer The translation look-aside buffer
(TLB) is used to translate the virtual memory address specifed by the
program to the physical memory address before cache can be accessed. This
must be done because the memory data in cache is addressed with physical
addresses. Both the L1 instruction and data cache have a two level TLB
structure but the data cache has a larger level-one TLB. The hit rate for the
TLB is very high, around 99 percent, because a miss results in about a three-
clock cycle penalty per address (Source: 1). Level 1 Instruction Cache The
level 1 instruction cache is 64-Kbyte, uses a two way set-associative
mapping, and the least recently used (LRU) replacement algorithm. As
instructions are loaded into the cache some predecoding is done in order to
fnd the boundaries of variable length instructions and deal with
unconditional branch instructions. This also helps increase the performance
of the regular decoding Harrington 7 phase. The TLB for the instruction cache
is divided into two levels (Source: 4): Level 1: 24 entries of which 16 map to
4-Kbyte pages and 8 map to 2 or 4 Mbytes pages using associative mapping.
Level 2: 256 entries that map to 4-Kbyte pages using 4 way set-associative
mapping. Level 1 Data Cache The level 1 data cache is setup similar to the
instruction cache except the level-one TLB has 32 entries using associative
mapping and 24 map to 4-Kbyte pages. The data cache has two 64-bit
access ports for loading and storing data and it has multiple banks to permit
several concurrent memory operations. In general a load takes precedence
over a store and if a load is scheduled before a store to the same memory
location it will forward the data with no cache access. Level 2 Cache The
level 2 cache is 256-Kbyte exclusive from level 1 and uses a 16 way set-
associative mapping. Level 2 only contains copy-back blocks that were
removed from level 1 due to a miss. Since it is an exclusive cache the Athlon
processor has total on chip cache of 384-Kbyte accessible in one clock cycle.
The Athlon still uses a traditional memory hierarchy so level 2 will only be
checked if it is not present within the level 1 cache.
IBM POWER4:
L1 caches The L1 instruction cache is single-ported, capable of either one 32-byte
read or one 32-byte write each cycle. The store through L1 data cache is triple-
ported, capable of two 8-byte reads and one 8-byte write per cycle with no blocking.
L1 data-cache reloads are 32 bytes per cycle. The L1 caches are parity-protected. A
parity error detected in the L1 instruction cache forces the line to be invalidated and
reloaded from the L2. Errors encountered in the L1 data cache are reported as a
synchronous machine-check interrupt. To support error recovery, the machine-check
interrupt handler is implemented in system specifc frmware code. When the
interrupt occurs, the frmware saves the processor-architected states and examines
the processor registers to determine the recovery and error status. If the interrupt is
recoverable, the system frmware removes the error by invalidating the L1
datacache line and incrementing an error counter. If the L1 data-cache error counter
is greater than a predefned threshold, which is an indication of a solid error, the
system frmware disables the failing portion of the L1 data cache. The system
frmware then restores the processorarchitected states and calls back the
operating system machine-check handler with the fully recovered status. The
operating system checks the return status from frmware and resume execution.
With the L1 data-cache line invalidated, data is now reloaded from the L2. All data
stored in the L1 data cache is available in the L2 cache, guaranteeing no data loss.
Data in the L1 cache can be in one of two states: I (the invalid state, in which
the data is invalid) or V (the valid state, in which the data is valid).
L2 cache The unifed second-level cache is shared across the two processors
on the POWER4 chip. Figure 5 shows a logical view of the L2 cache. The L2 is
implemented as three identical slices, each with its own controller. Cache
lines are hashed across the three controllers.
L3 cache
The L3 consists of two components, the L3 controller and the L3 data array. The L3
controller is located on the POWER4 chip and contains the tag directory as well as
the queues and arbitration logic to support the L3 and the memory behind it. The
data array is stored in two 16MB eDRAM chips mounted on a separate module. A
separate memory controller can be attached to the back side of the L3 module. To
facilitate physical design and minimize bank conflicts, the embedded DRAM on the
L3 chip is organized as eight banks at 2 MB per bank, with banks
grouped in pairs to divide the chip into four 4MB quadrants. The L3 controller is also
organized in quadrants. Each quadrant contains two coherency processors to
service requests from the fabric, perform any L3 cache and/or memory accesses,
and update the L3 tag directory. Additionally, each quadrant contains two
processors to perform the memory cast-outs, invalidate functions, and DMA writes
for I/O operations. Each pair of quadrants shares one of the two L3 tag directory
SRAMs. The L3 cache is eight-way set-associative, organized in 512-byte blocks,
with coherence maintained on 128-byte sectors for compatibility with the L2 cache.
Five coherency states are supported for each of the 128-byte sectors, as follows: I
(invalid state): The data is invalid. S (shared state): The data is valid. In this state,
the L3 can source data only to L2s for which it is caching data. T (tagged state):
The data is valid. The data is modifed relative to the copy stored in memory. The
data may be shared in other L2 or L3 caches. Trem (remote tagged state): This is
the same as the T state, but the data was sourced from memory attached to
another chip. O (prefetch data state): The data in the L3 is identical to the data in
memory. The data was sourced from memory attached to this L3. The status of the
data in other L2 or L3 caches is unknown.