Вы находитесь на странице: 1из 19

CSC- 347

Hardware Organization & Maintenance


Sec-c
Group Name-Matrix
Topic: Cache Memory
Name Id Email
Nahid 15103336 Nahidjahid8698@gmail.c
Babu om
Shemu 15103354 Shemu0015@gmail.com
Khatun
Reshma 15103358 reshmaiubat@gmail.com
Akter
Afroza 15103184 Afrozaakteririn1@gmail.c
Akter om
Sharmin 15203105 Sharminshapla8@gmail.c
Zahan om
Asraf 15103200 Ashrafahmed30csc@gm
Akter ail.com
Group Leader: Nahid Babu
Cell no: 01786981171
Attendance field:
Name 02-02- 08-03-17 15-03-17
17
Nahid Babu P P P
Shemu P P P
Khatun
Reshma P P P
Akter
Afroza Akter P P P
Asraf Akter P P P
Sharmin P P P
Zahan

Table of Contents:
Content Date
Introduction 02-03-17
memory
cache memory

How cache memory works 08-03-17


Analysis cache 15-03-17
memory based on
different processor
Cache operation summary 22-03-17
Design issues
Cache capacity Cache line size
Degree of associatively

Introduction:
In data processing systems, a cache memory or memory cache (or sometimes
also CPU cache (*)) is a fast and relatively small memory, not visible to the software,
that is completely handled by the hardware, that stores the most recently used
(MRU) main memory (MM) (or working memory) data.
The function of the cache memory is to speed up the MM data access (performance
increasing) and most important, in multiprocessor systems with shared memory, to
reduce the system bus and MM traffic that is one of the major bottleneck of these
systems.
Cache memory makes use of the fast technology SRAM (static random-access memory
cells), against a slower MM DRAM (dynamic random-access memory), connected
directly to the processor(s).
The term "cache" is derived from the French and means "hidden".
This term has many meanings depending on the context. Examples are: disk
cache, TLB (translation look aside buffer) (Page Table cache), branch
prediction cache, branch history table, Branch Address Cache, trace cache, that are
physical memories. Others are handled by the software, to store temporary data in
reserved MM space (again disk cache (page cache), system cache, application
cache, database cache, web cache, DNS cache, browser cache, router cache, etc.).
Some of these last ones are actually only "buffers" , that is a non-associative
memory with sequential access (strings of data) against the random accesses through
an associative "memory-to-cache" address of a classic cache.

How cache memory works:


When an application starts or data is to be read/written or any operation is to be
performed then the data and commands associated with the specific operation are
shifted from a slow moving storage device (magnetic device - hard disk, optical device
CD drive etc.) to a faster device. This faster device is RAM Random Access Memory.
This RAM is type of DRAM (Dynamic Random Access Memory). RAM is placed here
because it is a faster device, and whenever data/ commands/instructions are needed by
Processor, they provide them at a faster rate than slow storage devices. They serve as
a cache memory for the storage devices. Although they are much faster than slow
storage device but the processor processes at much a faster pace and they are not able
to provide the needed data/instructions at that rate. So there is need of a device that is
faster than RAM which could keep up with the speed of processor needs. Therefore the
data required is transmitted to the next level of fast memory, which is known as CACHE
memory. CACHE is also a type of RAM, but it is Static RAM SRAM. SRAM are faster
and costlier then DRAM because it has flip-flops (6 transistors) to store data unlike
DRAM which uses 1 transistor and capacitor to store data in form of charge. Moreover
they need not be refreshed periodically (because of bistable latching circuitry) unlike
DRAM making it faster.
Cache memory principles:
In data processing systems, a cache memory or memory cache (or
sometimes also CPU cache is a fast and relatively small memory, not visible
to the software, that is completely handled by the hardware, that stores the
most recently used main memory (MM) (or working memory) data.
The function of the cache memory is to speed up the MM data access
(performance increasing) and most important, in multiprocessor systems
with shared memory, to reduce the system bus and MM traffic that is one of
the major bottleneck of these systems.
Cache memory makes use of the fast technology SRAM (static random-
access memory cells), against a slower MM DRAM (dynamic random-access
memory), connected directly to the processor(s).
The term "cache" is derived from the French and means "hidden".
Ramalamadingdong
This term has many meanings depending on the context. Examples are: disk
cache, TLB , branch prediction cache, branch history table, Branch Address
Cache, trace cache, that are physical memories. Others are handled by the
software, to store temporary data in reserved MM space. Some of these last
ones are actually only "buffers" , that is a non-associative memory with
sequential access against the random accesses through an associative
"memory-to-cache" address of a classic cache.

Cache Memory operation is based on two major "principles of locality"


- Temporal locality
- Spatial locality'

Temporal locality
- Data which have been used recently have high likelihood of being
used again.
A cache stores only a subset of MM data the most recent-used MRU.
Data read from MM are temporary stored in cache. If the processor
requires the same data, this is supplied by the cache. The cache is
effective because short instruction loops and routines are a common
program structure and generally several operations are performed on
the same data values and variables.

Spatial locality
- If a data is referenced, it is very likely that nearby data will be
accessed soon.
Instructions and data are transferred from MM to the cache in fxed
blocks (cache block), known as cache lines. Cache line size is in the
range of 4 to 512 bytes. So that more than one processing data (4/8
bytes) is stored in each cache entry. After a frst MM access, all cache
line data are available in cache.
Most programs are highly sequential. Next instruction usually comes
from the next memory location. Data is usually structured and data in
these structures normally are stored in contiguous memory locations
(data strings, arrays, etc.).
Mapping Function
A block of words have to be brought in and out of the cache memory continuously

Performance of the cache memory mapping function is key to the speed

There are a number of mapping techniques

Direct mapping

Associative mapping

Set associative mapping


Determines how memory blocks are mapped to

cache lines

Three types

Direct mapping

Specifies a single cache line for each memory block

Set-associative mapping

Specifies a set of cache lines for each memory block

The memory system has to quickly determine if a given address is in the cache

There are three popular methods of mapping addresses to cache locations

-DirectEach address has a specifc place in the cache

-AssociativeSearch the entire cache for an address

Set AssociativeEach address can be in any of a small set of cache locations

Direct Mapping
Direct mapping is simple and inexpensive to implement, but if a program accesses 2
blocks that map to the same line repeatedly, the cache begins to thrash back and forth
reloading the line over and over again meaning misses are very high.

Simplest way of mapping

Main memory is divided in blocks

Block j of the main memory is mapped onto block j modulo 128 of the cache
consider a cache of 128 blocks of 16 words each
Each location in RAM has one specific place in cache where the data will be held

Consider the cache to be like an array. Part of the address is used as index into the
cache to identify where the data will be held

Since a data block from RAM can only be in one specific line in the cache, it must
always replace the one block that was already there.

There is no need for a replacement algorithm.

Associative mapping
In this type of mapping the associative memory is used to store content and
addresses both of the memory word. This enables the placement of the any word at
any place in the cache memory. It is considered to be the fastest and the most
flexible mapping form.
Set-associative mapping
This form of mapping is a modified form of the direct mapping where the disadvantage
of direct mapping is removed. Set-associative mapping allows that each word that is
present in the cache can have two or more words in the main memory for the same
index address.

Use tag to see if a desired word is in cache It there is no match, the block containing
the required word must first be read from the memory For example: MOVE $A815,
DO 10101 0000001 0101

Advantage

simplest replacement algorithm

Disadvantage

not flexible there is contention problem even when cache is not full

For example, block 0 and block 128 both take only block 0 of cache:

0 modulo 128 = 0

128 modulo 128 = 0


If both blocks 0 and 128 of the main memory are used a lot, it will be very slow

Write Policy
Memory write requires special attention
We have two copies

A memory copy
A cached copy
Write policy determines how a memory writes operation

is handled
Two policies
Write-through
4Update both copies

Write-back
4Update only the cached copy
4Needs to be taken care of the memory copy
The cache's write policy determines how it handles writes to memory locations that
are currently being held in cache.

The cache's write policy determines how it handles writes to memory


locations that are currently being held in cache. The two policy types
are:

Write-Back Cache: When the system writes to a memory


location that is currently held in cache, it only writes the new
information to the appropriate cache line. When the cache line is
eventually needed for some other memory address, the changed
data is "written back" to system memory. This type of cache
provides better performance than a write-through cache, because it
saves on (time-consuming) write cycles to memory.
Write-Through Cache: When the system writes to a
memory location that is currently held in cache, it writes the new
information both to the appropriate cache line and the memory
location itself at the same time. This type of caching provides
worse performance than write-back, but is simpler to implement
and has the advantage of internal consistency, because the cache is
never out of sync with the memory the way it is with a write-back
cache.

Both write-back and write-through caches are used extensively, with


write-back designs more prevalent in newer and more modern machines.

Replacement policies
When a MM block needs to be brought in while all the CM blocks are
occupied, one of them has to be replaced. ... LRU: replace the block in CM
that has not been used for the longest time, i.e., the least recently used
(LRU) block.

tag Block 0
tag Block 1

tag Block 127


Consider a memory of 64K words divided into 4096 blocks
Where blocks 0, 128, 256, 3968 should be mapped to?
Where blocks 126, 254, 382, 4094 should be Main
memory address mapped to?

hen a MM block needs to be brought in while all the CM blocks are


occupied, one of them has to be replaced. The selection of this block to
be replaced can be determined in one of the following ways.

Optimal Replacement: replace the block which is no longer


needed in the future. If all blocks currently in CM will be used
again, replace the one which will not be used in the future for the
longest time.
Random selection: replace a randomly selected block among
all blocks currently in CM.
FIFO (first-in first-out): replace the block that has been in
CM for the longest time.
LRU: replace the block in CM that has not been used for the
longest time, i.e., the least recently used (LRU) block.

The optimal replacement is obviously the best but is not realistic, simply
because when a block will be needed in the future is usually not known
ahead of time. The LRU is suboptimal based on the temporal locality of
reference, i.e., memory items that are recently referenced are more likely
to be referenced soon than those which have not been referenced for a
longer time. FIFO is not necessarily consistent with LRU therefore is
usually not as good. The radom selection, surprisingly, is not necessarily
bad.

LRU replacement can be implemented by attaching a number to each


CM block to iidicate how recent a block has been used. Eveytime a CPU
reference is made all of these numbers are updated in such a way that the
smaller a number the more recent it was used, i.e., the LRU block is
always indicated by the largest number.

Analysis cache memory based on different


processor:
1982-1990
1.MOTOROLA MC68030:
Due to locality of reference, instructions and data that are used in a program
have a high probability of being reused within a short time. Additionally,
instructions and data operands that reside in proximity to the instructions
and data currently in use also have a high probability of being utilized within
a short period. To exploit these locality characteristics, the MC68030 contains
two on-chip logical caches, a data cache, and an instruction cache. Each of
the caches stores 256 bytes of information, organized as 16 entries, each
containing a block of four long words (16 bytes). The processor flls the cache
entries either one long word at a time or, during burst mode accesses, four
long words consecutively. The burst mode of operation not only flls the
cache efficiently but also captures adjacent instruction or data items that are
likely to be required in the near future due to locality characteristics of the
executing task. The caches improve the overall performance of the system
by reducing the number of bus cycles required by the processor to fetch
information from memory and by increasing the bus bandwidth available for
other bus masters in the system. Addition of the data cache in the MC68030
extends the benefts of cache techniques to all memory accesses. During a
write cycle, the data cache circuitry writes data to a cached data item as well
as to the item in memory, maintaining consistency between data in the
cache and that in memory. However, writing data that is not in the cache
may or may not cause the data item to be stored in the cache.
5.7.1 Cache Inhibit Input (CIIN) This input signal prevents data from being
loaded into the MC68030 instruction and data caches. It is a synchronous
input signal and is interpreted on a bus-cycle-by-bus-cycle basis. CIIN is
ignored during all write cycles. Refer to 6.1 On-Chip Cache Organization and
Operation for information on the relationship of CIIN to the on-chip caches.
5.7.2 Cache Inhibit Output (CIOUT) This three-state output signal reflects the
state of the CI bit in the address translation cache entry for the referenced
logical address, indicating that an external cache should ignore the bus
transfer. When the referenced logical address is within an area specifed for
transparent translation, the CI bit of the appropriate transparent translation
register controls the state of CIOUT. Refer to Section 9 Memory Management
Unit for more information about the address translation cache and
transparent translation. Also, refer to Section 6 On-Chip Cache Memories for
the effect of CIOUT on the internal caches. 5.7.3 Cache Burst Request
(CBREQ) This three-state output signal requests a burst mode operation to fll
a line in the instruction or data cache. Refer to 6.1.3 Cache Filling for flling
information and 7.3.7 Burst Operation Cycles for bus cycle information
pertaining to burst mode operations. 5.7.4 Cache Burst Acknowledge
(CBACK) This input signal indicates that the accessed device can operate in
the burst mode and can supply at least one more long word for the
instruction or data cache. Refer to 7.3.7 Burst Operation Cycles for
information about burst mode operation.
Instruction Cache The instruction cache is organized with a line size of
four long words, as shown in Figure 6- 2. Each of these long words is
considered a separate cache entry as each has a separate valid bit. All four
entries in a line have the same tag address. Burst flling all four long words
can be advantageous when the time spent in flling the line is not long
relative to the equivalent bus-cycle time for four nonburst long-word
accesses, because of the probability that the contents of memory adjacent to
or close to a referenced operand or instruction is also required by
subsequent accesses. Dynamic RAMs supporting fast access modes (page,
nibble, or static column) are easily employed to support the MC68030 burst
mode.
Data Cache The data cache stores data references to any address space
except CPU space (FC=$7), including those references made with PC relative
addressing modes and accesses made with the MOVES instruction. Operation
of the data cache is similar to that of the instruction cache, except for the
address comparison and cache flling operations. The tag of each line in the
data cache contains function code bits FC0, FC1, and FC2 in addition to
address bits A31A8. The cache control circuitry selects the tag using bits
A7A4 and compares it to the corresponding bits of the access address to
determine if a tag match has occurred. Address bits A3A2 select the valid
bit for the appropriate long word in the cache to determine if an entry hit has
occurred. Misaligned data transfers may span two data cache entries. In this
case, the processor checks for a hit one entry at a time. Therefore, it is
possible that a portion of the access results in a hit and a portion results in a
miss. The hit and miss are treated independently. Figure 6-3 illustrates the
organization of the data cache
2.intel 80486:

CACHE MEMORY: The cache memory system stores data used by a program
and also
the instructions of the program. The cache is organised as a 4 way set
associative cache
with each location containing 16 bytes or 4 double words of data.
Control register CR0 is used to control the cache with two new control bits
not present
in the 80386 microprocessor.
The CD ( cache disable ) , NW ( non-cache write through ) bits are new to
the 80486
and are used to control the 8K byte cache.
If the CD bit is a logic 1, all cache operations are inhibited. This setting is
only used for
debugging software and normally remains cleared. The NW bit is used to
inhibit cache
write-through operation. As with CD, cache write through is inhibited only for
testing.
For normal operations CD = 0 and NW = 0.
Because the cache is new to 80486 microprocessor and the cache is flled
using burst
cycle not present on the 386.
2000-2004
AMD Athlon:
In order to achieve a high hit rate the Athlon uses a large L1 and L2 cache.
Both are on chip and accessible in one clock cycle. The Athlon has an
exclusive cache so there is no duplication of data in the L1 and L2 cache. The
L1 cache is divided into an instruction and data cache, while the L2 cache
contains blocks that are removed from L1 cache due to a miss. The idea is
that if a block of instructions or data was used once it is likely to be used
again and therefore to minimize the penalty if it is needed again it is placed
in L2 cache. This is based on the locality of reference property of computer
programs. Translation Look-aside Buffer The translation look-aside buffer
(TLB) is used to translate the virtual memory address specifed by the
program to the physical memory address before cache can be accessed. This
must be done because the memory data in cache is addressed with physical
addresses. Both the L1 instruction and data cache have a two level TLB
structure but the data cache has a larger level-one TLB. The hit rate for the
TLB is very high, around 99 percent, because a miss results in about a three-
clock cycle penalty per address (Source: 1). Level 1 Instruction Cache The
level 1 instruction cache is 64-Kbyte, uses a two way set-associative
mapping, and the least recently used (LRU) replacement algorithm. As
instructions are loaded into the cache some predecoding is done in order to
fnd the boundaries of variable length instructions and deal with
unconditional branch instructions. This also helps increase the performance
of the regular decoding Harrington 7 phase. The TLB for the instruction cache
is divided into two levels (Source: 4): Level 1: 24 entries of which 16 map to
4-Kbyte pages and 8 map to 2 or 4 Mbytes pages using associative mapping.
Level 2: 256 entries that map to 4-Kbyte pages using 4 way set-associative
mapping. Level 1 Data Cache The level 1 data cache is setup similar to the
instruction cache except the level-one TLB has 32 entries using associative
mapping and 24 map to 4-Kbyte pages. The data cache has two 64-bit
access ports for loading and storing data and it has multiple banks to permit
several concurrent memory operations. In general a load takes precedence
over a store and if a load is scheduled before a store to the same memory
location it will forward the data with no cache access. Level 2 Cache The
level 2 cache is 256-Kbyte exclusive from level 1 and uses a 16 way set-
associative mapping. Level 2 only contains copy-back blocks that were
removed from level 1 due to a miss. Since it is an exclusive cache the Athlon
processor has total on chip cache of 384-Kbyte accessible in one clock cycle.
The Athlon still uses a traditional memory hierarchy so level 2 will only be
checked if it is not present within the level 1 cache.

IBM POWER4:
L1 caches The L1 instruction cache is single-ported, capable of either one 32-byte
read or one 32-byte write each cycle. The store through L1 data cache is triple-
ported, capable of two 8-byte reads and one 8-byte write per cycle with no blocking.
L1 data-cache reloads are 32 bytes per cycle. The L1 caches are parity-protected. A
parity error detected in the L1 instruction cache forces the line to be invalidated and
reloaded from the L2. Errors encountered in the L1 data cache are reported as a
synchronous machine-check interrupt. To support error recovery, the machine-check
interrupt handler is implemented in system specifc frmware code. When the
interrupt occurs, the frmware saves the processor-architected states and examines
the processor registers to determine the recovery and error status. If the interrupt is
recoverable, the system frmware removes the error by invalidating the L1
datacache line and incrementing an error counter. If the L1 data-cache error counter
is greater than a predefned threshold, which is an indication of a solid error, the
system frmware disables the failing portion of the L1 data cache. The system
frmware then restores the processorarchitected states and calls back the
operating system machine-check handler with the fully recovered status. The
operating system checks the return status from frmware and resume execution.
With the L1 data-cache line invalidated, data is now reloaded from the L2. All data
stored in the L1 data cache is available in the L2 cache, guaranteeing no data loss.

Data in the L1 cache can be in one of two states: I (the invalid state, in which
the data is invalid) or V (the valid state, in which the data is valid).

L2 cache The unifed second-level cache is shared across the two processors
on the POWER4 chip. Figure 5 shows a logical view of the L2 cache. The L2 is
implemented as three identical slices, each with its own controller. Cache
lines are hashed across the three controllers.

L3 cache

The L3 consists of two components, the L3 controller and the L3 data array. The L3
controller is located on the POWER4 chip and contains the tag directory as well as
the queues and arbitration logic to support the L3 and the memory behind it. The
data array is stored in two 16MB eDRAM chips mounted on a separate module. A
separate memory controller can be attached to the back side of the L3 module. To
facilitate physical design and minimize bank conflicts, the embedded DRAM on the
L3 chip is organized as eight banks at 2 MB per bank, with banks

grouped in pairs to divide the chip into four 4MB quadrants. The L3 controller is also
organized in quadrants. Each quadrant contains two coherency processors to
service requests from the fabric, perform any L3 cache and/or memory accesses,
and update the L3 tag directory. Additionally, each quadrant contains two
processors to perform the memory cast-outs, invalidate functions, and DMA writes
for I/O operations. Each pair of quadrants shares one of the two L3 tag directory
SRAMs. The L3 cache is eight-way set-associative, organized in 512-byte blocks,
with coherence maintained on 128-byte sectors for compatibility with the L2 cache.
Five coherency states are supported for each of the 128-byte sectors, as follows: I
(invalid state): The data is invalid. S (shared state): The data is valid. In this state,
the L3 can source data only to L2s for which it is caching data. T (tagged state):
The data is valid. The data is modifed relative to the copy stored in memory. The
data may be shared in other L2 or L3 caches. Trem (remote tagged state): This is
the same as the T state, but the data was sourced from memory attached to
another chip. O (prefetch data state): The data in the L3 is identical to the data in
memory. The data was sourced from memory attached to this L3. The status of the
data in other L2 or L3 caches is unknown.

Вам также может понравиться