Вы находитесь на странице: 1из 5

1.

3 The Pentium Processor


1.3.1 On-Chip Caches
The Pentium processor implements two internal caches for a total integrated
cache size of 16 Kbytes: an 8 Kbyte data cache and a separate 8 Kbyte code cache.
These caches are transparent to application software to maintain compatibility with
previous Intel Architecture generations.
The data cache fully supports the MESI (modified/exclusive/shared/invalid)
writeback cache consistency protocol. The code cache is inherently write protected to
prevent code from being inadvertently corrupted, and as a consequence supports a
subset of the MESI protocol, the S(shared) and I (invalid) states. The caches have
been designed for maximum flexibility and performance. The data cache is
configurable as writeback or writethrough on a line-by-line basis. Memory areas can
be defined as non-cacheable by software and external hardware. Cache writeback
and invalidations can be initiated by hardware or software. Protocols for cache
consistency and line replacement are implemented in hardware, easing system
design.
1.3.2 Cache Organization
On the Pentium processor, each of the caches are 8 Kbytes in size and each is
organized as a 2-way set associative cache. There are 128 sets in each cache, each
set containing 2 lines (each line has its own tag address). Each cache line is 32 bytes
wide. In the Pentium processor, replacement in both the data and instruction caches
is handled by the LRU mechanism which requires one bit per set in each of the
caches.
The data cache consists of eight banks interleaved on 4-byte boundaries. The
data cache can be accessed simultaneously from both pipes, as long as the
references are to different cache banks. A conceptual diagram of the organization of
the data and code caches is shown in Figure 2-8. Note that the data cache supports
the MESI writeback cache consistency protocol which requires 2 state bits, while the
code cache supports the S and I state only and therefore requires only one state bit.

Figure 1-3 Conceptual Organization of Code and Data Caches

1.3.3 Cache Structure


The instruction and data caches can be accessed simultaneously. The instruction
cache can provide up to 32 bytes of raw opcodes and the data cache can provide
data for two data references all in the same clock. This capability is implemented
partially through the tag structure. The tags in the data cache are triple ported. One of
the ports is dedicated to snooping while the other two are used to lookup two
independent addresses corresponding to data references from each of the pipelines.
The instruction cache tags of the Pentium processor are also triple ported. Again, one
port is dedicated to support snooping and the other two ports facilitate split line
accesses (simultaneously accessing upper half of one line and lower half of the next
line).
The storage array in the data cache is single ported but interleaved on 4-byte
boundaries to be able to provide data for two simultaneous accesses to the same
cache line. Each of the caches are parity protected. In the instruction cache, there
are parity bits on a quarter line basis and there is one parity bit for each tag. The data
cache contains one parity bit for each tag and a parity bit per byte of data. Each of
the caches are accessed with physical addresses and each cache has its own TLB
(translation lookaside buffer) to translate linear addresses to physical addresses. The
TLBs associated with the instruction cache are single ported whereas the data cache
TLBs are fully dual ported to be able to translate two independent linear addresses
for two data references simultaneously. The tag and data arrays of the TLBs are
parity protected with a parity bit associated with each of the tag and data entries in
the TLBs.
The data cache of the Pentium processor has a 4-way set associative, 64-entry
TLB for 4-Kbyte pages and a separate 4-way set associative, 8-entry TLB to support
4-Mbyte pages. The code cache has one 4-way set associative, 32-entry TLB for 4Kbyte pages and 4-Mbyte pages which are cached in 4-Kbyte increments.

Replacement in the TLBs is handled by a pseudo LRU mechanism (similar to the


Intel486 CPU) that requires 3 bits per set.

1.4 The Pentium Pro /Pentium II/Pentium III


1.4.1 The Pentium pro
The Pentium Pro Processor on-chip level one (L1) caches consist of one 8-Kbyte
four-way set associative instruction cache unit with a cache line length of 32 bytes
and one 8-Kbyte two-way set associative data cache unit. Not all misses in the L1
cache expose the full memory latency. The level two (L2) cache masks the full
latency caused by an L1 cache miss. The minimum delay for a L1 and L2 cache miss
is between 11 and 14 cycles based on DRAM page hit or miss. The data cache can
be accessed simultaneously by a load instruction and a store instruction, as long as
the references are to different cache banks.

Figure 1.4 The Pentium Pro, II, III Processor Micro-Architecture


with Advanced TransferCache Enhancement, The first and second level caches

1.4.2 The Pentium II /Pentium III


The on-chip cache subsystem of Pentium II and Pentium III processors
consists of two 16-Kbyte four-way set associative caches with a cache line length of
32 bytes. The caches employ a write-back mechanism and a pseudo-LRU (least
recently used) replacement algorithm. The data cache consists of eight banks
interleaved on four-byte boundaries.Level two (L2) caches have been off chip but in
the same package. They are 128K or more in size. L2 latencies are in the range of 4
to 10 cycles. An L2 miss initiates a transaction across the bus to memory chips. Such

an access requires on the order of at least 11 additional bus cycles, assuming a


DRAM page hit. A DRAM page miss incurs another three bus cycles. Each bus cycle
equals several processor cycles, for example, one bus cycle for a 100 MHz bus is
equal to four processor cycles on a 400 MHz processor. The speed of the bus and
sizes of L2 caches are implementation dependent, however. Check the specifications
of a given system to understand the precise characteristics of the L2 cache.

Figure 1-5 The Intel NetBurst Micro-Architecture,


the First Level, the Second Level Caches and Trace Cache

1.5 The Pentium 4 Processor


The Intel Pentium 4 processor is the latest IA-32 processor, and the first based
on the Intel NetBurst micro-architecture ( Figure1.5). The Intel NetBurst microarchitecture can support up to three levels of on-chip cache. Only two levels of onchip caches are implemented in the Pentium 4 processor, but there brings a new
concept: Trace Caches. The level nearest to the execution core of the processor, the
first level, contains separate caches for instructions and data: a first-level data cache
and the trace cache, which is an advanced first-level instruction cache. All other
levels of caches are shared. The levels in the cache hierarchy are not inclusive, that
is, the fact that a line is in level i does not imply that it is also in level i+1. All caches
use a pseudo-LRU (least recently used) replacement algorithm.
1.5.1 Execution Trace Cache
The execution trace cache (TC) is the primary instruction cache in the Intel
NetBurst micro-architecture. The TC stores decoded IA-32 instructions, or ops. This
removes decoding costs on frequently-executed code, such as template restrictions
and the extra latency to decode instructions upon a branch misprediction.

In the Pentium 4 processor implementation, the TC can hold up to 12K ops and
can deliver up to three ops per cycle. The TC does not hold all of the ops that need
to be executed in the execution core. In some situations, the execution core may
need to execute a microcode flow, instead of the op traces that are stored in the
trace cache.The Pentium 4 processor is optimized so that most frequently-executed
IA-32 instructions come from the trace cache, efficiently and continuously, while only
a few instructions involve the microcode ROM.
1.5.2 The Second-level Cache
A second-level cache miss initiates a transaction across the system bus interface
to the memory sub-system. The system bus interface supports using a scalable bus
clock and achieves an effective speed that quadruples the speed of the scalable bus
clock. It takes on the order of 12 processor cycles to get to the bus and back within
the processor, and 6-12 bus cycles to access memory if there is no bus congestion.
Each bus cycle equals several processor cycles. The ratio of processor clock speed
to the scalable bus clock speed is referred to as bus ratio. For example, one bus
cycle for a 100 MHz bus is equal to 15 processor cycles on a 1.50 GHz processor.

Вам также может понравиться