Академический Документы
Профессиональный Документы
Культура Документы
J a m e s R. G o o d m a n
D e p a r t m e n t of C o m p u t e r S c i e n c e s
U n i v e r s i t y of W i s c o n s i n - M a d i s o n
Madison, WI 58706
importance
of r e d u c i n g
processorm e m o r y b a n d w i d t h is r e c o g n i z e d i n two d i s t i n c t s i t u a tions: single board computer systems and microprocess o r s of t h e f u t u r e . C a c h e m e m o r y is i n v e s t i g a t e d as a
way to reduce the memory-processor
t r a f f i c . We s h o w
that traditional caches which depend heavily on spatial
locality (look-ahead) for their performance
are inapp r o p r i a t e in t h e s e e n v i r o n m e n t s b e c a u s e t h e y g e n e r a t e
l a r g e b u r s t s of b u s t r a f f i c . A c a c h e e x p l o i t i n g p r i m a r i l y
t e m p o r a l l o c a l i t y ( l o o k - b e h i n d ) is t h e n p r o p o s e d a n d
d e m o n s t r a t e d t o b e e f f e c t i v e in a n e n v i r o n m e n t w h e r e
p r o c e s s s w i t c h e s a r e i n f r e q u e n t . We a r g u e t h a t s u c h a n
e n v i r o n m e n t is p o s s i b l e if t h e t r a f f i c t o b a c k i n g s t o r e is
small enough that many processors can share a common
m e m o r y a n d if t h e c a c h e d a t a c o n s i s t e n c y p r o b l e m is
solved. We d e m o n s t r a t e t h a t s u c h a c a c h e c a n i n d e e d
r e d u c e t r a f f i c t o m e m o r y g r e a t l y , a n d i n t r o d u c e e.r
elegant solution to the cache coherency problem.
ABSTRACT-The
T h e i m p a c t of VLSI h a s b e e n v e r y d i f f e r e n t i n
microprocessor
applications.
Here memory
is s t i l l
r e g a r d e d as a n e x p e n s i v e c o m p o n e n t i n t h e s y s t e m , a n d
those familiar primarily with a minicomputer or mainf r a m e e n v i r o n m e n t a r e o f t e n s c o r n f u l of t h e t r o u b l e t o
which microprocessor
u s e r s go t o c o n s e r v e m e m o r y .
T h e r e a s o n , of c o u r s e , is t h a t e v e n t h e s m a l l m e m o r y in a
m i c r o p r o c e s s o r is a m u c h l a r g e r p o r t i o n of t h e t o t a l system cost than the much larger memory on a typical
m a i n f r a m e s y s t e m . This r e s u l t s f r o m t h e f a c t t h a t
memory and processors are implemented in the same
technology.
1.1. A S u p e r C P U
With t h e a d v a n c e s t o VLSI o c c u r r i n g n o w a n d c o n t i n u i n g o v e r t h e n e x t few y e a r s , it will b e c o m e p o s s i b l e
t o f a b r i c a t e c i r c u i t s t h a t a r e o n e t o t w o o r d e r s of m a g n i t u d e m o r e c o m p l e x t h a n c u r r e n t l y a v a i l a b l e microprocessors.
It will s o o n b e p o s s i b l e t o f a b r i c a t e
an
e x t r e m e l y h i g h - p e r f o r m a n c e CPU o n a s i n g l e c h i p , If t h e
e n t i r e c h i p is d e v o t e d t o t h e CPU, h o w e v e r , i t is n o t a
good idea. Extrapolating historical trends to predict
future c o m p o n e n t densities, we might expect that within
a few years we should be able to purchase a single-chip
processor containing at least ten times as m a n y transistors as occur in, say, the MC68000. For the empirical
rule k n o w n as Grosch's law [Grosch53], P = k C g, where
P is s o m e m e a s u r e of performance, C is the cost, and k
and g are constants, Knight[Knight66] concluded that g
is at least 2, and Solomon[Solomon66] has suggested
that g~1.47. For the IBM S y s t e m / B 7 0 family, Siewiorek
determined that gal.8 [Siewiorek82]. While Grosch's law
breaks d o w n in the comparison of processors using
different technology or architectures, it is realistic for
predicting improvements within a single technology.
Siewiorek in fact suggests that it holds "by definition."
1. I n t r o d u c t i o n
1983ACM0149-7111/83/0600/0124501.00
124
o p e r a t i o n is r e q u i r e d .
We t h e r e f o r e h a v e e x p l o r e d t h e e f f e c t i v e n e s s of a
cache w h i c h exploits primarily or e x c l u s i v e l y temporal
locality, i.e., the blocks fetched from backing store are
only the size needed by the C P U (or possibly slightly
larger). In considering ways to evaluate this strategy, w e
identified a commercial
environment that contained
m a n y of t h e s a m e c o n s t r a i n t s a n d s e e m e d a m e n a b l e t o
t h e s a m e k i n d s of s o l u t i o n s . This e n v i r o n m e n t is t h e
m a r k e t p l a c e of t h e s i n g l e - b o a r d c o m p u t e r r u n n i n g o n a
s t a n d a r d b u s s u c h as M u l t i b u s ~ o r V e r s a b u s . s We h a v e
c h o s e n t o s t u d y t h i s e n v i r o n m e n t in a n a t t e m p t t o g a i n
insight into the original, general scheme.
1.2. On.-chip M e m o r y
One alternative for increased performance without
proportionately increasing processor-memory bandwidth
is to introduce m e m o r y on the s a m e chip with the CPU.
With t h e a b i l i t y to f a b r i c a t e c h i p s c o n t a i n i n g o n e t o two
million transistors, it should be possible - using only a
p o r t i o n of t h e c h i p - t o b u i l d a p r o c e s s o r s i g n i f i c a n t l y
more powerful than any currently available single-chip
CPU. While d e v o t i n g t h e e n t i r e c h i p t o t h e CPU c o u l d
result in a still more powerful processor, introducing
o n - c h i p m e m o r y o f f e r s a r e d u c t i o n in m e m o r y a c c e s s
time due to the inherently smaller delays as compared
t o i n t e r - c h i p d a t a t r a n s f e r s . If m o s t a c c e s s e s w e r e onc h i p , i t m i g h t a c t u a l l y p e r f o r m as f a s t as t h e m o r e
powerful processor.
I d e a l l y , t h e c h i p s h o u l d c o n t a i n as m u c h m e m o r y as
"needs" for main storage. Conventional
w i s d o m t o d a y s a y s t h a t a p r o c e s s o r of t h e s p e e d of
current microprocessors needs at least 1/ megabytes
of m e m o r y [ L i n d s a y B 1 ] . This is c e r t a i n l y m o r e t h a n is
feasible on-chip, though a high performance processor
could probably use substantially more than that. Clearly
all t h e p r i m a r y m e m o r y f o r t h e p r o c e s s o r c a n n o t b e
p l a c e d o n t h e s a m e c h i p w i t h a p o w e r f u l CPU. W h a t is
n e e d e d is the top e l e m e n t of a m e m o r y h i e r a r c h y ,
the p r o c e s s o r
1.3. C a c h e M e m o r y
While t h e m a r k e t h a s r a p i d l y d e v e l o p e d f o r p r o d u c t s
u s i n g t h i s b u s , its a p p l i c a t i o n s a r e l i m i t e d b y t h e s e v e r e
constraint
i m p o s e d b y t h e b a n d w i d t h of M u l t i b u s .
Clearly the bus bandwidth could be increased by increasing t h e n u m b e r of p i n s , a n d b y m o d i f y i n g t h e p r o t o c o l .
Its b r o a d p o p u l a r i t y a n d t h e a v a i l a b i l i t y of c o m p o n e n t s
t o i m p l e m e n t i t s p r o t o c o l m e a n , h o w e v e r , t h a t it is l i k e l y
t o s u r v i v e m a n y y e a r s in i t s p r e s e n t f o r m . T h u s a l a r g e
market exists for a computer-on-a-card which, much as
if it w e r e all o n a s i n g l e c h i p . h a s s e v e r e l i m i t a t i o n s o n
i t s c o m m u n i c a t i o n s w i t h t h e r e s t of t h e s y s t e m .
We d e c i d e d t o d e t e r m i n e if a c a c h e m e m o r y s y s t e m
could be implemented
effectively in the Multibus
e n v i r o n m e n t . To t h a t e n d we h a v e d e s i g n e d a c a c h e t o
be used with a current-generation
microprocessor.
In
a d d i t i o n , we h a v e d o n e e x t e n s i v e s i m u l a t i o n of c a c h e
p e r f o r m a n c e , d r i v e n b y m e m o r y t r a c e d a t a . We h a v e
i d e n t i f i e d a n e w c o m p o n e n t w h i c h is p a r t i c u l a r l y s u i t e d
for VLSI i m p l e m e n t a t i o n
and have demonstrated
its
f e a s i b i l i t y b y d e s i g n i n g it [ R a v i s h a n k a r B 3 ] . This c o m ponent, which implements
the tag memory
for a
d y n a m i c RAM c a c h e i n t e n d e d f o r a m i c r o p r o c e s s o r , is
s i m i l a r i n m a n y r e s p e c t s t o t h e r e c e n t l y a n n o u n c e d TMS
The u s e of c a c h e m e m o r y , h o w e v e r , h a s o f t e n a g g r a v a t e d t h e b a n d w i d t h p r o b l e m r a t h e r t h a n r e d u c e it.
Smith [SmithB2] says that optimizing the design has four
general a s p e c t s :
(1)
(2)
(3)
m i n i m i z i n g t h e a c c e s s t i m e t o d a t a i n t h e cache,
minimizing the delay due to a miss, and
(4)
minimizing
the
overheads
of u p d a t i n g
main
memory, maintaining multicache consistency, etc.
The r e s u l t is o f t e n a l a r g e r b u r s t b a n d w i d t h r e q u i r e m e n t
from main storage to the cache than would be necessary
w i t h o u t a c a c h e . F o r e x a m p l e , t h e c a c h e o n t h e IBM Syst e m / 3 7 0 m o d e l 16B, is c a p a b l e of r e c e i v i n g d a t a f r o m '
m a i n m e m o r y a t a r a t e of 100 m e g a b y t e s p e r s e c o n d
[IBM76], It supplies data to the C P U at less than 1/3"
t h a t r a t e . The r e a s o n is t h a t t o e x p l o i t t h e s p a t i a l l o c a l ity in m e m o r y r e f e r e n c e s , t h e d a t a t r a n s f e r r e d f r o m
b a c k i n g s t o r e i n t o t h e c a c h e is f e t c h e d in l a r g e b l o c k s ,
r e s u l t i n g in r e q u i r e m e n t s of v e r y h i g h b a n d w i d t h b u r s t s
of d a t a . We h a v e m e a s u r e d t h e a v e r a g e b a n d w i d t h o n a n
IBM S y s t e m / 3 7 0 m o d e l 155, a n d c o n c l u d e d t h a t t h e flyerage backing-store-to-cache
t r a f f i c is l e s s t h a n t h e
caehe-to-CPU traffic.
2150 [Tree].
Multibus systems have generally dealt with the problem of limited bus bandwidth by removing most of the
The d e s i g n of c a c h e m e m o r y f o r m i n i - c o m p u t e r s
demanded
greater concern for bus bandwidth.
The
d e s i g n e r s of t h e PDP-11 m o d e l s 80 a n d 70 c l e a r l y r e c o g n i z e d that s m a l l b l o c k s i z e s w e r e n e c e s s a r y to k e e p m a i n
m e m o r y t r a f f i c t o a m i n i m u m [Bell78].
s m a l l b l o c k s of d a t a a r e b r o u g h t
store to the cache, or
from backing
(2)
l o n g d e l a y s o c c u r while a b l o c k is b e i n g b r o u g h t
in, i n d e p e n d e n t of ( a n d in a d d i t i o n to) t h e a c c e s s t i m e of t h e b a c k i n g s t o r e .
While it is p o s s i b l e t o b r i n g in t h e w o r d r e q u e s t e d initially (read through), thus reducing the wait on a given
r e f e r e n c e , t h e low b a n d w i d t h m e m o r y i n t e r f a c e will
r e m a i n b u s y long a f t e r t h e i n i t i a l t r a n s f e r is c o m p l e t e d ,
resulting in l o n g d e l a y s if a s e c o n d b a c k i n g s t o r a g e
125
I n m a n y e n v i r o n m e n t s , a simplp, d y n a m i c h a r d w a r e
allocation scheme
can efficiently determine
what
memory locations are being accessed frequently and
s h o u l d t h e r e f o r e b e k e p t in l o c a l m e m o r y better than
the programmer
who often has little insight into the
dynamic characteristics
of h i s p r o g r a m .
There are
e n v i r o n m e n t s w h e r e t h e p r o g r a m m e r is i n t i m a t e l y f a m i l i a r w i t h t h e b e h a v i o r of his p r o g r a m a n d c a n g e n e r a t e
c o d e t o t a k e a d v a n t a g e of it. In t h i s e n v i r o n m e n t t h e
t i m e s p e n t r u n n i n g a p r o g r a m is o f t e n m u c h m o r e s u b stantial than the time developing the program.
This
e x p l a i n s , f o r e x a m p l e , w h y a n i n v i s i b l e c a c h e is n o t
a p p r o p r i a t e o n t h e CRAY-1. We b e l i e v e t h a t f r e e i n g t h e
p r o g r a m m e r f r o m c o n c e r n a b o u t m e m o r y a l l o c a t i o n is
e s s e n t i a l w h e r e p r o g r a m m e r p r o d u c t i v i t y is c r i t i c a l .
3. Cache Coherency
I t is w e l l - k n o w n t h a t m u l t i p l e c a c h e s p r e s e n t s e r i o u s
p r o b l e m s b e c a u s e of t h e r e d u n d a n c y of s t o r a g e of a single l o g i c a l m e m o r y l o c a t i o n [ T a n g 7 6 , C e n s i e r 7 8 , Rao78].
The m o s t c o m m o n m e t h o d a m o n g c o m m e r c i a l p r o d u c t s
f o r d e a l i n g w i t h t h i s , t h e s t a l e d a t a p r o b l e m , is t o c r e a t e
a special, high-speed bus on whleh addresses are sent
w h e n e v e r a w r i t e o p e r a t i o n is p e r f o r m e d b y a n y p r o c e s sor. This s o l u t i o n h a s w e a k n e s s e s [CeLisier78] w h i c h
have generally limited commercial implementations to
two p r o c e s s o r s . In t h e s i n g l e - c h i p p r o c e s s o r o r s i n g l e b o a r d c o m p u t e r e n v i r o n m e n t s , it h a s t h e a d d e d w e a k n e s s t h a t i t r e q u i r e s a n u m b e r of e x t r a I / 0 p i n s .
An a l t e r n a t i v e a p p r o a c h , i m p l e m e n t e d
in C , m m p
[ H o o g e n d o o r n 7 7 ] a n d p r o p o s e d b y N o r t o n [ N o r t o n 8 2 ] . is
to require the operating system to recognize when inconsistencies might occur and take steps to prevent them
under those circumstances.
This s o l u t i o n is u n a p p e a l i n g
because
the
cache
is n o r m a l l y
regarded
as a n
architecture-independent
feature,
invisible
to
the
software.
E a r l i e r a n a l y s e s [ K a p l a n 7 3 , Bell74, Rao78, P a t e 1 8 2 ]
have used the cache hit ratio or something closely
related to measure performance.
The important crit e r i o n h e r e is t o m a x i m i z e u s e of t h e b u s , n o t t h e h i t
ratio, or even necessarily to optimize processor perform a n c e . We o p t i m i z e s y s t e m p e r f o r m a n c e b y o p t i m i z i n g
bus utilization, achieving higher performance by minimizing i n d i v i d u a l p r o c e s s o r s '
bus requirements,
and
t h e r e b y s u p p o r t i n g m o r e p r o c e s s o r s r e a s o n a b l y well. We
allow i n d i v i d u a l p r o c e s s o r s t o s i t idle p e r i o d i c a l l y r a t h e r
than tie up the bus fetching data which they might not
use. This i m p l i e s t h a t t h e c a c h e s t a l e d a t a p r o b l e m
m u s t b e s o l v e d e f f e c t i v e l y . We p r e s e n t a n e w s o l u t i o n in
section 3.
2.2. Switching Contexts
W h e r e b u s b a n d w i d t h is l i m i t e d , a t a s k s w i t c h is a
major disturbance, since the cache must effectively be
r e l o a d e d a t t h i s t i m e . The p r o c e s s o r is m o m e n t a r i l y
reduced to accesses at the rate at which the bus can
s u p p l y t h e m . While t h i s p r o b l e m s e e m s u n a v o i d a b l e , i t
n e e d n o t b e s e r i o u s if t a s k s w i t c h i n g is m i n i m i z e d . We
a r e p r o v i d i n g a n e n v i r o n m e n t w h i c h allows m a n y p r o c e s s o r s t o w o r k o u t of a s i n g l e m o n o l i t h i c m e m o r y i n p a r a l lel. If m o r e p a r a l l e l t a s k s a r e r e q u i r e d , m o r e p r o c e s s o r s
c a n b e u s e d . We p o i n t o u t t h a t t h e c u r r e n t M u l t i b u s
a l t e r n a t i v e is t o m o v e t h e p r o g r a m i n t o l o c a l m e m o r y ,
a n o p e r a t i o n w h i c h also s w a m p s t h e b u s . The t a s k s w i t c h
merely makes this operation implicit, and avoids bringing a c r o s s t h e b u s d a t a w h i c h a r e n e v e r a c t u a l l y u s e d .
W r i t i n g t h e o l d d a t a o u t is also n o w o r s e t h a n t h e a l t e r n a t i v e , s i n c e we o n l y w r i t e t h a t w h i c h h a s b e e n c h a n g e d
and which has not been already purged.
126
But write-back has more severe coherency problems than write-through, since even main memory does
n o t a l w a y s c o n t a i n t h e c u r r e n t v e r s i o n of a p a r t i c u l a r
memory location.
Invalid
T h e r e is n o d a t a in t h e b l o c k .
Vain
T h e r e is d a t a in t h e b l o c k w h i c h h a s b e e n r e a d
from backing store and has not been modified.
4. S i m u l a t i o n
We d e s i g n e d a c a c h e m e m o r y s y s t e m t o w o r k o n
Multibus. To v a l i d a t e o u r d e s i g n b e f o r e b u i l d i n g it we did
e x t e n s i v e s i m u l a t i o n u s i n g m e m o r y t r a c e d a t a . To d a t e
we h a v e p e r f o r m e d e x t e n s i v e s i m u l a t i o n s f o r six t r a c e s ,
all r u n n i n g u n d e r UNIX: ~
R e s e r v e d T h e d a t a in t h e b l o c k h a s b e e n l o c a l l y m o d i f i e d
e x a c t l y o n c e since it was b r o u g h t into t h e
cache and the change has been transmitted to
backing store.
D/try
The d a t a i n
more than
cache and
transmitted
EDC
The UNIX e d i t o r e d r u n n i n g a s c r i p t .
ROFFAS
TRACE
The
program,
written
in
assembly
language, which generated
the above
t r a c e s f o r t h e PDP-11.
NROFF
The p r o g r a m
~tro~F i n t e r p r e t i n g
Berkeley macro package -me.
CACHE
The t r a c e - d r i v e n
gram.
COMPACT
A p r o g r a m using an on-line a l g o r i t h m
w h i c h c o m p r e s s e s files u s i n g a n a d a p t i v e
Huffman code.
cache
simulator
the
pro-
This s c h e m e a c h i e v e s c o h e r e n c y in t h e following
way. Initially w r i t e - t h r o u g h is e m p l o y e d . H o w e v e r , a n
a d d i t i o n a l g o a l is a c h i e v e d u p o n w r i t i n g . All o t h e r c a c h e s
a r e p u r g e d of t h e b l o c k b e i n g w r i t t e n , s o t h e c a c h e w r i t ing t h r o u g h t h e b u s n o w is g u a r a n t e e d t h e only c o p y
e x c e p t f o r b a c k i n g s t o r e . It is so i d e n t i f i e d b y b e i n g
m a r k e d r e s e r v e d . If i t is p u r g e d a t t h i s p o i n t , no w r i t e is
n e c e s s a r y t o b a c k i n g s t o r e , so t h i s is e s s e n t i a l l y w r i t e t h r o u g h . ]f a n o t h e r w r i t e o c c u r s , t h e b l o c k is m a r k e d
dirty. Now w r i t e - b a c k is e m p l o y e d and, o n p u r g i n g , t h e
data must be rewritten to backing store.
127
4.4. HIock S i z e
extremely small
4 . 2 . Cold S t a r t vs. W a r m S t a r t
An i m p o r t a n t c o n s i d e r a t i o n i n d e t e r m i n i n g c a c h e
hit r a t i o a n d b u s t r a f f i c is t h e c o l d s t a r t p e r i o d k n o w n as
the lifetime function [Easton78], during which time
m a n y m i s s e s o c c u r b e c a u s e t h e c a c h e is e m p t y . This is
d e f i n e d as t h e p e r i o d u n t i l a s m a n y c a c h e m i s s e s h a v e
o c c u r r e d a s t h e t o t a l n u m b e r of b l o c k s in t h e c a c h e .
This i n i t i a l b u r s t of m i s s e s is a m o r t i z e d o v e r all a c c e s s e s ,
so t h e l o n g e r t h e t r a c e a n a l y z e d , t h e l o w e r t h e m i s s r a t i o
o b t a i n e d . In a d d i t i o n t o t h e i n i t i a t i o n of a p r o g r a m a n d
o c c a s i o n a l s w i t c h e s of e n v i r o n m e n t s , a c o l d s t a r t g e n e r a l l y o c c u r s w h e n e v e r t h e r e is a t a s k s w i t c h . T h u s a n
i m p o r t a n t a s s u m p t i o n i n t r a d i t i o n a l c a c h e e v a l u a t i o n is
t h e f r e q u e n c y of t a s k s w i t c h i n g . We h a v e a r g u e d t h a t
task switching must be very infrequent in our system.
T h u s we c a n m o r e n e a r l y a p p r o a c h i n p r a c t i c e t h e w a r m
s t a r t hit r a t i o s , a n d t h u s i t is a p p r o p r i a t e t o u s e v e r y
l o n g t r a c e s of a s i n g l e p r o g r a m , a n d a s s u m e a w a r m
s t a r t . We d i d t h a t initially, u s i n g t h e full l e n g t h of t h e
PDP-11 t r a c e s
available to us (1,256,570 memory
accesses).
We n o t e d , h o w e v e r , t h a t f o r e v e n m u c h
s h o r t e r t r a c e s t h a n we w e r e r u n n i n g , t h e r e w a s l i t t l e
difference between warm start and cold start statistics.
S i n c e c o l d s t a r t s t a t i s t i c s a r e e a s i e r t o g e n e r a t e , we n o r mally used them. Unless stated otherwise, our results
a r e f r o m c o l d s t a r t , b u t a t l e a s t 10 t i m e s t h e l i f e t i m e
f u n c t i o n in t o t a l l e n g t h .
We h a v e m a d e t h e a s s u m p t i o n t h a t a c c e s s t i m e is
r e l a t e d l i n e a r l y t o b l o c k size. In m a n y c a s e s t h i s is n o t
t r u e . It is e s s e n t i a l l y t r u e f o r t h e M u l t i b u s , s i n c e o n l y
two b y t e s c a n b e f e t c h e d a t a t i m e , a n d a r b i t r a t i o n is
overlapped with bus operations. For a single-chip implem e n t a t i o n , it w o u l d a l m o s t c e r t a i n l y b e w o r t h w h i l e t o
p r o v i d e t h e c a p a b i l i t y for e f f i c i e n t m u l t i p l e t r a n s f e r s
o v e r a s e t of w i r e s i n t o t h e p r o c e s s o r . This h a s n o t b e e n
i n c o r p o r a t e d i n o u r a n a l y s i s , b u t will u n d o u b t e d l y s u g g e s t a s o m e w h a t l a r g e r t r a n s f e r b l o c k size.
Cache S i z e
I n g e n e r a l , we w e r e s u r p r i s e d a t t h e e f f e c t i v e n e s s of
a s m a l l c a c h e . F o r t h e PDP-11 t r a c e s w i t h a c a c h e of 2K
b y t e s o r l a r g e r , we d i s c o v e r e d t h a t e s s e n t i a l l y n o m i s s e s
occurred after the cold start period. These are not
trivial programs, but were run on a machine which has
o n l y 64K b y t e s f o r b o t h i n s t r u c t i o n s a n d d a t a . The p r o g r a m s a r e v e r y f r u g a l in t h e i r u s e of m e m o r y , a n d t h e
e n t i r e w o r k i n g s e t a p p a r e n t l y c a n fit in t h e c a c h e .
4.3.
4.4.1. L o w e r i n g t h e O v e r h e a d of ~ [ n a l l B l o c k s
Small blocks are costly in that they greatly increase
t h e o v e r h e a d of t h e c a c h e : a n a d d r e s s t a g a n d t h e t w o
s t a t e b i t s a r e n o r m a l l y s t o r e d in t h e c a c h e f o r e a c h
b l o c k t r a n s f e r r e d . We r e d u c e d t h i s o v e r h e a d b y s p l i t t i n g
t h e n o t i o n of b l o c k i n t o two p a r t s :
(1) T h e t r a n s f e r block is t h e a m o u n t of d a t a t r a n s f e r r e d
from backing store into the cache on a read miss.
(2) T h e address block is t h e q u a n t u m of s t o r a g e for.
w h i c h a t a g is m a i n t a i n e d in t h e c a c h e . It is a l w a y s
a p o w e r of two l a r g e r t h a n a t r a n s f e r b l o c k . An
effective cache can be implemented by keeping the
transfer block small but making the address block
larger.
T h e VAX t r a c e s d o n o t e x h i b i t t h e s a m e l o c a l i t y
o b s e r v e d w i t h t h e PDP-11, a n d a 6 4 K - b y t e c a c h e w a s n o t
l a r g e e n o u g h t o c o n t a i n t h e e n t i r e w o r k i n g s e t of t h e
program.
This m a y b e a r e s u l t of t h e l a r g e r a d d r e s s
space available, the more complex instruction set, or
m o r e c o m p l e x p r o g r a m s . In all c a s e s t h e p r o g r a m s w e r e
spread out over a much larger memory space than for
t h e PDP-11 t r a c e s . F o r t h i s c o m p a r i s o n we u s e d a s m a l l
b l o c k size of 4 b y t e s . This m a y h a v e h a d a g r e a t e r
i m p a c t o n t h e VAX t h a n o n PDP-11 t r a c e s .
We f o u n d t h a t r e d u c i n g t h e size of t h e c a c h e ( b e l o w
4K b y t e s for t h e P D P - 1 1 ) i n c r e a s e d t h e m i s s r a t i o a n d
t h e b u s t r a f f i c - in g e n e r a l t h e two c o r r e l a t e well w i t h
r e s p e c t t o t h i s p a r a m e t e r . Fig. 1 s h o w s t h e a v e r a g e m i s s
r a t i o a n d b u s t r a f f i c as a f u n c t i o n of t o t a l c a c h e size f o r
t h e PDP-11 t r a c e s . F o r t h i s a n d all r e s u l t s given, t h e
m i s s r a t i o i n c l u d e s w r i t e s . T h e b u s t r a f f i c is g i v e n as a
percent
of t h e n u m b e r
of a c c e s s e s t h a t w o u l d b e
r e q u i r e d if n o c a c h e w e r e p r e s e n t . Fig. 2 s h o w s t h e s a m e
d a t a f o r t h e VAX t r a c e s .
Large A d d r e s s B l o c k s
The u s e of a d d r e s s b l o c k s l a r g e r t h a n t r a n s f e r
blocks means that only data from one address block in
4 . 4 . 2 . T h e E f f e c t of
128
b a c k i n g s t o r e c a n o c c u p y a n y of t h e t l o c k s m a k i n g u p
an a d d r e s s b l o c k in the c a c h e . There are c a s e s w h e r e
t h e a p p r o p r i a t e t r a n s f e r b l o c k is e m p t y , b u t o t h e r
t r a n s f e r b l o c k s in t h e s a m e a d d r e s s b l o c k m u s t b e
p u r g e d so t h a t t h e new a d d r e s s block can be allocated.
We e x a m i n e d t h i s f o r v a r i o u s s i z e s of a d d r e s s b l o c k s a n d
f o u n d t h a t t h e m i s s r a t i o i n c r e a s e d v e r y slowly u p t o a
p o i n t . F o r t h e s i t u a t i o n s h o w n i n fig. 7, t h e m i s s r a t i o
h a d o n l y r i s e n b y a b o u t 30% w h e n t h e a d d r e s s b l o c k c o n t a i n e d 64 b y t e s f o r t h e PDP-11 t r a c e s . T h a t p o i n t w a s
r e a c h e d f o r t h e VAX t r a c e s w h e n it c o n t a i n e d 3 2 - b y t e
address blocks.
b u t 2-way s e t a s s o c i a t i v e is n o t m u c h w o r s e , a n d 4 - w a y
s e t a s s o c i a t i v e is s o m e w h e r e i n b e t w e e n . ( B u t see
[ S m i t h 8 3 ] ) . We h a d h o p e d t o find t h a t a h i g h d e g r e e of
assoeiativity would i m p r o v e p e r f o r m a n c e , b e c a u s e s u c h
a n o r g a n i z a t i o n is m u c h m o r e f e a s i b l e in t h e VLSI
domain, but results were negative. For results reported
h e r e we h a v e a s s u m e d a 4-way, s e t a s s o c i a t i v e c a c h e .
4.5.3. R e p l a c e m e n t Algorithm
R e p l a c e m e n t s t r a t e g y h a s b e e n t h e s u b j e c t of
another study using the same simulator and traces
[ S m i t h 8 3 ] . In o r d e r t o l i m i t i t s s i g n i f i c a n c e , w h i c h s e e m s
t o b e o r t h o g o n a l t o t h e i s s u e s r a i s e d h e r e , we h a v e
a s s u m e d t r u e LRU r e p l a c e m e n t a m o n g t h e e l e m e n t s of
e a c h s e t i n all c a s e s .
We p r e d i c t e d t h a t t h e b u s t r a f f i c w o u l d c o r r e l a t e
well w i t h m i s s r a t i o w i t h r e s p e c t t o t h i s p a r a m e t e r . To
o u r s u r p r i s e , t h e b u s t r a f f i c a c t u a l l y declined initially as
we i n c r e a s e d t h e a d d r e s s b l o c k s i z e . The d e c l i n e was
small, b u t c o n s i s t e n t for t h e PDP-11 t r a c e t a p e s , e v e n t u ally c l i m b i n g o v e r t h e b a s e line w h e n t h e a d d r e s s b l o c k
w a s 16 o r 32 t r a n s f e r b l o c k s . The p h e n o m e n o n was
s m a l l e r , b u t d i s c e r n i b l e f o r t h e VAX t r a c e s a s well,
t h o u g h in all c a s e s t h e b u s t r a f f i c s t a r t e d i n c r e a s i n g
s o o n e r . This s i t u a t i o n is s h o w n in fig. 8.
We i n i t i a l l y s u s p e c t e d t h a t o u r s i m u l a t i o n m i g h t b e
f a u l t y . T h a t was n o t t h e c a s e , a n d e v e n t u a l l y we w e r e
a b l e t o e x p l a i n it a n d v e r i f y it. The w r i t e - o n c e a l g o r i t h m
r e q u i r e s a b u s o p e r a t i o n w h e n e v e r a b l o c k is m o d i f i e d
initially ( s e t t o r e s e r v e d . ) H o w e v e r , r e s e r v a t i o n s c o u l d b e
m a d e on t h e b a s i s of e i t h e r t r a n s f e r b l o c k s o r a d d r e s s
b l o c k s . We h a d p u t t h e c h o i c e i n t o t h e s i m u l a t o r , b u t
h a d n o t e x p e r i m e n t e d w i t h it, r e s e r v i n g a t t h e a d d r e s s
b l o c k level. This in f a c t r e d u c e s t h e n u m b e r of b u s
w r i t e s n e c e s s a r y for r e s e r v a t i o n b e c a u s e of s p a t i a l l o c a l ity of w r i t e s : a n a d d r e s s b l o c k a l r e a d y f e s e r v e d n e e d
o n l y b e m a r k e d dirty w h e n a n y t r a n s f e r b l o c k w i t h i n it is
m o d i f i e d . This w o u l d s e e m t o i n c r e a s e g r e a t l y t h e t r a f f i c
w h e n e v e r t h e b l o c k is p u r g e d f r o m t h e c a c h e , b u t in f a c t
t h e e f f e c t is small: only t h o s e t r a n s f e r b l o c k s w h i c h h a v e
actually b e e n modified n e e d be w r i t t e n back.
We a s s u m e d t h a t t h e c a c h e s u p p l i e d o n e w o r d - - 16
b i t s f o r t h e PDP-11 a n d 32 b i t s f o r t h e VAX - t o t h e CPU
on e a c h r e q u e s t . H o w e v e r , t h e i n t e l l i g e n c e of t h e p r o c e s s o r d e t e r m i n e s how o f t e n t h e s a m e w o r d m u s t b e
f e t c h e d . The t r a c e t a p e s c o n t a i n all m e m o r y r e f e r e n c e s .
We f i l t e r e d t h e s e w i t h t h e a s s u m p t i o n t h a t o n i n s t r u c t i o n
f e t c h e s the s a m e word would not be f e t c h e d without an
i n t e r v e n i n g i n s t r u c t i o n f e t c h . No f i l t e r i n g was d o n e o n
data f e t c h e s ,
5. S u m m a r y
Our s i m u l a t i o n s i n d i c a t e t h a t a s i n g l e b o a r d c o m p u t e r w i t h a 4 K - b y t e c a c h e c a n p e r f o r m r e a s o n a b l y well
w i t h l e s s t h a n 10% of t h e a c c e s s e s r e q u i r e d t o i t s p r i m a r y m e m o r y w i t h o u t a c a c h e . The PDP-11 t r a c e s s u g g e s t a n u m b e r a s low a s 3%. While t h e VAX n u m b e r s a r e
h i g h e r , a d d i t i o n a l d e c l i n e s will b e e x p e r i e n c e d b y
i n c r e a s i n g t h e size of t h e c a c h e b e y o n d 4K b y t e s .
We d e m o n s t r a t e d t h a t t h i s was i n d e e d r e s p o n s i b l e
for the behavior n o t e d by changing the r e s e r v a t i o n s to
t h e t r a n s f e r b l o c k level. The s i m u l a t i o n r e s u l t s t h e n
exhibited the originally predicted behavior, correlating
closely with the miss ratio.
We c o n c l u d e t h a t m i n i m u m b u s t r a f f i c is g e n e r a t e d
w i t h m i n i m u m t r a n s f e r b l o c k s i z e s . The m i s s r a t i o m a y
be substantially improved by using slightly larger
t r a n s f e r b l o c k s , in w h i c h c a s e b u s t r a f f i c d o e s n o t
i n c r e a s e greatly. Using larger a d d r e s s blocks r e d u c e s
t h e c o s t of t h e t a g m e m o r y c o n s i d e r a b l y . It i n i t i a l l y h a s
o n l y a m i n o r e f f e c t o n m i s s r a t i o , w h i c h is m o r e t h a n
o f f s e t b y t h e s a v i n g s in w r i t e s d u e t o t h e m o r e e f f i c i e n t
r e s e r v a t i o n of m o d i f i e d b l o c k s
An i m p o r t a n t r e s u l t is t h e u s e of t h e w r i t e - o n c e
algorithm to guarantee consistent data among multiple
p r o c e s s o r s . We h a v e s h o w n t h a t t h i s a l g o r i t h m c a n b e
i m p l e m e n t e d in a way t h a t d e g r a d e s p e r f o r m a n c e o n l y
trivially (ignoring a c t u a l collisions, which are rare), and
p e r f o r m s b e t t e r t h a n e i t h e r p u r e w r i t e - b a c k or writet h r o u g h in m a n y i n s t a n c e s .
The u s e of s m a l l t r a n s f e r b l o c k s i z e s c a n b e c o u p l e d
with large a d d r e s s blocks to build an i n e x p e n s i v e c a c h e
w h i c h p e r f o r m s e f f e c t i v e l y in t h e a b s e n c e of f r e q u e n t
p r o c e s s s w i t c h e s . The low b u s u t i l i z a t i o n a n d t h e solution to the stale data problem make possible an environm e n t for w h i c h t h i s c o n d i t i o n is m e t . E v e n t h o u g h t h e
m i s s r a t i o i n c r e a s e s , b u s t r a f f i c initially d e c l i n e s a s t h e
a d d r e s s b l o c k is e n l a r g e d , h o l d i n g t h e t r a n s f e r b l o c k
constant. Therefore larger address blocks should be
u s e d for r e s e r v i n g m e m o r y f o r m o d i f i c a t i o n e v e n if s m a l l
b l o c k s a r e u s e d f o r t r a n s f e r of d a t a .
The a p p r o a c h a d v o c a t e d h e r e is a p p r o p r i a t e o n l y f o r
a s y s t e m c o n t a i n i n g a s i n g l e l o g i c a l m e m o r y . This is
. s i g n i f i c a n t b e c a u s e i t d e p e n d s o n t h e s e r i a l i z a t i o n of
m e m o r y a c c e s s e s t o a s s u r e c o n s i s t e n c y . It h a s a p p l i c a tions beyond those studied here, however. For example,
t h e a c c e s s p a t h t o m e m o r y c o u l d b e via a r i n g n e t w o r k ,
o r a n y o t h e r t e c h n i q u e in w h i c h e v e r y r e q u e s t p a s s e s
every processor.
This e x t e n s i o n s e e m s p a r t i c u l a r l y
a p p l i c a b l e a t t h e f o r m a i n t a i n i n g c o n s i s t e n c y for a file
s y s t e m or a c o m m o n virtual m e m o r y being s u p p l i e d to
4.5.~. AssociaUvity
We r a n a n u m b e r of s i m u l a t i o n s v a r y i n g t h e a s s o c i a t i v i t y all t h e way f r o m d i r e c t m a p p e d t o fully a s s o c i a t i v e .
While t h i s is c l e a r l y a n i m p o r t a n t p a r a m e t e r , we h a v e little n e w t o r e p o r t , i.e., fully a s s o c i a t i v e c a c h e is t h e b e s t ,
129
[IEEE 80] "Proposed m i c r o c o m p u t e r syste: L flus s t a n dard (P796 bus)," IEEE Computer S o c i e t y S u b c o m m i t t e e Microcomputer System ~
Group, 0eLober
1980.
30-36.
[Knight 86] J. R. Knight, "Changes in c.~mputer perform a n c e , " Datamation, VoL 12, No. 9, S e p t e m b e r
1966, pp. 40-54.
[Lindsay 81] "Cache Memory for Microprocessors," Comp u t e r Architecture Nevgs, A C M - SIGARCH, Vol. 9,
No. 5, (August 1981), pp. 6-13.
[Liptay 68] "J. S. Liptay, " S t r u c t u r a l a s p e c t s of the Syst e m / 3 6 0 Model 85, P a r t lI: the cache," IBM Syst. J.,
Vol. 7, No. 1, 1968, pp. 15-21.
[Norton 82] R. L. Norton a n d J. L. Abraham, "Using write
b a c k cache to improve p e r f o r m a n c e of m u l t i u s e r
m u l t i p r o c e s s o r s , " 1982 Int. Conf. on Par. Prec.,
IEEE cat. no. 82CH1794-7, 1982, pp. 326-331.
6. A c k n o w l e d g e m e n t s
This m a t e r i a l is b a s e d upon work s u p p o r t e d by the
National Science F o u n d a t i o n u n d e r Grant MCS-8202952.
We t h a n k Dr. A. J. Smith for providing the PDP-11
t r a c e t a p e s upon which m u c h of our e a r l y work
depended. We also wish to t h a n k T.-H. Yang for developing the VAX t r a c e facility. P. Vitale and T. Doyle contrib u t e d m u c h t h r o u g h discussions a n d by c o m m e n t i n g on
an early draft of the m a n u s c r i p t .
7. R e f e r e n c e s
[Amdahl 82] C. Amdahl, private commuvtict~tiovt, March
82.
[Bell 74] J. Bell, D. Casasent, and C. G. Bell, "An investigation of alternative cache organizations," IEEE
Trans. on Computers, Vol. C-23, No. 4, April 1974,
pp. 348-351.
530.
[Censier 78] L. M. Censier and P. F e a u t r i e r , "A new solution to c o h e r e n c e p r o b l e m s in m u l t i c a c h e systems," IEEE Trans. on Computers, Vol. C-27, No.
12, D e c e m b e r 1978, pp. 1112-1118.
1953).
[Widdoes 79] L. C. Widdoes, "S-1 Multiprocessor a r c h i t e c t u r e (MULT-2)," 1979 Annual Report - the S-1 Project, Volume 1: Architecture, Lawrence Livermore
Laboratories, Tech. Report UCID 18819, 1979.
[IBM 74]
"System/370
model
155
theory
of
o p e r a t i o n / d i a g r a m s m a n u a l (volume 5): buffer control
unit,"
IBM System
Products
Division,
Poughkeepsie, N.Y., 1974.
[IBM 76]
"System/370
model
168
theory
of
o p e r a t i o n / d i a g r a m s m a n u a l (volume 1)," D o c u m e n t
No. SY22-6931-3, IBM S y s t e m P r o d u c t s Division,
Poughkeepsie, N.Y., 1978.
130
1o'-
lOS.
2
10`
10-1
........
. . . . . . . . . . . . . . . .
10 m
Cae.he
10'
1;CP
$1se (bytes)
l
10"
10 a ,
10
......
Celd
Wluets
10 j
Fig. 5. Bus Transfer Ratio vs. Block Size for w a r m and cold
starts; PDP-11 traces.
~g. I. Bus Tra.mder and Miss Ratios vs. Cache Size; blocks are
4 bytes; PDP-11 traces. The bus trans~ar ratio is the number
of transfers b e t w e e n c a c h e and main store relative 'to "those
n e c e s s a r y .if there were no cache.
10"-
10`
t/Q
10 a
l 01
,i
w n
Start
I0
. . . . . . .
10 0
BI~
10 e
I O'
...........
110s
110"
glze (b71~es)
SLso
110:
(bytel)
. . . . . . . .
10 s
CsQhe
Fig. 6. Bus Trans~er Ratio vs. Block Size for-warm and cold
starts;VAX-II traces.
Fig. 2. Bus Transfer and Miss Ratios vs. Cache Size; 4-byte
bloclcs;VAX- 11 traces.
10`-
i 10.
~a~e~
Wn
~ a L
l O'
10
. . . . . . . .
10`
AddLrelo
. . . . . . . .
1 0 "l
10`
Blook
10'
Bloctk Slse
. . . . . . . .
110`"
(bF't, e s )
Fig. 7. Miss ratio vs, Address Block Size for w a r m and cold
starts,
10'
SJLiIs ( b y t e s )
Fit. 3. Miss Ratio vs. Block Size for w a r m and cold starts;
PDP-I 1 traces.
x
10'
WOa. VA.X--11
WOT ]PDP-- 11
~1C'
10`
.....
Block
llol
~e
10"
.....
lo'
(by'tes)
Fig, 4. Miss Ratio vs. Block Size for w a r m and cold starts;
10 ~
AdeLT'emm B I Q o k SJL=e ( b y ' t e r n )
10"
Fig. a. Bus Transfer Ratio vs. Address Block Size for w a r m and
cold starts; WOA: a d d r e s s blocks a r e r e s e r v e d . WOT: ira_risker
b l o c k are reserved.
VA)C-I 1 traces.
131