Академический Документы
Профессиональный Документы
Культура Документы
This paper proposes an algorithm that improves the local- In this code, although the same row of C and B are reused
ity of a loop nest by transforming the code via interchange, in the next iteration of the middle and outer loop, respec-
reversal, skewing and tiling. The loop transformation al- tively, the large volume of data used in the intervening
gorithm is based on two concepts: a mathematical for- iterations may replace the data from the register file or the
mulation of reuse and locality, and a loop transformation cache before it can be reused. Tiling reorders the execu-
theory that unifies the various transforms as unimodular tion sequence such that iterations from loops of the outer
matrix transformations. dimensions are executed before completing all the itera-
The algorithm has been implemented in the SUIF (Stan- tions of the inner loop. The tiled matrix multiplication
ford University Intermediate Format) compiler, and is suc- is
cessful in optimizing codes such as matrix multiplica-
tion, successive over-relaxation (SOR), LU decomposition for I I2 := 1 to n by s
without pivoting, and Givens QR factorization. Perfor- for I I3 := 1 to n by s
mance evaluation indicates that locality optimization is es- for I1 := 1 to n
pecially crucial for scaling up the performance of parallel for I2 := I I2 to min(I I2 + s 0 1,n)
code. for I3 := I I3 to min(I I3 + s 0 1, n)
C[I1 ,I3 ] += A[I1 ,I2 ] * B[I2 ,I3 ];
|
Loops whose dependences can be summarized by distance
cache tiling
55 vectors are special in that it is advantageous, possible and
| register tiling
no tiling easy to tile all loops [10, 15]. General loop nests, whose
50
|
0 1 2 3 4 5 6 7 8
Processors
two-dimensional tiling can be achieved via a pair of trans-
formations known as “strip-mine and interchange” [14]
or “unroll and jam” [5]. Wolfe also shows that skewing
Figure 1: Performance of 500 2 500 double precision ma- can make a pair of loops tilable. Banerjee discusses gen-
trix multiplication on the SGI 4D/380. Cache tiles are 64 eral unimodular transforms for two-deep loop nests [4]. A
2 64 iterations and register tiles are 4 2 2. technique used in practice to handle general n-dimensional
loop nests is to determine a priori the sequence of loop
transforms to attempt. This technique is inadequate be-
memory bus limits the speedup to about 4.5 times. Cache cause certain transformations, such as loop skewing, may
tiling permits speedups of over seven for eight processors, not improve code, but may enable other optimizations that
achieving an impressive speed of 64 MFLOPS when com- do so. Which of these to perform will depend on which
bined with register tiling. other optimizations will be enabled: the desirablity of a
transformation cannot be evaluated locally. Furthermore,
1.1 The Problem the correct ordering of optimizations are highly program
dependent.
The problem addressed in this paper is the use of loop On the desirability of tiling, previous work concentrated
transformations such as interchange, skewing and reversal on how to determine the cache performance and tune the
to improve the locality of a loop nest. Matrix multipli- loop parameters for a given loop nest. Porterfield gives an
cation is a particularly simple example because it is both algorithm for estimating the hit rate of a fully-associative
legal and advantageous to tile the entire nest. In general, LRU (least recently used replacement policy) cache of a
it is not always possible to tile the entire loop nest. Some given size [14]. Gannon et al. uses reference windows
loop nests may not be tilable. Sometimes it is necessary to determine the minimum memory locations necessary to
to apply transformations such as interchange, skewing and maximize reuse in a loop nest [8]. These evaluation func-
reversal to produce a set of loops that are both tilable and tions are useful for comparing the locality performance af-
advantageous to tile. ter applying transformations, but do not suggest the trans-
For example, consider the example of an abstraction of formations to apply when a series of transformations may
hyperbolic PDE code in Figure 2(a). Suppose the array first need to be applied before tiling becomes feasible and
in this example is larger than the memory hierarchy level useful.
of interest; the entire array must be fetched anew for each If we were to use these evaluation functions to find
iteration of the outermost loop. Due to dependences, the the suitable transformations, we would need to search the
loops must first be skewed before they can be tiled. This transformation space exhaustively. The previously pro-
is also equivalent to finding non-rectangular tiles. Figure 2 posed method of enumerating the all possible combina-
contains the entire derivation of the tiled code, which we tions of legal transformations is expensive and not even
will use to illustrate our locality algorithm in the rest of possible if there are infinitely many combinations, as is
the paper. the case when we include skewing.
group spanf(1; 0); (
self-spatial spanf(1; 0); (
spanf(1; 0
(a): Extract dependence information
for I1 := 0 to 5 do
for I2 := 0 to 6 do
D =
self-temporal
A[I2 + 1] := 1/3 * (A[I2 ] + A[I2 + 1] + A[I2 + 2]);
f(0 1) (1 0) (1 01)g.
; ; ; ; ;
solving for the transformation matrix that maximizes an the nest, counting from the outermost to innermost loop.
objective function, subjected to a set of constraints. This The iterations are therefore executed in lexicographic order
matrix model has previously been applied only to distance of their index vectors. That is, if p~2 is lexicographically
vectors. We have extended the framework to handle di- greater than p~1 , written p~2 p~1 , iteration p~2 executes after
rection vectors as well. iteration p~1 .
The second result is a formulation of an objective func- Our dependence representation is a generalization of
tion for data locality. We introduce the concepts of a reuse distance and direction vectors. A dependence vector in an
vector space to capture the potential of data locality opti- n-nested loop is denoted by a vector ~ d = (d1 ; d 2; . . . ; dn).
mization for a given loop nest. This formulation collapses Each component di is a possibly infinite range of integers,
represented by [d min ;d
max ], where 0 1
i i The elementary permutation matrix thus per-
1 0
di
min
2 Z [ f01g ; di
max
2 Z [ f1g and min
di max
di : forms the loop interchange transformation on the iteration
A single dependence vector therefore represents a set of space.
Since a matrix transformation T is a linear transfor-
mation on the iteration space, T ~p 2 0 T ~p1 = T (~p2 0 ~p1 ).
distance vectors, called its distance vector set:
8 9
E( ~
d) = (e1 ; . . . ; en ) j 2 Z and
ei di
min
ei
max
di : Therefore, if ~d is a distance vector in the original iteration
space, then T ~d is a distance vector in the transformed it-
Each of the distance vector defines a set of edges on pairs
eration space. Thus in loop interchange, the dependence
of nodes in the iteration space. Iteration p~2 depends on
vector (d1 ; d2 ) is mapped into
iteration p~1 , and thus must execute after p~1 , if for some
distance vector ~e, p~2 = p~1 + ~e. By definition, since p~2
0 1
d1
d2
p e must therefore be lexicographically greater than ~
~1 , ~ 0, = :
1 0 d2 d1
or simply, lexicographically positive.
The dependence vector ~d is also a distance vector if in the transformed space. Therefore if the transformed
each of its components is a degenerate range consisting dependence vector remains lexicographically positive, the
of a singleton value, that is, d min i
= dmax
i
. For short, we interchange is legal.
simply denote such a range with the value itself. There are The loop reversal and skewing transform can similarly
three common ranges found in practice: [1; 1] denoted by be represented as matrices [3, 4, 16]. (An example of
‘+’, [01; 01] denoted by ‘0’, and [01; 1] denoted by skewing is shown in Figure 2(d).) These matrices are uni-
‘6’. They correspond to the previously defined directions modular matrices, that is, they are square matrices with
of ‘ < ’, ‘ > ’, and ‘ 3 ’, respectively [17]. integral components and a determinant of one or nega-
We have extended the definition of vector operations to tive one. Because of these properties, the product of two
allow for ranges in each of the component. In this way, we unimodular matrices is unimodular, and the inverse of a
can manipulate a combination of distances and directions unimodular matrix is unimodular, so that combinations of
simply as vectors. This is needed to support the matrix unimodular loop transformations and inverses of unimodu-
transform model, discussed below. lar loop transformations are also unimodular loop transfor-
Our model differs from the previously used model in mations. Under this formulation, there is a simple legality
that all our dependence vectors are represented as lexi- test for all transforms.
cographically positive vectors. In particular, consider a
strictly sequential pair of loops such as the one below: Theorem 2.1 . Let D be the set of distance vectors of a
loop nest. A unimodular transformation T is legal if and
only if 8~d 2 D : T ~d ~0.
for I1 := 0 to n do
for I2 := 0 to n do
b := g (b);
The elegance of this theory helps reduce the complex-
The dependence of this program would previously be ity of the implementation. Once the dependences are ex-
represented as (‘ 3 ’; ‘ 3 ’). In our model, we repre- tracted, the derivation of the compound transform simply
sent them as a pair of lexicographically positive vectors, consists of matrix and vector operations. After the trans-
(0; ‘ + ’); (‘ + ’ ; ‘ 6 ’ ). The requirement that all dependences formation is determined, a straightforward algorithm ap-
are lexicographically greatly simplifies the legality tests plies the transformation to the loop bounds and derives the
for loop transformations. The dependence vectors define final code.
a partial order on the nodes in the iteration space, and
any topological ordering on the graph is a legal execution
order, as all dependences in the loop are satisfied.
3 The Localized Vector Space
It is important to distinguish between reuse and locality.
2.2 Unimodular Loop Transformations We say that a data item is reused if the same data is used in
multiple iterations in a loop nest. Thus reuse is a measure
With dependences represented as vectors in the iteration
that is inherent in the computation and not dependent on
space, loop transformations such as interchange, skewing
the particular way the loops are written. This reuse may
and reversal, can be represented as matrix transformations.
not lead to saving a memory access if intervening iterations
Let us illustrate the concept with the simple example
flush the data out of the cache between uses of the data.
of an interchange on a loop with distances. A loop inter-
For example, reference A[I2 ] in Figure 3 touches dif-
change transformation maps iteration (p 1 ; p2 ) to iteration
ferent data within the innermost loop, but reuses the same
(p2 ;1 ). In matrix notation, we can write this as
elements across the outer loop. More precisely, the same
0 1 p1 p2 data A[I2 ] is used in iterations (I 1 ; I2 ); 1 I1 n. There
is reuse, but the reuse is separated by accesses to n 0 1
= :
1 0 p2 p1
“Unroll and jam” is yet another equivalent form to tiling.
for I1 := 1 to n do Since all loops within a tile are fully permutable, any loop
for I2 := 1 to n do in an n-dimensional tile can be chosen to be the coalesced
f (A[I1 ],A[I2 ]); loop.
Figure 4: Iterat
2 3 I1 2 3 the quantity of self-temporal reuse utilized: the number of
0 1 + 2
I2 memory accesses is, simply,
1=sdim(RST \L) ;
respectively. These references
2 belong3 to a single uniformly
generated set with an H of 0 1 .
where s is the number of iterations in each dimension.
4.2 Quantifying Reuse and Locality Consider again the matrix multiplication code. The
only localized direction in the untiled code is the inner-
most loop, which is spanf(0; 0; 1)g. It coincides exactly
So far, we have discussed intuitively where reuse takes
place. We now show how to identify and quantify reuse
with R S T (A[I1 ; I2 ]), resulting in locality for that refer-
within an iteration space. We use vector spaces to repre-
ence. Similarly, the empty intersection with the reuse vec-
sent the directions in which reuse is found; these are the
tor space of references B[I2 ,I3 ] and C[I1 ,I3 ] indicates
directions we wish to include in the localized space. We
no temporal locality for these references. In contrast, the
evaluate the locality available within a loop nest using the
localized vector space of the tiled matrix multiplication
metric of memory accesses per iteration of the innermost
spans all the three loop directions. Trivially, there is a
loop. A reference with no locality will result in one access
non-empty intersection with each of the reuse spaces, so
per iteration.
self-temporal reuses are exploited for all references.
All references within a single uniformly generated set
4.2.1 Self-Temporal have the same H , and thus the same self-temporal reuse
Consider the self-temporal reuse for a reference A[H~{ + vector space. Therefore the derivation of the reuse vector
~
c]. Iterations ~{1 and ~{2 reference the same data element spaces and their intersections with the localized space need
whenever H~{1 +~c = H~{2 +~c, that is, when H (~{1 0~{2 ) = ~0. only be performed once per uniformly generated set, not
once per reference. The total number of memory accesses
We say that there is reuse in direction ~r when H~r = ~0.
is the sum over that of each reference. For typical val-
That is, reuse is exploited if ~r is included in the localized
ues of tile sizes chosen for caches, the number of memory
vector space. The solution to this equation is ker H , a
vector space in Rn . We call this the self-temporal reuse
accesses will be dominated by the one with the smallest
values of dim(RS T \ L). Thus the goal of loop transforms
vector space, RS T . Thus,
is to maximize the dimensionality of reuse; the exact value
RS T = ker H: of the tile size s does not affect the choice of the transfor-
mation.
For example, the reference C[I1 ,I3 ] in the matrix mul-
tiplication in Section 1 produces: 4.2.2 Self-Spatial
1 0 0 Without loss of generality, we assume that data is stored in
H =
0 0 1 row major order. Spatial reuse can occur only if accesses
are made to the same row. Furthermore, the difference
RS T = ker H = spanf(0; 1; 0)g: in the row index expression has to be within the cache
line size. For a reference such as A[I1 ,I2 ], the memory
Informally, we say there is self-temporal reuse in loop I 2 .
accesses in the I2 loop will be reduced by a factor of l,
Since loop I2 has n iterations, there are n reuses of C in
where l is the line size. However, there is no reuse for
this loop nest. Similar analysis shows that A[I 1 ,I2 ] has
A[I1 ,10*I2 ] if the cache line is less than 10 words. For
any stride k : 1 k l, the potential reuse factor is l=k.
self-temporal reuse in the I3 direction and B[I 2 ,I3 ] has
reuse in the I1 direction.
All but the row index must be identical for a reference
In this example nest, every reuse vector space is one-
A[H~{ + ~c] to possess self-spatial reuse. We let HS be
dimensional. In general, the reuse vector space can have
H with all elements of the last row replaced by 0. The
zero or more dimensions. If the dimensionality is zero,
then RS T = ; and there is no self-temporal reuse. An ex-
self-spatial reuse vector space is simply
ample of a two-dimensional reuse vector space is the refer-
RS S = ker HS :
ence A[I1 ] within a three-deep nest with loops (I 1 ; I2 ; I3 ).
In general, if the number of iterations executed along As discussed above, temporal locality is a special case
each dimension of reuse is s, then each element is reused of spatial locality. This is reflected in our mathematical
s
dim(RST ) times.
formulation since
As discussed in Section 3, a reuse is exploited only if it
occurs within the localized vector space. Thus, a reference ker H ker HS :
has self-temporal locality if its self-temporal reuse space
RS T and the localized space L in the code have a non- If RS S \ L = RS T \ L, the reuse occurs on the same word,
null intersection. The dimensionality of R S T \ L indicates the spatial reuse is also temporal, and there is no additional
gain due to the prefetching of cache lines. If R S S \ L 6= Reuse does not exist between A[I1 ,I2 0 1] and A[ 1 +
I
4.2.3 Group-Temporal Reuse exists and only exists between all pairs of references
within each class. Thus, the effective number of memory
To illustrate the analysis on group-temporal reuse, we use
references is simply the number of equivalence classes.
the code for a single SOR relaxation step:
In this case, there are three references instead of five, as
for I1 := 1 to n do expected.
for I2 := 1 to n do In general, the vector space in which there is any group-
A[I1 ,I2 ] := 0.2*(A[I1 ,I2 ]+A[I1 + 1,I2 ] temporal reuse RGT for a uniformly generated set with
+A[I1 0 1,I2 ]+A[I1 ,I2 + 1] references fA[H~{ + c~1 ]; . . . ;A[H~{ + c~g ]g is defined to
+A[I1 ,I2 0 1]); be
RGT = spanf~ rg g + ker H
r2 ; . . . ; ~
There are five distinct references in this innermost loop, where for k = 2; . . . ; g, ~rk is a particular solution of
all to the same array. The reuse between these references,
can potentially reduce the number of accesses by a factor H~
r = ~c 1 0 ~
ck :
of 5. If all temporal and spatial reuses are exploited, the
total number of memory accesses per iteration is reduced For a particular localized space L, we get an additional
from 5 to 1=l, where l is the cache line size. With the benefit due to group reuse if and only if R GT \ L 6=
way the nest is currently written, the localized space con- RS T \ L. To determine the benefit, we must partition
sists of only the I 2 direction. The reference A[I1 ,I2 0 1] the references into equivalence classes as shown above.
uses the same data as A[I1 ,I2 ] from the previous itera- Denoting the number of equivalence classes by g T , the
tion, and A[I1 ,I2 + 1] from the second previous iteration. number of memory references for the uniformly generated
However, A[I1 0 1,I2 ] and A[I1 + 1,I2 ] must neces- set per iteration is g T , instead of g.
sarily each access a different set of data. The number of
accesses per iteration is thus 3=l. We now show how to 4.2.4 Group-Spatial
mathematically calculate and factor in this group-temporal
reuse. Finally, there may also be group-spatial reuse. For exam-
As discussed above, group-temporal reuse need only ple, in the following loop nest
be calculated for elements within the same uniformly
generated set. Two distinct references A[H~{ + ~c1 ] and for I1 := 1 to n do
Table 1: Self-temporal and self-spatial locality in tiled and untiled matrix multiplication. We use the vectors ~e 1 , ~e2 and ~e3
to represent (1; 0; 0), (0; 1; 0) and (0; 0; 1) respectively.
Two references with index functions H~{ + c~i and H~{+ c~j The reuse vector spaces are
belong to the same group-spatial equivalence class if and
only if RS T = ker H = spanf(1; 0)g
9~r 2 L : HS~r = ~cS;i 0 ~cS;j : RS S = ker HS = spanf(0; 1); (1; 0)g
RGT = spanf(0; 1)g + ker H = spanf(0; 1); (1; 0)g
spanf(0; 1)g + ker HS = spanf(0; 1); (1; 0)g:
We denote the number of equivalence sets by gS . Thus,
gS gT g , where g is the number of elements in the
RGS =
is spanf(0; 1)g,
uniformly generated set. Using a probabilistic argument,
When the localized space L
the effective number of accesses per iteration for stride
one accesses, with a line size l, is RS T \ L = ;
gS + (gT 0 gS )=l:
RS S \ L = spanf(0; 1)g =
6 RS T
RGT \ L = spanf(0; 1)g
The union of the group-spatial reuse vector spaces of each Since gT = 1, the total number of memory accesses per
uniformly generated set captures the entire space in which iteration is 1=l. Similar derivation shows the overall num-
there is reuse within a loop nest. If we can find a trans- ber of accesses per iteration for the localized space of
formation that can generate a localized space that encom- spanf(1; 0); (0; 1)g to be 1=(ls).
passes these spaces, then all the reuses will be exploited. The data locality optimization problem can now be for-
Unfortunately, due to dependence constraints, this may not mulated as follows:
be possible. In that case, the locality optimization problem
Definition 4.2 For a given iteration space with
is to find the transform that delivers the fewest memory
accesses per iteration. 1. a set of dependence vectors, and
From the discussion above, the memory accesses per
iteration for a particular transformed nest can be calculated 2. uniformly generated reference sets
as follows. The total number of memory accesses is the
sum of the accesses for each uniformly generated set. The the data locality optimization problem is to find the uni-
general formula for the number of accesses per iteration modular and/or tiling transform, subject to data depen-
for a uniformly generated set, given a localized space L dences, that minimizes the number of memory accesses
and line size l, is per iteration.
gS + (gT 0 gS )=l
l e
sdim(R SS \L)
5 An Algorithm
where Our analysis of locality shows that differently transformed
0 R S T \ L = RS S \ L loops can differ in locality only if they have different lo-
e = calized vector spaces. That is, all transformations that
1 otherwise.
generate code with the same localized vector space can
We now study the full example in Figure 2(a) to il- be put in an equivalence class and need not be examined
lustrate the reuse and locality analysis. The reuse vector individually. The only feature of interest is the innermost
spaces of the code are summarized in Figure 2(b). For the tile; the outer loops can be executed in any legal order
uniformly generated set in the example, or orientation. Similarly, reordering or skewing the tiled
2 3 2 3 loops themselves does not affect the vector space and thus
H = 0 1 ; HS = 0 0 : need not be considered.
From the reuse analysis, we can identify a subspace that space. These loops include those that carry no reuse and
is desirable to make into the innermost tile. The question can be placed outermost legally, and those that carry reuse
of whether there exists a unimodular transformation that is but must be placed outermost due to legality reasons. For
legal and creates such an innermost subspace is a difficult example, if a three-deep loop has dependences
one. An existing algorithm that attempts to find such a
transform is exponential in the number of loops [15]. The f(1 ‘6 ’ ‘6 ’) (0 1 ‘6 ’) (0 0 ‘+ ’)g
; ; ; ; ; ; ; ;
Corollary 5.2 If loop I k has dependences such that 9~d 2 loops fully permutable, and therefore tilable, if some such
D : d
min = 01 and 9~ 2 D : dmax = 1 then the outer- transformation exists.
k
d k
most fully permutable nest consists only of a combination If there is little reuse, or if data dependences constrain
of loops not including I k . the legal ordering possibilities, the algorithm is fast since I
is small. The algorithm is only slow when there are many
The SRP algorithm takes as input the loops N that have carrying reuse loops and few dependences. This algorithm
not been placed outside this nest, and the set of dependen- can be further improved by ordering the search through the
ces D that have not been satisfied by loops outside this power set of I using a branch and bound approach.
loop nest. It first removes from N those serializing loops
as defined by the Ik s of Corollary 5.2. It then uses an iter-
ative step to build up the fully permutable loop nest F . In 6 Tiling Experiments
each step, it tries to find a loop from the remaining loops
in N that can be made fully permutable with F via possi- We have implemented the algorithm described in this pa-
bly multiple applications of Theorem 5.1. If it succeeds in per in our SUIF compiler. The compiler currently gener-
finding such a loop, it permutes the loop to be next outer- ates working tiled code for one and multiple processors
most in the fully permutable nest, adding the loop to F and of the SGI 4D/380. However, the scalar optimizer in our
removing it from N . Then it repeats, searching through compiler has not yet been completed, and the numbers ob-
the remaining loops in N for another loop to place in F . tained with our generated code would not reflect the true
This algorithm is known as SRP because the unimodular effect of tiling when the code is optimized. Fortunately,
transformation it performs can be expressed as the product our SUIF compiler also includes a C backend; we can
of a skew transformation (S), a reversal transformation (R) use the compiler to generate restructured C code, which
and a permutation transformation (P). can then be compiled by a commercial scalar optimizing
We use SRP in both steps of finding a transformation compiler.
that makes a loop nest I the innermost tile. We first ap- The numbers reported below are generated as follows.
ply SRP iteratively to those loops not in I . Each step We used the compiler to generate tiled C code for a sin-
through the SRP algorithm attempts to find the next out- gle processor. We then performed, by hand, optimizations
ermost fully permutable loop nest, returning all remaining such as register allocation of array elements[5], moving
loops that cannot be made fully permutable and returning loop-invariant address calculation code out of the inner-
the unsatisfied dependences. We repeatedly call SRP on most loop, and unrolling the innermost loop. We then
the remaining loops until (1) SRP fails to find a single compiled the code using the SGI’s optimizing compiler. To
loop to place outermost, in which case the algorithm fails run the code on multiple processors, we adopt the model
to find a legal ordering for this target innermost tile I , or of executing the tiles in a DO-ACROSS manner[16]. This
(2) there are no remaining loops, in which case the algo- code has the same structure as the sequential code. We
rithm succeeds. If this step succeeds, we then call SRP needed to add only a few lines to the sequential code to
with loops I , and all remaining dependences to be satis- create multiple threads, and initialize, check and increment
fied. In this case, we succeed only if SRP makes all the a small number of counters within the outer loop.
loops fully permutable.
Let us illustrate SRP algorithm using the example in
6.1 LU Decomposition
Figure 2. Suppose we are trying to tile loops I 1 and
I2 . First an outer loop must be chosen. I 1 can be the The original code for LU-decomposition is:
outer loop, because its dependence components are all non-
negative. Now loop I 2 has a dependence component that for I1 := 1 to n do
is negative, but it can be made non-negative by skewing for I2 := I1 +1 to n do
As with matrix multiplication, while tiling for cache
MFlops reuse is important for one processor, it is crucial for mul-
25 64x64 tiling
|
tiple processors. Tiling improves the uniprocessor exe-
32x32 tiling
no tiling
cution by about 20%; more significantly, the speedup is
much closer to linear in the number of processors when
tiling is used. We ran experiments with tile sizes 32 2 32
20
|
and 64 2 64. The results coincide with our analysis that the
marginal performance gain due to increases in the tile size
is low. In this example, a 32 2 32 tile, which holds 1=4
15
|
the data of a 64 2 64 tile, performed virtually as well as
10 the larger on a single processor. It actually outperformed
|
6.2 SOR
0 | | | | | | | | |
|
0 1 2 3 4 5 6 7 8 40
|
MFlops
Processors
35 3D tile
|
Figure 5: Performance of 500 2 500 double precision LU 2D tile
DOALL middle
factorization without pivoting on the SGI 4D/380. No 30
|
register tiling was performed.
25
|
A[I2 ,I1 ] /= A[I1 ,I1 ];
20
for I3 := I1 +1 to n do |
A[I2 ,I3 ] -= A[I2 ,I1 ]*A[I1 ,I3 ]; 15
|
For comparison, we measured the performance of the orig-
10
|
min(n,I I3 + s 0 1) do
2
The code for the second version, labeled “2D tile”, is
A[I2 ,I3 ] -= A[I2 ,I1 ]*A[I1 ,I3 ]; shown in Figure 7(c). This version tiles only the inner-
(a): SOR nest
for I1 := 1 to t do
0
for I2 := 1 to n 1 do
for I3 := 1 to n 1 do 0
A[I2 ,I3 ] := 0.2*(A[I2 ,I3 ] + A[I2 + 1,I3 ] + A[I2 0 1,I3] + A[I2 ,I3 + 1] + A[I2 ,I3 0 1]);
0
for I10 := 5 to 2N 2 + 3t do
0 0
doall I20 := max(1,(I10 t 2N + 2)=2) to min(t,(I10 3)=2) do 0
0 0 0
for I30 := max(1 + I20 ,I10 2I20 n + 1) to min(n 1 + I20 ,I10 1 2I20 ) do 0 0
0 0 0 0 0 0 0
A[I30 I20 ,I10 2I20 I30 ] := 0.2*(A[I30 I20 ,I10 2I20 I30 ] + A[I30 I20 + 1,I10 2I20 I30 ] 0 0
0 0 0 0 0 0 0
+ A[I30 I20 1,I10 2I20 I30 ] + A[I30 I20 ,I10 2I20 I30 + 1] + A[I30 I20 ,I10 2I20 I30 0 0 0 0 1]);
(c): “2-D Tile” SOR nest
for I10 := 1 to t do
0
for II30 := 1 to n 1 by s do
for I20 := 1 to n 1 do0
0
forI30 := II30 to min(n 1,II3 + s 1) do 0
A[I20 ,I30 ] := 0.2*(A[I20 ][I30 ]+A[I20 + 1,I30 ]+A[I20 0 1,I3 ]+A[I2 ,I3 + 1]+A[I2 ,I3 0 1]);
0 0 0 0 0
0
for II20 := 2 to n 1 + t by s do
0
for II30 := 2 to n 1 + t by s do
for I10 := 1 to t do
0
for I20 := II20 to min(n 1 + t,II20 + s 1) do 0
0
for I30 := II30 to min(n 1 + t,II30 + s 1) do 0
0 0 0
A[I20 ,I30 ] := 0.2*(A[I20 I10 ][I30 I10 ] + A[I20 I10 + 1,I30 I10 ] 0
0 0 0 0 0
+ A[I20 I10 1,I30 I10 ] + A[I20 I10 ,I30 I10 + 1] + A[I20 I10 ,I30 0 0 I1 0 1]);
0
most two loops. The uniprocessor performance is signif- reuse and locality, and a matrix-based loop transformation
icantly better than even the eight processor wavefronted theory.
performance. Thus although wavefronting may increase While previous work on evaluating locality estimates
parallelism, it may reduce locality so much that it is better the number of memory accesses directly for a given trans-
to use only one processor. However, the speedup of the formed code, we break the evaluation into three parts. We
2D tile method is still limited because self-temporal reuse use a reuse vector space to capture the inherent reuse
is not being exploited. within a loop nest; we use a localized vector space to
To exploit all the available reuse, all three loops must capture a compound transform’s potential to exploit local-
be included in the innermost tile. The SRP algorithm first ity; finally we evaluate the locality of a transformed code
skews I2 and I3 with respect to I1 to make tiling legal, and by intersecting the reuse vector space with the localized
then tiles to produce the code in Figure 7(d). This “3D tile” vector space.
version of the loop nest is best for a single processor, and
There are four reuse vector spaces: self-temporal, self-
also has the best speedup in the multiprocessor version.
spatial, group-temporal, and group-spatial. These reuse
vector spaces need to be calculated only once for a given
7 Conclusions loop nest. We show that while unimodular transformations
can alter the orientation of the localized vector space, tiling
In this paper, we propose a complete approach to the prob- can increase the dimensionality of this space.
lem of improving the cache performance for loop nests. The reuse and localized vector spaces can be used to
The approach is to first transform the code via interchange, prune the search for the best compound transformation,
reversal, skewing and tiling, then determine the tile size, and not just for evaluating the locality of a given code.
taking into account data conflicts due to the set associativ- First, all transforms with identical localized vector space
ity of the cache [12]. The loop transformation algorithm are equivalent with respect to locality. In addition, trans-
is based on two concepts: a mathematical formulation of forms with different localized vector spaces may also be
equivalent if the intersection between the localized and [10] F. Irigoin and R. Triolet. Computing dependence
the reuse vector spaces is identical. A loop transformation direction vectors and dependence cones. Technical
algorithm need only to compare between transformations Report E94, Centre D’Automatique et Informatique,
that give different localized and reuse vector space inter- 1988.
sections.
Unlike the stepwise transformation approach used in ex- [11] F. Irigoin and R. Triolet. Supernode partitioning. In
isting compilers, our loop transformer solves for the com- Proc. 15th Annual ACM SIGACT-SIGPLAN Sympo-
pound transformation directly. This is made possible by sium on Principles of Programming Languages, Jan-
our theory that unifies loop interchanges, skews and rever- uary 1988.
sals as unimodular matrix transformations on dependence [12] M. S. Lam, E. E. Rothberg, and M. E. Wolf. The
vectors with either direction or distance components. The cache performance and optimizations of blocked al-
algorithm extracts the dependence vectors, determines the gorithms. In Proceedings of the Sixth International
best compound transform using locality objectives to prune Conference on Architectural Support for Program-
the search, then transforms the loops and their loop bounds ming Languages and Operating Systems, April 1991.
once and for all. This theory makes the implementation
of the algorithm simple and straightforward. [13] A. C. McKeller and E. G. Coffman. The organization
of matrices and matrix operations in a paged multi-
programming environment. CACM, 12(3):153–165,
References 1969.
[1] W. Abu-Sufah. Improving the Performance of Vir- [14] A. Porterfield. Software Methods for Improvement of
tual Memory Computers. PhD thesis, University of Cache Performance on Supercomputer Applications.
Illinois at Urbana-Champaign, Nov 1978. PhD thesis, Rice University, May 1989.
[2] U. Banerjee. Data dependence in ordinary programs. [15] R. Schreiber and J. Dongarra. Automatic blocking of
Technical Report 76-837, University of Illinios at nested loops. 1990.
Urbana-Champaign, Nov 1976. [16] M. E. Wolf and M. S. Lam. A loop transforma-
tion theory and an algorithm to maximize parallelism.
[3] U. Banerjee. Dependence Analysis for Supercomput-
IEEE Transactions on Parallel and Distributed Sys-
ing. Kluwer Academic, 1988.
tems, July 1991.
[4] U. Banerjee. Unimodular transformations of double
[17] M. J. Wolfe. Techniques for improving the in-
loops. In 3rd Workshop on Languages and Compilers
herent parallelism in programs. Technical Report
for Parallel Computing, Aug 1990.
UIUCDCS-R-78-929, University of Illinois, 1978.
[5] D. Callahan, S. Carr, and K. Kennedy. Improving [18] M. J. Wolfe. More iteration space tiling. In Super-
register allocation for subscripted variables. In Pro- computing ’89, Nov 1989.
ceedings of the ACM SIGPLAN ’90 Conference on
Programming Language Design and Implementation,
June 1990.