Вы находитесь на странице: 1из 27

Compiler Optimisations

© 2009 stratusdesign@gmail.com
stratusdesign.blogspot.com
Overview
• Introduction
• Legacy optimisation
• Vector SIMD optimisation
• DSP optimisation
• RISC/Superscalar optimisation
• SSA optimisation
• Multicore optimisation
Introduction
• This is intended as an overview of the optimisation process only as optimisations can be
done in different ways often with subtle machine specific variations. Broadly speaking
then, there are four main classes of optimisation available to the implementor and these
are
• Classic legacy optimisations - these are well understood and the majority are technically
straightforward to implement. They offer a gain of around 10-25% in performance
• Classic Vector optimisations - once the reserve of leviathan mainframe CPUs with brand
new shiny Vector Units attached but now very commonly found in DSP related
technologies. Technically these optimisations are more difficult than the former but still
not complicated. For the right class of narrow numerical applications, fully and properly
optimised they can yeild gains of 500%-2400% performance improvements
• RISC based optimisations. Despite their potential speed, scheduling fast code close to the
theoretical maximum on a RISC has and continues to be problematic. For example the
Alphas brand new GEM compilers when profiled on the machine only achieved speeds
approaching what the Alpha was capable of about 30% of the time. That meant that the
raw power of Alpha compute was wasted 70% of time in other words all those extra MHz
were just used to heat up your datacentre/office. Performance enhancements are of the
order of at least 150%
• Parallel or Hybrid optimisations. Optimisation in these cases is dominated by the
underlying memory architecture eg. UMA, NUMA, MIMD or MIMD/SIMD Hybrid. So like
RISC memory bandwidth is an issue. The other factors are interprocessor utilisation,
interprocessor communication, interprocessor security, interprocessor management and
identifying coroutines to schedule on the parallel system. Another issue is that most
commercial computer languages to date have typically not been very good at allowing the
programmer to express parallelism this means that the compiler has to infer parallelism
from what is essentially a missing attribute and this is most difficult to accomplish with
any degree of success. Currently most languages rely on rather unsophisticated library or
system routines.
Classic legacy optimisation
• Copy propogation
Before After
x = y;
x = y; z = 1 + y;
z = 1 + x;

Before optimisation a data dependency is created when z has to wait for the
value of x to be written.

• Constant propogation
Before After

x = 42; x = y;
z = 1 + x; z = 1 + 42;
Classic legacy optimisation
• Constant folding
Before After
x = 512 * 4; x = 2048;

Can be applied to Constant arguments, Statics and Locals.

• Dead code removal


o Temporary code created by the compiler for eg. When
doing constant propogation
o Dead variable removal
o Elimination of unreachable code for eg. in C switch
statements
Classic legacy optimisation
• Algebraic
Before After
x = 10 * ( x + 5 ) / 10; x += 5;

• Strength Reduction
Before After
(1) x = y ** 2; x = y * y;
(2) x = y * 2; x = y + y;
Classic legacy optimisation
• Variable renaming
Before After
x = y * z; x = y * z;
x = u * v; x0 = u * v;

• Common subexpresion
elimination
Before After
x = u * (y + z ); x0 = y + z;
w = ( y + z ) / 2; x = u * x0;
w = x0 / 2;
Classic legacy optimisation
• Loop invariant code motion
Before After
for (i=0; i<10; i++) x0 = a + b
for (i=0; i<10; i++)
x[i] += v[i] + a + b
x[i] += v[i] + x0

• Loop induction variable


simplification
Before After
for (i=0; i<10; i++) x = v;
for (i=0; i<10; i++)
x = i * 2 + v; x += 2;
Classic legacy optimisation
• Loop unrolling
Before After (unroll by factor of 2)
for (i=0; i<n; i++) for (i=0; i<n-2; i+=2)
{
x[i] += x[i-1] * x[i+1] x[i] += x[i-1] * x[i+1]
x[i+1] += x[i] * x[i+2]
}

• Tail recursion elimination


All computation is done by the time the recursive call is
recurs( x, y ) made. By simply jumping to the top of the function
{ excessive stack frame creation is avoided. May not be
if( !x ) return possible in some languages for example C++ usually
recurs( x - y ); arranges to call destructors at function exit

}
GVN
• Global value numbering
o Similar to CSE but can target cases that aren’t considered by CSE (see below)
• Idea is an extension of Local Value Numbering (within Basic Block)

Local value numbering


b = V1, c=V2 so, a=V1+V2
a=b+c d=V1
d=b e=V1+V2
e=d+c Therefore a & e are equivalent

Global value numbering ~ has to consider effects of control flow across BBs

x1 = a1 x2 = b1 A1=V1
B1=V2
X1=V3 • V1
X2=V4 • V2
X3=V5=phi( V1, V2 ) • V6

x3 = phi( x1, x2 )
Nb. Later rhs eval ripple through previous nodes
PRE
• Partial redundancy elimination includes analysis for
o Loop invariant code motion – see previous
o Full redundancy elimination – see previous for CSE
o Partial redundancy elimination – see below – evaluation of x+y is predicated on some condition
creating a Partial Redundancy
• Some PRE variants applied to SSA values, not just the expressions, effectively combining
PRE and GVN

CFG Elimination of Partial Elimination of Full


Redundancy Redundancy (ref CSE)

cond-eval T=x+y
cond-eval cond-eval

… T=x+y T=x+y
a = x+y a=T … a=T …

b = x+y b=T b=T


Classic legacy optimisation
• Leaf procedure optimisation
A routine which does not call any other routines or require any local
storage can be invoked with a simple JSR/RET.

• Procedure inlining
This technique avoids the overhead of a call/ret by duplicating the
code wherever it is needed. It is best used for small frequently called
routines.
Vector SIMD
• These optimisations increase performance by using
deep vector unit pipelines, data locality and data
isolation found when manipulating arrays to
parallelise the computation. They also reduce
conditional branching over potentially large datasets.
• Nowdays SIMD instructions appear most frequently
in DSPs for computing FIR/IIR filters or doing FFTs.
• Most modern microprocessors also have vector
support in their SIMD extensions eg. SSE and Altivec
which have traditionally offered cut down
functionality in their vector units but future trends
are towards doing fuller implementations.
• Some studies have shown that when code can be
vectorised it can improve performance in some cases
by around 500+%.
Vector SIMD
Before After After
for( i=0; i<64; i++ ) Classic VP (long vector) Altivec et. al limited to
mtvlr #64 4x32b parallelism
a[i] = b[i] + 50; vspltisw v0, #50
vldl a, v0
vldl b, v1 lw r1, 0(a)
Before CISC case vvaddl v1, #50 lw r2, 0(b)
movl #1, r0 lvx v2, 0, r2
moval a, r1
vstl v0, a vaddsws v1, v2, v0
moval b, r2 stvx v1, 0, r1
L$1: ; have 4 words added in parallel
addl #50, (b)+ lw r1 128(a)
movl (b), (a)+ lw r2 128(b)
aobleq #64, r0, L$1 lvx v2, 0, r2
vaddsws v1, v2, v0
stvx v1, 0, r1
; have 8 words added in parallel
; keep going...
Nb. Also optimised away another branch
Scalar/Superscalar RISC
• Load delay slot
o The result of a load cannot be used in the following
instruction without having to stall the pipeline before
the add can complete. Instead of having the machine
stall in this way some useful code is found that can be
placed between the load r2 and add r2. If some useful
code cannot be found a nop can be inserted instead.
u = v + w; u = v + w;
z = x + y; z = x + y;

before after

ld r1, v ld r1, v
ld r2, w ld r2, w
add r3, r1, r2 ld r1, x
sw u, r1 add r3, r1, r2
ld r1, x sw u, r1
etc..
etc..
DSP optimisation
• DSPs have some unique hardware design features
which require additional compiler support
o tbd
Scalar/Superscalar RISC
• Branch delay slot
o The result of a branch cannot be resolved without having to stall the
pipeline. Instead of having the machine stall in this way some useful
code is found that can be placed immediately after the branch.
Several strategies can be used, either find a useful candidate
instruction before the branch, take one from the branch target and
update the branch target address by 1 instruction or take a
candidate from after the branch. If a candidate cannot be found a
z = xnop
+ y; can be inserted instead.
if( x == 0 )
goto L1;

before after

ld r1, x ld r1, x
ld r2, y ld r2, y
add r3, r1, r2 cmp r1, 0
cmp r2, 0 bne L1
bne L1 add r3, r1, r2
… ...
L1: L1:
sll r3, 4 sll r3, 4
Scalar/Superscalar RISC
• Branch reduction
o Loop unrolling is one way to reduce branching, other methods exist

Ex. Bitfield setting and rotation

if( x == 0 )
y++;
...

before after (branch eliminated)


L1: lw r2, x
... cmpdi r2, 10
lw r2, x cntlzw r2, r2
cmpi r1, r2, 10 addic r2, r2, -32 L2:
bne r1, L2 rlwinm r3, r2, 1, 31, 31
addi r3, r0, 1 ...
L2:

Scalar/Superscalar RISC
• Conditional Move
o Another branch reduction technique

Ex. Bitfield setting and rotation

if( x == 0 )
y = 1;
else
z = 20;
after
before
ldq r1, x
ldq r1, x ldq r2, 1
cmp r1, 0 ldq r3, 20
beq r1, L1 cmp r1, 0
mov r3, 1 cmovez r3, r2, r1
...
L1:
mov r3, 20
Superscalar Scheduling
• This is usually achieved by creating another IR or
extending an existing IR to associate machine instructions
with RISC functional units and in this way a determination
can be made as to current FU utilisation and how best to
reorder code for superscalar multi-issue.
• These IRs are highly guarded and highly proprietary
technologies.
• This is the reason for example the IBM POWER compiler
outperforms current GCC implementations
• A simple but innovative example at the time was tracking
register pressure in the WHIRL IR originally used by MIPS
and SGI
GCC
• GCC is a fairly standard compiler technology. Historically it had one tree form (the Parse
Tree) generated from the front end and a retargetable machine format (RTL) across
which the standard optimisations were done.
• Since 2005 this was expanded and tree forms now include the Parse Tree, the GENERIC
(language independent) and GIMPLE (supporting SSA form) trees (C and C++ omit a
GENERIC tree). The standard optimisations now occur after an SSA form has been
generated (scalar ops only). SSA starts out in GCC by versioning all variables and finishes
by merging them back down with PHI functions.
o This solved the problem that the various front-end parse trees did not use a common IR which could
be used as the basis for thorough optimisation and that that the RTL IR was also unsuitable because
it was at too low a level.
• Compiler passes over the IR are handled via an extendable Pass manager which as of
4.1.1 and include preparation for optimisation and optimisation proper. They are
separated across interprocedural, intra-procedural and machine forms (consisting SSA c.
100 passes, GIMPLE c.100 passes, RTL c.60 passes [Novillo06]). The majority of these
passes centre on the intra-procedural and machine forms.

• One criticism I would make of GCC is that in some cases it flagrantly ignores
manufacturer architected conventions. This leads to a lack of interoperability with the rest
of the manufacturers system software, for example the manufacturers cross-functional
software support or the manufacturers system threading package and libraries. Another
problem for GCC is to stem the flow of RTL machine dependent based optimisations by
handling these in a smarter way.
• Corporate involvement is accelerating functional releases (2008-2009 4 releases in last
year – current 4.4.1)
GCC Gimple
• Gimple
o Influenced by McCAT Simple IR (GNU Simple)
o Need for a generic language independent IR
o Need for an IR that renders complex deep parse
trees to an IR that is easier to analyse for
optimisation
o A small grammar covers bitwise, logical,
assignment, statement etc.
o Unlike parse tree, gimple never references more
than 3 variables, meaning 2 variable reads
o High Gimple and Low Gimple
 Removes binding scope information and
conditional clauses converted to gotos
o Gimple nodes iterated at tree-level (tsi) and on a
doubly linked list at bb level (bsi)
GCC Gimple
• 3 Address format ex.
Generic form Gimple form
T1 = b + c;
if ( a > b + c ) If ( a > T1 )
c=b/a+(b*a) {
T2 = b / a;
T3 = b * a;
c = T2 + T3
}
GCC SSA
• SSA another IR form originally developed to help with dataflow analysis for interpreted
systems
o SSA evolved from Def-Use chains (Reif & Lewis) when annotated with identity assignments eg. vx
became the basis for SSA
o GCC does Scalar SSA using Kildall Analysis (not Wegman et.al).
o SSA for ~ Simplification of existing optimisations, for example constant propogation was originally
complex to implement but with SSA it is greatly simplified
o SSA for ~ Classic dataflow analysis - Reaching Definition Analysis or more intuitively Reaching
Assignment Analysis since it attempts to pair the current variable reference to the most recent
update or write to that variable
o SSA for ~ significantly faster optimisation during compilation O(n) versus O(n2) when optimising
using traditional data-flow equations

Generic form Gimple form SSA form


c = 5;
c = 5; T1 = b + c; c1 = 5; SSA making Reaching
if ( a > b + c ) Definition Analysis easy
if ( a > T1 ) T11 = b1 + c1; to perform. Here it is
c=b/c+(b*a) { if ( a1 > T11 ) being used to simplify
T2 = b / c; { constant propogation
T3 = b * a; T21 = b1 / c1;
c = T2 + T3; T31 = b1 * a1;
} c2 = T21 + T31;
}
c3 = phi ( c1, c2 );
Fig.1 Basic Blocks contain
Dominators & Φ Fn in SSA
scalar expressions
Dominators ::=>
A 1. d dominates n if every path from n must go through d A Fig.2 Dom Tree
• Every node dominates itself
• Nodes also evidently have the property of an Immediate
Dominator
B C B C G

Fig.1 Clearly the path to G is either from B or F


split however the path to B and F stems from A so every
path from G goes through A therefore G is
D E dominated by A D E F
c1 = x; c2 = a / b; Fig.3 Dom Set & Dom
Fig.1 Likewise the path to F is either from D or E Frontier
merge however the path to these stems from C so every
straight-line path from F goes through C therefore F is Immed Dom Dom
dominated by C Block Dom Set
F Frontier
Using this we can build a Dominator Tree (Fig.2)
c3 = phi (c1, c2); and derive Dominator Sets (Fig.3) and a
Dominance Frontier. A DF over a given variable A A ----- -----
in the BB is used by the compiler to introduce Phi
functions this produces a maximal Phi insertion it B A, B A G
can be reduced by various methods eg variable
G liveness C A, C A G
D A, C, D C F
Dominance Frontier of a BB variable ::=> E A, C, E C F
DF(d) = {n | ∃p∈pred(n), d dom p and d !sdom n}
• Set of all CFG nodes for which x dom a predecessor p of n but not the n itself. (Intuitively earliest point F A, C, F C G
where definition of a variable is not guaranteed to be unique)
• This gives maximal insertion of phi nodes and can be optimised several ways for example by doing G A, G A ----
liveness analysis.
Multicore optimisation
• Polyhederal representation
o First proposed by Feautrier in 1991 and appears in research
compilers of the time.
o Complex Loop Nest Optimisation and Array analysis is difficult to do
with corresponding AST representation ~ especially with respect to
strict observance of loop bounds across the nest which often defeats
standard LNO
o Loop Nest is reformulated as a set of equations, Linear Inequalities
(properly affine constraints) and due to this higher level of
abstraction a deeper level of optimisation (transformation) can be
accomplished by solving the LP system
o Each loop integer is a point in an XY space the loop bounds of which
form a Polyhedra. Ex. The first nest is a point with 2 rays, the
second modifies this as a 4 sided 2D polyhedra the third forms a 3D
polyhedra. Problem - how to efficiently implement solving for large
number of points.
o The literature reports 20-90% improvement using polyhederal LNO
o Such an improvement makes it practical and desireable to distribute
LN and associated array computation across a set of multicores. AMD
are doing this with a lightweight intercore IPC they call streams
o Polyhederal LNO available in GCC 4.5 as Graphite and IBMs Cell
Compiler
Multicore optimisation
• The polyhederal model
Ref [Bastoul06]
❶ Typical Loop Nest Steps ❷ Reformulated as affine constraints. Ex. outer loop
for(i=2; i<=n; i++) 1. Define domain Dn (ref bounds
of enclosing loop)
z[i]=0; // S1 • List Access functions Ex.
for(i=1; i<=n; i++) S1 = Z[i]=0
for(j=1; j<=n; j++) • Transform (optimise) with
z[i+j] += x[i] & y[j]; // S2 some affine schedule eg. S1(i)
= (i)
• Generate code using projection
and separation of polyhedra

❹ Regenerate AST for code generation.Ex.


Will be DS1 - DS2 ∧ DS2 - DS1 ∧ DS1 ∩ DS2
❸ Transformation scheduling (optimisation) giving worst case of 3np (n=stmts; p=nest
depth)
t=2;
i=2;
z[i]=0; DS1-DS2
for(t=3; t<=2*n; t++)
for(i=max(1,t-n-1); i<=min(t-2,n); i++)
j=t-i-1;
z[i+j] += x[i] * y[i];
i=t; DS1 ∩ DS2
z[i] = 0;
t=2*n+1;
i=n;
j=n;
z[i+j] += x[i] * y[j];
DS2-DS1

Вам также может понравиться