Академический Документы
Профессиональный Документы
Культура Документы
Compiler Design
Unit –IV
A compiler front end is organized as in figure above, where parsing, static checking, and intermediate-code
generation are done sequentially; sometimes they can be combined and folded into parsing. All schemes can
be implemented by creating a syntax tree and then walking the tree.
Syntax trees
1.1 Intermediate code can be represented in the following four ways.
p1 = mkleaf(id, c)
p2 = mkleaf(id, d)
p3 = mknode('–', p1, p2)
p4 = mknode('U', p3, NULL)
p5 = mkleaf(id, b)
Production Semantic Rule
E→ E E.nptr := E1.nptr
T2 = –T1
T3 = b * T2
T4 = c – d
T5 = –T4
T6 = b * T5
T7 = T 3 + T6
a = T7
The code can also be written for DAG as follows:
T1 = c – d
T2 = –T1
T3 = b * T2
T4 = T 3 + T3
a = T4
2. DECLARATIONS
As the sequence of declarations in a procedure or block is examined, we can lay out storage for names local to
the procedure. For each local name, we create a symbol-table entry with information like the type and the
relative address of the storage for the name. The relative address consists of an offset from the base of the
static data area or the field for local data in an
activation record.
Declarations in a Procedure:
The syntax of languages such as C, Pascal and Fortran, allows all the declarations in a single procedure to be
processed as a group. In this case, a global variable, say offset, can keep track of the next available relative
address.
Before the first declaration is considered, offset is set to 0. As each new name is seen ,that name is
entered in the symbol table with offset equal to the current value of offset,and offset is incremented
The procedure enter( name, type, offset ) creates a symbol-table entry for name, gives itstype and
by the width of the data object denoted by that name.
Attribute type represents a type expression constructed from the basic types integer and real by
applying the type constructors pointer and array. If type expressions are represented by graphs, then
attribute type might be a pointer to the node representing a type expression.
The width of an array is obtained by multiplying the width of each element by the number of elements
in the array. The width of each pointer is assumed to be 4.
T array [ num ] of T1
T.type = array(num.val, T 1 .type)
T.width = num.val x T1 .width
T T1
T.type = pointer(T 1 .type)
T.width = 4
This is the continuation of the example in the previous slide
Keeping Track of Scope Information:
When a nested procedure is seen, processing of declarations in the enclosing procedure is temporarily
suspended. This approach will be illustrated by adding semantic rules to the following language:
P ->D
D ->D ; D | id : T | proc id ; D ; S
One possible implementation of a symbol table is a linked list of entries for names.
A new symbol table is created when a procedure declaration D proc id D1;S is seen, and entries for the
declarations in D1 are created in the new table. The new table points back to the symbol table of the enclosing
procedure; the name represented by id itself is local to the enclosing procedure. The only change from the
treatment of variable declarations is that the procedure enter is told which symbol table to make an entry in.
For example, consider the symbol tables for procedures readarray, exchange, and quicksort pointing back to
that for the containing procedure sort, consisting of the entire program. Since partition is declared within
quicksort, its table points to that of quicksort.
Whenever a procedure declaration Dproc id ; D1 ; S is processed, a new symbol table with a pointer to the
symbol table of the enclosing procedure in its header is created and the entries for declarations in D1 are
created in the new symbol table. The name represented by id is local to the enclosing procedure and is hence
entered into the symbol table of the enclosing procedure.
Example
Program sort:
Function partition(x,y:integer):integer;
Var I,j:integer;
…………
……….
Begin{main}
…….
end
For the above procedures, entries for x, a and b quicksortare created in the symbol table of sort. A pointer
pointing to the symbol table of quicksort is also entered. Similarly, entries for k,v and partition are created in
the symbol table of quicksort. The headers of the symbol tables of quicksort and partition have pointers
pointing to sort and quicksort respectively
mktable(previous): creates a new symbol table and returns a pointer to this table. previous is pointer
The following operations are designed :
enter(table,name,type,offset): creates a new entry for name in the symbol table pointed to by table .
to the symbol table of parent procedure.
productions.
3. ASSIGNMENT STATEMENTS
Suppose that the context in which an assignment appears is given by the following grammar.
P MD
M ɛ
D D ; D | id : T | proc id ; N D ; S
N ɛ
Non-terminal P becomes the new start symbol when these productions are added to those in thetranslation
scheme shown below.
Translation scheme to produce three-address code for assignments
S id : = E { p : = lookup ( id.name);
if p ≠ nil then
e it p : = E.pla e
else error }
E E1 + E2 { E.place : = newtemp;
e it E.pla e : = E .pla e + E .pla e }
E E1 * E2 { E.place : = newtemp;
e it E.pla e : = E .pla e * E .pla e }
E - E1 { E.place : = newtemp;
e it E.pla e : = u i us E .pla e }
E ( E1 ) { E.place : = E1.place }
E id { p : = lookup ( id.name);
if p ≠ nil then
E.place : = p
else error }
Reusing Temporary Names
The temporaries used to hold intermediate values in expression calculations tend to clutter up the symbol
table, and space has to be allocated to hold their values.
Temporaries can be reused by changing newtemp. The code generated by the rules for E.E1 + E2 has the
general form:
evaluate E1 into t1
evaluate E2 into t2
t : = t1 + t2
The lifetimes of these temporaries are nested like matching pairs of balanced parentheses.
Keep a count c , initialized to zero. Whenever a temporary name is used as an operand,
decrement c by 1. Whenever a new temporary name is generated, use $c and increase c
by 1.
For a single dimensional array, if low is the lower bound of the index and base is the relative address of the
storage allocated to the array i.e., the relative address of A[low], then the i th elements begins at the location:
base + (I - low)* w. This expression can be reorganized as i*w + (base -low*w). The sub-expression base-
low*wis calculated and stored in the symbol table at compile time when the array declaration is processed, so
that the relative address of A[i] can be obtained by just adding i*wto it.
3.2 2-dimensional array
storage can be either row major or column major
in case of 2-D array stored in row major form address of A[i1 , i2 ] can be calculated as
base + ((i1 - low 1 ) x n2 + i2 - low 2 ) x w
where n 2 = high2 - low2 + 1
rewriting the expression gives
((i1 x n 2 ) + i 2 ) x w + (base - ((low 1 x n2 ) + low 2 ) x w)
((i1 x n2 ) + i2 ) x w + constant
this can be generalized for A[i1 , i2 ,., ik ]
Similarly, for a row major two dimensional array the address of A[i][j] can be calculated by the formula :
base + ((i-lowi )*n2 +j - lowj )*w where low i and lowj are lower values of I and j and n2 is number of values j
can take i.e. n2 = high2 - low2 + 1.
This can again be written as :
((i * n2) + j) *w + (base - ((lowi *n2) + lowj ) * w) and the second term can be calculated at compile time.
In the same manner, the expression for the location of an element in column major two-dimensional array can
be obtained. This addressing can be generalized to multidimensional arrays.
4. BOOLEAN EXPRESSIONS
Boolean expressions have two primary purposes. They are used to compute logical values, but more often
they are used as conditional expressions in statements that alter the flow of control, such as if-then-else, or
while-do statements.
Boolean expressions are composed of the boolean operators ( and, or, and not ) applied to elements that are
boolean variables or relational expressions. Relational expressions are of the form E1 relop E2, where E1 and
E2 are arithmetic expressions.
Here we consider boolean expressions generated by the following grammar:
E .E or E | E and E | not E | ( E ) | id relop id | true | false
Methods of Translating Boolean Expressions:
There are two principal methods of representing the value of a boolean expression. They are:
To encode true and false numerically and to evaluate a boolean expression analogously to an arithmetic
expression. Often, 1 is used to denote true and 0 to denote false.
To implement boolean expressions by flow of control, that is, representing the value of a boolean expression
by a position reached in a program. This method is particularly convenient in implementing the boolean
expressions in flow-of-control statements, such as the if-then and while-do statements.
Consider the implementation of Boolean expressions using 1 to denote true and 0 to denote false. Expressions
are evaluated in a manner similar to arithmetic expressions.
For example, the three address code for a or b and not c is:
t1 = not c
t2 = b and t1
t3 = a or t2
107: t2 = 1
108:
5. CASE STATEMENT
switch expression
begin
case value: statement
case value: statement
..
case value: statement
default: statement
end
Default value matches the expression if none of the values explicitly mentioned in the cases matches the
expression execute the statement associated with the value found.
There is a selector expression, which is to be evaluated, followed by n constant values that the expression can
take. This may also include a default value which always matches the expression if no other value does. The
intended translation of a switch case code to:
evaluate the expression
find which value in the list of cases is the same as the value of the expression.
Default value matches the expression if none of the values explicitly mentioned in the cases matches
the expression. executethe statement associated with the value found
Most machines provide instruction in hardware such that case instruction can be implemented easily.
So, case is treated differently and not as a combination of if-then statements.
Translation
goto next ..
goto Ln
next:
6. BACKPATCHING
The easiest way to implement the syntax-directed definitions for boolean expressions is to use two passes.
First, construct a syntax tree for the input, and then walk the tree in depth-first order, computing the
translations. The main problem with generating code for boolean expressions and flow-of-control statements
in a single pass is that during one single pass we may not know the labels that control must go to at the time
the jump statements are generated. Hence, series of branching statements with the targets of the jumps left
unspecified is generated. Each statement will be put on a list of goto statements whose labels will be filled in
when the proper label can be determined. We call this subsequent filling in of labels backpatching.
makelist(i) creates a new list containing only i, an index into the array of quadruples; makelist returns a
pointer to the list it has made.
merge(p1,p2) concatenates the lists pointed to by p1 and p2, and returns a pointer to the
concatenated list.
backpatch(p,i) inserts i as the target label for each of the statements on the list pointed to by p.
Boolean Expressions
E E1 or M E2
| E1 and M E 2
| not E1
| (E 1 )
| id 1 relop id 2
| true
| false M ? ε
Synthesized attributes truelist and falselist of non-terminal E are used to generate jumping code for boolean
expressions. Incomplete jumps with unfilled labels are placed on lists pointed to by
E.truelist and E.falselist.
Consider production E E1 and M E2. If E1 is false, then E is also false, so the statements on E1.falselist
become part of E.falselist. If E1 is true, then we must next test E2, so the target for the statements E1.truelist
must be the beginning of the code generated for E2. This target is obtained using marker non-terminal M.
Attribute M.quad records the number of the first statement of E2.code. With the production M ɛ we
associate the semantic action
{ M.quad : = nextquad }
The variable nextquad holds the index of the next quadruple to follow. This value will be backpatched onto the
E1.truelist when we have seen the remainder of the production E E1 and
7. PROCEDURE CALLS
The procedure is such an important and frequently used programming construct that it isimperative for a
compiler to generate good code for procedure calls and returns. The run-timeroutines that handle procedure
argument passing, calls and returns are part of the run-time support package.
Let us consider a grammar for a simple procedure call statement
(1) S call id ( Elist )
(2) Elist Elist , E
(3) Elist E
Calling Sequences:
The translation for a call includes a calling sequence, a sequence of actions taken on entryto and exit from
each procedure. The falling are the actions that take place in a calling sequence:
When a procedure call occurs, space must be allocated for the activation record of the called procedure.
The arguments of the called procedure must be evaluated and made available to the called procedure in a
known place.
Environment pointers must be established to enable the called procedure to access data in enclosing blocks.
The state of the calling procedure must be saved so it can resume execution after the call.
Also saved in a known place is the return address, the location to which the called routine must transfer after
it is finished.
Finally a jump to the beginning of the code for the called procedure must be generated. For example, consider
the following syntax-directed translation.
(1) S call id ( Elist )
{ for each item p on queue do
e it para p ;
e it all id.pla e }
(2) Elist Elist , E
{ append E.place to the end of queue }
(3) Elist E
{ initialize queue to contain only E.place }
Here, the code for S is the code for Elist, which evaluates the arguments, followed by a param p statement for
each argument, followed by a call statement. Queue is emptied and then gets a single pointer to the symbol
table location for the name that denotes the value of E.
8. CODE GENERATION
The final phase in compiler model is the code generator. It takes as input an intermediaterepresentation of the
source program and produces as output an equivalent target program. Thecode generation techniques
presented below can be used whether or not an optimizing phaseoccurs before code generation.
Prior to code generation, the front end must be scanned, parsed and translated into intermediate
representation along with necessary type checking. Therefore, input to code generation is assumed to be
error-free.
Target program:
The output of the code generator is the target program. The output may be:
a. Absolute machine language
It can be placed in a fixed memory location and can be executed immediately.
b. Reloadable machine language
It allows subprograms to be compiled separately.
c. Assembly language
Code generation is made easier.
Names in the source program are mapped to addresses of data objects in run-time memory by the
Memory management:
It makes use of symbol table, that is, a name in a three-address statement refers to a symbol-table
front end and code generator.
Instruction speeds and machine idioms are important factors when efficiency of target program is
The quality of the generated code is determined by its speed and size.
considered.
Register allocation
Instructions involving register operands are shorter and faster than those involving operands in emery.
The use of registers is subdivided into two sub problems:
Register allocation – the set of variables that will reside in registers at a point in the program is selected.
Register assignment – the specific register that a variable will reside in is picked.
Certain machine requires even-odd register pairs for some operands and results.
For example, consider the division instruction of the form: D x, y
Where, x – dividend even register in even/odd register pair
y – Divisor even register holds the remainder odd register holds the quotient
Evaluation order
The order in which the computations are performed can affect the efficiency of the target code. Some
computation orders require fewer registers to hold intermediate results than others.
Method:
1. We first determine the set of leaders, the first statements of basic blocks. The rules we use are of the
following:
a. The first statement is a leader.
b. Any statement that is the target of a conditional or unconditional goto is a leader.
c. Any statement that immediately follows a goto or conditional goto statement is a leader.
2. For each leader, its basic block consists of the leader and all statements up to but not Including the next
leader or the end of the program.
Consider the following source code for dot product of two vectors a and b of length 20
begin
prod :=0;
i:=1;
do begin
prod :=prod+ a[i] * b[i];
i :=i+1;
end
while i <= 20
end
The three-address code for the above source program is given as:
(1) prod := 0
(2) i := 1
(3) t1 := 4* i
(4) t2 := a[t1] /*compute a[i] */
(5) t3 := 4* i
(6) t4 := b[t3] /*compute b[i] */
(7) t5 := t2*t4
(8) t6 := prod+t5
(9) prod := t6
(10) t7 := i+1
(11) i := t7
(12) if i<=20 goto (3)
A number of transformations can be applied to a basic block without changing the set ofexpressions computed
Structure-preserving transformations
by the block. Two important classes of transformation are:
Algebraic transformations
a: = b + c a : = b + c
b: = a – d b : = a - d
c: = b + c c : = b + c
d: = a – d d : = b
Since the second and fourth expressions compute the same expression, the basic block can betransformed as
above.
b) Dead-code elimination:
Suppose x is dead, that is, never subsequently used, at the point where the statement x : =y + z appears in a
basic block. Then this statement may be safely removed without changingthe value of the basic block.
c) Renaming temporary variables:
A statement t : = b + c ( t is a temporary ) can be changed to u : = b + c (u is a new temporary) and all uses of
this instance of t can be changed to u without changing the value of the basic block.
Such a block is called a normal-form block.
d) Interchange of statements:
Suppose a block has the following two adjacent statements:
t1: = b + c
t2: = x + y
We can interchange the two statements without affecting the value of the block if and only if neither x nor y is
t1 and neither b nor c is t2.
2. Algebraic transformations:
Algebraic transformations can be used to change the set of expressions computed by a basic block into an
algebraically equivalent set.
Examples:
i) x: = x + 0 or x : = x * 1 can be eliminated from a basic block without changing the set ofexpressions it
computes.
ii) The exponential statement x : = y * * 2 can be replaced by x : = y * y.
The nodes of the flow graph are basic blocks. It has a distinguished initial node.
Loops
A loop is a collection of nodes in a flow graph such that
1. All nodes in the collection are strongly connected.
2. The collection of nodes has a unique entry.
A DAG for a basic block is a directed acyclic graph with the following labels on nodes:
1. Leaves are labeled by unique identifiers, either variable names or constants.
2. Interior nodes are labeled by an operator symbol.
3. Nodes are also optionally given a sequence of identifiers for labels to store the computed values.
DAGs are useful data structures for implementing transformations on basic blocks.
It gives a picture of how the value computed by a statement is used in subsequent statements.
It provides a good way of determining common sub - expressions.
Method:
Step 1: If y is undefined then create node(y).
If z is undefined, create node(z) for case(i).
Step 2: For the case(i), create a node(OP) whose left child is node(y) and right child is node(z). ( Checking for
common sub expression). Let n be this node.
For case(ii), determine whether there is node(OP) with one child node(y). If not create such a node.
For case(iii), node n will be node(y).
Step 3: Delete x from the list of identifiers for node(x). Append x to the list of attached identifiers for the node
n found in step 2 and set node(x) to n.
Application of DAGs:
1. We can automatically detect common sub expressions.
2. We can determine which identifiers have their values used in the block.
3. We can determine which statements compute values that could be used outside the block.
Unreachable code
# define debug 0
If(debug) {
Print debugging info
}
This may be translated as
If debug =1 goto L1
Goto L2
If debug<> 1 goto L2
Print debugging information
L2:
Algorithm:
1) while unlisted interior nodes remain do begin
2) select an unlisted node n, all of whose parents have been listed;
3) list n;
4) while the leftmost child m of n has no unlisted parents and is not a leaf do
begin
5) list m;
6) n : = m
end
end
Code generation phase generates the code based on the numbering assigned to each node T. All the registers
available are arranged as a stack to maintain the order of the lower register at the top. This makes an
assumption that the required number of registers cannot exceed the number of available registers. In some
cases, we may need to spill the intermediate result of a node to memory. This algorithm also does not take
advantage of the commutative and associative properties of operators to rearrange the expression tree.
It first checks whether the node T is a leaf node; if yes, it generates a load instruction corresponding to it
as load top (), T. If the node T is an internal node, then it checks the left l and right r sub tree for the number
assigned. There are three possible values, the number on the right is 0 or greater than or less than the number
on the left. If it is 0 then call the generate () function with left sub tree l and then generate instruction op top
(),r. If the numbering on the left is greater than or equal to right, then call generate () with left sub tree, get
new register by popping the top, call generate () with right sub tree, generate new instruction for OP R, top
(), and push back the used register on to the stack.