Вы находитесь на странице: 1из 208

CHAPTER-01

(PART-01)
INTRODUCTION
1
About Me !!

Dr.G.K.Viju
Professor
Department of Computer Science
Karary University

vijugk2005@gmail.com / +249 919471821
2
Course Objectives
To introduce the major concept areas of language
translation and compiler design.
To enrich the knowledge in various phases of
compiler and its use.
To enrich the knowledge of parser.
To provide practical programming skills necessary
for constructing a compiler
3
Teaching Strategy
Lectures
Seminars
Interactive Exercises
Discussions
4
Attendance !!
Essential:
The seminar approach we will follow requires your presence and
active participation during each class period
Classroom Discussion:
Sharing your ideas and work history with your peers is central to
your learning experience in this course.
5
Grading !!!
Your grade in this course will be determined by
both individual and group activities.

Your grade will be determined by your performance
in the following areas:

External Examination (Final Exam) 60%
Internal Examinations (Class works) 40%
6
Keep in Touch
Ask questions in the class
Call me (+249 919471821)
Mail me (vijugk2005@gmail.com)
7
Syllabus
15 Weeks

Introduction to Compilers
Lexical Analysis
Syntax Analysis
Semantic Analysis
Intermediate Code Generation
Code Optimization
Final Code Generation (+)
Related Articles.
8
References
Compilers- Principles, Techniques & Tools
By Alfred V Aho, Monica S Lam, Ravi Sethi,
Jeffrey D Ullman.
9
Why to study Compilers?
Build a large, ambitious software systems.
See theory come to life.
Learn how to build programming languages.
Learn how programming languages work.
10
Program:
A set of instructions that is used to solve a set of problems.
Software:
A collection of programs that intended to solve large problems.

11
The use of computer language is an essential link in the chain
of man and machine.

Programming languages are notations for describing
computations to people and to machines.

The world as we know it depends on programming languages,
because all the software running on all the computers was
written in some programming languages.

But, before a program can be run, it first must be translated
into a form in which it can be executed by a computer.
12
Translator
A translator is a program that takes as input a program
written in programming language (Source Language) and
produces an output a program in another language (Target/
Object Language)
Low Level Language (Machine Dependent)
A language written for one computer can not be used by another.
High Level Language (Machine Independent)
A language written for one computer can be run on another
computer with or without modifications.
13
T-Diagram
A useful notation that is used to denote a computer
program or translator.
14
Source
Language



Translator
Source Program Object Program
Translator Classes
Assembler:
A program that converts Assembly language into machine
language.
Compiler:
A program that converts high level language program into
machine language program (As a block of statements at a time)
Interpreter:
A program that converts high level language program into
machine language program (As statement by statement)
15
Compiler
A program which translates a program written in
one language to an equivalent language in other
programs.


16
Compiler
Source Program Target Program
Error Messages
History of Compilers
17
First complete compiler (1957) for FORTRAN by
John Backus.
Early compilers Assembly Language.
Self Hosting Compiler
LISP (1962)
PASCAL / C etc..
If a portion of the input to a C/C++ compiler looked like this
A = B + C * D
The output corresponding to this input might look something
like this:

LOD R1,C // Load the value of C into register 1
MUL R1,D // Multiply the value of D by register1
STO R1,TEMP1 // Store the result in TEMP1
LOD R1,B // Load the value of B into register 1
ADD R1,TEMP1 // Add value of Temp1 to register 1
STO R1,TEMP2 // Store the result in TEMP2
MOV A,TEMP2 // Move TEMP2 to A, the final result
18
Properties of a good Compiler
Reliability
Faithfulness
Compilation Speed
Diagnostics
Error Handling
Implementation & Maintenance
Good Human Interface
19
Parts of the Compilation
Analysis
Synthesis

In analysis phase, an Intermediate Representation
(IR) is created from the given source program
Lexical Analyzer, Syntax Analyzer, Semantic Analyzer
In Synthesis phase, the equivalent target program
is created from the IR
Intermediate Code Generator, Code Optimizer, Code Generator
20
CHAPTER-01
(PART-2)
STRUCTURE OF COMPILER
21
The Structure of a Modern Compiler
Source Code
Lexical Analysis
Syntax Analysis
Semantic Analysis
IR Generation
IR Optimization
Code Generation
Code Optimization
Machine
Code
22
The Structure of a Modern Compiler
Source Code
Lexical Analysis
Syntax Analysis
Semantic Analysis
IR Generation
IR Optimization
Code Generation
Code Optimization
Machine
Code
Front End
23
The Structure of a Modern Compiler
Source Code
Lexical Analysis
Syntax Analysis
Semantic Analysis
IR Generation
IR Optimization
Code Generation
Code Optimization
Machine
Code
Back End
24
Phases of a Compiler
Compiler operates in phases
Each phase transforms source program from one representation
to another
They communicate with error handler.
They communicate with the symbol table.
25
Phases of the Compiler
Source Program
Lexical Analyzer
1
Syntax Analyzer
2
Semantic Analyzer
3
Intermediate
Code Generator
4
Code Optimizer
5
Code Generator
6
Target Program
Symbol-table
Manager
Error Handler
26
Analysis of the Source Program
Lexical Analysis (Linear Analysis/Scanning)
It is a process of reading the characters from left to right and grouped the
tokens having a collective meaning.
Syntax Analysis (Hierarchical Analysis/Parsing)
It involves the grouping of tokens of the source program into grammatical
phases that are used by the compiler to synthesize output.
Semantic Analysis
Checks the source program for semantic errors and gathers type information
for the subsequent code generation phase.
27
while (y < z) {
int x = a + b;
y += x;
}
Lexical Analysis
Syntax Analysis
Semantic Analysis
IR Generation
IR Optimization
Code Generation
Code Optimization
Lexical Analysis
28


Reads the characters in the source program and
groups them into tokens.

Token represents an Identifier, or a keyword,
a punctuation character or a operator.

The character Sequence forming a token is
called the lexeme for the token.

The lexical Analyzer not only generate
tokens but also it enters the lexeme into
the symbol table.



29
Outcome of Lexical Analyzer
30
Groups the tokens into Syntactic Structures.
31
Ensures that the components of a program fit together meaningfully
Gathers type information and checks for type compatibility

32
Simple Instructions produced by the
syntax Analyzer is IR.

IR has two properties: Easy to use, Easy
To translate.

Eg: TAC

TAC- Three Address Code
33


Improve the Intermediate Code so that
the ultimate object program runs faster
and or takes less space.

It involves:
- Detection and removal of dead code.
-Calculation of constant expressions and terms.
-Moving code outside of loops.
-Removal of unnecessary temporary variables.


34
Machine code is generated. This involves:

*. Allocation of Registers and Memory.
*. Generation of correct References.
*. Generation of correct types.
*. Generation of machine code.

35
36
Symbol Table
It is a data structure containing a record for each identifier,
with fields for the attributes of the identifier.

The data structure allows us to find the record for each
identifier quickly and to store or retrieve data from that record
quickly.

When an identifier in the source program is detected by the
lexical Analyzer, the identifier is entered into the symbol
table.



37
Error Detection and Reporting
Each phase can encounter some errors.
After detecting an error, a phase must deal with that error, so that
compilation can proceed allowing further errors in the source program to
be detected.
Eg:
i) Lexical Errors : (Dont form any tokens)
in, floa, switc etc.,
ii) Syntax Errors: (Token stream violates the structure rules of the
language).
Missing of parenthesis, braces etc.,

iii) Semantic Errors:(No meaning to the operation involved)
a is not used;
38
Find the Answer?
Consider the following statement
position=initial + rate*60
Show the output of each phase.

39
Example
E
r
r
o
r
s
position := initial + rate * 60
lexical analyzer
syntax analyzer
semantic analyzer
intermediate code generator
id1 := id2 + id3 * 60
:=
id1
id2
id3
+
*
60
:=
id1
id2l
id3
+
*
inttoreal
60
Symbol
Table
position ....
initial .
rate.
40
Example
E
r
r
o
r
s
intermediate code generator
code optimizer
final code generator
temp1 := inttoreal(60)
temp2 := id3 * temp1
temp3 := id2 + temp2
id1 := temp3
temp1 := id3 * 60.0
id1 := id2 + temp1
MOVF id3, R2
MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
MOVF R1, id1
position ....
initial .
rate.
Symbol Table
3 address code
41
Why study Compilation ?
Compilers are important:
Responsible for any aspects of system performance.
Compilers are interesting:
Compilers include many applications of theory to practice
Compilers are everywhere:
Many practical applications uses the concepts of compilers
42
43
CHAPTER-02
LEXICAL ANALYSIS
(PART-01)
1
Where We Are ?
Source Code
Lexical Analysis
Syntax Analysis
Semantic Analysis
IR Generation
IR Optimization
Code Generation
Code Optimization
Machine
Code
2
Lexical Analyzer


Lexical Analyzer reads the source program character by
character and returns the tokens of the source program.

The function of the Lexical Analyzer is to read the source
program, one character at a time, and to translate it into a
sequence of primitive units called Tokens



3
Role of the Lexical Analyzer
Read input and produce a sequence of tokens.
Can act as a sub-routine for the parser
Get_next_token
Secondary tasks
Strips blanks and new line characters
Correlate error messages
May be split into scanning and lexical analysis phase
After Lexical Analysis, individual characters are no longer
examined.
4
Role of Lexical Analyzer
lexical
analyzer
parser
symbol
table
source
program
token
get next
token
5
Terminology
Tokens include keywords, identifiers, operators,
constants, variables, strings, special characters and
special symbols.
A Lexeme is a sequence of characters in the source
program representing a token.
A Pattern is a rule describing a set of lexemes that
can represent a particular token.
6
Examples
Token Sample Lexemes Informal Description of Pattern
const
if
relation
id
num
literal
const
if
<, <=, =, < >, >, >=
pi, count, D2
3.1416, 0, 6.02E23
core dumped
const
if
< or <= or = or < > or >= or >
letter followed by letters and digits
any numeric constant
any characters between and except

Classifies
Pattern
Actual values are critical. Info is :
1.Stored in symbol table
2.Returned to parser
7
Attributes for Tokens
Attributes provide additional information about
tokens
Tokens influence parsing decision; the attributes
influence the translation of tokens
Lexical analyzers usually provide a single attribute
per token (pointer into symbol table)

8
9
Attributes of Tokens
Lexical analyzer
<id, y> <assign, :=> <num, 31> <operator , +> <num, 28> <operator, *> <id, x>
y := 31 + 28*x
Parser
token
tokenval
(token attribute)
Attributes for Tokens
Example: E = M * C ** 2

<id, pointer to symbol-table entry for E>
<assign_op, >
<id, pointer to symbol-table entry for M>
<mult_op, >
<id, pointer to symbol-table entry for C>
<exp_op, >
<num, integer value 2>

10
11
12
13
14
15
16
17
18
19
20
21
22
What Tokens are Useful Here?

for (int k = 0; k < myArray[5]; ++k
{
cout << k << endl;
}
for
{
int
}
<< ;
= <
( [
) ]
++
Identifier
IntegerConstant
Cout
>>
endl
23
Specification of Tokens
24
Alphabet: Finite, nonempty set of symbols.
E.g. = {0,1}, binary alphabet.
E.g. = {a, b, c, . z}, the set of all lowercase letters.
Strings: Finite sequence of symbols from an alphabet.
E.g. 0011001
Empty Strings : The string with zero occurrences of
symbols from alphabet. The empty string is denoted by

25
Length of string: Number of positions for symbols in
the string. |w| denotes the length of string w.
E.g. |0110| = 4; || = 0.
Powers of an alphabet:
k
= the set of strings of
length k with symbols from .
E.g.
= {0,1}

1
= {0,1}

2
= {00,01,10,11}

0
= { }
26
The set of all strings over is denoted by *

* =
0
U
1
U
2
U ..


+
=
1
U
2
U
3
U ..


*
=
+
U {}
27
Language: is a specific set of strings over some fixed
alphabet

.
Example
The set of legal English words.
The set of strings consisting of n 0s followed by n 1s.
{, 01, 0011, 000111, ..}.
L P = the set of binary numbers whose value is prime
{10, 11, 101, 111, 1011, .}.
Concatenation & Exponentiation
28
The concatenation of two strings x and y is denoted by
xy.

The exponentiation of a string s is denoted by

S
0
=
S
i
= S
i-1
S for i>0
Note that S = S = S
29
Operations on Languages
Concatenation:
L
1
L
2
= { s
1
s
2
| s
1
e L
1
and s
2
e L
2
}

Union
L
1
L
2
= { s

| s

e L
1
or s

e L
2
}

Exponentiation:
L
0

= {c} L
1
= L L
2
= LL

Kleene Closure

L
*
=

Positive Closure

L
+
=

Y=
Y

=0 i
i
L
Y

=1 i
i
L
Y=
30
Example
L
1
= {a,b,c,d} L
2
= {1,2}

L
1
L
2
= {a1,a2,b1,b2,c1,c2,d1,d2}

L
1
L
2
= {a,b,c,d,1,2}

L
1
3

= all strings with length three (using a,b,c,d}

L
1
*
= all strings using letters a,b,c,d and empty string

L
1
+
= doesnt include the empty string
CHAPTER-02
LEXICAL ANALYSIS
(PART-02)
31
Difficulty in Lexical Analysis
Alignment of Statements
DO 5 I = 1.25 && DO5I = 1.25
vector<vector<int>>myvector
vector < vector < int >> myvector
(vector < (vector < (int >> myvector)))
Treatment of blanks
Reserved keywords
IF THEN THEN THEN = ELSE; ELSE ELSE = THEN
IF THEN THEN THEN = ELSE; ELSE ELSE = THEN

32
LEXICAL ERRORS
Many errors may not be detected at the lexical analysis stage
alone.
Example fi (a ==b)
Errors are found when Lexical Analyzer is unable to proceed
because none of the patterns for tokens matches a prefix of
remaining input.
33
Handling Lexical Errors
Panic Mode Recovery
Delete successive characters from the remaining input until the
analyzer can find a well formed token.
May confuse the parser.
Possible error recovery actions
Deleting or inserting input characters
Replacing or Transposing characters
Minimum Distance Error Correction
The minimum distance is defined to be the minimum number of
transformations required to translate one string into the other.
34
Regular Expressions
Notation to specify a language
Declarative
Sort of like a programming language.
Fundamental in some languages like perl and applications like
grep or lex
Capable of describing the same thing as a NFA
The two are actually equivalent, so RE = NFA = DFA
We can define an algebra for regular expressions
35
Regular Expressions
36
Basic Symbols
is a regular expression denoting language {}.
a e E is a regular expression denoting {a}.
If r and s are regular expressions denoting languages L(r)
and M(s) respectively, then
r | s is regular expression denoting L(r) | M (s).
rs is a regular expression denoting L(r) M(s).
r* is a regular expression denoting L(r)*.
(r) is a regular expression denoting L(r).
A language defined by a regular expression is called a
regular set or a regular language.
RE Examples
L(001) = {001}
L(0+10*) = { 0, 1, 10, 100, 1000, 10000, }
L(0*10*) = {1, 01, 10, 010, 0010, } i.e. {w | w has exactly a single 1}
L()* = {w | w is a string of even length}
R = R
R+ = R

Note that R+ may or may not equal R (we are adding to the language)
Note that R will only equal R if R itself is the empty set.

37
Example Regular Expressions
Regular Expression Corresponding Language
{}
a {a}
abc {abc}
a|b|c {a,b,c}
(a|b|c)* {,a,b,c,aa,ab,ac,aaa,}
a|b|c||z|A|B||Z Any letter
0|1|2||9 Any digit
38
Precedence in Regular Expressions
* has highest precedence, left associative

Concatenation has second highest precedence, left associative

| has lowest associative, left associative
39
More Regular Expression Examples
Regular Expression Corresponding Language
|a|b|ab* {, a, b, ab, abb, abbb,}
ab*c {ac, abc, abbc,}
ab*|a* {, a, ab, aa, aaa, abb,}
a(b*|a*) {a, ab, aa, abb, aaa, }
a(b|a)* {a, ab, aa, aaa, aab, aba,}
40
You Try
What is the language described by each Regular Expression?

a*
(a|b)*
a|a*b
(a|b)(a|b)
aa|ab|ba|bb
41
42
Coding Regular Definitions in
Transition Diagrams
0 2 1
6
3
4
5
7
8
return(relop, LE)
return(relop, NE)
return(relop, LT)
return(relop, EQ)
return(relop, GE)
return(relop, GT)
start <
=
>
=
>
=
other
other
*
*
9
start letter
10 11
* other
letter or digit
return(gettoken(),
install_id())
relop <|<=|<>|>|>=|=
id letter ( letter|digit )
*
Queries ??
43
CHAPTER-03
LEXICAL ANALYSIS
(PART-03)
1
Finite Automata
A recognizer for a language is a program that takes a string
x, and answers Yes if x is a sentence of that language,
and No otherwise. This recognizer is called Finite
Automation.
Finite Automata is used as a model for
Software for designing digital circuits.
Lexical analyzer for compiler
Searching for keywords in a file or on a web.
Software for verifying finite state systems, such as communication
protocols.
2

First, we define regular expressions for tokens, then we
convert them into a DFA to get the lexical analyzer for our
tokens.
Algorithms
Regular Expression NFA DFA (2 Steps)
Regular Expression DFA (Directly)



3
Regular
expressions
NFA DFA
Simulate NFA
to recognize
tokens
Simulate DFA
to recognize
tokens
Optional
Informal Example
Customer shopping at a store with an electronic transaction
with the bank
The customer may pay the e-money or cancel the e-money at
any time.
The store may ship goods and redeem the electronic money
with the bank.
The bank may transfer any redeemed money to a different
party, say the store.
Can model this problem with three automata
Bank Automata
Actions in bold are initiated by the entity.
Otherwise, the actions are initiated by someone else and received by the specified automata
Complete Bank Automaton
Ignores other actions that may be received
Non-Deterministic Finite
Automation (NFA)
7
A Non-Deterministic Finite Automation (NFA)
is a mathematical model that consists of

S a set of states.
- a set of input symbols (alphabet)
o is a mapping from S E to a set of states
s
0
e S is the start state
F _ S is the set of accepting (or final) states

8
c - transitions are allowed in NFAs.
In other words, we can move from one state to another
one without consuming any symbol.
A NFA accepts a string x, if and only if there is a path
from the starting state to one of accepting states such that
edge labels along this path spell out x.

9
Transition Graph
An NFA can be diagrammatically represented by a
labeled directed graph called a transition graph
0
start a
1 3 2
b b
a
b
S = {0,1,2,3}
E = {a,b}
s
0
= 0
F = {3}
10
Transition Table
The mapping o of an NFA can be represented
in a transition table
State
Input
a
Input
b
0 {0, 1} {0}
1 {2}
2 {3}
o(0,a) = {0,1}
o(0,b) = {0}
o(1,b) = {2}
o(2,b) = {3}
11
NFA Example
0
1
2
3
4
5
a
b
a
c
c
c
start
Recognizes: aa* | b | ab
0
1
2
3
4
5
c a b
1,2,3 - -
- 4 Error
- Error 5
- 2 Error
- 4 Error
- Error Error
a
Can represent FA with either
graph of transition table
12
NFA (Example)
1 0 2
a b
start
a
b
0 is the start state s
0
{2} is the set of final states F
E = {a,b}
S = {0,1,2}
Transition Function: a b

0 {0,1} {0}
1 _ {2}
2 _ _

Transition graph of the NFA
The language recognized by this NFA is (a|b)
*

a b
13
The Language Defined by an NFA
An NFA accepts an input string x if and only if there is some
path with edges labeled with symbols from x in sequence
from the start state to some accepting state in the transition
graph
A state transition from one state to another on the path is
called a move
The language defined by an NFA is the set of input strings it
accepts, such as (a|b)*abb for the example NFA
14
N(r
2
) N(r
1
)
From Regular Expression to c-
NFA (Thompsons Construction)
f i
c
f
a
i
f i
N(r
1
)
N(r
2
)
start
start
start
c
c c
c
f i
start
N(r) f i
start
c
c
c
a
r
1
|r
2
r
1
r
2

r *
c c
15
Combining the NFAs of a Set of
Regular Expressions
2
a
1
start
6
a
3
start
4 5
b b
8
b
7
start
a b
a { action
1
}
abb { action
2
}
a*b+ { action
3
}
2
a
1
6
a
3 4 5
b b
8
b
7
a b
0
start
c
c
c
Deterministic Finite Automation
(DFA)
16
A Deterministic Finite Automation (DFA) is a special form
of NFA.

No state has c transition.
For each symbol a and state s, there is at most one
labeled edge a leaving s.
17
DFA Example
0
2
3
a
b
start
Recognizes: aa* | b | ab
a
b
1
a
Mistake !!!
18
DFA Example
0
2
3
a
b
start
Recognizes: aa* | b | ab
a
b
1
a
Correct
Exercises
[Represent in NFA & DFA]
19
a(bab)* U a(ba)*
(ab)* U (aba)*
(a|b)* abb
20
a(bab)* U a(ba)*


0
1
3
4
2 5
start
a
a
a
b
b
a
b
21
(ab)* U (aba)*

0
1 3
4
2
5
start


a
a
b
b
a
22
(a|b)* abb

1 0 2
a b
start
a
b
Exercises
a(a|b)*a
(( |a)b*)*
(a|b)* a(a|b) (a|b)
a*ba*ba*ba*
aa*|bb*
23
CHAPTER-03
SYNTAX ANALYSIS
24
Syntax Analysis
Syntax Analyzer creates the syntactic structure of the given
source program.
Syntax Analyzer is also called Parsing or Hierarchical
Analysis.
This syntactic structure is mostly a Parse Tree.
The grammar that a parser implements is called a CFG
or Context Free Grammar
25

The syntax analyzer (parser) checks whether a given source
program satisfies the rules implied by a context free grammar
or not.
If it satisfies, the parser creates the parse tree of that program
Otherwise, the parser gives the error message.
A Context Free Grammar
Gives a precise syntactic specification of a programming language.
The design of the grammar is an initial phase of the design of a
compiler.
A grammar can be directly converted into parser by some tools.
26
Dr.G.K.Viju,Professor 27
Parsers (cont.)
We categorize the parsers into two groups:

1) Top-Down Parser
the parse tree is created top to bottom, starting from the root.
2) Bottom-Up Parser
the parse is created bottom to top; starting from the leaves

Both top-down and bottom-up parsers scan the input from left
to right (one symbol at a time).
Efficient top-down and bottom-up parsers can be implemented
only for sub-classes of context-free grammars.
LL for top-down parsing
LR for bottom-up parsing
The first L stands for scanning left-to-right and the second L stands for producing a left-most derivation.
28
Context-Free Grammars
Inherently recursive structures of a programming language
are defined by a context-free grammar.
In a context-free grammar, we have:
A finite set of terminals (in our case, this will be the set of tokens)
A finite set of non-terminals (syntactic-variables)
A finite set of productions rules in the following form
A o where A is a non-terminal and
o is a string of terminals and non-terminals (including
the empty string)
A start symbol (one of the non-terminal symbol)
Example:
E E + E | E E | E * E | E / E | - E
E ( E )
E id
Dr.G.K.Viju,Professor
29
Derivations
E E+E

E+E derives from E
we can replace E by E+E
to able to do this, we have to have a production rule EE+E in our
grammar.

E E+E id+E id+id

A sequence of replacements of non-terminal symbols is called a
derivation of id+id from E.


Dr.G.K.Viju,Professor
30
Derivation Example
E -E -(E) -(E+E) -(id+E) -(id+id)
OR
E -E -(E) -(E+E) -(E+id) -(id+id)

At each derivation step, we can choose any of the non-terminal in the
sentential form of G for the replacement.

If we always choose the left-most non-terminal in each derivation step,
this derivation is called as left-most derivation.

If we always choose the right-most non-terminal in each derivation
step, this derivation is called as right-most derivation.
Dr.G.K.Viju,Professor
31
Left-Most and Right-Most
Derivations
Left-Most Derivation
E -E -(E) -(E+E) -(id+E) -(id+id)

Right-Most Derivation

E -E -(E) -(E+E) -(E+id) -(id+id)

We will see that the top-down parsers try to find the left-most
derivation of the given source program.

We will see that the bottom-up parsers try to find the right-most
derivation of the given source program in the reverse order.

lm lm lm lm lm
rm rm rm rm rm
Dr.G.K.Viju,Professor
32
Parse Tree
Inner nodes of a parse tree are non-terminal symbols.
The leaves of a parse tree are terminal symbols.
A parse tree can be seen as a graphical representation of a derivation.
E -E
E
E -
E
E
E E
E
+
-
( )
E
E
E -
( )
E
E
id
E
E
E +
-
( )
id
E
E
E
E E +
-
( )
id
-(E)
-(E+E)
-(id+E) -(id+id)
Dr.G.K.Viju,Professor
33
Ambiguity
A grammar produces more than one parse tree for a sentence is called as an
ambiguous grammar.
E E+E id+E id+E*E
id+id*E id+id*id
E E*E E+E*E id+E*E
id+id*E id+id*id
E
id
E +
id
id
E
E
*
E
E
E +
id
E
E
*
E
id id
Dr.G.K.Viju,Professor
34
Ambiguity (cont.)
For the most parsers, the grammar must be unambiguous.

unambiguous grammar
unique selection of the parse tree for a sentence

We should eliminate the ambiguity in the grammar during
the design phase of the compiler.
An unambiguous grammar should be written to eliminate
the ambiguity.
We have to prefer one of the parse trees of a sentence
(generated by an ambiguous grammar) to disambiguate
that grammar to restrict to this choice.
Dr.G.K.Viju,Professor
A:=B+C*A
35
<Assign>
<id> :=
<Expr>
A <Expr> <Expr> +
<id> <Expr> <Expr> *
<id>
<id> B
C A
A:= B + (C*A)
A:=B+C*A
36
<Assign>
<id> :=
<Expr>
A <Expr> <Expr> *
<id>
<Expr> <Expr> +
<id>
<id>
B C
A
A:= (B+C) * A
Top-Down Parsing
The parse tree is created top to bottom
Top-down parser
Recursive-Descent Parsing
Backtracking is needed (if a choice of a production rule does
not work, we backtrack to try other alternatives)
It is a general parsing technique, but not widely used.
Not efficient.
Recursive-Predictive Parsing
No backtracking
Efficient
Needs a special form of grammars (LL(1) Grammar)
Recursive Predictive Parsing is a special form of Recursive
Descent Parsing without backtracking.
37
Dr.G.K.Viju,Professor 38
Recursive-Descent Parsing (uses
Backtracking)
Backtracking is needed.
It tries to find the left-most derivation.
S aBc
B bc | b
S S
input: abc
a B c a B c

b c b
fails, backtrack
Dr.G.K.Viju,Professor 39
Recursive Predictive Parsing
Each non-terminal corresponds to a procedure.

Ex: A aBb (This is only the production rule for A)

proc A {
- match the current token with a, and move to the next token;
- call B;
- match the current token with b, and move to the next token;
}

Dr.G.K.Viju,Professor 40
Recursive Predictive Parsing
(cont.)
A aBb | bAB

proc A {
case of the current token {
a: - match the current token with a, and move to the next token;
- call B;
- match the current token with b, and move to the next token;
b: - match the current token with b, and move to the next token;
- call A;
- call B;
}
}



Dr.G.K.Viju,Professor 41
Recursive Predictive Parsing
(cont.)
When to apply c-productions.

A aA | bB | c

If all other productions fail, we should apply an c-
production. For example, if the current token is not a or b,
we may apply the c-production.
Dr.G.K.Viju,Professor 42
LL(1) Parser Example
Outputs: S aBa B bB B bB B c
Derivation(left-most): SaBaabBaabbBaabba
S
B a a
B
B b
b
c
parse tree
Bottom-Up Parsing
A bottom-up parser creates the parse tree of the given input
starting from leaves towards the root.
A bottom-up parser tries to find the right-most derivation of
the given input in the reverse order.
Bottom-up parsing is also known as shift-reduce parsing
because its two main actions are shift and reduce.


Dr.G.K.Viju,Professor 43
Dr.G.K.Viju,Professor 44
Shift-Reduce Parsing -- Example
S aABb input string: aaabb
A aA | a aaAbb
B bB | b aAbb
aABb reduction
S


S aABb aAbb aaAbb aaabb




Right Sentential Forms

How do we know which substring to be replaced at each reduction step?
rm rm rm rm
Exercise
Consider the grammar
S aAcBe
A Ab/b
B d
And the string abbcde.

Dr.G.K.Viju,Professor 45
Handle
Informally, a handle of a string is a substring
that matches the right side of a production
rule.
But not every substring matches the right side of a
production rule is handle

Dr.G.K.Viju,Professor 46
47
A Shift-Reduce Parser
E E+T | T Right-Most Derivation of id+id*id
T T*F | F E E+T E+T*F E+T*id E+F*id
F (E) | id E+id*id T+id*id F+id*id id+id*id

Right-Most Sentential Form Reducing Production
id+id*id F id
F+id*id T F
T+id*id E T
E+id*id F id
E+F*id T F
E+T*id F id
E+T*F T T*F
E+T E E+T
E
Handles are yellow and underlined in the right-sentential forms.
Dr.G.K.Viju,Professor
Syntax Directed Translation
A syntax directed definition specifies the values of attributes
by associating semantic rules with the grammar productions.

PRODUCTION SEMANTIC RULES
E E1 +T E.code = E1.code || T.code|| +

A Syntax Directed Definition (SDD) is a context free
grammar together with attributes and rules. Attributes are
associated with grammar symbol and rules are associated
with productions.
48
49
Annotated Parse Tree -- Example
Input: 5+3*4
L
E.val=17 return
E.val=5 + T.val=12
T.val=5 T.val=3 * F.val=4
F.val=5 F.val=3 digit.lexval=4
digit.lexval=5 digit.lexval=3
50
CHAPTER-04
SEMANTIC ANALYSIS

1
2
Semantic Analysis
Ensure that the program satisfies a set of rules regarding the usage
of programming constructs (variables, objects, expressions,
statements etc)
Lexical and syntactically correct programs may still contain errors.
Lexical and syntax analysis are not powerful enough to ensure the
correct usage of variables, objects, expressions, functions etc

3
The semantic Analysis Phase of a Compiler
Connects variable definitions to their uses.
Checks that each expression has a correct type.
Translates the abstract syntax into a simpler representation
suitable for generating machine code.
Adds semantic information to the parse tree and builds the
symbol table.
Example of Semantic Rules
Variables must be defined before being used.
A variable must not be defined multiple times.
In an assignment statement, the variable and the expression
must have the same type.
The test expression of an if-statement must have Boolean
type.
Semantic Analysis can be done in
Syntax analysis phase
Intermediate code generation phase
The final code generation phase.
The purpose of the semantic analysis is to verify static
correctness, that is to detect invalid programs.
Uses of identifiers must be consistent with their declarations.
The right number and types of parameters must be passed to
procedures and operators.
Other language specific rules for example- if a Java Compiler
can not prove that a variable is initialized before being used,
then the compiler must reject the program as being invalid.
Some things are typically not checked in the compilers
static analysis.
Division by Zero
Out of bound indexes in arrays
NULL pointer being dereferenced.
It is usually impossible to prove the absence of these errors
at compiler time, because they depend on runtime data.
So most compilers either dont check them at all (C) or
generate code to check them at runtime (Java, ML)
Tools for Semantic Analysis
7
Type Systems:
Describe the types available in the language (int, Boolean, real, String,
arrays, functions etc).
Describe the rules for how types may be combined in various language
constructs (adding two integers results in an integer, comparing two
numbers (int or real) results in a Boolean etc..).
To perform type checking in a procedure language, we need:
Types: int, Boolean, real, string etc
Environments (Symbol table): mapping from identifiers to types.
Errors
8
Semantic errors can be detected both at compile time and run
time. The most common errors that can be detected at compile
time are errors of declaration and scope
Examples:
Undeclared identifier
Multiple declared identifier
Index out of bounds
Wrong number or types or arguments to call
Incompatible types for operation
Break statement outside switch/loop.
Goto with no label etc

9
Attribute Grammar
Attribute Grammar is nothing but it is a CFG and attributes
put around all the terminal and non-terminal symbols are
used.
Attribute Grammar is the formal expression of the syntax-
derived semantic checks associated with a grammar.
It represents the rules of the language not explicitly imparted
by the syntax.
10
In a practical way, it defines the information that will need to
be in the Abstract Syntax Tree (AST) in order to successfully
perform semantic analysis.
This information is stored as attributes of the nodes of the
abstract syntax tree.
The values of those attributes are calculated by semantic rules.
Abstract Syntax Tree (AST)
Abstract Syntax Trees are condensed form of parse trees.
Normally operators and keywords appear as leaves but in
AST they are associated with the interior nodes that would
be the parent of those leaves in the parse tree.
Example:-
11

x:= a+b;
y:= a*b;
While (y>a) {
a:= a+1;
x:=a+b;
}
Program
:=
x
+
a b
While
>
y a
Block
:=
a
a 1
+

....
12
Example: 2 * (4 + 5) Parse tree vs. AST


E
int (2)
*
+ 2
5 4
T
F T
F
E
T
F
E
T
F
*
)
+
(
int (5)
int (4)
Abstract Syntax Trees
E
E * E
15
( E )
E
+ E
3
4
*
15
+
3 4
Parse tree Abstract syntax tree
Advantages:-
Abstract Syntax Trees are abstract
They dont contact all information in the program.
E.g. Spacing, comments, brackets, parenthesis etc
Any ambiguity has been resolved
a+b+c produces the same AST as (a+b)+c
Disadvantages:
AST has many similar forms (for, while, repeat until, if, switch)
Expressions in AST may be complex for nested loop
(42*y) + (z>5?12*z:z+20)
14
15
Control Flow Graph (CFG)
A directed graph where
Each node represents a statement.
Edges represent control flow
Graph representation of computation and control flow
in the program.


16
CFG Vs AST
CFG are much simpler than AST
Fewer forms, less redundancy, only simple expressions
But .. AST is a more faithful representation.
CFGs introduce temporaries
Lose block structure of program.
So for AST
Easier to report error + other messages.
Easier to explain to programmer
Easier to un-parse to produce readable code
17
INTERMEDIATE CODE
GENERATION

18
The intermediate code generator uses the structure produced
by the syntax analyzer to create a stream of small instructions
Many styles of intermediate codes are available.
One common style uses instructions with one operator and a
small number of operands.
An Intermediate Representation (IR) allows the compiler to
perform the translation in smaller steps
First the AST is translated to the IR
Then the IR is translated to machine-specific code.
19
Three Address Code
20
One popular type of intermediate language is called Three
Address Code. A typical three address code statement is A:=
B Op C
Where A, B and C are operands and Op is the binary
operator.
Expression
Expression Expression
Expression
*
A
/ B C
In many compilers the source code is translated into a
language which is intermediate in complexity between a
high level programming language and machine code.
Such a language is therefore called Intermediate Code or
Intermediate Text.
Four kinds of intermediate code often used are
Post Fix Notation
Syntax Trees
Quadruples
Triples
21
Post Fix Notation:
The ordinary (infix) way of writing the sum of a and b is
with the operator in the middle: a+b.
The post fix notation for the same expression places the
operator at the right end: ab+
E.g. (a+b)*c in post fix notation is ab+c*
Parse Tree/Syntax Tree:
The user can create a graphical representation of derivations.
It has the important purpose of making explicit the
hierarchical syntactic structure of sentences that is implied by
the grammar.
22
Quadruples:
We may use a record structure with four fields, which we
shall call OP, ARG1, ARG 2, and RESULT. OP contains an
internal code for the operators.
E.g.
An assignment statement like A:= -B*(C+D) would be
translated to three address statements, using the straight
forward algorithm as follows
T1 := -B
T2:= C+D
T3:= T1*T2
T4:= T3
23
These statements are represented in quadruple as follows:


OP ARG1 ARG2 RESULT

(0) Uminus B ------- T1
(1) + C D T2
(2) * T1 T2 T3
(3) := T3 .. T4
24
Triples:
To avoid entering temporary names into the symbol
table, one can allow the statement computing a
temporary value to represent that value. If we do so
three address statements are representable by a structure
with only three fields OP, ARG1, and ARG2.
25
These statements are represented in triples as follows:


OP ARG1 ARG2

(0) Uminus B -------
(1) + C D
(2) * (0) (1)
(3) := A (2)
26
Directed Acyclic Graphs for Expressions
DAG has leaves corresponding to atomic operands and
interior codes corresponding to operators.
The DAG for the expression a + a * (b c) + (b c) * d
27
Abstract Syntax Trees
E.nptr
* E.nptr
E.nptr a
b
+ E.nptr
*
a +
b c
E.nptr
c
E.nptr
( )
a * (b + c)
28
Abstract Syntax Trees versus DAGs
Representation-1
Graphical Representations.
Consider the assignment a:=b*-c+b*-c:

assign
a +
* *
uminus
uminus b
c c
b
29
AST
30
assign
a
+
*
uminus
b c
DAG
Abstract Syntax Trees versus DAGs
Representation-2
:=
a +
*
uminus
b
c
*
uminus
b
c
:=
a +
*
uminus
b
c
Tree DAG
a := b * -c + b * -c
31
32
CHAPTER-05
SYMBOL TABLE

1
2
The various phases of a translator inevitably make use of a
complex data structure called symbol table.
A compiler uses a symbol table to keep track of scope and
binding information about names.
The job of the symbol table is to store all names of the
program and information about each name.
Either the parser or the lexical analyzer can do the job of
inserting names into the symbol table.

3
In block structured languages, the symbol table collects
information from declarations and uses that information
whenever a name is used later in the program.

This information could be part of the syntax tree.
If there are different occurrences of the same name, the symbol
table assists in name resolution.
A compiler needs to collect and use information about the
names appearing in the source program.
This information entered into a data structure (symbol table)
Entry in the symbol table is pair of the form (name, information).
Each time a name is encountered, the symbol table is searched
to see whether the name has been seen previously.
If the name is new, it is entered into the table.
Information about the name is entered into the table during
lexical and syntactic analysis.
4
Symbol Table Entries:
Simple Variables, Basic Information
Variables (Identifiers)
Character Strings (lexeme)
Data Type
Storage Class
Other access information
Base Address
Arrays
Dimensions
Upper and lower bounds of each dimensions.
5
Symbol Table Entries:
Simple Variables, Basic Information
Records and Structures
List of fields
Information about each field
Functions and Procedures
Number and types of parameters
Type of return value
Function pointers ?
6
The primary issues in symbol table design are
The format of entries
The method of access &
The place where they are stored (primary/secondary)
Capabilities of Symbol Table
Determine whether a given name is in the table
Add new name to the table
Access the information associated with a given name
Add new information for a given name
Delete a name of a group of names from the table.
7
8
Example
// Declare an external function
extern double bar(double x);

// Define a public function
double foo (int count)
{
double sum = 0.0;

// Sum all the values bar(1) to bar(count)
for (int i = 1; i <= count; i++)
sum += bar((double) i);
return sum;
}
9

Symbol Name Type Scope
bar function, double extern
x double function parameter
foo function, double global
count int function parameter
sum double block local
i int for-loop statement
In a compiler, the names in the symbol table denote objects
of various sorts.
There may be separate table for variable names, labels,
procedure names, constants, field names and other types of
names, depending on the language.
It is often useful to have more than one table.
10
Re-using the Symbol Table Space
11
The identifier used by the programmer to denote a particular
name must be preserved in the symbol table until no further
references to that identifier can be possibly denote the same
name.
However a compiler can be designed to run in less space if the
space used to store identifiers can be reused in subsequent
passes.
ERROR HANDLING
12
The error handler is invoked when a flaw in the source
program is detected.
It must warn the programmer by issuing a diagnostic, and
adjust the information being passed from phase to phase so
that each phase can proceed.
It is desirable that compilation be completed on flawed
programs, at least through the syntax analysis phase, so that
as many errors as possible can be detected in one compilation.
13

14
Both table management and error handling phase will interact
with all phases of the compiler.
One of the most important functions of a compiler is the
detection and reporting of errors in the source program.
The error message should allow the programmer to determine
exactly where the errors have occurred.
Errors can be encountered by virtually all of the phases of
the compiler.
Whenever a phase of the compiler discovers an error, it
must report to the error handler, which issues an
appropriate diagnostic message.
Once the error has been noted, the compiler must modify
the input to the phase detecting the error, so that the latter
can continue processing its input, looking for subsequent
errors.
15
Reporting Errors
A good diagnostic can significantly help reduce debugging and
maintenance efforts.
Good error diagnostic should posses a number of properties
The message should pin point the errors in terms of the original
source program, rather than in terms of some internal
representation that is totally mysterious to the user.
The error message should be understandable to the user.
The message should be specific and localize the problem.
The message should not be redundant
If a variable ZAP is unclear, that should be said once, not every time ZAP
appears in the program.

16
Sources of Errors
It is difficult to give a precise classification scheme for
programming errors.
One way to classify errors is according to how they are
introduced.
If we look at the entire process of designing and implementing
a program, we see that errors can arise at every stages of the
process.
17
The design specification for the program may be inconsistent
or faulty
The algorithm used to meet the design may be inadequate or
incorrect
The programmer may introduce errors in implementing
algorithm, either by introducing logical errors or by using the
programming language constructs incorrectly.
Key punching or transcription errors can occur when the
program is typed into cards or onto files.
Finally although it should not happen, the compiler can
insert errors as it translates the source program into object
program.
18
Syntax Errors
Errors making in the syntax or general form of the code or
the sentences.
E.g.
Missing right parenthesis
MIN (A,2*(3+B)
Colon in place of semicolon
i=1:j=2
Misspelled Keyword
F:PORCEDURE OPTIONS (MAIN)
Extra Blank
/* COMMENT * /
19
Semantic Errors
Semantic errors can be detected at compile and run time.
The most common semantic errors that can be detected at
compile time are errors of declaration and scope
E.g.
Undeclared identifier
Multi-declared identifier
Type compatibilities between operators and operands and
between formal and actual parameters are another common
sources of semantic errors.
Dynamic Errors
Errors detected at run time
20
CODE OPTIMIZATION
21
Optimization of Code
The code optimization is the optimal phase designed to
improve the intermediate code so that the ultimate object
program runs faster and/or take less space.
Its output is another intermediate code program that does
the same job as the original, but saves time and/or space.

22
Three criteria that we have applied to the selection of
optimizing transformations.
Does the optimization capture most of the potential improvement
without an unreasonable amount of effort, either by the compiler
implementer or by the compiler itself ?
Does the optimization preserve the meaning of the source program
Does the optimization, at least on the average, reduce the time or
space taken by the object program?

23
The Principle Source of Optimization
Code optimization techniques are applied after syntax
analysis, usually both before and during the code generation.
The technique consists of detecting patterns in the program
and replacing these patterns by equivalent but most efficient
constructs.
E.g.
A single assignment statement like A[i+1] = B[i+1] is easier
to understand than a pair of statements like

j= i+1
A[j] = B[j]
24
Code Optimization can be divided into 3 inter-related areas
Local Optimization (Algorithmic Optimization)
Loop Optimization (inner/outer loop Optimization)
Data flow analysis
Usually use flow graphs to do the optimization.
The nodes of a flow graph are the basic blocks.
One node is distinguished as the initial node.
25
Intermediate code for segment of Quick Sort

26
27
Basic Block partitioning
and Flow Graph
Construction
28
Common Sub Expression elimination

a : = b + c a : = b + c
b : = a d b : = a d
c : = b + c c : = b + c
d : = a d d : = b

Before After
29
Local Optimization Common Sub
Expression Elimination
30
After Global Common
Sub Expression
elimination
31
Strength Reduction
32
Copy Propagation
and Dead Code
Elimination in
Blocks 5 and 6
CODE GENERATION
33
The final phase Code Generation produces the object code
by deciding on the memory locations for data, selecting code
to access each datum, and selecting the registers in which each
computation is to be done.

Designing a code generator that produces truly efficient object
programs is one of the most difficult parts of the compiler
design, both theoretically and practically.

A careful code generation algorithm can easily produce code
that runs perhaps twice as fast as code produced by an ill-
considered algorithm.
34
The output of the code generator is an object program.
This may take on a variety of forms.
An absolute machine language program
A relocatable machine language program
An assembly language program
Or perhaps a program in some other programming languages.
35
36
The following address mode will be assumed for code
generation.


No Registers Description
1 r Register Mode
2 *r Indirect Register Mode
3 X(r) Indexed Mode
4 *X(r) Indirect Indexed Mode
5 #X Immediate Mode
6 X Absolute
37
We shall use the following op-codes.


No Op-Code Description
1 MOV Move Source to Destination
2 ADD Add Source to Destination
3 SUB Subtract Source to Destination
38

Statements Code Generated Register Descriptor Address Descriptor

t := a - b


u := a - c


v := t + u


d := v + u

MOV a,R0
SUB b,R0

MOV a,R1
SUB c,R1

ADD R1,R0


ADD R1,R0
MOV R0,d

Registers empty
R0 contains t

R0 contains t
R1 contains u

R0 contains v
R1 contains u

R0 contains d





t in R0

t in R0
u in R1

u in R1
v in R0

d in R0
d in R0 and
memory
39

40