Introduction To Compilers: Syntax Analysis

Introduction to
Compilers
This chapter will enable the reader to

understand the need for a compiler
understand grammar and language
theory
distinguish between the generations
of computer languages
describe the evolution of computer

languages
detail the stages of a compiler
2.1 INTRODUCTION
A compiler (or more generally, translator) is a program that translates a program written
in one language into another. The different stages/phases of a compiler are categorized
as follows:
1. Syntax analysis (scanning and parsing)
2. Semantic analysis (determining what a program should do)
3. Optimization (improving the performance of a program as indicated by some metric, typically execution speed and/or saving space requirement)
4. Code generation (generation and output of an equivalent program in some target
language, often the instruction set of a CPU)
Syntax analysis or parsing is a process of matching the structure of sentences of the
language in accordance with a given grammar. Here, all the elements of the language constructs are in linear representation, which means that each element in the sentence restricts
the next element. This representation is applicable to both the sentence and the computer
program. Each grammar of a language can generate an infinite number of sentences (in
linear representation). Although a finite-size grammar is simple, it specifies a structure
to generate an infinite number of sentences. Once a grammar is specified, the sentences
generated from this grammar are said to be in the language supported by this grammar.
The input for the syntax analyser is the stream of tokens generated by the scanner. The
scanner program reads the source program character by character, forms the tokens, and
sends it to the parser. The tokens are specified by a simple structure. For example, a letter
followed by zero or more letters or digits defines a Pascal identifier.
Introduction to Compilers
31
A recognizer or a parser, discussed in Chapter 4, could be developed to recognize

the sentence generated by this grammar. This is normally called syntax checking, which
checks whether the language used has been written according to the grammar. Following
syntax checking, the source language under consideration is represented in an intermedate form to be used by the code generator to generate the final machine code representing the source code. This chapter outlines the process of compilation. Let us first discuss
the theory of languages, and then computer languages.
Following the process of parsing, the semantics (meaning) associated with the language is checked to some extent. For example, checking whether an integer data can be
added with real data is done in a semantic analyser. If the syntax and semantic analysis
phases are successful, an intermediate code (IC) representing the source code will be
generated. Then, the IC will be subjected to an optional optimization process, and it will
be converted into machine code. All the phases involved in the compiler are discussed
briefly in this chapter.
Before designing the compiler, it is important to understand the structure of the language. Language is always characterized by grammar (a set of rules). In order to design
a compiler, the recognizer of the language should be fed with the grammar (set of rules).
A variety of computer languages have evolved; some have survived, but some vanished
in a short period of time. The strength of a computer language depends on its expressive
power. However, there is always a need for a trade-off between the power of the language
and its complexity of implementation.
2.2 THEORY OF COMPUTER LANGUAGES

Language, in general, is used for communication. A language is a set of words coined
according to some rules for the purpose of communication. If the formation and use of
the words follow strict rules, it is called formal language. Otherwise, it is simply a natural
language, that is, the way we normally speak. Words are coined by juxtaposing characters
in the alphabet of a language. For example, az are alphabets of the English language.
A combination of these alphabets, such as Raman, Sita, eat, live, and world, are
words of the English language. Languages evolved over time by categorizing words as
nouns, verbs, and so on, and by applying certain guidelines/rules for their formation. The
formation of words and construction of languages (string of words) were based on human
needs, and they evolved over time. Ease of use and expressiveness are some features
expected by the user of a language. Ultimately, the purpose of communication through a
given language is to convey the message in a clear and precise manner.
2.2.1 Natural Languages vs Formal Languages

The term natural language defines a means of communication (for example, English,
French, or Tamil) shared by a group of individuals. Natural languages do not have much
usage restriction. As long as the user is able to communicate without misunderstanding, it
is all right. Natural languages are understood by us because we possess intelligence. Any
kind of incomplete/incorrect (in terms of syntax or semantics) information communicated
32
Compiler Design
between two persons may also be interpreted properly because of a certain context that is
already established between them.
Consider the following sentence.
The cht caught the mouse.
This sentence is clearly understood as
The cat caught the mouse.
Here, a context is already established between the words cat and mouse, and hence,
even when the word cat is misspelled as cht, the human eye is able to interpret the
appropriate meaning of the sentence. This is due to human intelligence. On the other
hand, there is no guarantee that the misspelled sentence will be understood by a different
group of people unless there is a context established among all of them.
Let us see another example of a formal language where a semantic (or meaning) is
associated with the formation of the sentence. Consider the two sentences:
The mouse caught the cat. and
The cat caught the mouse.
Both the sentences are grammatically correct. However, the first sentence is not semantically acceptable. Hence, there is a need for a formal language, which has a different
level of interpretation. The formal language is always associated with the set of rules that
we refer to by the term grammar. Therefore, for every formal language, there is a grammar. In other words, each grammar will generate its own language. The complexity of the
grammar determines the power of each language.
Everyone feels comfortable with his/her own mother tongue owing to the long association with that language. The human system gets trained well with the repeated usage of
the vocabulary and style of the language. People conversing in their mother tongue are
able to communicate well. When communicating with people who speak in other languages, they, in general, do not feel the same level of comfort. This is because the translation process occurs both while speaking and while processing the received information.
We can conclude that people speaking formal languages need to have knowledge of the
language, that is, they must know the set of rules associated with the language and must
be able to apply it during the process of recognition or understanding of the language. At
the same time, recognizing a natural language is also a complex process. This is because,
apart from having knowledge about the language, inference has to be made when the rules
are not followed strictly, which needs an additional processing mechanism. Hence, formal
language needs rules; informal language, in addition to rules, supports deviation from the
usage of the rules, making the recognition process difficult.
Humans need to communicate with computers to use them efficiently. The language
used by the computer is called machine language or machine code. The code that reaches
the processor, which is the main part of the computer, consists of a series of 0s and 1s
known as binary code. The characteristics of human language vary widely. There is a
need for a language that can balance the simplicity of the machine language and the
complexity of the human language. Hence, a programming language is meant for the
33
user to communicate with the computer. Basically, the computer or the machine works
using a machine code, which is difficult for humans to understand. Hence, languages with
restrictions (enforced by rules or grammar) were developed, which can be understood by
humans with some extra effort. The code written in this type of language is transformed
into machine code so that the processor can process it. Now, let us turn our attention to
the relationship between grammar and language.
2.2.2 Language and Grammar

Language is a collection of sentences. Each sentence consists of words to communicate
the intended message to others. Hence, languages have components at the following three
levels:
1. Symbols or character sets
2. Words
3. Sentences
As seen in Section 2.2.1, there is always a close relationship between language and
grammar. In other words, every language is governed by grammar, and every grammar
produces a language. In primary schools where grammar is taught, we learn the set of rules
by which the sentences of a language are coined. Exercises are given to check whether
the given sentences conform to the grammar. Over time, we get familiar with the ways of
constructing the language obeying the grammar rules. For computer languages, there is a
need for a strict formal grammar by which the sentence formation specification is given.
In a similar manner, computer languages have sentences, and these sentences possess
structure; the sentences consist of words, which we call tokens. Each token, in addition
to carrying a piece of information, contributes to the meaning of the whole sentence. The
tokens cannot be broken down any further. Hence, a grammar is a finite set of rules, which
may generate languages. What is size of the language one can speak? Of course, we can
say there is a finite set of rules in the grammar book of a language. However, this grammar can produce an infinite size of language. How is this possible? The grammar that we
specify consists of rewriting rules or recursive rules.
The set of rewriting rules serves as a basis for formal grammar. This kind of rewriting
systems, have a long history among mathematicians, especially the extensive study made
by Chomsky in the year 1959. He laid the foundation for almost all the formal languages,
parsers, and a considerable part of compiler construction. Since formal languages are a
branch of mathematics, it is necessary to introduce the notations for representing the elements involved in forming the languages.
Before introducing the formal notation for the parts of the language and grammar, let
us consider an example:
Murugan eats mango.
In this sentence, the words Murugan, eats, and mango are the subject, verb, and object,
respectively. One English grammar rule for constructing a simple sentence is as follows:
sentence subject verb object
Some examples of subjects are Raman, Murugan, and Sita. The words eat and buy are
examples of verbs. The words mango and apple are examples of objects. With the given
34
Compiler Design
rule (part of grammar) and various examples/representatives of the subject, verb, and
object, we can construct many sentences as listed here:
Raman eats apple.
Sita buys mango.
The boy throws a ball.
For the third sentence, the grammar and its derivation are given in Fig. 2.1. Here, the
sentence (S) consists of a subject phrase (SP) and a verb phrase (VP). The subject phrase
is in turn split into an article and a noun. The verb phrase consists of a verb and an object.
The other elements in the grammar are self-explanatory.
S
SP
Article
VP
Noun
Verb
Object
NP
The
boy
throws
Article
Noun
ball
Fig. 2.1 Parts of a sentence and grammar elements
There are various examples for each part of the grammar. Grammar rules can be written
in different forms. With different combinations of this finite set of rewriting rules, theoretically, a language of infinite size can be produced. If we use proper notations it will be
convenient to represent the grammar for a language.
2.2.3 Notations and Conventions

To express the grammar in an unambiguous and consistent manner, notations will be
helpful. Table 2.1 shows the various elements in the language construction process, with
their notations and examples.
Using the notations, a generative grammar (G) can be formally defined as G = (VN, VT,
R, S) such that
1. VN and VT are the finite set of symbols
35
Table 2.1 Grammar notations and conventions

S. no.
Element name
Notation
Example
1.
Symbol, alphabet, or
character set
{a, b, , z} in English;
{0, 1, ..., 9} in numeric language
2.
Grammar
G = (VT, VN, R, S)
3.
Set of terminals
VT
VT = {a, b, c,} starting with

lower case letters in the beginning
of the alphabet
4.
Set of non-terminals
VN
VN = {A, B, C,...} starting with

upper case letters in the beginning
of the alphabet
5.
Set of rewriting or production

rules
R = {R1, R2, , Rn}
6.
Start symbol (any one of the

non-terminals)
expr
7.
Grammar symbol
W, X, Y, Z
Upper case representation of w, x,

y, and z
8.
String of terminals
w, x, y
9.
String of grammar symbols
w = abba
= XYZ
10.
Empty set
11.
Variable or identifier
id
int area, base, height; here area,

base, and height are the variables
or identifiers
12.
Operators (arithmetic/
relational/Boolean)
op
+*/|&<=
13.
Character constant
cconst
a p,...
14.
Numeric constant
nconst
10, 20.345, 147,
15.
String constant
sconst
hello, home, ...
16.
Constant
const
a, 10, hello
2. VN VT =
3. R is a set of pairs (P, Q) such that
(a) P (VN VT)+
(b) Q (VN VT)*
4. S VN
R is the rewriting rule represented in the form P :: Q or P Q. We say P produces Q.
Backus norm form In Backus norm form, the rewriting rule R is represented in the
form
<P> ::= <Q>
36
Compiler Design
We say P produces Q. In the BNF notation, all the non-terminals are embedded between
angular brackets < >. However, throughout our discussion, we will use the grammar representation P Q.
2.2.4 Hierarchy of Formal Languages

There is more than one way to write a grammar for generating a given language. To manage the complexity of writing the grammar and to keep the power of language intact,
Chomsky has described four levels (0 through 3) of hierarchical grammar and in turn, of
languages. The objective to restrict the unmanageability of phrase structure grammars
while keeping as much of their generative powers as possible has led to the hierarchy of
grammars, as shown here:
1. Type 0: unrestricted or phrase-structured grammar
2. Type 1: context-sensitive grammar (CSG)
3. Type 2: context-free grammar (CFG)
4. Type 3: regular grammar (RG)
Type 0 grammar is unrestricted grammar, which has the form P Q, where P and Q
have no restrictions in their form except that P has to be non-empty. The part before the
is called the left-hand side (LHS); the part after is called the right-hand side (RHS). The
other types are derived by restricting the form of the rules in this grammar. Each of these
restrictions allows the resulting grammars to be more easily understood, implemented,
and manipulated, but gradually less powerful. However, they are still very useful; in fact,
more useful than even type 0 in terms of compilers. In other words, type 0 through type
3, in that order, are easy to understand and represent a language in computer terms but
restrict the way the language can be constructed. For example, type 3 grammar is restricted grammar. However, it could be used to represent the structure of the words, strings, or
tokens in a language.
Type 2 grammar is more powerful than type 3. It is used to represent the structure of
the sentence of the language, which is not possible with type 3 grammar. Type 2 grammar is CFG, where the syntax alone can be specified, not the semantics or the meaning
associated with the language. To quote a few real-world examples, consider the following
two sentences:
Rama eats mango.
Mango eats Rama.
Both sentences will be accepted by the recognizer written based on type 2 grammar.
However, these sentences will be distinguished by the recognizer written based on type
1 or CSG, which has a flexible way of representing the grammar for expressing the language powerfully. Table 2.2 compares these four formal grammars for their power of
expressing the language and their ease of implementation.
These four types of grammar play a role in describing the formal language.
37
Table 2.2 Comparison of grammar types and the generated languages

Grammar
left-hand side of
production rule
Right-hand side
of production rule
Example
Type 0 or
Any grammar
unrestricted symbol
grammar
Any grammar
symbol
Very difficult
Highly expressive
Type 1 or
CSG
A limited combination of terminals

and non-terminals
Any grammar
symbols
aA
Difficult
Medium expressive power
Type 2 or
CFG
Only one
non-terminal
Any grammar
symbol
Easy
Restricted power
Type 3 or
regular or
restricted
Only one
non-terminal
Very limited
combination of
terminal and
non-terminal
A Aa | b
Very easy
Very restricted
power
A aA | b
Implementation
complexity
power of
grammar
2.3 DESIGN OF A LANGUAGE

Natural languages were not designed overnight. They have taken their own time to evolve.
In addition, they have inherited many entities from other languages to make them more
expressive and useful. In terms of computers and compilers, the design of a language
depends on many factors and has evolved over the years, taking some applications into
consideration. Recent language design stems from the fact that they should interoperate
with many other languages, and should be recognized in more than one platform. Recent
scenarios dictate the inculcation of the software engineering principles in the design
and implementation of the computer languages. The language is expected to have many
important features in terms of both the users and compilers used.
2.3.1 Features of a Good Language

A programming language is a sequence of strings used to convey the user message to
the computer to execute some sequence of operations/instructions for obtaining a solution to the given problem. It is a collection of programming language constructs such as
keywords, variables, constants, and operators coined according to the grammar of the
language.
From the users perspective, the following is the list of expectations:
1.
2.
3.
4.
5.
6.
Easy to understand
Expressive power
Interoperability
Good turnaround time
Portability
Automatic error recovery
38
Compiler Design
7. Good error reporting

8. Efficient memory usage
9. Provision of good run-time environment
(a) Support for virtual machine
(b) Support for concurrent operation
(c) Support for unblocked operations
10. Garbage collection
11. Ability to interface with foreign functions
12. Ability to model real-world problems
13. Ability to expose the functions for usage in other languages
This list is not exhaustive. The expectation of the user increases over time with different combinations of these expectations. The question is whether it is possible to meet all
of them. Hence, the language and compiler developer has to keep track of these expectations in the development process.
2.3.2 Representation of Languages

A language is a sequence or string of words or tokens. The structure or the format of the
sequence is dictated by the grammar of the language. A sentence of the programming
language may take different structures. The words in the language may take different
formats. The structure of the language and format of the tokens have to be specified.
The structure of the language has slightly complex specifications compared to the format of the token. For example, an identifier in a programming language normally begins
with a letter or underscore followed by zero or more occurrences of letters, digits, or
underscores. No special symbols are allowed in an identifier. Hence, there are several
restrictions applied in the formation of the tokens. Are these restrictions dictated by the
language designer or the compiler developer? This question is answered by two factors:
1. The programming language designer is satisfied with this restricted format for certain types of language constructs. That means no elaborate format is required to
represent the tokens such as identifiers, constants, and operators.
2. The compiler developers desire a restricted grammar for easy implementation as far
as possible.
Hence, type 3 grammar is sufficient to define the format of the tokens in a programming language.
On the other hand, the restricted grammar is not able to support a structure such as
parenthesis-balanced expressions and nested control statements in forming the sentence
of the language. Hence, a higher level grammar may be considered. It may be type 2, type
1, or type 0. The factors to be considered in the selection of the grammar are influenced by
the ease with which it can be implemented and its capability to support the requirements
of the language designer. Many researchers, compiler developers, and language designers
have agreed on type 2 as the grammar for specifying the structure of the language, compromising on the power of the grammar and the ease of its implementation.
39
Following the decision of selecting the grammar, the modules associated with the recognition of these programming language constructs are to be identified. They are scanners and parsers. A scanner scans the source program character by character and identifies the token conforming to type 3 or RG. A parser takes the sequence of tokens as input
and parses them. We will list out the phases of the compiler shortly in this chapter, where
the first phase is the scanner and the second phase is the parser, which is written to recognize the language whose structure is specified by type 2 grammar or CFG.
Type 2 grammar is syntactic grammar that specifies only the syntax or the structure
of the language, not the meaning associated with its representations. Higher level grammars are possible for language specifications with increased complexity in implementation. The compiler developers are not taking the risk of implementing the higher level
grammar because of a lot of uncertainties arising in the specification and recognition of
the language. Hence, in terms of compilers only, type 3 and type 2 are of interest to the
programming language developer and compiler writer.
2.3.3 Grammar of a Language

As discussed in Section 2.2, the formal definition of a grammar is that it is a collection of
a finite set of terminals, finite set of non-terminals, finite set of rewriting or production
rules, and a start symbol.
A source program consists of zero or more number of statements. If we use the notation S for a statement and Ss for statements, the program can be conveniently represented
as follows:
Ss Ss S | where stands for empty symbol. S may be any valid single statement.
S Se | Sd where Se and Sd are the declarative statement and executable statement,
respectively.
Example 2.1 An arithmetic expression is formed by the operators applied on operands. Operands have the format of factors, terms, and expressions. The following set of
production rules represents the grammar for an assignment statement.
1. Se id = expr
2. expr expr + term
3. expr term
4. term term * fact
5. term fact
6. fact (expr)
7. fact id
8. fact const
The set of rules given in Example 2.1 constitutes a grammar for an assignment statement
of the form lvalue = rvalue. Here, the lvalue represents the LHS of an assignment, which
refers to a memory location. The rvalue refers to the RHS of an assignment statement,
which refers to a value as a value of an identifier, constant, or evaluation of an expression. In this grammar, Se is the start symbol, which is also a non-terminal. The words
expr, term, and fact are the other non-terminals of the grammar. The words id and const
are the terminals. The symbols =, +, and * are the assignment, additive, and multiplicative operators, respectively. The operators (and) give priority to the embedded
40
Compiler Design
expression to be parsed first. The | is the alternative grammar specification operator,

which is a form of selection operator.
In a similar manner, the grammar for any language constructs can be specified. A
detailed description of all these grammar symbols is given in Chapter 4. In Section 2.4,
we discuss the closeness with which the grammar and languages are described with the
computer systems.
2.4 EVOLUTION OF COMPILERS

The development of the compiler did not happen with a sharp boundary in time. Owing
to the growth of digital electronics, its affordable prices, and its potential use in different
applications, people started using computers aggressively. However, with the limitation of
hardware and its interface with the external world, researchers could not think of developing a compiler till the late 1950s. After the usage of assembly language with processor
architecture, gradually the programmers felt the need for an easy language to deal with
digital hardware, which opened the door for compiler development.
2.4.1 History of Compilers

Although programming languages were thought of before the development of compilers, they could not be developed for want of resources. Alonzo Church and Stephen Cole
Kleene developed lambda calculus in the year 1930, which was the worlds first programming language. However, it was intended to model computation rather than being a means
for programmers to form a programming language for the computer system. In the early
1950s, Grace Murray Hopper coined the term compiler. Then it was called automatic
programming.
FORmula TRANslation (FORTRAN) was the first programming language, and it
was very popular for scientific applications developed by a team of IBM researchers led
by John Backus from 1954 to 1957. Following the success of FORTRAN, a committee
was formed to develop a universal computer language. The committee developed an
algorithmic language called ALGOL58. A team led by John McCarthy of Massachusetts
Institute of Technology (MIT) developed the LISt Processing (LISP) programming language. LISP was based on the lambda calculus. Programmers were successful in using
LISP in list processing with a focus on intelligent processing. After 1960, the development of compilers gained momentum.
The important events in the history of programming language theory are as follows:
1. The basic levels of formal languages (and its associated grammar) were proposed
by Noam Chomsky in the 1950s. It was later called Chomsky hierarchy of languages
in finite automata theory and in the field of compiler design.
2. Ole-Johan Dahl and Kristen Nygaard developed a language called Simula in the
1960s. It was the first object-oriented programming (OOP) language. In addition, it
introduced the concept of co-routines.
3. The following were the developments in the 1970s:
41
(a) The object-oriented language Smalltalk was developed by a team of scientists

at Xerox PARC led by Alan Kay.
(b) Sussman and Steele developed the Scheme programming language, called
LISP incorporating lexical scoping, a unified namespace, and elements from
the Actor model.
(c) Logic programming and Prolog were developed, allowing computer programs
to be expressed as mathematical logic.
(d) In 1977, Backus brought out the limitations of the current state of industrial
languages in the ACM Turing Award lecture, exhibiting his new proposal by
highlighting the features of the function-level programming languages.
(e) In 1980, Robin Milner introduced the calculus of communicating systems
called calculi. C.A.R. Hoare brought out the communicating sequential processes (CSP) model. It formed a strong foundation for the representation of the
finite state machine for any kind of process development applications, including the concurrent operation.
(a) Bertrand Meyer created the methodology design by contract and incorporated
it into the Eiffel programming language.
(a) Gregor Kiczales, Jim Des Rivieres, and Daniel G. Bobrow published the
book titled The Art of the Metaobject Protocol, which deals with LISP and its
extension.
(b) Philip Wadler introduced the use of programming templates for structured programs written in functional programming languages.
In addition to these, many other languages were evolving over time, and some of them
were general-purpose languages. Some languages are meant for specific purposes. For
example, Ada language has some features as listed here:
1. It is structured and statically typed.
2. It is imperative.
3. It is an OOP language.
4. It is extended from Pascal and other languages.
5. During 19771983, Jean Ichbiah and his team designed Ada for its use in the US
Department of Defense.
A critical application, such as avionics, was well supported by the Ada language since
it was believed that Ada is a reliable language. Ada has undergone many revisions, and its
latest version is Ada-2011.
In a similar manner, the COmmon Business-Oriented Language (COBOL) is purely
meant for business applications, focusing more on the data designed for data processing.
The standard edition of COBOL was released in 1960. It is still used in many maintenance applications. The recent version of COBOL 2002 supports OOP.
42
Compiler Design
In 1964, John George Kemeny and Thomas Eugene Kurtz at Dartmouth College in
New Hampshire, USA, designed a very simple language called Beginners All-purpose
Symbolic Instruction Code (BASIC), which was very popular till the late 1980s.
Popular and general-purpose languages such as C, C++, Java, and C# were introduced
in the following timeline:
1. C was invented in 1972 by Dennis Ritchie at the Bell Telephone Laboratories.
2. In 1979, Bjarne Stroustrup at Bell Labs developed a language that was an enhancement of the C programming language and named it C with Classes, which was
later renamed as C++ in year 1983.
3. In 1995, James Gosling at Sun Microsystems developed a popular language called
Java. It is designed for supporting internetworking applications. In 2010, Oracle
took over Sun Microsystems.
4. Microsoft developed a language called C# (pronounced see sharp) to work under
the .NET platform. Similar to Java, C# is also designed to produce a common intermediate language and can be ported to multiple platforms with many featuresthat
is, it is object-oriented, component-oriented, and generic.
5. Hyper text markup language (HTML) is a markup language used to present messages on the Internet.
6. Extensible markup language (XML) is a plain ASCII text language used to represent
messages to be transported across computer systems working with any platform.
2.4.2 Development of Compilers

In the development of the compiler process, we have to identify the requirements specifications. Before doing so, we have to understand the structure of the language and purpose
for which it is being developed. Having studied the anatomy of the language, it is required
to analyse the various issues associated with the development of the compiler. Typical
issues from the perspective of a language developer are as follows:
1. Location of the source code (from keyboard, file, or socket)
2. Types of data supported
(a) Basic data types (Boolean, character, integer, real, etc.)
(b) Qualified data types (short, long, signed, unsigned, etc.)
(c) Derived data types (record, 1D array, 2D array, file, pointer, etc.)
3. Types of constants (Boolean, char, string, etc.)
4. Representation of variables
5. Size of each data supported
6. Scope of variables (static, dynamic)
7. Lifetime of variables (local, global, external)
8. Interface with other compiled codes
9. Error reports that can be produced
10. Decision about the operating environment (Microsoft, Unix variants)
43
11. Ability to support parallel processing. If able, then the method of separation of the
units that are to be run in parallel.
The typical issues associated with compilers are as follows:
1. How to read the source code
2. How to represent the source code
3. How to separate the tokens
4. What data structures can be used for storing variable information
5. How to store them in memory (code, stack, or heap areas)
6. How to manage the storage during run-time
7. How to prepare the errors linked with multiple lines
8. To what extent semantic checking can be done
9. What IC can be preferred
10. Where to introduce the optimization process
11. Mapping of IC to machine code
12. Interface with host operating system for any parallel processing support
Having gone through the various issues on both sides, that is, the sides of the language
developer and compiler developer, various modules have to be identified. Then the function of each module has to be outlined. Decide the interface between each function so that
the development is scalable. The development of the compiler is not a one-time process. It
has to be upgradable. Assign each module to the right team where their expertise matches
with the requirements. For example, members who are involved in code generation must
have a thorough knowledge of the machine architecture. Divide the modules into submodules so that they are manageable. Now each team can work concurrently, with proper
interaction among them at the appropriate time.
The compiler has evolved very slowly, working with languages ranging from very simple to the more recent sophisticated languages. Initially, the different phases (discussed in
Section 2.5) are tested, producing their output in the secondary storage and reading again
before proceeding through the next phase.
In addition, the development of each phase is monotonous work, and many tools have
been developed and are available in the market (some tools are open source) for implementing them with ease. Hence, the possibilities of exploring the tools must be studied.
Use of these tools enabled rapid compiler development.
2.5 Stages OF COMPILATION

We have seen in Chapter 1 that a compiler is a system software that translates the source code
into a target code. Figure 2.2 shows the surface-level or the user-level view of the compiler.
The input is a source code. It can be any traditional programming language such as
FORTRAN or COBOL. The target code is an object code for a particular machine architecture. The design of the compiler is broadly divided into two parts: front end and back
end. The front end of the compiler focuses on analysing the source code. It scans the
44
Compiler Design
Compiler
Source
code
Target
code
Error messages
Fig. 2.2 User-level view of a compiler
source code character by character and separates the source code into a list of tokens associated with its attributes. Following the scanning process, it checks the source code for its
structure as described by the grammar of the source code. If it is not successful, it reports
the error to the user and terminates the compilation process. Otherwise, it produces an IC.
It is called an IC because the back end of the compiler makes use of this code for further
processing.
The back end of the compiler takes the IC as an input and produces the machine code or
translates it into any other target code as specified by the designer. Generally, the front end
of the compiler is called the analysis part of the compiler, and the back end of the compiler
is called the synthesis part of the compiler. The structure of the compiler is shown in Fig. 2.3.
These two parts, the front and back ends of the compiler, are further divided into different phases as shown in Fig. 2.4.
Each stage has its own significance and techniques associated with it. The functions of
each phase are outlined in the following sections.
Source code
Compiler front end
IC
Loop optimization
Register allocation
Code generation
Code scheduling
Machine code
Fig. 2.3 Structure of a compiler
6RXUFHSURJUDP
/H[LFDODQDO\VHU6FDQQHU
7RNHQV
6\QWD[DQDO\VHU3DUVHU
3DUVHWUHH
6HPDQWLFDQDO\VHU
$EVWUDFWV\QWD[WUHH
,QWHUPHGLDWHFRGH,&JHQHUDWRU
1RQRSWLPL]HG,&
,&RSWLPL]HU
2SWLPL]HG,&
7DUJHWFRGHJHQHUDWRU
7DUJHWPDFKLQHFRGH
Fig. 2.4 Stages of compiler design
45
46
Compiler Design
2.5.1 Lexical Analysis

A lexical analyser or scanner is the first phase of the compiler. As the name implies,
the scanner scans the source code character by character delimited by some white-space
characters, operators, and punctuators, and separates the tokens.
For example, consider the source code segment D = A + B * C. As programmers, we
know that A, B, C, and D are variables or identifiers and =, +, and * are the arithmetic
operators. The functions of the scanner are to scan through this programming statement
and separate the tokens. Here, the delimiters are the operators. The output of the scanner
program is shown in Table 2.3.
Table 2.3
Output of the scanner program
Token no.
Token type
id
op
id
op
id
op
id
Token or lexeme
value
Note: id stands for identifiers or the variables; op stands for the operators in the source
code.
ExamplE 2.2 Consider a statement Area = 1/2 * base * height. The scanner program
in this example uses the white-space character and operator as delimiters and separates
the tokens as follows:
id:
Area, base, height
op:
=, /, *
nconst: 1, 2
Note that during the scanning process, the scanning program separates the tokens, assigns
each token its type, and stores its lexeme value in a place holder. Hence, a token in a
source code is represented as a pair <type, lexeme value>.
ExamplE 2.3 Consider the source code statement given in Example 2.2 and present
the outputs in a sequence of tokens in suitable format.
As done in Example 2.1, the scanner program uses the white space and operator as
delimiting characters and produces the token sequence as follows:
<id, Area>
<op, = >
<nconst, 1>
<op, / >
<nconst, 2>
<op, *>
<id, base>
<op, *>
<id, height>
Will there be too many types in the source code that they are not manageable? Let us
analyse a source program and its types of tokens. Any programming language will have
a limited number of keywords. These keywords can be given a type number in the order
(1, ..., n), where n is the number of keywords. Following these, there are only limited
token types such as identifiers, constants, operators, and punctuators. They will be represented by the token type and the tokens lexeme value.
47
If a scanner program is implemented in the C language, one may use the following
strategies for giving the token number.
#define char 1
#define int 2
#define float 3
...
...
#define id 28
#define op 29
#define nConst 30
#define op 31
...
...
Example 2.4 Consider the following program segment. The token output sequence
will be annotated with their token representations as follows:

int A, B;
float C;
C = A * B;
<2, >
<28, "A">
<28, "B">
<3, >
<28, "C">
<28, "C">
<31, =>
<28, "A">
<31, *>
<28, "B">
In an actual implementation, each operator will also be assigned with numerical values.
The lexeme value will be replaced with the location of the identifier. During the parsing
process, each token with its attribute is passed to the parser, and necessary routines will
be called to store the identifier in the symbol table. The symbol table is implemented with
a suitable data structure to hold information about the identifier in the source code.
What is the format for the tokens? How are they specified? They are specified by regular expressions, RG, or type 3 grammar, which will be dealt with in detail in Chapter 3.
The separated tokens along with their attributes will be passed to the next phase of the
compiler called parser or the syntax analyser.
2.5.2 Syntactic Analysis

The syntax analyser, also called parser, is responsible for reading the sentence of the
language and checking for its structure as specified by the grammar. There are various
structures supported in a language. They are broadly classified into executable and nonexecutable statements. Executable statements are statements that will be executed by the
processor during run-time.
For example, the statement C = A + B is an executable one, where the values of the
variables A and B are added and assigned to the variable C during run-time. On the other
hand, consider a statement int A, B, C in a C-like language; it is not executed during
run-time. Instead, the variables are processed during compile time, and the information
about these variables is stored in a symbol table to be referred to during run-time. These
statements are called declarative statements.
48
Compiler Design
There are different types of statements in these two categories:

1. Declarative statement
2. Assignment statement (Sa)
(a) lvalue = expr
3. Control statement
(a) Selective statement
(i) if statement (Sif)
(ii) if-then-else statement (Sie)
(iii) switch case (Ssc)
(b) Iterative statement
(i) for statement (Sfor)
(ii) while statement (Swhile)
(iii) repeat while or do-while statement (Sdw)
(c) goto statement (Sgo)
4. IO statement (Sio)
St represents the statement of type t, such as if, if-else, while, or do-while.
Each statement has its own syntax and function. For example, the syntax for the if-else
statement is
if expr statement1 else statement2
Again, the statement can be any one of these two categories (declarative or executable). How do we write the grammar? As we have seen in the beginning of this chapter,
a grammar is represented by four tuples with a finite set of terminals, finite set of nonterminals, finite set of production rules, and a start symbol. Each statement is specified by
one or more production rules.
Example 2.5 Using the notation Se for the executable statement in general, we can
write the syntax for various types of executable statements.
Se Sa | Sif | Sie | Ssc | Sfor | Swhile | Sdw | Sgo
The syntax for each statement is
Sa id = expr
Sif if Se
Sie if Se else Se
Ssc switch expr Scase
Sfor for exprinit exprcheck exprincrdecr Se
Swhile while expr Se
Sdw do Se while expr
Sgo goto L
Scase Scase case expr Se |
49
where init refers to the initial condition of the expression, check refers to the condition
checking for the continuation/termination of the loop, and incrdecr refers to the increment/decrement of the expression used to check the bounding condition of the loop.
All statements and their production rules or the rewriting rules are self-explanatory. The
typical production rules for the arithmetic expression are given in Example 2.1.
Among the four formal grammars (type 0 through type 3), type 3 grammar is suitable
for specifying the structure of the token since it is simple and powerful enough to represent the token. However, it is not sufficient to specify the structure of the sentence. Hence,
it is preferred to use type 2 or CFG. Should we use type 1 or type 0 grammar? Of course,
we can use type 1 or type 0 grammar without loss of precision. However, it is relatively
more difficult and complex to write the recognizer for types 1 or 0 than type 2 grammars.
In addition, type 0 and type 1 grammars are not required to specify the language structure
at the syntactic level. Hence, in terms of compilers, we are restricted to use only type 3
and type 2 grammars.
Recently, attempts are being made to develop recognizers for languages described by
type 1 grammar. So far, we have focused on specifying the grammar for the language. Let
us turn our attention to developing the recognizer or parser for this language.
The objective of the parser is to scan through the source code by calling the scanner
routine and checking whether the sequence of tokens is in the order specified by the grammar. If it is successful, it produces the IC; else, it reports an error.
By notation, the CFG rule is denoted by
A
where stands for a string of grammar symbols, which means it is either a terminal or
non-terminal.
For example, if-then-else is denoted by
Sie if expr Se else Se
Mapping the rule in terms of terminal and non-terminal, we get
NTNNTN
Representing the terminal or non-terminal (grammar symbol) as
N X1 X2 X3 X4 X5, where Xi stands for the grammar symbol representing either terminal or non-terminal
Using the notations
A , where A stands for non-terminal (N) and stands for the string of grammar
symbols X1 X2 X3 X4 X5. Hence, any CFG can be presented in this form.
The parser is broadly classified into two categories:
1. Top-down parser
2. Bottom-up parser
50
Compiler Design
If a given language conforms to the grammar of the language or if the grammar is able
to produce the given language, we call it successful parsing.
The top-down parser expands the start symbol of the grammar and consecutively all
the non-terminals in the RHS of the rule. At every point of expansion or derivation of the
production, if the final string obtained matches with the given source code, the parsing is
said to be complete.
Example 2.6 Consider the statement if (a < 10) c = a + b else c = a b for parsing.
In this example, there is one Boolean expression and two assignment statements. Both
assignment statements are executable statements denoted by Se. In addition, the Boolean
expression is derivable from the rule for expression. In Example 2.1, we had the rules for
arithmetic expression, which can be expanded to Boolean expressions also. The derivation steps are
Sie if expr Se else Se if (a < 10) c = a + b else c = a b.
It has the analogy of the derivations: A X1 X2 X3 X4 X5
In Example 2.6, we have started from the non-terminal Sie, which is also the start symbol
for the if-else statements. At every production process, a non-terminal is replaced by one
of the RHS values of the production rules. Finally, the string that we get is the sentence
of the language. In the derivation process, the last string is if (a < 10) c = a + b else c =
a b. All the other intermediate steps are called sentential forms of the language. In other
words, the sentential form with only terminals is called sentence of the language.
So far, we have seen how a given sentence can be parsed using the top-down parser.
Alternatively, we can use the bottom-up parser. As the name implies, we can consider the
given sentence of the language, scan left to right, and identify the appropriate RHS of any
one of the given grammar rules (we call it handle) and replace it by the LHS of that rule.
Again we find the handle and replace it with the non-terminal in the LHS of the grammar
rule. This process is repeated until we get the start symbol of the grammar.
However, identifying the exact handle is the issue of the bottom-up parser. The process
of identifying the exact handle will be discussed in detail in Chapter 4 on parsing techniques. Consider the grammar for the arithmetic expression:
1. expr expr + term
2. expr term
3. term term * fact
4. term fact
5. fact (expr)
6. fact id
7. fact const
In this grammar, the appropriate handles are all the RHS items. In general, it is nothing
but the notation of the form A . Here, is the string of the grammar symbols that will
be found in the sentential form in the parsing process and will be replaced by A.
Example 2.7 Consider the expression A + B. Here, both A and B are variables or
the identifiers for some memory locations. We denote them by id. Hence, it is internally
represented by id + id.
51
If we scan this sentence from left to right, we find that id is the RHS of rule 6. Hence, it
is replaced by fact. Hence, the steps are
id + id fact + id term + id expr + id
expr + fact expr + term
expr
This is the start symbol for the expression.
We have taken the simplest example for doing the parsing. In the real programming environment, there are varieties of expressions, and different combinations of operands are
possible.
Example 2.8 Consider an arithmetic expression A + B * C. As in the previous example it is internally represented as id + id + id. The parsing process is
id + id * id fact + id * id term + id * id expr + id * id

expr + fact * id expr + term * id expr + expr * id

expr + expr * fact expr + expr * term expr + term

expr
This is the start symbol for the expression.
In Examples 2.7 and 2.8, for the given language, bottom-up parsing is successful.
Let us write down the rules for the arithmetic expression in a slightly different and valid
form as follows:
1. expr expr + expr
2. expr expr * expr
3. expr (expr)
4. expr id
Let us work for the example given in Example 2.7.

A + B is mapped to id + id and the bottom-up parsing is
id + id expr + id expr + expr expr, , which is the start symbol.
For the example A + B * C, it is mapped to id + id * id, and the parsing is
id + id * id expr + id * id expr + expr * id expr * id

expr * expr expr, which is the start symbol.
Here, the process of parsing is successful. However, the meaning associated with the
semantic actions of this parsing is not present. In the process of arithmetic expression, we
give precedence to the evaluation of the expression of the form E * E rather than E + E.
In the sample example it has happened in the other way. expr + expr has been parsed first,
before expr * expr. One solution is to write down the grammar rules in a different order.
However, this is not practical in a design environment involving a larger number of rules.
Therefore, this kind of grammar (ambiguous grammar) must be modified. Elaborate discussions on using the ambiguous grammar have been presented in Chapter 4.
52
Compiler Design
2.5.3 Semantic Analysis

The CFG discussed in Section 2.5.2 is able to check with the structure of the language. It
does not take care of what and how operands are worked on by operators. How does the
system operate with two operands of different types? One of the issues is the size of the
operand. For example, consider the following code segment:

int i,j;
short int si,sj;
i = 10;
j = BIGINTEGER;
si = 10;
// Case 1: No loss of data
sj = j;
// Case 2: Loss of data
As the operands of the large-sized data (e.g., 4 bytes) are assigned to lower sized data
(e.g., 2 bytes), there are chances of loss of data depending on the value of data. However,
the burden of checking with the values of each data cannot be put on the programmer.
Hence, a good compiler is expected to analyse this kind of statement and report errors.
Consider a situation where an evaluation of an arithmetic expression 5/2 is to be carried out. It gives the answer 2, since both operands are integers. Hence, integer division
truncates the remainder. However, in many situations, the programmer is interested in
working with mixed data types. For example, consider 5.0/2. Is it 2.5 or 2? It depends
on how the compiler works. Most of the machine architectures support the operations on
the operands with similar kind of data. If this is not the case, the compiler has to check for
the type of compatible operands.
Consider the example of operating with numeric elements such that one operand is of
the integer type and other is of the real type. Following the type compatibility checking,
size compatibility checking has to be done. In such cases, it is preferred by the programmer and language designer that both data types be unified to be of a high-level type. If
one data is an integer and other real, then the integer data will be converted into real data
(which has large data size). This process is called data promotion. This will eliminate the
loss of data.
Example 2.9 For the following code segment, data conversion takes place to promote the lower size data to equivalent compatible peer size data for operation.

int a, b;
float fa;
a = 10;
fa = 12.5
b = a + fa;
There are three executable and two declarative statements in this example. Consider the
third assignment statement. Here, an integer data is added with real data. In the bottomup parsing process, the compiler has to reduce the sentence of the language (assignment
statement) to the start symbol of the assignment statement. Since the operands in the
expression in the RHS of the rule have mixed operands, type conversion has to be carried
out. Let us use the suffix notations i and f to denote integer and real, respectively.
53
b = a + fa is mapped idi = idi + idf and using the rules given in Example 2.1.
idi = idi + idf idi = factori + idf idi = factori + factorf
idi = termi + factorf idi = expri + factorf
idi = expri+ termf idi(converInt2Float)expri + termf
// There is no error till this step.
// This step shows error message.
idi = expri
The expression with real value attribute in the RHS is to be assigned to the identifier with
the integer value attribute. Here, either an error or a warning message has to be displayed.
However, if a type cast operator is provided in the assignment statement of the previous
example the warning or error message need not be shown, and it will be successfully parsed.
Example 2.10 If the third assignment statement in Example 2.9 is modified as
b = (int) a + fa; then the process will be as follows:
idi = idi + idf idi = factori + idf idi = factori + factorf
idi = termi + factorf idi = expri + factorf
idi = expri+ termf idi =(converInt2Float)expri + termf
idi = (convertFloat2Int) expri
In Example 2.10, the function convertInt2Float converts the size of the data from integer
to float type. The function convertFloat2Int reduces the size of the data from float to integer type. The first type of conversion is called implicit type conversion. The second type
of conversion is called explicit type conversion in a compiler perspective.
Since we are using the CFG, a separate phase performing semantic analysis is
required. If we use CSG, these issues will be handled as part of the parsing process.
However, implementing a language supported by CSG is cumbersome, and many of the
programming languages can be restricted to be supported by CFG. The semantic analyser
becomes the supplementary module to the compiler. In recent languages, even the explicit
type conversion routines show warnings to alert the user. Therefore, writing a compiler
is not only a technique but also an art to study the various conveniences of programmers
and assist them.
Similarly, there are many semantic issues in the design of a compiler. They are dealt
with in detail in Chapter 6.
2.5.4 Intermediate Code Generation

As soon as the syntax and semantic analysis is carried out successfully, the source code is
represented in a form applicable to the next phases. We have already discussed that lexical
analysis, syntax analysis, and semantic analysis are parts of the front end of the compiler.
The front end of the compiler transforms the source code into an IC. Why is the compiler
designed to generate IC and not directly machine code?
There are several reasons for generating IC before creating machine code.
54
Compiler Design
Modular Design
Writing the compiler can be in monolithic form, where there will not be any flexibility
of making changes. In addition, the front and back ends of the compiler can be isolated.
Here, the analysis part generates the IC and the synthesis part works only on the IC; it has
no reference to the source code. At the same time, the front end does not depend on the
back end, that is, on the target machine.
Refer to Figs 2.5 and 2.6. In the first case, that is, without using the IC each source
code is mapped to each target code. If the number of source languages are m and the target
languages are n, then the complexity of the compiler development is m * n. For a large
number of source languages and machine languages the complexity of writing compiler
is very large to manage. On the other hand, if IC is used, the number of mappings from m
source languages to n target languages is only m + n.
Keeping these factors in mind, Sun Microsystems introduced the concept of byte
code, which has a standard format in releasing the Java language, which is popular
worldwide. Byte code is something similar to IC with a standard being supported in
almost all platforms. To convert byte code to any machine code there is a special software
called Java virtual machine (JVM) available in each target machine depending on the
machine. Because of these reasons, a byte code developed using Java language in one
platform can be ported to any other platform without any difficulty.
6RXUHFRGH
6RXUHFRGHP
0DFKLQHFRGH
0DFKLQHFRGHQ
Fig. 2.5 Compilations without IC

Soure code1
Soure codem
Intermediate code (IC)
Machine code1
Fig. 2.6 Compilations with IC
Machine coden
55
Following Sun Java, Microsoft also introduced a special language called C# or CSharp
to work in an Internet-like environment in the .NET platform. It has common language
run-time (CLR) environment, which is usable by different programming languages.
Advantages IC with modified features has played a major role in the Internet era. It
supports many features as follows:
1. Consistent and continuous programming model
2. Develop once and run anywhere
3. Simplified deployment
4. Wide platform reach
5. Programming language integration
6. Simplified code reuse
7. Interoperability
What could be the format by which IC can be generated? Many forms of IC have been
used over the years. A few of them are as follows:
1. Syntax trees
2. Three-address code
(a) Quadruple
(b) Triple
(c) Indirect triple
3. Any valid and usable IC
The syntax tree represents the source code in a tree-like structure.
For example, consider an arithmetic expression A + B * C that is mapped to id + id *
id and is represented in the form of a tree.
The syntax tree is the tree representation of the source code after the syntax checking
is completed. Figure 2.7 shows the syntax tree of the expression. This syntax tree is also
called expression tree, labelled tree, or operator tree. Here, the interior nodes are operators and the leaf nodes are operands. In addition, this tree is a binary tree. If we traverse
the tree in a different order, we will get different expressions as follows:
+
id
id
id
Fig. 2.7 Syntax tree
56
Compiler Design
Preorder traversal gives the prefix expression: + A * BC

Inorder traversal gives the infix expression: A + B * C
Postorder traversal gives the postfix expression: ABC * +
If we start processing the postfix expression we will get the three-address code as follows:
t1 = B * C
t2 = A + t1
Why is it called three-address code? Two addresses are used for storing two operands
in the RHS. The third address is used to store the result. The details of the various threeaddress codes are covered in Chapter 6.
2.5.5 Code Optimization

Optimization is the process of improving the objective function in a non-polynomial
problem. From a compiler point of view, the objective of optimization is to reduce the
time requirement and/or space requirements. Recent compilers have options of optimization. During the development of the application code, the optimization will normally be
disabled. This mode is called debug mode. During product delivery, the code will be in
release mode after subjecting it to many optimization processes. The internal architecture varies from one processor to another. The compiler developer working in the front
end need not be concerned with the internal architecture of the processor. Some processors handle certain types of code much better than others. In addition, some processors
have special performance-enhancing features, but in order to use them the code must be
arranged in a compatible way. For example, some processors are designed to work well
with strings, some with numbers. For the same operations, two processors may support
different types of instructions, with different timing and memory requirements. Programs
that take these issues into account can structure themselves to get more performance than
those that do not optimize themselves in this manner.
In addition, users are working on different platforms with complex business requirements. Compilers must be able to handle all these for their usability. Optimization could
be done at various levels. Even programmers can take care of these issues while developing the source code.
Example 2.11 Consider the following statements in the for loop.

for(i=0;i<N*M;i++)
// Some statements
If the values of N and M are large, such as 1000 and 2000, respectively, the multiply
operations in the for condition checking will be performed 1000 2000 times. Let the
code be rewritten as

runLengh = N * M;
for(i=0;i<runLength;i++)

57
// Some statements
Now the gain in computational time is (N * M 1) times in terms of multiplications.

From the software development point of view, the user will not be able to concentrate on
all these issues since he or she focuses only on the business logic. Hence, a good compiler
is expected to identify the scope for code development and achieve it. The main areas of
optimization mostly occur in the following:
1. Source code
2. Intermediate code
3. Machine code
The optimization in the source code will not come under the purview of the compiler.
However, the optimization can be either in the IC or in the machine code. However,
working with machine code is cumbersome since it is in binary language. In addition, all
non-optimized codes in the early stages of compilation will accumulate in the later stages
of the compiler. Hence, it is preferable to work on the optimization of code in the IC level
because of the following reasons:
1. Understanding IC is easier than machine code.
2. Exiting techniques in optimization can be directly deployed in IC than machine
code.
The process of optimization can be done in code and loop level. Code-level optimization will take care of the following:
1. Reduction in cost by using the appropriate equivalent operations
2. Elimination of repeated calculations
3. Elimination of common sub-expression evaluations
4. Identification and removal of unreachable code
Loop optimization will focus more on the control of execution of the statements inside
the loop. From the statistics of the programming environment, it is understood that most
of the real-world programs have 80 per cent of the execution logic inside the loop. If the
focus is on loop optimization, the efficiency of execution can be improved in a considerable manner. Optimization is dealt with in detail in Chapter 7. After the optimization
process, the code generation process can be initiated.
2.5.6 Code Generation

The final phase of the compiler is code generation. It has two views. On one hand, the
process of generating the machine code from the source code looks very complex. On the
other hand, if a standard format for the IC is derived and templates are made available for
the translation process, it is only mapping of the IC to machine code. Here, the requirement is that one has to be familiar with the complete instruction set of the processor to
decide which code is efficient for the given IC. As discussed in Chapter 6, every construct
of the programming language can be brought into a few IC categories. They can be converted into machine language using specialized instructions meant for this purpose.
58
Compiler Design
Example 2.12 Consider an arithmetic expression D = A + B * C. It is mapped internally as id = id + id * id. The ICs generated are the following:
1. T1 = B * C //Parsed IC was generated
2. T2 = A + T1 //Making use of the previous value T1 and A
3. D = T2
//Assignment of the computed value to result variable D
If we translate each IC into machine code directly, we will get the machine code as

T1 = B * C
MOV R1, B
T2 = A + T1
MOV R1, A
MOV R2, C
MOV R2, T1
ADD R1, R2
ADD R1, R2
MOV T1, R1
MOV T2, R1
D = T2
MOV R1, T2
MOV D, R1
Code generation is influenced by many factors as listed here.

1. Register allocation
2. Register scheduling
3. Code selection
4. Addressing modes
5. Instruction format
6. Power of instructions
7. Optimization at the machine code level
8. Back patching
The developer concentrating on code generation should be an expert to take care of all
these issues. Chapter 8 discusses code generation in detail.
2.5.7 Symbol Table Management

The purpose of the programming language is to get some work done by the computer.
Initially, the compiler was focusing only on evaluation of arithmetic expression involving operators and operands. The operands are all of different types and sizes, and they
need different types of storage area. The symbols or the variables are the core part of the
expressions. Hence, the identifiers are required during the run-time. The questions to be
answered are as follows:
1. Where are they stored?
2. When are they stored?
3. How are they accessed?
4. What is their format?

5. What data structure can be used?
When the variables are declared, the two phases process these statements. For example, consider two variables a and b, which are declared as integers.
int a, b;
During the scanning operations, only the tokens are separated. The token separated
is int, which is uniquely identified as a keyword, followed by a list of variables a and b.
Only during the parsing process, the relationship between the int keyword and the variable list a, b is established. At this time, a procedure has to be invoked to store a variable or
the symbol in the program. The location for storing these variables is decided by the scope
of the variables. If it is a static variable, then it is stored in the heap memory of the system.
59
If they are dynamic variables, they are stored in the stack. Each language supports different types of scope and lifetime of the variables. Thus, the symbol tables are created during
the parsing process and used through the later part of the compilation process.
Each symbol is associated with a minimum set of attributes as listed here.
1. Name
2. Type
3. Size
4. Location
Since every symbol has a set of attributes, it can be realized as a record of information.
Since there are multiple variables in a source code, it can be realized as a set of records.
Any suitable data structure for the set can be made use of. Examples of such data structures are as follows:
1. Array of records
2. Linked list of records
3. Tree of records (binary search tree, B-tree, etc.)

4. Hash data structure, and so on
Each data structure has its own advantages and disadvantages. It can be selected
depending on the type of language.
ExamplE 2.13
Consider a program segment having the declarative statements:
int principal;
float rate, interest;
A typical data structure and the simplest for holding information of these variables is
#define NAMESIZE 30
#define MAXSYMBOL 50
typedef struct symbol
{
char name[NAMESIZE];
int type;
int size;
void *location;
} symbol;
symbol symbols[MAXSYMBOLS];
The simplest data structure is an array of records of symbol information (Table 2.4).
Table 2.4
Data structure
Name
Type
Size
Value/location
Principal
Pointer to memory
Rate
Pointer to memory
Interest
Pointer to memory
This symbol table is created during the syntax and semantic analysis phase and is referred
to by other phases in the compilation process.
60
Compiler Design
A detailed description of symbol table management is given in Chapter 5.
2.5.8 Error Management

For any beginner in a programming environment, it is a nightmare to write an error-free
program. Earlier compilers were not friendly enough to handle error reporting properly.
Detecting, locating, and reporting errors and recovery from errors are the most important
tasks to be carried out by the error-management module. Some of the desirable characteristics of error-management modules are as follows:
1. The compiler should not crash on encountering an error.
2. The compiler should report an error and recover from it for proceeding with other
lines of code.
3. The reported error must be meaningful.
4. The error-correction module should not change the meaning of the intended
operations.
Some editors have built-in knowledge of language profile and guide the developer in
providing information so as to assist him/her to type without mistakes. In most of the integrated development environments (IDEs), the editor helps in matching the parenthesis,
undeclared variables, and colouring the different types of programming constructs.
Errors are broadly classified into static (compile-time) errors and dynamic (run-time)
errors. It is essential to report the errors as much as possible in the earlier stages. This
helps in saving resources. In Java-like languages, the compile-time and run-time errors
are clearly classified into errors and exceptions, respectively.
During the compilation process, errors may occur in more than one phase. Some of the
errors are as follows:
1. Lexical errors
(a) Misspelling
2. Syntax errors (CFG error)
(a) Unbalanced parenthesis
(b) Undeclared variables
3. Semantic errors (CSG error)
(a) Truncation of results
(b) Juxtaposing of characters

(c) Missing punctuation operators
(b) Unreachable code
All the errors in the later stage of the compilation process will lead to run-time error
since it has nothing to do with the source code.
Hence, the compiler translates the programming language into machine language. An
open research issue would be to come up with ways of translating any language to another, thus eliminating the need to learn a foreign language.
SUMMARY
Languages are used for communication. A natural
language defines a means of communication (for
example, English, French, or Tamil) shared by a
group of individuals without much restriction on

their usage. On the other hand, a formal language
has defined rules in the usage of the language.
The grammar of a language specifies the rules by

which the constructs of the language can be formed.
The components of languages are symbols, words,
and sentences.
For every formal language, there exists a finite set
of rules or grammar. Every finite grammar has a
language associated with it.
Notations are a convenient means of representing
the components of a language and grammar.
Chomsky has categorized the languages into four:
(a) type 0 or unrestricted grammar, (b) type 1 or
CSG, (c) type 2 or CFG, and (d) type 3 or RG.
Design of any programming language must be in
such a way that it is easy and powerful enough for
the user and practically possible to implement.
Type 3 or RG is used to specify the tokens of the
computer language, and type 2 or CFG is used to
specify the structure of the sentence of the languages.
When a language whose structure closely resembles the computer system is developed, it is fast to
execute and inflexible. When it is closer to human
language, it is slow for execution since there is a
lot of overhead information associated with the
process of translating human language to machine
language.
Owing to the speed of the hardware and the requirement of flexible language to be used in the business environment, it is preferred to write programs
in high-level (3GL or 4GL) languages only, which
necessitates the presence of a translator/compiler
for converting them into machine language.
61
Scripting languages are popular, as they are easy to

understand and use.
Special and unique languages are being developed
to work with the Internet across the globe (HTML
and XML).
It will be interesting to communicate with the
computer by speaking to it. Research is underway
for using a computer to analyse and recognize
speech.
Programming languages have been evolving since
the year 1950, with several capabilities being added
over time. From FORTRAN to Java and C#, several
useful features have been added.
Compiler development must release the burden
of the programmer on remembering and checking
syntax, semantic issues, and so on.
Development of the compiler is broadly divided
into analysis phase (front end) and synthesis phase
(back end).
The various phases of the compilers are lexical
analysis (scanner), syntax analysis (parser), semantic analysis, intermediate code generation, code
optimization, and code generation.
Tools are available for developing these phases to
ease compiler development.
We have broadly outlined in this chapter the
programming language theory, construction of its
recognizer, that is, compiler, and the stages of the
compiler. Chapter 3 deals in detail with the first phase
of the compiler (scanner or lexical analyser).
OBJECTIVE TYPE QUESTIONS

U 1. Pick the odd one out:
(a) Scanner
(c) Intermediate code
(b) Parser
(d) Code generator
U 2. Can we specify the format of the token using
CFG?
(a) Yes
(b) No
U 3. Chat grammar is preferable to specify the syntax of a language?
(a) Type 0
(c) Type 2
(b) Type 1
(d) Type 3
U 4. What is the best grammar for specifying a natural language?
(a) Type 0
(c) Type 2
(b) Type 1
(d) Type 3
U 5. Rules are related to
(a) data
(c) knowledge
(b) information
(d) intelligence
U 6. Inference is related to
(a) data
(c) knowledge
(b) information
(d) intelligence
U 7. What is the size of the language that a recursive
grammar can generate?
(a) Infinite
(c) All of these
(b) Finite
(d) None of these
62
Compiler Design
U 8. Pick the odd one out:

(a) Finite set of terminals
(b) Finite set of non-terminals
(c) Finite set of production rules
(d) Sequence of words
L 9. Which one has the same level of capabilities for
a given language, and what is your choice of
grammar in the implementation point of view?
(a) Type 3
(c) All of these
(b) Type 2
(d) None of these
U 10. Chich of the following grammars is more
expressive?
(a) Type 0
(c) Type 2
(b) Type 1
(d) Type 3
U 11. lvalue in an assignment statement always refers
to a location in memory.
(a) True
(b) False
U 12. Which of these languages is preferred for fast
response?
(a) 1GL (b) 2GL (c) 3GL (d) 4GL
U 13. Compiled languages are inferior to interpreted
languages with respect to response time.
(a) Yes
(b) No
U 14. Is the memory requirement of a compiled language huge compared to an interpreted language?
(a) Yes
(b) No
U 15. JVM stands for

(a) Java virtual memory
(b) Java virtual model
(c) Java virtual machine
(d) Java virtual method
U 16. COBOL stands for
(a) COmmon Based Object Language
(b) COmmon Basic Object Language
(c) COmmon Basic-Oriented Language
(d) COmmon Business-Oriented Language
U 17. The output of syntax analyser/semantic analyser is
(a) parse tree
(b) intermediate code
(c) sequence of tokens
(d) none of these
R 18. The representation of token type is by
(a) integer (b) float (c) double (d) char
U 19. When operands are of two types, one having
2 bytes of representation and another 4 bytes,
what is the size of the resulting answer in
general?
(a) 2 bytes
(b) 4 bytes
U 20. Without creating IC the compiler cannot generate machine code.
(a) True
(b) False
Review Questions
U 1. Identify the following sentence as formal or
A 8. Refer to any book on programming in C# and
informal language.
(a) God is Love.
(b) Find a watch in the market whose rate is less
than 500.
(c) Select pay details for the employees for the
month of January 2010.
L 2. List out the characters in a language of your
choice. Find out how many two-character
words can be coined by using these characters.
A 3. Specify the rules for question 2.
U 4. Write down the rule for the format of an identifier in the C language.
U 5. Write down the grammar for type 0 grammar.
U 6. Write down the CFG for Boolean expression.
U 7. Compare the features of CFG with CSG.
identify the keywords and represent them in

token pairs.
9. How do you select a high-level language for a
given problem?
10. Give the choice of interpreted and compiled
language.
11. Enumerate the generation of languages.
12. When do you use the FORTRAN language?
13. For a business environment, which high-level
language do you prefer?
14. You have to write a device driver (say, for a
mouse). Do you prefer assembly language or C
language? Justify.
15. Java is a platform-independent language: True
or False? Justify your answer.
L
U
R
U
U
U
U 16. Under what circumstances will you use HTML
R
U
A
U
U
U
and XML?
17. What are the features of the PHP language?
18. When do you use GUI-based applications?
19. What are the required steps in recognizing
human speech? Search for some speech recognition articles before answering.
20. What are the advantages of compiled languages
over interpreted languages?
21. What are the features of a good language?
22. What are the features of a good compiler?
63
R 23. List out some source languages that you can
feed to a compiler.
R 24. List out the target language that a compiler can
generate.
U 25. How are the sequences of tokens represented?
U 26. Give an example for the while statement.
L 27. Give the grammar for a nested while statement.
Refer to Example 2.5 for a single while statement.

U 28. Give the grammar for working with nested if-
else statement. Do you find any difficulties in

matching them?
exercises
A 1. Using English grammar, give the derivation for
the following sentences:

(a) The cat drinks milk.
(b) The monkey climbs a tree.
(c) The player kicked the ball.
A 2. Consider a grammar with the following rules
and set of terminal symbols (0, 1):
A 0 A A1 A 1 S 0S
Describe the set of binary strings generated by
this grammar.
A 3. List out any five strings derivable from the following grammar
(a) S 0A1A A 1B B 1
(b) S 010 A 0
(c) A 0A1 B 0B
A 4. Parse the string bbba using the following
grammar.
(a) S bS
BC
(b) S ba
B ba
(c) S A
B aCC
(d) A a
C CCa
(e) A Bb
Cb
A 5. Draw the derivation/production steps for the following arithmetic expressions using the grammar for the arithmetic expression in the text:
(a) p + q
(c) (p + q)
(b) p * q + r
(d) (p + q) * r
A 6. Identify start symbol, sentential form, handle,
and sentence in the derivations steps obtained
in the previous question.
U 7. A binary language consisting of four 0s followed by eight 1s followed by four 0s is to be
generated for a frame delimiter in a communication link. Write down the grammar for this.
A 8. Consider a code segment of C language:
(a) int a, b;
(b) a = 10;
(c) b = 20;
(d) if (a = 20)

(i)a = a + 1;
(e) else

(i)a = a - 1;
Is this syntactically and semantically right?
Write down the value of the variable a after the
execution of this code segment.
U 9. For the problem given in question 8, if the code
segment in the if statement is slightly modified
as:
(a) if(a == 20)
(i)a = a + 1;
(b) else
(i)a = a 1;
what would be the value of a after the code gets
executed? Compare the result with the previous
question and comment.
U 10. There are four programming languages, namely, BASIC, C, Pascal, and FORTRAN for which
the compiler has to be developed to work with
three machine architectures, Intel, Motorola,
and Zilog processor.
(a) List out how many front ends and back ends
of the compiler you have to work with.
(b) Repeat the question for a compiler without
using the IC where the source code will be
64
Compiler Design
directly converted into machine code (of

course it is possible).
U 11. Write down a grammar for representing the following numbers.
(a) 230 (b) 145.23 (c) 424 (d) 667e40
A 12. Draw the syntax tree for the arithmetic expression 1 + 5 * (4 + 6).
L 13. Consider the following sentences and identify
scope for optimization:
Pay1 = FixedPay + NoOfWorkingHours * Rate
Pay2 = Pay1 + NoOfWorkingHours * Rate *
bonusRateForHours
These computations are to be carried out for all
the employees of an organization. Comment on
the gain in execution time if the codes are properly optimized.

L 14. In a processor, a register is to be initialized
to be zero. We have the following two codes
to realize it. Which one would you prefer and
why?

(a) MOV A,00 //store a value 0 into
(b) XRA A
the register:
//A A XOR A
A 15. Generate the machine code for the following
source code segment:

TotalAmount = Principal + Principal * Interest
Rate
Answers to Objective Type Questions

1. (c)
11. (a)
2. (a)
12. (a)
3. (c)
13. (b)
4. (a)
14. (a)
5. (c)
15. (c)
6. (d)
16. (d)
7. (a)
17. (b)
8. (d)
18. (a)
9. (a)
19. (b)
10. (a)
20. (b)

Introduction To Compilers: Syntax Analysis

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Introduction To Compilers: Syntax Analysis

Загружено:

Авторское право:

Доступные форматы

Introduction to

This chapter will enable the reader to

describe the evolution of computer

A recognizer or a parser, discussed in Chapter 4, could be developed to recognize

2.2 THEORY OF COMPUTER LANGUAGES

2.2.1 Natural Languages vs Formal Languages

2.2.2 Language and Grammar

Fig. 2.1 Parts of a sentence and grammar elements

2.2.3 Notations and Conventions

Table 2.1 Grammar notations and conventions

VT = {a, b, c,} starting with

VN = {A, B, C,...} starting with

Set of rewriting or production

R = {R1, R2, , Rn}

Start symbol (any one of the

Upper case representation of w, x,

String of grammar symbols

int area, base, height; here area,

10, 20.345, 147,

hello, home, ...

2.2.4 Hierarchy of Formal Languages

Table 2.2 Comparison of grammar types and the generated languages

A limited combination of terminals

Medium expressive power

2.3 DESIGN OF A LANGUAGE

2.3.1 Features of a Good Language

7. Good error reporting

2.3.2 Representation of Languages

2.3.3 Grammar of a Language

expression to be parsed first. The | is the alternative grammar specification operator,

2.4 EVOLUTION OF COMPILERS

2.4.1 History of Compilers

(a) The object-oriented language Smalltalk was developed by a team of scientists

2.4.2 Development of Compilers

2.5 Stages OF COMPILATION

Fig. 2.2 User-level view of a compiler

Compiler front end

Fig. 2.3 Structure of a compiler

,QWHUPHGLDWHFRGH ,& JHQHUDWRU

Fig. 2.4 Stages of compiler design

2.5.1 Lexical Analysis

Output of the scanner program

2.5.2 Syntactic Analysis

There are different types of statements in these two categories:

Let us work for the example given in Example 2.7.

2.5.3 Semantic Analysis

short int si,sj;

// Case 1: No loss of data

// Case 2: Loss of data

2.5.4 Intermediate Code Generation

Fig. 2.5 Compilations without IC

Intermediate code (IC)

Fig. 2.6 Compilations with IC

Fig. 2.7 Syntax tree

Preorder traversal gives the prefix expression: + A * BC

2.5.5 Code Optimization

Now the gain in computational time is (N * M 1) times in terms of multiplications.

2.5.6 Code Generation

Code generation is influenced by many factors as listed here.

2.5.7 Symbol Table Management

4. What is their format?

3. Tree of records (binary search tree, B-tree, etc.)

Fig. 2.1 Parts of a sentence and grammar elements

(a) The object-oriented language Smalltalk was developed by a team of scientists

,QWHUPHGLDWHFRGH,&JHQHUDWRU

U 15. JVM stands for

A 8. Refer to any book on programming in C# and

R 23. List out some source languages that you can

(a) MOV A,00 //store a value 0 into

A 15. Generate the machine code for the following