Вы находитесь на странице: 1из 17

Compiler Design

Submitted By
Kiran Subramaniam S
III Year BE/CSE
AC07UCS033
Compiler Design

Summary in Compiler Design

MODULE I. INTRODUCTION TO COMPILER

Compiler writing spans programming languages, machine architecture, language theory, algorithms, and
software engineering. Fortunately, a few basic compiler-writing techniques can be used to construct translators for a
wide variety of languages and machines.

A Compiler is a program that reads a program written in one language – the source language – and translates
it into an equivalent program in another language – the target language as illustrated in Figure 1.1 in which the
important part of the translation process is that the compiler reports to its user the presence of errors in the source
program.
source target
compiler
program program

error
messages

Figure 1.1 A Compiler

At first glance, the variety of compilers may appear overwhelming. There are thousands of source languages,
ranging from traditional programming languages such as Fortran and Pascal to specialized languages that have arisen
in virtually every area of computer application. Target languages are equally as varied; a target language may be
another programming language, or the machine language of any computer between a microprocessor and a
supercomputer.

A compiler translates a source program into machine language. An interpreter program reads individual words
of a source program and immediately executes corresponding machine-language segments. Interpretation occurs each
time the program is used. Thus, once it has been compiled, a program written into a compiled language will run more
quickly than a program in an interpreted language.

An interpreter is a computer program that translates commands written in a high-level computer language into
machine-language commands that the computer can understand and execute. An interpreter's function is thus similar
to that of a compiler, but the two differ in their mode of operation. A compiler translates a complete set of high-level
commands, producing a corresponding set of machine-language commands that are then executed, whereas an
interpreter translates and executes each command as the user enters it. Interpretive languages, such as the widely
used BASIC, are relatively easy to learn, and programs written in them are easy to edit and correct. Compiled
programs, on the other hand, operate more rapidly and efficiently.

THE ANALYSIS-SYNTHESIS MODEL OF COMPILATION

There are two parts of compilation: analysis and synthesis. The analysis part breaks up the source program
into constituent pieces and creates an intermediate representation of the source program. The synthesis part
constructs the desired target program from the intermediate representation.

During analysis, the operations implied by the source program are determined and recorded in a hierarchical
structure called a tree. It is a special kind of tree called a syntax tree, in which each node represents an operation and
the children of a node represent the arguments of the operation. An example is shown in

2
Compiler Design

Figure 1.2.

:=
position +
initial *
rate 60

Figure 1.2 Syntax tree for position := initial + rate * 60.

Software tools that manipulate source programs:

1. Structure editors – takes an input a sequence of commands to build a source program. The structure editor
not only performs the text-creation and modification functions of an ordinary text editor, but it also analyzes
the program text, putting an appropriate hierarchical structure of the source program.
2. Pretty printers – analyzes a program and prints it in such a way that the structure of the program becomes
clearly visible. Example of this, comments may appear in a special font, statements may appear with an
amount of indentation proportional to the depth of their nesting in the hierarchical organization of the
statements.
3. Static checkers – reads a program, analyzes it, and attempts to discover potential bugs without running the
program. For example, a static checker may detect that parts of the source program can never be executed,
or that a certain variable might be used before being defined.
4. Interpreters – performs the operations implied by the source program. For an assignment statement, for
example, an interpreter might build a tree like in Figure 1.2 and then carry out the operations at the nodes as
it “walks” the tree. Interpreters are frequently used to execute command languages, since each operator
executed in a command language is usually in invocation of a complex routine such as an editor or compilers.

ANALYSIS OF THE SOURCE PROGRAM

It consists of three phases:

1. Linear analysis, in which the stream of characters making up the source program is read from left-to-right and
grouped into tokens that are sequences of characters having a collective meaning.
2. Hierarchical analysis, in which characters or tokens are grouped hierarchically into nested collections with
collective meaning.
3. Semantic analysis, in which certain checks are performed to ensure that the components of a program fit
together meaningfully.

THE PHASES OF A COMPILER


Conceptually, a compiler operates in phases, each of which transforms the source program from one
representation to another. A typical decomposition of a compiler is shown in Figure 1.3.

3
Compiler Design

LEXICAL ANALYSIS

In a compiler, linear analysis is called lexical analysis or scanning. For example, in lexical analysis the
characters in the assignment statement

position := initial + rate * 60

would be grouped into the following tokens:


1. The identifier position.
2. The assignment symbol :=.
3. The identifier initial.
4. The plus sign
5. The identifier rate.
6. The multiplication sign.
7. The number 60.

The blanks separating the characters of these tokens would normally be eliminated during lexical analysis.

SYNTAX ANALYSIS

Hierarchical analysis is called parsing or syntax analysis. It involves grouping the tokens of the source
program into grammatical phrases that are used by the compiler to synthesize output. Usually, a parse tree such as
the one shown in Figure 1.4 represents the grammatical phrases of the source program.

SEMANTIC ANALYSIS

The semantic analysis phase checks the source program for semantic errors and gathers type information for
the subsequent code-generation phase. It uses the hierarchical structure determined by the syntax-analysis phase to
identify the operators and operands of expressions and statements.

4
Compiler Design
An important component of semantic analysis is type checking. Here the compiler checks that each operator
has operands that are permitted by the source language specification. For example, many programming language
definitions require a compiler to report an error every time a real number is used to index an array. However, the
language specification may permit some operand coercions, for example, when a binary arithmetic operator is applied
to an integer and real. In this case, the compiler may need to convert the integer to a real.

SYMBOL-TABLE MANAGEMENT

An essential function of a compiler is to record the identifiers used in the source program and collect
information about various attributes of each identifier. These attributes may provide information about the storage
allocated for an identifier, its type, its scope, and in the case of procedure names, such things as the number and
types of its arguments, the method of passing each argument, and the type returned, if any.

A symbol table is a data structure containing a record for each identifier, with fields for the attributes of the
identifier. The data structure allows us to find the record for each identifier quickly and to store or retrieve data from
that record quickly.

When the lexical analyzer detects an identifier in the source program, the identifier is entered into the symbol
table. However, the attributes of an identifier cannot normally be determined during lexical analysis. For example, in
a Pascal declaration like

var position, initial, rate : real ;

The type real is not known when position, initial, and rate are seen by the lexical analyzer.

INTERMEDIATE CODE GENERATION

After syntax and semantic analysis, some compilers generate an explicit intermediate representation of the
source program. We can think of this intermediate representation as a program for an abstract machine. This
intermediate representation should have two important properties: it should be easy to produce, and easy to translate
into the target program.

CODE OPTIMIZATION

The code optimization phase attempts to improve the intermediate code, so that faster-running machine code
will result.

CODE GENERATION

The final phase of the compiler is the generation of target code, consisting normally of relocatable machine
code or assembly code. Memory locations are selected for each of the variables used by the program. Then,
intermediate instructions are each translated into a sequence of machine instructions that perform the same task. A
crucial aspect is the assignment of variables to registers.

ERROR-HANDLING

Each phase can encounter errors. However, after detecting an error, a phase must somehow deal with that
error, so that compilation can proceed, allowing further errors in the source program to be detected. A compiler that
stops when it finds the first error is not as helpful as it could be.

The syntax and semantic analysis phases usually handle a large fraction of the errors detectable by the
compiler. The lexical phase can detect errors where the characters remaining in the input do not form any token of
the language. Errors where the token stream violates the structure rules (syntax) of the language are determined by
the syntax analysis phase. During semantic analysis the compiler tries to detect constructs that have the right
syntactic structure but no meaning to the operation involved.

COMPILER-WRITING TOOLS

5
Compiler Design

The compiler writer, like any programmer, can profitably use software tool such as debuggers, version
managers, profilers, and so on. Shortly after the first compilers were written, systems to help with the compiler-
writing process appeared. These systems have often been referred to as compiler-compilers, compiler-generators, or
translator-writing systems. Largely, they are oriented around a particular model of languages, and they are most
suitable for generating compilers of languages similar to the model.

Some general tools have been created for the automatic design of specific compiler components. These tools
use specialized languages for specifying and implementing the component, and many use algorithms that are quite
sophisticated. The most successful tools are those that hide the details of the generation algorithm and produce
components that can be easily integrated into the remainder of a compiler. The following is a list of some useful
compiler-construction tools:

1. Parser generators. These produce syntax analyzers, normally from input that is based on a context-free grammar.
In early compilers, syntax analysis consumed not only a large fraction of the running time of a compiler, but also a
large fraction of the intellectual effort of writing a compiler. This phase is now considered one of the easiest to
implement. Many parser generators utilize powerful parsing algorithms that are too complex to be carried out by
hand.
2. Scanner generators. These automatically generate lexical analyzers, normally from a specification based on regular
expressions. The basic organization of the resulting lexical analyzer is in effect a finite automation.
3. Syntax-directed translation engines. These produce collections of routines that walk the parse tree, generating
intermediate code. The basic idea is that one or more “translations” are associated with each node of the parse
tree, and each translation is defined in terms of translations at its neighbor nodes in the tree.
4. Automatic code generators. Such a tool takes a collection of rules that define the translation of each operation of
the intermediate language into the machine language for the target machine. The rules must include sufficient
detail that we can handle the different possible access methods for data; e.g., variables may be in registers, in a
fixed (static) location in memory, or may be allocated a position on a stack. The basic technique is “template
matching.” The intermediate code statements are replaced by “templates” that represent sequences of machine
instructions, in such a way that the assumptions about storage of variables match from template to template.
5. Data-flow engines. Much of the information needed to perform good code optimization involves “data-flow
analysis,” the gathering of information about how values are transmitted from one part of a program to each other
part. Different tasks of this nature can be performed by essentially the same routine, with the user supplying
details of the relationship between intermediate code statements and the information being gathered.

FORMAL SPECIFICATIONS AND GENERATION OF COMPILER MODULES

Compiler Subtask Specification mechanism Automaton type


Lexical analysis Regular expressions Deterministic finite automata

Syntax analysis Context-free grammars Deterministic pushdown automata

Semantic analysis Attribute grammars

Efficiency-increasing transformations Tree → tree Finite


transformation Tree transducers
Code selection in code generation
Regular tree grammars Finite tree automata

MODULE II. LEXICAL ANALYSIS

THE ROLE OF LEXICAL ANALYZER

The lexical analyzer is the first phase of a compiler. Its main task is to read the input characters and produce
as output a sequence of tokens that the parser uses for syntax analysis. Upon receiving a “get next token” command
from the parser, the lexical analyzer reads input characters until it can identify the next token.

6
Compiler Design
Since the lexical analyzer is the part of the compiler that reads the source text, it may also perform certain
secondary tasks at the user interface. One such task is stripping out from the source program comments and white
spaces in the form of blank, tab, and new line characters. Another is correlating error messages from the compiler
with the source program.
Sometimes lexical analyzers are divided into a cascade of two phases first called “scanning” and the second
“lexical analysis”. The scanner is responsible for doing simple tasks, while the lexical analyzer proper does the more
complex operations.

ISSUES IN LEXICAL ANALYSIS

Reasons for separating the analysis phase of compiling into lexical analysis and parsing:
1. Simpler design is perhaps the most important consideration. The separation of lexical analysis from syntax
analysis often allows us to simplify one or the other of these phases.
2. Compiler efficiency is improved. A separate lexical analyzer allows us to construct a specialized and potentially
more efficient processor for the task.
3. Compiler portability is enhanced. Input alphabet peculiarities and other device-specific anomalies can be
restricted to the lexical analyzer.

TOKENS, PATTERNS, LEXEMES

A lexeme is a sequence of characters in the source program that is matched by the pattern for a token.
Examples of tokens
Token Sample Lexemes Informal Description of Pattern
Const const const
If if if
Relation <, <=, =, <>, >, >= < or <= or = or <> or > or >=
Id pi, count, D2 letter followed by letters and digits any numeric constant
Num 3.1416, 0, 6.02E23 any numeric constant
Literal “core dumped” any characters between “ and “ except “

Tokens are treated as terminal symbols in the grammar for the source language, using boldface names to
represent tokens. The lexemes matched by the pattern for the token represent strings of characters in the source
program that can be treated together as a lexical unit.
In other languages, the following constructs are treated as tokens: keywords, operators, identifiers, constants,
literal strings, and punctuation symbols such as parentheses, commas, and semicolons.
A pattern is a rule describing the set of lexemes that can represent a particular token in a source program.
The pattern for the token const just the single const that spells out the keyword.

ATTRIBUTES FOR TOKENS


When more than one pattern matches a lexeme, the lexical analyzer must provide additional information
about the particular lexeme that matched to the subsequent phases of the compiler.
The lexical analyzer collects information about tokens into their associated attributes. The tokens influence
parsing decisions; the attributes influence the translation of tokens.

LEXICAL ERRORS
Few errors are discernible at the lexical level alone, because a lexical analyzer has a very localized view of the
source of program.
But suppose a situation does arise in which the lexical analyzer is unable to proceed because none of the
patterns for the tokens matches a prefix of the remaining input. Perhaps the simplest recovery strategy is “panic
mode” recovery.

Other possible error-recovery actions are:


1. deleting an extraneous character
2. inserting a missing character
3. replacing an incorrect character by a correct character

7
Compiler Design
4. transposing two adjacent characters

Error transformations like these may be tried in attempting to repair the input. The simplest such strategy is
to see whether a prefix of the remaining input can be transformed into a valid lexeme by just a single error
transformation.
One way of finding the error in a program is to compute the minimum number of error transformations
required to transform the erroneous program into one that is syntactically well formed.

INPUT BUFFERING
Three general approaches to the implementation of a lexical analyzer:
1. Use a lexical analyzer generator, such as Lex compiler to produce the lexical analyzer from a regular
expression based specification.
2. Write the lexical analyzer in a conventional systems-programming language, using the I/O facilities of that
language to read the input.
3. Write the lexical analyzer in assembly language and explicitly manage the reading of input.

The three are listed in order of increasing difficulty for the implementor. Since the lexical analyzer is the only
phase of the compiler that reads the source program character by character, it is possible to spend a considerable
amount of time in the lexical analysis phase, even though the later phases are conceptually more complex.

Buffer Pairs

There are many times when the lexical analyzer needs to look ahead several characters beyond the lexeme for
a pattern before a match can be announced. Because a large amount of time can be consumed moving characters,
specialized buffering techniques have been developed to reduce the amount of overhead required to process an input
character.
: : :E: :=: :M: * C: *: *: 2: eof : : :

lexeme beginning forward

An input buffer in two halves

Sentinels

Except at the ends of the buffer halves, it requires tests for each advance of the forward pointer. The two
tests can be reduced to one if each buffer half will be extended to hold a sentinel character at the end.

:E: :=: :M: * eof C: *: *: 2: eof : : : eof


Sentinels at the end of each buffer half
forward
SPECIFICATION OF TOKENS lexeme beginning
Regular expressions are an important notation for specifying patterns. Each pattern matches a set of strings,
so regular expressions will serve as names for sets of strings.

Strings and languages

Alphabet or character class denotes any finite set of symbols.


Example
The set { 0 , 1} is the binary alphabet

String – is finite sequence of symbols drawn from the alphabet.


In language theory, the terms sentence and word are often used as synonyms for the term “string”. The
length of a string s, usually written |s| is the number of occurrences of symbol in s.

Empty string – denoted ∈, is a special string of length zero.

8
Compiler Design
The term language denotes any set of strings over some fixed alphabet.

Term Definition
Prefix of s A string obtained by removing zero or more trailing symbols of string s, e.g. ban is a
prefix of banana.

Suffix of s A string formed by deleting zero or more of the leading symbols of s; nana is a suffix of
banana.

Substring of s A string obtained by deleting a prefix and a suffix from s; nan is a substring of banana.
Every prefix and every suffix of s is a substring of s, but not every substring of s is a
prefix or suffix of s.

Proper prefix, suffix, of Any nonempty string x that is, respectively, a prefix, suffix, or substring of such that s
substring of s is not equal to x.

Subsequence of s Any string formed by deleting zero or more not necessarily contiguous symbols from s;
e.g., baaa is a subsequence of banana.

Operations on Languages

There are several important operations that can be applied to languages. For lexical analysis, the primary
operations are union, concatenation, and closure.

Operation Definition
Union L and M written L L ∪ M = { s |s is in L or s is in M }
∪M

Concatenation of L and LM = { st | s is in L and t is in M}


M written LM

Kleene closure of L L * denotes “zero or more concatenations of” L


written L *
+
Positive closure of L L denotes “one or more concatenations of” L
written L +

REGULAR EXPRESSIONS
A regular expression is built up out of simpler regular expressions using a set of defining rules. Each regular
expression r denotes a language L(r). The defining rules specifies how L(r) is formed by combining in various ways
the languages denoted by the subexpressions of r.

Rules that define the regular expressions using a set of defining rules:
1. ∈ is a regular expression that denotes { ∈ }, that is, the set of containing the empty string.
2. If a is a symbol in ∑, then a is a regular expression that denotes {a} is a set containing the string a.
3. Suppose r and s are regular expressions denoting the language L(r) and L(s). Then:
a) (r) | (s) is a regular expression denoting L(r) ∪ L(s)
b) (r) (s) is a regular expression denoting L(r) L(s)
c) (r)* is a regular expression denoting (L(r))*
d) (r) is a regular expression denoting L(r)2
A language denoted by a regular expression is said to be a regular set.

PRACTICE SET
1. What is the input alphabet of each of the following languages?
9
Compiler Design
a. Pascal
b. C
c. C++
d. Visual Basic
2. What are the conventions regarding the use of the blanks in each language in number 1?
3. In a string length n, how many of the following are there?
a. Prefixes
b. Suffixes
c. Substrings
d. Proper prefixes
e. Subsequences
4. Describe the language denoted by the following regular expressions:
a. 0(0|1)*)
b. (0|1)*0(0|1)(0|1)
c. 0*10*10*
d. (00|11)*((01|10)(00|11))
5. Write regular definitions for the following languages:
a. all strings of letters that contain the five vowels in order
b. all strings of letters in which the letters are in ascending lexicographic order
c. comments consisting of a string surrounded by /* and */ without and intervening */ unless it appears
inside the quotes “and”
d. all strings of digits with at most one repeated digit

REGULAR EXPRESSIONS & REGULAR LANGUAGES

• “Familiar” algebraic properties/identities


- Alphabet is always a finite set. It can be ANY alphabet but in theory of languages it is usually chosen to consist of
two or three characters, typically: a and b or 0 and 1.

= {a, b} = (a + b) = (a | b) are all IDENTICAL regular


= {a} U {b} = aUb expression notations for .

Definition
Consider a string w made up from the symbols in the alphabet = {a, b} such as “a” or “b” or “ab” or “ababbbbb” or
“aaa” or “abbbaaabb” etc., then we define string “e”, called the empty string such that for any string w:
ew = we = w

The symbol “e” is then “neutral element” for strings when performing the concatenation operation. (This is similar to
defining 0 as the neutral element for real numbers when performing the operation of addition or defining 1 as the
neutral element for the operation of multiplication). It is important to notice that e in never part of the alphabet.

a* = {a}* = {e, a, aa, aaa, aaaa, ....}, this is the Kleene star or simply operation on one symbol of the alphabet.

* = {a, b}* = (a, b)* = (a | b)* = an infinite set given by:


{e, a, b, aa, ab, ba, bb, aaa, aab, aba, baa, abb, bab, bba,
bbb, aaaa, ....}, that is, all possible strings of a’s and b’s.
This is the star operation on the entire alphabet. It can be proven that another regular expression for this
same set is (a*b*)*.
Ø={}≠e notice that Ø is a set while e is a symbol.
Ø* = {e} ≠ Ø the set Ø* has one symbol in it, mainly e, the empty string, and thus it is NOT the empty set.

a | a*b denotes the set containing the string a or all strings of zero or more a’s followed by a b.

Note: The notation “a | a*b” can also be written as “(a + a*b)” as we shall see when reviewing the regular
expressions. The symbol “|” or the symbol “+” in this case means the so-called “or” operation used in set theory.
Therefore, the expression a*b means the following:
a*b = {e, a, aa, aaa, ...}b = {b, ab, aab, aaab, ....},

10
Compiler Design
that is, the set of all strings of zero or more a’s followed by a b.

• Define a set of strings (i.e., a language L) over a given alphabet and use a set of construction rules:
- The empty string e is a regular expression.
- Every symbol a in the alphabet is a regular expression
- Given the regular expressions r and s denoting the
languages L(r) and L(s), then:
• (r) + (s) is the regular expression denoting L(r)U L(s)
• (r)°(s) is the regular expression denoting L(r)°L(s)
• (r)* is a regular expression denoting (L(r))*
• (r) is a regular expression denoting L(r)
• Precedence rules to avoid parentheses:
- the unary operator * has the highest precedence
- concatenation ° has the second highest precedence
- + (also “|”, U-union-, “or”) has the lowest precedence.

• Algebraic properties of regular expressions (r & s):


- r+s = s+r => + is commutative
- r+(s+t) = (r+s)+t => + is associative
- (r°s)°t = r°(s°t) => ° is associative
- r°(s+t) = r°s + r°t & (s+t)°r = s°r+t°r => ° distributes over +
- e°r = r & r°e = r => e is the identity element for °
- r* = (r+e)* => relation between * and e.
- r** = r* => * is idempotent.
NOTE: If two regular expressions denote the same regular language we say that they are equivalent.

Examples of regular expressions:


Strings over the alphabet Σ = {0, 1}
• L = 0*1(0 | 1)*0 = 0*1 *0; set of non-zero binary numbers that are multiples of 2. (They end in 0).
• L = (0 + 1)* = *; set of all strings of 0's and 1s.
• L = (0 + 1)*00(0 + 1)* = *00 *; set of all strings of 0's and 1's with at least two consecutive 0's.
• L = (0 + 1)*011 = *011; set of all strings of 0's and 1's ending in 011.
• L = letter (letter | digit)*; set of all strings of letters and digits beginning with a letter.
• (e + 00)* it is equivalent to (00)*.
• 1 + 01 it is equivalent to (e + 0)1.
• 0* + 0*(1 + 11)(00*(1 + 11))*0* ; set of all strings that do not have the substring 111.

MODULE III. THE SYNTACTIC SPECIFICATION OF PROGRAMMING LANGUAGES

FINITE AUTOMATA

• Machine Model that Recognizes Regular Languages


• The finite automata, (FA), machine M defined by the 5-tuple M = {∑, S, s0, , F}; where the alphabet is: ∑ =
{0,1}; the set of states is: S = {s0, s1, s2}; the starting state is s0; the final state is: F = {s2}; and the transitions
are defined by the table below.
0 1.
s0 s0 s1
s1 {s1, s2} s1
s2 s2 Ø

M can also be represented by the transition graph:


0 1,0 0
11
1 0
s0 s1 s2
Compiler Design

this figure (which corresponds to the transition table above) is a non-deterministic finite automaton, NFA. The big
circles represent states and the double circles represent accepting or final states. The state with an unlabeled arrow
coming from its left is the starting state. (A brief list of the differences between a Deterministic Finite Automaton,
DFA, and a Non-deterministic Finite Automaton, NFA, given below.)

NFA vs. DFA

Deterministic Finite Automaton (DFA)


• For every state there is exactly one outgoing edge per alphabet symbol (the transition function IS a function)
• For each symbol in the alphabet there is a corresponding output and there is only one.

Non-Deterministic Finite Automaton (NFA)


• At least one of the states has more than one outgoing edge for the same alphabet symbol (the transition
function IS NOT a function, it is a relation.)
• There may be e transitions (transitions that occur without the presence of an input symbol from the alphabet.)

Are NFA more powerful than DFA?


• No at all. Both, NFAs and DFAs recognize Regular Languages and nothing more.
• Therefore, NFAs can be converted to DFA, which in turn can be "programmed". (Only DFA can be computer
programmed!)

Example

The FAs below are equivalent and they recognize the same regular language. A regular expression (there may be
many regular expressions) corresponding to the regular language can be easily captured from the NFA
construction:
0*1 *00*
that is, the regular language includes all strings of 0’s and 1’s with at least an identified 1 which in the end has to
be followed by an identified 0 regardless how many characters there are between them or regardless how many
characters follow or precede them.
0 1,0 0

1 0
s0 s1 s2

NFA

0 1 0

1 0
A B C
1
DFA
12
Compiler Design

ALGORITHM TO SIMULATE A DFA

Algorithm 1
Input: A string x ended by an EOF character and a DFA defined as a 5-tuple with s0 as its initial state and F as its
set of accepting states.
Output: A “YES” if the DFA accepts x or “NO” otherwise.

Method:
Apply the “pseudo code” algorithm below to the input string x. The function move(x, c) gives the state to which
there is a transition from state s on input character c. The function nextchar returns the next character of the input
string x.
s = s0;
c = nextchar;
while c =! EOF
s = move(s, c);
c = nextchar;
end
if s is in F then
return “YES”
else return “NO”;

Finite Automata Parsing


• Accept input string iff there exists at least one path from the initial state (s0) to a final state (sf) that spells the
input string.
• If no path exists, reject the string
• Example: Use the input string x = ‘11000’

Example

Consider the NFA N = {{q0,.., q4}, {0, 1}, q0, d, {q2, q4}} shown below.
0, 1

q0
1 0

q1 q3

1 0
0, 1 1, 0
q2 q4
13
Compiler Design

d | 0 1 .
q0 | {q0, q3} {q0, q1}
q1 | Ø {q2}
q2 | {q2} {q2}
q3 | {q4} Ø
q4 | {q4} {q4}

We show next the proliferation of states of the NFA under the input string 01001
0 1 q0 0 q0 0 q0 1 q0
q0 q0
0 1 0 0 1

q3 q1 q3 q3 q1

0
1 q4
q4
DEFINITIONS RELATED TO THE e-closure ALGORITHM
Definition
Given a state s of some NFA N, the e-closure(s) is the set of states in N reachable from s on e-transitions
only.
Definition
Given a set of states T of some NFA N, the e-closure(T) is the set of states in N reachable from some state
s in T on e-transitions alone.
Example
X = e-closure({0}) = {0, 1, 2, 4, 7}
Y = e-closure({2}) = {2}
Z = e-closure({6}) = {6, 1, 2, 4, 7} = {1, 2, 4, 6, 7}
T = e-clos.({1, 2, 5}) = e-clos.({1}) U e-clos.({2}) U e-clos.({5})
= {1, 2, 5, 4, 6, 7} = {1, 2, 4, 5, 6, 7}

e
a
e 2 3
e
e e
0 1 6 7
e b e
4 5

NFA
(with e-transitions)

14
Compiler Design

THE e-closure ALGORITHM


The computation of e-closure(T) is a typical process of searching a graph for nodes reachable from a given
set T of nodes by following the e-labeled edges of the NFA.
The (pseudo code) algorithm below used to compute the e-closure (T) uses a stack to hold the states
whose edges have not been checked for e-labeled transitions

Algorithm (e-closure)
push all states in T onto stack;
initialize e-closure(T) to T;
while stack is not empty
pop t, the top element, off stack;
for each state u with an edge from t to u labeled e
if u is not in the e-closure(T)
add u to e-closure(T);
push u onto stack
end
end

ALGORITHM TO BUILD THE DFA FROM AN NFA


Potentially, the total number of states of the DFA is the power set of the number of states in the NFA (2N).
Input: An NFA N with starting state s0, accepting some regular language.
Output: A DFA D accepting the same language as N.

Method:
The algorithm constructs a transition table DTran for D. We call Dstates the set of states in D. We will use the
definitions given for the e-closure algorithm plus:
move(T, a) = set of states to which there is a transition on input symbol a in from some NFA state s in
T.

Algorithm 2 (“subset construction”).


initially, e-closure(s0) is the only state in Dstates and it is unmarked;
while there is an unmarked state T in Dstates
mark T;
for each input symbol a in Σ
U = e-closure(move(T, a));
if U is not in Dstates then
add U as an unmarked state to Dstates;
DTran(T, a) = U
end
end
Algorithm for Subset Construction.
(taken from Parson’s textbook)

Consider an NFA N and a DFA D accepting the same regular language. [Equivalent to Algorithm 2].

15
Compiler Design
1) Initially the list of states in D is empty. Create the starting state as e-closure(s0), named after initial state {s0} in N.
That is, new start state: state(1) = e-closure (s0).
2) While (there is an uncompleted row in the transition table for D) do:
a) Let x = {s1, s2, ...,sk) be the state for this row.
b) For each input symbol a do:
i) Find the e-closure({s1,s2,...sk},a) = N({s1},a) U N({s2},a) U..... U N({sk},a) = some set we'll call T .
ii) Create the D-state y = {T}, corresponding to T.
iii) If y is not already in the list of D-states, add it to the list. (This results in a new row in the table)
iv) Add the rule D(x , a) = y to the list of transition rules for D.
3) Identify the accepting states in D. (They must include the accepting states in N). That is, make state(j) final
state iff at least one state in state(j) is final state in the NFA.

Yet another algorithm for the construction of a DFA equivalent to an NFA.


state[0] = { };
state[1] = e-closure (s0);
p = 1;
j = 0;
while (j <= p) {
foreach c in ∑ do {
e = DFAedge(state[j],c);
if (e == state[i] for some i <= p) {
trans[j,c] = i;
}
else {
states[p] = e;
trans[j,c] = p;
}
}
j = j + 1;
}
Example
Find all possible theoretical states for the NFA given below.

0 1,0 0

1 0
0 1 2

There are in total 23 = 8 states for the corresponding DFA. This is the Power Set of set S = {1, 2, 3} represented
by 2|S|. That is, the 8 states of the DFA are:
2|S| ={Ø, {0}, {1}, {2}, {0, 1}, {0, 2}, {1, 2}, {0, 1, 2}}

16
Compiler Design
and graphically, the transition diagram is:
0 1 0 1,0

1 1
{0} {1} {2} Ø

1 0 1
1 1
0 {0, 1, 2}
{0, 2} {1, 2} {0, 1}

0 0 0

Trivially, the states {2} and Ø plus the states {0, 2} and {0, 1} that have no input can be eliminated. After the 1st
elimination cycle is complete, {0, 1, 2} has no input to it and can be eliminated. Only A = {0}, B= {1} and C = {1,
2} remain.

Example
The NFA below represents the regular expression letter (letter | digit)*. Find its corresponding DFA. (The
theoretical number of states in this case is 210 = 1,024).
e

letter
5 6 e
letter e e e e
1 2 3 4 9 10
digit
e e
7 8

e
Solution:
A = e-closure ({1}) = {1}
move(A, letter) = {2} (function move defined in Alg. 2)
move(A, digit) = Ø
B = e-closure({2}) = {2, 3, 4, 5, 7, 10}
move(B, letter) = {6}
move(B, digit) = {8}
C = e-closure({6}) = {6, 9, 10, 4, 5, 7} = {4, 5, 6, 7, 9, 10}
move(C, letter) = {6}
move(C, digit) = {8}
D = e-closure({8}) = {8, 9, 10, 4, 5, 7} = {4, 5, 7, 8, 9, 10}
move(D, letter) = {6}
move(D, digit) = {8}
State A is the start state of the DFA and all states that include state 10 of the NFA are accepting states of the DFA.
The transition diagram of the DFA is given below.

17