Вы находитесь на странице: 1из 9

COMPILER WRITING

Abstract

1.1 What Is A Compiler?

This paper introduces compiler writing


techniques. It was written with the following
observations made by the author:(1) While
compiler writing may be done without too
much knowledge of formal language or
automata theory, most books on compiler
writing devote most of their chapters on
these topics. This inevitably gives most
readers a false impression that compiler
writing is a difficult task. I have deliberately
avoided this formal approach. Each aspect of
compiler writing is presented in such an
informal way that any naive reader can
comprehend; (2) there are several different
aspects of compiler writing. Most
publications would give detailed
presentations on every aspect and still fail to
give the reader some feeling on how to write
a compiler. Most important of all, the internal
data structure of this simple compiler will be
clearly presented. Examples will be given
and a reader will be able to understand how a
compiler works by studying these examples.
In other words, I make sure that a reader will
not lose his global view of compiler writing;
he sees not only trees, but also forests.

A compiler is software (Program) that reads a program


written in a source language and translates it into an
equivalent program in another language - the target
language. The important aspect of compilation process is
to produce diagnostic (error messages) in the source
program. These error messages are mainly due to the
grammatical mistakes done by a programmer.
There are thousands of source languages, ranging from C
and PASCAL to specialised languages that have arisen in
virtually every area of computer application. Target
languages are also in thousands. A target language may
be another programming language or the machine
language or an assembly language. Compilers are
classified as single pass, multipass, debugging or
optimizing, depending on how they have been
constructed or on what functions they are supposed to
perform. Earlier (in 1950's) compilers were considered as
a difficult program to write. The first FORTRAN
compiler, for example, took 18 staff-years to implement.
But now several new techniques and tools have been
developed for handling many of the important tasks that
occur during compilation process. Good implementation
languages, programming environments (editors,
debubuggers, etc.) and software tools have also been
developed. With this development compiler writing
exercise has become easier.

Categories and Subject Descriptors


D.3.4 [Programming Languages]: Processors|Code generation, Compilers, Parsing, Run-time environments
General Terms Design, Management

General Terms
Compiler, Cross-Compiler, Bootstrapping .

1.2 Approaches to Compiler


Development
There are several approaches to compiler developments.
Here we will look at some of them.

Keywords
Compiler Design, Object Oriented Programming,
High Level Language, Design Patterns, Tools.

1.0 Introduction
The study of compiler designing form a central theme in
the field of computer science. An understanding of the
technique used by high level language compilers can give
the programmer a set of skills applicable in many aspects
of software design - one does not have to be a compiler
writer to make use of them.
The compiler writing is not confined to one discipline
only but rather spans several other disciplines:
programming languages, computer architecture, theory of
programming languages, algorithms, etc. This paper is
intended as an introduction to the basic essential features
of compiler writing.

1.2.1 Assembly Language Coding

Early compilers were mostly coded in assembly


language. The main consideration was to increase
efficiency. This approach worked very well for small
High Level Languages (HLL). As languages and their
compilers became larger, lots of bugs started surfacing
which were difficult to remove. The major difficulty with
assembly language implementation was of poor software
maintenance.
Around this time, it was realised that coding the
compilers in high level language would overcome this
disadvantage of poor maintenance.
Many compilers were therefore coded in FORTRAN, the
only widely available HLL at that time. For example,
FORTRAN H compiler for IBM/360 was coded in
FORTRAN. Later many system programming languages
were developed to ensure efficiency of compilers written
into HLL. Assembly language is still being used but
trend is towards compiler implementation through HLL.

1.2.2 Cross-Compiler
A cross-compiler is a compiler which runs on one
machine and generates a code for another machine. The
only difference between a cross-compiler and a normal
compiler is in terms of code generated by it. For
example, consider the problem of implementing a Pascal
compiler on a new piece of hardware (a computer called
X) on which assembly language is the only programming
language already available. Under these circumstances,
the obvious approach is to write the Pascal compiler in
assembler. Hence, the compiler in this case is a program
that takes Pascal sources as input, produces machine
code for the target machine as output and is written in the
assembly language of the target machine.
This compiler implementation involves a great deal of
work since a large assembly language program has to be
written for X. It is to be noticed in this case that the
compiler is very machine specific; that is, not only does
it run on X but it also produces machine code suitable for
running on X. Furthermore, only one computer is
involved in the entire implementation process.

The use of a high-level language for coding the compiler


can offer great Savings in implementation effort. If the
language in which the compiler is being written is
already available on the computer in use, them the
process is simple.
For example, Pascal might already be available on
machine X, thus permitting the coding of, say, a Modula2 compiler in Pascal.

If the language in which the compiler is being written is


not available on the machine, then all is not lost, since it
may be possible to make use of an implementation of
that language on another machine. For example, a
Module-2 compiler could be implemented in Pascal on
machine Y, producing object code for machine X.
The object code for X generated on machine Y would of
course have to be transferred to X for its execution. This
process of generating code on one machine for execution
on another is called cross-compilation.

1.2.3 Bootstrapping
It is a concept of developing a compiler for a language by
using subsets (small part) of the same language.
Suppose that a Modda-2 compiler is required for machine
X, but that the compiler itself is to be coded in Modula-2.
Coding the compiler in the language it is to compile is
nothing special and, as will be seen, it has a great deal in
its favour. Suppose further that Modula-2 is already
available on machine Y. In this case, the compiler can be
run on machine Y, producing object code for machine X.
This is the same situation as before except that the
compiler is coded in Modula-2 rather than Pascal. The
special feature of this approach appears in the next step.
The compiler, running on Y. is nothing more than a large
program written in Modula-2. Its function is to transform
an input file of Module-2 statement into a functionally
equivalent sequence of statements in X's machine code.
Therefore, the source statements of this Module-2
compiler can be passed into itself running on Y to
produce a file containing X's machine code. This file is
of course Module-2 compiler, which is capable of being
run on X. By making the compiler compile itself, a
version of the compiler that runs on X has been created.
Once this machine code has been transferred to X, a selfsufficient Modula-2 compiler is available on X; hence
there is no further use for machine Y for supporting
Module-2 compilation.
This implementation plan is very attractive. Machine Y is
only required for compiler development and once this
development has reached the stage at which the compiler
can (correctly) compile itself, machine Y is no longer
required. Consequently, the original compiler
implemented on Y need not be of the highest quality - for
example, optimization can be completely disregarded.
Further development (and obviously conventional use) of
the compiler can then continue at leisure on machine X.
This approach to compiler implementation is called
bootstrapping. Many languages, including C, Pascal,
FORTRAN and LISP have been implemented in this
way.

Pascal was first implemented by writing a compiler in


Pascal itself. This was done through several
bootstrapping processes. The compiler was translated "by
hand" into an available low level language.

1.3 Compiler Designing Phases

sequence of characters such as identifier (also called


variable), a keyword (if, then, else, etc.), a multicharacter operator < =, etc. The output of this phase goes
to the next phase, i.e. syntax .analysis or parsing. The
interaction between two phases is shown below in figure
2.

The compiler being a complex program is developed


through several phases. Each phase transforms the
source program from one representation to another. The
tasks of a compiler can be divided very broadly into two
sub-tasks.
i.
ii.

The analysis of a source program


The synthesis of the object program

In a typical compiler, the analysis task consists of 3


phases.
i.
ii.
iii.

Lexical analysis
Syntax analysis
Semantic analysis

The synthesis task is usually considered as a code


generation phase but it can be divided into some other
distinct phases like intermediate code generation and
code optimization. These four phase functions in
sequence are shown in figure 1. The nature of the
interface between these four phases depends on the
compiler. It is perfectly possible for the four phases to
exist as four separate programs.

1.3.1 Lexical Analysis


Lexical analysis is the first phase of a compiler. Lexical
analysis, also called scanning, scans a source program
from left to right character by character and group them

Figure 2: Interaction between the first two phases


The second task performed during lexical analysis is to
make entry of tokens into a symbol table if it is not there.
Some other tasks performed during lexical analysis are:

to remove all comments, tabs, blank spaces and


machine characters.
to produce error messages (also called diagnostics)
occurred in a source program.

Let us consider the following Pascal language statement,


For i = 1 TO 50 do sum = sum + x [i]; sum of numbers
stored in array x
After going through the statement, the lexical analysis
transforms it into the sequence of tokens:
For i=1TO 50do sum: =sum+x[i];
Tokens are based on certain grammatical structures.
Regular expressions are important notations for
specifying these tokens. It consists of symbols (in the
alphabet of the language that is being defined) and a set
of operators that allow:
(i)
(ii)
(iii)

concatenation (combination of strings),


repetition, and
alteration.

Examples of Regular Expressions


(i) ab denotes the set of strings f ab)
(ii) a l b denotes either a or b
(iii) a* denotes (empty, a, aa, am), etc. *
(iv) ab* denotes {a,a b, abb, abbb)
Figure 1: Compiler Design Phases
into tokens having a collective meaning. It performs two
important tasks. First, it scans a source program character
by character from left to right and groups them into
tokens (or syntactic element). Each taken or basic
syntactic element represents a logically cohesive

(v) . [a - z A - z] [a - z A - z 0 - 01]* gives a definition of


a variable which means that a variable starts with an
alphabetic character followed by either alphabetic
character or digit 'character.
Writing a lexical analysis completely from scratch is a
fairly challenging task. Several tools have been built for

constructing lexical analysis from special purpose


notation based on regular expressions. Perhaps the most
famous of these tools is Lex, one of the many utilities
available with Unix operating system. Lex requires that
the syntax of each lexical token be defined in terms of a
regular expression. Associated with each regular
expression is a fragment of code that defines the action to
be taken when that expression is recognised.

constructs can be described by Backens Naur Form (BNF) notations. These types of notations are also called
context-free grammars. Well-formed grammars offer
significant advantages to compiler designer:

The Symbol Table


An essential function of a compiler is to record the
identifiers and the related information about its attributes
type (numeric or character), its scope (where in the
program it is valid) and in the case of procedure or the
function names, such things as the number and types of
its arguments, the mechanism of passing each argument
and the type of result it returns.
A symbol table is a set of locations containing a record
for each identifier with fields for the attributes of the
identifier. A symbol table allows us to find the record for
each identifier (variable) and to store or retrieve data
from that record quickly.
For example, take an expression written in C such as int
x, y, z;
The lexical analysis after oing through this expression
will enter x, y and z into the symbol table. This is shown
in the figure given below.

A grammar gives a precise, yet easy to


understand syntactic specification of a
programming language.
Development of tools for designing Parser to
determine if a source program is syntactically
correct, can be achieved from certain class of
grammars.
A well designed grammar imparts a structure to
a programming language that is useful for the
translation of source program into correct
object code.

Syntax analysis is the second phase of compilation


process. This process is also called parsing. It performs
the following operations:
1.
2.

3.

Obtains a group of tokens from the lexical


analyser.
Determines whether a string of tokens can be
generated by a grammar of the language, i.e. it
checks whether the expression is syntactically
correct or not.
Reports syntax error(s) if any.

The output of parsing is a representation of the syntactic


structure of a statement in the form of Parse tree (syntax
tree).

1.3.3 Semantic Analysis


The role of semantic analyzer is to derive methods by
which the structures constructed by the syntax analyzer
may be evaluated or analyzed.

Figure 1: Symbol Table


The first column of this table contains the entry of
variables and the second contains the address of memory
locations where values of these variables will be stored.
The remaining phases enter information about identifiers
into the symbol table and then use this information in
various ways.

1.3.2 Syntax Analysis


Every language whether it is a programming language or
any natural language follows certain grammatical rules
that define syntactical structures of a language. In C
language, for ex- ample a program is made out of main
function consisting of blocks, a block out of statements, a
statement out of expressions, and an expression out of
tokens and so on. The syntax of a programming language

The semantic analysis phase checks the source program


for semantic errors and gathers data type information for
the subsequent code-generation phase. An important
component of semantic analysis is type checking. Here
the compiler checks that each operator has operands that
are permitted by the source language specification. For
example: Many programming languages definition
require a compiler to report an error every time a real
number is used to index an array. For example a [5.6];
here 5.6 is a real value not an integer.
To illustrate some of the actions of a semantic analyzer
consider the expression a+b-c*d in a language such as
Pascal where abc have data type integer and d has type
real. The syntax analyzer produces a parse tree of the
form shown in figure 4(a).
One of the tasks of the semantic analyzer is to perform
type checking within this expression. By consulting the
symbol table, the data types of all the variables can be

inserted into the tree as shown in the figure 4(b) and


performs semantic type conversion and label a node
accordingly.

Figure 2:
Semantic
Analysis
of an
arithmetic
expression
The semantic analyser can determine the types of the
intermediate results and thus propagate the type attributes
through the tree checking for compatibility as it goes. In
our example, the semantic analyzer first considers the
results of c and d. According to the Pascal semantic rule
integer * real --> real, the * node can be labelled as real.
This is shown in figure 4(c).
Compilers vary widely in the role taken by the semantic
analyzer. In some simpler compilers, there is no easily
identifiable semantic analysis phase, the syntax analyzer
itself does semantic analysis and intermediate code
generation directly. In other compilers syntax analysis,
semantic analysis and code generation is a separate
phase. In the next section we will discuss about code
generation phase.

can be linked together and loaded for execution by a


linking loader. The process of linking and loading in
producing relocatable object code might be little time
consuming but it provides flexibility in being able to
compile subroutine separately and to call other
previously compiled program from the object module. If
the target machine does not handle location
automatically, the compiler must provide relocation
information to the loader to link the separately compiled
program segments.
Producing an assembly-language program as output
makes the process of code generation in somewhat
simpler. We can generate symbolic instruction and use
the macro facilities of the assembler to help generate
code.
A thorough knowledge of the target machine's
architecture as well as instruction set is required to write
a good code generator. The code generator is concerned
with the choice of machine instruction, allocation of
machine registers, addressing, and interfacing with
operating system. To produce faster and more compact
code, the code generator should include some form of
code optimization.
This may exploit techniques such as the use of special
purpose machine instructions or addressing modes,
register optimization etc. This code optimization may
incorporate both machine-dependent and machineindependent techniques.

1.4 Software Writing Tools


1.3.4 Code Generation and
Optimization
The final phase of the compiler is the code generator. The
code generator takes an input as intermediate
representation (in the form of parse tree) of the source
program and produces as output an equivalent target
program (figure 5).
The target program may take on a variety of forms:
absolute machine language, relocatable machine
language or assembly
language.

Producing an
machine language
has the advantage that

absolute
program as output
it can be placed in a

Figure 3: Code generation phase


fixed location in memory and immediately executed
Producing a relocatable machine language program
(object module) as output allows sub-programs to be
compiled separately. A set of relocatable object modules

Two best known software tools for compiler


constructions are Lex (a lexical analyzer generator) and
Yacc (a parser generator). Both of which are available
under the UNIX operating system. Their continuing
popularity is partly due to their widespread availability
but also because they are powerful and easy to use with a
wide range of applicability. This section describes these
two software tools.

1.4.1 Lex
Lex is a software tool that takes as input a specification
of a set of regular expressions together with actions to be
taken on recognising each of these expressions. The
output of Lex is a program that recognizes the regular
expression and acts appropriately on each Lex is
generally used in the manner depicted in the following
figure 6.

Lex supports a very powerful range of operators for the


constructions of regular expressions. For example a
regular expression of an identifier can be written as:
[A-Z a-z] [A-Z a-z 0-9]*
which represents an arbitrary string of letters and digits
beginning with a letter suitable for machining a variable
in many programming languages.
Here is a list of Lex operators with examples:
Figure 4: Creating a lexical analyser with lex.
First, a specification of lexical analyzer is prepared by
creating a program lex. 1 in the Lex language. Then,
Lex.1 is run through the lex compiler to produce a C
program lex. yy.c. Lex.yy.c. consists of a C language
program containing re-congniser for regular expressions
together with user supplied code. Finally lex.yyc. is run
through the C compiler to produce an object program
a.out which is a lexical analyzer that transforms an input
stream into a sequence of tokens.
Lex specifications:
A lex program consists of three parts:
declaration
%%
translation rules
%%
user routines
Any of these three sections may be empty but the %%
separator between the definitions and the rules cannot be
omitted.

Operator
notation

Example

* (astersk)

a*
Set of all strings of zero or
more a's, i.e. . (empty a, aa,
aaa ... )

| (or)

a|b
Either a or b

a+

The declaration section includes declaration of variables,


constants and regular definitions. The regular definition
is statements used as components of the regular
expression appearing in the translation rides.
The translation rules of a lex program which is a key part
of the Lex input are statements of the form:
P1 (action l)
P2 (action 2)
------------------------Pn (action n)

Meaning

One or more instances of


a i.e. a, aa, aaa etc.
Zero one of instance of a

a?

[,]

[a b c]
a | b | c. An alphabetical
character class such as [a-zl

where each pi is a regular expression and each action is a


program fragment describing what action the lexical
analyzer should take when patterns p1 ... pn matches a
token. In lex, the actions are written in C-language in
general, however, they can be in any implementation
language.

denotes the regular


expression a | h | ... | z.

The third section contains user routines which are needed


by action. Alternatively, these procedures can be
compiled separately and loaded with lexical analyzer.

1.4.2 Yacc

Yacc (Yet Another Compiler Compiler) assists in the next


phase of the compiler. It creates a parser which will be
output in a form suitable for inclusion in the next phase.
Yacc is available as a command (utility) on the UNIX
system and has been used to help implement hundreds of
compilers.
A parser can be constructed using Yacc in the manner
illustrated in figure 7.

Figure 5: Yacc Functioning


First a file say Parse.y containing a Yacc specification for
an expression is prepared. The UNIX system command
Yacc parse.y transforms the file parse.y into a C program
called Y.tab.c which is a representation of parser written
in C language along with other C programs that the user
may have prepared. Y.tab.c is run through C compiler and
produces object program a.out that performs the
translation specified by the original Yacc program. Yacc
source program has also 3 parts as Lex. This can be
expressed in Yacc specification as:
declaration
%%
translation rules
%%

C - Programs
Example: To illustrate how to prepare a Yacc source
program, let us construct a simple desk calculator that
reads an arithmetic expression, evaluates it, and then
prints its numeric value. We shall build the desk
calculator staffing with the following grammar for
arithmetic expressions:
expr
expr + term | term
term
term * factor | factor
factor
(expr) digit
The token digit is a single digit ranging from 0 to 9. A
Yacc desk calculator program derived from this grammar
is shown in figure 8.

The declarations part. There are two optional sections in


the declarations part of a Yacc program. In the first
section, we write ordinary C declarations, delimited by %
( and %). Here we place declarations of any temporaries
used by the translation rules or procedures of the second
and third sections. In figure 13, this section contains only
the include-statement #include < ctype.h > that causes
the C pre-processor to include the standard header file
<ctype.h > that contains the predicate is digit.
Also in the declarations part are declarations of grammar
tokens. In figure 13 the statement
%token DIGIT
declares DIGIT (pre-defined) to be a token. Tokens
declared in this section can then be used in the second
and third parts of the Yacc specification.
The translation rules part. In the part of the Yacc
specification after the first % % pair, put the translation
rules. Each rule consists of a grammar production and
the associated semantic action. A set of productions that
we have been writing
< left side > - < alt 1 > | < alt 2 > ...| < alt n >
would be written in Yacc as
< left side > :< alt 1 > (Semantic 1)
: < alt 2 > (Semantic 2)
: < alt n > (Semantic n)
In a Yacc production, a quoted single character 'c' is
taken to be the terminal symbol c, a unquoted strings of
letters and digits not declared to be tokens are taken to
be nonterminal. Alternative right sides can be separated
by a vertical bar, and a semicolon follows each left side
is taken to be the start symbol.

Figure 8: Yaac specification of a simple desk calculator

A Yacc semantic action is a sequence of C statements. In


a semantic action, the symbol $$ refers to the attribute
value associated with the nonterminal on the left, while
$i refers to value associated with ith grammar symbol
(terminal or nonterminal) on the right. The semantic
action is performed whenever we reduce by die
associated production, so normally the semantic action
computes a value for $$ in terms of the $i's. In the Yacc
specification, we have written the two production rules
for expressions (expr).
Expr : expr + term 1 term and their associated semantic
actions as
expr : expr '+' term {$$ = $1 + $3;}
: term
:
Note that the nonterminal term in the fast production is
the third grammar symbol on the right while '+' is the
second. The semantic action associated with the first
production adds the value of the expr and the term on the
right and assigns the result as the value for the nonterminal expr on the left. We have omitted the semantic
action for the second production altogether, since
copying the value is the default action for productions
with a single grammar symbol on the right. In general,
{ $$ = $1; } is the default semantic action.
Notice that we have added a new starting production
Line : expr '\n' { printf("%d\n"};
to the Yacc specification. This production says that an
input to the desk calculator is to be an expression. The
semantic action associated with this production prints the
decimal value of the expression.
The supporting C-routines part. The third part of a Yacc
specification consists of supporting C-routines. A lexical
analyzer by the name lex( ) must be provided. Other
procedures such as error recovery routines may be added
as necessary.
The lexical analyser lex( ) produces pairs consisting of a
token and its associated attribute value. If a token such as
DIGIT is returned, the token must be declared in the first
section of the Yacc specification. The attribute value
associated with a token is communicated to the parser
through a Yacc defined variable Ival.
C language routine reads input characters one at a time
using the getchar(). If the character is a digit the value of
the digit is stored in the variable Ival, and the token

DIGIT is returned. Otherwise, the character itself is


returned as the token. This section will be more clear
once you have gone through Block 3 of Course 4 on Cprogramming.
The power and utility of Yacc should not be
underestimated. The effort saved during compiler
implementation by using Yacc rather than a handwritten
parser, can be considerable.

1.5 Program Development Tools


There are some other language support tools a program
has to access for reducing the cost of software
development. These additional facilities include tool for
program editing, debugging. analysis and documentation.
Ideally such tools should be closely integrated with the
language implementation so that, for example, when the
compiler detects syntactic errors in the source program,
the editor can be entered automatically with the cursor
indicating the position of the error.
The rapid production of syntactically correct program is a
major goal. The use of syntax directed editors is
increasing and they can assist the user in eliminating
syntax errors as the program is being input. Such an error
only accepts syntactically correct input and prompts the
user, if necessary to input only those constructs that are
syntactically correct at the current position. These editors
can considerably improve run time efficiency by
removing the necessity for repeated runs of the compiler
to remove syntax errors.
Debugging tools should also be integrated with the
programming language for finding and removing bugs in
the shortest period of Time.
Integrated program development environment are now
being considered as an important aid for the rapid
construction of correct softwares. Considerable progress
has been made over the last decade in the provision of
powerful software tools to ease the programmers burden.
The application of the basic principles of language and
compiler design will help continue the development.

1.6 Conclusion
This paper discussed several issues related to the
compiler. The initial discussion focused on the
approaches to the compiler designing phases, which
included lexical analysis, parsing semantic analysis and
the code generation, while the latter part examined two
important software tools Lex and Yacc and also program
development tools which greatly simplify
implementation of a compiler.

References
[1]Aho, A. V. and Ullman, J.D. (2007). Principles of
Compiler Design, Addison-Wesley, Reading, MA.
[2]Backhouse, R. C.(2009). Syntax of Programming
Languages, Theory and Practice, Prentice-Hall,
Englewood Cliffs, N. J.
[3]Barret, W. A. and Couch, J. D. (1999). Compiler
Construction, Theory and Practice, Science
Research Associates, Chicago, ILL.
[4]Bauer, F, L. and Eickel, J. (2007).Compiler
Construction-an Advanced Course, SpringerVerlag, New York.
[5]Bornat, R. (2000). Understanding and Writing

Compilers, The MacMillan Press, New York.

[6]Callingart P, Assemblers, Compilers and Program


Translation, Computer Science Press, Rockville,
MD.
[7]Cleaveland, C. and Uzgalis, R.C. (1977):
Grammars for Programming Languages , Elsevier,
New York.

[8]http://www.icse.s5.com/notes/m3.html

Вам также может понравиться