Академический Документы
Профессиональный Документы
Культура Документы
A program can be written in assembly language as well as in high-level language. This written
program is called the source program. The source program is to be converted to the machine
language, which is called an object program. A translator is required for such a translation.
Program translator translates source code of programming language into machine language-
instruction code. Generally, computer programs are written in languages like COBOL, C, BASIC and
ASSEMBLY LANGUAGE, which should be translated into machine language before execution.
Assembler
Compiler
Interpreter
In another words “A compiler is a software program that compiles program source code files into
an executable program. It is included as part of the integrated development environment IDE with
most programming software packages”.
Assembler: A program which translates an assembly language program into a machine language
program is called an assembler. If an assembler which runs on a computer and produces the
machine codes for the same computer then it is called self assembler or resident assembler. If an
assembler that runs on a computer and produces the machine codes for other computer then it is
called Cross Assembler.
Assemblers are further divided into two types: One Pass Assembler and Two Pass Assembler. One
pass assembler is the assembler which assigns the memory addresses to the variables and
translates the source code into machine code in the first pass simultaneously. A Two Pass
Assembler is the assembler which reads the source code twice. In the first pass, it reads all the
variables and assigns them memory addresses. In the second pass, it reads the source code and
translates the code into object code.
By the compiler, the machine codes are saved permanently for future reference. On the other
hand, the machine codes produced by interpreter are not saved. An interpreter is a small program
as compared to compiler. It occupies less memory space, so it can be used in a smaller system
which has limited memory space.
Applications of compilers
Interpreter Vs Compiler
Translates program one statement at a Scans the entire program and translates it
1
time. as a whole into machine code.
It takes less amount of time to analyze It takes large amount of time to analyze the
2 the source code but the overall source code but the overall execution time
execution time is slower. is comparatively faster.
Continues translating the program until It generates the error message only after
4 the first error is met, in which case it scanning the whole program. Hence
stops. Hence debugging is easy. debugging is comparatively hard.
Language extension: they add capabilities to the language by what amounts to built-in macros. e.g.
the language, Sql is a database query language embedded in C
Assembler
Programmers found it difficult to write or read programs in machine language. They begin to use a
mnemonic (symbols) for each machine instruction, which they would subsequently translate into
machine language. Such a mnemonic machine language is now called an assembly language.
Programs known as assembler were written to automate the translation of assembly language in
to machine language. The input to an assembler program is called source program, the output is a
machine language translation (object program).
Linker: In high level languages, some built in header files or libraries are stored. These libraries are
predefined and these contain basic functions which are essential for executing the program. These
functions are linked to the libraries by a program called Linker. If linker does not find a library of a
function then it informs to compiler and then compiler generates an error. The compiler
automatically invokes the linker as the last step in compiling a program.
Not built in libraries, it also links the user defined functions to the user defined libraries. Usually a
longer program is divided into smaller subprograms called modules. And these modules must be
combined to execute the program. The process of combining the modules is done by the linker.
The assembler processes one file at a time. Thus the symbol table produced while processing file
A is independent of the symbols defined in file B, and conversely. Thus, it is likely that the same
address will be used for different symbols in each program. The technical term is that the
(local) addresses in the symbol table for file A are relative to file A; they must be relocated
by the linker. This is accomplished by adding the starting address of file A (which in turn is the
sum of the lengths of all the files processed previously in this run) to the relative address.
Assume procedure f, in file A, and procedure g, in file B, are compiled (and assembled)
separately. Assume also that f invokes g. Since the compiler and assembler do not see g when
processing f, it appears impossible for procedure f to know where in memory to find g. The
solution is for the compiler to indicate in the output of the file A compilation that the address of g
is needed. This is called a use of g. When processing file B, the compiler outputs the (relative)
address of g. This is called the definition of g. The assembler passes this information to the linker.
The simplest linker technique is to again make two passes. During the first pass, the linker records
in its external symbol table (a table of external symbols, not a symbol table that is stored
externally) all the definitions encountered. During the second pass, every use can be resolved by
access to the table.
Loader: Loader is a program that loads machine codes of a program into the system memory. In
Computing, a loader is the part of an Operating System that is responsible for loading programs. It
is one of the essential stages in the process of starting a program. Because it places programs into
memory and prepares them for execution. Loading a program involves reading the contents of
executable file into memory. Once loading is complete, the operating system starts the program
by passing control to the loaded program code. All operating systems that support program
loading have loaders. In many operating systems the loader is permanently resident in memory.
Examples of Compilers:
Phases of a Compiler
A compiler operates in phases, each of which transforms the source program from one
representation to another. There are six phases of compiler:
Lexical Analysis Phase: The lexical phase reads the characters in the source program and groups
them into a stream of tokens in which each token represents a logically cohesive sequence of
characters, such as, An identifier, A keyword, A punctuation character. The character sequence
forming a token is called the lexeme for the token.
Syntax Analysis Phase: Syntax analysis imposes a hierarchical structure on the token stream. This
hierarchical structure is called syntax tree. A syntax tree has an interior node is a record with a
field for the operator and two fields containing pointers to the records for the left and right
children. A leaf is a record with two or more fields, one to identify the token at the leaf, and the
other to record information about the token.
Semantic Analysis Phase: This phase checks the source program for semantic errors and gathers
type information for the subsequent code-generation phase. It uses the hierarchical structure
determined by the syntax-analysis phase to identify the operators and operands of expressions
and statements. An important component of semantic analysis is type checking.
Intermediate Code Generation: The syntax and semantic analysis generate a explicit intermediate
representation of the source program. The intermediate representation should have two
important properties: It should be easy to produce, and easy to translate into target program.
Intermediate representation can have a variety of forms. Here one of the forms is: three address
code; which is like the assembly language for a machine in which every location can act like a
register. Three address code consists of a sequence of instructions, each of which has at most
three operands.
Code Optimization: Code optimization phase attempts to improve the intermediate code, so that
faster-running machine code will result.
Code Generation: The final phase of the compiler is the generation of target code, consisting
normally of reloadable machine code or assembly code. Memory locations are selected for each
of the variables used by the program. Then, the each intermediate instruction is translated into a
sequence of machine instructions that perform the same task.
Symbol Table Management: Symbol table is a data structure containing a record for each
identifier, with fields for the attributes of the identifier. It Record the identifier used in the source
program and collect information about the identifier such as,
Error Detecting and Reporting: Each phase encounters errors. Lexical phase determine the input
that do not form token. Syntax phase determine the token that violates the syntax rule. Semantic
phase detects the constructs that have no meaning to operand.
Analysis and Synthesis phases of Compiler
A compiler can broadly be divided into two phases based on the way they compile. A pass refers to
the traversal of a compiler through the entire program. Analysis phase is consists of Lexical, Syntax
& Semantic phase and Synthesis phase is consists of Intermediate code generation, code
optimization & Target code generation.
The Analysis consists of three steps.
Lexical Analysis: The Lexical Analysis in which the stream of characters making up the
source program is read from left-to-right and converted into a stream of words, where a
word is a sequence of characters with a collective meaning.
Syntactic Analysis: The Syntactic Analysis in which words are grouped into nested
collections (grammatical phrases) with a collective meaning represented by a PARSE TREE.
Semantic Analysis: The Semantic Analysis in which certain checks are performed to ensure
that the components of a program fit together meaningfully (type checking, type
conversion) and to report on errors.
Code generator: Code generator produces the object code by deciding on the memory
locations for data, selecting code to access each datum and selecting the registers in which
each computation is to be done. Many computers have only a few high speed registers in
which computations can be performed quickly. A good code generator would attempt to
utilize registers as efficiently as possible.
Code Optimization: This is optional phase described to improve the intermediate code so
that the output runs faster and takes less space. Its output is another intermediate code
program that does the same job as the original, but in a way that saves time and / or
spaces.
Compiler structure
Front End (Language specific) and Back End (Machine specific) parts of compilation
Passes of Compiler
1. Phase and Pass are two terms used in the area of compilers.
2. Pass is a reading of a file followed by processing of data from file.
3. Phase is a logical part of the compilation process.
4. A pass is a single time the compiler passes over (goes through) the sources code or some
other representation of it.
5. Typically, most compilers have at least two phases called front end and backend, while
they could be either one-pass or multi-pass.
Note: Phase is used to classify compilers according to the construction, while pass is used to
classify compilers according to how they operate
One pass Compiler: A single pass compiler makes a single pass through the source text, parsing,
analyzing, and generating code only once. In other words, it allows the source code to pass
through each compilation unit only once. It immediately translates each code section into its final
machine code.
Main stages of single pass compiler are lexical analysis, syntactical analysis and code generator.
First, the lexical analysis scans the source code and divides it into tokens. Every programming
language has a grammar. It represents the syntax and legal statements of the language. Then, the
syntactical analysis determines the language constructs described by the grammar. Finally, the
code generator generates the target code. Overall, single pass compiler does not optimize the
code. Moreover, there is no intermediate code generation.
Fig:
Dependency diagram
of typical one pass
compiler
Multi-pass Compiler: In a multi-pass compiler which converts the program into one or more
intermediate representations steps in between source code and machine code, and which
reprocesses the entire compilation unit in each sequential pass. It processes the source code or
abstract syntax tree of a program several times. Multi-pass compilers are sometimes called wide
compilers. Each pass takes the result of the previous pass as the input, and creates an
intermediate output. In this way, the (intermediate) code is improved pass by pass, until the final
pass emits the final code.
Each pass takes the result of the previous pass as the input and creates an intermediate output.
Likewise, in each pass, the code improves until the final pass generates the final code. A multipass
compiler performs additional tasks such as intermediate code generation, machine dependent
code optimization and machine independent code optimization.
Types of Compilers:
1. Native code compiler: The compiler used to compile a source code for same type of
platform only. The output generated by this type of compiler can only be run on the same
type of computer system and Os that the compiler itself runs on.
2. Cross compiler: The compiler used to compile a source code for different kinds platform.
Used in making software’s for embedded systems that can be used on multiple platforms.
3. Source to source compiler: the compiler that takes high-level language code as input and
outputs source code of another high- level language only. Unlike other compilers which
convert high level language into low level machine language, it can take up a code written
in Pascal and can transform it into C-conversion of one high level language into another
high level language having same type of abstraction . Thus, it is also known as transpiler.
4. One pass compiler: It is a type of compiler that compiles the whole process in only one-
pass.
5. Threaded code compiler: The compiler which simply replace a string by an appropriate
binary code.
6. Incremental compiler: The compiler which compiles only the modified/changed lines from
the source code and update the object code.
7. Source compiler: The compiler which converts the source code high level language code in
to assembly language only.
Lexical analysis
Lexical analysis is the first phase of a compiler. It takes the modified source code from language
preprocessors that are written in the form of sentences. The lexical analyzer breaks these syntaxes
into a series of tokens, by removing any whitespace or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works closely
with the syntax analyzer. It reads character streams from the source code, checks for legal tokens,
and passes the data to the syntax analyzer when it demands.
Tokens
Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some
predefined rules for every lexeme to be identified as a valid token. These rules are defined by
grammar rules, by means of a pattern. A pattern explains what can be a token, and these patterns
are defined by means of regular expressions.
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set
of Hexadecimal alphabets, {a-z, A-Z} is a set of English language alphabets.
Strings
Any finite sequence of alphabets is called a string. Length of the string is the total number of
occurrence of alphabets, e.g., the length of the string tutorialspoint is 14 and is denoted by |
tutorialspoint| = 14. A string having no alphabets, i.e. a string of zero length is known as an empty
string and is denoted by ε (epsilon).
Special Symbols
A typical high-level language contains the following symbols:-
Arithmetic Symbols Addition(+), Subtraction(-), Modulo(%), Multiplication(*),
Division(/)
Assignment =
Preprocessor #
Language
A language is considered as a finite set of strings over some finite set of alphabets. Computer
languages are considered as finite sets, and mathematically set operations can be performed on
them. Finite languages can be described by means of regular expressions.
Bootstrapping
In computer science, bootstrapping is the technique for producing a self-compiling compiler —
that is, compiler (or assembler) written in the source programming language that it intends to
compile. An initial core version of the compiler (the bootstrap compiler) is generated in a different
language (which could be assembly language); successive expanded versions of the compiler are
developed using this minimal subset of the language.
Many compilers for many programming languages are bootstrapped, including compilers for
BASIC, ALGOL, C, D, Pascal, PL/I, Factor, Haskell, Modula-2, Oberon, OCaml, Common Lisp,
Scheme, Go, Java, Rust, Python, Scala, Nim, Eiffel, and more.
1. Source Language
2. Target Language
3. Implementation Language
T-Diagram
Tombstone diagrams (or T-diagrams) consist of a set of “puzzle pieces” representing compilers and
other related language processing programs. They are used to illustrate and reason about
transformations from a source language (left of T) to a target language (right of T) realized in an
implementation language (bottom of T). They are most commonly found, describing complicated
processes for bootstrapping, porting, and self-compiling of compilers, interpreters, and macro-
processors.
T-diagrams were first introduced for describing bootstrapping and cross-compiling compilers by
McKeeman et al. in 1971.
The T- diagram shows a compiler SCIT for Source language S, Target language T and implemented
language I.
1. Create a compiler SCAA for subset, S of the desired language L ( S ⊂L ) using language "A"
and that compiler runs on machine A.
3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler for language L, which runs
on machine A and produces code for machine A.
L
CSA SCAA LCAA
Cross Compiler
Cross Compiler is a compiler that runs on one computer but produces machine code for a different
type of computer. Cross compilers are used to generate software that can run on computers with
a new architecture or on special-purpose devices that cannot host their own compilers.
For example, a compiler that runs on a Windows 7 PC but generates code that runs on Android
Smartphone is a cross compiler.
Cross compilers are written in different language as the target language. E.g. SNM is a
compiler for the language S that is in a language that runs on machine N and generates output
code that runs on machine M.
Steps involving to produce a compiler for a different machine B (or cross compiler):
1. Convert LCSA into LCLB (by hand, if necessary). Recall that language S is a subset of
language L.
2. Compile LCLB on LCAA (available compiler of machine A) to produce LCAB, a cross-
compiler for L which runs on machine A and produces code for machine B.
3. Compile LCLB with the cross-compiler LCAB to produce LCBB, a compiler for language L
which runs on machine B
Input Buffering
• To ensure that a right lexeme is found, one or more characters have to be looked up beyond
the next lexeme.
• Hence a two-buffer scheme is introduced to handle large lookaheads safely.
• Techniques for speeding up the process of lexical analyzer such as the use of sentinels to mark
the buffer end have been adopted.
• There are three general approaches for the implementation of a lexical analyzer:
1. By using a lexical-analyzer generator, such as lex compiler to produce the lexical analyzer
from a regular expression based specification. In this, the generator provides routines for
reading and buffering the input.
2. By writing the lexical analyzer in a conventional systems-programming language, using I/O
facilities of that language to read the input.
3. By writing the lexical analyzer in assembly language and explicitly managing the reading of
input.
Buffer Pairs
Fig shows the buffer pairs which are used to hold the input data.
Scheme
• Consists of two buffers, each consists of N-character size which are reloaded alternatively.
• N-Number of characters on one disk block, e.g., 4096.
• N characters are read from the input file to the buffer using one system read command.
• eof is inserted at the end if the number of characters is less than N.
Pointers
lexeme Begin points to the beginning of the current lexeme which is yet to be found.
forward scans ahead until a match for a pattern is found.
• Once a lexeme is found, lexemebegin is set to the character immediately after the lexeme
which is just found and forward is set to the character at its right end.
• Current lexeme is the set of characters between two pointers.
Disadvantages of this scheme
• This scheme works well most of the time, but the amount of lookahead is limited.
• This limited lookahead may make it impossible to recognize tokens in situations where the
distance that the forward pointer must travel is more than the length of the buffer.
• It cannot determine whether the DECLARE is a keyword or an array name until the character
that follows the right parenthesis.
Sentinels
• In the previous scheme, each time when the forward pointer is moved, a check is done to
ensure that one half of the buffer has not moved off. If it is done, then the other half must be
reloaded.
• Therefore the ends of the buffer halves require two tests for each advance of the forward
pointer.
forward:=forward +1;
if forward =eof then begin
if forward at end of first half then begin
reload second half;
forward:=forward +1;
end
else if forward at the end of second half then begin
reload first half;
move forward to the beginning of first half;
end
else /*eof within a buffer signifying end of input*/
terminate lexical analysis
end
fig: Lookahead code with sentinels
Advantages
• Most of the time, It performs only one test to see whether forward pointer points to an eof.
• Only when it reaches the end of the buffer half or eof, it performs more tests.
• Since N input characters are encountered between eofs, the average number of tests per input
character is very close to 1
A compiler or an interpreter for a programming language is often decomposed into two parts:
Lex and Yacc can generate program fragments (either tokens or hierarchical structure) that solve
the first task. The task of discovering the source structure again is decomposed into subtasks:
1. Split the source file into tokens (Lex).
2. Find the hierarchical structure of the program (Yacc)
Lex source is a table of regular expressions and corresponding program fragments. The table is
translated to a program which reads an input stream, copying it to an output stream and
partitioning the input into strings which match the given expressions. As each such string is
recognized the corresponding program fragment is executed. The recognition of the expressions is
performed by a deterministic finite automaton generated by Lex. The program fragments written
by the user are executed in the order in which the corresponding regular expressions occur in the
input stream.
LEX specifications:
{declarations}
%%
{translation rules }
%%
{auxiliary procedures}
where the definitions and the user subroutines are often omitted. The second %% is optional, but
the first is required to mark the beginning of the rules. The absolute minimum LEX program is thus
%%
(no definitions, no rules) which translates into a program which copies the input to the output
unchanged.
p1 {action 1}
p2 {action 2}
p3 {action 3}
……
……
Where each p is a regular expression and each action is a program fragment describing what
action the lexical analyzer should take when a pattern p matches a lexeme. In Lex the actions
are written in C.
3. The third section holds whatever auxiliary procedures are needed by the actions.
Alternatively these procedures can be compiled separately and loaded with the lexical
analyzer.
LEX is one such lexical analyzer generator which produces C code, based on the token
specifications. LEX tool has been widely used to specify lexical analyzers for a variety of
languages. We refer to the tool as Lex Compiler, and to its input specification as the Lex
language. Lex is generally used in the manner depicted in the above figure;
Example:
%{
/* definitions of manifest constants
LT, LE, EQ, NE, GT, GE,
IF, THEN, ELSE, ID, NUMBER, RELOP */
%}
/* regular definitions */
delim [ \t\n]
ws {delim}+
letter [A-Za-z]
digit [0-9]
id {letter}({letter}|{digit})*
number {digit}+(\.{digit}+)?(E[+-]?{digit}+)?
%%
{ws} {/* no action and no return */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = (int) installID(); return(ID);}
{number} {yylval = (int) installNumO ; return(NUMBER);}
“<” {yylval = LT; return(RELOP);}
“<=” {yylval = LE; return(RELOP);}
“_“ {yylval = EQ; return(RELOP);}
"<>" {yylval = NE; return(RELOP);}
{yylval = GT; return(RELOP);}
{yylval = GE; return(RELOP);}
int installlDQ {
/* function to install the lexeme, whose first character is pointed to by yytext, arid
whose length is yyleng, into the symbol table and return a pointer thereto*/
int installNumO {
Structure of YACC
{ declarations}
%%
{Translation rules}
%%
{programs}
1. A variable that is being (partially) defined by the production. This variable is often
called the head of the production.
2. The production symbol →
3. A string of zero or more terminals and variables.
Example of CFG: Given a grammar G = ({S}, {a, b}, P, S). The set of productions P is
S →aSb
S →SS
S→ ε
This grammar generates strings such as abab, aaabbb, and aababb. If we assume that a is left
parenthesis ‘(’ and b is right parenthesis ‘)’, then L(G) is the language of all strings of properly
nested parentheses.
Derivation Trees
A ‘derivation tree’ is an ordered tree which the the nodes are labeled with the left sides of
productions and in which the children of a node represent its corresponding right sides.
Let G = (V, T, S, P) be a CFG. An ordered tree is a derivation tree for G iff it has the following
properties:
1. S→aSS
2. S→b
For the String w = aababbb
We have:
Ambiguity of a Grammar
The grammar given by
Generates strings having an equal number of a’s and b’s. The string “abab” can be generated from
this grammar in two distinct ways, as shown in the following derivation trees:
Similarly, “abab” has two distinct leftmost derivations:
Each of the above derivation trees can be turned into a unique rightmost derivation, or into a
unique leftmost derivation. Each leftmost or rightmost derivation can be turned into a unique
derivation tree. These representations are largely interchangeable.
Since derivation trees, leftmost derivations, and rightmost derivations are equivalent rotations,
the following definitions are equivalent:
A string w ε L(G) is said to be “ambiguously derivable “if there are two or more different derivation
trees for that string in G.
Definition: A CFG given by G = (N, T, P, S) is said to be “ambiguous” if there exists at least one
string in L(G) which is ambiguously derivable. Otherwise it is unambiguous.
Ambiguity is a property of a grammar, and it is usually, but not always possible to find an
equivalent unambiguous grammar. An “inherently ambiguous language” is a language for which no
unambiguous grammar exists
Solution: In order to show that G is ambiguous, we need to find a wεL(G), which is ambiguous.
Assume w = abababa. The two derivation trees for w = abababa is shown below in Fig. (a) and (b).
Therefore, the grammar G is ambiguous.