Вы находитесь на странице: 1из 25

1/25

STUDY MATERIAL
SUBJECT COURSE SEMESTER UNIT STAFF
SYLLABUS
Compilers Analysis of the source program Phases of a compiler Cousins of compiler Grouping of Phases. Simple one-pass compiler Overview Syntax definition Syntax-directed translation parsing translator for simple expressions. Lexical analysis removal of white space and comments constants recognizing identifiers and keyboards a lexical analyzer role of lexical analyzer input buffering specification of tokens recognition of tokens. INTRODUCTION COMPILERS The entire program can be translated into a machine language. This is called compilers. Source Program Compiler Error messages Compilers are classified as i) single-pass or one-pass translation To generate modest input/output activity. Example : The code for a procedure body is compiled in memory and written out as a unit to secondary storage. ii) multi-pass iii) load-and-go iv) debugging (symbolic debugging) v) optimizing (code optimization) Two parts of compilation i) Analysis It breaks up the source program into pieces and creates an intermediate representation of the source program. ii) Synthesis It constructs the desired target program from the intermediate representation. During analysis, the source programs are hierarchical structure called a tree. determined and recorded in a Target Program

: : : : :

INTRODUCTION TO COMPILER DESIGN

III - B C A V

I
S.UMA

UNIT - I

2/25
SYNTAX TREE Each node represents an operation and the children of a node represent the arguments of the operation. For example, an assignment statement in Pascal is given as, Position := initial + rate * 60. The syntax tree is given below, := position initial rate EXAMPLES OF SOFTWARE TOOLS i) Structure Editors It takes as input a sequence of commands to build a source program. It also analyzes the program text, putting an appropriate hierarchical structure on the source program. Example : When the user types while, the editor supplies the matching do. ii) Pretty printers It analyzes a program and prints the structure of the program becomes clearly visible. Example : Comments may appear in a special font and statements may appear indentation, nesting in the hierarchical organization of the statements. iii) Static Checkers It reads a program, analyzes it, and attempts to identify potential bugs without running the program. Example : It can catch logical errors, such as, the use of a real variable as a pointer. iv) Interpreters It performs the operations implied by the source program. The source program translated into machine language, after execution of each statement. EXAMPLES OF ANALYSIS i) Text formatters It is a stream of characters , which is text to typeset, which include commands, to indicate paragraphs, figures or mathematical structure like superscripts and subscripts. ii) Silicon Compiler + * 60

3/25
It has a source program that is identical to a conventional programming language. Here, the program represents not only locations in memory, but also logical signals ( 0 or 1 ) or groups of signals in a switching circuit. iii) Query Interpreters It translates a predicate containing relational and Boolean operators into commands, to search a database for records satisfying that predicate. THE CONTEXT OF A COMPILER A source program may be divided into modules stored in separate files. The collection of the source programs is sometimes entrusted, is called preprocessor. It also expands macros into source language statements. The source program created by the compiler may require further processing before it can be run. The compiler creates assembly code that is translated by an assembler into machine code and then linked together with same library routines into the code that actually runs on the machine. Skeletal source Program

Preprocessor Source Program Compiler Target assembly Program Assembler Relocatable machine code Loader/link-editor Absolute machine code ANALYSIS OF THE SOURCE PROGRAM In compiling, analysis consists of three phases 1. Linear Analysis or Lexical Analysis or Scanning The stream of characters making up the source program is read from leftto-right and grouped into tokens that are sequences of characters having a collective meaning. For example, position := initial + rate * 60 Library relocatable object files

4/25
In lexical analysis, the characters in the assignment statement would be grouped into the following tokens : a) the identifier (position) b) the assignment symbol (:= ) c) the identifier (initial) d) the plus sign (+) e) the identifier rate f) the multiplication sign (*) g) the number (60) The blanks separating the characters of these tokens would normally be eliminated. 2. Hierarchical Analysis or Parsing or Syntax Analysis The characters or tokens are grouped hierarchically into nested collections with collective meaning. It involves grouping the tokens of the source program into grammatical phases that are used by the compiler to output. The grammatical phases of the source program are represented by a parse tree. Lexical constructs do not require recursion , while syntactic constructs requires recursion. Context free grammars are a formalization of recursive rules that can be used to guide syntactic analysis. For example, parse tree for position := initial + rate * 60 Assignment statement identifiers Position identifier initial := expression expression + expression * expression number 60

expression identifier rate

Hierarchical structure of a program expressed by a recursive rules. 1. any identifier is an expression 2. any number is an expression 3. if expression1 and expression2 are expression, then expression1 + expression2 expression1 * expression2 (expression) Symbol Table The characters are grouped and removed from the input processing of the next token can begin and are recorded in a table. 3. Semantic Analysis

To check the source program for semantic errors. It uses the hierarchical structure determined by the syntax analysis phase to identify the operators and operands of expressions and statements.

5/25
An important component of semantic analysis is type checking. For example, when a binary arithmetic operator is applied to an integer and real. The compiler may need to convert the integer to real. ANALYSIS IN TEXT FORMATTERS The input to a text formatter as specifying a hierarchy of boxes that are rectangular regions to be filled by bit pattern, representing light and dark pixels to be printed by the output devices. Consecutive characters not separated by white space are grouped into words consisting of a sequence of horizontally arranged boxes. Boxes in TEX system may be built from smaller boxes by horizontally and vertically. For example, \h box { < list of boxes > } \h box { \v box { !1 } \v box { @ 2 } } ! 1 @ 2

To build a mathematical expression from operators like sub and sup for subscripts and superscripts. The EQN preprocessor for mathematics, BOX sub box BOX box a sub { i sup 2 } a i 2 THE PHASES OF A COMPILER Each of which transforms the source program from one representation to another. SYMBOL TABLE MANAGEMENT It is data structure containing a record for each identifier with fields for the attributes of the identifier. To store or retrieve data from that record. Example : int position, initial, rate; When an identifier in the source program is entered into the symbol table. Error detection and reporting. Each phase can encounters errors. A compilers that stops when it finds the first error. Error where the token stream violates the structure rules of the language are determined by the syntax analysis phase. THE ANALYSIS PHASES The lexical analyzer generates the symbol table of lexical value id. The characters sequence forming a token is called lexeme for the token. Syntax analysis imposes a hierarchical structure on the token stream by trees. CODE GENERATION After syntax and semantic analysis, compilers generates an explicit intermediate representation of the source program. To translate into the target program. CODE OPTIMIZATION To improve the intermediate code and to improve the running time of the target program without slowing down compilation.

6/25
CODE GENERATION To generate the target code, consisting of relocatable machine code or assembly code. Memory locations are selected for each of the variables used by the program. The intermediate instructions are each translated into sequence of machine instructions that perform the same task. Source Program Syntax Analyzer Lexical Analyzer Semantic Analyzer Code Optimization Intermediate Code Generation Target Program COUSINS OF THE COMPILER A compiler may be produced one or more preprocessors. Preprocessor functions a) Macro processing A preprocessor may allow a user to define macros that are shorthands for longer constructs. b) File inclusion A preprocessor may include header files into the program text. For example, <global.h> the contents of the file to replace the statement # include <global.h> when it process a file containing this statement. c) Rational preprocessors These processors constructs flow-of-control and data-structuring facilities. For example, built-in-macros fro construct like while statement, or if statement. d) Language extension These processors attempt to add capabilities to the language by built-in macros. For example, statements beginning with ## are taken by the preprocessors to be database-access statements. Translated into procedure calls on routines that perform the database access.

Symbol-table Management

Error Handler

7/25
MACRO PROCESSORS Macro definition To indicate by some unique characters or keyword, like define or macro. They consist of a name for the macro being defined and a body. To permit formal parameters in their definition i.e., symbols to be replaced by values. The use of a macro consists of naming the macro and supplying actual parameters, i.e., value for its formal parameters. The macro substitutes the actual parameters for the formal parameters in the body of the macros, the transformed body then replaces the macro use itself. Assemblers Some compilers produce assembly code, that is passed to an assembler for further processing. Other compilers perform the job of the assembler, producing relocatable machine code that can be passed directly to the loader/linker editor. Assembly code is mnemonic version of machine code, in which names are used instead of binary codes for operations, and names are also given to memory addresses. Example: mov a, R1 add #2, R1 mov R1, b The contents of the address a into register R1, then adds the constant 2 to it and stores the result in the location named by b. Two-pass assembly Pass consists of reading an input file once. In first pass, all the identifiers that denote storage locations are found and stored in a symbol table. Identifiers are assigned storage locations, after reading the symbol table. For example, Identifier Address a 0 b 4 In second pass, again the assembler scans the input. This time, it translates each operation code into the sequence of bits representing that operation in machine language, and it translates each identifier representing a location into the address given for that identifier in the symbol table. The result of this pass is relocatable machine code. Loaders and Link-Editors Loader is a program, performs the two functions of loading and linkediting. The process of loading consists of taking relocatable machine code, placing the altered instructions and data in memory at the proper locations The link-editor allows to make a single program from various files relocatable machine. The result of various different files compilations, and one or more may be library files of routines provided by the system. THE GROUPING OF PHASES More than one phases are grouped together.

8/25
Front and back ends The phases are collected into a front end and back end. The front end consists of source language and are largely independent of the target machine. It include lexical and syntactic analysis, the creation of the symbol table, semantic analysis and the generation of the intermediate code. The amount of code optimization can be done by the front end as well and also includes the error handling. The back end includes the compiler depend on the target machine, or do not depend on the source language. Aspect of this phase, error handling and symbol-table operations. Passes Single pass consisting of reading an input file and writing an output file. For example, lexical analysis, syntax analysis, semantic analysis, and intermediate code generation might be grouped into one pass. The token stream after lexical analysis translated directly into intermediate code. Reducing the number of passes : To read and write intermediate files, several phases into one pass. The interface between the lexical and syntactic analyzers can often be limited to a single token. The intermediate and target code generation can be merged into one pass using a technique called back patching. Back patching is the term of an assembler. If the machine code is generated, labels in three address statements have to be converted to addresses of instructions. For example, x := y op z ( or ) x := y * z, where op any operator, x, y & z are names, constants. A SIMPLE ONE-PASS COMPILER A context free grammar (or) Backus-Naur Form (BNF) can be used to help guide the translation of programs. A grammar-oriented compiling technique, known as Syntax-directed translation, very helpful for organizing a compiler front end. The lexical analyzer converts the stream of input characters into a stream of tokens. Character stream Lexical Analyzer Token Stream Syntax-directed Translator Intermediate representation

SYNTAX DEFINITION Context-free grammar for specifying the syntax of a language. Contextfree grammar has four components : 1. 2. 3. A set of tokens, known as terminal symbols. A set of nonterminals. A set of productions where each production consists of a nonterminal called the leftside of the production, an arrow, and a sequence of tokens and/or nonterminals, called the rightside of the production. A designation of one of the nonterminals as the start symbol.

4.

For example, an if-else statement in C, if ( expression ) statement else statement

9/25
the variable expr to denote an expression and the variable stmt to denote an statement, the structure rule can be expressed as, stmt if ( expr ) stmt else stmt in which the arrow may read as can have the form. This rule is called a production. In a production lexical elements like the keyword if and the parentheses are called tokens. Variables like expr and stmt represent sequences of tokens and are called nonterminals. A string of tokens is a sequence of zero or more tokens. The string containing zero tokens, written as , is called the empty string. A grammar derives strings by beginning with the start symbol and repeatedly replacing a nonterminal by the right side of a production for that nonterminal. The tokens strings that can be derived from the start symbol form the language defined by the grammar. For example, 9 5 + 2 is list a) 9 is a list by production, since 9 is a digit. b) 9-5 is a list by production, since 9 is a list and 5 is a digit. c) 9-5+2 is a list by production, since 9-5 is a list and 2 is a digit. list of digits separated by plus and minus sign. list list + digit list list digit list digit digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ( | read as or ) The nonterminals appear on the left side of the production. list list list digit 9 5 + 2 digit digit

Each node in the tree is labeled by a grammar symbol. An interior node and its children correspond to a production. The interior node corresponds to the left side of the production, the children to the rightside. Such trees are called parse trees. PARSE TREES A parse tree pictorial representation, the start symbol of a grammar derives a string in the language. If nonterminal A has a production A XYZ, then a parse tree may have an interior node Properties 1. 2. 3. The root is labeled by the start symbol. Each leaf is labeled by a token or by . Each interior node is labeled by a nonterminal. labeled A with three children labeled X, Y and Z from the left to right. A X Y Z

10/25
4. If A is the nonterminal labeling some interior node and X 1 , X 2 ,, X n are the labels of the children of that node from left to right, then A X 1 , X 2 ,, X n is a production. Here, X 1 , X 2 ,, X n stand for a symbol that is either a terminal or a nonterminal. If A then a node labeled A may have a single child labeled .

The leaves of a parse tree read from left to right form the yield of the tree, which is the string generated or derived from the nonterminal at the root of the parse tree. The language generated by a grammar is as the set of strings that can be generated by some parse tree. The process of finding a parse tree for a given string of tokens is called parsing that string. Ambiguity A grammar can have more than one parse tree generating a string of tokens, such a grammar is said to be ambiguous. A string with more than one parse tree has more than one meaning, for compiling applications require to design unambiguous grammar, or to use ambiguous grammars with additional rules to resolve the ambiguities. For example, 9-5+2 has more than one parse tree. The two trees are two ways (9-5)+2 and 9-(5+2) String String String + String String 2 String String 9 + String String

String

9 Rule 1 :

5 Associativity of operators

Arithmetic operators Addition, subtraction, multiplication and division are left associative. Exponentiation, equal sign are right associative. For example, 9-5-2 is equivalent to (9-5)-2 left associative. a=b=c is equivalent to a=(b=c) right associative. Operator generated by the grammar : right letter = right | letter letter a | b | | z The contrast between a parse tree for a left and right associative operator and = respectively. list right list list digit + digit 5 digit 2 letter a = letter b right = right letter

11/25
9 grows down towards left Rule 2 : Precedence of Operators c grows down towards the right For example, 9+5*2, two possible interpretations expression (9+5)*2 or 9+(5*2). Here, the associativity of + and * do not resolve this ambiguity, because the precedence of operators when more than one operator is present. * has the higher precedence than + if * takes its operands before + does. Multiplication and division have higher precedence than addition and subtraction. Other expression can be 9+5*2 and 9*5+2 i.e., the expressions are equivalent to 9+(5*2) and (9*5)+2 respectively. Syntax of Expressions A grammar for arithmetic expressions can be constructed from the associativity and precedence of operands. Syntax-directed Translation To specify the translation of a construct in terms of attributes associated with its syntactic components. An attribute may represent any quantity e.g., a type, a string, a memory location. Procedural notation are called a translation scheme, for specifying translations. This schemes for translating infix expressions into postfix notation. Postfix notation The postfix notation for an expression E can be defined : 1. If E is a variable or constant, then the postfix notation for E is E itself. 2. If E is an expression of the form E 1 op E 2 , where op is any binary operator, then the postfix notation for E 1 E 2 op, where E 1 and E 2 are the postfix notations for E 1 and E 2 respectively. 3. If E is expression of the form (E 1 ), then the postfix notation for E 1 is also the postfix notation for E. parentheses are not needed in postfix, to permit only one decoding. For example, (9-5)+2 is 95-2+ is 952+Syntax-directed Definition It uses a context-free grammar to specify the syntactic structure of the input. With each grammar symbol, it associates a set of attributes, and with each production, a set of semantic rules for computing values of the attributes associated with the symbols appearing in that production. The grammar and the set of semantic rules constitute the syntax-directed definition. A translation is an input-output mapping. For example, a node n in the parse tree is labeled by the grammar symbol X. X.a to denote the value of attribute a of X at that node. The value of X.a at n is computed using the semantic rules for attribute a associated with the X production used at n. The attribute values at each node of a parse tree is called an annotated parse tree. Synthesized Attributes

12/25
An attribute is said to be synthesized if its value at a parse tree node is determined from attribute value at the children of the node. Evaluated a single bottom-up traversal of parse tree. For example, (9-5)+2 is an expression Production exprexpr 1 + term exprexpr 1 - term exprterm term0 term9 expr.t expr.t expr.t term.t term.t Semantic Rule := := := := expr 1 .t || term.t || + expr 1 .t || term.t || - term.t 0

:= 9

Associated with each nonterminal is a string-valued attribute t that represents the postfix notation for the expression generated by that nonterminal in a parse tree. The value of the t-attribute at each node has been computed using the semantic rule associated with the production used at that node. The value of the attribute at the root is the postfix notation for the string generated by the parse tree. expr.t = 95-2+ expr.t = 95expr.t = 9 term.t = 9 9 5 + 2 term.t = 5 term.t = 2

Fig. Attribute values at node Depth-First Traversals A traversal of a tree starts at the root and visits each node of the tree in some order. The semantic rules are evaluated : procedure visit ( n: node ); begin for each child m of n, from left to right do Visit(m); evaluate semantic rules at node n. end It starts at the root and recursively visits the children of each node in leftto-right order. The semantic rules at a given node are evaluated once all descendants of that node have been visited. It is called depth-first because it visits an unvisited child of a node whenever it can, so it tries to visit nodes from the root. Translation schemes

13/25
A translation scheme is a context free grammar in which program fragments called semantic actions are embedded within the right sides of the productions. A translation scheme is like a syntax-directed definition, except the order of evaluation of the semantic rules. The position at which an action is to be executed by enclosing it between braces and writing it within the right side of a production. rest + term { print (+) } rest 1 A translation scheme generates an output for each sentence x generated by the underlying grammar by executing the actions in the order they appear during a depth-first traversal of a parse tree for x. rest

term

{ print (+) }

rest 1

{ print (+) } is a subtree, connected by a dashed line to the node for its production. Emitting Translation Syntax directed definitions property The string representing the translation of the nonterminals on the left side of the production is the concatenation of the translations of the nonterminals on the right in the same order as in the production, with some additional strings interleaved. This property is termed simple. For example, expr expr 1 + term expr.t := expr 1 .t || term.t || + production Semantic rule

here, the translation expr.t is the concatenation of the translation of expr 1 and term, followed by the symbol +. Note : expr 1 appears before term on the right side of the production. An additional string appears between term.t and rest 1 .t in rest + term rest 1 rest.t := term.t|| + || rest 1 Production Semantic rule

the non terminal term appears before rest 1 on the rightside. PARSING Parsing is the process of determining if a string of tokens can be generated by a grammar. A parser can be constructed for any grammar. For any context-free grammar there is a parser that takes at most O(n 3 ) time to parse a string of n tokens. Parsing methods i) Bottom-up methods Construction starts at the leaves and proceeds towards the root. To handle larger class of grammars and translation scheme.

14/25
ii) Top-down methods Construction starts at the root and proceeds towards the leaves. Efficiency and easily handled. The top-down construction of a parse tree is done by starting with the root, labeled with the starting nonterminal, and repeatedly performing the following two steps: a) b) At node n, labeled with nonterminal A, select one of the productions for A and construct children at n for the symbols on the right side of the production. Find the next node at which a subtree is to be constructed. The token of the input string scanned frequently referred to as lookahead symbol.

For example, steps in top-down type simple | id | array [ simple ] of type simple integer | char | num dotdot num The next token of the input becomes the new lookahead symbol and the arrow in the parse tree has advanced to the next child of the root and the arrow in the input has advanced to the next token [ , the arrow in the parse tree will point to the child labeled with nonterminal simple. type type array [ simple type ] of type Parse Tree
INPUT

type

array [ num dotdot num of integer ] num of type type array


INPUT

array num

simple dotdot type

simple ]

of integer

array num

simple dotdot type

] num

of

type simple

array [ num dotdot num of integer

type ] num of type simple array


INPUT

array num

simple dotdot

simple ]

of integer

array [ num dotdot num of integer

15/25
integer PREDICTIVE PARSING Recursive-descent parsing is a top-down method of syntax analysis in which execute a set of recursive procedures to process the input. A procedure is associated with each nonterminal of a grammar. A special form of recursive-descent parsing, called predictive parsing, in which the lookahead symbol unambiguously determines the procedure selected for each nonterminal. The sequence of procedures called in processing the input implicitly defines a parse tree for the input. Each terminal in the right side is matched with the lookahead symbol and that each nonterminal in the right side leads to a call of its procedure. If is or generate , then is also in FIRST( ). FIRST( ) to be the set of tokens that appear as the first symbols of one or more strings generated from . The FIRST, there are two productions : A and A Recursive-descent parsing without backtracking requires FIRST( ) and FIRST( ) to be disjoint. If the lookahead symbol is in FIRST( ), then is used. Otherwise, if the lookahead symbol is in FIRST( ), then is used. When to use -productions The recursive descent parser will use an production as a default when no other production can be used. For example, stmt opt_stmt begin opt_stmt end stmt_list |

if the lookahead symbol is not in FIRST(stmt_list), then production is used. Designing a Predictive Parser A predictive parser is a program consisting of a procedure for every nonterminals. Each procedure does two things. 1. It decides which production to use by looking at the lookahead symbol. The production with right side is used if the lookahead symbol is in FIRST ( ). If there is a conflict between two right sides for any lookahead symbol, then cannot use this parsing method on this grammar. A production with on the right side is used if the lookahead symbol is not in the FIRST set for any other right hand side. The procedure uses a production by mimicking the right side. A nonterminal results in a call to the procedure for the nonterminal, and a token matching the lookahead symbol results in the next input token being read. If at some point the token in the production does not match the lookahead symbol, an error is declared.

2.

LEFT RECURSION It is possible for a recursive-descent parser to loop forever. A problem

16/25
arises with left recursive productions like expr expr + term in which the leftmost symbol on the right side is the same as the nonterminal on the left side of the production. Note: The lookahead symbol changes only when a terminal in the right side is matched. The production begins with the nonterminal expr, no changes to the input takes place between recursive calls, causing the infinite loop. A left recursive production can be eliminated by rewriting the offending production A A | where, and are sequences of terminals and nonterminals that do not start with A. This can be rewriting the productions for A A A A A| here, A is a new nonterminal. The production A A is right recursive, because this production for A has A itself as the left symbol on the right side. For example, expr expr + term | term A = expr , = +term, and = term. The nonterminal A is left recursive because the production A A itself as the leftmost symbol on the right side. expr term expr expr +term expr | A TRANSLATOR FOR SIMPLE EXPRESSIONS A predictive parser cannot handle a left-recursive grammar. By eliminating the left recursion, a grammar suitable for use in a predicate recursive descent translator. Abstract and Concrete Syntax The translation of an input string is an abstract syntax tree in which each node represents an operator and the children of the node represent the operands. The parse tree is called a concrete syntax tree, and the underlying grammar is called syntax for the language. Adapting the translation scheme This technique transforms the productions, A A | A | into A A A A | A | For example, expr term rest rest + term { print (+) } rest | -term { print(-) } rest | term 0 { print(0) | 1 { print(1) } | | 9 {print(9)} here, obtained above production is A = expr, = +term{print(+)}, has A

17/25
= -term {print(-)}, and = term

Translation of 9-5+2 into 95-2+ expr term 9 {print(9)} 5 Optimizing the Translator A procedure body is a recursive calls of the same procedure, this call is said to be tail recursive. Speed up the program by replacing tail recursion by iteration. For a procedure without parameters, a tail-recursive call can be simply replaced by a jump to the beginning of the procedure. Lexical analysis A sequence of input characters that comprises a single token is called a lexeme. A lexical analyzer can insulate a parser from the lexeme representation of tokens. Removal of white space and comments White space appear between tokens (blanks, taps and new lines). Comments also treated as white space, ignored by the parser and translator. If white space is eliminated by the lexical analyzer, the parser will never have to consider it. The alternative of modifying the grammar to incorporate white space into the syntax is not nearly as easy to implement. Constants The lexical analyzer passes both the token and the attribute to the parser. If write a token and its attributes as a tuple enclosed between < >. For example, the input 31 + 28 + 59 is transformed into the sequence of tuples <num, 31> <+, > <num, 28> <+, > <num, 59> The token + has no attribute. The second component of the tuples, the attributes, play no role during passing, but are needed during translation. Recognizing identifiers and keywords A grammar for language treats an identifier (variables, arrays, functions) as a tokens. For example, the input count = count + increment; would be rest rest term {print(-)} + {print(5)} 2 {print(2)} term {print(+)} rest

18/25
converted by the lexical analyzer into the token stream id = id + id; This token stream is used for parsing. A lexeme is stored in the symbol table and a pointer to this symbol-table entry becomes an attribute of the token id. (begin, end, if and punctuation) Fixed character strings called keywords. A mechanism is needed for deciding when a lexeme forms a keywords and when it forms an identifier. To resolve if keywords are reserved, i.e., if they cannot be used as identifiers. Then a character string forms an identifier only if it is not a keyword. The problem of isolating tokens also arises if the same characters appear in the lexeme of more than one token, as in <, <= and != in C. The role of the Lexical Analyzer The lexical analyzer acts as a subroutine or coroutine, which is called by the parser whenever it needs a new token. This organization eliminates the need for the intermediate file. The lexical analyzer returns to the parser a representation for the token it has found. The representation is an integer code if the token is a simple construct such as left parentheses, comma or colon. token Lexical Source Parser Get token next Analyzer program Symbol table Fig. Interaction of lexical analyzer with parser Lexical analyzes are divided into a cascade of two phases : Scanning To responsible for simple tasks & to eliminate blanks from the input. Lexical Analysis To responsible for complex operations. Functions of the lexical analyzers To keep track of line numbers. The line numbers can be associated with an error message. The source language supports the macro preprocessor functions, these preprocessor functions implemented as lexical analysis. To stripe out white space ( redundant blanks and taps ). To delete the comments.

Issues in lexical analysis Simpler Design The separation of lexical analysis from syntax analysis allows to simplify one or more of phases. For example, already to remove the comments and white space. Compiler efficiency is improved

19/25
A separate lexical analyzer allows to construct a specialized and potentially more efficient processor for the task. A large amount of time is spent reading the source program and partitioning it into tokens. Specialized buffering techniques for reading input characters and processing tokens can significantly speed up the performance of a compiler. Compiler portability is enhanced Input alphabet peculiarities and other device specific anomalies can be restricted to the lexical analyzer. The representation of special or non-standard symbols can be isolated in the lexical analyzer. Tokens, Patterns, Lexemes: Tokens A set of strings in the input which is produced as output is same token. Patterns This set of strings is described by a rule called pattern associated with the token. The pattern is said to match each string in the set. Lexemes A sequence of characters in the source program that is matched by the pattern for a token. Tokens as terminal symbol in the grammar for the source language. The lexemes matched by the pattern for the token represent strings of characters in the source program that can be treated together as a lexical unit. Tokens treated as keywords, operators, identifiers, constants, literal, strings and punctuation symbols such as parentheses, commas and semicolons. Token representing an identifier is returned to the parser. The lexical analyzer must distinguish between a keyword and a user defined identifier. A pattern is a rule describing the set of lexemes that can represent a particular token in source programs. To describe the patterns for more complex tokens. To use the regular expression notation developed. For example:
TOKEN SAMPLE LEXEMES INFORMAL DESCRIPTION OF PATTERN

const if relation id num literal

const if <,<=,>,>=,==,!= pi, count, D2 3.1416, 0, 6.02E3 core dumped

const if < or <= or > or >= or == or != letter followed by letters and digits any numeric constant any characters between and except

Attributes for tokens

When more than one pattern matches a lexeme, the lexical analyzer must provide additional information about the particular lexeme that matched to the subsequent phases of the compiler. The lexical analyzer collects information about tokens into their associated attributes. The tokens influence parsing decisions; the attributes influence the translation of tokens. A token has only a single attribute- a pointer to the symbol-table entry in which the information about the token is kept. The pointer becomes the attributes for the token. In certain pairs there is no need for an attribute value, the first component is sufficient to identify the lexeme. For

20/25
example, the tokens and associated attribute values E = M * C * 2 are written as a sequence of a pairs : < < < < < < < id, pointer to symbol-table entry for E> assign-op, > id, pointer to symbol-table entry for M> multi-op, > id, pointer to symbol-table entry for C> multi-op, > num, integer value 2 >

num has been given an integer-valued attribute. The compiler may store the character string that forms a number in a symbol table and the attribute of token num be a pointer to the table entry.
Lexical errors

For example, fi (a== f(x)) where, fi is misspelling of the keyword if or an unbalanced function identifier. Since fi is a valid identifier. The lexical analyzer must return the token for an identifier and phase of compiler handle any error.
Error-recovery actions:

i. ii. iii. iv.

Deleting an extraneous character. Inserting a missing character. Replacing an incorrect character by a correct character Transposing two adjacent characters.

Input buffering

Two-buffer input scheme a. b. lookahead to identify tokens sentinels for speeding up the lexical analyzer

Three approaches to the implementation of a lexical analyzer : 1. Use a lexical analyzer generator(Lex compiler), to produce the lexical analyzer from a regular expression based specification. The generator provides routines for reading and buffering the input. 2. Write the lexical analyzer in a conventional systems-programming language, using the I/O facilities of that language to read the input. 3. Write the lexical analyzer in assembly language and explicitly manage the reading of input. The lexical analyzer is the only phase of the compiler that reads the source program character-by-character, it is possible to speed some amount of time in the lexical analysis phase. All these processes may be carried out with an extra buffer into which the source file is read and then copied, with modification, into the buffer. Preprocessing the character stream being subjected to lexical analysis saves the trouble of moving the lookahead pointer back and forth over comments or strings of blanks. The lexical analyzer needs to look ahead characters beyond the lexeme for a pattern.

21/25
The lexical analyzers used a function to push lookahead characters back into the input stream. Because a large amount of time can be consumed moving characters, specialized buffering techniques have been developed to reduce the amount of overhead required to process an input character.

: : E : = : : M : * C : * : * : 2 : eof : : : : : forward lexeme-beginning Fig. An input buffer in two halves A buffer divided into two N character halves. N is the number of characters on one disk block( e.g.,1024 or 4096 ). To read N input characters into each half of the buffer with one system read command. To invoke a read command for each input character. eof marks the end of the source file and is different from any input character.
Sentinels

Each buffer half to hold a sentinel character at the end. The sentinel is a special character that cannot be part of the source program. The same buffer arrangement with the sentinels added eof. : : : E : : = : : M : * : eof C : * : * : 2 : eof : : : : :eof forward lexeme-beginning Fig. Sentinels at the end of each buffer half
Specification of tokens

Alphabet or character class denotes any finite set of symbols. Example of symbols are letters and characters. The set { 0, 1 } is the binary alphabet. Example of computer alphabets : ASCII and EBCDIC. String or sentence or word is a finite sequence of symbols drawn from that alphabet. The length of a string S, written |S|, is the number of occurrences of symbols in S. for example, VARMA is string of length 5. The empty string, denoted , is a special string of length zero. Language denotes any set of strings over some fixed alphabet. Alphabet languages like , the empty set or { }, the set containing only the empty string. Concatenation string formed x and y xy. The empty string is the identity element concatenation s = s = s.
Term Definition

Prefix of s Suffix of s

A string obtained by removing zero or trailing symbols of string s. For example, ban is a prefix of banana. A string obtained by removing zero or leading symbols of string s. For example, nana is a suffix of banana.

22/25
Substring of s A string obtained by deleting a prefix and a suffix from s. For example, nan is a substring of banana. Every prefix and every suffix of s is a substring of s, but not every substring of s is a prefix or a suffix of s. For every string s, both s and are prefixes, suffixes, and substrings of s. Any nonempty string x that is, a prefix, suffix, or substring of s such that s x. Any string formed by deleting zero or more not necessarily contiguous symbols from s. For example, baaa is a subsequences of banana.

Proper prefix, suffix, or substring of s subsequence of s

Operations on Language

exponentiation operator to languages by denoting L 0 to be { }, and L i to be L i - 1 L. L i is L concatenated with itself i-1 times. For lexical analysis, union, concatenation and closure are defined, these operations can be applied to languages. The following table for definitions of operations or languages.
Operation Definition

Union of L and M written LUM LUM ={ s | s is in L or s is in M } Concatenation of L and M written LM LM ={ st | s is in L and t is in M }

Kleene closure of L written L

= U L i , L * denotes Zero or
i=0

more concatenations of L

Positive closure of L written L

= U L i , L + denotes One or
i=1

more concatenations of L
Example

L = { A, B,,Z,a,b,,z } and D = { 0,1,,9 } L as the alphabet consisting of the set of upper and lower case letters, and D as the alphabet consisting of the set of the ten decimal digits. A symbol can be regarded as a string of length of one, the sets L and D are each finite languages : 1. 2. 3. 4. 5. LUD LD L4 L* L(LUD) * is the set of letter and digits. is the set of strings consisting of letters followed by a digit. is the set of all 4 letter strings. is the set of all strings of letters, including , the empty string . is the set of all strings of letters and digits beginning with a letter.

Regular expressions

An identifier is a letter followed by zero or more letters or digits. A regular expressions is a set of defining rules. Each regular expressions r denotes a language L(r). The defining rules to specify L(r) is formed by combining in various the language denoted by the sub expressions of r. The rules define the

23/25
regular expressions over the alphabet . Associated with each rule is a specification of the language denoted by the regular expression being defined. a. is a regular expression that denotes { }, that is the set containing the empty string. b. If a is a symbol in , then a is a regular expression that denotes {a}, i.e., the set containing the string a. c. Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then i. (r)|(s) is a regular expressions denoting L(r)UL(s). ii. (r)(s) is a regular expressions denoting L(r)L(s). iii. (r)* and (s)* is a regular expressions denoting (L(r))* and (L(s))*. iv. (r) and (s) is a regular expression denoting L(r) and L(s). A language denoted by a regular expression is said to be a regular set. The specification of a regular expression is an example of a recursive definition. Rules (i) and (ii), basis of the definition, basic symbol to refer to or a symbol in appearing in a regular expressions. Rules(iii), inductive step or induction rules, compound regular expression. Parenthesis can be avoided in regular expressions 1. the unary operator * has the highest precedence and is left associative. 2. concatenation has the second highest precedence and is left associative. 3. | has the lowest precedence and is left associative. The number of algebraic laws obeyed by regular expressions and can be used to manipulate regular expressions into equivalent regular forms. Algebraic laws that hold for regular expressions r, s and t
Axiom Descriptive

r|s = s|r r|(s|t) = s|(r|t) (rs)|t = r(st) r(s|t)=rs|rt,(s|t)r=st|sr r=r, r=r r* = (r|)* r** = r*
Regular Definition

| is commutative | is associative concatenation is associative concatenation distributes over | is the identity element for concatenation relation between * and * is idempotent

If is an alphabet of basic symbols, then regular definition is a sequence of definitions of the form : d 1 r 1 , d 2 r 2 , d 3 r 3 , , d n r n , where each d i is a distinct name, and r i is a regular expression over the symbol in U { d 1 , d 2 , d i - 1 }. If r i used d j for some j i, then r i might be recursively defined. For example, 5280, 39.37, 6.39E01 is the set of strings. The regular definition of following : digit digits opt_fra opt_exp num 0|1||9 digitdigit* .digit| (E(+|-| )digits)| digits opt_fra opt_exp

24/25
Notational Shorthands One or more instances

If r is a regular expression that denotes the language L(r), then (r)+ is a regular expression that denote the language (L(r))+. The unary postfix operator + means one or more instances of. The regular expression a+ denotes the set of all strings of one or more as. The operator + has the same precedence and associativity as the operator *. The two algebraic identities r* = r+| and r+ = rr* relate the Kleene and positive closure operators.
Zero or more instances

The unary postfix operator ? means zero or more instances of . The notation r? is a shorthand for r| . If r is regular expression, then (r)? is the regular expression that denotes the language L(r)U{ }. For above example, can be rewritten as digit digits opt_fra opt_exp num
Character classes

0|1||9 digit+ (.digit)? (E(+|-)?digits)? digits opt_fra opt_exp

The notation [abc] where a, b, and c are alphabet symbol denotes the regular expression a | b | c. Character class [a-z] denotes the regular expression a | b | |z. Using character class, identifiers as being strings generated by the regular expression [A-Za-z][A-Za-z0-9]*.
Non regular sets

Regular expressions can be used to denote only a fixed number of repetition or an unspecified number of repetitions of a given construct. Regular expressions cannot be described balanced or nested constructs. Repeating strings cannot be described by regular expressions. The set { w w | w is a strings of as and bs } cannot be denoted by any regular expression, nor can it be described by or context-free grammar.
Recognition of Tokens

Lexical analyzer will isolate the lexeme for the next token in the input buffer and produce as output a pair consisting of the appropriate token and attribute value.
Transition Diagram

An intermediate step in the construction of a lexical analyzer to produce a stylized flowchart called transition diagrams. Transition diagrams depict the actions that take place when a lexical analyzer is called by the parser to get the next token. Positions in a transition diagram are drawn as circles and are called states. The states are connected by arrows called edges. The edges leaving state s have labels indicating the input characters that can next appear after the transition diagram has reached state s. The initial state of the transition diagram is labeled the start state, where control resides when begin to recognize the

25/25
token. The double circle state is an accepting state or final state, in which the token has been found. A * to indicate states on which input retraction must take place. If failure occurs in all transition diagrams, then a lexical error has been detected and to invoke an error-recovery routine. For example, the transition diagram is >= . Start o > 1 1 = 2

Вам также может понравиться