Compiler Lecture Syntax PDF

Introduction Every programming language has rules that prescribe the syntactic structure of well-formed programs.
In Pascal for example a program is made out of blocks, a block out of statements, a statement out of expressions, an expression out of tokens, and so on. Syntax of programming language constructs can be described by Context-Free Grammar (CFG) or BNF notation. Grammar offers significant advantages to both the language designers and compiler writers. The role of the parser The parser obtains a string of tokens from the lexical analyzer, as described in figure below. The parser verifies that the string of tokens can be generated by the grammar for the source language. Also we expect the parser to report any syntax errors in an intelligible fashion. It should also recover from commonly occurring errors so that it can continue processing the remainder of its input.
Source Program
Lexical analysis
Token Get next Token
Parser
Parse Rest of tree front end
Intermediate Representation next Token
Symbol Table
There are three general types of parsers for grammars: universal, top-down, and bottom-up. Universal parsing methods such as the Cocke-Younger-Kasami algorithm and Earley's algorithm can parse any grammar. These general methods are, however, too inefficient to use in production compilers. The methods commonly used in compilers can be classified as being either top-down or bottom-up. As implied by their names, top-down methods build parse trees from the top (root) to the bottom (leaves), while bottom-up methods start from the leaves and work their way up to the root. In either case, the input to the parser is scanned from left to right, one symbol at a time. Parser definition A parser for grammar G is a program that take as input a string W and produce as output either a parse tree for W, if W is a sentence of G, or an error message indicating that W is not a sentence of G. Error Recovery Strategies 1. Panic-Mode recovery In case of an error like: a=b+c // no semi-colon d=e+f ; The compiler will now discard all subsequent tokens till a ; is encountered .This is a crude method but often turns out to be the best method. This method often skips a considerable amount of input without checking it for additional errors; it has an advantage of simplicity. In situations where multiple errors in the same statements are rare, this method may be quite adequate. 2. Phrase level recovery: On discovering an error, a parser may perform local correction on the remaining input; that is, it may replace a prefix of the remaining input by some string that allows the parser to continue. For example, in case of an error like the one above, it will report the error, generate the ; and continue.
|Page1
3. Error Production: If we have an idea of common errors that might occur, we can include the errors in the grammar at hand. For example if we have a production rule like: E +E | -E | *E | /E Then, a=+b; a=-b; a=*b; a=/b; Here, the last two are error situations. Hence we change the grammar as: E +E | -E | *A | /A A E Hence, once it encounters *A, it sends an error message asking the user if he is sure he wants to use a unary *. Logical errors are only debugged while compiling. 4. Global Correction: We would like compiler to make as few changes as possible in processing an incorrect input string. There are algorithms for choosing a minimal amount of changes to obtain a globally least-cost correction. Suppose we have a line like this: THIS IS A OCMPERIL SCALS. To correct this, there is an attractor, which checks how different tokens are from the initial inputs, checks the closest attractor to the incorrect token is. This is more of a probabilistic type of error correction. Unfortunately, these methods are in general too costly to implement in terms of space and time. Context-Free Grammar (CFG) It is natural to define certain P.L constructs recursively. For example: If S1 and S2 are statements and E is an expression, then we can write the following statement: Or "If E then S1 else S2 is a statement".. (1) If S1; S2; ;Sn are statements, then "begin S1; S2; ..;Sn end" is a statement (2) if E1 and E2 are expressions then "E1 + E2" is an expression
Or
If we use the syntactic category "statement" to denote the class of statements and "Expression" to denote the class of expressions then (1) can be expressed by the rewriting rules or production as following: Statement if Expression then statement else statement Expression Expression + Expression With (2) above we could write Statement begin statement; statement; .; statement end But the use of ellipses () would create problems. To over come this problem we require rewriting rules with no ellipses, by introduce new symbols statement-list, then Statement begin statement-list end Statement-list statement | statement; statement-list This set of rules is an example of grammar. In general a grammar involves four components: Terminals, nonterminals, a start symbol, and productions For example: Grammar with the following productions defines simple arithmetic expressions. expr expr op expr | (expr) | - expr | id op +|-|*|/| In this grammar terminal symbols are: id + - * / ( ) The nonterminal symbols are expr and op, and expr is the start symbol.
|Page2
Notational Conventions To avoid always having to state that "these are the terminals," "these are the nonterminal" and so on, the following notational conventions for grammars will be used throughout the remainder of this book. 1. These symbols are terminals: (a) Lowercase letters early in the alphabet, such as a, b, c. (b) Operator symbols such as +, *, and so on. (c) Punctuation symbols such as parentheses, comma, and so on. (d) The digits 0 , 1 , . . . ,9. (e) Boldface strings such as id or if, each of which represents a single terminal symbol. 2. These symbols are nonterminals: (a) Uppercase letters early in the alphabet, such as A, B, C. (b) The letter S, which, when it appears, is usually the start symbol. (c) Lowercase, italic names such as expr or stmt. (d) When discussing programming constructs, uppercase letters may be used to represent nonterminals for the constructs. For example, nonterminals for expressions, terms, and factors are often represented by E, T, and F, respectively. 3. Uppercase letters late in the alphabet, such as X, 7, Z, represent grammar symbols; that is, either nonterminals or terminals. 4. Lowercase letters late in the alphabet, chiefly u,v,..., z, represent (possibly empty) strings of terminals. 5. Lowercase Greek letters, , 0, 7 for example, represent (possibly empty) strings of grammar symbols. Thus, a generic production can be written as A a, where A is the head and the body. 6. A set of productions A ai, A a2 , ... , A ak with a common head A (call them A-productions), may be written A a1 | a 2 | . | ak the alternatives for A. 7. Unless stated otherwise, the head of the first production is the start symbol.
Derivation There are several ways to view the process by which a grammar defines a language. One of these methods is derivation. A derivation is a process in which a production is treated as a rewriting rule where the nonterminal on the left side is replaced by the string on the right side of the production. A derivation is a sequence of strings: 0, 1, 2, , n such that i -1 i , 0 < i < = n. Example Consider the following grammar for arithmetic expression with nonterminal E representing the expression. E E + E | E * E | (E) | - E | id The production E - E signifies that an expression preceded by the minus sign is also an expression. This production can be used to generate more complex expression from simple expression by allowing replacing any instance of an E by E. We can describe this action by writing E - E. Which read E derives -E. e.g. E*E, (E)*E or E*E E * ( E ).
We can take a single E and repeatedly apply productions in any order to obtain a sequence of replacement for example. E -E -(E) - ( id ).
We call such a sequence of replacements a derivation of ( id ) from E. If S where may contain nonterminal symbols, then we say that is a sentential form of grammar G. A sentence is a sentential form with no nonterminal symbols. Example The string (id + id ) is a sentence of grammar because there is the derivation E -E -( E) - ( E+ E) - ( id + E ) - ( id + id ) |Page3
The strings: - E, -( E), - ( E+ E), - ( id + E ), - ( id + id ) appearing in this derivation are all sentential form of this grammar. We write E * - ( id + id ) to indicate that ( id + id) can be derived from E.
Parse tree and derivation A parse tree may be viewed as a graphical representation for a derivation that filters out the choice regarding replacement order. Each interior node of a parse tree is labeled by nonterminal A, and the children of the node are labeled from left to right by the symbols in the right side of the production by which this A was replaced in the derivation. The leaves of the parse tree are labeled by nonterminal or terminals and read from left to right they constitute a sentential form called the yield or frontier of the tree.
Example The parse tree for - (id + id) .
E E
id id To see the relationship between the derivation and the parse tree, consider the following derivation E -E - (E) - ( E+ E) - ( id + E ) - ( id + id ) E E E ( E E
E E E ( ( E ) E E + E id + E E ) E -
E E
id
id |Page4
Example Let us consider the arithmetic expression grammar in the figure above the sentence id + id * id has two distinct leftmost derivations. E E+E id + E id + E * E id + id * E id + id * id (a) Two parse trees for Expression id + id * id E E*E E*E+E id + E * E id + id * E id + id * id (b)
The parse tree of (a) reflects the commonly assumed precedence of + and * while the tree in (b) does not. i.e. a + (b * c) rather than as (a + b) * c. Regular Expressions versus Context Free Grammar Every construct that can be described by regular expression can also be described by a grammar, for example the regular expression (a|b)*abb the grammar: A0 a A0| b A0|a A1 A1 b A2 A2 b A3
A3
Describe the same language, the set of strings of as and bs ending in abb. Since every regular set is a context free grammar language, we may reasonably ask why use regular expressions to define the lexical syntax of language: 1. The lexical rules of a language are frequently quite simple and to describe them do not need a notation as powerful as grammar. 2. Regular expressions generally provide a more concise and easier to understand notation for tokens than grammar. 3. More efficient lexical analyzers can be constructed automatically from regular expression than from arbitrary grammar. 4. Separating the syntactic structure of a language into lexical and nonlexical parts provides a convenient way of modularizing the front end of a compiler into two manageable-sized components. Regular expressions are most useful for describing the structure of constructs such as identifiers, constants, keywords, and white space. Grammars, on the other hand, are most useful for describing nested structures such as balanced parentheses, matching begin-end's, corresponding if-then-else's, and so on. These nested structures cannot be described by regular expressions. Ambiguity A grammar that produces more than one parse tree (or more than leftmost derivation or more than rightmost derivation) for some sentence is said to be ambiguous. For certain types of parsers it is desirable that the grammar be made unambiguous, but if it is not we cannot uniquely determines which pear tree to select for a sentence. For some applications we shall also consider methods by where by we can use certain ambiguous grammars together with disambiguating rules that "through away" undesirable pare trees leaving us with only one parse tree for each sentence.
|Page5
Eliminating Ambiguity Sometimes an ambiguous grammar can be rewritten to eliminate the ambiguity. As an example, we shall eliminate the ambiguity from the following grammar "dangling-else" grammar: Stmt if expr then stmt | if expr then stmt else stmt | other Here "other" stands for any other statement, According to this grammar the compound conditional statement. The above grammar is ambiguous since the string: " if E1 then if E2 then S1 else S3" have two parse trees as shown in the following figure:
stmt
stmt if expr E1 if expr then stmt else stmt S2 if expr then stmt S1 E2 S2 E2 S1 if expr stmt then stmt
E1
then
stmt
else
(1)
(2)
In all programming languages with conditional statements of this form the second parse tree is preferred. The general rule is "match each else with the closest previous unmatched then." This disambiguating rule can be incorporated directly into the grammar. For example we can rewrite the above if statement grammar as in the following unambiguous grammar. The idea is that a statement appearing between a then and an else must be "matched". A matched statement is either is an if-then-else statement containing no unmatched statement or it is any other kind of unconditional statement. Thus we may use the grammar. stmt matched_stmt | unmatched_stmt matched_stmt if expr then matched-stmt else matched_stmt | other ummatched_stmt if expr then stmt | if expr then matched_stmt else unmatched_stmt This grammar is unambiguous. Eliminating left recursion A grammar is left recursive if it has a nonterminal A such that there is a derivation A A for some string . Top-down parsing methods cannot handle left-recursive grammar, so a + transformation that eliminate left recursion is needed. To do that we shoed how the left recursive pair of production A A | could be replaced by the non-left recursive productions as following: ' A A ' A A' | Without changing the set of strings derivable from A. This rule by itself suffices in many grammars. Example Consider the following grammar for arithmetic expressions. E E+T|T T T*F|F F (E) | id |Page6
Eliminating the immediate left recursion (production of the form A then for T, we obtain: E E' T T' F T E' + T E' | F T' * F T' | (E) | id
A ) to the productions for E and
The general technique to eliminate the left recursion take the following form: A A 1 | A 2 | A 3 | | A m | 1 | 2. | n Where no i begin with A, Then we replace the A-production by: A 1 A'| 21 A'| . | n A' A' 1 A '| 2 A ' | | m A ' | Example:
S A
A a |b A c | S d|
Left factoring Left factoring is grammar transformation that is useful for producing a grammar suitable for predictive parsing. The basic idea is that when it is not clear which of two alternative productions to use to expand a nonterminal symbols A, we may be able to rewrite the A-productions to defer the decision until we have seen enough of the input to the right choice. For example, if we have the productions stmt if expr then stmt else stmt | if expr then stmt On seeing the input token if, we can not immediately tell which production to choose to expand the stmt. In general if A 1 | 2 are two A-productions and the input begins with a nonempty string derived from we do not know whether to expand A to 1 or 2. We may defer the decision by expanding A to 1 A '. Then after seeing the input derived from we expand A' to 1 or to 2. That is left-factored the original productions become A A' A' 1 | 2
Left-factoring algorithm Input: Grammar G. Output: An equivalent left-factored grammar. Method: For each nonterminal A find the longest prefix common to two or more of its alternatives. If = , i.e. there is a nontrivial common prefix, replace all the A productions A 1| 2| .| n | y where y represents all alternatives do not begin with by A A' | y A' 1 | 2 | .| n Example The following grammar abstracts the dangling else problem: S iEtS|iEtSeS|a E b Left factored of this grammar becomes: S i E t SS' | a S' eS| E b
|Page7

Compiler Lecture Syntax PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Compiler Lecture Syntax PDF

Загружено:

Авторское право:

Доступные форматы

Introduction Every programming language has rules that prescribe the syntactic structure of well-formed programs.

Token Get next Token

Parse Rest of tree front end

Intermediate Representation next Token

Example The parse tree for - (id + id) .

A ) to the productions for E and

Вам также может понравиться