Вы находитесь на странице: 1из 46

1/26/12

Compilers: Class Notes

... with a Programming Prerequisite


I do assume you are an experienced programmer. There will be non-trivial programming assignments during this course. Indeed, you will write a compiler for a simple programming language. I also assume that you have at least a passing familiarity with assembler language. In particular, your compiler will produce assembler language. We will not, however, write significant assembly-language programs.

0.10: Academic Integrity Policy


Our policy on academic integrity, which applies to all graduate courses in the department, can be found here.

Roadmap of the Course


1. Chapter 1 touches on all the material. 2. Chapter 2 constructs a simple compiler. 3. Chapters 3-10 fill in the (considerable) gaps. 4. I tend to spend too much time on introductory chapters, but will try not to.

Chapter 1: Introduction to Compiling


Homework Read chapter 1.

1.1: Compilers
A Compiler is a translator from one language, the input or source language, to another language, the output or target language. Often, but not always, the target language is an assembler language or the machine language for a computer processor.

Analysis, synthesis, front and back ends


Modern compilers contain two (large) parts, each of which is often subdivided. These two parts are the front end and the back end. The front end analyzes the source program, determines its constituent parts, and constructs an intermediate representation of the program. Typically the front end is independent of the target language. The back end synthesizes the target program from the intermediate representation produced by the front end. Typically the back end is independent of the source language. This front/back division very much reduces the work for a compiling system that can handle several (N) source languages and several (M) target languages. Instead of NM compilers, we need N front ends and M back ends. For gcc (originally standing for Gnu C Compiler, but now standing for Gnu Compiler Collection), N=7 and M~30 so the savings is considerable.

Syntax Trees
Often the intermediate form produced by the front end is a syntax tree. In simple cases, such as that shown to the right
file:///D:/cc pdf/Compilers Class Notes.htm 4/133

1/26/12

Compilers: Class Notes

corresponding to the C statement sequence


x: 2 y: x+3 = ; = ;

this tree has constants and variables (and nil) for leaves and operators for internal nodes. The back end traverses the tree in Euler-tour order and generates code for each node. (This is quite oversimplified.)

Other analyzers and synthesizers


Other compiler like applications also use analysis and synthesis. Some examples include 1. Pretty printer. Can be considered a real compiler with the target language a formatted version of the source. 2. Interpreter. The synthesis traverses the tree and executes the operation at each node (rather than generating code to do such).

The compilation tool chain


[rpoesr ->[oplr ->[sebe]->[ikr ->[odr percso] - cmie] - asmlr - lne] - lae]

We will be primarily focused on the second element of the chain, the compiler. Our target language will be assembly language. Preprocessors Preprocessors are normally fairly simple as in the C language, providing primarily the ability to include files and expand macros. There are exceptions, however. IBM's PL/I, another Algol-like language had quite an extensive preprocessor, which made available at preprocessor time, much of the PL/I language itself (e.g., loops and I believe procedure calls). Some preprocessors essentially augment the base language, to add additional capabilities. One could consider them as compilers in their own right having as source this augmented language (say fortran augmented with statements for multiprocessor execution in the guise of fortran comments) and as target the original base language (in this case fortran). Often the preprocessor inserts procedure calls to implement the extensions at runtime. Assemblers Assembly code is an mnemonic version of machine code in which names, rather than binary values, are used for machine instructions, and memory addresses. Some processors have fairly regular operations and as a result assembly code for them can be fairly natural and not-toohard to understand. Other processors, in particular Intel's x86 line, have let us charitably say more interesting instructions with certain registers used for certain things. My laptop has one of these latter processors (pentium 4) so my gcc compiler produces code that from a pedagogical viewpoint is less than ideal. If you have a mac with a ppc processor (newest macs are x86), your assembly language is cleaner. NYU's ACF features sun computers with sparc processors, which also have regular instruction sets.
Two pass assembly

No matter what the assembly language is, an assembler needs to assign memory locations to symbols (called identifiers) and use the numeric location address in the target machine language produced. Of course the same address must be used for all occurrences of a given identifier and two different identifiers must (normally) be assigned two different locations. The conceptually simplest way to accomplish this is to make two passes over the input (read it once, then read it again from the beginning). During the first pass, each time a new identifier is encountered, an address is assigned and the pair (identifier, address) is stored in a symbol table. During the second pass, whenever an identifier is encountered, its address is looked up in Class Notes.htm file:///D:/cc pdf/Compilers the symbol table and this value is used in the generated machine instruction.
5/133

1/26/12

Compilers: Class Notes

is looked up in the symbol table and this value is used in the generated machine instruction.
A Trivial Assembler Program

Consider the following trivial C program that computes and returns the xor of the characters in a string.
itxr(hrs] n o ca [) { itas=0 n n ; iti=0 n ; wie([]! 0 { hl si = ) as=as^si; n n [] i=i+1 ; } rtr as eun n; } / ntv Csekr syca * / aie paes a hr s

The corresponding assembly language program (produced by gcc -S -fomit-frame-pointer) is


.ie "o." fl xrc .et tx .lb xr gol o .ye xr @ucin tp o, fnto xr o: sb ul $,%s 8 ep mv ol $,4%s) 0 (ep mv ol $,(ep 0 %s) .2 L: mv ol (ep,%a %s) ex ad dl 1(ep,%a 2%s) ex cp mb $,(ex 0 %a) j e .3 L mv ol (ep,%a %s) ex ad dl 1(ep,%a 2%s) ex mvb (ex,ex osl %a)%d la el 4%s) %a (ep, ex xr ol %d,(ex ex %a) mv ol %s,%a ep ex ic nl (ex %a) jp m .2 L .3 L: mv ol 4%s) %a (ep, ex ad dl $,%s 8 ep rt e .ie xr .xr sz o, -o .eto scin .oeGUsak",pobt nt.N-tc,"@rgis .dn "C:(N)346(eto346r,sp34510 pe879" iet GC GU .. Gno ..-1 s-..-., i-..)

You should be able to follow everything from xor: to ret. Indeed most of the rest can be omitted (.globl g is needed). That is the following assembly program gives the same results.
.lb xr gol o xr o: sb ul mv ol mv ol .2 L: mv ol ad dl cp mb j e mv ol ad dl mvb osl la el

$,%s 8 ep $,4%s) 0 (ep $,(ep 0 %s) (ep,%a %s) ex 1(ep,%a 2%s) ex $,(ex 0 %a) .3 L (ep,%a %s) ex 1(ep,%a 2%s) ex (ex,ex %a)%d 4%s) %a (ep, ex
6/133

file:///D:/cc pdf/Compilers Class Notes.htm

1/26/12

la el xr ol mv ol ic nl jp m .3 L: mv ol ad dl rt e

4%s) %a (ep, ex %d,(ex ex %a) %s,%a ep ex (ex %a) .2 L 4%s) %a (ep, ex $,%s 8 ep

What is happening in this program? 1. The stack pointer originally points to the (unused) frame pointer. Just above is the one parameter S. 2. Allocate (by moving the stack pointer) and initialize the local variables. 3. Add S (an address) to I (giving the address of S[I}). 4. Compare the contents of the above address (i.e., S[I]) to 0 and break out if appropriate. 5. Calculate S[I] xor ans (not using the address just calculated, since I did not ask for optimization). The code actually performs the calculation in the memory location containing ans (ppc and sparc would do this in a register). 6. Increment I and loop. 7. When the loop ends store the return value (in eax), restore the stack pointer, and return. Lab assignment 1 is available on the class web site. The programming is trivial; you are just doing inclusive (i.e., normal) OR rather than XOR I just did. The point of the lab is to give you a chance to become familiar with your compiler and assembler. Linkers Linkers, a.k.a. linkage editors combine the output of the assembler for several different compilations. That is the horizontal line of the diagram above should really be a collection of lines converging on the linker. The linker has another input, namely libraries, but to the linker the libraries look like other programs compiled and assembled. The two primary tasks of the linker are 1. Relocating relative addresses. 2. Resolving external references (such as the procedure xor() above).
Relocating relative addresses

The assembler processes one file at a time. Thus the symbol table produced while processing file A is independent of the symbols defined in file B, and conversely. Thus, it is likely that the same address will be used for different symbols in each program. The technical term is that the (local) addresses in the symbol table for file A are relative to file A; they must be relocated by the linker. This is accomplished by adding the starting address of file A (which in turn is the sum of the lengths of all the files processed previously in this run) to the relative address.
Resolving external references

Assume procedure f, in file A, and procedure g, in file B, are compiled (and assembled) separately. Assume also that f invokes g. Since the compiler and assembler do not see g when processing f, it appears impossible for procedure f to know where in memory to find g. The solution is for the compiler to indicated in the output of the file A compilation that the address of g is needed. This is called a use of g When processing file B, the compiler outputs the (relative) address of g. This is called the definition of g. The assembler passes this information to the linker. The simplest linker technique is to again make two passes. During the first pass, the linker records in its external symbol table (a table of external symbols, not a symbol table that is stored externally) all the definitions encountered. During the 7/133

1/26/12

Compilers: Class Notes

table (a table of external symbols, not a symbol table that is stored externally) all the definitions encountered. During the second pass, every use can be resolved by access to the table. I will be covering the linker in more detail tomorrow at 5pm in 2250, OS Design Loaders After the linker has done its work, the resulting executable file can be loaded by the operating system into central memory. The details are OS dependent. With early single-user operating systems all programs would be loaded into a fixed address (say 0) and the loader simply copies the file to memory. Today it is much more complicated since (parts of) many programs reside in memory at the same time. Hence the compiler/assembler/linker cannot know the real location for an identifier. Indeed, this real location can change. More information is given in any OS course (e.g., 2250 given wednesdays at 5pm).

1.2: Analysis of the source program


Conceptually, there are three phases of analysis with the output of one phase the input of the next. The phases are called lexical analysis or scanning, syntax analysis or parsing, and semantic analysis.

Lexical analysis or scanning


The character stream input is grouped into tokens. For example, any one of the following
x : y+3 3 = ; x : 3 = y + x 3 :y 3 ; =+ 3 ;

but not
x3: y+3 = ;

would be grouped into 1. 2. 3. 4. 5. 6. The identifier x3. The assignment symbol :=. The identifier x. The plus sign. The number 3. The semicolon.

Note that non-significant blanks are normally removed during scanning. In C, most blanks are non-significant. Blanks inside strings are an exception. Note that we could define identifiers, numbers, and the various symbols and punctuation can be defined without recursion (compare with parsing below).

Syntax analysis or parsing


Parsing involves a further grouping in which tokens are grouped into grammatical phrases, which are normally represented in a parse tree. For example
x : y+3 3 = ;

would be parsed into the tree on the right. This parsing would result from a grammar containing rules such as
file:///D:/cc pdf/Compilers Class Notes.htm 8/133

1/26/12

Compilers: Class Notes

as-tt->i : ep ; stsm - d = xr ep ->nme xr - ubr | i d | ep +ep xr xr

Note the recursive definition of expression (expr). Note also the hierarchical decomposition in the figure on the right. The division between scanning and parsing is somewhat arbitrary, but invariably if a recursive definition is involved, it is considered parsing not scanning. Often we utilize a simpler tree called the syntax tree with operators as interior nodes and operands as the children of the operator. The syntax tree on the right corresponds to the parse tree above it. (Technical point.) The syntax tree represents an assignment expression not an assignment statement. In C an assignment statement includes the trailing semicolon. That is, in C (unlike in Algol) the semicolon is a statement terminator not a statement separator.

Semantic analysis
There is more to a front end than simply syntax. The compiler needs semantic information, e.g., the types (integer, real, pointer to array of integers, etc) of the objects involved. This enables checking for semantic errors and inserting type conversion where necessary. For example, if y was declared to be a real and x3 an integer, We need to insert (unary, i.e., one operand) conversion operators inttoreal and realtoint as shown on the right.

Analysis in text formatters


Illustrates the use of hierarchical grouping for formatting languages (Tex and EQN are used as examples). For example shows how you can get subscripted superscripts (or superscripted subscripts)

1.3: The phases of a compiler


[cne][asrsmaa][ne cd gnot][oegnot] sanrpre][e nlitr oe e][p1cd e][p2

We just examined the first three phases. Modern, high-performance compilers, are dominated by their extensive "optimization" phases, which occur before, during, and after code generation. Note that optimization is most assuredly an inaccurate, albeit standard, terminology, as the resulting code is not optimal.

Symbol-table management
As we have seen when discussing assemblers and linkers, a symbol table is used to maintain information about symbols. The compiler uses a symbol to maintain information across phases as well as within each phase. One key item stored with each symbol is the corresponding type, which is determined during semantic and used (among other places) during code generation.

Error detection and reporting


As you have doubtless noticed, not all programming efforts produce correct programs. If the input to the compiler is not a legal source language program, errors must be detected and reported. It is often much easier to detect that the program is not legal (e.g., the parser reaches a point where the next token cannot legally occur) than to deduce what is the actual error (which may have occurred earlier). It is even harder to reliably deduce what the intended correct program should be.

The analysis phases


The scanner converts
file:///D:/cc pdf/Compilers Class Notes.htm 9/133

1/26/12

Compilers: Class Notes

x : y+3 3 = ;

into
i1 : i2 +3; d = d

where id is short for identifier. This is processed by the parser and semantic analyzer to produce the two trees shown above here and here. On some systems, the tree would not contain the symbols themselves as shown in the figures. Instead the tree would contain leaves of the form idi which in turn would refer to the corresponding entries in the symbol table.

Intermediate code generation


Many compilers first generate code for an idealized machine. For example, the intermediate code generated would assume that the target has an unlimited number of registers and that any register can be used for any operation. Another common assumption is that all machine operations take three operands, two source and one target. With these assumptions one generates three-address code by walking the semantic tree. Our example C instruction would produce
tm1: itoel3 ep = ntra() tm2: i2+tm1 ep = d ep tm3: ratittm2 ep = elon(ep) i1: tm3 d = ep

We see that three-address code can include instructions with fewer than 3 operands. Sometimes three-address code is called quadruples because one can view the previous code sequence as
itoeltm13 ntra ep ad d tm2i2 tm1 ep d ep ratittm3tm2elon ep ep asg sin i1 tm3d ep -

Each quad has the form


oeain tre suc1suc2 prto agt ore ore

Code optimization
This is a very serious subject, one that we will not really do justice to in this introductory course. Some optimizations are fairly easy to see. 1. Since 3 is a constant, the compiler can perform the int to real conversion and replace the first two quads with
ad d tm2i2 30 ep d .

2. The last two quads can be combined into


ratiti1 elon d tm2 ep

Code generation
Modern processors have only a limited number of register. Although some processors, such as the x86, can perform operations directly on memory locations, we will for now assume only register operations. Some processors (e.g., the MIPS architecture) use three-address instructions. However, some processors permit only two addresses; the result overwrites the second source. With these assumptions, code something like the following would be produced for our example, after first assigning memory locations to id1 and id2.
file:///D:/cc pdf/Compilers Class Notes.htm 10/133

1/26/12

Compilers: Class Notes

MV i2 R OE d, 1 AD #.,R D 30 1 RO R, R TI 1 2 MV R, i1 OE 2 d

1.4: Cousins of the compiler


I found it more logical to treat these topics (preprocessors, assemblers, linkers, and loaders) earlier.

1.5: The grouping of phases


Logically each phase is viewed as a separate program that reads input and produces output for the next phase, i.e., a pipeline. In practice some phases are combined.

Front and back ends


We discussed this previously. Aho, Sethi, Ullman assert only limited success in producing several compilers for a single machine using a common back end. That is a rather pessimistic view and I wonder if the 2nd edition will change in this area.

Passes
The term pass is used to indicate that the entire input is read during this activity. So two passes, means that the input is read twice. We have discussed two pass approaches for both assemblers and linkers. If we implement each phase separately and use multiple phases for some of them, the compiler will perform a large number of I/O operations, an expensive undertaking. As a result techniques have been developed to reduce the number of passes. We will see in the next chapter how to combine the scanner, parser, and semantic analyzer into one phase. Consider the parser. When it needs another token, rather than reading the input file (presumably produced by the scanner), the parser calls the scanner instead. At selected points during the production of the syntax tree, the parser calls the code generator, which performs semantic analysis as well as generating a portion of the intermediate code.

Reducing the number of passes


One problem with combining phases, or with implementing a single phase in one pass, is that it appears that an internal form of the entire program will need to be stored in memory. This problem arises because the downstream phase may need early in its execution information that the upstream phase produces only late in its execution. This motivated the use of symbol tables and a two pass approach. However, a clever one-pass approach is often possible. Consider the assembler (or linker). The good case is when the definition precedes all uses so that the symbol table contains the value of the symbol prior to that value being needed. Now consider the harder case of one or more uses preceding the definition. When a not yet defined symbol is first used, an entry is placed in the symbol table, pointing to this use and indicating that the definition has not yet appeared. Further uses of the same symbol attach their addresses to a linked list of undefined uses of this symbol. When the definition is finally seen, the value is placed in the symbol table, and the linked list is traversed inserting the value in all previously encountered uses. Subsequent uses of the symbol will find its definition in the table. This technique is called backpatching.

1.6: Compiler-construction tools


Originally, compilers were written from scratch, but now the situation is quite different. A number of tools are available to ease the burden.
file:///D:/cc pdf/Compilers Class Notes.htm 11/133

1/26/12

Compilers: Class Notes

We will study tools that generate scanners and parsers. This will involve us in some theory, regular expressions for scanners and various grammars for parsers. These techniques are fairly successful. One drawback can be that they do not execute as fast as hand-crafted scanners and parsers. We will also see tools for syntax-directed translation and automatic code generation. The automation in these cases is not as complete. Finally, there is the large area of optimization. This is not automated; however, a basic component of optimization is dataflow analysis (how are values transmitted between parts of a program) and there are tools to help with this task.

Chapter 2: A Simple One-Pass Compiler


Homework: Read chapter 2. Implement a very simple compiler. 1. 2. 3. 4. 5. 6. Simple source language. No optimization. Target language similar to source. No machine-dependent back end. No tools. Little theory.

2.1: Overview
The source language is infix expressions consisting of digits, +, and -; the target is postfix expressions with the same components. The compiler will convert 7 4 5to 7 + +45. Actually, our simple compiler will handle a few other operators as well. We will "tokenize" the input (i.e., write a scanner), model the syntax of the source, and let this syntax direct the translation.

2.2: Syntax definition


This will be done right in the next two chapters. A context-free grammar (CFG) consists of 1. 2. 3. 4. A set of terminal tokens. A set of nonterminals. A set of productions (rules for transforming nonterminals) Specifying the start symbol (a nonterminal).

Example:
Trias 0123456789+emnl: Nnemnl:ls dgt otrias it ii Poutos ls ls +dgt rdcin: it it ii ls ls -dgt it it ii ls dgt it ii dgt0|1|2|3|4|5|6|7|8|9 ii Satsmo:ls tr ybl it

Watch how we can generate the input 7+4-5 starting with the start symbol, applying productions, and stopping when no productions are possible (we have only terminals).
file:///D:/cc pdf/Compilers Class Notes.htm 12/133

1/26/12

Compilers: Class Notes

ls ls -dgt it it ii ls -5 it ls +dgt-5 it ii ls +4-5 it dgt+4-5 ii 7+4-5

It is important that you see that this context-free grammar generates precisely the set of infix expressions with digits (so 25 is not allowed) as operands and + and - as operators. The way you get different final expressions is that you make different choices of which production to apply. There are 3 productions you can apply to list and 10 you can apply to digit. The input cannot have blanks since blank is not a nonterminal. The empty string is not a legal input since, starting from list, we cannot get to the empty string. If we wanted to include the empty string, we would add the production
ls it

Homework: 2.1a, 2.1c, 2.2a-c (don't worry about justifying your answers).

Parse trees
The compiler front end runs the above procedure in reverse! It starts with the string 7+4-5 and gets back to list (the start symbol). Reaching the start symbol means that the string is in the language generated by the grammar. While running the procedure in reverse, the front end builds up the parse tree on the right. You can read off the productions from the tree. For any internal (i.e.,non-leaf) tree node, its children give the right hand side (RHS) of a production having the node itself as the LHS. The leaves of the tree, read from left to right, is called the yield of the tree. We call the tree a derivation of its yield from its root. The tree on the right is a derivation of 7+4-5 from list. Homework: 2.1b

Ambiguity
An ambiguous grammar is one in which there are two or more parse trees yielding the same final string. We wish to avoid such grammars. The grammar above is not ambiguous. For example 1 2 3can be parsed only one way; the arithmetic must be done left ++ to right. Note that I am not giving a rule of arithmetic, just of this grammar. If you reduced 2+3 to list you would be stuck since it is impossible to generate 1+list. Homework: 2.3 (applied only to parts a, b, and c of 2.2)

Associativity of operators
Our grammar gives left associativity. That is, if you traverse the tree in postorder and perform the indicated arithmetic you will evaluate the string left to right. Thus 8-8-8 would evaluate to -8. If you wished to generate right associativity (normally exponentiation is right associative, so 2**3**2 gives 512 not 64), you would change the first two productions to l s d g t + l s and l s d g t - l s it ii it it ii it

Precedence of operators
We normally want * to have higher precedence than +. We do this by using an additional nonterminal to indicate the items that have been multiplied. The example below gives the four basic arithmetic operations their normal precedence unless overridden by parentheses. Redundant parentheses are permitted. Equal precedence operations are performed left to right.
file:///D:/cc pdf/Compilers Class Notes.htm 13/133

1/26/12

Compilers: Class Notes

ep xr ep +tr |ep -tr |tr xr em xr em em tr em tr *fco |tr /fco |fco em atr em atr atr fco dgt|(ep ) atr ii xr dgt 0|1|2|3|4|5|6|7|8|9 ii

We use | to indicate that a nonterminal has multiple possible right hand side. So
AB|C

is simply shorthand for


AB AC

Do the examples 1+2/3-4*5 and (1+2)/3-4*5 on the board. Note how the precedence is enforced by the grammar; slick! Statements Keywords are very helpful for distinguishing statements from one another.
sm i : ep tt d = xr |i ep te sm f xr hn tt |i ep te sm es sm f xr hn tt le tt |wieep d sm hl xr o tt |bgnotsmsed ei p-tt n otsmssm-it| p-tt ttls sm-itsm-it;sm |sm ttls ttls tt tt

Remark: opt-stmts stands for optional statements. The begin-end block can be empty in some languages. The epsilon stands for the empty string The use of epsilon productions will add complications. Some languages do not permit empty blocks; e.g. Ada has a null statement, which does nothing when executed, for this purpose. The above grammar is ambiguous! The notorious dangling else problem. How do you parse if x then if y then z=1 else z=2? Homework: 2.16a, 2.16b

2.3: Syntax-Directed Translation


Specifying the translation of a source language construct in terms of attributes of its syntactic components.

Postfix notation
Operator after operand. Parentheses are not needed. The normal notation we used is called infix. If you start with an infix expression, the following algorithm will give you the equivalent postfix expression. Variables and constants left alone. E op F becomes E' F' op, where E' F' are the postfix of E F. ( E ) becomes E', where E' is the postfix of E.

file:///D:/cc pdf/Compilers Class Notes.htm

14/133

1/26/12

One question is, given say 1+2-3, what is E, F and op? Does E=1+2, F=3, and op=+? Or does E=1, F=2-3 and op=+? This is the issue of precedence mentioned above. To simplify the present discussion we will start with fully parenthesized infix expressions. Example: 1+2/3-4*5 1. Start with 1+2/3-4*5 2. Parenthesize (using standard precedence) to get (1+ (2/3))-(4*5) 3. Apply the above rules to calculate P{(1+(2/3))(4*5)}, where P{X} means convert the infix expression X to postfix. A. P{(1+(2/3))-(4*5)} B. P{(1+(2/3))} P{(4*5)} C. P{1+(2/3)} P{4*5} D. P{1} P{2/3} + P{4} P{5} * E. 1 P{2} P{3} / + 4 5 * F. 1 2 3 / + 4 5 * Example: Now do (1+2)/3-4*5 1. Parenthesize to get ((1+2)/3)-(4*5) 2. Calculate P{((1+2)/3)-(4*5)} A. P{((1+2)/3) P{(4*5)} B. P{(1+2)/3} P{4*5) C. P{(1+2)} P{3} / P{4} P{5} * D. P{1+2} 3 / 4 5 * E. P{1} P{2} + 3 / 4 5 * F. 1 2 + 3 / 4 5 * -

Syntax-directed definitions
We want to decorate the parse trees we construct with annotations that give the value of certain attributes of the corresponding node of the tree. We will do the example of translating infix to postfix with 1+2/3-4*5. We use the following grammar, which follows the normal arithmetic terminology where one multiplies and divides factors to obtain terms, which in turn are added and subtracted to form expressions.
ep xr ep +tr |ep -tr |tr xr em xr em em tr em tr *fco |tr /fco |fco em atr em atr atr fco dgt|(ep ) atr ii xr dgt 0|1|2|3|4|5|6|7|8|9 ii

This grammar supports parentheses, although our example does not use them. On the right is a movie in which the parse tree is build from this example. The attribute we will associate with the nodes is the text to be used to print the postfix form of the string in the leaves below the node. In particular the value of this attribute at the root is the postfix form of the entire source.
file:///D:/cc pdf/Compilers Class Notes.htm 15/133

1/26/12

Compilers: Class Notes

The book does a simpler grammar (no *, /, or parentheses) for a simpler example. You might find that one easier. The book also does another grammar describing commands to give a robot to move north, east, south, or west by one unit at a time. The attributes associated with the nodes are the current position (for some nodes, including the root) and the change in position caused by the current command (for other nodes). Definition: A syntax-directed definition is a grammar together with a set of semantic rules for computing the attribute values. A parse tree augmented with the attribute values at each node is called an annotated parse tree.

Synthesized Attributes
For the bottom-up approach I will illustrate now, we annotate a node after having annotated its children. Thus the attribute values at a node can depend on the children of the node but not the parent of the node. We call these synthesized attributes, since they are formed by synthesizing the attributes of the children. In chapter 5, when we study top-down annotations as well, we will introduce inherited attributes that are passed down from parents to children. We specify how to synthesize attributes by giving the semantic rules together with the grammar. That is we give the syntax directed definition. Production Semantic Rule

expr expr1 + term expr.t := expr1.t || term.t || '+' expr expr1 - term expr.t := expr1.t || term.t || '-' expr term expr.t := term.t term term1 * factor term.t := term1.t || factor.t || '*' term term1 / factor term.t := term1.t || factor.t || '/' term factor factor digit factor ( expr ) digit 0 digit 1 digit 2 digit 3 digit 4 digit 5 digit 6 digit 7 digit 8 term.t := factor.t factor.t := digit.t factor.t := expr.t digit.t := '0' digit.t := '1' digit.t := '2' digit.t := '3' digit.t := '4' digit.t := '5 digit.t := '6' digit.t := '7' digit.t := '8' := '9'
16/133

digit 9 digit.t file:///D:/cc pdf/Compilers Class Notes.htm

1/26/12

Compilers: Class Notes

digit 9

digit.t := '9'

We apply these rules bottom-up (starting with the geographically lowest productions, i.e., the lowest lines on the page) and get the annotated graph shown on the right. The annotation are drawn in green. Homework: Draw the annotated graph for (1+2)/3-4*5.

Depth-first traversals
As mentioned in this chapter we are annotating bottom-up. This corresponds to doing a depth-first traversal of the (unannotated) parse tree to produce the annotations. It is often called a postorder traversal because a parent is visited after (i.e., post) its children are visited.

Translation schemes
The bottom-up annotation scheme generates the final result as the annotation of the root. In our infix postfix example we get the result desired by printing the root annotation. Now we consider another technique that produces its results incrementally. Instead of giving semantic rules for each production (and thereby generating annotations) we can embed program fragments called semantic actions within the productions themselves. In diagrams the semantic action is connected to the node with a distinctive, often dotted, line. The placement of the actions determine the order they are performed. Specifically, one executes the actions in the order they are encountered in a Capable of being classified postorder traversal of the tree. Definition: A syntax-directed translation scheme is a context-free grammar with embedded semantic actions. For our infix postfix translator, the parent either just passes on the attribute of its (only) child or concatenates them left to right and adds something at the end. The equivalent semantic actions would be either to print the new item or print nothing.

Emitting a translation
Here are the semantic actions corresponding to a few of the rows of the table above. Note that the actions are enclosed in {}.
ep ep +tr xr xr em ep ep -tr xr xr em tr tr /fco em em atr tr fco em atr dgt3 ii {pit'' } rn(+) {pit'' } rn(-) {pit'' } rn(/) {nl } ul {pit'' } rn(3)

The diagram for 1+2/3-4*5 with attached semantic actions is shown on the right. Given an input, e.g. our favorite 1+2/3-4*5, we just do a depth first (postorder) traversal of the corresponding diagram and perform the semantic actions as they occur. When these actions are print statements as above, we can be said to be emitting the translation. Do a depth first traversal of the diagram on the board, performing the semantic actions as they occur, and confirm that the
file:///D:/cc pdf/Compilers Class Notes.htm 17/133

1/26/12

Compilers: Class Notes

translation emitted is in fact 123/+45*-, the postfix version of 1+2/3-4*5 Homework: Produce the corresponding diagram for (1+2)/3-4*5. Prefix to infix translation When we produced postfix, all the prints came at the end (so that the children were already printed. The { actions } do not need to come at the end. We illustrate this by producing infix arithmetic (ordinary) notation from a prefix source. In prefix notation the operator comes first so +1-23 evaluates to zero. Consider the following grammar. It translates prefix to infix for the simple language consisting of addition and subtraction of digits between 1 and 3 without parentheses (prefix notation and postfix notation do not use parentheses). The resulting parse tree for +1-23 is shown on the right. Note that the output language (infix notation) has parentheses.
rs +tr rs |-tr rs |tr et em et em et em tr 1|2|3 em

The table below shows the semantic actions or rules needed for our translator. Production with Semantic Action Semantic Rule

rest { print('(') } + term { print('+') } rest { print(')') } rest.t := '(' || term.t || '+' || rest.t || ')' rest { print('(') } - term { print('-') } rest { print(')') } rest.t := '(' || term.t || '-' || rest.t || ')' rest term term 1 { print('1') } term 2 { print('2') } term 3 { print('3') } Homework: 2.8. Simple syntax-directed definitions If the semantic rules of a syntax-directed definition all have the property that the new annotation for the left hand side (LHS) of the production is just the concatenation of the annotations for the nonterminals on the RHS in the same order as the nonterminals appear in the production, we call the syntax-directed definition simple. It is still called simple if new strings are interleaved with the original annotations. So the example just done is a simple syntax-directed definition. Remark: We shall see later that, in many cases a simple syntax-directed definition permits one to execute the semantic actions while parsing and not construct the parse tree at all. rest.t := term.t term.t := '1' term.t := '2' term.t := '3'

================ Start Lecture #2 ================


Remarks: 1. Lecture #1 is available on the web separately. 2. (Some of) the typos have been fixed. 3. The true story of the various editions of the book is now in the giant page (but not in the separate lecture 1.
file:///D:/cc pdf/Compilers Class Notes.htm 18/133

1/26/12

Compilers: Class Notes

2.4: Parsing
Objective: Given a string of tokens and a grammar, produce a parse tree yielding that string (or at least determine if such a tree exists). We will learn both top-down (begin with the start symbol, i.e. the root of the tree) and bottom up (begin with the leaves) techniques. In the remainder of this chapter we just do top down, which is easier to implement by hand, but is less general. Chapter 4 covers both approaches. Tools (so called parser generators) often use bottom-up techniques. In this section we assume that the lexical analyzer has already scanned the source input and converted it into a sequence of tokens.

Top-down parsing
Consider the following simple language, which derives a subset of the types found in the (now somewhat dated) programming language Pascal. I am using the same example as the book so that the compiler code they give will be applicable. We have two nonterminals, type, which is the start symbol, and simple, which represents the simple types. There are 8 terminals, which are tokens produced by the lexer and correspond closely with constructs in pascal itself. I do not assume you know pascal. (The authors appear to assume the reader knows pascal, but do not assume knowledge of C.) Specifically, we have. 1. 2. 3. 4. 5. 6. integer and char id for identifier array and of used in array declarations meaning pointer to num for a (positive whole) number dotdot for .. (used to give a range like 6..9)

The productions are


tp sml ye ipe tp i ye d tp ary[sml ]o tp ye ra ipe f ye sml itgr ipe nee sml ca ipe hr sml nmdto nm ipe u odt u

Parsing is easy in principle and for certain grammars (e.g., the two above) it actually is easy. The two fundamental steps (we start at the root since this is top-down parsing) are 1. At the current (nonterminal) node, select a production whose LHS is this nonterminal and whose RHS matches the input at this point. Make the RHS the children Class node (one file:///D:/cc pdf/Compilersof this Notes.htm child per RHS symbol).
19/133

1/26/12

Compilers: Class Notes

children of this node (one child per RHS symbol). 2. Go to the next node needing a subtree. When programmed this becomes a procedure for each nonterminal that chooses a production for the node and calls procedures for each nonterminal in the RHS. Thus it is recursive in nature and descends the parse tree. We call these parsers recursive descent. The big problem is what to do if the current node is the LHS of more than one production. The small problem is what do we mean by the next node needing a subtree. The easiest solution to the big problem would be to assume that there is only one production having a given terminal as LHS. There are two possibilities 1. No circularity. For example
ep tr +tr -9 xr em em tr fco /fco em atr atr fco dgt atr ii dgt7 ii

But this is very boring. The only possible sentence is


77779 /+/-

2. Circularity
ep tr +tr xr em em tr fco /fco em atr atr fco (ep ) atr xr

This is even worse; there are no (finite) sentences. Only an infinite sentence beginning (((((((((. So this won't work. We need to have multiple productions with the same LHS. How about trying them all? We could do this! If we get stuck where the current tree cannot match the input we are trying to parse, we would backtrack. Instead, we will look ahead one token in the input and only choose productions that can yield a result starting with this token. Furthermore, we will (in this section) restrict ourselves to predictive parsing in which there is only production that can yield a result starting with a given token. This solution to the big problem also solves the small problem. Since we are trying to match the next token in the input, we must choose the leftmost (nonterminal) node to give children to.

Predictive parsing
Let's return to pascal array type grammar and consider the three productions having type as LHS. Even when I write the short form
tp y pdf/Compilers e i | r a i p Class d ipe f file:///D:/cc e s m l | Notes.htm a r y [ s m l ] o
20/133

1/26/12

Compilers: Class Notes

tp sml |i |ary[sml ]o ye ipe d ra ipe f tp ye

I view it as three productions. For each production P we wish to consider the set FIRST(P) consisting of those tokens that can appear as the first symbol of a string derived from the RHS of P. We actually define FIRST(RHS) rather than FIRST(P), but I often say first set of the production when I should really say first set of the RHS of the production. Definition: Let r be the RHS of a production P. FIRST(r) is the set of tokens that can appear as the first symbol in a string derived from r. To use predictive parsing, we make the following Assumption: Let P and Q be two productions with the same LHS. Then FIRST(P) and FIRST(Q) are disjoint. Thus, if we know both the LHS and the token that must be first, there is (at most) one production we can apply. BINGO! An example of predictive parsing This table gives the FIRST sets for our pascal array type example. Production type simple type id simple integer simple char simple num dotdot num {} { integer } { char } { num } FIRST { integer, char, num }

type array [ simple ] of type { array }

The three productions with type as LHS have disjoint FIRST sets. Similarly the three productions with simple as LHS have disjoint FIRST sets. Thus predictive parsing can be used. We process the input left to right and call the current token lookahead since it is how far we are looking ahead in the input to determine the production to use. The movie on the right shows the process in action. Homework: A. Construct the corresponding table for
rs +tr rs |-tr rs |tr et em et em et em tr 1|2|3 em

B. Can predictive parsing be used? End of Homework:.

-productions
file:///D:/cc pdf/Compilers Class Notes.htm 21/133

1/26/12

Not all grammars are as friendly as the last example. The first complication is when occurs in a RHS. If this happens or if the RHS can generate , then is included in FIRST. But would always match the current input position! The rule is that if lookahead is not in FIRST of any production with the desired LHS, we use the (unique!) production (with that LHS) that has as RHS. The second edition, which I just obtained now does a C instead of a pascal example. The productions are
sm ep ; tt xr |i (ep )sm f xr tt |fr(otxr;otxr;otxr)sm o pep pep pep tt |ohr te otxrep | pep xr

For completeness, on the right is the beginning of a movie for the C example. Note the use of the -production at the end since no other entry in FIRST will match ;

Designing a Predictive Parser


Predictive parsers are fairly easy to construct as we will now see. Since they are recursive descent parsers we go top-down with one procedure for each nonterminal. Do remember that we must have disjoint FIRST sets for all the productions having a given nonterminal as LHS. 1. For each nonterminal, write a procedure that chooses the unique(!) production having lookahead in its FIRST. Use the production if no other production matches. If no production matches and there is no production, the parse fails. 2. These procedures mimic the RHS of the production. They call procedures for each nonterminal and call match for each terminal. Write a match(nonterminal) that advances lookahead to the next input token after confirming that the previous value of lookaheThe book has code at this point. ad equals t nonterminal argument. 3. Write a main program that initializes lookahead to the first input token and invokes the procedure for the start symbol. The book has code at this point. We will see code later in this chapter.

Left Recursion
Another complication. Consider
ep ep +tr xr xr em ep tr xr em

For the first production the RHS begins with the LHS. This is called left recursion. If a recursive descent parser would pick
22/133

1/26/12

Compilers: Class Notes

this production, the result would be that the next node to consider is again expr and the lookahead has not changed. An infinite loop occurs. Consider instead
ep tr rs xr em et rs +tr rs et em et rs et

Both pairs of productions generate the same possible token strings, namely
tr +tr +..+tr em em . em

The second pair is called right recursive since the RHS ends (has on the right) the LHS. If you draw the parse trees generated, you will see that, for left recursive productions, the tree grows to the left; whereas, for right recursive, it grows to the right. Note also that, according to the trees generated by the first pair, the additions are performed right to left; whereas, for the second pair, they are performed left to right. That is, for
tr +tr +tr em em em

the tree from the first pair has the left + at the top (why?); whereas, the tree from the second pair has the right + at the top. In general, for any A, R, , and , we can replace the pair
AA|

with the triple


AR RR|

For the example above A is expr, R is rest, is + term, and is term.

2.5: Translator for simple expressions


Objective: an infix to postfix translator for expressions. We start with just plus and minus, specifically the expressions generated by the following grammar. We include a set of semantic actions with the grammar. Note that finding a grammar for the desired language is one problem, constructing a translator for the language given a grammar is another problem. We are tackling the second problem.
ep ep +tr {pit'' } xr xr em rn(+) ep ep -tr {pit'' } xr xr em rn(-) ep tr xr em tr 0 em {pit'' } rn(0) ... tr 9 em {pit'' } rn(9)

One problem that we must solve is that this grammar is left recursive.

2.5.1: Abstract and concrete syntax


We prefer not to have superfluous nonterminals as they make the parsing less efficient. That is why we don't say that a term produces a digit and a digit produces each of 0,...,9. Ideally the syntax tree would just have the operators + and and the 10 digits 0,1,...,9. That would be called the abstract syntax tree. A parse tree coming from a grammar is technically called a concrete syntax tree.

2.5.2: Adapting the Translation Scheme


We eliminate the left recursion as we did in 2.4. This time there are two operators + and - so we replace the triple
AA|A|

with the quadruple


AR
file:///D:/cc pdf/Compilers Class Notes.htm 23/133

1/26/12

Compilers: Class Notes

RR|R|

This time we have actions so, for example is + t r { p i t ' ' } em rn(+) However, the formulas still hold and we get
ep tr rs xr em et rs +tr {pit'' }rs et em rn(+) et |-tr {pit'' }rs em rn(-) et | tr 0 em {pit'' } rn(0) ... |9 {pit'' } rn(9)

2.5.3: Procedures for the nonterminals expr, term, and rest


The C code is in the book. Note the e s ;in r s ( . This corresponds to the epsilon production. As mentioned le et) previously. The epsilon production is only used when all others fail (that is why it is the e s arm and not the t e or the le hn e s i arms). le f

2.5.4: Simplifying the translator


These are (useful) programming techniques.

The complete program


In the first edition this is about 40 lines of C code, 12 of which are single { or }. The second edition has equivalent code in java.

2.6: Lexical analysis


Converts a sequence of characters (the source) into a sequence of tokens. A lexeme is the sequence of characters comprising a single token.

2.6.1: Removal of White space and comments


These do not become tokens so that the parser need not worry about them.

2.6.2: Reading ahead


The 2nd edition moves the discussion about x yversus x = < <y into this new section. I have left it 2 sections ahead to more closely agree with our (first edition).

2.6.3: Constants
This chapter considers only numerical integer constants. They are computed one digit at a time by value=10*value+digit. The parser will therefore receive the token num rather than a sequence of digits. Recall that our previous parsers considered only one digit numbers. The value of the constant is stored as the attribute of the token num. Indeed <token,attribute> pairs are passed from the scanner to the parser.

2.6.4: Recognizing identifiers and keywords


The C statement
sm=sm+x u u ;
file:///D:/cc pdf/Compilers Class Notes.htm 24/133

1/26/12

Compilers: Class Notes

contains 4 tokens. The scanner will convert the input into i = i + i ;(id standing for identifier). d d d Although there are three id tokens, the first and second represent the lexeme sum; the third represents x. These must be distinguished. Many language keywords, for example then, are syntactically the same as identifiers. These also must be distinguished. The symbol table will accomplish these tasks. Care must be taken when one lexeme is a proper subset of another. Consider x yversus x = < <y When the < is read, the scanner needs to read another character to see if it is an =. But if that second character is y, the current token is < and the y must be pushed back onto the input stream so that the configuration is the same after scanning < as it is after scanning <=. Also consider t e versus t e e v l e one is a keyword and the other an id. hn hnwau,

Interface
As indicated the scanner reads characters and occasionally pushes one back to the input stream. The downstream interface is to the parser to which <token,attribute> pairs are passed.

2.6.5: A lexical analyzer


A few comments on the program given in the text. One inelegance is that, in order to avoid passing a record (struct in C) from the scanner to the parser, the scanner returns the next token and places its attribute in a global variable. Since the scanner converts digits into num's we can shorten the grammar. Here is the shortened version before the elimination of left recursion. Note that the value attribute of a num is its numerical value.
ep xr ep xr ep xr tr em ep +tr xr em ep -tr xr em tr em nm u {pit'' } rn(+) {pit'' } rn(-) {pitnmvle } rn(u,au)

In anticipation of other operators with higher precedence, we introduce factor and, for good measure, include parentheses for overriding the precedence. So our grammar becomes.
ep xr ep +tr xr em {pit'' } rn(+) ep xr ep -tr xr em {pit'' } rn(-) ep xr tr em tr em fco atr fco (ep )|nm{pitnmvle } atr xr u rn(u,au)

The factor() procedure follows the familiar recursive descent pattern: find a production with lookahead in FIRST and do what the RHS says.

2.7: Incorporating a symbol table


The symbol table is an important data structure for the entire compiler. For the simple translator, it is primarily used to store and retrieve <lexeme,token> pairs.

Interface
i s r ( , )returns the index of a new entry storing the pair (lexeme s, netst l o u ( )returns the index for x or 0 okps

token t).

if not there.

Reserved keywords
Simply insert them into the symbol table prior to examining any input. Then they can be found when used correctly and,
file:///D:/cc pdf/Compilers Class Notes.htm 25/133

1/26/12

Compilers: Class Notes

since their corresponding token will not be id, any use of them where an identifier is required can be flagged.
isr(dv,i) net"i"dv

Implementation
Probably the simplest would be
src smalTp { tut ytbeye ca lxm[INME] hr eeeBGUBR; it tkn n oe; }smal[NTEBGUBR; ytbeAOHRINME]

The space inefficiency of having a fixed size entry for all lexemes is poor, so the authors use a (standard) technique of concatenating all the strings into one big string and storing pointers to the beginning of each of the substrings.

2.8: Abstract stack machines


One form of intermediate representation is to assume that the target machine is a simple stack machine (explained very soon). The the front end of the compiler translates the source language into instructions for this stack machine and the back end translates stack machine instructions into instructions for the real target machine. We use a very simple stack machine Separate instruction and data memories (no self modifying code). Arithmetic performed on elements of the stack, rather than on registers or arbitrary locations in (data) memory. Very simple instructions Arithmetic (we just do integer for now). Stack manipulation, push, pop, dup, a few others. Control flow. The machine itself has two hidden registers tos (top of stack) and pc (program counter) that are manipulated by instruction execution but are not explicitly mentioned in the instructions. Similar to implementing an abstract data type.

Arithmetic instructions
An instruction for each simple op (e.g., add, mul). Complicated ops (e.g., sqrt) require several instructions. We assume an instruction exists for each of the ops we use. The instruction consumes one or two operands from the tos and places the result on the tos.

L-values and R-values


Consider Q : Z or A f x + * ] : g B C h x y ) . (I follow the text and use := for the assignment op, which is = ; [()BD = (+*(,); written = in C/C++. I am using [] for array reference and () for function call). From a macroscopic view, we have three tasks. 1. Evaluate the left hand side (LHS) to obtain an l-value. 2. Evaluate the RHS to obtain an r-value. 3. Perform the assignment. Note the differences between L-values and R-values An l-value corresponds to an address or a location. An r-value corresponds to a value. Neither 1 nor s tcan be used as an l-value, but both are legal r-values. 2 +
file:///D:/cc pdf/Compilers Class Notes.htm 26/133

1/26/12

Compilers: Class Notes

Stack manipulation
push v push v (onto stack) rvalue l push contents of (location) l lvalue l push address of l pop := pop r-value on tos put into the location specified by l-value 2nd on the stack; both are popped

copy duplicate the top of stack

Translating expressions
Machine instructions to evaluate an expression mimic the postfix form of the expression. That is we generate code to evaluate the left operand, then code to evaluate the write operand, and finally the code to evaluate the operation itself. For example y : 7 * x + 6 * ( + w becomes = x z )
lau y vle ps 7 uh rau x vle x * ps 6 uh rau z vle rau w vle + * + : =

To say this more formally we define two attributes. For any nonterminal, the attribute t gives its translation and for the terminal i , the attribute lexeme gives its string representation. d Assuming we have already given the semantic rules for expr (i.e., assuming that the annotation expr.t is known to contain the translation for expr) then the semantic rule for the assignment statement is
sm i : ep tt d = xr {sm. : 'vle | i.eie| ep. | : } ttt = lau' | dlxm | xrt | =

Control flow
There are several ways of specifying conditional and unconditional jumps. We choose the following 5 instructions. The simplifying assumption is that the abstract machine supports symbolic labels. The back end of the compiler would have to translate this into machine instructions for the actual computer, e.g. absolute or relative jumps (jump 3450 or jump +500). label l target of jump goto l gofalse pop stack; jump if value is false gotrue pop stack; jump if value is true halt

Translating (if-then) statements


Fairly simple. Generate a new label using the assumed function newlabel(), which we sometimes write without the (), and use it. The semantic rule for an if statement is simply
file:///D:/cc pdf/Compilers Class Notes.htm 27/133

1/26/12

Compilers: Class Notes

sm i ep te sm1 {ot: nwae(; tt f xr hn tt u = elbl) sm. : ep. | 'oas'ot| sm1 t| 'ae'ot t t t = x r t | g f l e u | t t. | l b l u

Emitting a translation
Rewriting the above as a semantic action (rather than a rule) we get the following, where emit() is a function that prints its arguments in whatever form is required for the abstract machine (e.g., it deals with line length limits, required whitespace, etc).
sm i tt f ep xr te hn sm1 tt {ot: nwae;ei(gfle,ot;} u = elbl mt'oas' u) {ei(lbl,ot } mt'ae' u)

Don't forget that expr is itself a nonterminal. So by the time we reach o t = e l b l we will have already parsed expr u:nwae, and thus will have done any associated actions, such as emit()'ing instructions. These instructions will have left a boolean on the tos. It is this boolean that is tested by the emitted gofalse. More precisely, the action written to the right of expr will be the third child of stmt in the tree. Since a postorder traversal visits the children in order, the second child expr will have been visited (just) prior to visiting the action. Pseudocode for stmt (fig 2.34) Look how simple it is! Don't forget that the FIRST sets for the productions having stmt as LHS are disjoint!
poeuesm rcdr tt itgrts,ot nee et u; i loaed=i te f okha d hn / frtsti {d frasgmn / is e s i} o sinet ei(lau' tkna) / pse lau o ls mt'vle, oevl; / uhs vle f h mthi) ac(d; / mv ps tels / oe at h h] mth'=) ac(:'; / mv ps te: / oe at h = ep; xr / pse rau o rso ts / uhs vle f h n o ei(:'; mt'=) / d teasgmn (mte i bo) / o h sinet Oitd n ok es i loaed='f te le f okha i' hn mth'f) ac(i'; / mv ps tei / oe at h f ep; xr / pse boeno ts / uhs ola n o ot: nwae(; u = elbl) ei(gfle,ot; mt'oas' u) / oti itgr ei mksalgllbl / u s nee, mt ae ea ae mth'hn) ac(te'; / mv ps tete / oe at h hn sm; tt / rcriecl / eusv al ei(lbl,ot mt'ae' u) / ei aanmksotlgl / mt gi ae u ea es i .. le f . / wie rpa/o ec / hl, eetd, t es err) le ro(; edsm; n tt

2.9: Putting the techniques together


Full code for a simple infix to postfix translator. This uses the concepts developed in 2.5-2.7 (it does not use the abstract stack machine material from 2.8). Note that the intermediate language we produced in 2.5-2.7, i.e., the attribute .t or the result of the semantic actions, is essentially the final output desired. Hence we just need the front end.

Description
The grammar with semantic actions is as follows. All the actions come at the end since we are generating postfix. this is not always the case.
satls ef tr it o ls ep ;ls it xr it ls it
file:///D:/cc pdf/Compilers Class Notes.htm

/ wudnral ue|a blw / ol omly s s eo {pit'' } rn(+)


28/133

ep ep +tr xr xr em

1/26/12

Compilers: Class Notes

ep ep +tr xr xr em |ep -tr xr em |tr em tr tr *fco em em atr |tr /fco em atr |tr dvfco em i atr |tr mdfco em o atr |fco atr fco (ep ) atr xr |i d |nm u

{pit'' } rn(+) {pit'';} rn(-) {pit'' } rn(*) {pit'' } rn(/) {pit'I' } rn(DV) {pit'O' } rn(MD)

{piti.eee } rn(dlxm) {pitnmvle } rn(u.au)

Eliminate left recursion to get


satls ef tr it o ls ep ;ls it xr it | ep tr mrtrs xr em oeem mrtrs+tr {pit'' }mrtrs oeem em rn(+) oeem |-tr {pit'' }mrtrs em rn(-) oeem | tr |fco mrfcos em atr oeatr mrfcos*fco {pit'' }mrfcos oeatr atr rn(*) oeatr |/fco {pit'' }mrfcos atr rn(/) oeatr |dvfco {pit'I' }mrfcos i atr rn(DV) oeatr |mdfco {pit'O' }mrfcos o atr rn(MD) oeatr | fco (ep ) atr xr |i d {piti.eee } rn(dlxm) |nm u {pitnmvle } rn(u.au)

Show A+B; on board starting with start.


Lxrc ee.

Contains lexan(), the lexical analyzer, which is called by the parser to obtain the next token. The attribute value is assigned to tokenval and white space is stripped. lexme token attribute value

white space sequence of digits NUM numeric value div DIV mod MOD other seq of a letter then letters and digits ID index into symbol table eof char DONE other char that char NONE
Pre. asrc

Using a recursive descent technique, one writes routines for each nonterminal in the grammar. In fact the book combines term and morefactors into one routine.
tr( { em) itt n ; fco(; atr) / nww sol cl mrfcos(,btisedcd i iln / o e hud al oeatrl) u nta oe t nie wietu) hl(re / mrfco nnemnli rgtrcrie / oeatr otria s ih eusv sic (okha){ / loaedstb mth) wth loaed / okha e y ac( cs '' cs '' cs DV cs MD / altesm ae *: ae /: ae I: ae O: / l h ae t=loaed okha; / nee frei( blw / edd o mt) eo
29/133

1/26/12

Compilers: Class Notes

mthloaed ac(okha) fco(; atr) ei(,OE; mttNN) cniu; otne dfut eal: rtr; eun

/ si oe teoeao / kp vr h prtr / segamrfrmrfcos / e rma o oeatr / Csmnisfrcs / eatc o ae / teeslnpouto / h pio rdcin

Other nonterminals similar.


Eitrc mte.

The routine emit().


S m o . and i i . yblc ntc

The insert(s,t) and lookup(s) routines described previously are in symbol.c The routine init() preloads the symbol table with the defined keywords.
Errc ro.

Does almost nothing. The only help is that the line number, calculated by lexan() is printed.

Two Questions
1. How come this compiler was so easy? 2. Why isn't the final exam next week? One reason is that much was deliberately simplified. Specifically note that No real machine code generated (no back end). No optimizations (improvement to generated code). FIRST sets disjoint. No semantic analysis. Input language very simple. Output language very simple and closely related to input. Also, I presented the material way too fast to expect full understanding.

Chapter 3: Lexical Analysis


Homework: Read chapter 3. Two methods to construct a scanner (lexical analyzer). 1. By hand, beginning with a diagram of what lexemes look like. Then write code to follow the diagram and return the corresponding token and possibly other information. 2. Feed the patterns describing the lexemes to a lexer-generator, which then produces the scanner. The historical lexer-generator is Lex; a more modern one is flex. Note that the speed (of the lexer not of the code generated by the compiler) and error reporting/correction are typically much better for a handwritten lexer. As a result most production-level compiler projects write their own lexers

3.1: the role of the Lexical Analyzer


The lexer is called by the parser when the latter is ready to process another token.
file:///D:/cc pdf/Compilers Class Notes.htm 30/133

The lexer also might do some housekeeping such as eliminating whitespace and comments. Some call these tasks scanning, but others call the entire task scanning. After the lexer, individual characters are no longer examined by the compiler; instead tokens (the output of the lexer) are used.

3.1.1: Lexical Analysis Versus Parsing


Why separate lexical analysis from parsing? The reasons are basically software engineering concerns. 1. Simplicity of design. When one detects a well defined subtask (produce the next token), it is often good to separate out the task (modularity). 2. Efficiency. With the task separated it is easier to apply specialized techniques. 3. Portability. Only the lexer need communicate with the outside.

3.1.2: Tokens, Patterns, and Lexemes


A token is a <name,attribute> pair. These are what the parser processes. The attribute might actually be a tuple of several attributes A pattern describes the character strings for the lexemes of the token. For example a letter followed by a (possibly empty) sequence of letters and digits. A lexeme for a token is a sequence of characters that matches the pattern for the token. Note the circularity of the definitions for lexeme and pattern. Common token classes. 1. One for each keyword. The pattern is trivial. 2. One for each operator or class of operators. A typical class is the comparison operators. Note that these have the same precedence. We might have + and - as the same token, but not + and *. 3. One for all identifiers (e.g. variables, user defined type names, etc). 4. Constants (i.e., manifest constants) such as 6 or hello, but not a constant identifier such as quantum in the Java statement. static final int quantum = 3;. There might be one token for integer constants, one for real, one for string, etc. 5. One for each punctuation symbol. Homework: 3.3.

3.1.3: Attributes for Tokens


We saw an example of attributes in the last chapter. For tokens corresponding to keywords, attributes are not needed since the name of the token tells everything. But consider the token corresponding to integer constants. Just knowing that the we have a constant is not enough, subsequent stages of the compiler need to know the value of the constant. Similarly for the token identifier we need to distinguish one identifier from another. The normal method is for the attribute to specify the symbol table entry for this identifier.

3.1.4: Lexical Errors


We saw in this movie an example where parsing got stuck because we reduced the wrong part of the input string. We also learned about FIRST sets that enabled us to determine which production to apply when we are operating left to right on the input. For predictive parsers the FIRST sets for a given nonterminal are disjoint and so we know which production to apply. In general the FIRST sets might not be disjoint so we have to try all the productions whose FIRST set contains the lookahead symbol. All the above assumed that the input was error free, i.e. that the source was a sentence in the language. What should we

1/26/12

Compilers: Class Notes

do when the input is erroneous and we get to a point where no production can be applied? The simplest solution is to abort the compilation stating that the program is wrong, perhaps giving the line number and location where the parser could not proceed. We would like to do better and at least find other errors. We could perhaps skip input up to a point where we can begin anew (e.g. after a statement ending semicolon), or perhaps make a small change to the input around lookahead so that we can proceed.

================ Start Lecture #3 ================

3.2: Input Buffering


Determining the next lexeme often requires reading the input beyond the end of that lexeme. For example, to determine the end of an identifier normally requires reading the first whitespace character after it. Also just reading > does not determine the lexeme as it could also be >=. When you determine the current lexeme, the characters you read beyond it may need to be read again to determine the next lexeme.

3.2.1: Buffer Pairs


The book illustrates the standard programming technique of using two (sizable) buffers to solve this problem.

3.2.2: Sentinels
A useful programming improvement to combine testing for the end of a buffer with determining the character read.

3.3: Specification of Tokens


The chapter turns formal and, in some sense, the course begins. The book is fairly careful about finite vs infinite sets and also uses (without a definition!) the notion of a countable set. (A countable set is either a finite set or one whose elements can be put into one to one correspondence with the positive integers. That is, it is a set whose elements can be counted. The set of rational numbers, i.e., fractions in lowest terms, is countable; the set of real numbers is uncountable, because it is strictly bigger, i.e., it cannot be counted.) We should be careful to distinguish the empty set from the empty string . Formal language theory is a beautiful subject, but I shall suppress my urge to do it "right" and try to go easy on the formalism.

3.3.1: Strings and Languages


We will need a bunch of definitions. Definition: An alphabet is a finite set of symbols. Example: {0,1}, presumably (uninteresting), ascii, unicode, ebcdic, latin-1. Definition: A string over an alphabet is a finite sequence of symbols from that alphabet. Strings are often called words or sentences. Example: Strings over {0,1}: , 0, 1, 111010. Strings over ascii: , sysy, the string consisting of 3 blanks. Definition: The length of a string is the number of symbols (counting duplicates) in the string. Example: The length of allan, written |allan|, is 5. Definition: A language over an alphabet is a countable set of strings over the alphabet. Example: All grammatical English sentences with five, eight, or twelve words is a language over ascii. It is also a language
file:///D:/cc pdf/Compilers Class Notes.htm 32/133

1/26/12

Compilers: Class Notes

over unicode. Definition: The concatenation of strings s and t is the string formed by appending the string t to s. It is written st. Example: s = s = s for any string s. We view concatenation as a product (see Monoid in wikipedia http://en.wikipedia.org/wiki/Monoid). It is thus natural to define s0= and si+1=sis. Example: s1=s, s4=ssss. More string terminology A prefix of a string is a portion starting from the beginning and a suffix is a portion ending at the end. More formally, Definitions: A prefix of s is any string obtained from s by removing (possibly zero) characters from the end of s. A suffix is defined analogously and a substring of s is obtained by deleting a prefix and a suffix. Example: If s is 123abc, then (1) s itself and are each a prefix, suffix, and a substring. (2) 12 are 123a are prefixes. (3) 3abc is a suffix. (4) 23a is a substring. Definitions: A proper prefix of s is a prefix of s other than and s itself. Similarly, proper suffixes and proper substrings of s do not include and s. Definition: A subsequence of s is formed by deleting (possibly) positions from s. We say positions rather than characters since s may for example contain 5 occurrences of the character Q and we only want to delete a certain 3 of them. Example: issssii is a subsequence of Mississippi. Homework: 3.1b, 3.5 (c and e are optional).

3.3.2: Operations on Languages


Definition: The union of L1 and L2 is simply the set-theoretic union, i.e., it consists of all words (strings) in either L1 or L2. Example: The union of {Grammatical English sentences with one, three, or five words} with {Grammatical English sentences with two or four words} is {Grammatical English sentences with five or fewer words}. Definition: The concatenation of L1 and L2 is the set of all strings st, where s is a string of L1 and t is a string of L2. We again view concatenation as a product and write LM for the concatenation of L and M. Examples:: The concatenation of {a,b,c} and {1,2} is {a1,a2,b1,b2,c1,c2}. The concatenation of {a,b,c} and {1,2,} is {a1,a2,b1,b2,c1,c2,a,b,c}. Definition: As with strings, it is natural to define powers of a language L. L0 ={}, which is not . Li+1 =LiL. Definition: The (Kleene) closure of L, denoted L* is L0 L1 L2 ...
file:///D:/cc pdf/Compilers Class Notes.htm 33/133

1/26/12

Compilers: Class Notes

Definition: The positive closure of L, denoted L+ is L1 L2 ... Example: {0,1,2,3,4,5,6,7,8,9}+ gives all unsigned integers, but with some ugly versions. It has 3, 03, 000003. {0} ( {1,2,3,4,5,6,7,8,9} ({0,1,2,3,4,5,6,7,8,9,0}* ) ) seems better. In these notes I may write * for * and + for +, but that is strictly speaking wrong and I will not do it on the board or on exams or on lab assignments. Example: {a,b}* is {,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,bba,bbb,...}. {a,b}+ is {a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,bba,bbb,...}. {,a,b}* is {,a,b,aa,ab,ba,bb,...}. {,a,b}+ is the same as {,a,b}*. The book gives other examples based on L={letters} and D={digits}, which you should read..

3.3.3: Regular Expressions


The idea is that the regular expressions over an alphabet consist of , the alphabet, and expressions using union, concatenation, and *, but it takes more words to say it right. For example, I didn't include (). Note that (A B)* is definitely not A* B* (* does not distribute over ) so we need the parentheses. The book's definition includes many () and is more complicated than I think is necessary. However, it has the crucial advantages of being correct and precise. The wikipedia entry doesn't seem to be as precise. I will try a slightly different approach, but note again that there is nothing wrong with the book's approach (which appears in both first and second editions, essentially unchanged). Definition: The regular expressions and associated languages over an alphabet consist of 1. 2. 3. 4. 5. 6. , the empty string; the associated language L() is {}. Each symbol x in the alphabet; L(x) is {x}. rs for all regular expressions (REs) r and s; L(rs) is L(r)L(s). r|s for all REs r and s; L(r|s) is L(r) L(s). r* for all REs r; L(r*) is (L(r))*. (r) for all REs r; L((r)) is L(r).

Parentheses, if present, control the order of operations. Without parentheses the following precedence rules apply. The postfix unary operator * has the highest precedence. The book mentions that it is left associative. (I don't see how a postfix unary operator can be right associative or how a prefix unary operator such as unary - could be left associative.) Concatenation has the second highest precedence and is left associative. | has the lowest precedence and is left associative. The book gives various algebraic laws (e.g., associativity) concerning these operators. The reason we don't include the positive closure is that for any RE r+ = rr* . Homework: 3.6 a and b.

file:///D:/cc pdf/Compilers Class Notes.htm

34/133

1/26/12

Compilers: Class Notes

3.3.4: Regular Definitions


These will look like the productions of a context free grammar we saw previously, but there are differences. Let be an alphabet, then a regular definition is a sequence of definitions
d r 1 1 d r 2 2 .. . d r n n

where the d's are unique and not in and ri is a regular expressions over {d1,...,di-1}. Note that each di can depend on all the previous d's. Example: C identifiers can be described by the following regular definition
lte_A|B|..|Z|a|b|..|z|_ etr . . dgt0|1|..|9 ii . Cdlte_(lte_|dgt* I etr etr ii)

Homework: 3.7 a,b (c is optional)

3.3.5: Extensions of Regular Expressions


There are many extensions of the basic regular expressions given above. The following three will be frequently used in this course as they are particular useful for lexical analyzers as opposed to text editors or string oriented programming languages, which have more complicated regular expressions. All three are simply shorthand. That is, the set of possible languages generated using the extensions is the same as the set of possible languages generated without using the extensions. 1. One or more instances. This is the positive closure operator + mentioned above. 2. Zero or one instance. The unary postfix operator ? defined by r? = r | for any RE r. 3. Character classes. If a1, a2, ..., an are symbols in the alphabet, then [a1a2...an ] = a1 | a2 | ... | an . In the special case where all the a's are consecutive, we can simply the notation further to just [a1-an ]. Examples: C-language identifiers
lte_[-az] etr AZ-_ dgt[-] ii 09 Cdlte_(lte |dgt)* I etr etr ii

Unsigned integer or floating point numbers


dgt[-] ii 09 dgt dgt iis ii+ nme dgt ( dgt)([-?dgt) ubr iis . iis?E+] iis?

Homework: 3.8 for the C language (you might need to read a C manual first to find out all the numerical constants in C), 3.10a.

3.4: Recognition of Tokens


file:///D:/cc pdf/Compilers Class Notes.htm 35/133

Compilers: Class Notes

Goal is to perform the lexical analysis needed for the following grammar.
sm i ep te sm tt f xr hn tt |i ep te sm es sm f xr hn tt le tt | ep tr rlptr xr em eo em / rlpi rltoa oeao = > ec / eo s eainl prtr , , t |tr em tr i em d |nme ubr

Recall that the terminals are the tokens, the nonterminals produce terminals. A regular definition for the terminals is
dgt[-] ii 09 dgt dgt+ iis iis nme dgt ( dgt) ([-?dgt) ubr iis . iis? E+] iis? lte [-az etr AZ-] i lte (lte |dgt) d etr etr ii * i i f f te te hn hn es es le le rlp<|>|< |> |=|< eo = = >

We also want the lexer to remove whitespace so we define a new token


w (bak|tb|nwie)+ s ln a eln

Lexeme if then else A number < <= = <> > >=

Token if then else

Attribute

Whitespace ws

where blank, tab, and newline are symbols used to represent the corresponding ascii characters. Recall that the lexer will be called by the parser when the latter needs a new token. If the lexer then recognizes the token ws, it does not return it to the parser but instead goes on to recognize the next token, which is then returned. Note that you can't have two consecutive ws tokens in the input because, for a given token, the lexer will match the longest lexeme starting at the current position that yields this token. The table on the right summarizes the situation. For the parser all the relational ops are to be treated the same so they are all the same token, relop. Naturally, other parts of the compiler will need to distinguish between the various relational ops so that appropriate code is generated. Hence, they have distinct attribute values.

An identifier id relop relop relop relop relop relop

Pointer to table entry LT LE EQ NE GT GE

number Pointer to table entry

3.4.1: Transition Diagrams


A transition diagram is similar to a flowchart for (a part of) the lexer. We draw one for each possible token. It shows the decisions that must be made based on the input seen. The two main components are circles representing states (think of them as decision points of the lexer) and arrows representing edges (think of them as the decisions made). The transition diagram (3.12 in the 1st edition, 3.13 in the second) for relop is shown on the right. 1. The double circles represent accepting or final states at which point a lexeme has been found. There is often an action to be done (e.g., returning the token), which is written to the right of the double circle. 2. If we have moved one (or more) characters "too far" in finding the token, one (or more) stars are drawn. 3. An imaginary start state exists and has an arrow coming from it to indicate where to begin the process. It is fairly clear how to write code corresponding to this diagram. You look at the first character, if it is <, you look at the next character. If that character is =, you return (relop,LE) to the parser. If instead that character is >, you return (relop,NE). If it is another character, return (relop,LT) and adjust the input buffer so that you will read this character again
file:///D:/cc pdf/Compilers Class Notes.htm 36/133

1/26/12

Compilers: Class Notes

since you have used it for the current lexeme. If the first character was =, you return (relop,EQ).

3.4.2: Recognition of Reserved Words and Identifiers


The transition diagram below corresponds to the regular definition given previously.

Note again the star affixed to the final state. Two questions remain. 1. How do we distinguish between identifiers and keywords such as "then", which also match the pattern in the transition diagram? 2. What is (gettoken(), installID())? We will continue to assume that the keywords are reserved, i.e., may not be used as identifiers. (What if this is not the caseas in Pl/I, which had no reserved words? Then the lexer does not distinguish between keywords and identifiers and the parser must.) We will use the method mentioned last chapter and have the keywords installed into the symbol table prior to any invocation of the lexer. The symbol table entry will indicate that the entry is a keyword. installID() checks if the lexeme is already in the table. If it is not present, the lexeme is install as an id token. In either case a pointer to the entry is returned. gettoken() examines the lexeme and returns the token name, either id or a name corresponding to a reserved keyword. Both installID() and gettoken() access the buffer to obtain the lexeme of interest The text also gives another method to distinguish between identifiers and keywords.
file:///D:/cc pdf/Compilers Class Notes.htm 37/133

1/26/12

Compilers: Class Notes

3.4.3: Completion of the Running Example


So far we have transition diagrams for identifiers (this diagram also handles keywords) and the relational operators. What remains are whitespace, and numbers, which are the simplest and most complicated diagrams seen so far. Recognizing Whitespace The diagram itself is quite simple reflecting the simplicity of the corresponding regular expression.

The "delim" in the diagram represents any of the whitespace characters, say space, tab, and newline. The final star is there because we needed to find a non-whitespace character in order to know when the whitespace ends and this character begins the next token. There is no action performed at the accepting state. Indeed the lexer does not return to the parser, but starts again from its beginning as it still must find the next token. Recognizing Numbers The diagram below is from the second edition. It is essentially a combination of the three diagrams in the first edition.

This certainly looks formidable, but it is not that bad; it follows from the regular expression. In class go over the regular expression and show the corresponding parts in the diagram. When an accepting states is reached, action is required but is not shown on the diagram. Just as identifiers are stored in a symbol table and a pointer is returned, there is a corresponding number table in which numbers are stored. These numbers are needed when code is generated. Depending on the source language, we may wish to indicate in the table whether this is a real or integer. A similar, but more complicated, transition diagram could be produced if they language permitted complex numbers as well. Homework: Write transition diagrams for the regular expressions in problems 3.6 a and b, 3.7 a and b.

3.4.4: Architecture of a Transition-Diagram-Based Lexical Analyzer


The idea is that we write a piece of code for each decision diagram. I will show the one for relational operations below (from the 2nd edition). This piece of code contains a case for each state, which typically reads a character and then goes to the next case depending on the character read. The numbers in the circles are the names of the cases.

file:///D:/cc pdf/Compilers Class Notes.htm

38/133

1/26/12

Accepting states often need to take some action and return to the parser. Many of these accepting states (the ones with stars) need to restore one character of input. This is called retract() in the code. What should the code for a particular diagram do if at one state the character read is not one of those for which a next state has been defined? That is, what if the character read is not the label of any of the outgoing arcs? This means that we have failed to find the token corresponding to this diagram. The code calls fail(). This is not an error case. It simply means that the current input does not match this particular token. So we need to go to the code section for another diagram after restoring the input pointer so that we start the next diagram at the point where this failing diagram started. If we have tried all the diagram, then we have a real failure and need to print an error message and perhaps try to repair the input. Note that the order the diagrams are tried is important. If the input matches more than one token, the first one tried will be chosen.
TKNgteo( OE eRlp) / TKNhstocmoet / OE a w opnns TKNrtoe =nwRLP; OE eTkn e(EO) / Frtcmoetsthr / is opnn e ee wie(re hl tu) sic(tt) wthsae cs 0 c=nxCa(; ae : ethr) i ( = '' f c = <) sae=1 tt ; es i ( = '' sae=5 le f c = =) tt ; es i ( = '' sae=6 le f c = >) tt ; es fi(; le al) bek ra; cs 1 .. ae : . .. . cs 8 rtat) / a acpigsaewt asa ae : erc(; / n cetn tt ih tr rtoe.trbt =G; / scn cmoet eTknatiue T / eod opnn rtr(eTkn; eunrtoe)

Second edition additions The description above corresponds to the one given in the first edition. The newer edition gives two other methods for combining the multiple transition-diagrams (in addition to the one above). 1. Unlike the method above, which tries the diagrams one at a time, the first new method tries them in parallel. That is, each character read is passed to each diagram (that hasn't already failed). Care is needed when one diagram has accepted the input, but others still haven't failed and may accept a longer prefix of the input. 2. The final possibility discussed, which appears to be promising, is to combine all the diagrams into one. That is easy for the example we have been considering because all the diagrams begin with different characters being matched. Hence we just have one large start with multiple outgoing edges. It is more difficult when there is a character that can begin more than one diagram.

3.5: The Lexical Analyzer Generator L x e


The newer version, which we will use, is called flex, the f stands for fast. I checked and both lex and flex are on the cs machines. I will use the name lex for both. Lex is itself a compiler that is used in the construction of other compilers (its output is the lexer for the other compiler). The lex language, i.e, the input language of the lex compiler, is described in the few sections. The compiler writer uses the lex language to specify the tokens of their language as well as the actions to take at each state.

3.5.1: Use of L x e
Let us pretend I am writing a compiler for a language called pink. I produce a file, call it lex.l, that describes pink in a
file:///D:/cc pdf/Compilers Class Notes.htm 39/133

1/26/12

Compilers: Class Notes

manner shown below. I then run the lex compiler (a normal program), giving it lex.l as input. The lex compiler output is always a file called lex.yy.c, a program written in C. One of the procedures in lex.yy.c (call it pinkLex()) is the lexer itself, which reads a character input stream and produces a sequence of tokens. pinkLex() also sets a global value yylval that is shared with the parser. I then compile lex.yy.c together with a the parser (typically the output of lex's cousin yacc, a parser generator) to produce say pinkfront, which is an executable program that is the front end for my pink compiler.

3.5.2: Structure of L xPrograms e


The general form of a lex program like lex.l is
dcaain elrtos % % tasainrls rnlto ue % % axlayfntos uiir ucin

The lex program for the example we have been working with follows (it is typed in straight from the book).
% { / dfntoso mnfs cntns * eiiin f aiet osat L,L,E,N,G,G, T E Q E T E I,TE,ES,I,NME,RLP* F HN LE D UBR EO / % } / rglrdfntos* * eua eiiin / dlm ei [\\] tn w s {ei} dlm* lte etr [-az AZ-] dgt ii [-] 09 i d {etr(lte}dgt) lte}{etr{ii}* nme ubr {ii}(.dgt+?E+]{ii}) dgt+\{ii})([-?dgt+? % % {s w} i f te hn es le {d i} {ubr nme} "" < "= <" "" = "> <" "" > "= >" % % itisalD){*fnto t isaltelxm,woefrtcaatr n ntlI( / ucin o ntl h eee hs is hrce i pitdt b ytx,adwoelnt i yln, s one o y yet n hs egh s yeg it tesmo tbeadrtr apitrteeo no h ybl al n eun one hrt * / } itisalu( {*smlrt isalD btpt nmrclcntns n ntlNm) / iia o ntlI, u us ueia osat it asprt tbe no eaae al * / {*n ato adn rtr *} / o cin n o eun / {eunI)} rtr(F; {eunTE)} rtr(HN; {eunES)} rtr(LE; {yvl=(n)isalD) rtr(D; yla it ntlI(; eunI)} {yvl=(n)isalu(;rtr(UBR; yla it ntlNm) eunNME)} {yvl=L;rtr(EO)} yla T eunRLP; {yvl=L;rtr(EO)} yla E eunRLP; {yvl=E;rtr(EO)} yla Q eunRLP; {yvl=N;rtr(EO)} yla E eunRLP; {yvl=G;rtr(EO)} yla T eunRLP; {yvl=G;rtr(EO)} yla E eunRLP;

The first, declaration, section includes variables and constants as well as the all-important regular definitions that define the building blocks of the target language, i.e., the language that the generated lexer will analyze. The next, translation rules, section gives the patterns of the lexemes that the lexer will recognize and the actions to be performed upon recognition. Normally, these actions include returning a token name to the parser and often returning other information about the token via file:///D:/cc pdf/Compilers Class Notes.htm the shared variable yylval.
40/133

1/26/12

Compilers: Class Notes

information about the token via the shared variable yylval. If a return is not specified the lexer continues executing and finds the next lexeme present. Comments on the Lex Program Anything between %{ and %} is not processed by lex, but instead is copied directly to lex.yy.c. So we could have had statements like
#eieL 1 dfn T 2 #eieL 1 dfn E 3

The regular definitions are mostly self explanatory. When a definition is later used it is surrounded by {}. A backslash \ is used when a special symbol like * or . is to be used to stand for itself, e.g. if we wanted to match a literal star in the input for multiplication. Each rule is fairly clear: when a lexeme is matched by the left, pattern, part of the rule, the right, action, part is executed. Note that the value returned is the name (an integer) of the corresponding token. For simple tokens like the one named IF, which correspond to only one lexeme, no further data need be sent to the parser. There are several relational operators so a specification of which lexeme matched RELOP is saved in yylval. For id's and numbers's, the lexeme is stored in a table by the install functions and a pointer to the entry is placed in yylval for future use. Everything in the auxiliary function section is copied directly to lex.yy.c. Unlike declarations enclosed in %{ %}, however, auxiliary functions may be used in the actions

3.5.3: Conflict Resolution in L x e


1. Match the longest possible prefix of the input. 2. If this prefix matches multiple patterns, choose the first. The first rule makes <= one instead of two lexemes. The second rule makes "if" a keyword and not an id.

3.5.3a: Anger Management in L x e


Sorry.

3.5.4: The Lookahead Operator


Sometimes a sequence of characters is only considered a certain lexeme if the sequence is followed by specified other sequences. Here is a classic example. Fortran, PL/I, and some other languages do not have reserved words. In Fortran
I()3 FX=

is a legal assignment statement and the IF is an identifier. However,


I(.TYXY FXL.)=

is an if/then statement and IF is a keyword. Sometimes the lack of reserved words makes lexical disambiguation impossible, however, in this case the slash / operator of lex is sufficient to distinguish the two cases. Consider
I /\.\{etr F (*)lte}

This only matches IF when it is followed by a ( some text a ) and a letter. The only FORTRAN statements that match this are the if/then shown above; so we have found a lexeme that matches the if token. However, the lexeme is just the IF and not the rest of the pattern. The slash tells lex to put the rest back into the input and match it for the next and subsequent tokens. Homework: 3.11. Homework: Modify the lex program in section 3.5.2 so that: (1) the keyword while is recognized, (2) the comparison operators are those used in the C language, (3) the underscore is permitted as another letter (this problem is easy).
file:///D:/cc pdf/Compilers Class Notes.htm 41/133

1/26/12

Compilers: Class Notes

3.6: Finite Automata


The secret weapon used by lex et al to convert ("compile") its input into a lexer. Finite automata are like the graphs we saw in transition diagrams but they simply decide if a sentence (input string) is in the language (generated by our regular expression). That is, they are recognizers of language. There are two types of finite automata 1. Deterministic finite automata (DFA) have for each state (circle in the diagram) exactly one edge leading out for each symbol. So if you know the next symbol and the current state, the next state is determined. That is, the "execution" is deterministic; hence the name. 2. Nondeterministic finite automata (NFA) are the other kind. There are no restrictions on the edges leaving a state: there can be several with the same symbol as label and some edges can be labeled with . Thus there can be several possible next states from a given state and a current "lookahead" symbol. Surprising Theorem: Both DFAs and NFAs are capable of recognizing the same languages, the regular languages, i.e., the languages generated by regular expressions (plus the automata can recognize the empty language). What does this mean? There are certainly NFAs that are not DFAs. But the language recognized by each such NFA can also be recognized by at least one DFA. Why mention (confusing) NFAs? The DFA that recognizes the same language as an NFA might be significantly larger that the NFA. The finite automaton that one constructs naturally from a regular expression is often an NFA.

3.6.1: Nondeterministic Finite Automata


Here is the formal definition. A nondeterministic finite automata (NFA) consists of 1. A finite set of states S. 2. An input alphabet not containing . 3. A transition function that gives, for each state and each symbol in , a set of next states (or successor states. 4. An element s0 of S, the start state. 5. A subset F of S, the accepting states (or final states). An NFA is basically a flow chart like the transition diagrams we have already seen. Indeed an NFA (or a DFA, to be formally defined soon) can be represented by a transition graph whose nodes are states and whose edges are labeled with elements of . The differences between a transition graph and our previous transition diagrams are: 1. Possibly multiple edges with the same label leaving a single state. 2. An edge may be labeled with . The transition graph to the right is an NFA for the regular expression (a|b)* abb, which (given the alphabet {a,b} represents all words ending in abb. Consider aababb. If you choose the "wrong" edge for the initial a's you will get stuck or not end at the accepting state. But an NFA accepts a word if any path (beginning at the start state and using the pdf/Compilers Class Notes.htm file:///D:/ccsymbols in the word in order) ends at an accepting state. It essentially tries all such paths at once and accepts if any 42/133

1/26/12

Compilers: Class Notes

the symbols in the word in order) ends at an accepting state. It essentially tries all such paths at once and accepts if any end at an accepting state. Patterns like (a|b)*abb are useful regular expressions! If the alphabet is ascii, consider *.java. Homework: For the NFA to the right, indicate all the paths labeled aabb.

3.6.2: Transition Tables


There is an equivalent way to represent an NFA, namely a table giving, for each state s and input State a symbol x (and ), the set of successor states x leads to from s. The empty set is used when there is 0 {0,1} no edge labeled x emanating from s. The table on the right corresponds to the transition graph above. 1 The downside of these tables is their size, especially if most of the entries are since those entries 2 would not take any space in a transition graph. Homework: Construct the transition table for the NFA in the previous homework problem. b {0} {2} {3}

3.6.3: Acceptance of Input Strings by Automata


An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state. Homework: Does the NFA in the previous homework accept the string aabb? Again note that these symbols may specify several paths, some of which lead to accepting states and some that don't. In such a case the NFA does accept the string; one successful path is enough. Also note that if an edge is labeled , then it can be taken "for free". For the transition graph above any string can just sit at state 0 since every possible symbol (namely a or b) can go from state 0 back to state 0. So every string can lead to a non-accepting state, but that is not important since if just one path with that string leads to an accepting state, the NFA accepts the string. The language defined by an NFA or the language accepted by an NFA is the set of strings (a.k.a. words) accepted by the NFA. So the NFA in the diagram above (not the diagram with the homework problem) accepts the same language as the regular expression (a|b)*abb. If A is an automaton (NFA or DFA) we use L(A) for the language accepted by A. The diagram on the right illustrates an NFA accepting the language L(aa*|bb*). The path 034444 shows that bbbb is accepted by the NFA. Note how the that labels the edge 0 3 does not appear in the string bbbb since is the empty string.

3.6.4: Deterministic Finite Automata


There is something weird about an NFA if viewed as a model of computation. How is a computer of any realistic construction able to check out all the (possibly infinite number of) paths to determine if any terminate at an accepting state?
file:///D:/cc pdf/Compilers Class Notes.htm 43/133

1/26/12

We now consider a much more realistic model, a DFA. Definition: A deterministic finite automata or DFA is a special case of an NFA having the restrictions 1. No edge is labeled with 2. For any state s and symbol a, there is exactly one edge leaving s with label a. This is realistic. We are at a state and examine the next character in the string, depending on the character we go to exactly one new state. Looks like a switch statement to me. Minor point: when we write a transition table for a DFA, the entries are elements not sets so there are no {} present. Simulating a DFA Indeed a DFA is so reasonable there is an obvious algorithm for simulating it (i.e., reading a string and deciding whether or not it is in the language accepted by the DFA). We present it now. The second edition has switched to C syntax: = is assignment == is comparison. I am going to change to this notation since I strongly suspect that most of the class is much more familiar with C/C++/java/C# than with algol60/algol68/pascal/ada (the last is my personal favorite). As I revisit past sections of the notes to fix errors, I will change the examples from algol to C usage of =. I realize that this makes the notes incompatible with the edition you have, but hope and believe that this will not cause any serious problems.
s = s; / s a t s a e N T = i a s g m n / tr tt. OE s sinet 0 c=nxCa(; ethr) / a"rmn"ra / piig ed wie( ! ef { hl c = o) s=mv(,) oesc; c=nxCa(; ethr) } i ( i i F testo acpigsae)rtr "e" f s s n , h e f cetn tts eun ys es rtr "o le eun n"

3.7: From Regular Expressions to Automata 3.7.0: Not losing site of the forest due to the trees
This is not from the book. Do not forget the goal of the chapter is to understand lexical analysis. We saw, when looking at Lex, that regular expressions are a key in this task. So we want to recognize regular expressions (say the ones representing tokens). We are going to see two methods. 1. Convert the regular expression to an NFA and simulate the NFA. 2. Convert the regular expression to an NFA, convert the NFA to a DFA, and simulate the DFA. So we need to learn 4 techniques. 1. 2. 3. 4. Convert a regular expression to an NFA Simulate an NFA Convert an NFA to a DFA Simulate a DFA.

The list I just gave is in the order the algorithms would be appliedbut you would use either 2 or (3 and 4). The two editions differ in the order the techniques are presented, but neither does it in the order I just gave. Indeed, we just did item #4. I will follow the order of 2nd ed but give pointers to the first edition where they differ.
file:///D:/cc pdf/Compilers Class Notes.htm 44/133

1/26/12

Compilers: Class Notes

================ Start Lecture #4 ================


Remark: If you find a particular homework question challenging, ask on the mailing list and an answer will be produced. Remark: I forgot to assign homework for section 3.6. I have added one problem spread into three parts. It is not assigned but it is a question I believe you should be able to do.

3.7.1: Converting an NFA to a DFA


(This is item #3 above and is done in section 3.6 in the first edition.) The book gives a detailed proof; I am just trying to motivate the ideas. Let N be an NFA, we construct a DFA D that accepts the same strings as N does. Call a state of N an N-state, and call a state of D a D-state. The idea is that Dstate corresponds to a set of N-states and hence this is called the subset algorithm. Specifically for each string X of symbols we consider all the Nstates that can result when N processes X. This set of N-states is a D-state. Let us consider the transition graph on the right, which is an NFA that accepts strings satisfying the regular expression (a|b)* abb. The start state of D is the set of N-states that can result when N processes the NFA states empty string . This is called the -closure of the start state s0 of N, and consists of {0,1,2,4,7} those N-states that can be reached from s0 by following edges labeled with . Specifically it is the set {0,1,2,4,7} of N-states. We call this state D0 and enter it in {1,2,3,4,6,7,8} the transition table we are building for D on the right. {1,2,4,5,6,7} DFA state a b D0 D1 D2 D1 D2 D1 D3 D1 D2 D1 D4 D1 D2

Next we want the a-successor of D0, i.e., the D-state that occurs when we start at {1,2,4,5,6,7,9} D3 D0 and move along an edge labeled a. We call this successor D1. Since D0 consists {1,2,4,5,6,7,10} D4 of the N-states corresponding to , D1 is the N-states corresponding to "a"="a". We compute the a-successor of all the N-states in D0 and then form the -closure. Next we compute the b-successor of D0 the same way and call it D2.

We continue forming a- and b-successors of all the D-states until no new D-states result (there is only a finite number of subsets of all the N-states so this process does indeed stop). This gives the table on the right. D4 is the only D-accepting state as it is the only D-state containing the (only) N-accepting state 10. Theoretically, this algorithm is awful since for a set with k elements, there are 2k subsets. Fortunately, normally only a small fraction of the possible subsets occur in practice. Homework: Convert the NFA from the homework for section 3.6 to a DFA.

3.7.2: Simulating an NFA


file:///D:/cc pdf/Compilers Class Notes.htm 45/133

1/26/12

Compilers: Class Notes

Instead of producing the DFA, we can run the subset algorithm as a simulation itself. This is item #2 in my list of techniques
S = c o u e s) -lsr(0 ; c=nxCa(; ethr) wie(c! ef){ hl = o S=couemv(,); -lsr(oeSc) c=nxCa(; ethr) } i (SF! )rtr "e" f = eun ys; es rtr "o; le eun n"

/ Fi acpigsae / s cetn tts

3.7.3: Efficiency of NFA Simulation


Slick implementation.

3.7.4: Constructing an NFA from a Regular Expression


I give a pictorial proof by induction. This is item #1 from my list of techniques. 1. The base cases are the empty regular expression and the regular expression consisting of a single symbol a in the alphabet. 2. The inductive cases are. i. s | t for s and t regular expressions ii. st for s and t regular expressions iii. s* iv. (s), which is trivial since the nfa for s works for (s). The pictures on the right illustrate the base and inductive cases. Remarks: 1. The generated NFA has at most twice as many states as there are operators and operands in the RE. This is important for studying the complexity of the NFA. 2. The generated NFA has one start and one accepting state. The accepting state has no outgoing arcs and the start state has no incoming arcs. 3. Note that the diagram for st correctly indicates that the final state of s and the initial state of t are merged. This uses the previous remark that there is only one start and final state. 4. Except for the accepting state, each state of the generated NFA has either one outgoing arc labeled with a symbol or two outgoing arcs labeled with . Do the NFA for (a|b)* abb and see that we get the same diagram that we had before. Do the steps in the normal leftmost, innermost order (or draw a normal parse tree and follow it). Homework: 3.16 a,b,c

3.7.5: Efficiency of String-Processing Algorithms


(This is on page 127 of the first edition.) Skipped.

3.8: Design of a Lexical-Analyzer Generator


How lexer-generators like Lex work.
file:///D:/cc pdf/Compilers Class Notes.htm 46/133

1/26/12

Compilers: Class Notes

3.8.1: The structure of the generated analyzer


We have seen simulators for DFAs and NFAs. The remaining large question is how is the lex input converted into one of these automatons. Also 1. Lex permits functions to be passed through to the yy.lex.c file. This is fairly straightforward to implement. 2. Lex also supports actions that are to be invoked by the simulator when a match occurs. This is also fairly straight forward. 3. The lookahead operator is not so simple in the general case and is discussed briefly below. In this section we will use transition graphs, lexer-generators do not draw pictures; instead they use the equivalent transition tables. Recall that the regular definitions in Lex are mere conveniences that can easily be converted to REs and hence we need only convert REs into an FSA. We already know how to convert a single RE into an NFA. But lex input will contain several REs (since it wishes to recognize several different tokens). The solution is to 1. Produce an NFA for each RE. 2. Introduce a new start state. 3. Introduce an transition from the new start state to the start of each NFA constructed in step 1. 4. When one reaches one of the accepting states,they do NOT stop. See below for an explanation. The result is shown to the right. At each of the accepting states (one for each NFA in step 1), the simulator executes the actions specified in the lex program for the corresponding pattern.

3.8.2: Pattern Matching Based on NFAs


We use the algorithm for simulating NFAs presented in 3.7.2. The simulator starts reading characters and calculates the set of states it is at. At some point the input character does not lead to any state or we have reached the eof. Since we wish to find the longest lexeme matching the pattern we proceed backwards from the current point (where there was no state) until we reach an accepting state (i.e., the set of NFA states, N-states, contains an accepting N-state). Each accepting N-state corresponds to a matched pattern. The lex rule is that if a lexeme matches multiple patterns we choose the pattern listed first in the lexprogram. Example Consider the example on the right with three patterns and their associated actions and consider processing the input aaba. Pattern Action to perform a abb a* b+ Action1 Action2 Action3

1. We begin by constructing the three NFAs. To save space, the third NFA is not the one that would be constructed by our algorithm, but is an equivalent smaller one. For example, some unnecessary -transitions have been eliminated. If one view the lex executable as a compiler transforming lex source into NFAs, this would be
47/133

1/26/12

Compilers: Class Notes

2. 3. 4.

5. 6. 7. 8.

9.

10.

considered an "optimization". We introduce a new start state and -transitions as in the previous section. We start at the -closure of the start state, which is {0,1,3,7}. The first a (remember the input is aaba) takes us to {2,4,7}. This includes an accepting state and indeed we have matched the first patten. However, we do not stop since we may find a longer match. The next a takes us to {7}. The b takes us to {8}. The next a fails since there are no a-transitions out of state 8. So we must back up to before trying the last a. We are back in {8} and ask if one of these N-states (I know there is only one, but there could be more) is an accepting state. Indeed state 8 is accepting for third pattern. If there were more than one accepting state in the list, we would choose the one in the earliest listed pattern. Action3 would now be performed.

3.8.3: DFA's for Lexical Analyzers


We could also convert the NFA to a DFA and simulate that. The resulting DFA is on the right. Note that it shows the same set of states we had as well as others corresponding other possible inputs. We label the accepting states with the pattern matched. If multiple patterns are matched (because the accepting D-state contains multiple accepting N-states), we use the first pattern listed (assuming we are using lex conventions). Technical point. For a DFA, there must be a outgoing edge from each D-state for each possible character. In the diagram, when there is no NFA state possible, we do not show the edge. Technically we should show these edges, all of which lead to the same D-state, called the dead state, and corresponds to the empty subset of N-states.

3.8.4: Implementing the Lookahead Operator


This has some tricky points. Recall that this lookahead operator is for when you must look further down the input but the extra characters matched are not part of the lexeme. We write the pattern r1/r2. In the NFA we match r1 then treat the / as an and then match s1. It would be fairly easy to describe the situation when the NFA has only -transition at the state where r1 is matched. But it is tricky when there are more than one such transition.

3.9: Optimization of DFA-Based Pattern Matchers


Skipped

3.9.1: Important States of an NFA


file:///D:/cc pdf/Compilers Class Notes.htm 48/133

1/26/12

Compilers: Class Notes

Skipped

3.9.2: Functions Computed form the Syntax Tree


Skipped

3.9.3: Computing nullable, firstpos, and lastpos


Skipped

3.9.4: Computing followpos


Skipped

Chapter 4: Syntax Analysis


Homework: Read Chapter 4.

4.1: Introduction
4.1.1: The role of the parser
Conceptually, the parser accepts a sequence of tokens and produces a parse tree. As we saw in the previous chapter the parser calls the lexer to obtain the next token. In practice this might not occur. 1. The source program might have errors. 2. Instead of explicitly constructing the parse tree, the actions that the downstream components of the front end would do on the tree can be integrated with the parser and done incrementally on components of the tree. There are three classes for grammar-based parsers. 1. universal 2. top-down 3. bottom-up The universal parsers are not used in practice as they are inefficient. As expected, top-down parsers start from the root of the tree and proceed downward; whereas, bottom-up parsers start from the leaves and proceed upward. The commonly used top-down and bottom parsers are not universal. That is, there are grammars that cannot be used with them. The LL and LR parsers are important in practice. Hand written parsers are often LL. Specifically, the predictive parsers we looked at in chapter two are for LL grammars. The LR grammars form a larger class. Parsers for this class are usually constructed with the aid of automatic tools.

4.1.2: Representative Grammars


Expressions with + and *
EE+T|T TT*F|F
file:///D:/cc pdf/Compilers Class Notes.htm

F(E)|i d

49/133

Вам также может понравиться