Chapter 3 Lexical Analyser

Chapter 3: Lexical Analysis
Homework: Read chapter 3.
Two methods to construct a scanner (lexical analyzer). 1. By hand, beginning with a diagram of what lexemes look like. Then write code to follow the diagram and return the corresponding token and possibly other information. 2. Feed the patterns describing the lexemes to a lexer-generator, which then produces the scanner. The historical lexer-generator is Lex; a more modern one is flex. Note that the speed (of the lexer not of the code generated by the compiler) and error reporting/correction are typically much better for a handwritten lexer. As a result most production-level compiler projects write their own lexers
3.1: the role of the Lexical Analyzer

The lexer is called by the parser when the latter is ready to process another token. The lexer also might do some housekeeping such as eliminating whitespace and comments. Some call these tasks scanning, but others call the entire task scanning. After the lexer, individual characters are no longer examined by the compiler; instead tokens (the output of the lexer) are used. 3.1.1: Lexical Analysis Versus Parsing Why separate lexical analysis from parsing? The reasons are basically software engineering concerns. 1. Simplicity of design. When one detects a well defined subtask (produce the next token), it is often good to separate out the task (modularity). 2. Efficiency. With the task separated it is easier to apply specialized techniques. 3. Portability. Only the lexer need communicate with the outside. 3.1.2: Tokens, Patterns, and Lexemes
A token is a <name,attribute> pair. These are what the parser processes. The attribute might actually be a tuple of several attributes
A pattern describes the character strings for the lexemes of the token. For example a letter followed by a (possibly empty) sequence of letters and digits. A lexeme for a token is a sequence of characters that matches the pattern for the token.
Note the circularity of the definitions for lexeme and pattern. Common token classes. 1. One for each keyword. The pattern is trivial. 2. One for each operator or class of operators. A typical class is the comparison operators. Note that these have the same precedence. We might have + and - as the same token, but not + and *. 3. One for all identifiers (e.g. variables, user defined type names, etc). 4. Constants (i.e., manifest constants) such as 6 or hello, but not a constant identifier such as quantum in the Java statement. static final int quantum = 3;. There might be one token for integer constants, one for real, one for string, etc. 5. One for each punctuation symbol.
Homework: 3.3.
3.1.3: Attributes for Tokens We saw an example of attributes in the last chapter. For tokens corresponding to keywords, attributes are not needed since the name of the token tells everything. But consider the token corresponding to integer constants. Just knowing that the we have a constant is not enough, subsequent stages of the compiler need to know the value of the constant. Similarly for the token identifier we need to distinguish one identifier from another. The normal method is for the attribute to specify the symbol table entry for this identifier. 3.1.4: Lexical Errors We saw in this movie an example where parsing got stuck because we reduced the wrong part of the input string. We also learned about FIRST sets that enabled us to determine which production to apply when we are operating left to right on the input. For predictive parsers the FIRST sets for a given nonterminal are disjoint and so we know which production to apply. In general the FIRST sets might not be disjoint so we have to try all the productions whose FIRST set contains the lookahead symbol.
All the above assumed that the input was error free, i.e. that the source was a sentence in the language. What should we do when the input is erroneous and we get to a point where no production can be applied? The simplest solution is to abort the compilation stating that the program is wrong, perhaps giving the line number and location where the parser could not proceed. We would like to do better and at least find other errors. We could perhaps skip input up to a point where we can begin anew (e.g. after a statement ending semicolon), or perhaps make a small change to the input around lookahead so that we can proceed.
3.2: Input Buffering

Determining the next lexeme often requires reading the input beyond the end of that lexeme. For example, to determine the end of an identifier normally requires reading the first whitespace character after it. Also just reading > does not determine the lexeme as it could also be >=. When you determine the current lexeme, the characters you read beyond it may need to be read again to determine the next lexeme. 3.2.1: Buffer Pairs The book illustrates the standard programming technique of using two (sizable) buffers to solve this problem. 3.2.2: Sentinels A useful programming improvement to combine testing for the end of a buffer with determining the character read.
3.3: Specification of Tokens

The chapter turns formal and, in some sense, the course begins. The book is fairly careful about finite vs infinite sets and also uses (without a definition!) the notion of a countable set. (A countable set is either a finite set or one whose elements can be put into one to one correspondence with the positive integers. That is, it is a set whose elements can be counted. The set of rational numbers, i.e., fractions in lowest terms, is countable; the set of real numbers is uncountable, because it is strictly bigger, i.e., it cannot be counted.) We should be careful to distinguish the empty set from the empty string . Formal language theory is a beautiful subject, but I shall suppress my urge to do it right and try to go easy on the formalism. 3.3.1: Strings and Languages
We will need a bunch of definitions.

Definition: An alphabet is a finite set of symbols. Example: {0,1}, presumably (uninteresting), ascii, unicode, ebcdic, latin-1. Definition: A string over an alphabet is a finite sequence of symbols from that
alphabet. Strings are often called words or sentences.

Example: Strings over {0,1}: , 0, 1, 111010. Strings over ascii: , sysy, the string
consisting of 3 blanks.
Definition: The length of a string is the number of symbols (counting duplicates) in the
string.
Example: The length of allan, written |allan|, is 5. Definition: A language over an alphabet is a countable set of strings over the alphabet. Example: All grammatical English sentences with five, eight, or twelve words is a
language over ascii. It is also a language over unicode.

Definition: The concatenation of strings s and t is the string formed by appending the
string t to s. It is written st.

Example: s = s = s for any string s.
We view concatenation as a product (see Monoid in wikipedia http://en.wikipedia.org/wiki/Monoid). It is thus natural to define s0= and si+1=sis.
Example: s1=s, s4=ssss.
More string terminology A prefix of a string is a portion starting from the beginning and a suffix is a portion ending at the end. More formally,
Definitions: A prefix of s is any string obtained from s by removing (possibly zero)
characters from the end of s. A suffix is defined analogously and a substring of s is obtained by deleting a prefix and a suffix.
Example: If s is 123abc, then
(1) s itself and are each a prefix, suffix, and a substring. (2) 12 are 123a are prefixes. (3) 3abc is a suffix. (4) 23a is a substring.
Definitions: A proper prefix of s is a prefix of s other than and s itself.
Similarly, proper suffixes and proper substrings of s do not include and s.

Definition: A subsequence of s is formed by deleting (possibly) positions from s. We
say positions rather than characters since s may for example contain 5 occurrences of the character Q and we only want to delete a certain 3 of them.
Example: issssii is a subsequence of Mississippi. Homework: 3.1b, 3.5 (c and e are optional).
3.3.2: Operations on Languages

Definition: The union of L1 and L2 is simply the set-theoretic union, i.e., it consists of
all words (strings) in either L1 or L2.

Example: The union of {Grammatical English sentences with one, three, or five
words} with {Grammatical English sentences with two or four words} is {Grammatical English sentences with five or fewer words}.
Definition: The concatenation of L1 and L2 is the set of all strings st, where s is a
string of L1 and t is a string of L2. We again view concatenation as a product and write LM for the concatenation of L and M.
Examples:: The concatenation of {a,b,c} and {1,2} is {a1,a2,b1,b2,c1,c2}. The
concatenation of {a,b,c} and {1,2,} is {a1,a2,b1,b2,c1,c2,a,b,c}.

Definition: As with strings, it is natural to define powers of a language L.
L0={}, which is not . Li+1=LiL.

Definition: The (Kleene) closure of L, denoted L* is
L0 L1 L2 ...
Definition: The positive closure of L, denoted L is
L1 L2 ...
Example: {0,1,2,3,4,5,6,7,8,9}+ gives all unsigned integers, but with some ugly
versions. It has 3, 03, 000003. {0} ( {1,2,3,4,5,6,7,8,9} ({0,1,2,3,4,5,6,7,8,9,0} * ) ) seems better. In these notes I may write * for * and + for +, but that is strictly speaking wrong and I will not do it on the board or on exams or on lab assignments.
Example: {a,b}* is {,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,bba,bbb,...}.
{a,b}+ is {a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,bba,bbb,...}. {,a,b}* is {,a,b,aa,ab,ba,bb,...}. {,a,b}+ is the same as {,a,b}*. The book gives other examples based on L={letters} and D={digits}, which you should read.. 3.3.3: Regular Expressions The idea is that the regular expressions over an alphabet consist of , the alphabet, and expressions using union, concatenation, and *, but it takes more words to say it right. For example, I didn't include (). Note that (A B)* is definitely not A* B* (* does not distribute over ) so we need the parentheses. The book's definition includes many () and is more complicated than I think is necessary. However, it has the crucial advantages of being correct and precise. The wikipedia entry doesn't seem to be as precise. I will try a slightly different approach, but note again that there is nothing wrong with the book's approach (which appears in both first and second editions, essentially unchanged).
Definition: The regular expressions and associated languages over an alphabet consist
of 1. 2. 3. 4. , the empty string; the associated language L() is {}. Each symbol x in the alphabet; L(x) is {x}. rs for all regular expressions (REs) r and s; L(rs) is L(r)L(s). r|s for all REs r and s; L(r|s) is L(r) L(s).
5. r* for all REs r; L(r*) is (L(r))*. 6. (r) for all REs r; L((r)) is L(r). Parentheses, if present, control the order of operations. Without parentheses the following precedence rules apply. The postfix unary operator * has the highest precedence. The book mentions that it is left associative. (I don't see how a postfix unary operator can be right associative or how a prefix unary operator such as unary - could be left associative.) Concatenation has the second highest precedence and is left associative. | has the lowest precedence and is left associative. The book gives various algebraic laws (e.g., associativity) concerning these operators. The reason we don't include the positive closure is that for any RE r+ = rr*.
Homework: 3.6 a and b.
3.3.4: Regular Definitions These will look like the productions of a context free grammar we saw previously, but there are differences. Let be an alphabet, then a regular definition is a sequence of definitions
d1 r1 d2 r2 ... dn rn
where the d's are unique and not in and ri is a regular expressions over {d1,...,di-1}. Note that each di can depend on all the previous d's.
Example: C identifiers can be described by the following regular definition
letter_ A | B | ... | Z | a | b | ... | z | _ digit 0 | 1 | ... | 9 CId letter_ ( letter_ | digit)*
Homework: 3.7 a,b (c is optional)
3.3.5: Extensions of Regular Expressions
There are many extensions of the basic regular expressions given above. The following three will be frequently used in this course as they are particular useful for lexical analyzers as opposed to text editors or string oriented programming languages, which have more complicated regular expressions. All three are simply shorthand. That is, the set of possible languages generated using the extensions is the same as the set of possible languages generated without using the extensions. 1. One or more instances. This is the positive closure operator + mentioned above. 2. Zero or one instance. The unary postfix operator ? defined by r? = r | for any RE r. 3. Character classes. If a1, a2, ..., an are symbols in the alphabet, then [a1a2...an] = a1 | a2 | ... | an. In the special case where all the a's are consecutive, we can simply the notation further to just [a1-an].
Examples:
C-language identifiers
letter_ [A-Za-z_] digit [0-9] CId letter_ ( letter | digit )
Unsigned integer or floating point numbers

digit [0-9] digits digit+ number digits (. digits)?(E[+-]? digits)?
Homework: 3.8 for the C language (you might need to read a C manual first to find out
all the numerical constants in C), 3.10a.
3.4: Recognition of Tokens

Goal is to perform the lexical analysis needed for the following grammar.
stmt | | expr | term | if expr then stmt if expr then stmt else stmt term relop term // relop is relational operator =, >, etc term id number
Recall that the terminals are the tokens, the nonterminals produce terminals. A regular definition for the terminals is
digit digits number letter id if then else relop [0-9] digits+ digits (. digits)? (E[+-]? digits)? [A-Za-z] letter ( letter | digit )* if then else < | > | <= | >= | = | <>
We also want the lexer to remove whitespace so we define a new token

ws ( blank | tab | newline ) +
where blank, tab, and newline are symbols used to represent the corresponding ascii characters.
Recall that the lexer will be called by the parser when the latter needs a new token. If the lexer then recognizes the token ws, it does not return it to the parser but instead goes on to recognize the next token, which is then returned. Note that you can't have two consecutive ws tokens in the input because, for a given token, the lexer will match the longest lexeme starting at the current position that yields this token. The table on the right summarizes the situation.
Lexeme Whitespace if then else An identifier A number < <= = <> > >=
Token ws if then else id number relop relop relop relop relop relop
Attribute Pointer to table entry Pointer to table entry LT LE EQ NE GT GE
For the parser all the relational ops are to be treated the same so they are all the same token, relop. Naturally, other parts of the compiler will need to distinguish between the various relational ops so that appropriate code is generated. Hence, they have distinct attribute values. 3.4.1: Transition Diagrams A transition diagram is similar to a flowchart for (a part of) the lexer. We draw one for each possible token. It shows the decisions that must be made based on the input seen. The two main components are circles representing states (think of them as decision points of the lexer) and arrows representing edges (think of them as the decisions made). The transition diagram (3.12 in the 1st edition, 3.13 in the second) for relop is shown on the right. 1. The double circles represent accepting or fin al states at which point a lexeme has been found. There is often an action to be done (e.g., returning the token), which is written to the right of the double circle. 2. If we have moved one (or more) characters too far in finding the token, one (or more) stars are drawn. 3. An imaginary start state exists and has an arrow coming from it to indicate where to begin the process. It is fairly clear how to write code corresponding to this diagram. You look at the first character, if it is <, you look at the next character. If that character is =, you return (relop,LE) to the parser. If instead that character is >, you return (relop,NE). If it is another character, return (relop,LT) and adjust the input buffer so that you will read
this character again since you have used it for the current lexeme. If the first character was =, you return (relop,EQ). 3.4.2: Recognition of Reserved Words and Identifiers The transition diagram below corresponds to the regular definition given previously.
Note again the star affixed to the final state. Two questions remain. 1. How do we distinguish between identifiers and keywords such as then, which also match the pattern in the transition diagram? 2. What is (gettoken(), installID())? We will continue to assume that the keywords are reserved, i.e., may not be used as identifiers. (What if this is not the caseas in Pl/I, which had no reserved words? Then the lexer does not distinguish between keywords and identifiers and the parser must.) We will use the method mentioned last chapter and have the keywords installed into the symbol table prior to any invocation of the lexer. The symbol table entry will indicate that the entry is a keyword. installID() checks if the lexeme is already in the table. If it is not present, the lexeme is install as an id token. In either case a pointer to the entry is returned. gettoken() examines the lexeme and returns the token name, either id or a name corresponding to a reserved keyword. Both installID() and gettoken() access the buffer to obtain the lexeme of interest The text also gives another method to distinguish between identifiers and keywords. 3.4.3: Completion of the Running Example
So far we have transition diagrams for identifiers (this diagram also handles keywords) and the relational operators. What remains are whitespace, and numbers, which are the simplest and most complicated diagrams seen so far. Recognizing Whitespace The diagram itself is quite simple reflecting the simplicity of the corresponding regular expression.
The delim in the diagram represents any of the whitespace characters, say space, tab, and newline. The final star is there because we needed to find a non-whitespace character in order to know when the whitespace ends and this character begins the next token. There is no action performed at the accepting state. Indeed the lexer does not return to the parser, but starts again from its beginning as it still must find the next token.
Recognizing Numbers The diagram below is from the second edition. It is essentially a combination of the three diagrams in the first edition.
This certainly looks formidable, but it is not that bad; it follows from the regular expression. In class go over the regular expression and show the corresponding parts in the diagram. When an accepting states is reached, action is required but is not shown on the diagram. Just as identifiers are stored in a symbol table and a pointer is returned, there is a corresponding number table in which numbers are stored. These numbers are needed when code is generated. Depending on the source language, we may wish to indicate in the table whether this is a real or integer. A similar, but more complicated, transition diagram could be produced if they language permitted complex numbers as well.
Homework: Write transition diagrams for the regular expressions in problems 3.6 a
and b, 3.7 a and b. 3.4.4: Architecture of a Transition-Diagram-Based Lexical Analyzer The idea is that we write a piece of code for each decision diagram. I will show the one for relational operations below (from the 2nd edition). This piece of code contains a case for each state, which typically reads a character and then goes to the next case depending on the character read. The numbers in the circles are the names of the cases. Accepting states often need to take some action and return to the parser. Many of these accepting states (the ones with stars) need to restore one character of input. This is called retract() in the code. What should the code for a particular diagram do if at one state the character read is not one of those for which a next state has been defined? That is, what if the character read is not the label of any of the outgoing arcs? This means that we have failed to find the token corresponding to this diagram. The code calls fail(). This is not an error case. It simply means that the current input does not match this particular token. So we need to go to the code section for another diagram after restoring the input pointer so that we start the next diagram at the point where this failing diagram started. If we have tried all the diagram, then we have a real failure and need to print an error message and perhaps try to repair the input. Note that the order the diagrams are tried is important. If the input matches more than one token, the first one tried will be chosen.
TOKEN getRelop() // TOKEN has two components TOKEN retToken = new(RELOP); // First component set here while (true) switch(state) case 0: c = nextChar(); if (c == '<') state = 1; else if (c == '=') state = 5; else if (c == '>') state = 6; else fail(); break; case 1: ... ... case 8: retract(); // an accepting state with a star retToken.attribute = GT; // second component return(retToken);
Second edition additions The description above corresponds to the one given in the first edition. The newer edition gives two other methods for combining the multiple transitiondiagrams (in addition to the one above). 1. Unlike the method above, which tries the diagrams one at a time, the first new method tries them in parallel. That is, each character read is passed to each diagram (that hasn't already failed). Care is needed when one diagram has accepted the input, but others still haven't failed and may accept a longer prefix of the input. 2. The final possibility discussed, which appears to be promising, is to combine all the diagrams into one. That is easy for the example we have been considering because all the diagrams begin with different characters being matched. Hence we just have one large start with multiple outgoing edges. It is more difficult when there is a character that can begin more than one diagram.
3.5: The Lexical Analyzer Generator Lex

The newer version, which we will use, is called flex, the f stands for fast. I checked and both lex and flex are on the cs machines. I will use the name lex for both. Lex is itself a compiler that is used in the construction of other compilers (its output is the lexer for the other compiler). The lex language, i.e, the input language of the lex compiler, is described in the few sections. The compiler writer uses the lex language to specify the tokens of their language as well as the actions to take at each state. 3.5.1: Use of Lex
Let us pretend I am writing a compiler for a language called pink. I produce a file, call it lex.l, that describes pink in a manner shown below. I then run the lex compiler (a normal program), giving it lex.l as input. The lex compiler output is always a file called lex.yy.c, a program written in C. One of the procedures in lex.yy.c (call it pinkLex()) is the lexer itself, which reads a character input stream and produces a sequence of tokens. pinkLex() also sets a global value yylval that is shared with the parser. I then compile lex.yy.c together with a the parser (typically the output of lex's cousin yacc, a parser generator) to produce say pinkfront, which is an executable program that is the front end for my pink compiler. 3.5.2: Structure of Lex Programs The general form of a lex program like lex.l is
declarations %% translation rules %% auxiliary functions
The lex program for the example we have been working with follows (it is typed in straight from the book).
%{ /* definitions of manifest constants LT, LE, EQ, NE, GT, GE, IF, THEN, ELSE, ID, NUMBER, RELOP */ %} /* regular definitions */ delim [ \t\n] ws {delim}* letter [A-Za-z] digit [0-9] id {letter}({letter}{digit})* number {digit}+(\.{digit}+)?(E[+-]?{digit}+)? %% {ws} if then else {id} {number} "<" "<=" "=" "<>" {/* no action and no return */} {return(IF);} {return(THEN);} {return(ELSE);} {yylval = (int) installID(); return(ID);} {yylval = (int) installNum(); return(NUMBER);} {yylval = LT; return(RELOP);} {yylval = LE; return(RELOP);} {yylval = EQ; return(RELOP);} {yylval = NE; return(RELOP);}
">" ">=" %%
{yylval = GT; return(RELOP);} {yylval = GE; return(RELOP);}
int installID() {/* function to install the lexeme, whose first character is pointed to by yytext, and whose length is yyleng, into the symbol table and return a pointer thereto */ } int installNum() {/* similar to installID, but puts numerical constants into a separate table */
The first, declaration, section includes variables and constants as well as the allimportant regular definitions that define the building blocks of the target language, i.e., the language that the generated lexer will analyze. The next, translation rules, section gives the patterns of the lexemes that the lexer will recognize and the actions to be performed upon recognition. Normally, these actions include returning a token name to the parser and often returning other information about the token via the shared variable yylval. If a return is not specified the lexer continues executing and finds the next lexeme present. Comments on the Lex Program Anything between %{ and %} is not processed by lex, but instead is copied directly to lex.yy.c. So we could have had statements like
#define LT 12 #define LE 13
The regular definitions are mostly self explanatory. When a definition is later used it is surrounded by {}. A backslash \ is used when a special symbol like * or . is to be used to stand for itself, e.g. if we wanted to match a literal star in the input for multiplication. Each rule is fairly clear: when a lexeme is matched by the left, pattern, part of the rule, the right, action, part is executed. Note that the value returned is the name (an integer) of the corresponding token. For simple tokens like the one named IF, which correspond to only one lexeme, no further data need be sent to the parser. There are several relational operators so a specification of which lexeme matched RELOP is saved in yylval. For id's and numbers's, the lexeme is stored in a table by the install functions and a pointer to the entry is placed in yylval for future use.
Everything in the auxiliary function section is copied directly to lex.yy.c. Unlike declarations enclosed in %{ %}, however, auxiliary functions may be used in the actions 3.5.3: Conflict Resolution in Lex 1. Match the longest possible prefix of the input. 2. If this prefix matches multiple patterns, choose the first. The first rule makes <= one instead of two lexemes. The second rule makes if a keyword and not an id. 3.5.3a: Anger Management in Lex Sorry. 3.5.4: The Lookahead Operator Sometimes a sequence of characters is only considered a certain lexeme if the sequence is followed by specified other sequences. Here is a classic example. Fortran, PL/I, and some other languages do not have reserved words. In Fortran
IF(X)=3
is a legal assignment statement and the IF is an identifier. However,

IF(X.LT.Y)X=Y
is an if/then statement and IF is a keyword. Sometimes the lack of reserved words makes lexical disambiguation impossible, however, in this case the slash / operator of lex is sufficient to distinguish the two cases. Consider
IF / \(.*\){letter}
This only matches IF when it is followed by a ( some text a ) and a letter. The only FORTRAN statements that match this are the if/then shown above; so we have found a lexeme that matches the if token. However, the lexeme is just the IF and not the rest of the pattern. The slash tells lex to put the rest back into the input and match it for the next and subsequent tokens.
Homework: 3.11. Homework: Modify the lex program in section 3.5.2 so that: (1) the keyword while is
recognized, (2) the comparison operators are those used in the C language, (3) the underscore is permitted as another letter (this problem is easy).
3.6: Finite Automata
The secret weapon used by lex et al to convert (compile) its input into a lexer. Finite automata are like the graphs we saw in transition diagrams but they simply decide if a sentence (input string) is in the language (generated by our regular expression). That is, they are recognizers of language. There are two types of finite automata 1. Deterministic finite automata (DFA) have for each state (circle in the diagram) exactly one edge leading out for each symbol. So if you know the next symbol and the current state, the next state is determined. That is, the execution is deterministic; hence the name. 2. Nondeterministic finite automata (NFA) are the other kind. There are no restrictions on the edges leaving a state: there can be several with the same symbol as label and some edges can be labeled with . Thus there can be several possible next states from a given state and a current lookahead symbol. Surprising Theorem: Both DFAs and NFAs are capable of recognizing the same languages, the regular languages, i.e., the languages generated by regular expressions (plus the automata can recognize the empty language). What does this mean? There are certainly NFAs that are not DFAs. But the language recognized by each such NFA can also be recognized by at least one DFA. Why mention (confusing) NFAs? The DFA that recognizes the same language as an NFA might be significantly larger that the NFA. The finite automaton that one constructs naturally from a regular expression is often an NFA. 3.6.1: Nondeterministic Finite Automata Here is the formal definition. A nondeterministic finite automata (NFA) consists of 1. A finite set of states S. 2. An input alphabet not containing .
3. A transition function that gives, for each state and each symbol in , a set of next states (or successor states. 4. An element s0 of S, the start state. 5. A subset F of S, the accepting states (or final states). An NFA is basically a flow chart like the transition diagrams we have already seen. Indeed an NFA (or a DFA, to be formally defined soon) can be represented by a transition graph whose nodes are states and whose edges are labeled with elements of . The differences between a transition graph and our previous transition diagrams are: 1. Possibly multiple edges with the same label leaving a single state. 2. An edge may be labeled with . The transition graph to the right is an NFA for the regular expression (a|b)*abb, which (given the alphabet {a,b} represents all words ending in abb. Consider aababb. If you choose the wrong edge for the initial a's you will get stuck or not end at the accepting state. But an NFA accepts a word if any path (beginning at the start state and using the symbols in the word in order) ends at an accepting state. It essentially tries all such paths at once and accepts if any end at an accepting state. Patterns like (a|b)*abb are useful regular expressions! If the alphabet is ascii, consider *.java.
Homework: For the NFA to the right, indicate all the paths labeled aabb.
3.6.2: Transition Tables There is an equivalent way to represent an NFA, namely a table State a b giving, for each state s and input symbol x (and ), the set of 0 {0,1} {0} successor states x leads to from s. The empty set is used when {2} there is no edge labeled x emanating from s. The table on the right 1 2 {3}
corresponds to the transition graph above. The downside of these tables is their size, especially if most of the entries are since those entries would not take any space in a transition graph.
Homework: Construct the transition table for the NFA in the previous homework
problem. 3.6.3: Acceptance of Input Strings by Automata An NFA accepts a string if the symbols of the string specify a path from the start to an accepting state.
Homework: Does the NFA in the previous homework accept the string aabb?
Again note that these symbols may specify several paths, some of which lead to accepting states and some that don't. In such a case the NFA does accept the string; one successful path is enough. Also note that if an edge is labeled , then it can be taken for free. For the transition graph above any string can just sit at state 0 since every possible symbol (namely a or b) can go from state 0 back to state 0. So every string can lead to a non-accepting state, but that is not important since if just one path with that string leads to an accepting state, the NFA accepts the string. The language defined by an NFA or the language accepted by an NFA is the set of strings (a.k.a. words) accepted by the NFA. So the NFA in the diagram above (not the diagram with the homework problem) accepts the same language as the regular expression (a|b)*abb. If A is an automaton (NFA or DFA) we use L(A) for the language accepted by A. The diagram on the right illustrates an NFA accepting the language L(aa*|bb*). The path 034444 shows that bbbb is accepted by the NFA. Note how the that labels the edge 0 3 does not appear in the string bbbb since is the empty string.
3.6.4: Deterministic Finite Automata There is something weird about an NFA if viewed as a model of computation. How is a computer of any realistic construction able to check out all the (possibly infinite number of) paths to determine if any terminate at an accepting state? We now consider a much more realistic model, a DFA.
Definition: A deterministic finite automata or DFA is a special case of an NFA having
the restrictions 1. No edge is labeled with 2. For any state s and symbol a, there is exactly one edge leaving s with label a. This is realistic. We are at a state and examine the next character in the string, depending on the character we go to exactly one new state. Looks like a switch statement to me. Minor point: when we write a transition table for a DFA, the entries are elements not sets so there are no {} present. Simulating a DFA Indeed a DFA is so reasonable there is an obvious algorithm for simulating it (i.e., reading a string and deciding whether or not it is in the language accepted by the DFA). We present it now. The second edition has switched to C syntax: = is assignment == is comparison. I am going to change to this notation since I strongly suspect that most of the class is much more familiar with C/C++/java/C# than with algol60/algol68/pascal/ada (the last is my personal favorite). As I revisit past sections of the notes to fix errors, I will change the examples from algol to C usage of =. I realize that this makes the notes incompatible with the edition you have, but hope and believe that this will not cause any serious problems.
s = s0; // start state. NOTE = is assignment c = nextChar(); // a priming read while (c != eof) { s = move(s,c); c = nextChar(); } if (s is in F, the set of accepting states) return yes else return no
3.7: From Regular Expressions to Automata 3.7.0: Not losing site of the forest due to the trees
This is not from the book. Do not forget the goal of the chapter is to understand lexical analysis. We saw, when looking at Lex, that regular expressions are a key in this task. So we want to recognize regular expressions (say the ones representing tokens). We are going to see two methods. 1. Convert the regular expression to an NFA and simulate the NFA. 2. Convert the regular expression to an NFA, convert the NFA to a DFA, and simulate the DFA. So we need to learn 4 techniques. 1. 2. 3. 4. Convert a regular expression to an NFA Simulate an NFA Convert an NFA to a DFA Simulate a DFA.
The list I just gave is in the order the algorithms would be appliedbut you would use either 2 or (3 and 4). The two editions differ in the order the techniques are presented, but neither does it in the order I just gave. Indeed, we just did item #4. I will follow the order of 2nd ed but give pointers to the first edition where they differ.
================ Start Lecture #4 ================

Remark: If you find a particular homework question challenging, ask on the mailing
list and an answer will be produced.

Remark: I forgot to assign homework for section 3.6. I have added one problem
spread into three parts. It is not assigned but it is a question I believe you should be able to do. 3.7.1: Converting an NFA to a DFA (This is item #3 above and is done in section 3.6 in the first edition.)
The book gives a detailed proof; I am just trying to motivate the ideas. Let N be an NFA, we construct a DFA D that accepts the same strings as N does. Call
a state of N an N-state, and call a state of D a D-state. The idea is that D-state corresponds to a set of N-states and hence this is called the subset algorithm. Specifically for each string X of symbols we consider all the Nstates that can result when N processes X. This set of N-states is a D-state. Let us consider the transition graph on the right, which is an NFA that accepts strings satisfying the regular expression (a|b)*abb. The start state of D is the set of N-states that can result when N processes the empty string . This is called the -closure of the start state s0 of N, and consists of those N-states that can be reached from s0 by following edges labeled with . Specifically it is the set {0,1,2,4,7} of N-states. We call this state D0 and enter it in the transition table we are building for D on the right.
NFA states {0,1,2,4,7} {1,2,3,4,6,7,8} {1,2,4,5,6,7} {1,2,4,5,6,7,9} {1,2,4,5,6,7,10}
DFA state D0 D1 D2 D3 D4
a D1 D1 D1 D1 D1
b D2 D3 D2 D4 D2
Next we want the a-successor of D0, i.e., the D-state that occurs when we start at D0 and move along an edge labeled a. We call this successor D1. Since D0 consists of the N-states corresponding to , D1 is the N-states corresponding to a=a. We compute the a-successor of all the N-states in D0 and then form the -closure. Next we compute the b-successor of D0 the same way and call it D2.
We continue forming a- and b-successors of all the D-states until no new D-states result (there is only a finite number of subsets of all the N-states so this process does indeed stop). This gives the table on the right. D4 is the only D-accepting state as it is the only Dstate containing the (only) N-accepting state 10. Theoretically, this algorithm is awful since for a set with k elements, there are 2k subsets. Fortunately, normally only a small fraction of the possible subsets occur in practice.
Homework: Convert the NFA from the homework for section 3.6 to a DFA.
3.7.2: Simulating an NFA Instead of producing the DFA, we can run the subset algorithm as a simulation itself. This is item #2 in my list of techniques
S = -closure(s0); c = nextChar(); while ( c != eof ) { S = -closure(move(S,c)); c = nextChar(); } if ( S F != ) return yes; accepting states else return no;
// F is
3.7.3: Efficiency of NFA Simulation Slick implementation. 3.7.4: Constructing an NFA from a Regular Expression I give a pictorial proof by induction. This is item #1 from my list of techniques. 1. The base cases are the empty regular expression and the regular expression consisting of a single symbol a in the alphabet. 2. The inductive cases are. i. s | t for s and t regular expressions
ii. iii. iv.
st for s and t regular expressions s* (s), which is trivial since the nfa for s works for (s).
The pictures on the right illustrate the base and inductive cases.
Remarks:
1. The generated NFA has at most twice as many states as there are operators and operands in the RE. This is important for studying the complexity of the NFA. 2. The generated NFA has one start and one accepting state. The accepting state has no outgoing arcs and the start state has no incoming arcs. 3. Note that the diagram for st correctly indicates that the final state of s and the initial state of t are merged. This uses the previous remark that there is only one start and final state. 4. Except for the accepting state, each state of the generated NFA has either one outgoing arc labeled with a symbol or two outgoing arcs labeled with . Do the NFA for (a|b)*abb and see that we get the same diagram that we had before. Do the steps in the normal leftmost, innermost order (or draw a normal parse tree and follow it).
Homework: 3.16 a,b,c
3.7.5: Efficiency of String-Processing Algorithms (This is on page 127 of the first edition.) Skipped.
3.8: Design of a Lexical-Analyzer Generator

How lexer-generators like Lex work. 3.8.1: The structure of the generated analyzer We have seen simulators for DFAs and NFAs. The remaining large question is how is the lex input converted into one of these automatons. Also
1. Lex permits functions to be passed through to the yy.lex.c file. This is fairly straightforward to implement. 2. Lex also supports actions that are to be invoked by the simulator when a match occurs. This is also fairly straight forward. 3. The lookahead operator is not so simple in the general case and is discussed briefly below. In this section we will use transition graphs, lexer-generators do not draw pictures; instead they use the equivalent transition tables. Recall that the regular definitions in Lex are mere conveniences that can easily be converted to REs and hence we need only convert REs into an FSA. We already know how to convert a single RE into an NFA. But lex input will contain several REs (since it wishes to recognize several different tokens). The solution is to 1. Produce an NFA for each RE. 2. Introduce a new start state. 3. Introduce an transition from the new start state to the start of each NFA constructed in step 1. 4. When one reaches one of the accepting states,they do NOT stop. See below for an explanation. The result is shown to the right. At each of the accepting states (one for each NFA in step 1), the simulator executes the actions specified in the lex program for the corresponding pattern. 3.8.2: Pattern Matching Based on NFAs We use the algorithm for simulating NFAs presented in 3.7.2. The simulator starts reading characters and calculates the set of states it is at. At some point the input character does not lead to any state or we have reached the eof. Since we wish to find the longest lexeme matching the pattern we proceed backwards from the current point (where there was no state) until we reach an accepting state (i.e., the set of NFA states, N-states, contains an accepting N-state).
Each accepting N-state corresponds to a matched pattern. The lex rule is that if a lexeme matches multiple patterns we choose the pattern listed first in the lex-program. Pattern Action to perform Example a Action1 Consider the example on the right with three patterns abb Action2 and their associated actions and consider processing the * + ab Action3 input aaba.
1. We begin by constructing the three NFAs. To save space, the third NFA is not the one that would be constructed by our algorithm, but is an equivalent smaller one. For example, some unnecessary -transitions have been eliminated. If one view the lex executable as a compiler transforming lex source into NFAs, this would be considered an optimization. 2. We introduce a new start state and -transitions as in the previous section. 3. We start at the -closure of the start state, which is {0,1,3,7}. 4. The first a (remember the input is aaba) takes us to {2,4,7}. This includes an accepting state and indeed we have matched the first patten. However, we donot stop since we may find a longer match.
5. The next a takes us to {7}. 6. The b takes us to {8}. 7. The next a fails since there are no a-transitions out of state 8. So we must back up to before trying the last a. 8. We are back in {8} and ask if one of these N-states (I know there is only one, but there could be more) is an accepting state. 9. Indeed state 8 is accepting for third pattern. If there were more than one accepting state in the list, we would choose the one in the earliest listed pattern. 10. Action3 would now be performed. 3.8.3: DFA's for Lexical Analyzers We could also convert the NFA to a DFA and simulate that. The resulting DFA is on the right. Note that it shows the same set of states we had as well as others corresponding other possible inputs. We label the accepting states with the pattern matched. If multiple patterns are matched (because the accepting D-state contains multiple accepting N-states), we use the first pattern listed (assuming we are using lex conventions). Technical point. For a DFA, there must be a outgoing edge from each D-state for each possible character. In the diagram, when there is no NFA state possible, we do not show the edge. Technically we should show these edges, all of which lead to the same D-state, called the dead state, and corresponds to the empty subset of N-states. 3.8.4: Implementing the Lookahead Operator This has some tricky points. Recall that this lookahead operator is for when you must look further down the input but the extra characters matched are not part of the lexeme. We write the pattern r1/r2. In the NFA we match r1 then treat the / as an and then match s1. It would be fairly easy to describe the situation when the NFA has only -transition at the state where r1 is matched. But it is tricky when there are more than one such transition.
3.9: Optimization of DFA-Based Pattern Matchers

Skipped 3.9.1: Important States of an NFA
Skipped 3.9.2: Functions Computed form the Syntax Tree Skipped 3.9.3: Computing nullable, firstpos, and lastpos Skipped 3.9.4: Computing followpos Skipped

Chapter 3 Lexical Analyser

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Chapter 3 Lexical Analyser

Загружено:

Авторское право:

Доступные форматы

Chapter 3: Lexical Analysis

Homework: Read chapter 3.

3.1: the role of the Lexical Analyzer

3.2: Input Buffering

3.3: Specification of Tokens

We will need a bunch of definitions.

alphabet. Strings are often called words or sentences.

language over ascii. It is also a language over unicode.

string t to s. It is written st.

Example: If s is 123abc, then

Similarly, proper suffixes and proper substrings of s do not include and s.

3.3.2: Operations on Languages

all words (strings) in either L1 or L2.

concatenation of {a,b,c} and {1,2,} is {a1,a2,b1,b2,c1,c2,a,b,c}.

L0={}, which is not . Li+1=LiL.

Definition: The positive closure of L, denoted L is

Homework: 3.7 a,b (c is optional)

3.3.5: Extensions of Regular Expressions

Unsigned integer or floating point numbers

all the numerical constants in C), 3.10a.

3.4: Recognition of Tokens

We also want the lexer to remove whitespace so we define a new token

Attribute Pointer to table entry Pointer to table entry LT LE EQ NE GT GE

3.5: The Lexical Analyzer Generator Lex

{yylval = GT; return(RELOP);} {yylval = GE; return(RELOP);}

is a legal assignment statement and the IF is an identifier. However,

3.6: Finite Automata

================ Start Lecture #4 ================

list and an answer will be produced.

NFA states {0,1,2,4,7} {1,2,3,4,6,7,8} {1,2,4,5,6,7} {1,2,4,5,6,7,9} {1,2,4,5,6,7,10}

ii. iii. iv.

3.8: Design of a Lexical-Analyzer Generator

3.9: Optimization of DFA-Based Pattern Matchers

Вам также может понравиться